Welcome to Intel® Software Network Quick Login | Join | Help |
Search in Intel® Software Network Forums
in Go

How Zero in Performance / execution is possible. ?

Last post 11-20-2007, 1:56 AM by canepag. 26 replies.
Sort Posts: Previous Next
 11-06-2007, 8:00 AM 30243258  

How Zero in Performance / execution is possible. ?

I'd like to know how the judges have quote the performance / execution part ?. Because I wonder how
- a working application
- with all the read/write rewritten using multi tread for conversion,
- And 40 000 000 numbers sorted in 1 sec.
Can have 0 in performance and execution.

Is it because we have not write the last CRLF at the end on the output file? thus the file compare they might have used would have failed ?

Thanks for en lighting me on this.

Jean-Philippe D.

 
 11-07-2007, 7:59 AM 30243330 in reply to 30243258  

Re: How Zero in Performance / execution is possible. ?

Hei Jean-Philippe,
   Have you used quicksort? How many processors have you used? If you used less than 8 processors you are my winner!
   I compiled and tested the solution of the winner using intel compiler under windows and it is 23% slower than my solution (on an average of 895 runs over a random sequence of 100000 int32) . I tested the winner solution on 40 000 000 int32 and it takes 4.28 seconds on my two processors.
   My routine take 3.23 seconds on 40 000 000 int32 (two processors only) but my algorithm scales very well.
   As a consequence you are my winner. Please, my you send your solution? I'm really curious to see such a nice piece of code.
   Anyway I'm going to write a nice post about the judges: their are not adequate at all and I think I will make an official complaint to Intel: a letter full of interesting points. Look at the post in the problem 2 to see my remark with title: "Judges" !
   Gaetano
 
 11-07-2007, 9:45 AM 30243351 in reply to 30243330  

Re: How Zero in Performance / execution is possible. ?

This is not about being the winner, or the best. It is only a question on fair judging.

The problem specified 2^32 for the limit, doing a sort on 250 000 INT is trivial. This is less than 0.006% of the specified range. Everybody will say that most of our algorithm were tuned for an extremely higher quantity of integers. 

Our sort for a random 50 000 000 signed longs takes 1.12 sec on a quad core 3ghz
On the same computer the winner's solution took 7.0593 sec for the same 50 000 000 elements. That is 6 times slower.

It was said that those who will optimize the read/write will have an advantage, and over 50 000 000 elements, we read, sort and write in 13 secs, when the winner took 45 secs for the same operation. This is an overall 3.4 times slower.

Also, using Intel Vtune Thread Analyzer on Wei-Yin Chen's application crash it and the software do not write the result. I mean, those are Intel tools, they should be used for scoring the application, since we used them to develop the application.

Here is the conclusive result using the Intel specified test files.

random 100 000 :

 Jean-Philippe Doiron's Version:
  read_data 0.0025199
  sort 0.0031155
  write_data 0.00761727
  overall 0.0135201

 Wei-Yin Chen's version:
  read_data 0.0274919
  sort 0.0024196788
  write_data 0.0352129824
  overall 0.0663712130 

 JPD's version overall runtime is 4.91x faster

random 250 000 :

 Jean-Philippe Doiron's Version:
  read_data 0.00798503
  sort 0.00794066
  write_data 0.0218019
  overall 0.0380246

 Wei-Yin Chen's version:
  read_data 0.0711102
  sort 0.0061736824
  write_data 0.0917116927
  overall 0.1703533901

 JPD's version overall runtime is 4.48x faster

sorted 250 000 with inversed 0-125000 :

 Jean-Philippe Doiron's Version:
  read_data 0.00713012
  sort 0.000928429
  write_data 0.0221195
  overall 0.0304531 

 Wei-Yin Chen's version:
  read_data 0.0675642
  sort 0.0015339248
  write_data 0.0914670419
  overall 0.1618304710

 JPD's version overall runtime is 5.31x faster

random 1 000 000 :

 Jean-Philippe Doiron's Version:
  read_data 0.0419785
  sort 0.0304754
  write_data 0.11227
  overall 0.185196 

 Wei-Yin Chen's version:
  read_data 0.293949
  sort 0.0267700296
  write_data 0.3752933219
  overall 0.6975016388 

 JPD's version overall runtime is 3.77x faster

random 50 000 000 :

 Jean-Philippe Doiron's Version:
  read_data 2.65815
  sort 1.12388
  write_data 9.93338
  overall 13.7214

 Wei-Yin Chen's version:
  read_data 15.7649
  sort 7.6782576421
  write_data 21.6691115932
  overall 45.1196687726 

 JPD's version overall runtime is 3.29x faster (6.83192x faster sort!!!)

sorted 50 000 000 with inversed 0-25000000:

 Jean-Philippe Doiron's Version:
  read_data 1.90845
  sort 0.367165
  write_data 9.62512
  overall 11.9175

 Wei-Yin Chen's version:
  read_data 14.7956
  sort 7.0593976408
  write_data 21.8377143905
  overall 43.7023949770  

 JPD's version overall runtime is 3.67x faster (19.2267x faster sort!!!)

Intel Calculation for Execution & Time Score:

 Jean-Philippe Doiron's version:
0.013 + 0.0380246 + 0.030 = 0.081
300 - 0.08 = 299.918 / 3 = 99.972
 Penalties : Incorrect output : no. All output verified.(no CRLF on last line)
                    No use of thread : no. Optimal use of thread.
      incorrect use of file : no. Optimal read/write function.
      amount of code changed : no. MS VC solution ready to build with no change.

 Wei-Yin Chen's version:
  0.0663 + 0.1703 + 0.1618 = 0.398
  300 - 0.3985 = 299.6014 / 3 = 99.867

How It is possible to receive a ZERO score, this is a little bit confusing?!.

Real-Test that should have been done at least since the problem describes a quantity up to ~4 000 000 000 int.
[adding the 50 000 000 sorted test]

 Jean-Philippe Doiron's version:
 0.01352 + 0.0380 + 0.0304 + 11.9175 = 12.024
 400 - 12.024 = 387.975 / 4 = 96.993

 Wei-Yin Chen's version:
 0.06637 + 0.17035 + 0.16183 + 43.70239 = 44.1009
 400 - 44.1009 = 355.8990 / 4 = 88.97

 

Correction from first post : I removed : read/write rewritten using multi tread for conversion,

 
 11-07-2007, 12:52 PM 30243368 in reply to 30243351  

Re: How Zero in Performance / execution is possible. ?

Well, it is exactly what I wanted to say: I participated because I wanted to learn. The judgement seems not fair to me and I am writing a letter about this fact that is very hard. It also includes very hard remark about the second problem that explain that our judges cannot be fair because they do not know what they are doing (I mean programming).

Anyway, I sent you a message (I use the your user name as address) because I would like to learn something on fast programming and your program is really fast. How can be so quick? Not for reading/writing but for ordering. Are you sure your algorithm is only a quicksort (Knut like)?
Can you please send me your code? I think I could learn from it more than from any webinar Intel course.
I have to say that I just started to  use VTune: but my code did not crash using it! Have you optimized a lot using it?

Gaetano

P. S. Is the reading writing really limited by the conversion or is the disk speed that limit the time? Anover question that I would like to study on your code.
 
 11-07-2007, 2:46 PM 30243374 in reply to 30243368  

Re: How Zero in Performance / execution is possible. ?

In short, (I'll send you the code once the dispute with the score will be solve).

In short,

I'm using a Normal quick sort, derived from the qsort sent by intel, with a Insertion_sort when reaching less than 400 element to sort. (the basic rules for a quick sort).

I'm using a parallel_while with a concurrent_queue and a center point pivot (to counter already sorted files).

And a std:swap for swapping method. Nothing much, no cache optimization, no assembler function. And I guess your code will run essentially the same speed on the same machine.

As for IO,  Conversion is really the bottleneck of the process, scanf  need to validate each character and make a conversion. disabling the system cache, reading the file at once and converting the number manually (stay clear of atoi, atol ,strok if you need speed).  And for the character conversion : 2548 will gives 50 53 52 56 in char. removing 48 to each gives 2 5 4 8, multiplying by 1000/100/10/1 will gives your final number.

    //read all the number using the same technique as above.
    //loop until LF and convert char to int.
    for (i = 0; i < N; i++)
    {
        A[i] = 0;
        lp = p;

       
        while (t[p++] != 10 && p <=s+1);   

        si = false;
        z = (p-lp)-2;
        for (j = 0; j < z; j++)
        {
            x = *(t+lp++);
            if (x == '-')
                si = true;
            else
                A[i] = A[i]*10 + x-48;
        }
        if (si) A[i] *= -1;
    }

note: this code support minus value.

Like you, I have joined this contest to learn more on Parallel programming, I hope we will have some feedback from the Intel representative shortly.

thanks.

JP.


 
 11-08-2007, 6:14 AM 30243425 in reply to 30243374  

Re: How Zero in Performance / execution is possible. ?

It might sound weird, but I, too, think the scoring criteria strange. First of all, the array size is too limited to really test the scalability and I think under the current criteria, even the unparallelized sorting can get a full score in this part.
Arrays with size greater than 10M or even 100M should be tested.
As for the elegance part, I don't see "use of given code" positive because the given code is not really in a good shape. And "cut-off sort". Hmm...actually I didn't even know it is allowed to do so....Orz. I assumed the rule is the strict original quick sort. It would have been the largest speed gain next to the IO part.
Timing of the IO part is also controversial. The official response said optimizing the IO is advantageous, but it seems like the IO timing doesn't matter so much. (or even the whole timing doesn't matter so much because the input size is too small.) I started spending time on it too late (you can tell from my using of fprintf/fscanf) because I think it is not as interesting as parallelizing the sorting and partitioning.
The result of JPD's version looks very impressive, especially in the 50M case. If I interpolate from my experimental results, JPD's version should be 5 times faster than the built-in parallel_sort! Good job, JP! Maybe Intel should consider using your routine as the built-in parallel_sort.
 
 11-08-2007, 7:59 AM 30243436 in reply to 30243425  

Re: How Zero in Performance / execution is possible. ?

Hello!

My solution is a little bit faster than Intel's parallel_sort... not 5x faster.

For example:

[Test 1] 50 000 000 random numbers:

JPD's Version:

read_data 2.56957
sort 1.07407
write_data 8.53645
overall 12.186

Intel Parallel Sort:

read_data 2.5695
sort 1.13325
write_data 8.72662
overall 12.4352

[Test 2] 100 000 000 random numbers (1 gig file):

JPD's Version:

read_data 10.5217
sort 2.19227
write_data 12.2159
overall 24.9478

Intel Parallel Sort:

read_data 10.5707
sort 2.19149
write_data 12.1613
overall 24.9411

Wei-Yin Chen's Version [Intel's solution winner]:

read_data 31.2981
sort 28.0532037433
write_data 45.0566705043
overall 104.4079742476

[Test 3] 100 000 000 sorted numbers with 1-50 000 000 inversed (1 gig file):

JPD's Version:

read_data 8.86079
sort 0.752496
write_data 12.4276
overall 22.058

Intel Parallel Sort:

read_data 8.84091
sort 0.880217
write_data 13.048
overall 22.7838

Ok that's maybe just a difference of 0.127 sec in this sort, but that is still 17% faster.

Since my version is comparable to Intel's hand-optimized parallel_sort, it would be fair to receive my 100 points in Execution & Time Score...

Also, This would be appreciated if an official Intel representative could read and comment this thread..

Thanks.

 
 11-08-2007, 8:19 AM 30243440 in reply to 30243436  

Re: How Zero in Performance / execution is possible. ?

It seems like my parameters tuned in my environment do not fit in your environment. I ran a benchmark on 100M random numbers, and my solution is slightly 3% faster than parallel_sort.
Being 17% faster than parallel_sort is pretty good. I do think you qualify 100 points in timing score.
 
 11-08-2007, 8:32 AM 30243443 in reply to 30243425  

Re: How Zero in Performance / execution is possible. ?

Hi again JP,
   I think My algorith is approximately like yours except I stopped the insertion ordering at 50 elements because using some test I found that after that limit it will run slower. I also check some other ordering algorithm and they were slower (only a few percent). In many internet site (and in the Knuth book) the limit for insertion  should be 20! My question is: how can your insertion limit be 400? Without special optimization? Are you sure your routine is working? Or you wrote in the mail one zero too much (400 instead than 40)?

Gaetano
 
 11-08-2007, 8:51 AM 30243444 in reply to 30243443  

Re: How Zero in Performance / execution is possible. ?

Hi Gaetano,

   400 is really the minimum number in my solution before calling std::sort on the remaining items... In our test, it was the fastest value with very high quantities of numbers to sort... And yes, our routine is working ;) When using values lower than 100, the overhead obtained from creating many parallel threads to quicksort these small subset before reaching that value were adding too much time... We used Thread Profiler to observe these overhead and adjust our threshold to 400. By the way, Intel's Parallel Sort uses 500:

template<typename RandomAccessIterator, typename Compare>

void parallel_sort( RandomAccessIterator begin, RandomAccessIterator end, const Compare& comp) { 
   const int min_parallel_size = 500; 
   if( end > begin ) {
      if (end - begin < min_parallel_size) { 
         std::sort(begin, end, comp);
      }
  else {
         internal::parallel_quick_sort(begin, end, comp);
      }
   }
}

Regards.

 
 11-09-2007, 1:47 AM 30243497 in reply to 30243425  

Re: How Zero in Performance / execution is possible. ?

Hi JP,
   Actually, to reduce the problem, I stop parallelizing the quicksort at 1000 samples and I start insertion at 40 samples. Starting insertion at 400 make everithing slower. If you do like this you should be able to reduce the time even more. My algorithm is 17% faster than intel parallel_sort at 40M samples. I now think that the difference in speed is only due to cache, processor type and bus speed. I'm looking forward to compare your algorithm on my pc (2Gbyte RAM, dual CPU Xeon 3.06MHz) preatty old with restpect to yours (I think).

Gaetano
 
 11-09-2007, 6:17 AM 30243522 in reply to 30243497  

Re: How Zero in Performance / execution is possible. ?

Hi,

That is a nice algorithm, are you using Standard template library insertion sort or your own method, because the STD sort method have both quick sort and insertion and choose which one fits the best for the sent element. So, if you have coded this yourself, you should save lines of code, and call directly std::sort when under 1000, it will quick sort then insertion sort, or even heap sort to counter the worst case when quick sorting.

void _Sort(_RanIt _First, _RanIt _Last, _Diff _Ideal)

{ // order [_First, _Last), using operator<

_Diff _Count;

for (; _ISORT_MAX < (_Count = _Last - _First) && 0 < _Ideal; )

{ // divide and conquer by quicksort

pair<_RanIt, _RanIt> _Mid = _Unguarded_partition(_First, _Last);

_Ideal /= 2, _Ideal += _Ideal / 2; // allow 1.5 log2(N) divisions

if (_Mid.first - _First < _Last - _Mid.second) // loop on larger half

_Sort(_First, _Mid.first, _Ideal), _First = _Mid.second;

else

_Sort(_Mid.second, _Last, _Ideal), _Last = _Mid.first;

}

if (_ISORT_MAX < _Count)

{ // heap sort if too many divisions

std::make_heap(_First, _Last);

std::sort_heap(_First, _Last);

}

else if (1 < _Count)

_Insertion_sort(_First, _Last); // small, insertion sort

}

Regards, JP.

 
 11-09-2007, 9:11 AM 30243540 in reply to 30243522  

Re: How Zero in Performance / execution is possible. ?

No,
   We had to order integers so I used this one (Numerical recipes in C) rearranged:

inline void InsertSort(int* const A, int p, int r) {

   int i, j;
   int a;
   for(j=p+1;j<=r;j++) {
      a=A[j];
      i=j-1;
      while(i>=p && A[i]> a) {
         A[i+1]=A[i];
         i--;
      }
      A[i+1]=a;
   }

}


And I used thid for quicksort (non parallel):
inline void QuickSortTask::QuickSort(int p, int r) {

   if (p < r) {
#ifdef M_INSERTION
      if((r-p)< M_INSERTION) InsertSort(A, p, r);
      else
#endif // M_INSERTION
      {
         int q = Partition(A, p, r);
           QuickSort (p, q-1);
           QuickSort (q+1, r);
       }
   }
}

Try it!

Gaetano
 
 11-09-2007, 9:20 AM 30243541 in reply to 30243540  

Re: How Zero in Performance / execution is possible. ?

Could you send me your executable file (for Windows), I will try it on my 100 000 000 numbers file and compare the time.

Have a nice day!

 
 11-09-2007, 9:22 AM 30243543 in reply to 30243258  

Re: How Zero in Performance / execution is possible. ?

I suppose I had a completely different reaction when I saw the contest results.  My thought after looking at the points was that I spend too much time worrying about performance.  I believe there is more to programming (and this contest) than simply having the fastest code.  My impression is that we recieved 100 points for submitting, 100 if our code executed correctly (I am not in a position to know why JPD's code was overlooked, but feel sure the judges will review it and respond appropriately).  When I looked at Wei's code I noticed that he provided a good usage write-up and a program to generate test data-sets.  I think the devil (and the points) are in those types of details.  The fastest code on a particular platform with a particular data-set may not be the "best" code.  I think about Knuth, he has spend a good part of his life writing code for MIX machines, which don't exist.  Is his code bad because it's performance is basically zero?

My main purpose here is the learn more about parallel programming.  For myself, I've decided to worry alot less about performance and a lot more about writing "interesting" code and having some fun with it.  My next submission will not be as fast as I can make it, it will be as interesting as I can make it.  My one wish is that Intel would allow us to use non-com versions of their other software tools on Windows. 
 
 11-09-2007, 6:36 PM 30243574 in reply to 30243444  

Re: How Zero in Performance / execution is possible. ?

jpd@internetimds.com:

Hi Gaetano,

   400 is really the minimum number in my solution before calling std::sort on the remaining items... In our test, it was the fastest value with very high quantities of numbers to sort... And yes, our routine is working ;) When using values lower than 100, the overhead obtained from creating many parallel threads to quicksort these small subset before reaching that value were adding too much time... We used Thread Profiler to observe these overhead and adjust our threshold to 400. By the way, Intel's Parallel Sort uses 500:



If I understand your statement correctly, I think this could explain the 0 points you received for your solution. Reverting to a library function when there are still 400 elements really deviating from the quicksort algorithm. The contest was not to make the fastest sort program, but a fast/multi-threaded QUICKSORT program. There is even another sort contest coming up - mergesort.
 
 11-09-2007, 6:59 PM 30243575 in reply to 30243574  

Re: How Zero in Performance / execution is possible. ?

No, check at the criteria :

"Cut-off sort - does the code use quicksort only to a certain size of elements to sort"

In other words : The judge expects that you use a more efficient sorting algorithm than quicksoft when less than a certain size of elements to sort.

I'm still waiting an answer from Intel to explain the '0' score....

 
 11-10-2007, 7:50 AM 30243589 in reply to 30243541  

Re: How Zero in Performance / execution is possible. ?

Hi JP,
   I sent my program to the address jpd@internetimds.com

Gaetano
 
 11-10-2007, 7:58 AM 30243590 in reply to 30243574  

Re: How Zero in Performance / execution is possible. ?

Hi mgarlanger,
   Just check in the Internet: the quicksort standard algorithm include insert sorting for a small number of sorting element. Also, a parallelization algorithm have to be a correct algorithm to parallelize. Also, under a certain threshold (number of array elements), the algorithm cannot parallelized more and a serial algorithm is better.
Finally, everything in the algorithm I sent was optional (using #define: I think they had 4 options). If they did not like the algorithm, they only had to switch of the option!

Bye
Gaetano
 
 11-10-2007, 8:07 AM 30243593 in reply to 30243543  

Re: How Zero in Performance / execution is possible. ?

Hi DweeberlyLoom,
Dr Knuth has produced one of the best text processing software ever written (word cannot even kiss is h...): Tex and Latex.
His book are a bit old, expecially considering actual computer world development, but the algorithm are still the same: moreover, his algorithms could be implemented on FPGA. probably 1000 time faster than on any multi CPU processors: in this sense the algorithms are ultra modern!

Gaetano
 
 11-12-2007, 9:57 AM 30243658 in reply to 30243593  

Re: How Zero in Performance / execution is possible. ?

Hi Gaetano, [wychen please read at the end]

I compiled your solution. I can understand that they did not give you the whole 100 points since your program is not reading the InputData.in file. You are making your own data set with IPP generating random value... However, giving you 0 for execution is exagerated since your solution works and you've put a lot of efforts inside.

I've changed your DEBUG_N value to 100000000 instead of 100000 since is it the number of int in my test file.

I've removed everything related to IPP and replaced it with our read function to get the same input data in the var A.

Here's the result:

100 000 000 Mixed Integers:

Gaetano's Version:
no read function in his code: so we used our function : read_data 18.7767
sort 30.4818
write_data 44.8657
overall 94.1249

Our Version:
read_data 18.4319
sort 2.2552
write_data 14.3311
overall 35.0314

Please explain how you compared your speed with parallel_sort, since if I replace your :

QuickSortPar (A, 0, DEBUG_N-1);

with :

parallel_sort (A, A+DEBUG_N);

The sort is done in your version in around 2 seconds and half.

Wychen: Can you send me a windows executable of your code, since I played with all flags in the compilation and I can't make your solution fas