Welcome to Intel® Software Network Quick Login | Join | Help |
Search in Intel® Software Network Forums
in Go

What about solutions

Last post 11-01-2007, 6:31 AM by bugman. 8 replies.
Sort Posts: Previous Next
 10-25-2007, 8:28 AM 30242643  

What about solutions

What about solutions after deadline? It would be interesting to read: who and how made his/her code, or (like me) has an idea for solution, but has no time to implement it. If it is interesting I can tell more...
 
 10-26-2007, 6:14 AM 30242703 in reply to 30242643  

Re: What about solutions

In my solution I just distributed large chunk (>1000) of queued things to do using tbb::parallel_while and tbb:concurrent_queue

I got better performance than tbb::parallel_sort on normal case. In worst case, it keeps the complexity of quicksort.

For small chunk, I used the normal quicksort algorithm with C++'s std::queue

For the IO, I tested :
1) C++'s iostream,
2) C fscanf/fprintf,
3) getting the file into strings and then distributing the conversion to int in multiple threads

2 was twice faster as 1 with g++/linux . 3 gives some improvement but could consume two much memory and is hard to read/maintain. So I finally used 2) for the IO part.

Best regards.
 
 10-26-2007, 7:07 AM 30242712 in reply to 30242643  

Re: What about solutions

Totally not enough time in two weekends, I expected there to be more pipelining in the contest stages. I dont suppose we could hope for late entry with penalty, in the future?
I was too busy at work, and figuring out tasking, and searching for access to a dual core for testing (HT didnt do much here) to check the calendar. (Wasted several hours because MS C++2005 cannot use separate templated member functions...)
Next month, watch out...

I had hopes for parallelizing Partition... Anyone do anything with that?

Shouldn't give away all the secrets (sort will almost certainly be useful later):
* Iterative: breadth first (stack) better than depth first (queue) [locality of reference]; tail "recursion"
* concurrent_queue is slow
* recursive threshold helps iterative
* for presorted: Median of 3 partition good (but possibly bad in general); Random pivot good
* pivot balancing, for when there are large amounts of data that match pivot ["discrete" below, 100 of each value randomly distributed]

Didnt get around to testing tasked-single threaded, or parallel_sort.

A few numbers:
Thousands of Int's sorted per sec [bigger is better, I/O excluded]
datasets and code NOT identical across tests (specifically 1T to 2T gains are not all because of multithread)
Earlier tests approx +/-100, later +/-50 (Look at that RDRAM go!)

K-ints sorted/sec
T=threads
10^4
rand
10^7
rand
10^6
discrete
10^5
semisorted
10^5
sorted
P4 2.2 512k RDRAM




1T Base Recursive7966251512987220
1T Base Iter79238861391
1T Opt+ Iter (P=Med3)693159528733409816000
recursive to iter factor0.872.3767.74.7800
P4 2.8 HT 1M




1T Optimized Iter74403078156878
2T Opt+ Tasked981934966308106815930
speedup factor1.321.1440.4412.17
AMD64x2 2.2




1T Optimized Iter137602668128906
2T Opt+ Tasked1516044988437174677956
speedup factor1.11.6965.9119.28

 
 10-28-2007, 1:42 AM 30242770 in reply to 30242712  

Re: What about solutions

An idea only (I had no time to test it):

 

1)      read input file to array A and find min and max values;

2)      let splitter = (max+min)/2;

3)      for all A members: if A[I] < splitter then A1[I] = A[I] else A2[I] = A[I];

4)      run thread1 to sort A1, run thread2 to sort A2;

5)      wait for finish of thread1 and thread2;

6)      write A1 to output;

7)      write A2 to output;

 

It will be fast for random input values, i.e. for uniform distribution. (I suspect that random numbers will be used to test our solutions ;-)).

 
 10-29-2007, 1:09 PM 30242820 in reply to 30242703  

Re: What about solutions

dgeld:
In my solution I just distributed large chunk (>1000) of queued things to do using tbb::parallel_while and tbb:concurrent_queue

I got better performance than tbb::parallel_sort on normal case. In worst case, it keeps the complexity of quicksort.

For small chunk, I used the normal quicksort algorithm with C++'s std::queue

For the IO, I tested :
1) C++'s iostream,
2) C fscanf/fprintf,
3) getting the file into strings and then distributing the conversion to int in multiple threads

2 was twice faster as 1 with g++/linux . 3 gives some improvement but could consume two much memory and is hard to read/maintain. So I finally used 2) for the IO part.

Best regards.

I used the MS c++ compiler on windows so my results will certainly vary.

I found that when reverting to a normal quicksort it was very tough to beat standard stack recursion.  I added to that some optimization that kept the tree from getting so "leafy" (bushy) at the ends.  I wanted to add optimization that reduced the malloc overhead in the non-recusive case, but ran out of time :-(

As to the I/O I'm not surprised that C++'s iostream didn't fair well.  I was able to (pretty much) double my I/O performance by dumping printf/scanf (at least on the box where I tested), but it was a bit of a hack (can you say no error checking).  Generally I found that printf/scanf were pretty darn good at what they do.  Of course as I said it's a bit of an apple/orange (MS/Penguin) comparison and I also believe that the actually hardware plays a big part in these numbers.  My HD and CPU are around five years old.

 
 10-29-2007, 2:00 PM 30242821 in reply to 30242712  

Re: What about solutions

ddrogahn:
Totally not enough time in two weekends, I expected there to be more pipelining in the contest stages. I dont suppose we could hope for late entry with penalty, in the future?
I was too busy at work, and figuring out tasking, and searching for access to a dual core for testing (HT didnt do much here) to check the calendar. (Wasted several hours because MS C++2005 cannot use separate templated member functions...)
Next month, watch out...

I too felt time constrained and I think that is just the nature of the beast.  If Intel had given us three months it still wouldn't have been enough time, cause there is always that "one more thing ...".  I had not used C++ in years and was "syntactically challenged" ... unlike you I'm not likely to be any better at it next month :-(    

Anyone have a nice C# port of tbb :-D

ddrogahn:

I had hopes for parallelizing Partition... Anyone do anything with that?

I think that might be pretty tough.  I would think that the costs of locking the elements so you could do a safe compare/swap would be higher than the gain you could reasonable expect to be returned.  Did you have ideas about ways to overcome this?

ddrogahn:

Shouldn't give away all the secrets (sort will almost certainly be useful later):
* Iterative: breadth first (stack) better than depth first (queue) [locality of reference]; tail "recursion"
* concurrent_queue is slow
* recursive threshold helps iterative
* for presorted: Median of 3 partition good (but possibly bad in general); Random pivot good
* pivot balancing, for when there are large amounts of data that match pivot ["discrete" below, 100 of each value randomly distributed]

I doubt that any of us will have many secrets ... I'd hope that Intel will make all the code (or at least the winning code) available to all.  After all when we hit submit that code became Intel's property and I would expect them to use it to provide both tbb examples and to improve tbb as well as to promote their multi-core chips.

I wanted to reduce the number of malloc's needed for the queue/while, but didn't have the time.  From what I can tell I think that a random pivot is good for diminishing worst case behavior, but I chose a fixed pivot because I figured the time I spent generating a random number could be better spent sorting (and hoping I didn't hit a worst-case case).  I also used thresholds (to move from parallel to serial and to move from recursive to in-line).  I didn't notice concurrent_queue being slow, but then I don't think I compared it to anything else.  I didn't notice that in the example code they gave us, there was a queue class that claimed to be thread safe, but clearly was not (sneaky sneaky Intel).

ddrogahn:

Didnt get around to testing tasked-single threaded, or parallel_sort.

A few numbers:
Thousands of Int's sorted per sec [bigger is better, I/O excluded]

...

Sorry I don't remember my numbers, probably wouldn't mean much since my machine is 5+ years old :-).   You did a nice job of laying out your results.

 
 10-29-2007, 2:09 PM 30242822 in reply to 30242770  

Re: What about solutions

mt2:

An idea only (I had no time to test it):

 

1)      read input file to array A and find min and max values;

2)      let splitter = (max+min)/2;

3)      for all A members: if A[I] < splitter then A1[I] = A[I] else A2[I] = A[I];

4)      run thread1 to sort A1, run thread2 to sort A2;

5)      wait for finish of thread1 and thread2;

6)      write A1 to output;

7)      write A2 to output;

 

It will be fast for random input values, i.e. for uniform distribution. (I suspect that random numbers will be used to test our solutions ;-)).

I had a similar idea but did not get around to implementing it.  Basically, I thought why "waste" the read loop, I should be able to do the first partition while I read in the data.  Instead of the min/max I thought I would just take the first number in the file as the pivot.

 
 10-30-2007, 1:39 PM 30242876 in reply to 30242822  

Re: What about solutions

My strategy was to minimize the number of nodes that get created.  First, I made it so that every node is re-used, so after a partition I would take the existing node and use it as the 'left' side of the partition.  Then I created a rudamentary memory manager to allocate nodes in blocks of ~50 and to recycle nodes that were freed.  Of course, I used a mutex to protect the critical blocks of the memory manager.  I also stopped using qsort for lists < 100 and instead used a simple insertion sort.

Finally, I'm kicking myself for allowing lists < 1000 to be queued with the parallel_while.  After seeing that so many people had tried that out, I implemented it just now and saw that it would have saved me an extra 7% in sort execution time.  I had seen forum posts about it, but I didn't realize it would make such a big difference when I was already recycling nodes (turns out, parallel_while doesn't do a good job of recycling its own internal nodes which results in the overhead).  However, I only have a p4 3gz w/HT, so I'm hoping the test machines will minimize that performance loss to maybe just a few %.

I think there were a few more minor tweaks I made, but my above strategies saw the most performance gain.

Matthew Ferguson

PS. This is what part of the alphabet would look like if Q and R were eliminated.
 
 11-01-2007, 6:31 AM 30242975 in reply to 30242643  

Re: What about solutions

I have implemented the following tricks:

  1. A few functions for sorting arrays of length up to 12. This provided the most significant performance gain
  2. Iterative version was modified to perform a single memory allocation at startup time. Furthermore, the stack was implemented using an array instead of a list
  3. The recursive version of the program was used as a basis for parallel implementation. The program estimates the  maximal number of threads T, and generates 4*T sorting subtasks in a recursive way. Each task employes the iterative version of the algorithm.
 
View as RSS news feed in XML

Shortcuts


Tags For This Post

...

Community Tags

...