-
jbstjohn
-
-
-
Joined on 10-12-2007
-
-
Posts 70
-
-
|
Post-submission Analysis I
Do I get to write "first!"?
So, in a bit of a marathon coding weekend, I did most of the game of life, and spent most of today adding TBB to it (which had some extra fun, since I had to install it on a Linux system for the first time, but I started that yesterday).
Anyway, some feedback to the TBB people: - the parallel_do loop doesn't exist in the version of TBB (stable) I took, although it's talked about in the docs. That's probablly okay. - Task.Destroy should probably be a static function of the class -- you also avoid the strange this->destroy(*this) syntax if it is. - it would be nice to be able to re-use/refresh task lists. At the moment, I had to always rebuild them new, even though they were always the same. But I'm probably using tasks in a strange way.
Some comments to the contest: I think you get huge gains by shrinking the space used. I currently use 1 byte per cell, although that could have been reduced to 6 bits. I use 32-bits per element on the various lists. I think you get a big gain by not using linked lists -- vectors should be much happier You can combine the four lists into two -- one candidate list, and one toggle list. I think this is also a pretty big gain. I don't knowsorting the lists might bring you. (Does a happier cache balance out the sort work? You can do a linear sort, so maybe.)
I hope you all had fun!
|
|
| |
-
jbstjohn
-
-
-
Joined on 10-12-2007
-
-
Posts 70
-
-
|
Post-submission Analysis II
Some comments on multi-threading strategy:
I chose to split up the data into strips, one per thread. Excluding a
one-cell-thick border, each strip can be run in parallel (within one
generation). It should be pretty easy to make the strip edges dynamic,
to evenly split the work, but I unfortunately didn't get around to this.
I liked this, because it was fairly simple, and required no locking, although it wasn't as *parallel* as a number of other techniques would be. It also was cache friendly.
I considered trying to use "atomic" but in the end decided against it. What did other people think?
So my technique per generation was: Toggle cells (parallel, no border regions) Toggle cells (serial, borders) Check Candidates (parallel, full regions)
The things I didn't get around to trying 1. dynamic region sizing 2. sorting (would help with 1.) 3. better dependencies, so more parallel (I think extraneous if I manage 1). I.e. the border regions only need to come after their neighbour regions, and can run in parallel too (sort of). Similarly, I can check candidates once I've done a region and its borders.
|
|
| |
|
|
Re: Post-submission Analysis II
I just took the sample code and parallelized it --- using concurrent_vector for the lists, atomics for the map and neighbour counts, and parallel_for to traverse the lists. Initially, I tried using concurrent_queue and parallel_while, but though this was faster than the original (I have a dual core CPU), it was still much slower than the concurrent_vector version --- I guess this is due to the extra memory allocations.
|
|
| |
-
jbstjohn
-
-
-
Joined on 10-12-2007
-
-
Posts 70
-
-
|
Re: Post-submission Analysis II
anthony_williams:I just took the sample code and parallelized it --- using concurrent_vector for the lists, atomics for the map and neighbour counts, and parallel_for to traverse the lists. Initially, I tried using concurrent_queue and parallel_while, but though this was faster than the original (I have a dual core CPU), it was still much slower than the concurrent_vector version --- I guess this is due to the extra memory allocations.
Interesting! Well, I certainly hope my efforts paid off then, as your way sounds easier. I had been thinking of just using atomics, but there were a few reasons why it looked like it wouldn't be that easy. I wasn't able to do any profiling, so all my changes were "gut feeling". (Okay, usually backed up with a little analysis.)
|
|
| |
|
|
Re: Post-submission Analysis II
My first attempt didn't work, as it kept complaining about bad values in Add or Subtract neighbours. I then realised that the algorithm may add the same cell to one list multiple times, or even to both lists, if cells around it both live and die, so you need to use fetch_and_store in Vivify and Kill in order to only add the cell to the newlive/newdie list if it was actually brought to life by this thread.
Anyway, I guess we'll find out how effective the different techniques are in a few weeks.
|
|
| |
-
lianqing
-
-
-
Joined on 12-23-2007
-
-
Posts 2
-
-
|
Re: Post-submission Analysis II
anthony, how did you implement the map and neighbor counts with atomic?
|
|
| |
|
|
Re: Post-submission Analysis II
anthony, how did you implement the map and neighbor counts with atomic?
In life.h, I changed the typedefs: typedef tbb::atomic<char> Grid[MAXROW+2][MAXCOL+2]; typedef tbb::atomic<int> Gridcount[MAXROW+2][MAXCOL+2]; // number of neighbors As I already said, I had to use fetch_and_store on the map elements to ensure only one thread added each cell to newlive or newdie, and I used the atomic increment and decrement on the neighbour counts.
|
|
| |
-
mt2
-
-
-
Joined on 10-19-2007
-
-
Posts 137
-
-
|
Re: Post-submission Analysis II
What about comparison (from getTickCount, for example) for your solutions: parallel vs non- parallel? I tested many possible solutions, but majority of them are slower than for non-parallel sample :(
|
|
| |
-
jbstjohn
-
-
-
Joined on 10-12-2007
-
-
Posts 70
-
-
|
Re: Post-submission Analysis II
mt2:What about comparison (from getTickCount, for example) for your solutions: parallel vs non- parallel? I tested many possible solutions, but majority of them are slower than for non-parallel sample :(
I wanted to do this, but time was running so tight that I didn't get to it. I also had problems first integrating tbb into my solution (I am new to Linux's .so quirks), so didn't measure my original solution. I also didn't know the standard (non-tbb) timing for Linux, so I just left it out. I may go back and check. It would be interesting and depressing if my serial version were slower, although that should be almost impossible for a somewhat full board (famous last words!). You raise an interesting point though -- it would be nice to see how the original solutions fares in the final performance comparison.
|
|
| |
-
bugman
-
-
-
Joined on 10-14-2007
-
-
Posts 94
-
-
|
Post-submission Analysis III
I have partitioned the grid into a number of horizontal stripes. Each stripe was processed independently (and in parallel) at each iteration using its own set of lists. The most difficult thing was to glue the stripes together.
Parallel processing was implemented by means of parallel_for. A major surprise was that TBB provided 3 times better performance compared to OpenMP.
However, the most significant performance gain (6 times without any parallelization) was achieved by eliminating new and delete operations being performed on list nodes. I have just introduced another list, which was used to store processed list nodes. These list nodes were recycled later.
|
|
| |
-
bugman
-
-
-
Joined on 10-14-2007
-
-
Posts 94
-
-
|
Re: Post-submission Analysis II
I have also tried to use different types of atomic operations. However, all of them caused my program to be 2 times slower compared to the non-atomic version EVEN WHEN THERE WAS JUST ONE THREAD. It seems that atomic operations, when trying to lock system bus, interact badly with caching hardware, or cause the data to ping-pong between CPU cores. Having spent a couple of evenings trying to find a proper way to use atomic operations, I had to abandon them completely.
|
|
| |
-
bugman
-
-
-
Joined on 10-14-2007
-
-
Posts 94
-
-
|
Re: Post-submission Analysis II
jbstjohn: I chose to split up the data into strips, one per thread. Excluding a one-cell-thick border, each strip can be run in parallel (within one generation).
I liked this, because it was fairly simple, and required no locking, although it wasn't as *parallel* as a number of other techniques would be. It also was cache friendly.
We were thinking in a similar way :-)
About cache: it turned out in my program that horizontal stripes provide 10% better performance compared to vertical stripes. I was a bit surprised to get so significant impact of cache-related issues on a seemingly computation-intensive problem.
|
|
| |
-
jbstjohn
-
-
-
Joined on 10-12-2007
-
-
Posts 70
-
-
|
Re: Post-submission Analysis II
Regarding the striping direction being important, I'm not that surprised. The amount of computation done per element in the list is actually pretty low, and affects a local neighbourhood. I might have thought vertical stripes would have an even larger effect, but I guess it's not too bad -- you basically have a strip that's 1 x 4 instead of 4 x 1. You have to consider that on a cache fetch, the whole line is fetched, which means for vertical strips, whenever you're grabbing something towards the end of a line, you're (may be) grabbing "useless" data, whereas almost all fetches in the horizontal case are 'productive'. I think the difference would be even more pronounced with more threads, or a smaller board. Regarding atomic, very interesting. What I was wondering about is whether atomic variables can be cached at all. My first reaction is, no, which would help explain the poor performance.
Interesting -- I just realized if I had run my program with 8 threads, rather then 4, I might end up with better performance (even on machines with less than 8 cores), because each thread has its own strip, which would be more cache friendly.
Finally, while I'm amazed that the list implementation made such a huge difference (6x!), I've always thought dynamic allocation is the devil. :D I switched over to a (custom) vector implementation for just this reason. The linked list wastes space (on the pointers, which means bad caching again) and time (on the allocation). I was also considering sorting, which would be a nightmare on a linked list implementation. Still, I never would have guessed it would be such a difference.
One PS also interesting that OpenMP did so poorly. Was that also from Intel, or the visual studio version?
|
|
| |
-
YurySerdyuk
-
-
-
Joined on 10-26-2007
-
-
Posts 55
-
-
|
Re: Post-submission Analysis III
petert@dcn.nord.nw.ru:
I have partitioned the grid into a number of horizontal stripes. Each stripe was processed independently (and in parallel) at each iteration using its own set of lists. The most difficult thing was to glue the stripes together.
In any case, threads must to interact with /interlock each other every iteration.
For short lists, this leads to that the overhead exceeds a useful work again.
I have implemented such kind algorithm in MC# (www.mcsharp.net)
but got a slowing down with several threads.
Do any have good scalability results , say , for the gosper sample ?
|
|
| |
-
bugman
-
-
-
Joined on 10-14-2007
-
-
Posts 94
-
-
|
Re: Post-submission Analysis II
jbstjohn: Regarding atomic, very interesting. What I was wondering about is whether atomic variables can be cached at all. My first reaction is, no, which would help explain the poor performance.
The problem is that atomic semantics applies only to operations. As far as I know, the user-mode program has very little control on caching stuff. And the problems start when different cores are accessing the same memory area too frequently. Furthermore, atomic operations require bus locking, which consumes a lot of time and bus bandwidth.
jbstjohn: One PS also interesting that OpenMP did so poorly. Was that also from Intel, or the visual studio version?
Both of them had these problem. It seems to me that the problem was caused by very simplistic thread load balancing used by OpenMP.
|
|
| |
-
jbstjohn
-
-
-
Joined on 10-12-2007
-
-
Posts 70
-
-
|
Re: Post-submission Analysis II
Has anyone used the affinity operations in TBB? I imagine they could play an important role, but didn't have time to use them at all.
|
|
| |
-
mgarlanger
-
-
-
Joined on 10-11-2007
-
-
Posts 165
-
-
|
Re: Post-submission Analysis II
anthony_williams:I just took the sample code and parallelized it --- using concurrent_vector for the lists, atomics for the map and neighbour counts, and parallel_for to traverse the lists.
Very similar to what I did, but I did not use atomics for the map, just the neighbor counts. I don't think the map array has to be protected.
|
|
| |
|
|
Re: Post-submission Analysis II
mgarlanger: anthony_williams:I just took the sample code and parallelized it --- using concurrent_vector for the lists, atomics for the map and neighbour counts, and parallel_for to traverse the lists.
Very similar to what I did, but I did not use atomics for the map, just the neighbor counts. I don't think the map array has to be protected.
I found I had to use fetch_and_store for updating the map in Vivify and Kill in order to ensure that cells were only added to newlive and newdie once --- cells can end up on maylive or maydie multiple times, if cells around them both live and die this generation. The only alternative I saw was to sort the lists, and eliminate duplicates, but I didn't try that.
|
|
| |