It migth be a bigger alternative than you want, but you might consider Intel Threading Building Blocks (TBB) for C++ programs (and for C programs if you are fine with some use of C++). I think TBB might provide good data locality for you, because:
a) it supports two-dimentional data distribution, i.e. you could process your matrices by blocks, not only by columns/rows
b) its affinity_partitioner feature allows to replay chunk distribution between threads close to what it was in a previous run.
c) it uses work stealing scheduler, which has the desired property of "each thread taking chunks from its stack until it empties and then starts stealing from other processors stacks".
If you are interested, I will be glad to answer your questions here or at the TBB forum.
Alexey Kukanov
TBB developer