Let me see if I can address your questions and concerns:
1) Actually, Quicksort seems like an excellent problem for parallelism. After partitioning the data around the pivot, each "half" of the data is independently sortable. The more the data is split, the more independent pieces there are for parallel execution. I agree that the traditional recursive version is harder to thread than many other algorithms. However, we feel that the iterative version is much more amenable to parallelism. (Besides, if we made the first problem really easy, it wouldn't be much of a Challenge, would it? :-)
2) The included input file is there only for an example of the format. It will not be used in the final judging since, as you state, it is much too small. Rest assured that the data used for judging will contain many more items to be sorted. We suggest that you create your won large test files for your own testing.
3) Unfortunately, as in real life, we don't know how many items will be included in the data files for judging. In fact, I would expect that there may be several files with differing numbers of items in each. Thus, rather than tune for a specific number of data points, you should tune the algorithm to scale to many different numbers of items to be sorted. To generate a data file, we would suggest a simple loop that use a random number generator to create an input file.
4) How many data items will fit into cache? 1 million? 10 million? 100 million? Different processors have different architectural characteristics. Tuning for one specific architecture or cache size will not make the application portable. You are free to add logic to your code that determines the size of the current cache and optimizes for that size.
I hope this addresses the issues you've raised and answers some of your questions.
--clay
"It's all very complicated and would take a scientist to explain it." -- MST3K