I have been playing around with Intel’s Threading Building Block for a while and have started to really appreciate its simplicity and elegance: Instead of thinking in threads and thread synchronizations, one can just simply concentrate on the problem on the hand. Take finding prime numbers for example, while the problem itself (using the most rudimentary algorithm) is quite simple, getting it to work in a multi-threaded fashion does take a little bit of work. In this particular example, the prime finding algorithm can be easily paralleled by utilizing threads and thread synchronization is almost a non-issue since the problem domain can be divided into totally disjoint regions, but in general dividing the problem domain into multiple sub-domains and performing load balancing among them could take significant work. In the following example, I created two C++ classes that both find prime numbers for a given interval (since all prime numbers are odd numbers except 2, 2 is omitted in the calculates below), one sequential and the other parallel. In the main function, both methods are timed and the results are outputted.(download tbbprime.cpp)
For a very large interval (e.g. 3~1,000,000), the TBB version of the prime program achieved a 4x speed up given a reasonably large grain size (e.g. 100). Smaller grain size resulted in slightly more overhead. On a quad-core machine (Q9450 @ 3.2GHz), it took 217.83 ms for the single threaded routine to find all the prime numbers within 1,000,000, whereas it only took 58.32 ms for the TBB version, which runs roughly four times as fast. The TBB framework took care of dividing the task according to the number of processors automatically.