In an earlier post, I created a simple prime finding program using Intel’s TBB (Thread Building Block). The main benefit of using TBB is that threading and thread synchronization mechanism are abstracted away within the TBB library so we do not need to deal with threads explicitly. Also, TBB is optimized for performance and scales nicely as the number of processing unit increases. In this post, I will show you how to create a Mandelbrot Set generator using TBB and how to optimize the algorithm using loop unrolling.
The standard algorithm for generating Mandelbrot Set is extremely easy to adapt to using TBB. In fact the loops look almost identical to those in the single-threaded approach, except that the iterations are calculated within a 2D range block (blocked_range2d) instead of the entire two dimensional space.
Because xlib by itself is not thread-safe, special attention must be made when trying to update the display concurrently. One way to address this issue is to employee a shared memory region (X11/extensions/XShm.h and sys/shm.h), the display is first built in memory and then the shared memory is attached to the display. In my examples above I used code (video.h, xvideo.cpp) from the sample code that come with the TBB library, which uses the shared memory method I mentioned earlier to make the X11 calls thread-safe.
Many optimization methods can be used to further enhance the performance of the algorithm. One of the most efficient methods is to utilize SSE instructions found on all modern Intel processors (examples can be found here: Using SSE3 Technology in Algorithms with Complex Arithmetic). This approach however might be difficult to implement and debug since parallel data structures must be used in order to benefit from SSE instructions. Also, explicit assembly level coding makes porting code to other machine architectures a daunting task. Modern compilers can already take full advantage of the underlying machine architecture. For example, the gcc compiler (4.2.3) already generates SSE instructions for the code snippet above. While hand tweaking using SSE instructions might further improve the performance, we would certainly sacrifice code simplicity and portability.
The approach I am going to take to further optimize the code is to use loop unrolling. Since the inner loop of the standard algorithm is pretty short, unrolling the inner loop should lessen the burden of loop overhead and decrease the chances of stalling the pipeline (when branching must be predicted). So a high-level loop unrolling should be able to improve the performance.
Here is the code after the inner loop is unrolled:
This code generates identical results as the code mentioned previously. As you can see, the inner loop is not unrolled to handle two data points at a time.
As it turned out, this algorithm runs almost twice as fast as the code mentioned earlier(280ms versus 510ms on Intel Q9450 @ 3.4GHz).
Source code for this article can be downloaded here.