A few days ago, I ran across this article by Dmitri Nesteruk. In his article, he compared the performance between C# and C++ in matrix multiplication. From the data he provided, matrix multiplication using C# is two to three times slower than using C++ in comparable situations.
Even though a lot of optimizations have been done in the .Net runtime to make it more efficient, it is apparent that scientific programming still favors C and C++ because that the performance advantage is huge.
In this article, I will examine some matrix multiplication algorithms that are commonly used and illustrate the efficiencies of the various methods. All the tests are done using C++ only and matrices size ranging from 500×500 to 2000×2000. When the matrix sizes are small (e.g. <50), you can pretty much use any matrix multiplication algorithms without observing any significant performance differences. This is largely due to the fact that the typical stable matrix multiplication algorithms are O(n^3) and sometimes array operation overheads outweigh the benefit of algorithm efficiencies. But for matrices of larger dimensions, the efficiency of the multiplication algorithm becomes extremely important.
Since Dmitri’s article has already captured pretty detailed data using the standard matrix multiplication algorithm, I will not repeat his findings in this article. What I intended to show was the performance data of uBLAS, OpenMP, cBLAS and MATLAB.
The following sample code are compiled under Ubuntu 8.10 64 bit (kernel 2.6.24.23) on Intel Q9450@3.2GHz.
Standard Matrix Multiplication (Single Threaded)
This is our reference code. Later on, I will only show the critical portion of the code and not repeat the common portion of code that initializes/finalizes the arrays. Similarly, the timing method used is also the same across all the tests and will be omitted later on.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | float **A, **B, **C; A = new float *[matrix_size]; B = new float *[matrix_size]; C = new float *[matrix_size]; for ( int i = 0 ; i < matrix_size; i++) { A[i] = new float [matrix_size]; B[i] = new float [matrix_size]; C[i] = new float [matrix_size]; } for ( int i=0; i<matrix_size; i++) { for ( int j = 0 ; j < matrix_size; j++) { A[i][j]= rand (); B[i][j]= rand (); } } timeval t1, t2, t; gettimeofday(&t1, NULL); for ( int i = 0 ; i < matrix_size; i++) { for ( int j = 0; j < matrix_size; j++) { C[i][j] = 0; for ( int k = 0; k < matrix_size; k++) { C[i][j] += A[i][k] * B[k][j]; } } } gettimeofday(&t2, NULL); timersub(&t2, &t1, &t); cout << t.tv_sec + t.tv_usec/1000000.0 << " Seconds -- Standard" << endl; for ( int i = 0 ; i < matrix_size; i++) { delete A[i]; delete B[i]; delete C[i]; } delete A; delete B; delete C; |
OpenMP With Two Dimensional Arrays
Using OpenMP, we are able to multiple threads via the #pragma omp directives. For the simple algorithm we used here, the speed increase is almost proportional to the number of available cores within the system.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | ... #pragma omp parallel for shared(a,b,c) for ( long i=0; i<matrix_size; i++) { for ( long j = 0; j < matrix_size; j++) { float sum = 0; for ( long k = 0; k < matrix_size; k++) { sum +=a[i][k]*b[k][j]; } c[i][j] = sum; } } ... |
OpenMP With One Dimensional Arrays
Cache locality is poor using the simple algorithm I showed above. The performance can be easily improved however by improving the locality of the references. One way to achieve better cache locality is to use one dimensional array instead of two dimensional array and as you will see later, the performance of the following implementation has as much as 50% speed gains over the previous OpenMP implementation using two dimensional arrays.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | float *a, *b, *c; a = new float [matrix_size * matrix_size]; b = new float [matrix_size * matrix_size]; c = new float [matrix_size * matrix_size]; for ( long i=0; i<matrix_size * matrix_size; i++) { a[i]= rand (); b[i] = rand (); c[i] = 0; } #pragma omp parallel for shared(a,b,c) for ( long i=0; i<matrix_size; i++) { for ( long j = 0; j < matrix_size; j++) { long idx = i * matrix_size; float sum = 0; for ( long k = 0; k < matrix_size; k++) { sum +=a[idx + k]*b[k * matrix_size +j]; } c[idx + j] = sum; } } delete a; delete b; delete c; |
Boost Library uBLAS (Single Threaded)
Boost library provides a convenient way to perform matrix multiplication. However, the performance is very poor compared to all other approaches mentioned in this article. The performance of the uBLAS implementation is largely on par with that using C# (see benchmarks towards the end of the article). Intel’s Math Kernal Library (MKL) 10.1 does provide functionality to dynamically convert code using uBLAS syntax into highly efficient code using MKL by the inclusion of header file mkl_boost_ublas_matrix_prod.hpp. I have not tried it myself though, but the performance should be comparible to algorithms using the native MKL BLAS interface.
By default (without using MKL’s uBLAS capability) though, uBLAS is single threaded and due to its poor performance and I would strongly suggest avoid using uBLAS in any high performance scientific applications.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | matrix< float > A, B, C; A.resize(matrix_size,matrix_size); B.resize(matrix_size,matrix_size); for ( int i = 0; i < matrix_size; ++ i) { for ( int j = 0; j < matrix_size; ++ j) { A(i, j) = rand (); B(i, j) = rand (); } } C =prod(A, B); |
Intel Math Kernel Library (MKL) cBLAS
Intel’s Math Kernel Library (MKL) is highly optimized on Intel’s microprocessor platforms. Given that Intel developed this library for its own processor platforms we can expect significant performance gains. I am still surprised at how fast the code runs using cBLAS though. In fact, it was so fast that I doubted the validity of the result at first. But after checking the results against those obtained by other means, those doubts were putting into rest.
The cBLAS matrix multiplication uses blocked matrix multiplication method which further improves cache locality. And it is more than thirty times faster then the fastest OMP 1D algorithm listed above! Another benefit is that by default it automatically detects the number of CPUs/cores available and uses all available threads. This behavior greatly simplifies the code since threading is handled transparently within the library.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | float *A, *B, *C; A = new float [matrix_size * matrix_size]; B = new float [matrix_size * matrix_size]; C = new float [matrix_size * matrix_size]; for ( int i = 0; i < matrix_size * matrix_size; i++) { A[i] = rand (); B[i] = rand (); C[i] = 0; } cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, matrix_size, matrix_size, matrix_size, 1.0, A,matrix_size, B, matrix_size, 0.0, C, matrix_size); |
MATLAB (Single Threaded)
MATLAB is known for its efficient algorithms. In fact it uses BLAS libraries for its own matrix calculation routines. The version of MATLAB I have is a little dated (7.0.1), but nevertheless it would be interesting to see how its performance compares with that of latest MKL’s. MATLAB 7 is single threaded, and given the same matrix size, it runs roughly three times slower than the fastest MKL routine listed above (per core).
a = rand(i,i); b = rand(i,i); tic; c = a*b; t = toc
The following table shows the results I obtained by running the code listed above. The results are time in seconds. (note, S.TH means single threaded and M.TH means multi-threaded).
Size/Algorithm | uBLAS S.TH | STD S.TH | OMP 2D | OMP 1D | MATLAB S.TH | cBLAS M.TH |
500×500 | 3.2435 | 0.5253 | 0.1939 | 0.0536 | 0.0810 | 0.0206 |
600×600 | 5.7854 | 0.9349 | 0.3223 | 0.1655 | 0.1410 | 0.0093 |
700×700 | 9.2292 | 1.2928 | 0.3529 | 0.2797 | 0.2230 | 0.0122 |
800×800 | 13.7711 | 2.3746 | 0.7259 | 0.4135 | 0.3320 | 0.0310 |
900×900 | 20.3245 | 3.4983 | 1.0146 | 0.7449 | 0.4740 | 0.0306 |
1000×1000 | 28.8345 | 3.4983 | 1.4748 | 1.0548 | 0.6530 | 0.0700 |
1100×1100 | 38.2545 | 7.0240 | 1.9383 | 1.6257 | 0.8620 | 0.1250 |
1200×1200 | 50.4964 | 9.9319 | 2.8411 | 2.1215 | 1.1170 | 0.0440 |
1300×1300 | 64.5064 | 12.8344 | 3.6277 | 2.9720 | 1.4250 | 0.0440 |
1400×1400 | 81.1826 | 17.1119 | 4.8309 | 3.5977 | 1.7760 | 0.0938 |
1500×1500 | 100.1330 | 21.0622 | 6.1689 | 4.8022 | 2.1870 | 0.1111 |
1600×1600 | 120.3400 | 26.4316 | 7.3189 | 5.0451 | 2.6490 | 0.1699 |
1700×1700 | 145.8550 | 31.2706 | 8.7525 | 6.8915 | 3.1870 | 0.1452 |
1800×1800 | 174.6860 | 38.9293 | 11.1060 | 8.1316 | 3.7940 | 0.1989 |
1900×1900 | 206.0520 | 45.8589 | 13.0832 | 9.9527 | 4.4450 | 0.2725 |
2000×2000 | 240.7820 | 55.4392 | 16.0542 | 11.0314 | 5.1820 | 0.3359 |
The following figure shows the results. Since uBLAS and single threaded matrix multiplications took significantly longer to compute, I did not include them in the figure below.
The following figure shows the same data but uses log-scale Y axis instead so that all the data can show up nicely. You can get a sense of various algorithms’ efficiencies here: