I need to multiply a symmetric sparse matrix A with a dense matrix X (Y = A*X) using multi-thread/core. The matrices I'm using are the adjacency matrix of graphs, with large number of nodes (up to 2 million nodes).
I have tried two approaches:
- mkl_dcsrmm() with matdescra[0] set to 's'.
- mkl_dcsrsymv() in a for-loop, looping over the column vectors of X. Below is the code I used.
#pragma omp parallel for schedule(static) for(int i=0; i<n; i++) { mkl_dcsrsymv(&matdescra[1], &m, values, rowIndex, columns, X[i], Y[i]); }
Initially, I thought that the first option (Sparse BLAS level 3) should be faster than the second one. But, I'm getting the opposite timing results.
Below is an example of a symmetric sparse matrix A with about 1.7M rows/columns and 42M non-zero entries and a dense matrix X with the same number of rows and 100 columns. Running on number of threads set to 2, 4, and 8, respectively.
- option1: 19.17sec, 9.38sec, 5.20sec
- option2: 13.26sec, 6.83sec, 3.84sec
Is there any particular reason for this or am I missing something? Because, it seems that mkl_dcsrmm() should be doing things more efficiently than my for-loop.
I compiled the code with the following command; icpc -mkl=parallel -I$(MKLROOT)/include -O3 -openmp -o test test.cpp -L$(MKLROOT)/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -lpthread -lm