hi
I had a problem with Pardiso in the past (https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/...) and thanks to Alex, we were able to come out a solution in 2015.
Now I am running on an i7-8700k 3.70GHz 6-core 12-thread PC, and we are using mkl 2017 update 3. We found out that Pardiso does not scale at all. NOTE the time below is for the solve time, since we factorize the matrix once and solve thousands of times.
The testing matrix has DOF 5811 and 618378 non-zero element (sparsity 1.8%). It was ordered through Metis.
I setup the options, load the matrix, factorize the matrix, AND then I solve the same RHS 1000 times.
Here is what the problems.
1) When I set the MKL thread number to 6 (the number of physical cores), and no matter what value of i in the function Domain_Set_Num_Threads(i, MKL_DOMAIN_PARDISO), Pardiso decided to run 6 cores.
The code is like this (where NT is the number of threads, 1, 2, 4, 6)
GetMKL_Service()->Set_Num_Threads(6);
GetMKL_Service()->Domain_Set_Num_Threads(NT, 4); //4 stands for MKL_DOMAIN_PARDISO
SetOption(3, NT); // set Pardiso Option[3] to NT;
Here is the task manager screenshot
https://drive.google.com/file/d/1-hNBA2a82qIyy4DiZ0WXSveRqGuWZ2yN/view?u...
NOTE a) There are lots of red internal operation while Pardiso is running. b) the memory keeps creeping up even though I called phase=-1 after test on each number of threads.
The times are here
1 threads are running for Pardiso.solve 1000 times
Pardiso.solve takes: 8.6750000000 s to run on 1 threads.
2 threads are running for Pardiso.solve 1000 times
Pardiso.solve takes: 8.1410000000 s to run on 2 threads.
4 threads are running for Pardiso.solve 1000 times
Pardiso.solve takes: 8.1460000000 s to run on 4 threads.
6 threads are running for Pardiso.solve 1000 times
Pardiso.solve takes: 7.9120000000 s to run on 6 threads.
Pardiso scaling 1.0000000220 1 threads
Pardiso scaling 1.0655939308 2 threads
Pardiso scaling 1.0649398712 4 threads
Pardiso scaling 1.0964358178 6 threads
So there is pretty much no gain in the solve time jumping from 1thread to 6 threads
2) When I set the MKL and Pardiso to both use the i number of threads
GetMKL_Service()->Set_Num_Threads(NT);
GetMKL_Service()->Domain_Set_Num_Threads(NT, 4); //4 stands for MKL_DOMAIN_PARDISO
SetOption(3, NT); // set Pardiso Option[3] to NT;
here is the task manager screenshot.
https://drive.google.com/file/d/1H377rFFTYmEqWxTvuzGUnJhBMqD2aEZ1/view?u...
NOTE 1) at least now the Pardiso is running with different threads, consistent with Domain_Set_Num_Threads, and Pardiso Option[3]. 2) there are still lots of red spinning thread there. 3) the memory still crept up after each run.
the data is here.
1 threads are running for Pardiso.solve 1000 times
Pardiso.solve takes: 8.3920000000 s to run on 1 threads.
2 threads are running for Pardiso.solve 1000 times
Pardiso.solve takes: 7.7710000000 s to run on 2 threads.
4 threads are running for Pardiso.solve 1000 times
Pardiso.solve takes: 7.6470000000 s to run on 4 threads.
6 threads are running for Pardiso.solve 1000 times
Pardiso.solve takes: 8.0690000000 s to run on 6 threads.
Pardiso scaling 1.0000000236 1 threads
Pardiso scaling 1.0799125207 2 threads
Pardiso scaling 1.0974238523 4 threads
Pardiso scaling 1.0400297680 6 threads
We still use Pardiso the same as we posted https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/.... We want Padiso to use NT number of threads, but other MKL functions (GEMM, GESVD, etc) to use 1 thread, because they are parallelized through TBB. But we couldnot let it happen.
The tested matrix can be downloaded from here (and it is attached as a zip file)
https://drive.google.com/file/d/1wcl8cRaKq704-nFwlScgLTbTIdmhIWdd/view?u...
The format is like this
# of rows
IA
# of NNZ
JA, Valr, Vali
You read it in like this
std::ifstream is("DSparseDebug.txt", std::ios_base::in);
if (is) {
float val, valr, vali;
is >> val;
m_nRows = val - 1;
m_nCols = m_nRows;
m_row_ptr.resize(m_nRows + 1);
for (int ir = 0; ir < m_nRows + 1; ++ir)
is >> m_row_ptr[ir];
is >> val;
m_nnz = int(val);
m_col_ind.resize(m_nnz);
m_val.resize(m_nnz);
for (int iz = 0; iz < m_nnz; ++iz) {
is >> val >> valr >> vali;
m_col_ind[iz] = val;
m_val[iz] = complex(valr, vali);
}
}
For Padiso's defense, I did see some speedup in the factorization step.
Factor Step:
Pardiso scaling 1.0000000019 1 threads
Pardiso scaling 1.5593220369 2 threads
Pardiso scaling 2.1904761947 4 threads
Pardiso scaling 2.4864864913 6 threads
However, in our application, we use the matrix as preconditioner so we solve the matrix thousands of times.
And one of my coworker pointed out that in the manual:
IPARM (3) — Number of processors. Input On entry: IPARM(3) must contain the number of processors that are available for parallel execution. The number must be equal to the OpenMP environment variable OMP NUM THREADS. Note: If the user has not explicitly set OMP NUM THREADS, then this value can be set by the operating system to the maximal numbers of processors on the system. It is therefore always recommended to control the parallel execution of the solver by explicitly setting OMP NUM THREADS. If fewer processors are available than specified, the execution may slow down instead of speeding up. There is no default value for IPARM(3).
I never set the OMP NUM THREADS, so IPARM(3) cannot be equal to OMP NUM THREADS. will this cause some problem? And I found this two pages that are quite different for IPARM[3], which one should I use?
https://software.intel.com/en-us/mkl-developer-reference-fortran-pardiso...
https://pardiso-project.org/manual/manual.pdf
in Intel manual, iparm(3): Reserved. Set to zero... So how do I set Pardiso to run with different number of threads than the number of physical cores.
thanks