I have the same multi-physics finite element code generating a matrix. An old machine with a Ryzen 1700 (8 core) is faster than a threadripper 2990wx (32 core). Windows 10, intel64, mkl_rt.lib, and the MKL versions are 2018.1.156 for Ryzen 1700 and 2019.0.117 for Threadripper. I can provide an example matrix if it helps. Here are the options, which are same on both builds:
struct pardiso_struct
{
void *pt[64];
int maxfct{ 1 };
int mnum{ 1 };
int mtype{ 11 };
int n{ 0 };
int idum{ 0 }; //dummy not used by PARDISO when iparm(5-1) != 1
int nrhs{ 1 };
int iparm[64];
int msglvl{ 1 };
double ddum{ 0. };
int error{ 0 };
pardiso_struct()
{
// fill(pt, pt + 64, void(0)); does not work
for (int i = 0; i < 64; ++i)
pt[i] = 0;
std::fill(iparm, iparm + 64, 0);
iparm[0] = 1; // 0 for all default, !=0 for any custom
iparm[1] = 3; // 0 minimum degree alg, metis, 3 openMP metis
//iparm[2] // reserved
iparm[3] = 0; // For iterative methods
iparm[4] = 0; // user fill-in reducing permutation
iparm[5] = 0; // 0 - solution written on x, 1 - solution on b
//iparm[6] output of number of iterative refinement steps
iparm[7] = 0; // iterative refinement steps
//iparm[8] reserved
iparm[9] = 13; // pivoting, 13 for nonsymmetric, 8 for sym
iparm[10] = 1; // 0 no scaling, 1 scaling (1 Default for nonsym)
iparm[12] = 1; // 0 to disable weighted matching? 1 default for non-sym
//iparm[13]-iparm[19] outputs
//iparm[20] = special pivoting for symmetric but indefinite
//iparm[21] output for number of pos eigs
//iparm[22] output for number of neg eigs
iparm[23] = 1; // 0 for classic alg, 1 for openMP scalable > 8 procs
iparm[24] = 0; // 0 for parallel solve, 1 for sequential solve
//iparm[25] // reserved
iparm[26] = 0; // 0 Do not check sparse mat, 1 check sparse mat
iparm[27] = 0; // 0 double precision, 1 single precision
//iparm[28] reserved;
//iparm[29] output zero or neg pivots in sym
//iparm[30] only solve for certain components...?
//iparm[31][32] reserved
//iparm[33] some reproduceability stuff
iparm[34] = 1; //0 one based indexing, 1 zero based indexing
//iparm[35] something with schur complements
iparm[36] = 0; //0 CSR, >0 BSR, <0 convert to BSR
//iparm[59] ooc options
}
};
The results of reorder and factorization are here. Solve (omitted here) is slower on 2990wx but the main concern is numerical factorization time.
*************** Ryzen 7 1700 **********************
=== PARDISO: solving a real nonsymmetric system ===
0-based array is turned ON
PARDISO double precision computation is turned ON
Parallel METIS algorithm at reorder step is turned ON
Scaling is turned ON
Matching is turned ON
Summary: ( reordering phase )
================
Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.847928 s
Time spent in reordering of the initial matrix (reorder) : 7.678907 s
Time spent in symbolic factorization (symbfct) : 2.075314 s
Time spent in data preparations for factorization (parlist) : 0.098494 s
Time spent in allocation of internal data structures (malloc) : 4.281882 s
Time spent in additional calculations : 3.785140 s
Total time spent : 18.767665 s
Statistics:
===========
Parallel Direct Factorization is running on 8 OpenMP
< Linear system Ax = b >
number of equations: 1928754
number of non-zeros in A: 46843184
number of non-zeros in A (%): 0.001259
number of right-hand sides: 1
< Factors L and U >
number of columns for each panel: 72
number of independent subgraphs: 0
number of supernodes: 795666
size of largest supernode: 9159
number of non-zeros in L: 673935341
number of non-zeros in U: 631031607
number of non-zeros in L+U: 1304966948
=== PARDISO: solving a real nonsymmetric system ===
Two-level factorization algorithm is turned ON
Summary: ( factorization phase )
================
Times:
======
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct) : 53.846398 s
Time spent in allocation of internal data structures (malloc) : 0.000878 s
Time spent in additional calculations : 0.000001 s
Total time spent : 53.847277 s
Statistics:
===========
Parallel Direct Factorization is running on 8 OpenMP
< Linear system Ax = b >
number of equations: 1928754
number of non-zeros in A: 46843184
number of non-zeros in A (%): 0.001259
number of right-hand sides: 1
< Factors L and U >
number of columns for each panel: 72
number of independent subgraphs: 0
number of supernodes: 795666
size of largest supernode: 9159
number of non-zeros in L: 673935341
number of non-zeros in U: 631031607
number of non-zeros in L+U: 1304966948
gflop for the numerical factorization: 2903.934836
gflop/s for the numerical factorization: 53.929973
****************** Threadripper 2990wx *********************************
=== PARDISO: solving a real nonsymmetric system ===
0-based array is turned ON
PARDISO double precision computation is turned ON
Parallel METIS algorithm at reorder step is turned ON
Scaling is turned ON
Matching is turned ON
Summary: ( reordering phase )
================
Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.919861 s
Time spent in reordering of the initial matrix (reorder) : 10.085178 s
Time spent in symbolic factorization (symbfct) : 2.207123 s
Time spent in data preparations for factorization (parlist) : 0.101967 s
Time spent in allocation of internal data structures (malloc) : 3.143640 s
Time spent in additional calculations : 3.677500 s
Total time spent : 20.135269 s
Statistics:
===========
Parallel Direct Factorization is running on 32 OpenMP
< Linear system Ax = b >
number of equations: 1928754
number of non-zeros in A: 46843184
number of non-zeros in A (%): 0.001259
number of right-hand sides: 1
< Factors L and U >
number of columns for each panel: 72
number of independent subgraphs: 0
number of supernodes: 794723
size of largest supernode: 7005
number of non-zeros in L: 683894639
number of non-zeros in U: 640539323
number of non-zeros in L+U: 1324433962
=== PARDISO: solving a real nonsymmetric system ===
Two-level factorization algorithm is turned ON
Summary: ( factorization phase )
================
Times:
======
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct) : 61.520888 s
Time spent in allocation of internal data structures (malloc) : 0.001112 s
Time spent in additional calculations : 0.000002 s
Total time spent : 61.522003 s
Statistics:
===========
Parallel Direct Factorization is running on 32 OpenMP
< Linear system Ax = b >
number of equations: 1928754
number of non-zeros in A: 46843184
number of non-zeros in A (%): 0.001259
number of right-hand sides: 1
< Factors L and U >
number of columns for each panel: 72
number of independent subgraphs: 0
number of supernodes: 794723
size of largest supernode: 7005
number of non-zeros in L: 683894639
number of non-zeros in U: 640539323
number of non-zeros in L+U: 1324433962
gflop for the numerical factorization: 2879.931235
gflop/s for the numerical factorization: 46.812250
Nearly 2 million unknowns should have enough work for each core. Manually specifying a max of 16 threads shows a modest speedup (53 seconds for numerical factorization), which suggests to me that this is a Pardiso scaling issue and not a hardware issue. Although, it may be due to the memory architecture of the 2990wx.
Any suggestions?