Quantcast
Channel: Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library
Viewing all articles
Browse latest Browse all 2652

Pardiso Threadripper 2990wx versus Ryzen 1700

$
0
0

I have the same multi-physics finite element code generating a matrix. An old machine with a Ryzen 1700 (8 core) is faster than a threadripper 2990wx (32 core). Windows 10, intel64, mkl_rt.lib, and the MKL versions are 2018.1.156 for Ryzen 1700 and 2019.0.117 for Threadripper. I can provide an example matrix if it helps. Here are the options, which are same on both builds:

 

struct pardiso_struct

{

void *pt[64];

int maxfct{ 1 };

int mnum{ 1 };

int mtype{ 11 };

int n{ 0 };

int idum{ 0 }; //dummy not used by PARDISO when iparm(5-1) != 1

int nrhs{ 1 };

int iparm[64];

int msglvl{ 1 };

double ddum{ 0. };

int error{ 0 };

 

pardiso_struct()

{

// fill(pt, pt + 64, void(0)); does not work

for (int i = 0; i < 64; ++i)

pt[i] = 0;

std::fill(iparm, iparm + 64, 0);

iparm[0] = 1; // 0 for all default, !=0 for any custom

iparm[1] = 3; // 0 minimum degree alg, metis, 3 openMP metis

  //iparm[2] // reserved

iparm[3] = 0; // For iterative methods

iparm[4] = 0; // user fill-in reducing permutation

iparm[5] = 0; // 0 - solution written on x, 1 - solution on b

  //iparm[6] output of number of iterative refinement steps

iparm[7] = 0; // iterative refinement steps

  //iparm[8] reserved

iparm[9] = 13; // pivoting, 13 for nonsymmetric, 8 for sym

iparm[10] = 1; // 0 no scaling, 1 scaling (1 Default for nonsym)

 

iparm[12] = 1; // 0 to disable weighted matching? 1 default for non-sym

   //iparm[13]-iparm[19] outputs

   //iparm[20] = special pivoting for symmetric but indefinite

   //iparm[21] output for number of pos eigs

   //iparm[22] output for number of neg eigs

iparm[23] = 1; // 0 for classic alg, 1 for openMP scalable > 8 procs

iparm[24] = 0; // 0 for parallel solve, 1 for sequential solve

   //iparm[25] // reserved

iparm[26] = 0; // 0 Do not check sparse mat, 1 check sparse mat

iparm[27] = 0; // 0 double precision, 1 single precision

   //iparm[28]  reserved;

   //iparm[29] output zero or neg pivots in sym

   //iparm[30] only solve for certain components...?

   //iparm[31][32] reserved

   //iparm[33] some reproduceability stuff

iparm[34] = 1; //0 one based indexing, 1 zero based indexing

   //iparm[35] something with schur complements

iparm[36] = 0; //0 CSR, >0 BSR, <0 convert to BSR

   //iparm[59] ooc options

 

}

};

 

 

The results of reorder and factorization are here. Solve (omitted here) is slower on 2990wx but the main concern is numerical factorization time. 

*************** Ryzen 7 1700 **********************

=== PARDISO: solving a real nonsymmetric system ===
0-based array is turned ON
PARDISO double precision computation is turned ON
Parallel METIS algorithm at reorder step is turned ON
Scaling is turned ON
Matching is turned ON

Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.847928 s
Time spent in reordering of the initial matrix (reorder)         : 7.678907 s
Time spent in symbolic factorization (symbfct)                   : 2.075314 s
Time spent in data preparations for factorization (parlist)      : 0.098494 s
Time spent in allocation of internal data structures (malloc)    : 4.281882 s
Time spent in additional calculations                            : 3.785140 s
Total time spent                                                 : 18.767665 s

Statistics:
===========
Parallel Direct Factorization is running on 8 OpenMP

< Linear system Ax = b >
             number of equations:           1928754
             number of non-zeros in A:      46843184
             number of non-zeros in A (%): 0.001259

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 72
             number of independent subgraphs:  0
             number of supernodes:                    795666
             size of largest supernode:               9159
             number of non-zeros in L:                673935341
             number of non-zeros in U:                631031607
             number of non-zeros in L+U:              1304966948

=== PARDISO: solving a real nonsymmetric system ===
Two-level factorization algorithm is turned ON

Summary: ( factorization phase )
================

Times:
======
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct)                        : 53.846398 s
Time spent in allocation of internal data structures (malloc)    : 0.000878 s
Time spent in additional calculations                            : 0.000001 s
Total time spent                                                 : 53.847277 s

Statistics:
===========
Parallel Direct Factorization is running on 8 OpenMP

< Linear system Ax = b >
             number of equations:           1928754
             number of non-zeros in A:      46843184
             number of non-zeros in A (%): 0.001259

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 72
             number of independent subgraphs:  0
             number of supernodes:                    795666
             size of largest supernode:               9159
             number of non-zeros in L:                673935341
             number of non-zeros in U:                631031607
             number of non-zeros in L+U:              1304966948
             gflop   for the numerical factorization: 2903.934836

             gflop/s for the numerical factorization: 53.929973

 

 

 

 

****************** Threadripper 2990wx *********************************

=== PARDISO: solving a real nonsymmetric system ===
0-based array is turned ON
PARDISO double precision computation is turned ON
Parallel METIS algorithm at reorder step is turned ON
Scaling is turned ON
Matching is turned ON

Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 0.919861 s
Time spent in reordering of the initial matrix (reorder)         : 10.085178 s
Time spent in symbolic factorization (symbfct)                   : 2.207123 s
Time spent in data preparations for factorization (parlist)      : 0.101967 s
Time spent in allocation of internal data structures (malloc)    : 3.143640 s
Time spent in additional calculations                            : 3.677500 s
Total time spent                                                 : 20.135269 s

Statistics:
===========
Parallel Direct Factorization is running on 32 OpenMP

< Linear system Ax = b >
             number of equations:           1928754
             number of non-zeros in A:      46843184
             number of non-zeros in A (%): 0.001259

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 72
             number of independent subgraphs:  0
             number of supernodes:                    794723
             size of largest supernode:               7005
             number of non-zeros in L:                683894639
             number of non-zeros in U:                640539323
             number of non-zeros in L+U:              1324433962

=== PARDISO: solving a real nonsymmetric system ===
Two-level factorization algorithm is turned ON

Summary: ( factorization phase )
================

Times:
======
Time spent in copying matrix to internal data structure (A to LU): 0.000000 s
Time spent in factorization step (numfct)                        : 61.520888 s
Time spent in allocation of internal data structures (malloc)    : 0.001112 s
Time spent in additional calculations                            : 0.000002 s
Total time spent                                                 : 61.522003 s

Statistics:
===========
Parallel Direct Factorization is running on 32 OpenMP

< Linear system Ax = b >
             number of equations:           1928754
             number of non-zeros in A:      46843184
             number of non-zeros in A (%): 0.001259

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 72
             number of independent subgraphs:  0
             number of supernodes:                    794723
             size of largest supernode:               7005
             number of non-zeros in L:                683894639
             number of non-zeros in U:                640539323
             number of non-zeros in L+U:              1324433962
             gflop   for the numerical factorization: 2879.931235

             gflop/s for the numerical factorization: 46.812250

 

Nearly 2 million unknowns should have enough work for each core. Manually specifying a max of 16 threads shows a modest speedup (53 seconds for numerical factorization), which suggests to me that this is a Pardiso scaling issue and not a hardware issue. Although, it may be due to the memory architecture of the 2990wx.

Any suggestions?

 


Viewing all articles
Browse latest Browse all 2652

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>