Hi,
We are currently developing a distributed version of our c++ finite element program. We planned to use the Intel Direct Sparse Solver for Cluster but it seems we can't reach good scalability with our settings. The matrix is assumed non symmetric and built in the DCSR format.
The test case used is a simple thermal diffusion problem on a square grid. Different sizes of problem, ranging from 1M to 25M DOF, have been tested with many combinations of MPI processes and OpenMP threads (usually with 1 MPI process by node or by socket). Memory allocated at factorization phase is scaling down but we observed small speed-up on running time.
Actually, we observed these behaviors:
- Symbolic factorization benefits from more MPI processes but is not affected by threads.
- Factorization scales with number of OpenMP threads and sometimes with MPI.
- Most of the time, results shows no significant gain on solving phase for both parallelization.
I must be doing something wrong but i can't seem to find the solution to the problem.
Thanks a lot for any advice
The following iparm variables are used:
iparm(0) = 1;
iparm(1) = 10;
iparm(7) = 2;
iparm(9) = 13;
iparm(10) = 1;
iparm(12) = 1;
iparm(34) = 1;
iparm(39) = 2;
iparm(40,41) = first and last line of local matrix
The code is compiled with 2017 Intel compiler and Intel MPI. Compilation flags used are : -03 -qopenmp -mkl=parallel and