A parallel fortran code that solves a set of linear simultaneous equations Ax = b using the scalapack routine PDGESV fails (exiting with segmentation fault) when the no. of equations, N, becomes large. I have not identified the exact value of N at which problems arise, but, for example, the code works for all the values I have tested up to N= 50000, but fails at N=94423.
In particular, the failure appears to occur during the call to the scalapack routine (i.e. not when allocating / deallocating memory);
it enters routine PDGESV, but does not leave this routine.
I have prepared a simple small Fortran example code (see attachment below) that exhibits this problem. This code simply 1) allocates space for the matrix A and vector b, 2) fills their entries with random entries 3) calls PDGESV and then 4) deallocates the memory. The code has been tested on a variety of different matrix sizes (NxN) and with various BLACS processor arrays without any errors until N becomes large.
The problem does not seem to be a problem with lack of memory; on the machine I execute the code 192 GB is available,
whereas the code only uses 65 GB when N=94423. I have tried using the 'ulimit -s unlimited' command , but this did not resolve the problem. My feeling is that instead there is some problem with maybe exceeding some default limit on what memory is available to a single process in mpi? i.e. perhaps I am simply missing some appropriate FLAGS at compilation / run time?
I am running the program on a linux cluster using Red Hat Enterprise Linux Server release 7.3 (Maipo)
I compiled the following code with:
mpiifort -mcmodel=medium -m64 -mkl=cluster -o para.exe solve_by_lu_parallelmpi_simple_light2.for
and run it using (for example when N= 9445)
mpiexec.hydra -n 4 ./para.exe 9445 2 2 32
the command line arguments here denote selecting N=9445 and using a 2x2 BLACS process array with block size 32
For this smaller matrix size the program runs w/out any problems producing the output
WE ARE SOLVING A SYSTEM OF 9445 LINEAR EQUATIONS
PROC: 0 0 HAS MLOC, NLOC = 4736 4736
PROC: 0 0 ALLOCATING SPACE ...
PROC: 0 0 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC: 0 1 HAS MLOC, NLOC = 4736 4709
PROC: 0 1 ALLOCATING SPACE ...
PROC: 0 1 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC: 1 0 HAS MLOC, NLOC = 4709 4736
PROC: 1 0 ALLOCATING SPACE ...
PROC: 1 0 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC: 1 1 HAS MLOC, NLOC = 4709 4709
PROC: 1 1 ALLOCATING SPACE ...
PROC: 1 1 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC: 1 1
NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
PROC: 1 0
NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
PROC: 0 1
NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
PROC: 0 0
NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
INFO code returned by PDGESV = 0
SO far so good. But when I try to solve a larger system using
mpiexec.hydra -n $NUM_PROCS ./para.exe 9445 2 2 32
the program crashes during the call to PDGESV with the output
WE ARE SOLVING A SYSTEM OF 94423 LINEAR EQUATIONS
PROC: 0 0 HAS MLOC, NLOC = 47223 47223
PROC: 0 0 ALLOCATING SPACE ...
PROC: 0 0 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC: 0 1 HAS MLOC, NLOC = 47223 47200
PROC: 0 1 ALLOCATING SPACE ...
PROC: 0 1 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC: 1 0 HAS MLOC, NLOC = 47200 47223
PROC: 1 0 ALLOCATING SPACE ...
PROC: 1 1 HAS MLOC, NLOC = 47200 47200
PROC: 1 1 ALLOCATING SPACE ...
PROC: 1 0 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC: 1 1 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC: 0 1
NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
PROC: 0 0
NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
PROC: 1 1
NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
PROC: 1 0
NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
forrtl: 致命的なエラー (154): 配列インデックスが境界外です。
Image PC Routine Line Source
libifcore.so.5 00002B0D716C19AF for__signal_handl Unknown Unknown
libpthread-2.17.s 00002B0D712335D0 Unknown Unknown Unknown
libmkl_avx512.so 00002B11A45E5A47 mkl_blas_avx512_x Unknown Unknown
libmkl_intel_lp64 00002B0D68E8BB55 dger_ Unknown Unknown
libmkl_scalapack_ 00002B0D69F972AE pdger_ Unknown Unknown
libmkl_scalapack_ 00002B0D69E53541 pdgetf3_ Unknown Unknown
libmkl_scalapack_ 00002B0D69E53688 pdgetf3_ Unknown Unknown
libmkl_scalapack_ 00002B0D69C2E13B pdgetf2_ Unknown Unknown
libmkl_scalapack_ 00002B0D69C2E836 pdgetrf2_ Unknown Unknown
libmkl_scalapack_ 00002B0D6A014F6E pdgetrf_ Unknown Unknown
libmkl_scalapack_ 00002B0D69C29C7D pdgesv_ Unknown Unknown
para.exe 0000000000401F8C Unknown Unknown Unknown
para.exe 00000000004011BE Unknown Unknown Unknown
libc-2.17.so 00002B0D73DFC3D5 __libc_start_main Unknown Unknown
para.exe 00000000004010C9 Unknown Unknown Unknown
the first error line beginning forrtl: can be translated as
forrtl: Fatal error (154): Array index out of bounds.
The problem seems to be ocurring somewhere in the scalapack routines.
Does anyone have any recommendations / possible solutions ?
Any advice or pointers will be gratefully received,
Many thanks,
Dan.