Quantcast
Channel: Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library
Viewing all articles
Browse latest Browse all 2652

Problem when solving large system using Scalapack PDGESV

$
0
0

A parallel fortran code that solves a set of linear simultaneous equations Ax = b using the scalapack routine PDGESV fails (exiting with segmentation fault) when the no. of equations, N,  becomes large.  I have not identified the exact value of N at which problems arise, but, for example, the code works for all the values I have tested up to N= 50000, but fails at N=94423.

In particular, the failure appears to occur during the call to the scalapack routine (i.e. not when allocating / deallocating memory);
it enters routine PDGESV, but does not leave this routine.

I have prepared a simple small Fortran example code (see attachment below) that exhibits this problem.  This code simply 1) allocates space for the matrix A and vector b, 2) fills their entries with random entries 3) calls PDGESV and then 4) deallocates the memory. The code has been tested on a variety of different matrix sizes (NxN) and with various BLACS processor arrays without any errors until N becomes large. 

The problem does not seem to be a problem with lack of memory; on the machine I execute the code 192 GB is available,

whereas the code only uses 65 GB when N=94423. I have tried using the 'ulimit -s unlimited' command , but this did not resolve the problem. My feeling is that instead there is some problem with maybe exceeding some default limit on what memory is available to a single process in mpi? i.e. perhaps I am simply missing some appropriate FLAGS at compilation / run time?

I am running the program on a linux cluster using  Red Hat Enterprise Linux Server release 7.3 (Maipo)

I compiled the following code with:

mpiifort -mcmodel=medium    -m64  -mkl=cluster  -o para.exe  solve_by_lu_parallelmpi_simple_light2.for

 

and run it using (for example when N= 9445)

mpiexec.hydra  -n 4 ./para.exe  9445 2 2 32

the command line arguments here denote selecting N=9445 and using a 2x2 BLACS process array with block size 32

For this smaller matrix size the program runs w/out any problems producing the output

WE ARE SOLVING A SYSTEM OF         9445  LINEAR EQUATIONS
 PROC:            0           0 HAS  MLOC, NLOC =        4736        4736
 PROC:            0           0  ALLOCATING SPACE ...
 PROC:            0           0  CONSTRUCTING MATRIX A AND RHS VECTOR B ...
 PROC:            0           1 HAS  MLOC, NLOC =        4736        4709
 PROC:            0           1  ALLOCATING SPACE ...
 PROC:            0           1  CONSTRUCTING MATRIX A AND RHS VECTOR B ...
 PROC:            1           0 HAS  MLOC, NLOC =        4709        4736
 PROC:            1           0  ALLOCATING SPACE ...
 PROC:            1           0  CONSTRUCTING MATRIX A AND RHS VECTOR B ...
 PROC:            1           1 HAS  MLOC, NLOC =        4709        4709
 PROC:            1           1  ALLOCATING SPACE ...
 PROC:            1           1  CONSTRUCTING MATRIX A AND RHS VECTOR B ...
 PROC:            1           1
 NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
 PROC:            1           0
 NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
 PROC:            0           1
 NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
 PROC:            0           0
 NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
 
 INFO code returned by PDGESV =            0

SO far so good. But when I try to solve a larger system using

mpiexec.hydra -n $NUM_PROCS ./para.exe  9445 2 2 32

the program crashes during the call to PDGESV with the output

WE ARE SOLVING A SYSTEM OF        94423  LINEAR EQUATIONS
 PROC:            0           0 HAS  MLOC, NLOC =       47223       47223
 PROC:            0           0  ALLOCATING SPACE ...
 PROC:            0           0  CONSTRUCTING MATRIX A AND RHS VECTOR B ...
 PROC:            0           1 HAS  MLOC, NLOC =       47223       47200
 PROC:            0           1  ALLOCATING SPACE ...
 PROC:            0           1  CONSTRUCTING MATRIX A AND RHS VECTOR B ...
 PROC:            1           0 HAS  MLOC, NLOC =       47200       47223
 PROC:            1           0  ALLOCATING SPACE ...
 PROC:            1           1 HAS  MLOC, NLOC =       47200       47200
 PROC:            1           1  ALLOCATING SPACE ...
 PROC:            1           0  CONSTRUCTING MATRIX A AND RHS VECTOR B ...
 PROC:            1           1  CONSTRUCTING MATRIX A AND RHS VECTOR B ...
 PROC:            0           1
 NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
 PROC:            0           0
 NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
 PROC:            1           1
 NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
 PROC:            1           0
 NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..

forrtl: 致命的なエラー (154): 配列インデックスが境界外です。
Image              PC                Routine            Line        Source             
libifcore.so.5     00002B0D716C19AF  for__signal_handl     Unknown  Unknown
libpthread-2.17.s  00002B0D712335D0  Unknown               Unknown  Unknown
libmkl_avx512.so   00002B11A45E5A47  mkl_blas_avx512_x     Unknown  Unknown
libmkl_intel_lp64  00002B0D68E8BB55  dger_                 Unknown  Unknown
libmkl_scalapack_  00002B0D69F972AE  pdger_                Unknown  Unknown
libmkl_scalapack_  00002B0D69E53541  pdgetf3_              Unknown  Unknown
libmkl_scalapack_  00002B0D69E53688  pdgetf3_              Unknown  Unknown
libmkl_scalapack_  00002B0D69C2E13B  pdgetf2_              Unknown  Unknown
libmkl_scalapack_  00002B0D69C2E836  pdgetrf2_             Unknown  Unknown
libmkl_scalapack_  00002B0D6A014F6E  pdgetrf_              Unknown  Unknown
libmkl_scalapack_  00002B0D69C29C7D  pdgesv_               Unknown  Unknown
para.exe           0000000000401F8C  Unknown               Unknown  Unknown
para.exe           00000000004011BE  Unknown               Unknown  Unknown
libc-2.17.so       00002B0D73DFC3D5  __libc_start_main     Unknown  Unknown
para.exe           00000000004010C9  Unknown               Unknown  Unknown

the first error line beginning forrtl: can be translated as

forrtl: Fatal error (154): Array index out of bounds.

The problem seems to be ocurring somewhere in the scalapack routines.

Does anyone have any recommendations / possible solutions ?

 Any advice or pointers will be gratefully received,

     Many thanks,

             Dan.

 

 


Viewing all articles
Browse latest Browse all 2652

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>