Hi,
If I run the compiled bin exe at head node without limiting the num of CPUs and amount of memory, PARDISO can run through and give correct results at Linux server. Also, running at Windows system, the code always works. However, PARDISO randomly crashes if I limit the memory usage with qsub at Linux server (see runCWNLAT.sub below). The crashes happed without error sign, so PARDISO will not return any error information. Pls help me check what might be the reason?
In my code, I used C++ code to call some functions from Fortran code. In one execution, PARDISO will be called many times. PARDISO in called in Fortran code. The simple Fortran code calling PARDISO is:
…
! Init or set PARDISO parameters
maxfct = 1
mnum = 1
mtype = 6 ! complex and symmetric matrix
phase = 13 ! analysis, numerical factorization, solve, iterative refinement
nrhs = n_recei + 1 ! number of right-hand sides that need to be solved for
msglvl = 0 ! if msglvl=1, print statistical information
error = 0
call pardisoinit(pt, mtype, iparm) ! init pardiso with default parameters in accordance with the matrix type
iparm(4) = 0 ! no iterative solver, use direct algorithm
iparm(28) = 0 ! use type double precision "double complex" instead of "complex"
iparm(35) = 0 ! one-based indexing (Fortran-style indexing)
! Solve A*u = f with mkl PARDISO
call pardiso(pt, maxfct, mnum, mtype, phase, &
& n_totNodes, & ! num of rows of A, ~ num of equations in A*u = f
& csrA_vals, csrA_rows, csrA_cols, & ! CSR3 A
& perm, nrhs, iparm, msglvl, &
& f, & ! right-hand side vector/matrix
& u, & ! solution vector/matrix
& error)
if (error /= 0) then
write(6,*) 'ERROR during PARDISO backslash! Error = ', error
stop "*** ERROR during PARDISO backslash! ***"
endif
phase = -1
call pardiso(pt, maxfct, mnum, mtype, phase, n_totNodes, dummy, csrA_rows, csrA_cols, perm, nrhs, iparm, msglvl, dummy, dummy, error)
…
The matrix A above is sparse complex symmetric matrix. A’s number of nonzeros is about 1 million to 2 million. I tested the peak memory usage during the execution is about 3 GB, but PARDISO crashes even though I assign 16 GB memory at the server.
In makefile, I first compile C++ or Fortran source code to object, then link them together. Here are the details:
- Operating system and version
-bash-4.1$ lsb_release -a
LSB Version: :core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
Distributor ID: RedHatEnterpriseServer
Description: Red Hat Enterprise Linux Server release 6.3 (Santiago)
Release: 6.3
Codename: Santiago
- Library version: mkl_compser_xe_2013
-bash-4.1$ echo $MKLROOT
/usr/opt/intel/composer_xe_2013.1.117/mkl
- Compiler version
-bash-4.1$ ifort -v
ifort version 13.0.1
- GNU Compiler Collection (GCC)* or Microsoft Visual Studio* version (if applicable)
-bash-4.1$ g++ -v
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/sb/gcc-5.2.0/libexec/gcc/x86_64-unknown-linux-gnu/5.2.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: /sb/objdir/../gcc-5.2.0/configure --prefix=/sb/gcc-5.2.0 --enable-languages=c,c++,fortran,go --disable-multilib
Thread model: posix
gcc version 5.2.0 (GCC)
- Steps to reproduce the error (include makefiles, command lines, small test cases, and build instructions)
Makefile:
SRCFDIR = $(realpath ./)/src_Fortran
SRCCDIR = $(realpath ./)/src_Cpp_cw5
OBJDIR = $(realpath ./)/obj
MKDIR = if [ ! -d $(@D) ]; then mkdir -p $(@D); fi
PROGRAM=cw5
ARCH = $(shell uname -m)
TARGET = ${PROGRAM}.${ARCH}
#include Makefile.${ARCH}
CPPC = g++
FC_SEQ = ifort
FC_PAR = ifort
FC_LINK = ifort
MKL_LINK_FLAGS =-L$(MKLROOT)/lib/intel64 -Wl,--start-group -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -Wl,--end-group -openmp -lpthread -lm
#Begin Optimized options
CPP_FLAGS = -std=c++17 -mcmodel=large -w
F_SEQ_FLAGS = -O3 -shared-intel
F_PAR_FLAGS = -O3 -shared-intel
F_LINK_FLAGS = -O3 -static-intel -cxxlib -lrt
#End Optimized options
C_MAIN = cw5.cpp
F_SRCS = dcwnlatg4.f
F90_SRCS = dcwnlatf4.f90
OBJS = $(OBJDIR)/${C_MAIN:.cpp=.o} $(OBJDIR)/${F_SRCS:.f=.o} $(OBJDIR)/${F90_SRCS:.f90=.o}
all: ${TARGET} CWNLAT
# ********* First Program: cw5 ************ #
#${TARGET}: ${OBJS}
# $(FC_LINK) -o $@ $(F_LINK_FLAGS) ${OBJS}
${TARGET}: ${OBJS}
$(FC_LINK) $(F_LINK_FLAGS) -nofor_main -o $@ ${OBJS} $(MKL_LINK_FLAGS)
$(OBJDIR)/dcwnlatg4.o : $(SRCFDIR)/${F_SRCS}
@$(MKDIR)
$(FC_PAR) $(F_PAR_FLAGS) -o $(OBJDIR)/dcwnlatg4.o -c $(SRCFDIR)/${F_SRCS}
$(OBJDIR)/dcwnlatf4.o : $(SRCFDIR)/${F90_SRCS}
@$(MKDIR)
$(FC_PAR) $(F_PAR_FLAGS) -o $(OBJDIR)/dcwnlatf4.o -c $(SRCFDIR)/${F90_SRCS}
$(OBJDIR)/cw5.o : $(SRCCDIR)/${C_MAIN}
@$(MKDIR)
$(CPPC) $(CPP_FLAGS) -o $(OBJDIR)/cw5.o -c $(SRCCDIR)/${C_MAIN} -lrt
# ********* Second Program CWNLAT ************ #
CWNLAT : $(SRCFDIR)/runCWNLAT.f
$(FC_SEQ) $(F_SEQ_FLAGS) -o CWNLAT $(SRCFDIR)/runCWNLAT.f
.PHONY: clean cleanall
clean:
rm $(OBJS) CWNLAT
cleanall:
rm $(OBJS) *~
runCWNLAT.sub used for qsub:
# Tell PBS which shell to use on the compute nodes Options are: /bin/bash or /bin/tcsh
#PBS -S /bin/bash
# Tell PBS the name to use for your job
#PBS -N runCWNLAT
# request #nodes:#cpus/node:#memory/node,requested time
#PBS -l select=1:ncpus=8:mem=16gb,walltime=00:04:00
# queue group
#PBS -q normal
# Tell PBS to join the output (.o) and error (.e) files into one file
#PBS -j oe
# *********** Commands **********#
# Tell PBS to run the job in the directory your job was submitted from
cd $PBS_O_WORKDIR
# Set up env for Intel MKL
source /opt/intel/bin/ifortvars.sh intel64
./CWNLAT ENmodel_33.DAT