Dear all,
unfortunately, I have again some troubles with the mkl_cluster_sparse_solver as in my previous topic. I have taken one of the examples intel provides in the example dir of mkl and modified it in two ways: on the one hand the code can now read an arbitrary matrix stored in the file fort.110 and on the other hand I perform a loop over the routines since I want to change the matrix within one cycle later on. The first problem arises when treating large system sizes.
In this case, you can find the matrix in fort1.zip. The program aborts with a segmentation fault after 18%: forrtl: severe (174): SIGSEGV, segmentation fault occurred. Unfortunately, this is somehow hard to track down what is the issue but it must be in the subroutine since it starts. As I said this happens for large matrices. Unfortunately I dont know how to get rid of this problem.
The next problem occurs for small matrices as found in fort.zip. The problem seems to be the loop: the first cycle everything works fine but the second cycle aborts with an error message I have already seen in one of my last topics:
Fatal error in PMPI_Reduce: Message truncated, error stack:
PMPI_Reduce(2334).................: MPI_Reduce(sbuf=0x7d7d7f8, rbuf=0x7f0b900, count=22912, MPI_DOUBLE, MPI_SUM, root=0, comm=0x84000004) failed
MPIR_Reduce_impl(1439)............: fail failed
I_MPIR_Reduce_intra(1533).........: Failure during collective
MPIR_Reduce_intra(1201)...........: fail failed
MPIR_Reduce_Shum_ring(833)........: fail failed
MPIDI_CH3U_Receive_data_found(131): Message from rank 1 and tag 11 truncated; 14000 bytes received but buffer size is 1296
I have tried what I did the last time: provide all parameters (nhrs, msglevel, iparm, ..) for all ranks again but it does not seem to fix the issue.
This is the program code (cl_solver_f90.f90):
program cluster_sparse_solver
use mkl_cluster_sparse_solver
implicit none
include 'mpif.h'
integer, parameter :: dp = kind(1.0D0)
!.. Internal solver memory pointer for 64-bit architectures
TYPE(MKL_CLUSTER_SPARSE_SOLVER_HANDLE) :: pt(64)
integer maxfct, mnum, mtype, phase, nrhs, error, msglvl, i, ik, l1, k1, idum(1), DimensionL, Nsparse
integer*4 mpi_stat, rank, num_procs
double precision :: ddum(1)
integer, allocatable :: IA( : ), JA( : ), iparm( : )
double precision, allocatable :: VAL( : ), rhodot( : ), rho( : )
integer(4) MKL_COMM
MKL_COMM=MPI_COMM_WORLD
call mpi_init(mpi_stat)
call mpi_comm_rank(MKL_COMM, rank, mpi_stat)
do l1 = 1, 64
pt(l1)%dummy = 0
end do
error = 0 ! initialize error flag
msglvl = 1 ! print statistical information
mtype = 11 ! real, non-symmetric
nrhs = 1
maxfct = 1
mnum = 1
allocate(iparm(64))
do l1 = 1, 64
iparm(l1) = 0
end do
!Setup PARDISO control parameter
iparm(1) = 1 ! do not use default values
iparm(2) = 3 ! fill-in reordering from METIS
iparm(8) = 100 ! Max. number of iterative refinement steps on entry
iparm(10) = 13 ! perturb the pivot elements with 1E-13
iparm(11) = 1 ! use nonsymmetric permutation and scaling MPS
iparm(13) = 1 ! Improved accuracy using nonsymmetric weighted matching
iparm(27) = 1 ! checks whether column indices are sorted in increasing order within each row
read(110,*) DimensionL, Nsparse
allocate(VAL(Nsparse),JA(Nsparse),IA(DimensionL))
if (rank.eq.0) then
do k1=1,Nsparse
read(110,*) VAL(k1)
end do
do k1=1,DimensionL+1
read(110,*) IA(k1)
end do
do k1=1,Nsparse
read(110,*) JA(k1)
end do
end if
allocate(rhodot(DimensionL), rho(DimensionL))
if (rank.eq.0) then
rhodot=0.0d0
rhodot(1) = 1.0d0
rho=0.0d0
end if
if (rank.eq.0) write(*,*) 'INIT PARDISO'
ik = 0
Pardisoloop: do
ik = ik + 1
phase = 12
call cluster_sparse_solver_64 ( pt, maxfct, mnum, mtype, phase, DimensionL, VAL, IA, JA, idum, nrhs, iparm, msglvl, ddum, ddum, MKL_COMM, error )
if (error.ne.0.and.rank.eq.0) write(*,*) 'ERROR: ', error
phase = 33
call cluster_sparse_solver_64 ( pt, maxfct, mnum, mtype, phase, DimensionL, VAL, IA, JA, idum, nrhs, iparm, msglvl, rhodot, rho, MKL_COMM, error )
if (error.ne.0.and.rank.eq.0) write(*,*) 'ERROR: ', error
if (ik.ge.4) exit Pardisoloop
end do Pardisoloop
call MPI_BARRIER(MKL_COMM,mpi_stat)
phase = -1
call cluster_sparse_solver_64 ( pt, maxfct, mnum, mtype, phase, DimensionL, ddum, idum, idum, idum, nrhs, iparm, msglvl, ddum, ddum, MKL_COMM, error )
if (error.ne.0.and.rank.eq.0) write(*,*) 'Release of memory: ', error
call mpi_finalize(mpi_stat)
end
I compile with
mpiifort -i8 -I${MKLROOT}/include -c -o mkl_cluster_sparse_solver.o ${MKLROOT}/include/mkl_cluster_sparse_solver.f90
mpiifort -i8 -I${MKLROOT}/include -c -o cl_solver_f90.o cl_solver_f90.f90
mpiifort mkl_cluster_sparse_solver.o cl_solver_f90.o -o MPI.out -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -liomp5 -lpthread -lm -ldl
and run the program with mpiexec -n 2 ./MPI.out. Our cluster has 16 cores per node and I request two of them. Ram should not be the problem (64gb), since it perfectly runs with the normal pardiso on just one node. I set export MKL_NUM_THREADS=16. Am I right that the slave MPI process should automatically obtain parts of the LL^T factors or do I have to use the distributed version in order to do so? The reason why I ask is that I cannot observe any process running on the second node.
The Versions are: MKL version: 2017.4.256, Ifort version: 17.0.6.256, IMPI version: 2017.4.239, but my college can also reproduce the issue on other versions/clusters.
Thanks in advance,
Horst