Dear all,
unfortunately, I have again some troubles with the mkl_cluster_sparse_solver as in my previous topic. I have taken one of the examples intel provides in the example dir of mkl and modified it in two ways: on the one hand the code can now read an arbitrary matrix stored in the file fort.110 and on the other hand I perform a loop over the routines since I want to change the matrix within one cycle later on. The first problem arises when treating large system sizes.
In this case, you can find the matrix in fort1.zip. The program aborts with a segmentation fault after 18%: forrtl: severe (174): SIGSEGV, segmentation fault occurred. Unfortunately, this is somehow hard to track down what is the issue but it must be in the subroutine since it starts. As I said this happens for large matrices. Unfortunately I dont know how to get rid of this problem.
The next problem occurs for small matrices as found in fort.zip. The problem seems to be the loop: the first cycle everything works fine but the second cycle aborts with an error message I have already seen in one of my last topics:
Fatal error in PMPI_Reduce: Message truncated, error stack:
PMPI_Reduce(2334).................: MPI_Reduce(sbuf=0x7d7d7f8, rbuf=0x7f0b900, count=22912, MPI_DOUBLE, MPI_SUM, root=0, comm=0x84000004) failed
MPIR_Reduce_impl(1439)............: fail failed
I_MPIR_Reduce_intra(1533).........: Failure during collective
MPIR_Reduce_intra(1201)...........: fail failed
MPIR_Reduce_Shum_ring(833)........: fail failed
MPIDI_CH3U_Receive_data_found(131): Message from rank 1 and tag 11 truncated; 14000 bytes received but buffer size is 1296
I have tried what I did the last time: provide all parameters (nhrs, msglevel, iparm, ..) for all ranks again but it does not seem to fix the issue.
This is the program code (cl_solver_f90.f90):
program cluster_sparse_solver use mkl_cluster_sparse_solver implicit none include 'mpif.h' integer, parameter :: dp = kind(1.0D0) !.. Internal solver memory pointer for 64-bit architectures TYPE(MKL_CLUSTER_SPARSE_SOLVER_HANDLE) :: pt(64) integer maxfct, mnum, mtype, phase, nrhs, error, msglvl, i, ik, l1, k1, idum(1), DimensionL, Nsparse integer*4 mpi_stat, rank, num_procs double precision :: ddum(1) integer, allocatable :: IA( : ), JA( : ), iparm( : ) double precision, allocatable :: VAL( : ), rhodot( : ), rho( : ) integer(4) MKL_COMM MKL_COMM=MPI_COMM_WORLD call mpi_init(mpi_stat) call mpi_comm_rank(MKL_COMM, rank, mpi_stat) do l1 = 1, 64 pt(l1)%dummy = 0 end do error = 0 ! initialize error flag msglvl = 1 ! print statistical information mtype = 11 ! real, non-symmetric nrhs = 1 maxfct = 1 mnum = 1 allocate(iparm(64)) do l1 = 1, 64 iparm(l1) = 0 end do !Setup PARDISO control parameter iparm(1) = 1 ! do not use default values iparm(2) = 3 ! fill-in reordering from METIS iparm(8) = 100 ! Max. number of iterative refinement steps on entry iparm(10) = 13 ! perturb the pivot elements with 1E-13 iparm(11) = 1 ! use nonsymmetric permutation and scaling MPS iparm(13) = 1 ! Improved accuracy using nonsymmetric weighted matching iparm(27) = 1 ! checks whether column indices are sorted in increasing order within each row read(110,*) DimensionL, Nsparse allocate(VAL(Nsparse),JA(Nsparse),IA(DimensionL)) if (rank.eq.0) then do k1=1,Nsparse read(110,*) VAL(k1) end do do k1=1,DimensionL+1 read(110,*) IA(k1) end do do k1=1,Nsparse read(110,*) JA(k1) end do end if allocate(rhodot(DimensionL), rho(DimensionL)) if (rank.eq.0) then rhodot=0.0d0 rhodot(1) = 1.0d0 rho=0.0d0 end if if (rank.eq.0) write(*,*) 'INIT PARDISO' ik = 0 Pardisoloop: do ik = ik + 1 phase = 12 call cluster_sparse_solver_64 ( pt, maxfct, mnum, mtype, phase, DimensionL, VAL, IA, JA, idum, nrhs, iparm, msglvl, ddum, ddum, MKL_COMM, error ) if (error.ne.0.and.rank.eq.0) write(*,*) 'ERROR: ', error phase = 33 call cluster_sparse_solver_64 ( pt, maxfct, mnum, mtype, phase, DimensionL, VAL, IA, JA, idum, nrhs, iparm, msglvl, rhodot, rho, MKL_COMM, error ) if (error.ne.0.and.rank.eq.0) write(*,*) 'ERROR: ', error if (ik.ge.4) exit Pardisoloop end do Pardisoloop call MPI_BARRIER(MKL_COMM,mpi_stat) phase = -1 call cluster_sparse_solver_64 ( pt, maxfct, mnum, mtype, phase, DimensionL, ddum, idum, idum, idum, nrhs, iparm, msglvl, ddum, ddum, MKL_COMM, error ) if (error.ne.0.and.rank.eq.0) write(*,*) 'Release of memory: ', error call mpi_finalize(mpi_stat) end
I compile with
mpiifort -i8 -I${MKLROOT}/include -c -o mkl_cluster_sparse_solver.o ${MKLROOT}/include/mkl_cluster_sparse_solver.f90
mpiifort -i8 -I${MKLROOT}/include -c -o cl_solver_f90.o cl_solver_f90.f90
mpiifort mkl_cluster_sparse_solver.o cl_solver_f90.o -o MPI.out -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -liomp5 -lpthread -lm -ldl
and run the program with mpiexec -n 2 ./MPI.out. Our cluster has 16 cores per node and I request two of them. Ram should not be the problem (64gb), since it perfectly runs with the normal pardiso on just one node. I set export MKL_NUM_THREADS=16. Am I right that the slave MPI process should automatically obtain parts of the LL^T factors or do I have to use the distributed version in order to do so? The reason why I ask is that I cannot observe any process running on the second node.
The Versions are: MKL version: 2017.4.256, Ifort version: 17.0.6.256, IMPI version: 2017.4.239, but my college can also reproduce the issue on other versions/clusters.
Thanks in advance,
Horst