Quantcast
Channel: Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library
Viewing all articles
Browse latest Browse all 2652

PARDISO crashes randomly with different messages

$
0
0

Hi,

I have a control volume code that calls PARDISO several times to solve a large sparse matrix (different on each call). The reason it needs to be called several times is because the entries in the matrix change with time. The options I am currently using to compile my code are

<code> -openmp -traceback -g -check all -fpe0 -fp-stack-check -warn all -gen-interfaces -r8 -debug -fp-model strict -auto -module $(OBJDIR) -I$(MKL)/include </code>

and for the link line I use
 <code> -L$(MKL)/lib/intel64 -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -lpthread -lm </code>

In the past I have also tried

<code> -openmp -traceback -g -check all -fpe0 -fp-stack-check -warn all -gen-interfaces -r8 -debug -fp-model strict -auto -module $(OBJDIR) -mkl=parallel </code>

<code> -lpthread -lm </code>

which also successfully compiles. The fortran compiler I am using is intel-fc/14.1.106. The version of MKL I am using is whatever that version of ifort is bundled with (I THINK it's Composer XE 2013 sp1). The specs of the machine I am running my code on can be found in the following link (http://nci.org.au/nci-systems/national-facility/peak-system/raijin/detailed-system-configuration/).

I have followed the instructions found on http://software.intel.com/en-us/articles/determining-root-cause-of-sigsegv-or-sigbus-errors. That is I have tried both compiling with and without -heap-arrays, I have tried "ulimit -s unlimited" and "ulimit -s 999999999999". I have set KMP_STACKSIZE to values ranging from 16M to 2GB. I have also set MKL_PARDISO_OOC_MAX_CORE_SIZE=16384 and MKL_PARDISO_OOC_MAX_SWAP_SIZE=16384. Within my code, pardisoinit is called at the beginning of my solve subroutine using

<code> Call pardisoinit(pt,mtype,iparm) </code>

where mtype = 11. I then set the values

iparm(1)=1

iparm(2)=3

iparm(24)=1

iparm(25)=0

iparm(27)=1

iparm(60)=0

and then call pardiso four more times using phase = 11, 22, 33 and then -1. This will work for several time steps, but at some point I will get an error message, usually in the form of a segfault

<code>

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
main 0000000000A807F9 Unknown Unknown Unknown
main 0000000000A7F170 Unknown Unknown Unknown
main 0000000000A357E2 Unknown Unknown Unknown
main 00000000009E6F68 Unknown Unknown Unknown
main 00000000009EB0AB Unknown Unknown Unknown
libpthread.so.0 0000146C1FBFA500 Unknown Unknown Unknown
libmkl_core.so 0000146C21ACF025 Unknown Unknown Unknown
libmkl_core.so 0000146C21AC80AD Unknown Unknown Unknown
libmkl_core.so 0000146C21AC7D7A Unknown Unknown Unknown
libmkl_intel_thre 0000146C20775F0F Unknown Unknown Unknown
libmkl_intel_thre 0000146C20776107 Unknown Unknown Unknown
libmkl_intel_thre 0000146C207761C3 Unknown Unknown Unknown
libmkl_intel_thre 0000146C20776594 Unknown Unknown Unknown
libiomp5.so 0000146C1F6D4623 Unknown Unknown Unknown

</code>

This error NEVER appears at the same iteration, i.e. the program can run through 10 time steps before hitting producing this error and then I can run the program again and it'll run for 1000 time steps. In addition, I have also sometimes got two different errors. One states something along the lines of glibc error which then produces a really long output (unfortunately, I haven't kept an output file for this example... if I get this one again I'll post it if someone thinks this maybe useful) and another error I have gotten from my code states "*** glibc detected *** ./main: free(): corrupted unsorted chunks: 0x00002b91bb025040 ***".

I know the program crashes during the phase=11 stage by using print statements before and after each call of PARDISO. I have tried to find out more information by setting msglvl=1, however this problem does not occur when msglvl=1. I will post a stripped down version of the code once I get the OK from my manager. Lastly, I would also like to add if the program doesn't crash, it produces results that seem reasonable (or even correct against known test cases).

Also I apologies in advance, I'm new to this forum and I'm not sure how to what the tags I'm meant to use for codes... My background is not in computer sciences/engineering or programming in general. I have been using Fortran 90 for approximately 5 years now, but that has been all self taught reading bits and pieces from the internet and I'm fairly sure my knowledge is very spotty. Also, I have never used Valgrind (I noticed that seems to be the first response to segfaults around here), but I have used a tiny bit of TotalView (they're basically the same right?), but never have been successful finding the source of any of my problems using it... Printing random strings of text before and after where I suspect problems are occurring is how I usually fix my problems...

Thanks in advance to anybody who spends sometime trying to help me here

Eugene


Viewing all articles
Browse latest Browse all 2652

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>