When repeatedly calling MKL's distributed (cluster) DFT library via the FFTW3 interface library, it will fail with a floating-point error with certain combination's of grid sizes and MPI processes (eg, a 1024 x 256 x 1 grid running with 17 MPI processes). This is repeatable, and I have uploaded an example code that demonstrates the problem. I am compiling using "Composer XE 2015" tools (MKL), eg
[jshaw@cl4n074]: make fft_mpi
echo "Building ... fft_mpi"
Building ... fft_mpi
g++ -I/home/jshaw/XCuda-2.4.0/inc -DNDEBUG -O2 -I/home/jshaw/XCuda-2.4.0/inc -I/opt/lic/intel13/impi/4.1.0.024/include64 -I/opt/lic/intel15/composer_xe_2015.1.133/mkl/include -I/opt/lic/intel15/composer_xe_2015.1.133/mkl/include/fftw fft_mpi.cpp -o fft_mpi -L/home/jshaw/XCuda-2.4.0/lib/x86_64-CentOS-6.5 -lXCuda -L/home/jshaw/XCuda-2.4.0/lib/x86_64-CentOS-6.5 -lXCut -L/opt/lic/intel13/impi/4.1.0.024/lib64 -lmpi -L/home/jshaw/fftw-libs-1.1/x86_64-CentOS-6.5/lib -lfftw3x_cdft_lp64 -L/opt/lic/intel15/composer_xe_2015.1.133/mkl/lib/intel64 -lmkl_cdft_core -lmkl_intel_lp64 -lmkl_core -lmkl_sequential -lmkl_blacs_intelmpi_lp64 -lpthread -lm -ldl -lirc
[jshaw@cl4n074]: mpirun -np 17 fft_mpi
FFT-MPI for Complex Multi-dimensional Transforms (float)
MPI procs: 17, dim: 1024x256x1, loops: 100 (2 MBytes)
Allocating FFT memory ...
rank: 0 (16592 61 0)
rank: 2 (16592 61 122)
rank: 7 (16592 60 424)
rank: 8 (16592 60 484)
rank: 9 (16592 60 544)
rank: 10 (16592 60 604)
rank: 12 (16592 60 724)
rank: 13 (16592 60 784)
rank: 15 (16592 60 904)
rank: 16 (16592 60 964)
rank: 1 (16592 61 61)
rank: 3 (16592 61 183)
rank: 4 (16592 60 244)
rank: 5 (16592 60 304)
rank: 6 (16592 60 364)
rank: 11 (16592 60 664)
rank: 14 (16592 60 844)
Initializing ...
Planning ...
loop: <1fnr> <2fnr> <3fnr> <4fnr> <5fnr> <6fnr> <7fnr> <8fnr> <9fnr> <10fnr> <11fnr> <12fnr> <13fAPPLICATION TERMINATED WITH THE EXIT STRING: Floating point exception (signal 8)
[jshaw@cl4n074]:
Note that the XCuda libraries referenced in the compile/link line are NOT required for this example to work! The code will run correctly with various grid sizes and numbers of MPI processes. Its not as simple as having too many or too few MPI processes, or the FFT grid being small or large. I regularly run codes with large 3D grids. The issue is that one cannot pick arbitrary grid sizes or MPI ranks (as you should be able to). Any help or thoughts on how this can be fixed are appreciated. Is this a legitimate bug in MKL?