Hello all,
I recently implemented the convolution of the intel mkl library as described in the example included with the library. Everything is fine and dandy on my Laptop with a i5-3210M. However when I tried to run the code on the big machine, with an Intel(R) Xeon(R) CPU E5-2650 v3 i ran into some bugs/problems.
For outputs that have a channel size that is a multiple of 8 the order of the output is wrong. This is either a mistake on my side (probably with the compile options) or in the worst case a bug in the mkl. I wrote a short test script similar to the example file, that implements a standard forward convolution.
#include <iostream> #include "mkl_dnn.h" #include <vector> using namespace std; #define dimension (4) int main() { dnnPrimitiveAttributes_t attributes; dnnPrimitive_t conv_prim = NULL; float* resConv1[dnnResourceNumber] = {0}; size_t batch_num = 1; bool use_bias = false; size_t xinp = 4, yinp = 4, xout = 4, yout = 4, inpchannels = 1, outchannels = 8, xfilt = 3, yfilt = 3; size_t outputSize[dimension] = { xout, yout, outchannels, batch_num }; size_t outputStrides[dimension] = { 1, xout, xout * yout, xout * yout * outchannels }; size_t inputSize[dimension] = { xinp, yinp, inpchannels, batch_num }; size_t inputStrides[dimension] = { 1, xinp, xinp * yinp, xinp * yinp * inpchannels }; size_t filterSize[dimension] = { xfilt, yfilt, inpchannels, outchannels }; size_t filterStrides[dimension] = { 1, xfilt, xfilt * yfilt, xfilt * yfilt * inpchannels }; size_t biasSize[1] = { outputSize[2] }; size_t biasStrides[1] = { outputStrides[2] }; size_t convolutionStride[dimension - 2] = { 1, 1 }; int inputOffset[dimension - 2 ] = { - ( (outputSize[0]/2)) - filterSize[0]/2 + inputSize[0]/2, - ( (outputSize[0]/2)) - filterSize[0]/2 + inputSize[0]/2 }; dnnLayout_t lt_conv1_input = NULL, lt_conv1_filt = NULL, lt_conv1_bias = NULL, lt_conv1_output = NULL; if( dnnPrimitiveAttributesCreate_F32(&attributes)!= E_SUCCESS){ std::cout << "error"<< std::endl; } dnnError_t err; if( use_bias ){ err= dnnConvolutionCreateForwardBias_F32(&conv_prim, attributes, dnnAlgorithmConvolutionDirect, dimension, inputSize, outputSize, filterSize, convolutionStride, inputOffset, dnnBorderZeros); }else{ err = dnnConvolutionCreateForward_F32(&conv_prim, attributes, dnnAlgorithmConvolutionDirect, dimension, inputSize, outputSize, filterSize, convolutionStride, inputOffset, dnnBorderZeros); } if( err != E_SUCCESS){ switch (err){ case E_INCORRECT_INPUT_PARAMETER: std::cout << "incorrect input parameter while creating the convolution"<< std::endl;break; default: std::cout << "error while creating convolution"<< std::endl; } } dnnLayoutCreateFromPrimitive_F32(<_conv1_input, conv_prim, dnnResourceSrc); dnnLayoutCreateFromPrimitive_F32(<_conv1_filt, conv_prim, dnnResourceFilter); if( use_bias){ dnnLayoutCreateFromPrimitive_F32(<_conv1_bias, conv_prim, dnnResourceBias); } dnnLayoutCreateFromPrimitive_F32(<_conv1_output,conv_prim, dnnResourceDst); std::vector<float> input(xinp*yinp*inpchannels,1.0); std::vector<float> output(xout*yout*outchannels,1.0); std::vector<float> filter(xfilt*yfilt*inpchannels*outchannels,1.0); std::vector<float> bias(outchannels,1.0); resConv1[dnnResourceSrc] = &(input[0]); resConv1[dnnResourceFilter] = &filter[0]; if( use_bias) resConv1[dnnResourceBias] = &bias[0]; resConv1[dnnResourceDst]= &output[0]; dnnError_t err_exe = dnnExecute_F32(conv_prim, (void**) resConv1); if( err_exe != E_SUCCESS){ std::cout << "Error while forward propagation in convolutional layer"<< std::endl; if( err_exe== E_MEMORY_ERROR){ std::cout << "Memory Error"<< std::endl; } if( err_exe == E_UNIMPLEMENTED){ std::cout << "Unimplemented"<< std::endl; } if( err_exe == E_UNSUPPORTED_DIMENSION){ std::cout << "Unsupported dimension"<< std::endl; } if( err_exe == E_INCORRECT_INPUT_PARAMETER){ std::cout << "Incorrect input parameter"<< std::endl; } } std::cout << "output"<<std::endl; for( int i=0; i < output.size(); i++){ std::cout << output[i] << ""; } std::cout << std::endl; return 0; }
The desired output for a 4x4 image with 8 convolutions and an input of 1s and 3x3 filters of 1s is:
4 6 6 4 6 9 9 6 6 9 9 6 4 6 6 4 4 6 6 4 6 9 9 6 6 9 9 6 4 6 6 4 4 6 6 4 6 9 9 6 6 9 9 6 4 6 6 4 4 6 6 4 6 9 9 6 6 9 9 6 4 6 6 4 4 6 6 4 6 9 9 6 6 9 9 6 4 6 6 4 4 6 6 4 6 9 9 6 6 9 9 6 4 6 6 4 4 6 6 4 6 9 9 6 6 9 9 6 4 6 6 4 4 6 6 4 6 9 9 6 6 9 9 6 4 6 6 4
This is also what my mobile CPU gives me when i run the code. However on the big PC i get
4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 4
which is obviously somewhat right, but not in the right order. However when i change the output channel to not be a multiple of 8 the code runs fine even on the Xeon CPU. This might be due to the mkl switching to a slower and different algorithm as explained in this post:
https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/...
Does anybody have an explanation or even a fix for this issue? Is this known behaviour on Xeon CPUs, or a bug in the software? I don't necessarily wan't to switch to the open source implementation, since it would mean a week of new implementing/testing.
For compilation i used the following linkline for both systems :
-L${MKLROOT}/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl
-I${MKLROOT}/include -I${MKLROOT}/../lib/intel64_lin
any help would be appreciated.