Hello There,
Recently I am using MKL FFT code to get the cycle count of DftiComputeForward. Form mkl documents, DFTI_NUMBER_OF_USER_THREADS is no longer used in latest MKL version. But I made a test.
Method is adding "status = DftiSetValue(FFT_desc, DFTI_NUMBER_OF_USER_THREADS, (1/2/3/4));" in my test code and result is:
Cycle count
FFT and thread setting
No setting thread
1 thread
2 thread
3 thread
4 thread
128-point
740
800
698
540
448
256-point
1418
923
956
920
960
512-point
3002
2263
1968
1984
1968
1024-point
5848
5044
4130
4185
4113
2048-point
24262
21624
9782
9714
9825
test code is below: //DFTI_SINGLE is single precision, DFTI_DOUBLE is double precision status = DftiCreateDescriptor(&FFT_desc, DFTI_SINGLE, DFTI_COMPLEX, 1, FFTSize); //DFTI_INPLACE is FFT output overwrites input, DFTI_NOT_INPLACE is FFT output does not overwrite input status = DftiSetValue(FFT_desc, DFTI_PLACEMENT, DFTI_NOT_INPLACE); status = DftiSetValue(FFT_desc, DFTI_NUMBER_OF_USER_THREADS, 4); //frease FFT descriptor status = DftiCommitDescriptor(FFT_desc); j = 0; for (idxTimeLoop = 0; idxTimeLoop < taskCallsNumber / internalLoopCounter; idxTimeLoop++) { unsigned __int64 clockStart, clockEnd; clockStart = GetTickAndTime(&getStartTick, &getStartTime); for (idxLoop = 0; idxLoop < internalLoopCounter; idxLoop++) { //run fft with forward method status = DftiComputeForward(FFT_desc, FFT_in_singlePrecision, FFT_out_singlePrecision); } clockEnd = GetTickAndTime(&getEndTick, &getEndTime); clockNumArray[j] = getEndTick - getStartTick; timeDurationArray[j] = (getEndTime - getStartTime)*1000.0; j++; }
My MKL version information:
Major version: 11
Minor version: 2
Update version: 3
Product status: Product
Build: 20150413
Platform: Intel(R) 64 architecture
Processor optimization: Intel(R) Advanced Vector Extensions (Intel(R) AVX) enabled processors
OS: win7
Porcessor: i5-3320M 2.6GHz.
My question: why the cycle count of 2048-point MKL FFT DftiComputeForward is about 4 times than 1024-point. Does this question is brought by data cache or something else? And why setting DFTI_NUMBER_OF_USER_THREADS can affect performance of 2048-point FFT DftiComputeForward. Please feel free to contact me if you need more info about my test code.
Thanks a lot!
Lei Fu