I have an openmp loop
#pragma openmp parallel for
for (int i=0;i<n;i++){
// routine that calls MKL FFT
}
The thread performance is pretty abysmal, on an 8 core machine, showing just over 1 core being used.
What is surprising is that Intel Amplifier shows that the time is spent in DftiCommitDescriptor, not the actual computation.
Function / Call Stack CPU Time Module Function (Full) Source File Start Address
DftiCommitDescriptor 83.7% mkl_rt.dll DftiCommitDescriptor [Unknown] 0x180a45b68
.....
DftiComputeForward 0.5% mkl_rt.dll DftiComputeForward [Unknown] 0x180a45f10
Any suggested best practices here. typically the FFT function will be called with the same data length, say ,10K-20K..