I'm using SVD during some least-square fitting, typically operating on spectral data (1000-2000 data points) and fitting with very few parameters (2-5).
For this, I'm generally using a direct implementaion of the SVD routines from the "numerical recipes" (single-threaded).
When I started needing SVDs in other areas (bigger matrices with a less extreme aspect ratio, typtically ~ 10000 x 1000) I started using MKL Lapacke, currenlty using version 2017_4_210 and here the routines greatly outperform the NR routines.
So I also started using them for the fitting as described above. However, when applying it to the "extreme" data of only very few parameters ( typical matrix size 2048 x 3 ), the Lapacke routines fell behind and the NR routines are just faster.
Just as a "guideline": Running the same (iterative) fitting on a typical standard data-set, my profile tells me I'm staying with the SVD-routines for about 4sec using NR routines and for about 7sec with the MKL routines)
Now, when MKL 2018 was announced a month ago, I was quite excited to read in the Release Notes (https://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-2018-release-notes):
LAPACK:
- Added the following improvements and optimizations for small matrices (N<16):
- Added ?gesvd, ?geqr/?gemqr, ?gelq/?gemlq optimizations for tall-and-skinny/short-and-wide matrice
So I gave it a try, but was quite disappointed. Not only did the NR still outperfrom MKL routines, but for reasons not clear to me, the performance actually dropped significantly in the 2018_0_124 MKL compared to the 2017_4_210 version.
The same data for guideline:
- NR routines: 4sec
- MKL 2017: 7sec
- MKL 2018: 14sec
The only changes I did when comparing both variantes was to re-compile/link with the newer version and use the according new version DLLs.
Did I miss something? Or did I misunderstand the release notes? Does anybody have some other comparative data for running SVDs on matrices of size ( 2048 x 3 ) which will help me figure out whether it is problem of the lirbary or of my implementation of it?
I ran my tests on 8 cores enabled on a (4 core hyper-threaded i7-4712 HQ).