Dear support team!
We’ve faced some problems with QR algorithm scalability, implemented using MKL functions LAPACKE_sgehrd to reduce our matrix to Hessenberg form and LAPACKE_shseqr to perform iterations of QR algorithm itself.
Here is the code we launched on Xeon E5 v3 processor with 14 cores:
omp_set_num_threads(threads_count);
cout << "threads count: "<< omp_get_max_threads() << endl;
double t1 = omp_get_wtime();
LAPACKE_sgehrd(LAPACK_ROW_MAJOR, size, 1, size, A, size, tau);
double t2 = omp_get_wtime();
cout << "LAPACKE_sgehrd time: "<< t2 - t1 << " sec"<< endl;
float *re = new float[size];
float *im = new float[size];
float *z;
t1 = omp_get_wtime();
LAPACKE_shseqr(LAPACK_ROW_MAJOR, 'E', 'N', size, 1, size, A, size, re, im, z, size);
t2 = omp_get_wtime();
cout << "LAPACKE_shseqr time: "<< t2 - t1 << " sec"<< endl;
The compiler we used is icc (ICC) 15.0.3 20150407. Here are the results of launches on 1, 2, 3, 4 and 14 cores:
threads count: 1
LAPACKE_sgehrd time: 84.4017 sec
LAPACKE_shseqr time: 30.4593 sec
threads count: 2
LAPACKE_sgehrd time: 45.2026 sec
LAPACKE_shseqr time: 27.8578 sec
threads count: 3
LAPACKE_sgehrd time: 35.0818 sec
LAPACKE_shseqr time: 25.2905 sec
threads count: 4
LAPACKE_sgehrd time: 27.3022 sec
LAPACKE_shseqr time: 28.1272 sec
threads count: 14
LAPACKE_sgehrd time: 19.8118 sec
LAPACKE_shseqr time: 27.1131 sec
As it is clear, LAPACKE_sgehrd has poor scalability, while LAPACKE_shseqr has no scalability at all. The question is if there is any way we can improve the scalability of both this routines, or it its working as intended?
Sincerely,
Vladislav Shishvatov