Hi,
I was running two subsequent dgemm operations: T=AB and C=A'T with A=(56,000x400,000), B=(400,000x30), T=(56,000x30) and C=B.
Conditional on the CPU I measured these wall clock times (for the dgemm operations only):
Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz with 36 (real) cores, 46080 KB cache, 250GB of RAM
T=AB: 3.73 seconds,
C=A'T: 4.17 seconds
Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz with 56 (real) cores, 19712 KB cache, 2TB of RAM
T=AB: 91.47 seconds
C=A'T: 232.78 seconds
What was paticularly striking was that T=AB used all 56 cores, whereas C=A'T used only half of it.
kmp setting was: KMP_AFFINITY=compact,1,0,granularity=fine
I am wondering whether the bad performance of the latter is solely attributable to its architecture and therefore is set in stone, or whether I can somehow optimize mkl/kmp environment variables to increase performance.
Thanks