Running Linpack MKL (xhpl.2018.3.222.static) with MKL_ENABLE_INSTRUCTIONS=SSE4_2 on Skylake with turbo enabled.
I've tried this with three different releases of MKL and three different Skylake processors. They all show the same effect, but with different frequencies, of course.
The base thread of each of the MPI ranks runs at the AVX512 turbo frequency, while the other threads run at the expected non-AVX frequency.
If I specify AVX2, all threads run at the AVX 2.0 frequency, as expected
If I specify AVX512, all threads run at the AVX 512 frequency, as expected
At first I thought the SSE 4.2 run might be using 512 bit instructions.on those two CPUs, but fiddling with the performance MSRs to look at the counters shows that only the expected Floating Point Double Precision instructions are being retired.
Here are some characteristics of my Skylake processor and the Linpack run (frequencies are all-cores-active max frequencies, in GHz):
# cores/processor 8
frequency GFlops run time (sec)
non-AVX turbo 4.1 2.07505e+02 222.87
AVX 2.0 turbo 3.7 8.22624e+02 56.22
AVX 512 turbo 3.0 1.30613e+03 35.41
Below is a turbostat snapshot while running with SSE4_2
(There's a bit of bouncing around of frequencies as the job runs, but you can see that the CPU 0 & 8 frequencies are low, tending toward 3.0 GHz, and the other 14 CPUs' frequencies are high, tending toward 4.1 GHz.
Core CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ SMI CPU%c1 CPU%c3 CPU%c6 CPU%c7 CoreTmp PkgTmp PkgWatt RAMWatt PKG_% RAM_%
- - 3957 100.00 3967 3891 15914 0 0.00 0.00 0.00 0.00 69 69 317.50 0.00 0.00 0.00
4 1 4090 100.00 4100 3891 5011 0 0.00 0.00 0.00 0.00 54
8 2 4090 100.00 4100 3891 9 0 0.00 0.00 0.00 0.00 66
9 3 4090 100.00 4100 3891 84 0 0.00 0.00 0.00 0.00 67
11 4 4090 100.00 4100 3891 8 0 0.00 0.00 0.00 0.00 63
16 0 3047 100.00 3054 3891 5626 0 0.00 0.00 0.00 0.00 55 67 153.59 0.00 0.00 0.00
18 5 4090 100.00 4100 3891 9 0 0.00 0.00 0.00 0.00 67
19 6 4090 100.00 4100 3891 9 0 0.00 0.00 0.00 0.00 64
25 7 4090 100.00 4100 3891 9 0 0.00 0.00 0.00 0.00 63
1 8 3006 100.00 3013 3891 5080 0 0.00 0.00 0.00 0.00 50 69 163.91 0.00 0.00 0.00
2 9 4090 100.00 4100 3891 10 0 0.00 0.00 0.00 0.00 56
3 10 4090 100.00 4100 3891 9 0 0.00 0.00 0.00 0.00 66
4 11 4090 100.00 4100 3891 9 0 0.00 0.00 0.00 0.00 67
8 12 4090 100.00 4100 3891 9 0 0.00 0.00 0.00 0.00 67
18 13 4090 100.00 4100 3891 9 0 0.00 0.00 0.00 0.00 69
24 14 4090 100.00 4100 3891 8 0 0.00 0.00 0.00 0.00 69
27 15 4090 100.00 4100 3891 15 0 0.00 0.00 0.00 0.00 67
I used the attached script to reproduce this. It takes an optional argument for the desired setting for MKL_ENABLE_INSTRUCTIONS, defaulting to SSE4_2. It will create an HPL.dat file if it does not exist, and run Linpack with two MPI ranks.
-- Chuck Newman
: