In my computation, I manually offload some computation to MIC using offload pragmas. Offloaded computation also involves a call to MKL's Double precision general matrix-matrix multiplication (dgemm). Work between host CPU and MIC is divided based on performance model. Performance model rely on DGEMM performance ( in Gigaflops/sec), which is recorded by running a microbenchmark for various operand sizes (m,n and k) (done offline) .
Before the actual computation is started, I run a warm up dgemm call on largest operand sizes I will encounter in our computation ( which in my case is n=m~10000 and k~200). Even after the warm up call, I observe that for some dgemm computation still performance is unexpectedly low.
k0 =2, m 2405 n 903 ,k 192, flop rate 67.2766
k0 =2, m 2405 n 903 ,k 192, flop rate 440.115
k0 =17, m 2422 n 1066 ,k 192, flop rate 67.5244
k0 =17, m 2422 n 1066 ,k 192, flop rate 599.45
k0 =346, m 2812 n 1280 ,k 2, flop rate 1.49697
k0 =346, m 2812 n 1280 ,k 2, flop rate 15.2189
Above are some anomalous performance observed. m,n,k are dimensions of dgemm call. ( k0 is iteration number (irrelevant for present discussion)). Note that I run each of them twice, and the second time the measured flop rate corroborate nicely with estimated value. However, in real computation, I may not have an option to do dgemm twice.
I am trying to understand what might cause such behaviour. Can such performance anomaly be mitigated by warming up dgemm for different sizes? If so, what sizes should I ran for warming up dgemm? What is minimum number of call that is required? (I'm presently trying trial and error, assuming that performance anomaly can be mitigated by performing a series of warm up of suitable sizes.)
( Computation is iterative in nature; thus a large number of offloads are performed. And if I incorrectly estimate of time taken by computation on MIC, this may cause a load imbalance between host CPU and MIC, that may have a cascade effect on subsequent iterations due to nature of computation )