Quantcast
Channel: Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library
Viewing all articles
Browse latest Browse all 2652

Warming up strategy for MIC dgemm call

$
0
0

In my computation, I manually offload some computation to MIC using offload pragmas.  Offloaded computation also involves a call to MKL's Double precision general matrix-matrix multiplication (dgemm). Work between host CPU and MIC is divided based on performance model. Performance model rely on DGEMM performance ( in Gigaflops/sec), which  is  recorded by running a microbenchmark for various operand sizes (m,n and k) (done offline) .  

Before the actual computation is started, I run a warm up dgemm call on largest operand sizes I will encounter in our computation ( which in my case is n=m~10000 and k~200). Even after the warm up call, I observe that for some dgemm computation  still performance is unexpectedly low.

k0 =2, m 2405 n 903 ,k 192, flop rate 67.2766

k0 =2, m 2405 n 903 ,k 192, flop rate 440.115

k0 =17, m 2422 n 1066 ,k 192, flop rate 67.5244

k0 =17, m 2422 n 1066 ,k 192, flop rate 599.45

k0 =346, m 2812 n 1280 ,k 2, flop rate 1.49697

k0 =346, m 2812 n 1280 ,k 2, flop rate 15.2189

Above are some anomalous performance observed. m,n,k are dimensions of dgemm call. (  k0 is iteration number (irrelevant for present discussion)). Note that I run each of them twice, and the second time the measured flop rate corroborate nicely with estimated value. However, in real computation, I may not have an option to do dgemm twice.

I am trying to understand what might cause such behaviour. Can such performance anomaly be mitigated by warming up dgemm for different sizes? If so, what sizes should I ran for warming up dgemm? What is minimum number of call that is required? (I'm presently trying trial and error, assuming that performance anomaly can be mitigated  by performing a series of  warm up of suitable sizes.)

( Computation is iterative in nature; thus a large number of offloads are performed. And if I incorrectly estimate of time taken by computation on MIC,  this may cause a load imbalance between host CPU and MIC, that may have a cascade effect on subsequent iterations due to nature of computation )


Viewing all articles
Browse latest Browse all 2652

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>