Performance of DGEMM on Core2 Duo P9600 2.66 GHz

Hello,

I'm performing some benchmark using DGEMM from MKL and OpenBLAS (GotoBLAS successor). I'm using a piece of code similar to (I don't know why, but I can't put links in the post, but the piece of code comes from this MKL forum)


/* mkl.h is required for dsecnd and DGEMM */
#include <mkl.h>

/* initialization code is skipped for brevity (do a dummy dsecnd() call to improve accuracy of timing) */

double alpha = 1.0, beta = 1.0;
/* first call which does the thread/buffer initialization */
DGEMM(“N”, “N”, &m, &n, &k, &alpha, A, &m, B, &k, &beta, C, &m);
/* start timing after the first GEMM call */
double time_st = dsecnd();
for (i=0; i<LOOP_COUNT; ++i)
{
     DGEMM("N", "N", &m, &n, &k, &alpha, A, &m, B, &k, &beta, C, &m);
}
double time_end = dsecnd();
double time_avg = (time_end - time_st)/LOOP_COUNT;
double gflop = (2.0*m*n*k)*1E-9;
printf("Average time: %e secs n", time_avg);
printf("GFlop       : %.5f  n", gflop);
printf("GFlop/sec   : %.5f  n," gflop/time_avg);

I change only the timing function when OpenBLAS is used, and I run the program using square matrices (and several repetitions) of size from 1000 to 5000.

Also, I take as reference the theoretical peak performance for my processor from intel com/support/processors/sb/CS-032819 htm (sorry the ugly link format). For the Core2 Duo P9600 (P9000 series) 2.66GHz, the theoretical peak using 2 cores is 21.328 GFLOPS/s. Running my program I obtain relative performances (R/Rmax) of about 95.2% using sizes between 3000 to 5000. This is a very good performance, so I congratulate Intel. Using OpenBLAS, the performance is very similar.

Then I've tested also the performance using only one thread. The document about theoretical peak does not inform about the performance using one thead, so I use as rmax the value 21.328/2 = 10.664 GFLOPS/s. Running the benchmark program I obtain results of about (for sizes 3000 to 5000) 10.68 to 10.76 GFLOPS/s, i.e. R/Rmax = 100.15% to 100.9% (!!!!). For OpenBLAS similar results are obtained too.

How it can be possible? How it can be possible to reach the theoretical peak performance? Is correct the way to calculate the theoretical peak for 1 thread as R2thread/2? How it can be explained the extrange value R/Rmax > 100% for 1 thread? Has anyone tested DGEMM using a similar processor?

The FLOP count for DGEMM is 2*M*N*K, that is divided between M*N*K products and M*N*K additions. Takes the same time a product as an addition or is slower?

Thanks

Performance of DGEMM on Core2 Duo P9600 2.66 GHz

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...