Hello,
I am trying to compare my own implementation of GEMV with the MKL. For benchmarking I use the following code:
size_t M = 64; // rows size_t N = 2; // columns // allocate memory float *matrix = (float*) mkl_malloc(M*N * sizeof(float), 64); float *vector = (float*) mkl_malloc(N * sizeof(float), 64); float *result = (float*) mkl_malloc(M * sizeof(float), 64); // execute warm up calls for (size_t i = 0; i < NUM_WARMUPS; ++i) { cblas_sgemv(CblasRowMajor, CblasNoTrans, M, N, 1.0f, matrix, N, vector, 1, 0.0f, result, 1); } // measure runtime float avg_runtime = 0; for (size_t i = 0; i < NUM_EVALUATIONS; ++i) { auto start = dsecnd(); cblas_sgemv(CblasRowMajor, CblasNoTrans, M, N, 1.0f, matrix, N, vector, 1, 0.0f, result, 1); auto end = dsecnd(); float runtime = (end - start) * 1000; avg_runtime += runtime; } avg_runtime /= NUM_EVALUATIONS; std::cout << "avg_runtime: "<< avg_runtime << std::endl; // free buffers mkl_free(matrix); mkl_free(vector); mkl_free(result);
On my system this gives me an average runtime of around 0.0003ms with the first evaluation taking around 0.002ms. Because the average seemed really fast, even for the small input size, I printed the runtimes of all 200 evaluations to make sure my calculation of the average value was correct. If I add a
std::cout << runtime << std::endl;
in line 29 the measured runtimes are way higher and every one of the 200 evaluations takes around 0.002ms. This seems more plausible compared to other libraries and my own implementation.
It seems like the compiler does some optimization to my code and notices that I call the routine with the exact same input multiple times. Can anyone confirm this? What is the suggested way of benchmarking MKL routines?
Thanks in advance!