Hello together,
I am a PhD student researching in the area of parallel programming. In my next research paper, I aim to present some high-performance (OpenCL) implementations for the Basic Linear Algebra Subroutines (BLAS) -- especially for the matrix multiplication routine GEMM -- on matrix sizes as used in the area of deep learning; my targeted hardware is Intel Xeon CPU. To strengthen my evaluation, I want to compare to the fastest state-of-the-art implementation for BLAS that targets Intel Xeon CPU.
My question is: Which is the currently fastest BLAS implementation for Intel Xeon CPU on matrix sizes as used in deep learning -- the Intel Math Kernel Library (MKL)?
Many thanks in advance.
Best,
Richard