I'm an experienced user of intel mkl and OpenMP. In my application, the parallelism topology is simple, so although I use OpenMP for a long time I haven't used very complex functionalities of OpenMP. On typical case is that there is a parallel_for loop. Within each loop, there are several cblas or lapack function calls. With MKl compiled with OpenMP, I got very good performance, so I didn't pay attention to the TBB too much. However, a benchmark from Intel MKL official website changed my mind (https://software.intel.com/en-us/articles/using-intel-mkl-and-intel-tbb-...). What it basically says is that MKL compiled with TBB has roughly 2x faster than MKL compiled with OpenMP when multiple lapack functions are called in parallel. However, when I tried I didn't get the same result. What I did is changing everything from OpenMP to TBB. What do I miss or do I understand anything wrong?
Question