I've discovered a strange performance problem: replacing a daxpy() call with daxpby() significantly degrades program scalability. I've written a simple test case (hopefully attached) which shows that, while daxpy() performance increases as I increase the number of threads, daxpby() performance remains completely unchanged. It's as if daxpby() isn't parallelized!
I'm using dense, aligned, million-element vectors of doubles. The vectors are uninitialized, but that shouldn't matter, and I'm using the same vectors for both calls.
I see the same behavior with mkl_2013.0.079 and mkl_2013_sp1.1.106.
Here's my program's output:
Threads=1; 1450.095253 daxpy() calls/sec
Threads=1; 1391.271988 daxpby() calls/sec
Threads=2; 2810.048711 daxpy() calls/sec
Threads=2; 1387.726834 daxpby() calls/sec
Threads=3; 4056.252211 daxpy() calls/sec
Threads=3; 1371.165840 daxpby() calls/sec
Threads=4; 5288.407756 daxpy() calls/sec
Threads=4; 1390.385973 daxpby() calls/sec
Threads=5; 6248.746533 daxpy() calls/sec
Threads=5; 1394.361390 daxpby() calls/sec
Threads=6; 7418.605490 daxpy() calls/sec
Threads=6; 1388.381384 daxpby() calls/sec
Threads=7; 8796.486642 daxpy() calls/sec
Threads=7; 1384.350895 daxpby() calls/sec
Threads=8; 9896.086698 daxpy() calls/sec
Threads=8; 1387.321742 daxpby() calls/sec
Note that the number of daxpy() calls per second increases as the number of threads increases, while the daxpby() throughput remains constant.