Hi guys,
Are there any highly optimized MKL routines or maybe performance primitives that can do rectangle matrix transposition but without scaling?
I've been using mkl_omatcopy but it seems to perform worse than a normal baseline implementation and I suspect this is due to the additional scaling that is performed. I've attached a plot running a naive baseline implementation with comparison on omatcopy and imatcopy. The latter I know runs very poorly on non-square matrices.
I just want to know whether I should start spending some time optimizing my own transpose routine with AVX/AVX2 and blocking or whether there's a very efficient one out there already.
Also, swapping indices is not viable for what I am trying to achieve.
Thank you in advance!
Ioan