I want an in place memory transpose of very large matrix. I am using mkl_simatcopy. But I am observing some performance issue while transposing inplace. I am currently using Intel(R) Xeon(R) CPU E7-8867 v4 @ 2.40GHz having 72 physical cores and redhat os.
My observation is that, when I perform transpose operation, only single core is used and it is not using all cores. I have tried all environment variables like MK_NUM_THREADS, MKL_DYNAMIC="FALSE" etc. My compilation script is as follows :
gcc -std=c99 -m64 -I $MKLROOT/include transpose.c ${MKLROOT}/lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_cdft_core.a ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_tbb_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_openmpi_ilp64.a -Wl,--end-group -lstdc++ -lpthread -lm -ldl -o transpose.out
Timings obtained are as follows
Sno. No. of Rows No. of Cols Time(in sec)
1 16384 8192 16
2 16384 32768 68
3 32768 65536 233
Data Type is float. Please let me know , if there is an efficient way to transpose inplace or how can we port to multiple cores or how can we reduce this execution time.
Below is code snippet of transpose.c:
int main(int argc,char *argv[])
{
if(argc!=3)
{
printf("Usage : exe NoofScan and NoofPix \n");
exit(0);
}
unsigned long noOfScan = atol(argv[1]);
unsigned long noOfPix = atol(argv[2]);
printf("----->>>> noOfScan = %d and noOfPix =%d \n",noOfScan,noOfPix);
size_t nEle = noOfScan * noOfPix;
float *data = (float *)calloc(nEle,sizeof(float));
initalizeData(data,noOfScan,noOfPix);
long nt = mkl_get_max_threads();
printf("No Of threads are = %d \n",nt);
mkl_set_num_threads_local(nt);
//mkl_set_num_threads(nt);
double time1 = cpuSecond();
mkl_simatcopy('R','T',noOfScan,noOfPix,1,data,noOfPix,noOfScan);
printf("Time elapsed is %lf \n",cpuSecond()-time1);
memset(data,0,nEle*sizeof(float));
free(data);
}