Quantcast
Channel: Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library
Viewing all articles
Browse latest Browse all 2652

MKL Rectangular matrix Inplace transpose performance issue

$
0
0

I want an in place memory transpose of very large matrix. I am using mkl_simatcopy. But I am observing some performance issue while transposing inplace. I am currently using  Intel(R) Xeon(R) CPU E7-8867 v4 @ 2.40GHz having 72 physical cores and redhat os.

My observation is that, when I perform transpose  operation, only single core is used and it is not using all cores. I have tried all environment variables like MK_NUM_THREADS, MKL_DYNAMIC="FALSE" etc.  My compilation script is as follows :

gcc  -std=c99    -m64 -I $MKLROOT/include transpose.c  ${MKLROOT}/lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_cdft_core.a ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_tbb_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_openmpi_ilp64.a -Wl,--end-group  -lstdc++ -lpthread -lm -ldl -o transpose.out

Timings obtained are as follows

Sno.               No. of Rows        No. of Cols     Time(in sec)
 1                          16384               8192            16
 2                          16384               32768          68 
 3                          32768               65536          233

Data Type is float. Please let me know , if there is an efficient way to transpose inplace or how can we port to multiple cores or how can we reduce this execution time.

Below is code snippet of transpose.c:

int main(int argc,char *argv[])
{
        if(argc!=3)
        {
                printf("Usage : exe NoofScan and NoofPix \n");
                exit(0);
        }
        unsigned long noOfScan = atol(argv[1]);
        unsigned long noOfPix = atol(argv[2]);
        printf("----->>>>  noOfScan = %d and noOfPix =%d \n",noOfScan,noOfPix);
        size_t nEle = noOfScan * noOfPix;

        float *data = (float *)calloc(nEle,sizeof(float));
        initalizeData(data,noOfScan,noOfPix);
       long nt = mkl_get_max_threads();
        printf("No Of threads are = %d \n",nt);
        mkl_set_num_threads_local(nt);
        //mkl_set_num_threads(nt);
        double time1 = cpuSecond();
        mkl_simatcopy('R','T',noOfScan,noOfPix,1,data,noOfPix,noOfScan);
        printf("Time elapsed is %lf \n",cpuSecond()-time1);
        memset(data,0,nEle*sizeof(float));
        free(data);
}


Viewing all articles
Browse latest Browse all 2652

Trending Articles