MKL Rectangular matrix Inplace transpose performance issue

I want an in place memory transpose of very large matrix. I am using mkl_simatcopy. But I am observing some performance issue while transposing inplace. I am currently using Intel(R) Xeon(R) CPU E7-8867 v4 @ 2.40GHz having 72 physical cores and redhat os.

My observation is that, when I perform transpose operation, only single core is used and it is not using all cores. I have tried all environment variables like MK_NUM_THREADS, MKL_DYNAMIC="FALSE" etc. My compilation script is as follows :

gcc -std=c99 -m64 -I $MKLROOT/include transpose.c ${MKLROOT}/lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_cdft_core.a ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_tbb_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_openmpi_ilp64.a -Wl,--end-group -lstdc++ -lpthread -lm -ldl -o transpose.out

Timings obtained are as follows

Sno.               No. of Rows        No. of Cols     Time(in sec)
1                          16384               8192            16
2                          16384               32768          68
3                          32768               65536          233

Data Type is float. Please let me know , if there is an efficient way to transpose inplace or how can we port to multiple cores or how can we reduce this execution time.

Below is code snippet of transpose.c:

int main(int argc,char *argv[])
{
        if(argc!=3)
        {
                printf("Usage : exe NoofScan and NoofPix \n");
                exit(0);
        }
        unsigned long noOfScan = atol(argv[1]);
        unsigned long noOfPix = atol(argv[2]);
        printf("----->>>> noOfScan = %d and noOfPix =%d \n",noOfScan,noOfPix);
        size_t nEle = noOfScan * noOfPix;

        float *data = (float *)calloc(nEle,sizeof(float));
        initalizeData(data,noOfScan,noOfPix);
long nt = mkl_get_max_threads();
        printf("No Of threads are = %d \n",nt);
        mkl_set_num_threads_local(nt);
        //mkl_set_num_threads(nt);
        double time1 = cpuSecond();
        mkl_simatcopy('R','T',noOfScan,noOfPix,1,data,noOfPix,noOfScan);
        printf("Time elapsed is %lf \n",cpuSecond()-time1);
        memset(data,0,nEle*sizeof(float));
        free(data);
}

MKL Rectangular matrix Inplace transpose performance issue

Trending Articles

JOAN (HARDING) LEWIS AGE 96, O...

Okra & Motia — The Workshop (Prod by Hammer)

Practice Sheet of Right form of verbs for HSC Students

ERROR CE-41893-5

Stories • Goddess Stepmom

操作を 2 つ以上設定したタスクの実行が失敗する問題について

TunerPad KeyGen FREE

Camila Cabello – C,XOXO (Magic City Edition) [iTunes Plus M4A + M4V]

[LATEST][RECOVERY][UNOFFICIAL]TWRP 3.7.0_12-v2 for Moto G Stylus 5G...

[ GET ] Nero Knowledge - The Metaphysical Money Manual

DONALD L. NEMETH AGE 86, OF SH...

Windows Update / Microsoft Update の接続先 URL について

LAG, Lacp configuration on Mellanox switches

The 10 Tennessee Cities With The Largest Black Population For 2021

Bureau of Internal Revenue: Regional Offices (Directory)

MIB2 Patch (CP Off + FEC/SWaP) [Technisat/Preh/Delphi/Harman]...

Mp3 Download: Stormzy - Cigarettes & Cush (feat. Kehlani & Lily Allen)

Uline Warehouse Associate Interview

Adobe Master Collection 2025 RUS-ENG v7-m0nkrus

Maureen Rose Gradvohl, 67