Quantcast
Channel: Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library
Viewing all 2652 articles
Browse latest View live

Optimizing matrix multiplication algorithm on Intel Xeon Gold (DevCloud)

$
0
0

Hi,

 

I am working on Case #03357624 - Benchmarking algorithms on Intel Xeon Gold (DevCloud):

https://communities.intel.com/thread/124090

 

Summary:

The concern is on time overhead while running compiled mmatest1.c, attached to the link: Performance of Classic Matrix Multiplication Algorithm on Intel® Xeon Phi™ Processor System | Intel® Software 

 

Observation:

First occurrence of loop is taking huge time. Second loop is also taking comparatively more time. Time taken by rest is similar.

I ran the code with 16 loop count and matrix size 256 and got following result for each loop:

        MKL:

        MKL  - Completed 1 in: 0.2302730 seconds

        MKL  - Completed 2 in: 0.0001534 seconds

        MKL  - Completed 3 in: 0.0001267 seconds

        MKL  - Completed 4 in: 0.0001275 seconds

        ..................

        MKL  - Completed 15 in: 0.0001280 seconds

        MKL  - Completed 16 in: 0.0001347 seconds

 

        CMMA:

        CMMA - Completed 1 in: 0.0504993 seconds

        CMMA - Completed 2 in: 0.0003169 seconds

        CMMA - Completed 3 in: 0.0001666 seconds

        CMMA - Completed 4 in: 0.0001687 seconds

        ................

        CMMA - Completed 15 in: 0.0001638 seconds

        CMMA - Completed 16 in: 0.0001636 seconds

 

Time taken by first loop should be due to warm up (initial process of loading the data in cache and Translation Look-Aside Buffer (TLB) etc.)

 

=> I need advise and confirmation on following Questions and answers which I got as per my understanding:

1) Should first result (time taken by first occurrence of loop) be included in time estimation while benchmarking?

Ans I have) No, it should be excluded. 

Further Q) Why time taken by second loop is more than other following loops? Should it also be excluded from benchmarking? How many initial loops should we not include in time estimation?

 

2)  Is the overhead primarily due to the cache misses or the warm up time?

Ans I have) It’s due to warm up time. If we will use large matrices cache miss will also come to effect. 

Further Q) As per the user it’s due to cache misses. How cache miss is effecting initially when it has no data? Is warm up not a right term instead?

 

3) If it is indeed cache misses, how can he work on that? He thought its always accessed in a row-major format and thus cache misses would be avoided if he would have accessed it in the same format.

Ans I have) It’s correct, we should access in row-major format. Data layout in memory and data access scheme should be kept best same. Possible solutions (if it’s a big matrix) are:

a) Transpose matrix B to access it with row major

b) Use loop blocking optimization technique (LBOT) with block size equal to virtual page size.

 

4) How to debug cblas_sgemm() or where to find source code of it to debug using gdb? 

 

Please advise.

Thanks and regards,

Rishabh Kumar Jain


LINPACK with multiple MPI ranks

$
0
0

Hello,

to benchmark our new Skylake cluster consisting of two and four socket machines together with a Broadwell system, I want to be able to run LINPACK with a different amount of MPI ranks per node. My problem is that there are too many processes spawned on the four socket nodes where I launch two MPI ranks.
I tried to limit the number of threads via OMP_NUM_THREADS and MKL_NUM_THREADS, but without effect. TBB seams to be the cause here, because some MKL functions (which will probably be used in LINPACK) are using this:
https://software.intel.com/en-us/mkl-macos-developer-guide-functions-thr...

As far as I know, there is no possibility to influence the number of threads with environment variables created with TBB.

So my question is, how to run LINPACK with two MPI ranks on one node (and get the full performance)?

Best regards,
Holger

Stability Functions and Pardiso

$
0
0

Dear Guru:

I have been using PARDISO for a while with a structural analysis program from Harrison 1973 Fortran.  Updated of course.

I have no problems with it in the standard solve form. 

I am now looking at high compression members which require a reformulation of the stiffness matrix at each stage using STABILITY FUNCTIONS. These now work and the problems solve in reasonable time. 

I now started a problem - dnl.inp. which is an analysis of a 1 km arch, although that is not relevant. If I just use self weight of the beams it solves nicely, but I started to add the mass of the trucks on the bridge, in this case using a simple method just to develop the model - the 1.1 in the equation increases the deadload. The actual code is in the ELEMENTS.F90 file in the attached solution folder. 

do i = 1,n
        write(*,104)i,nodeloads(i,2),-(nodemass(i)*gr),(-(nodemass(i)*gr) + nodeloads(i,2))
104     Format(' Node : ',I4,' Applied Load :: ',F15.3, ' Self Weight :: ',F15.3, ' Total Load :: ',f15.3)
        nodeloads(i,2) = ZERO - ((nodemass(i)*gr)*1.1)
    end do

If I set 1.1 to 1.0 it runs nicely - about 1.01 it spits out a PARDISO -2 error code at about the 5th iteration.  I cannot see in the help files how to solve for memory size problem. 

Help Please - to run just type dnl at the input. 

John

AttachmentSize
Downloadapplication/zipBalorN.zip2.13 MB

issue of 2d fft with openmp

$
0
0

Hi, I'm having trouble with 2d fft using intel 2018 compiler. I tracked it down and reproduce the behavior in this simple code. The sample code applies forward and backward 2d fft, and is supposed to bring back the original value. But if I do

setenv MKL_DYNAMIC false 

setenv OMP_NUM_THREADS 10

and run the code, the result is wrong.

Here is my sample code. Any insight is appreciated. Thank you.

ifort version: 18.0.2 20180210

MKL version: ics-2018.update.2/compilers_and_libraries_2018.2.199

Xin

 

 

 

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; min-height: 13.0px} p.p3 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #ba2da2} p.p4 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #008400} p.p5 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #d12f1b} span.s1 {font-variant-ligatures: no-common-ligatures; color: #ba2da2} span.s2 {font-variant-ligatures: no-common-ligatures} span.s3 {font-variant-ligatures: no-common-ligatures; color: #272ad8} span.s4 {font-variant-ligatures: no-common-ligatures; color: #000000} span.s5 {font-variant-ligatures: no-common-ligatures; color: #d12f1b}

Program fft2dcc

 

! to compile the code, do the following:

! ifort -c -DMKLI8 -qopenmp -I${MKLROOT}/include/intel64/ilp64 -O3 -o fft2dcc.o fft2dcc.F90

! ifort -DMKLI8 -static-intel -qopenmp -O3 -o fft2dcc fft2dcc.o  -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a -Wl,--end-group

 

use MKL_DFTI

use omp_lib

 

implicit none

 

integer :: nx, ny, nw, nfreq, ier, iw, iy, ix

complex, allocatable, dimension(:,:,:) :: wave

 

integer :: nthreads

type(DFTI_DESCRIPTOR), pointer :: plan

real :: fscale, bscale

 

nx = 640

ny = 640

nw = 224

 

nthreads = omp_get_num_threads()

fscale = 1.0

bscale = 1.0/float(nx*ny)

ier = ccfft2d_plan(nx,ny,nx,fscale,bscale,plan,nthreads)

 

allocate(wave(nx,ny,nw))

!$omp parallel

!$omp master

 print *, omp_get_num_threads()

!$omp end master

!$omp end parallel

 

! first touch memory

!$omp parallel do default(shared), private(ix,iy,iw)

do iw = 1, nw

   do iy = 1, ny

   do ix = 1, nx

      wave(ix,iy,iw) = cmplx(0.0,0.0)

   enddo

   enddo

         ! inject source

   wave(nx/2,ny/2,iw) = cmplx(1.0,0.0)

enddo

 

print *

print *, "Before", wave(318:322,320,1)

print *

 

!$OMP PARALLEL DO DEFAULT(SHARED), PRIVATE(iw)

do iw=1,nw

   call ccfft2d_exe(1,wave(1,1,iw),plan)

   call ccfft2d_exe(-1,wave(1,1,iw),plan)

enddo

 

print *, "After", wave(318:322,320,1)

 

deallocate(wave)

 

contains

 

integer function ccfft2d_plan &

   ( n1, n2, ldn1, fscale, bscale, plan, nthreads )

 

use MKL_DFTI

implicit none

 

integer, intent(in), optional :: nthreads

integer, intent(in)  :: n1, n2, ldn1

real,    intent(in)  :: fscale, bscale

type(DFTI_DESCRIPTOR), pointer :: plan

 

integer :: ier

#ifdef MKLI8

integer(kind=8) :: dim, nsize(2), strides(3), nthreads_in, status

#else

integer(kind=4) :: dim, nsize(2), strides(3), nthreads_in, status

#endif

 

ier = 0

!--- add some error checking for sizes

 

dim = 2

 

nsize(1) = n1

nsize(2) = n2

 

strides(1) = 0

strides(2) = 1

strides(3) = ldn1

 

status = DftiCreateDescriptor(plan,DFTI_SINGLE,DFTI_COMPLEX,dim,nsize(1:dim))

if (.NOT. DftiErrorClass(Status, DFTI_NO_ERROR)) then

print *, 'error 1'

   call DftiStatusPrint(Status)

   ier = -1

   return

endif

 

nthreads_in = 1

if (present(nthreads)) nthreads_in = nthreads

status = DftiSetValue(plan,DFTI_NUMBER_OF_USER_THREADS,nthreads_in)

 

status = DftiSetValue(plan,DFTI_INPUT_STRIDES,strides)

 

!--- this corresponding to isign < 0 usage at BP

! bscale= 1.0/float(n1*n2)

status = DftiSetValue(plan,DFTI_FORWARD_SCALE,bscale)

 

!--- this corresponding to isign > 0 usage at BP

! fscale= 1.0

status = DftiSetValue(plan,DFTI_BACKWARD_SCALE,fscale)

 

status = DftiCommitDescriptor(plan)

if (status /= 0) print *, 'DftiCommitDescriptor returned ', status

 

ccfft2d_plan = ier

 

end function ccfft2d_plan

 

subroutine ccfft2d_exe( isign, cdata, plan )

 

use MKL_DFTI

implicit none

 

integer, intent(in)  :: isign

complex, intent(inout) :: cdata(*)

type(DFTI_DESCRIPTOR), pointer :: plan

 

#ifdef MKLI8

integer(kind=8) :: status

#else

integer(kind=4) :: status

#endif

 

if ( isign < 0 ) then

   status = DftiComputeForward( plan, cdata )

else

   status = DftiComputeBackward( plan, cdata )

endif

 

end subroutine ccfft2d_exe

 

subroutine DftiStatusPrint(status)

 

use MKL_DFTI

implicit none

 

#ifdef MKLI8

integer(kind=8) :: status

#else

integer(kind=4) :: status

#endif

character(DFTI_MAX_MESSAGE_LENGTH) :: error_message

 

error_message = DftiErrorMessage(status)

print *, 'Error message: ', error_message

print *, 'Error status = ', status

 

end subroutine DftiStatusPrint

 

end Program

 

 

 

The support period for my license has expired.

$
0
0

Hi, folks.

What I have tried is to download latest Parallel Studio using my 'old' profile that I've created several years ago. And I got 

"The support period for your license has expired. To download this product update, you will need to purchase renewal licenses to extend your support from your expiration date (07 Aug 2015) to the build date of this product update (15 Mar 2018). Note that support for new renewal licenses will begin on 07 Aug 2015."

And my serial number is not commercial and can not be renewed.

So to download latest 'free' MKL I have to create new profile.

To be honest the situation looks a bit strange for me. Is it possible to keep my profile and have latest MKL?

 

Tensorflow performance w/ MKL

$
0
0

Hi,

I am trying to use tensorflow-1.8.0 compiled with MKL-2018.2.199 enabled. I use it to run mobilenet image classification and obj detection models. I compared the performance w/ MKL and w/o MKL. In general, w/ MKL is much slower in most cases. I am posting here to see whether I did sth. wrong or this is what I should expect..

All the following comparison numbers were collected from running the corresponding inference models on an i7-5557U CPU. I also run the tests on other CPUs and got similar results. NOTE: the time in 1-4 is per 16 frames. 5-6 is per a 320x180 frame.

1. Mobilenet_v2_1_4_224                 w/ MKL 1463 ms   w/o MKL  2486 ms (this is good)

2. Mobilenet_v2_1_0_96                   w/ MKL  481 ms    w/o MKL  276 ms (~1 time slower!)

3. Mobilenet_v1_1_0_224_quant      w/ MKL  903 ms    w/o MKL  664 ms (~50% slower)

4. Mobilenet_v1_1_0_128_quant      w/ MKL  469 ms    w/o MKL  233 ms  (~1 time slower)

5. ssd_mobilenet_v1_coco                w/ MKL  142 ms    w/o MKL  116 ms

6. ssd_mobilenet_v2_coco                w/ MKL  212 ms     w/o MKL 130 ms

I used "-DINTEL_MKL -DINTEL_MKL_ML -DEIGEN_USE_MKL_ALL -DMKL_DIRECT_CALL -march=native -mtune=native" to compile tensorflow .

You can find the code here. The benchmark data is here and here.

Storing Tri-diagonal matrix in coo or csr format.

$
0
0

Hi,

Is there any function/method to store the tridiagonal matrix in csr or coo format. 

 

FFT performance issue

$
0
0

I am evaluating the MKL library for the new project. While computing the FFT, I could see that performance of the FFT (in terms of GFLOP/s) reduces if the threads are in different socket ( using thread affinity). The test was carried out in Intel(R) Xeon(R) CPU E5-2650 processor and compiler is gcc. Please let me know the reason. 


Numpy + MKL only using 1 core of cpu on Ubuntu 16.04

$
0
0

Hi,

I am using Ubuntu 16.04 on my pc with "Intel® Core™ i7-7700K CPU @ 4.20GHz × 8".

My numpy and scipy only use one cpu when I try to do some element calculation for my numpy ndarray. 

Something like:

numpy.power(matrix, 1.5)

I compiled the numpy and scipy following https://software.intel.com/en-us/articles/numpyscipy-with-intel-mkl?page=1

The numpy configurations are as following.

blas_opt_info:

    include_dirs = ['/opt/intel/compilers_and_libraries_2018/linux/mkl/include']

    library_dirs = ['/opt/intel/compilers_and_libraries_2018/linux/mkl/lib/intel64']

    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]

    libraries = ['mkl_rt', 'pthread']

lapack_opt_info:

    include_dirs = ['/opt/intel/compilers_and_libraries_2018/linux/mkl/include']

    library_dirs = ['/opt/intel/compilers_and_libraries_2018/linux/mkl/lib/intel64']

    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]

    libraries = ['mkl_rt', 'pthread']

blas_mkl_info:

    include_dirs = ['/opt/intel/compilers_and_libraries_2018/linux/mkl/include']

    library_dirs = ['/opt/intel/compilers_and_libraries_2018/linux/mkl/lib/intel64']

    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]

    libraries = ['mkl_rt', 'pthread']

lapack_mkl_info:

    include_dirs = ['/opt/intel/compilers_and_libraries_2018/linux/mkl/include']

    library_dirs = ['/opt/intel/compilers_and_libraries_2018/linux/mkl/lib/intel64']

    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]

    libraries = ['mkl_rt', 'pthread']

 

I tried to modify environment variables like MKL_NUM_THREADS, OMP_NUM_THREADS, MKL_DOMAIN_NUM_THREADS, MKL_DYNAMIC, but they do nothing to my situation. 

Thanks for your help. 

New Floating Point Math Error in Subtraction for Intel Core Processors?

$
0
0

To duplicate type the following in to an Excel spreadsheet.

=(6.377-6.376)*1000

I get: 0.999999999999446 if I allow the decimal places to be shown. That is three significant digits that are in error, way too many for rounding error.

 

I get a true rounding  error using Google (0.99999999999)

 

I understand rounding error, but look at the significant digits and this occurs in subtraction. Second of all there are only four significant digits to begin with there should not be any rounding error at all.

 

The error exists before multiplying by 1000, that is done, just for convenience.

 

This error occurs on the latest generation of Intel Processors (i7-8550U as well as my older i7-4770).

 

I had to track down this error from a 1 > 1 problem (when the number was put into a logical statement it failed).

getting MKL thread IDs

$
0
0

Hi, 

We have a problem regarding mkl threads and we really appreciate your valuable help.  we are using mkl function calls in the nested parallel region below:

        omp_set_num_threads( NUM_OF_THREADS );
        omp_set_nested(1);
        omp_set_max_active_levels(2);


	#pragma omp parallel num_threads(2)
        {
                if (omp_get_thread_num() == 0){

                        mkl_set_num_threads_local(16);

                        printf("My ID is %d\n", omp_get_thread_num());
                       	cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
                        m, n, p, 1, pA, p, pB, n, 0, pC1, n);
                }else{
                        mkl_set_num_threads_local(16);

                        printf("My ID is %d\n", omp_get_thread_num());
                       	cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
                        m, n, p, 1, pD, p, pE, n, 0, pC2, n);

                }
        }

Using VTune Amplifier, we can verify that the correct number of 32 threads are produced. However, the output of the print statements is as follows: 

My ID is 0
My ID is 1

It seems like we cannot access "mkl" threads using "omp_get_thread_num()". Is there any similar function for accessing thread IDs of mkl threads? Or is there a way to do that? (We need such information for affinity and thread placement decisions). 

Thank you very much, 

Sanaz 

Support for legacy software migration project - Intel MKL

$
0
0

Hi,

We have a legacy software utility which we are planning to Migrate from Visual Studio 2003 to Visual Studio 2010. Migration will be from Windows Server 2003(x86) to target platform Windows Server 2012(x64).

In Windows Server 2003(x86), below are the Intel Math Kernel components used in our utility,

• mkl_c
• mkl_lapack
• mkl_ia32
• libguide40
• libguide

We have downloaded the latest version of Intel Math Kernel Library v2018.0.2.1 from https://software.intel.com/en-us/mkl website. We are not able to find equivalent components of listed above.

Could you please clarify whether these components were removed or replaced by another component in latest version?

Thanks & Regards,
Vijayakumar R

COO (with duplicates) to CSR

$
0
0
  1. HI, I wrote the following routine to convert the e-vectors (in coo with multiple duplicates) to the csr format. The steps I made are (1-sorting ia 2- sorting ja 3- condensing or summing up the duplicates(that val sharing the same(ia&ja) indices 4- Convert using the mkl_csrcoo function. This works fine for a small test example. however it takes too much time for big example. 
  2. Can anyone help?
  3. ***HERE IS THE CODE***
  4. void Sort_COO(MKL_INT nnz, std::vector<MKL_INT> &ia, std::vector<MKL_INT> &ja, std::vector<double> &val) {
  5. //this part to sort the arrays according to (row and column) array
  6. //first , sort the row array in ascending order and sort the other two arrays corespondigly
  7. //second, sort the column array such that it is ascending within constant row number(i.e within specific row) and sort the other arrays correspondigly.
  8. MKL_INT  holdia, holdja;
  9. double holdVal;
  10. for (int i = 0; i < nnz; i++) {
  11. for (int j = 0; j < nnz - 1; j++) {
  12. if (ia[j] > ia[j + 1]) {
  13. //swapping elements w.r.t row rearrangement
  14. holdia = (ia)[j];
  15. holdja = (ja)[j];
  16. holdVal = (val)[j]
  17. ia[j] = ia[j + 1];  //check this if it works properly ( those are pointers not values)
  18. ja[j] = ja[j + 1];
  19. val[j] = val[j + 1];
  20. (ia)[j + 1] = holdia;
  21. (ja)[j + 1] = holdja;
  22. (val)[j + 1] = holdVal;
  23. }
  24. }
  25. }
  26. for (int i = 0; i < nnz; i++) {
  27. for (int j = 0; j < nnz - 1; j++) {
  28. if ((ja[j] > ja[j + 1]) && (ia[j] == ia[j + 1])) {
  29. //swapping elemnents w.r.t column rearrangement (in assending oreder)
  30. holdia = (ia)[j];
  31. holdja = (ja)[j];
  32. holdVal = (val)[j];
  33. ia[j] = ia[j + 1];
  34. ja[j] = ja[j + 1];
  35. val[j] = val[j + 1];
  36. (ia)[j + 1] = holdia;
  37. (ja)[j + 1] = holdja;
  38. (val)[j + 1] = holdVal;
  39. }
  40. }
  41. }
  42. }
  43. void Condense_COO(MKL_INT& nnz, std::vector<MKL_INT> &ia, std::vector<MKL_INT> &ja, std::vector<double> &val) {
  44. //now, val,ia and ja are constructed in coo with duplicate entriees
  45. //we need to condense this to coo format (without duplications)
  46. //we will sum those values in val which have the same (ia and ja) indecies
  47. MKL_INT nduplic = 0;
  48. for (int i = 0; i < nnz; i++) {
  49. if (((ia)[i] == (ia)[i - 1]) && ((ja)[i] == (ja)[i - 1]))
  50. {
  51. nduplic++; 
  52. (val)[i - 1] = (val)[i] + (val)[i - 1];
  53. for (int k = i; k < nnz; k++) {
  54. if (((ia)[k + 1] == (ia)[i - 1]) && ((ja)[k + 1] == (ja)[i - 1]))
  55. {
  56. /*nduplic++;*/
  57. (val)[i - 1] = (val)[k + 1] + (val)[i - 1];
  58. std::cout << "nduplic="<< nduplic << endl;
  59. for (int j = k + 1; j < nnz; ++j)
  60. {
  61. val[j] = val[j + 1];
  62. ia[j] = ia[j + 1];
  63. ja[j] = ja[j + 1];
  64. }
  65. }
  66. }
  67. for (int l = i; l < nnz; ++l)
  68. {
  69. val[l] = val[l + 1];
  70. ia[l] = ia[l + 1];
  71. ja[l] = ja[l + 1];
  72. }
  73. }
  74. }
  75. nnz = nnz - nduplic;
  76. std::cout << "nduplic="<< nduplic << endl;
  77. //erase the surplus elements at the end of each array
  78. for (int i = 0; i < nduplic; ++i)
  79. {
  80. val.pop_back();
  81. ia.pop_back();
  82. ja.pop_back();
  83. }
  84. }
  85. void Convert_CSR(MKL_INT &job, int& N, double &ASCR, MKL_INT& AII, MKL_INT& AJJ, MKL_INT& NNZ, double& VAL, MKL_INT& IA, MKL_INT& JA, MKL_INT& info) {
  86. mkl_dcsrcoo(&job, &N, &ASCR, &AJJ, &AII, &NNZ, &VAL, &IA, &JA, &info);
  87. }

Skylake HPL with Intel MKL

$
0
0

Hi Team,

 

We are unable to run HPL on skylake.

Environment

Centos 7.4

Single Server of Xeon Gold 6148, 20 Core,2.4 GHrtz, Quantity 2, IPC 32 And Of Rpeak 3072.

Running "xhpl_intel64_dynamic" Precompiled binary Linux* package(l_mklb_p_2018.2.010)

 

References

https://software.intel.com/en-us/mkl-linux-developer-guide-configuring-parameters

https://software.intel.com/en-us/articles/intel-mkl-benchmarks-suite

 

For HPL dat file P & Q values unable to determine and how we can launch the run.

Submitted run 5 & 8 input values for P&Q respectively along with the mpirun -np 40 threads. But System got halt.

Need your input for running HPL for SKYLAKE architecture.

 

Thank You

Atul Yadav

 

 

Memory leak in FEAST routines

$
0
0

Hi,

I was experimenting with the FEAST routines and found a problem with memory leaks.

In particular, the zfeast_heev routine works perfectly fine, when the parameter "x" is allocated on the stack, e.g, when one defines:

 MKL_Complex16 x[DIM*m0];

but when it is dynamically allocated and stored on the heap, I get a segmentation fault upon calling free (or delete). This implies that the routines do something with this pointer that it is not supposed to do.

I append a minimal example. If you comment out "#define X_ON_HEAP_CPP", the memory is allocated on the stack and everything works fine. With the definition, or that of "#define X_ON_HEAP", the memory is allocated on the heap and the program segfaults when free (or delete) is called.

I am using mkl version 2018.2.199 and compile using g++ version 5.4.0, which I call with the compile flags "-I${MKLROOT}/include  -L${MKLROOT}/lib/intel64  -lmkl_intel_thread -lmkl_rt -lmkl_core -lmkl_intel_lp64 -lm"

 

Can somebody reproduce this issue, maybe also with icc?  How would one correctly allocate dynamic memory for the use in the FEAST routines?

Regards, Moritz

 

#include <iostream>
#include <cmath>
#include <string>
#include <complex>
#include <malloc.h>

#include "mkl.h"
#include "mkl_solvers_ee.h"

using namespace std;

int main(int args, char** argv){
  const MKL_INT m0=4;
  double lambdamin=-1,lambdamax=2;
 

  const MKL_INT DIM=4;
  MKL_Complex16 A[4][4]={  \
{{0.,0.},{sqrt(3.)/2.,0.},{0.,0.},{0.,0.}},  \
{{sqrt(3.)/2.,0.},{0.,0.},{1.,0.},{0.,0.}},  \
{{0.,0.},{1.,0.},{0.,0.},{sqrt(3.)/2.,0.}},  \
{{0.,0.},{0.,0.},{sqrt(3.)/2.,0.},{0.,0.}} };
//The exact eigenvalues are: -1.5, -0.5, 0.5 and 1.5

 

  MKL_Complex16 zero = {0.0, 0.0};
  char uplo = 'F';
  MKL_INT fpm[128];
  MKL_INT n=DIM;
  double epsout=0.;
  MKL_INT loop=0;
  MKL_INT m0var=m0;
  MKL_INT m=m0;
  double res=0.;
  int info=0;
  double lambdaptr[DIM];

#define X_ON_HEAP_CPP
#ifdef X_ON_HEAP_C
  MKL_Complex16 *x=(MKL_Complex16*)malloc(sizeof(MKL_Complex16)*DIM*m0);
#elif defined(X_ON_HEAP_CPP)
  MKL_Complex16 *x=new MKL_Complex16[DIM*m0];
#else
  MKL_Complex16 x[DIM*m0];
#endif

  feastinit(fpm);
  fpm[0]=1;

  zfeast_heev( &uplo, &n,  \
      (MKL_Complex16 *) &A[0][0], &n,  \
      fpm, &epsout, &loop, &lambdamin, &lambdamax, &m0var, \
      lambdaptr, (MKL_Complex16 *) x, &m, &res, &info);

  cout<<"Nr of eigenvalues found: "<<m<<endl<<flush;
   
  for(int i=0;i<m;i++)cout<<lambdaptr[i]<<endl;

#ifdef X_ON_HEAP
  free(x);
#elif defined(X_ON_HEAP_CPP)
  delete[] x;
#endif
  cout<<"Done"<<endl<<flush;

  return 0;
}

 

 

 

 

 


MKL Operations in TensorFlow

$
0
0

 

Hi all,

I run a TensorFlow model Inception with Intel MKL-DNN support. 

The execution time of MKL operations in MKL (e.g. _MklConv2DBackpropFilter, _MklConv2D, _MklConv2DBackpropInput, etc) have no change when I change the number of intra-threads (means the number of threads running one operation). The results are as below. Does anyone know why the MKL the performance of MKL operations will not change when changing the number of threads running this operation? Or does anyone explain the implementation of MKL-DNN support for TensorFlow for this situation? Thank you!

Name                        Intra-threads           Time

_MklConv2DBackpropFilter     8                470
                                               17                466
                                               34                468
                                               68                467

_MklConv2DBackpropInput     8                354
                                               17                344
                                               34                347
                                               68                347
        
_MklConv2D                            8                300
                                               17                304
                                               34                311
                                               68                311

Kevin

Uninstall of MKL 2017 Update 2 FAILING on Windows 10 (version 1803)

$
0
0

Uninstall of MKL 2017 Update 2 from my PC using the usual Control Panel uninstaller fails as follows:

Intel(R) Math Kernel Library 2017 Update 2 for Windows* Intel® Software Setup Assistant ended prematurely because of an error(s).

  Windows Installer failed to configure product:

Error message: The specified account already exists. 

Error code: 1603

Product code: 36D90182-C708-49CC-A6B9-6D9148159A01

Module name: ide_c_common_vs2012_p_17.0.2.187.msi

 

Installer logs location: C:\Users\Andrew\AppData\Local\Temp\pset_tmp_PSXE2017_Andrew\2018.05.12_21.01.08_0000354c\log\

 

It is being uninstalled from the same account that it was installed under, and the account has admin privileges.

 

Log files from the installer logs location are attached.

 

Support for legacy software migration

$
0
0

Hi,

We have a legacy software application currently running in Windows 2003 server (x86) which we are planning to move it to Windows 2012 server.

Our existing application uses below mentioned Intel Math Kernal libraries v9.0.0.1.

• mkl_c
• mkl_lapack
• mkl_ia32
• libguide40
• libguide

Please let us know whether the same version of the libraries(32-bit) will be supported in Windows 2012 Server?

Regards,
Tamil Selvan.

Mistake in the mkl_dnn documentation

$
0
0

Hi Everybody,

i have been using the intel mkl_dnn library for convolutions and batch normalization and have noticed that there is a simple spelling mistake in the documentation that took me around 2 hours to find in my code.

dnnResourceScaleShift | Scale and shift data.

dnnResourceScaleShift | Gradient with respect to scale and shift.

The second enum should be dnnResourceDiffScaleShift if i am not mistaken?

https://software.intel.com/en-us/mkl-developer-reference-c-enumerated-types

Thats all.

PSGESVD: Illegal parameter 19

$
0
0

Hello,

I want to perform SVD on a matrix in parallel(C). But I am facing issues in psgesvd. When I try to perform svd of a matrix of size 4750*4750, I get an error saying that:

"{    0,    0}:  On entry to
PSGESVD parameter number   19 had an illegal value"

Parameter 19 is 'lwork' for which a make a query first which returns the minimum size of 'work' array required. The svd works fine for smaller matrices but big matrices give an error. Can anyone please help me with this as to what is the reason for this error. The value of lwork in this case is 45362800.

When I use serial "sgesvd" I don't get any such issue and the results are perfectly correct. But I am not sure what is causing an issue in this case.

Thanks in advance!

-Shailesh Tripathi

Viewing all 2652 articles
Browse latest View live