Optimizing matrix multiplication algorithm on Intel Xeon Gold (DevCloud)

April 27, 2018, 2:58 am

Latest and popular articles on Intel Technologies

≪ Previous: Pardiso does not scale at all, and possibly a memory leak

Hi,

I am working on Case #03357624 - Benchmarking algorithms on Intel Xeon Gold (DevCloud):

https://communities.intel.com/thread/124090

Summary:

The concern is on time overhead while running compiled mmatest1.c, attached to the link: Performance of Classic Matrix Multiplication Algorithm on Intel® Xeon Phi™ Processor System | Intel® Software

Observation:

First occurrence of loop is taking huge time. Second loop is also taking comparatively more time. Time taken by rest is similar.

I ran the code with 16 loop count and matrix size 256 and got following result for each loop:

MKL:

MKL - Completed 1 in: 0.2302730 seconds

MKL - Completed 2 in: 0.0001534 seconds

MKL - Completed 3 in: 0.0001267 seconds

MKL - Completed 4 in: 0.0001275 seconds

..................

MKL - Completed 15 in: 0.0001280 seconds

MKL - Completed 16 in: 0.0001347 seconds

CMMA:

CMMA - Completed 1 in: 0.0504993 seconds

CMMA - Completed 2 in: 0.0003169 seconds

CMMA - Completed 3 in: 0.0001666 seconds

CMMA - Completed 4 in: 0.0001687 seconds

................

CMMA - Completed 15 in: 0.0001638 seconds

CMMA - Completed 16 in: 0.0001636 seconds

Time taken by first loop should be due to warm up (initial process of loading the data in cache and Translation Look-Aside Buffer (TLB) etc.)

=> I need advise and confirmation on following Questions and answers which I got as per my understanding:

1) Should first result (time taken by first occurrence of loop) be included in time estimation while benchmarking?

Ans I have) No, it should be excluded.

Further Q) Why time taken by second loop is more than other following loops? Should it also be excluded from benchmarking? How many initial loops should we not include in time estimation?

2) Is the overhead primarily due to the cache misses or the warm up time?

Ans I have) It’s due to warm up time. If we will use large matrices cache miss will also come to effect.

Further Q) As per the user it’s due to cache misses. How cache miss is effecting initially when it has no data? Is warm up not a right term instead?

3) If it is indeed cache misses, how can he work on that? He thought its always accessed in a row-major format and thus cache misses would be avoided if he would have accessed it in the same format.

Ans I have) It’s correct, we should access in row-major format. Data layout in memory and data access scheme should be kept best same. Possible solutions (if it’s a big matrix) are:

a) Transpose matrix B to access it with row major

b) Use loop blocking optimization technique (LBOT) with block size equal to virtual page size.

4) How to debug cblas_sgemm() or where to find source code of it to debug using gdb?

Please advise.

Thanks and regards,

Rishabh Kumar Jain

↧

LINPACK with multiple MPI ranks

April 30, 2018, 1:33 am

Latest and popular articles on Intel Technologies

≫ Next: Stability Functions and Pardiso

≪ Previous: Optimizing matrix multiplication algorithm on Intel Xeon Gold (DevCloud)

Hello,

to benchmark our new Skylake cluster consisting of two and four socket machines together with a Broadwell system, I want to be able to run LINPACK with a different amount of MPI ranks per node. My problem is that there are too many processes spawned on the four socket nodes where I launch two MPI ranks.
I tried to limit the number of threads via OMP_NUM_THREADS and MKL_NUM_THREADS, but without effect. TBB seams to be the cause here, because some MKL functions (which will probably be used in LINPACK) are using this:
https://software.intel.com/en-us/mkl-macos-developer-guide-functions-thr...

As far as I know, there is no possibility to influence the number of threads with environment variables created with TBB.

So my question is, how to run LINPACK with two MPI ranks on one node (and get the full performance)?

Best regards,
Holger

↧

Stability Functions and Pardiso

May 1, 2018, 7:27 am

Latest and popular articles on Intel Technologies

≫ Next: issue of 2d fft with openmp

≪ Previous: LINPACK with multiple MPI ranks

Dear Guru:

I have been using PARDISO for a while with a structural analysis program from Harrison 1973 Fortran. Updated of course.

I have no problems with it in the standard solve form.

I am now looking at high compression members which require a reformulation of the stiffness matrix at each stage using STABILITY FUNCTIONS. These now work and the problems solve in reasonable time.

I now started a problem - dnl.inp. which is an analysis of a 1 km arch, although that is not relevant. If I just use self weight of the beams it solves nicely, but I started to add the mass of the trucks on the bridge, in this case using a simple method just to develop the model - the 1.1 in the equation increases the deadload. The actual code is in the ELEMENTS.F90 file in the attached solution folder.

do i = 1,n
        write(*,104)i,nodeloads(i,2),-(nodemass(i)*gr),(-(nodemass(i)*gr) + nodeloads(i,2))
104     Format(' Node : ',I4,' Applied Load :: ',F15.3, ' Self Weight :: ',F15.3, ' Total Load :: ',f15.3)
        nodeloads(i,2) = ZERO - ((nodemass(i)*gr)*1.1)
    end do

If I set 1.1 to 1.0 it runs nicely - about 1.01 it spits out a PARDISO -2 error code at about the 5th iteration. I cannot see in the help files how to solve for memory size problem.

Help Please - to run just type dnl at the input.

John

Attachment	Size
Download BalorN.zip	2.13 MB

↧

issue of 2d fft with openmp

May 1, 2018, 12:38 pm

Latest and popular articles on Intel Technologies

≫ Next: The support period for my license has expired.

≪ Previous: Stability Functions and Pardiso

Hi, I'm having trouble with 2d fft using intel 2018 compiler. I tracked it down and reproduce the behavior in this simple code. The sample code applies forward and backward 2d fft, and is supposed to bring back the original value. But if I do

setenv MKL_DYNAMIC false

setenv OMP_NUM_THREADS 10

and run the code, the result is wrong.

Here is my sample code. Any insight is appreciated. Thank you.

ifort version: 18.0.2 20180210

MKL version: ics-2018.update.2/compilers_and_libraries_2018.2.199

Xin

p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #000000; min-height: 13.0px} p.p3 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #ba2da2} p.p4 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #008400} p.p5 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Menlo; color: #d12f1b} span.s1 {font-variant-ligatures: no-common-ligatures; color: #ba2da2} span.s2 {font-variant-ligatures: no-common-ligatures} span.s3 {font-variant-ligatures: no-common-ligatures; color: #272ad8} span.s4 {font-variant-ligatures: no-common-ligatures; color: #000000} span.s5 {font-variant-ligatures: no-common-ligatures; color: #d12f1b}

Program fft2dcc

! to compile the code, do the following:

! ifort -c -DMKLI8 -qopenmp -I${MKLROOT}/include/intel64/ilp64 -O3 -o fft2dcc.o fft2dcc.F90

! ifort -DMKLI8 -static-intel -qopenmp -O3 -o fft2dcc fft2dcc.o -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a -Wl,--end-group

use MKL_DFTI

use omp_lib

implicit none

integer :: nx, ny, nw, nfreq, ier, iw, iy, ix

complex, allocatable, dimension(:,:,:) :: wave

integer :: nthreads

type(DFTI_DESCRIPTOR), pointer :: plan

real :: fscale, bscale

nx = 640

ny = 640

nw = 224

nthreads = omp_get_num_threads()

fscale = 1.0

bscale = 1.0/float(nx*ny)

ier = ccfft2d_plan(nx,ny,nx,fscale,bscale,plan,nthreads)

allocate(wave(nx,ny,nw))

!$omp parallel

!$omp master

print *, omp_get_num_threads()

!$omp end master

!$omp end parallel

! first touch memory

!$omp parallel do default(shared), private(ix,iy,iw)

do iw = 1, nw

do iy = 1, ny

do ix = 1, nx

wave(ix,iy,iw) = cmplx(0.0,0.0)

enddo

! inject source

wave(nx/2,ny/2,iw) = cmplx(1.0,0.0)

enddo

print *

print *, "Before", wave(318:322,320,1)

print *

!$OMP PARALLEL DO DEFAULT(SHARED), PRIVATE(iw)

do iw=1,nw

call ccfft2d_exe(1,wave(1,1,iw),plan)

call ccfft2d_exe(-1,wave(1,1,iw),plan)

enddo

print *, "After", wave(318:322,320,1)

deallocate(wave)

contains

integer function ccfft2d_plan &

( n1, n2, ldn1, fscale, bscale, plan, nthreads )

use MKL_DFTI

implicit none

integer, intent(in), optional :: nthreads

integer, intent(in) :: n1, n2, ldn1

real, intent(in) :: fscale, bscale

type(DFTI_DESCRIPTOR), pointer :: plan

integer :: ier

#ifdef MKLI8

integer(kind=8) :: dim, nsize(2), strides(3), nthreads_in, status

#else

integer(kind=4) :: dim, nsize(2), strides(3), nthreads_in, status

#endif

ier = 0

!--- add some error checking for sizes

dim = 2

nsize(1) = n1

nsize(2) = n2

strides(1) = 0

strides(2) = 1

strides(3) = ldn1

status = DftiCreateDescriptor(plan,DFTI_SINGLE,DFTI_COMPLEX,dim,nsize(1:dim))

if (.NOT. DftiErrorClass(Status, DFTI_NO_ERROR)) then

print *, 'error 1'

call DftiStatusPrint(Status)

ier = -1

return

endif

nthreads_in = 1

if (present(nthreads)) nthreads_in = nthreads

status = DftiSetValue(plan,DFTI_NUMBER_OF_USER_THREADS,nthreads_in)

status = DftiSetValue(plan,DFTI_INPUT_STRIDES,strides)

!--- this corresponding to isign < 0 usage at BP

! bscale= 1.0/float(n1*n2)

status = DftiSetValue(plan,DFTI_FORWARD_SCALE,bscale)

!--- this corresponding to isign > 0 usage at BP

! fscale= 1.0

status = DftiSetValue(plan,DFTI_BACKWARD_SCALE,fscale)

status = DftiCommitDescriptor(plan)

if (status /= 0) print *, 'DftiCommitDescriptor returned ', status

ccfft2d_plan = ier

end function ccfft2d_plan

subroutine ccfft2d_exe( isign, cdata, plan )

use MKL_DFTI

implicit none

integer, intent(in) :: isign

complex, intent(inout) :: cdata(*)

type(DFTI_DESCRIPTOR), pointer :: plan

#ifdef MKLI8

integer(kind=8) :: status

#else

integer(kind=4) :: status

#endif

if ( isign < 0 ) then

status = DftiComputeForward( plan, cdata )

else

status = DftiComputeBackward( plan, cdata )

endif

end subroutine ccfft2d_exe

subroutine DftiStatusPrint(status)

use MKL_DFTI

implicit none

#ifdef MKLI8

integer(kind=8) :: status

#else

integer(kind=4) :: status

#endif

character(DFTI_MAX_MESSAGE_LENGTH) :: error_message

error_message = DftiErrorMessage(status)

print *, 'Error message: ', error_message

print *, 'Error status = ', status

end subroutine DftiStatusPrint

end Program

↧

The support period for my license has expired.

May 3, 2018, 4:32 am

Latest and popular articles on Intel Technologies

≫ Next: Tensorflow performance w/ MKL

≪ Previous: issue of 2d fft with openmp

Hi, folks.

What I have tried is to download latest Parallel Studio using my 'old' profile that I've created several years ago. And I got

"The support period for your license has expired. To download this product update, you will need to purchase renewal licenses to extend your support from your expiration date (07 Aug 2015) to the build date of this product update (15 Mar 2018). Note that support for new renewal licenses will begin on 07 Aug 2015."

And my serial number is not commercial and can not be renewed.

So to download latest 'free' MKL I have to create new profile.

To be honest the situation looks a bit strange for me. Is it possible to keep my profile and have latest MKL?

↧

Tensorflow performance w/ MKL

May 5, 2018, 11:58 pm

Latest and popular articles on Intel Technologies

≫ Next: Storing Tri-diagonal matrix in coo or csr format.

≪ Previous: The support period for my license has expired.

Hi,

I am trying to use tensorflow-1.8.0 compiled with MKL-2018.2.199 enabled. I use it to run mobilenet image classification and obj detection models. I compared the performance w/ MKL and w/o MKL. In general, w/ MKL is much slower in most cases. I am posting here to see whether I did sth. wrong or this is what I should expect..

All the following comparison numbers were collected from running the corresponding inference models on an i7-5557U CPU. I also run the tests on other CPUs and got similar results. NOTE: the time in 1-4 is per 16 frames. 5-6 is per a 320x180 frame.

1. Mobilenet_v2_1_4_224 w/ MKL 1463 ms w/o MKL 2486 ms (this is good)

2. Mobilenet_v2_1_0_96 w/ MKL 481 ms w/o MKL 276 ms (~1 time slower!)

3. Mobilenet_v1_1_0_224_quant w/ MKL 903 ms w/o MKL 664 ms (~50% slower)

4. Mobilenet_v1_1_0_128_quant w/ MKL 469 ms w/o MKL 233 ms (~1 time slower)

5. ssd_mobilenet_v1_coco w/ MKL 142 ms w/o MKL 116 ms

6. ssd_mobilenet_v2_coco w/ MKL 212 ms w/o MKL 130 ms

I used "-DINTEL_MKL -DINTEL_MKL_ML -DEIGEN_USE_MKL_ALL -DMKL_DIRECT_CALL -march=native -mtune=native" to compile tensorflow .

You can find the code here. The benchmark data is here and here.

↧

Storing Tri-diagonal matrix in coo or csr format.

May 6, 2018, 12:14 pm

Latest and popular articles on Intel Technologies

≫ Next: FFT performance issue

≪ Previous: Tensorflow performance w/ MKL

Hi,

Is there any function/method to store the tridiagonal matrix in csr or coo format.

↧

FFT performance issue

May 4, 2018, 11:41 pm

Latest and popular articles on Intel Technologies

≫ Next: Numpy + MKL only using 1 core of cpu on Ubuntu 16.04

≪ Previous: Storing Tri-diagonal matrix in coo or csr format.

I am evaluating the MKL library for the new project. While computing the FFT, I could see that performance of the FFT (in terms of GFLOP/s) reduces if the threads are in different socket ( using thread affinity). The test was carried out in Intel(R) Xeon(R) CPU E5-2650 processor and compiler is gcc. Please let me know the reason.

↧

Numpy + MKL only using 1 core of cpu on Ubuntu 16.04

May 5, 2018, 12:33 am

Latest and popular articles on Intel Technologies

≫ Next: New Floating Point Math Error in Subtraction for Intel Core Processors?

≪ Previous: FFT performance issue

Hi,

I am using Ubuntu 16.04 on my pc with "Intel® Core™ i7-7700K CPU @ 4.20GHz × 8".

My numpy and scipy only use one cpu when I try to do some element calculation for my numpy ndarray.

Something like:

numpy.power(matrix, 1.5)

I compiled the numpy and scipy following https://software.intel.com/en-us/articles/numpyscipy-with-intel-mkl?page=1

The numpy configurations are as following.

blas_opt_info:
include_dirs = ['/opt/intel/compilers_and_libraries_2018/linux/mkl/include']
library_dirs = ['/opt/intel/compilers_and_libraries_2018/linux/mkl/lib/intel64']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
libraries = ['mkl_rt', 'pthread']
lapack_opt_info:
include_dirs = ['/opt/intel/compilers_and_libraries_2018/linux/mkl/include']
library_dirs = ['/opt/intel/compilers_and_libraries_2018/linux/mkl/lib/intel64']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
libraries = ['mkl_rt', 'pthread']
blas_mkl_info:
include_dirs = ['/opt/intel/compilers_and_libraries_2018/linux/mkl/include']
library_dirs = ['/opt/intel/compilers_and_libraries_2018/linux/mkl/lib/intel64']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
libraries = ['mkl_rt', 'pthread']
lapack_mkl_info:
include_dirs = ['/opt/intel/compilers_and_libraries_2018/linux/mkl/include']
library_dirs = ['/opt/intel/compilers_and_libraries_2018/linux/mkl/lib/intel64']
define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
libraries = ['mkl_rt', 'pthread']

I tried to modify environment variables like MKL_NUM_THREADS, OMP_NUM_THREADS, MKL_DOMAIN_NUM_THREADS, MKL_DYNAMIC, but they do nothing to my situation.

Thanks for your help.

↧

New Floating Point Math Error in Subtraction for Intel Core Processors?

May 8, 2018, 6:08 am

Latest and popular articles on Intel Technologies

≫ Next: getting MKL thread IDs

≪ Previous: Numpy + MKL only using 1 core of cpu on Ubuntu 16.04

To duplicate type the following in to an Excel spreadsheet.

=(6.377-6.376)*1000

I get: 0.999999999999446 if I allow the decimal places to be shown. That is three significant digits that are in error, way too many for rounding error.

I get a true rounding error using Google (0.99999999999)

I understand rounding error, but look at the significant digits and this occurs in subtraction. Second of all there are only four significant digits to begin with there should not be any rounding error at all.

The error exists before multiplying by 1000, that is done, just for convenience.

This error occurs on the latest generation of Intel Processors (i7-8550U as well as my older i7-4770).

I had to track down this error from a 1 > 1 problem (when the number was put into a logical statement it failed).

↧

getting MKL thread IDs

May 8, 2018, 1:22 pm

Latest and popular articles on Intel Technologies

≫ Next: Support for legacy software migration project - Intel MKL

≪ Previous: New Floating Point Math Error in Subtraction for Intel Core Processors?

Hi,

We have a problem regarding mkl threads and we really appreciate your valuable help. we are using mkl function calls in the nested parallel region below:

        omp_set_num_threads( NUM_OF_THREADS );
        omp_set_nested(1);
        omp_set_max_active_levels(2);


	#pragma omp parallel num_threads(2)
        {
                if (omp_get_thread_num() == 0){

                        mkl_set_num_threads_local(16);

                        printf("My ID is %d\n", omp_get_thread_num());
                       	cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
                        m, n, p, 1, pA, p, pB, n, 0, pC1, n);
                }else{
                        mkl_set_num_threads_local(16);

                        printf("My ID is %d\n", omp_get_thread_num());
                       	cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
                        m, n, p, 1, pD, p, pE, n, 0, pC2, n);

                }
        }

Using VTune Amplifier, we can verify that the correct number of 32 threads are produced. However, the output of the print statements is as follows:

My ID is 0
My ID is 1

It seems like we cannot access "mkl" threads using "omp_get_thread_num()". Is there any similar function for accessing thread IDs of mkl threads? Or is there a way to do that? (We need such information for affinity and thread placement decisions).

Thank you very much,

Sanaz

↧

Support for legacy software migration project - Intel MKL

May 8, 2018, 11:27 pm

Latest and popular articles on Intel Technologies

≫ Next: COO (with duplicates) to CSR

≪ Previous: getting MKL thread IDs

Hi,

We have a legacy software utility which we are planning to Migrate from Visual Studio 2003 to Visual Studio 2010. Migration will be from Windows Server 2003(x86) to target platform Windows Server 2012(x64).

In Windows Server 2003(x86), below are the Intel Math Kernel components used in our utility,

• mkl_c
• mkl_lapack
• mkl_ia32
• libguide40
• libguide

We have downloaded the latest version of Intel Math Kernel Library v2018.0.2.1 from https://software.intel.com/en-us/mkl website. We are not able to find equivalent components of listed above.

Could you please clarify whether these components were removed or replaced by another component in latest version?

Thanks & Regards,
Vijayakumar R

↧

COO (with duplicates) to CSR

May 9, 2018, 9:03 am

Latest and popular articles on Intel Technologies

≫ Next: Skylake HPL with Intel MKL

≪ Previous: Support for legacy software migration project - Intel MKL

HI, I wrote the following routine to convert the e-vectors (in coo with multiple duplicates) to the csr format. The steps I made are (1-sorting ia 2- sorting ja 3- condensing or summing up the duplicates(that val sharing the same(ia&ja) indices 4- Convert using the mkl_csrcoo function. This works fine for a small test example. however it takes too much time for big example.
Can anyone help?
***HERE IS THE CODE***
void Sort_COO(MKL_INT nnz, std::vector<MKL_INT> &ia, std::vector<MKL_INT> &ja, std::vector<double> &val) {
//this part to sort the arrays according to (row and column) array
//first , sort the row array in ascending order and sort the other two arrays corespondigly
//second, sort the column array such that it is ascending within constant row number(i.e within specific row) and sort the other arrays correspondigly.
MKL_INT holdia, holdja;
double holdVal;
for (int i = 0; i < nnz; i++) {
for (int j = 0; j < nnz - 1; j++) {
if (ia[j] > ia[j + 1]) {
//swapping elements w.r.t row rearrangement
holdia = (ia)[j];
holdja = (ja)[j];
holdVal = (val)[j]
ia[j] = ia[j + 1]; //check this if it works properly ( those are pointers not values)
ja[j] = ja[j + 1];
val[j] = val[j + 1];
(ia)[j + 1] = holdia;
(ja)[j + 1] = holdja;
(val)[j + 1] = holdVal;
}
}
}
for (int i = 0; i < nnz; i++) {
for (int j = 0; j < nnz - 1; j++) {
if ((ja[j] > ja[j + 1]) && (ia[j] == ia[j + 1])) {
//swapping elemnents w.r.t column rearrangement (in assending oreder)
holdia = (ia)[j];
holdja = (ja)[j];
holdVal = (val)[j];
ia[j] = ia[j + 1];
ja[j] = ja[j + 1];
val[j] = val[j + 1];
(ia)[j + 1] = holdia;
(ja)[j + 1] = holdja;
(val)[j + 1] = holdVal;
}
}
}
}
void Condense_COO(MKL_INT& nnz, std::vector<MKL_INT> &ia, std::vector<MKL_INT> &ja, std::vector<double> &val) {
//now, val,ia and ja are constructed in coo with duplicate entriees
//we need to condense this to coo format (without duplications)
//we will sum those values in val which have the same (ia and ja) indecies
MKL_INT nduplic = 0;
for (int i = 0; i < nnz; i++) {
if (((ia)[i] == (ia)[i - 1]) && ((ja)[i] == (ja)[i - 1]))
{
nduplic++;
(val)[i - 1] = (val)[i] + (val)[i - 1];
for (int k = i; k < nnz; k++) {
if (((ia)[k + 1] == (ia)[i - 1]) && ((ja)[k + 1] == (ja)[i - 1]))
{
/*nduplic++;*/
(val)[i - 1] = (val)[k + 1] + (val)[i - 1];
std::cout << "nduplic="<< nduplic << endl;
for (int j = k + 1; j < nnz; ++j)
{
val[j] = val[j + 1];
ia[j] = ia[j + 1];
ja[j] = ja[j + 1];
}
}
}
for (int l = i; l < nnz; ++l)
{
val[l] = val[l + 1];
ia[l] = ia[l + 1];
ja[l] = ja[l + 1];
}
}
}
nnz = nnz - nduplic;
std::cout << "nduplic="<< nduplic << endl;
//erase the surplus elements at the end of each array
for (int i = 0; i < nduplic; ++i)
{
val.pop_back();
ia.pop_back();
ja.pop_back();
}
}
void Convert_CSR(MKL_INT &job, int& N, double &ASCR, MKL_INT& AII, MKL_INT& AJJ, MKL_INT& NNZ, double& VAL, MKL_INT& IA, MKL_INT& JA, MKL_INT& info) {
mkl_dcsrcoo(&job, &N, &ASCR, &AJJ, &AII, &NNZ, &VAL, &IA, &JA, &info);
}

↧

Skylake HPL with Intel MKL

May 10, 2018, 2:29 am

Latest and popular articles on Intel Technologies

≫ Next: Memory leak in FEAST routines

≪ Previous: COO (with duplicates) to CSR

Hi Team,

We are unable to run HPL on skylake.

Environment

Centos 7.4

Single Server of Xeon Gold 6148, 20 Core,2.4 GHrtz, Quantity 2, IPC 32 And Of Rpeak 3072.

Running "xhpl_intel64_dynamic" Precompiled binary Linux* package(l_mklb_p_2018.2.010)

References

https://software.intel.com/en-us/mkl-linux-developer-guide-configuring-parameters

https://software.intel.com/en-us/articles/intel-mkl-benchmarks-suite

For HPL dat file P & Q values unable to determine and how we can launch the run.

Submitted run 5 & 8 input values for P&Q respectively along with the mpirun -np 40 threads. But System got halt.

Need your input for running HPL for SKYLAKE architecture.

Thank You

Atul Yadav

↧

Memory leak in FEAST routines

May 10, 2018, 10:57 am

Latest and popular articles on Intel Technologies

≫ Next: MKL Operations in TensorFlow

≪ Previous: Skylake HPL with Intel MKL

Hi,

I was experimenting with the FEAST routines and found a problem with memory leaks.

In particular, the zfeast_heev routine works perfectly fine, when the parameter "x" is allocated on the stack, e.g, when one defines:

MKL_Complex16 x[DIM*m0];

but when it is dynamically allocated and stored on the heap, I get a segmentation fault upon calling free (or delete). This implies that the routines do something with this pointer that it is not supposed to do.

I append a minimal example. If you comment out "#define X_ON_HEAP_CPP", the memory is allocated on the stack and everything works fine. With the definition, or that of "#define X_ON_HEAP", the memory is allocated on the heap and the program segfaults when free (or delete) is called.

I am using mkl version 2018.2.199 and compile using g++ version 5.4.0, which I call with the compile flags "-I${MKLROOT}/include -L${MKLROOT}/lib/intel64 -lmkl_intel_thread -lmkl_rt -lmkl_core -lmkl_intel_lp64 -lm"

Can somebody reproduce this issue, maybe also with icc? How would one correctly allocate dynamic memory for the use in the FEAST routines?

Regards, Moritz

#include <iostream>
#include <cmath>
#include <string>
#include <complex>
#include <malloc.h>

#include "mkl.h"
#include "mkl_solvers_ee.h"

using namespace std;

int main(int args, char** argv){
const MKL_INT m0=4;
double lambdamin=-1,lambdamax=2;

const MKL_INT DIM=4;
MKL_Complex16 A[4][4]={ \
{{0.,0.},{sqrt(3.)/2.,0.},{0.,0.},{0.,0.}}, \
{{sqrt(3.)/2.,0.},{0.,0.},{1.,0.},{0.,0.}}, \
{{0.,0.},{1.,0.},{0.,0.},{sqrt(3.)/2.,0.}}, \
{{0.,0.},{0.,0.},{sqrt(3.)/2.,0.},{0.,0.}} };
//The exact eigenvalues are: -1.5, -0.5, 0.5 and 1.5

MKL_Complex16 zero = {0.0, 0.0};
char uplo = 'F';
MKL_INT fpm[128];
MKL_INT n=DIM;
double epsout=0.;
MKL_INT loop=0;
MKL_INT m0var=m0;
MKL_INT m=m0;
double res=0.;
int info=0;
double lambdaptr[DIM];

#define X_ON_HEAP_CPP
#ifdef X_ON_HEAP_C
MKL_Complex16 *x=(MKL_Complex16*)malloc(sizeof(MKL_Complex16)*DIM*m0);
#elif defined(X_ON_HEAP_CPP)
MKL_Complex16 *x=new MKL_Complex16[DIM*m0];
#else
MKL_Complex16 x[DIM*m0];
#endif

feastinit(fpm);
fpm[0]=1;

zfeast_heev( &uplo, &n, \
      (MKL_Complex16 *) &A[0][0], &n, \
      fpm, &epsout, &loop, &lambdamin, &lambdamax, &m0var, \
      lambdaptr, (MKL_Complex16 *) x, &m, &res, &info);

cout<<"Nr of eigenvalues found: "<<m<<endl<<flush;

for(int i=0;i<m;i++)cout<<lambdaptr[i]<<endl;

#ifdef X_ON_HEAP
free(x);
#elif defined(X_ON_HEAP_CPP)
delete[] x;
#endif
cout<<"Done"<<endl<<flush;

return 0;
}

↧

MKL Operations in TensorFlow

May 11, 2018, 2:04 pm

Latest and popular articles on Intel Technologies

≫ Next: Uninstall of MKL 2017 Update 2 FAILING on Windows 10 (version 1803)

≪ Previous: Memory leak in FEAST routines

Hi all,

I run a TensorFlow model Inception with Intel MKL-DNN support.

The execution time of MKL operations in MKL (e.g. _MklConv2DBackpropFilter, _MklConv2D, _MklConv2DBackpropInput, etc) have no change when I change the number of intra-threads (means the number of threads running one operation). The results are as below. Does anyone know why the MKL the performance of MKL operations will not change when changing the number of threads running this operation? Or does anyone explain the implementation of MKL-DNN support for TensorFlow for this situation? Thank you!

Name Intra-threads Time

_MklConv2DBackpropFilter 8               470
17               466
34               468
68               467

_MklConv2DBackpropInput 8               354
17               344
34               347
68               347

_MklConv2D 8               300
17               304
34               311
68               311

Kevin

↧

Uninstall of MKL 2017 Update 2 FAILING on Windows 10 (version 1803)

May 12, 2018, 4:22 am

Latest and popular articles on Intel Technologies

≫ Next: Support for legacy software migration

≪ Previous: MKL Operations in TensorFlow

Uninstall of MKL 2017 Update 2 from my PC using the usual Control Panel uninstaller fails as follows:

Intel(R) Math Kernel Library 2017 Update 2 for Windows* Intel® Software Setup Assistant ended prematurely because of an error(s).

Windows Installer failed to configure product:

Error message: The specified account already exists.

Error code: 1603

Product code: 36D90182-C708-49CC-A6B9-6D9148159A01

Module name: ide_c_common_vs2012_p_17.0.2.187.msi

Installer logs location: C:\Users\Andrew\AppData\Local\Temp\pset_tmp_PSXE2017_Andrew\2018.05.12_21.01.08_0000354c\log\

It is being uninstalled from the same account that it was installed under, and the account has admin privileges.

Log files from the installer logs location are attached.

Attachment	Size
Download {36D90182-C708-49CC-A6B9-6D9148159A01}.log	652.19 KB
Download pid_13644.log	5.82 MB

↧

Support for legacy software migration

May 17, 2018, 7:05 am

Latest and popular articles on Intel Technologies

≫ Next: Mistake in the mkl_dnn documentation

≪ Previous: Uninstall of MKL 2017 Update 2 FAILING on Windows 10 (version 1803)

Hi,

We have a legacy software application currently running in Windows 2003 server (x86) which we are planning to move it to Windows 2012 server.

Our existing application uses below mentioned Intel Math Kernal libraries v9.0.0.1.

• mkl_c
• mkl_lapack
• mkl_ia32
• libguide40
• libguide

Please let us know whether the same version of the libraries(32-bit) will be supported in Windows 2012 Server?

Regards,
Tamil Selvan.

↧

Mistake in the mkl_dnn documentation

May 18, 2018, 5:54 am

Latest and popular articles on Intel Technologies

≫ Next: PSGESVD: Illegal parameter 19

≪ Previous: Support for legacy software migration

Hi Everybody,

i have been using the intel mkl_dnn library for convolutions and batch normalization and have noticed that there is a simple spelling mistake in the documentation that took me around 2 hours to find in my code.

dnnResourceScaleShift | Scale and shift data.

dnnResourceScaleShift | Gradient with respect to scale and shift.

The second enum should be dnnResourceDiffScaleShift if i am not mistaken?

https://software.intel.com/en-us/mkl-developer-reference-c-enumerated-types

Thats all.

↧

PSGESVD: Illegal parameter 19

May 20, 2018, 3:16 am

Latest and popular articles on Intel Technologies

≫ Next: Can't start MKL pardiso in Mac OSX!

≪ Previous: Mistake in the mkl_dnn documentation

Hello,

I want to perform SVD on a matrix in parallel(C). But I am facing issues in psgesvd. When I try to perform svd of a matrix of size 4750*4750, I get an error saying that:

"{ 0, 0}: On entry to
PSGESVD parameter number 19 had an illegal value"

Parameter 19 is 'lwork' for which a make a query first which returns the minimum size of 'work' array required. The svd works fine for smaller matrices but big matrices give an error. Can anyone please help me with this as to what is the reason for this error. The value of lwork in this case is 45362800.

When I use serial "sgesvd" I don't get any such issue and the results are perfectly correct. But I am not sure what is causing an issue in this case.

Thanks in advance!

-Shailesh Tripathi

↧