Is DTRSM a sequential code in the case of a single right hand side?

June 13, 2018, 2:43 am

Latest and popular articles on Intel Technologies

≫ Next: MKL cluster solver: number of threads always equal 1

≪ Previous: MKL_DNN convolution has the wrong output order on Intel(R) Xeon(R) CPU E5-2650 v3 (Possible bug)

Hello,

DTRSM is the LAPACK function which I use to solve triangular linear system with a multiple right hand sides.
I am perfectly happy with the performance and the parallel scalability of the multi-threaded variant when the number of right hand sides n is sufficiently large.

My questions concerns the special case of a single right hand side. Here I detect no evidence that the function has been parallelized as the run-time is independent of the number of threads. The run-time does increase quadratically with the dimension m of the matrix exactly as predicted by the raw flop count. Moreover, the runtime of DTRSM is approximately twice that of DTRSV. I can find no evidence that DTRSV has been parallelized as the run-time is independent of the number of threads.

My specific questions are:
1) Is it correct that DTRSM defaults to a sequential code if the case of a single RHS.
2) Is it correct that DTRSV is a sequential code.

My motivation is the following: In LAPACK the function DLATRS can be used to solve triangular linear system in a manner which eliminates the possibility of floating point overflow. This is a sequential code. My colleges and I at Umeaa University are developing a parallel version of DLATRS. We need to make a fair comparison against the standard solvers DTRSM, DTRSV when the systems do not require overflow protection.

↧

MKL cluster solver: number of threads always equal 1

June 14, 2018, 12:43 am

Latest and popular articles on Intel Technologies

≫ Next: Sparse matrix complex number

≪ Previous: Is DTRSM a sequential code in the case of a single right hand side?

Hello everyone,

I'm using the MKL cluster sparse solver but I found a problem related to the number of threads. In fact the number of threads for my example is kept equal to 1 for every calculation and is not maximize according to the maximum number or threads in my nodes.

Someone can help me?

↧

Sparse matrix complex number

June 14, 2018, 8:36 am

Latest and popular articles on Intel Technologies

≫ Next: I get different results from MKL and LAPACK.

≪ Previous: MKL cluster solver: number of threads always equal 1

We are going to solve linear systems with sparse matrix of complex numbers and we would like to know, which is the most suitable mkl routine to use.
We read the documentation but we have still some doubts.

We saw that we can use PARDISO functions or Direct Sparse Solver (DSS) Interface Routines.

We would like to be oriented. Any suggestion or help?

Thank you

↧

I get different results from MKL and LAPACK.

June 16, 2018, 4:48 am

Latest and popular articles on Intel Technologies

≫ Next: non-negative matrix factorization

≪ Previous: Sparse matrix complex number

Has anybody experienced getting different results from -mkl and -llapack?

I've developed a fortran code that uses zgeev as in

LWORK=-1
call zgeev('N','N',Ndim,U,Ndim,Eig,VL,1,VR,1,WORKdummy,LWORK,RWORK,Info)
LWORK=INT(DBLE(WORKdummy(1)))
allocate (WORK(LWORK))
call zgeev('N','N',Ndim,U,Ndim,Eig,VL,1,VR,1,WORK,LWORK,RWORK,INFO)

For some parameters (NOT always), I get different results when I linked mkl and lapack.
Here are the compiling commands:
ifort -O3 -mkl -heap_arrays QWrandomEig.f90
ifort -O3 -heap_arrays -llapack QWrandomEig.f90

I'm attaching the source code as well as the input file and the output files,
QWrandomEig_output_mkl.dat
QWrandomEig_output_lapack.dat
respectively, which are different from each other. From the anticipated behavior,
I believe the latter (with lapack) is the correct result. More disturbingly, when
I compile the code with -O0 -mkl as in
ifort -O0 -heap_arrays -mkl QWrandomEig.f90
I get the following error message when I run it:
Intel MKL ERROR: Parameter 2 was incorrect on entry to ZGEHD2.
although the code runs till the end and the result
QWrandomEig_output_mkl_O0.dat
seems basically the same as the one with -O3 -mkl.

I do not get this problem for other parameter sets, which is very annoying.
Could someone guess what can be wrong with the code (or compiler for that matter)?
OS is Mac OS 10.13.4 and the fortran version is ifort version 18.0.2

I appreciate your opinions.

Attachment	Size
Download QWrandomEig.f90	4.64 KB
Download QWrandomEig_datafiles.zip	106.46 KB

↧

non-negative matrix factorization

June 17, 2018, 1:35 pm

Latest and popular articles on Intel Technologies

≫ Next: Impossible to use MKL in Visual Studio 2017

≪ Previous: I get different results from MKL and LAPACK.

Hello,

does anyone know if the non-negative matrix factorization is available in mkl; or will it be implemented in mkl in the near future?

Thanks!

↧

Impossible to use MKL in Visual Studio 2017

June 19, 2018, 3:42 am

Latest and popular articles on Intel Technologies

≫ Next: CNR Support Functions in Intel MKLML small libraries

≪ Previous: non-negative matrix factorization

Hello, I know that this topic has been widely addressed but in my case I have tried with all the proposed solutions that I have found and no one worked for me. The testing code is the following:

// ConsoleApplication6.cpp : Defines the entry point for the console application.
#include "stdafx.h"
#include <stdio.h>
#include <stdlib.h>
#include "mkl.h"
#define min(x,y) (((x) < (y)) ? (x) : (y))

int main()
{
	double *A, *B, *C;
	int m, n, k, i, j;
	double alpha, beta;

	m = 2000, k = 200, n = 1000;
	alpha = 1.0; beta = 0.0;
		
	A = new double[m*k];
	B = new double[k*n];
	C = new double[m*n];

	for (i = 0; i < (m*k); i++)
		A[i] = (double)(i + 1);

	for (i = 0; i < (k*n); i++)
		B[i] = (double)(-i - 1);

	for (i = 0; i < (m*n); i++)
		C[i] = 0.0;

	cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
		m, n, k, alpha, A, k, B, n, beta, C, n);

	delete[] A, B, C;

	return 0;
}

I get the next error:

Severity Code Description Project File Line Suppression State

Error (active) E1696 cannot open source file "mkl.h"

StateError C1083 Cannot open include file: 'mkl.h': No such file or directory

The program versions are:

Visual Studio Community 2017 15.6.0
Microsoft .NET Framework 4.7.03056
Intel MKL 2018.3.210
Intel C++ Compiler 18.0

The Intel Performance Library option inside Configuration Properties is set to Sequential but it didn't link correctly.

I don't know if it could be a compatibility issue between the latest releases of those programs... I will appreciate your responds

Thanks in advance.

↧

CNR Support Functions in Intel MKLML small libraries

June 19, 2018, 5:36 am

Latest and popular articles on Intel Technologies

≫ Next: MKL 2018 Update 3 (Windows) zgetri crash

≪ Previous: Impossible to use MKL in Visual Studio 2017

Hi,

I'm using the Intel MKLML small libraries shipped with MKLDNN [1] to build a custom inference engine for neural networks. It's great to get MKL performance in such a small package.

However it appears that the CNR support functions [2] are not part of this library. I'm currently using an internal function to get the current CNR branch:

extern "C" {
  int mkl_serv_cbwr_get_auto_branch();
}

Are there plans to add this set of functions? More generally, how are functions selected for MKLML inclusion (for example, vsSin et vsCos are not included either)?

Thanks,

Guillaume

[1] https://github.com/intel/mkl-dnn/releases/download/v0.14/mklml_lnx_2018.0.3.20180406.tgz
[2] https://software.intel.com/en-us/mkl-developer-reference-c-conditional-numerical-reproducibility-control

↧

MKL 2018 Update 3 (Windows) zgetri crash

June 19, 2018, 8:25 am

Latest and popular articles on Intel Technologies

≫ Next: Strange partial pivoting of LAPACKE_dgetrf

≪ Previous: CNR Support Functions in Intel MKLML small libraries

MKL 2018 Update 3 has broken zgetri.

Our QA process now crashes in a call to zgetri. Reverting to MKL Update 2 DLL's resolves the issue. It does not happen on every call to zgetri

I will point out that Intel Inspector complains about zgetri and data races (and has for a long time...)

Not Flagged	>	14996	0	Main Thread	Main Thread	libiomp5md.dll!__kmp_task_team_wait
 	 	 	 	 	 	[External Code]
 	 	 	 	 	 	libiomp5md.dll!__kmp_task_team_wait(kmp_info * this_thr, kmp_team * team, void * itt_sync_obj, int wait) Line 401
 	 	 	 	 	 	libiomp5md.dll!__kmp_join_barrier(int gtid) Line 2037
 	 	 	 	 	 	libiomp5md.dll!__kmp_join_call(ident * loc, int gtid, fork_context_e fork_context, int exit_teams) Line 7493
 	 	 	 	 	 	libiomp5md.dll!__kmpc_fork_call(ident * loc, int argc, void(*)(int *, int *) microtask) Line 372
 	 	 	 	 	 	mkl_intel_thread.dll!000007feb0d84b70()
 	 	 	 	 	 	mkl_core.dll!000007feacbc61fe()

↧

Strange partial pivoting of LAPACKE_dgetrf

June 20, 2018, 7:41 pm

Latest and popular articles on Intel Technologies

≫ Next: largest matrix size by the LAPACKE_dsyevr

≪ Previous: MKL 2018 Update 3 (Windows) zgetri crash

Hello,

I'm using LAPACKE_dgetrf to compute the LU factorization of square matrices in double precision. The matrix is in column major. Here is what I am doing. The environment is MKL 2018 Update 3 for Windows + Visual studio 2017.

for(...)

{

lapack_int m = dim; //dim is around 40 to 80

double * A = (double*)mkl_malloc(m * m * sizeof(double), 64);

memcpy(A, source, m * m * sizeof(double));

lapack_int * ipiv = (lapack_int*)mkl_malloc(m * sizeof(lapack_int), 64);

lapack_int info = 0;

int mat_layout = LAPACK_COL_MAJOR;

for(int k = 0; k < m; k++) //flush ipiv before calling dgetrf

ipiv[k] = -1;

info = LAPACKE_dgetrf(mat_layout, m, m, A, m, ipiv);

if(info != 0)

{clean up memory;

break;}

}

I checked the info that was returned by LAPACKE_dgetrf and it was always 0. However, I found out I got duplicate items in ipiv array after each dgetrf call.

I mean I got ipiv[i] == ipiv[j] (0 < i , j < m), which doesn't make sense for partial pivoting. I also flushed the ipiv array before calling dgetrf.

What is the possible reason for getting duplicate values in partial pivoting array?

Thank you.

↧

largest matrix size by the LAPACKE_dsyevr

June 21, 2018, 9:28 am

Latest and popular articles on Intel Technologies

≫ Next: PDGEMM returns wrong result under specific conditions

≪ Previous: Strange partial pivoting of LAPACKE_dgetrf

Hi,

I am trying to diagonalize a square matrix of size N=73789, with a MKL LAPACKE_dsyevr, and
need to find all the eigenvalues and eigenvectors. The machine I am running on uses
Intel(R) C++ Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 18.0.2.199 Build 20180210
and has 3TB of RAM. The matrix itself occupies ~44 G. The code calls the
info = LAPACKE_dsyevr( LAPACK_COL_MAJOR, 'V', 'A', 'U', N, A, LDA,
vl, vu, il, iu, abstol, &m, W, Z, LDZ, ISUPPZ );
successfully, and spends about ~6000 minutes in the routine (it multithreads with the available processors),
and the memory requirement is steady.
However, when it seems it is nearly finished, there is a segmentation fault (core dumped).

Can anyone tell me why this is happening? Or if this is too big a matrix to be used with dsyevr?
Thanks,

Debasish

↧

PDGEMM returns wrong result under specific conditions

June 22, 2018, 12:52 am

Latest and popular articles on Intel Technologies

≫ Next: Shared library using MKL crashes on runtime with EXC_BAD_ACCESS

≪ Previous: largest matrix size by the LAPACKE_dsyevr

Hello,

I encountered a problem using Scalapack routine PDGEMM under specific conditions. After investigation, I managed to reproduce the problem with a simple test program. In my case, the error occurs for mkl versions from 16 to 18 (I tested the 17, 18.0 and 18.3 release on our local cluster linking with intelmpi and libmkl_blacs_intelmpi_lp64, the 16 and 17 on our national cluster Curie, linking with bullxmpi and libmkl_blacs_openmpi_lp64, and locally on the latest release of mkl 18 linking with openmpi and libmkl_blacs_openmpi_lp64). Linking with manually compiled version of Scalapack2.0.2 resolve the problem. Linking with mkl version 14 and 15 runs fine too.

The test consist in multiplying two matrices with all coeffs set to 1, then testing the result. It appears that for a 2 by 2 processor grid (i.e. mpirun -n 4 ), the resulting matrix can be wrong. Increasing the grid size correct the problem. The error is silent as the code runs and terminate normally. Apart from the mkl, the test code is standalone and pasted below and attached.

Cordially,

Ivan Duchemin.

! 
! The program test pdgemm matrix x matrix multiplication under fixed condition 
! on a square processor grid provided by the user.
! 
! The product tested is:
! 
!  C = A * B
! 
! with A being a 8160 x 8160  matrix with all coeffs set to 1
! and  B being a 8160 x 19140 matrix with all coeffs set to 1
! The result expected is thus all coeffs of C equal to 8160
! 
PROGRAM TEST
  
  ! Parameters
  INTEGER         , PARAMETER :: M=8160, N =19140, K=8160, DLEN_=9
  INTEGER         , PARAMETER :: CSRC=1, RSRC=1
  DOUBLE PRECISION, PARAMETER :: ONE=1.0D+0, ZERO=0.0D+0
  
  ! work variables
  INTEGER                                       :: ICTXT
  INTEGER                                       :: IAM
  INTEGER                                       :: NPROCS
  INTEGER                                       :: NPROW
  INTEGER                                       :: NPCOL
  INTEGER                                       :: MYROW
  INTEGER                                       :: MYCOL
  INTEGER                                       :: DESCA(9)
  INTEGER                                       :: DESCB(9)
  INTEGER                                       :: DESCC(9)
  INTEGER                                       :: M_A
  INTEGER                                       :: N_A
  INTEGER                                       :: M_B
  INTEGER                                       :: N_B
  INTEGER                                       :: M_C
  INTEGER                                       :: N_C
  INTEGER                                       :: MB_A
  INTEGER                                       :: NB_A
  INTEGER                                       :: MB_B
  INTEGER                                       :: NB_B
  INTEGER                                       :: MB_C
  INTEGER                                       :: NB_C
  DOUBLE PRECISION, ALLOCATABLE, DIMENSION(:,:) :: A
  DOUBLE PRECISION, ALLOCATABLE, DIMENSION(:,:) :: B
  DOUBLE PRECISION, ALLOCATABLE, DIMENSION(:,:) :: C
  
  ! Get starting information
  CALL BLACS_PINFO( IAM, NPROCS )
  
  ! try setting square grid
  NPROW = sqrt(REAL(NPROCS,kind=8))
  NPCOL = sqrt(REAL(NPROCS,kind=8))
  if ( NPROW*NPCOL .ne. NPROCS ) then
    print *,"please provide a square number of procs"
    stop 1
  end if
  
  ! Define process grid
  CALL BLACS_GET( -1, 0, ICTXT )
  CALL BLACS_GRIDINIT( ICTXT, 'R', NPROW, NPCOL )
  CALL BLACS_GRIDINFO( ICTXT, NPROW, NPCOL, MYROW, MYCOL )
  
  ! set A matrix dimensions
  M_A = M
  N_A = K
  
  ! set B matrix dimensions
  M_B = K
  N_B = N
  
  ! set C matrix dimensions
  M_C = M
  N_C = N
  
  ! set blocking factors for A matrix
  MB_A = M_A/NPROW
  NB_A = N_A/NPCOL
  
  ! set blocking factors for B matrix
  MB_B = M_B/NPROW
  NB_B = 32
  
  ! set blocking factors for C matrix
  MB_C = M_C/NPROW
  NB_C = 32
  
  ! get A local dimensions
  MLOC_A = NUMROC( M_A, MB_A, MYROW, 0, NPROW )
  NLOC_A = NUMROC( N_A, NB_A, MYCOL, 0, NPCOL )
  
  ! get B local dimensions
  MLOC_B = NUMROC( M_B, MB_B, MYROW, 0, NPROW )
  NLOC_B = NUMROC( N_B, NB_B, MYCOL, 0, NPCOL )
  
  ! get C local dimensions
  MLOC_C = NUMROC( M_C, MB_C, MYROW, 0, NPROW )
  NLOC_C = NUMROC( N_C, NB_C, MYCOL, 0, NPCOL )
  
  ! Initialize the array descriptor for the matrix A, B and C
  CALL DESCINIT( DESCA, M_A, N_A, MB_A, NB_A, 0, 0, ICTXT, max(MLOC_A,1), INFO )
  CALL DESCINIT( DESCB, M_B, N_B, MB_B, NB_B, 0, 0, ICTXT, max(MLOC_B,1), INFO )
  CALL DESCINIT( DESCC, M_C, N_C, MB_C, NB_C, 0, 0, ICTXT, max(MLOC_C,1), INFO )
  
  ! print grid infos
  do IPROC=0,NPROCS-1
    if ( IPROC .eq. IAM ) then
      print *,""
      print *,"-------------------------"
      print *,"PROC, MYROW, MYCOL :",PROC,MYROW,MYCOL
      print *,"MLOC_A, NLOC_A :",MLOC_A,NLOC_A
      print *,"MLOC_B, NLOC_B :",MLOC_B,NLOC_B
      print *,"MLOC_C, NLOC_C :",MLOC_C,NLOC_C
      print *,"DESCA :",DESCA
      print *,"DESCB :",DESCB
      print *,"DESCC :",DESCC
      print *,"-------------------------"
      print *,""
    end if
    CALL SLEEP(2)
  end do
  
  ! allocate and set matrices
  ALLOCATE( A(MLOC_A,NLOC_A) )
  ALLOCATE( B(MLOC_B,NLOC_B) )
  ALLOCATE( C(MLOC_C,NLOC_C) )
  
  ! init A matrix
  do j=1,NLOC_A
    do i=1,MLOC_A
      A(i,j)=ONE
    end do
  end do
  
  ! init B matrix
  do j=1,NLOC_B
    do i=1,MLOC_B
      B(i,j)=ONE
    end do
  end do
  
  ! compute A * B
  CALL PDGEMM('N', 'N',       &
&             M, N, K,        &
&             ONE,            &
&             A, 1, 1, DESCA, &
&             B, 1, 1, DESCB, &
&             ZERO,           &
&             C, 1, 1, DESCC )
  
  ! check result
  do j=1,NLOC_C
    do i=1,MLOC_C
      if ( abs(C(i,j)-K) .gt. 1.0D-8 ) then
        print *,"Error: result differs from exact"
        print *,"C(",i,",",j,")=",C(i,j)
        print *,"expected ",K
        print *,"TEST FAILED!"
        stop 2
      end if
    end do
  end do
 
  ! inform that everything is ok
  if ( IAM .eq. 0 ) then
    print *,"TEST PASSED!"
  end if 

  ! terminate
  CALL BLACS_GRIDEXIT( ICTXT )
  CALL BLACS_EXIT( 0 )
  
END PROGRAM

Attachment	Size
Download test_pdgemm.f90	4.94 KB

↧

Shared library using MKL crashes on runtime with EXC_BAD_ACCESS

June 22, 2018, 9:08 am

Latest and popular articles on Intel Technologies

≫ Next: iparm[35] = 1 or 2 question

≪ Previous: PDGEMM returns wrong result under specific conditions

Hi everyone!

I'm trying to create an application for MacOS platforms that uses some MKL methods for its math computations, namely the `dfdInterpolate1D()`. I'm facing a difficult issue, with random crashes due to bad access exceptions during the interpolation call. I tried several things and I can, quite safely, verify that it is not an obvious buffer allocations error from my side. Generally, I'm following the example templates that can be found at https://software.intel.com/en-us/mkl-developer-reference-c-data-fitting-....

After a lengthy investigation time, I think the issue comes from the linking scheme of the project. Due to some design specs, the base code is wrapped into a static library that is then used to create the application. My findings are concluded to the following:

isolating the code region that crashes from its surroundings and compiling it into a standalone executable runs fine and with no problem.
wrapping the same isolated code into a static or a dynamic library and then having a standalone executable loading and using the library, creates the exact same crash.
turning the Xcode's memory management diagnostics on, makes the code to constantly crash at the same point, instead of the initial random crashes.

So, my question is: is there a known issue or is there a certain way to wrap the MKL spline interpolation into a library? Is there anything I specifically have to be careful with for this use case? I'm using the linker and compiler options proposed by the MKL linker advisor.

Thank you very much in advance and sorry for the long post.

P.S. I'm using MacOS 10.13.5, Xcode 9.4.1, MKL 2018.0.104

↧

iparm[35] = 1 or 2 question

June 24, 2018, 7:35 pm

Latest and popular articles on Intel Technologies

≫ Next: Must the Block in sparse BSR format be Square?

≪ Previous: Shared library using MKL crashes on runtime with EXC_BAD_ACCESS

Dear Intel

I'm a user of MKL 2018 up1 parallel studio.

Usually I handle the large matrix(>50Mx50M) using schur complement.

'iparm[35] = 1' is the my general setting and it shows the schur complement matrix.

However, when I changed the setting 'iparm[35]' from 1 to 2, they shows the error message during factorization (phase=2)

Error code is -4. (not SPD)

Already, I checked iparam[29] and matrix->a[iparam[29]], but there are no zero pivot or wrong conditions for SPD.

I think, if the matrix is factorized with 'iparm[35] = 1', then 'iparm[35] = 2' should be also OK.

I ask your advice.

Thank you very much in advance!!!

Regards,
Yong-hee

↧

Must the Block in sparse BSR format be Square?

June 24, 2018, 8:55 pm

Latest and popular articles on Intel Technologies

≫ Next: dyld: Library not loaded: @rpath/libmkl_intel_lp64.dylib when running scripts

≪ Previous: iparm[35] = 1 or 2 question

Dear all,

I am new to the MKL sparse blas, and want to use BSR as my sparse matrix format.

I notice that the block_size is specified by an integer. Does this mean the block must be a square matrix?

Is there a way to specify a non-square block in BSR?

Best,

Zhihao

↧

dyld: Library not loaded: @rpath/libmkl_intel_lp64.dylib when running scripts

June 25, 2018, 9:59 pm

Latest and popular articles on Intel Technologies

≫ Next: Potential Buffer overwrite in DGEEV when compiled in a debug mode.

≪ Previous: Must the Block in sparse BSR format be Square?

I tried to run a series of simulations using GROMACS 4.6.5 on MAC PRO in which intel composer_xe_2013 was installed. If I manually submit an individual job using commands in a terminal, I do not have any problems. However, If I want to submit a series of job using PERL scripts to run these commands, I always get the error message

dyld: Library not loaded: @rpath/libmkl_intel_lp64.dylib

The $DYLD_LIBRARY_PATH has been set and can be visited.

I do not know what is going wrong here. I read discussion at https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/.... I am not very sure how to set the library link properly. Do I need to recompile GROMACS or just simply set " link with libiomp5.dylib and mkl lib path" which was stated in above link? if I run "icc -o dgemm_example dgemm_example.c -L$MKLROOT/lib -Wl,-rpath,$MKLROOT/lib -Wl,-rpath,$MKLROOT/../compiler/lib -mkl " as suggested in that discussion, how do I backup the original setting in case I failed and had a chance to recover the original setting? Any further information and help would be sincerely appreciated.

↧

Potential Buffer overwrite in DGEEV when compiled in a debug mode.

June 25, 2018, 5:07 pm

Latest and popular articles on Intel Technologies

≫ Next: MKL GEMM slower for larger matrices

≪ Previous: dyld: Library not loaded: @rpath/libmkl_intel_lp64.dylib when running scripts

Hi,

We recently upgrade our MKL to version 18.3 and TBB to version 18.4. When running a debug version of our code we noticed that we were getting heap corruption when calling the DGEEV MKL function. It appears that the work vector passed into the function was overwritten. Importantly, one can query for the optimum length of the work vector by passing in -1 as the length of the work vector. In this particular case, we have a 60x60 input matrix and the optimum length of the work vector was returned as 2040. However, we were passing in a value of 6x60 = 360 for the work vector. After running DGEEV in debug mode, it appears that the function was assuming the length of the work vector was the optimum length as all of the data up to that length (2040) had been modified.

When running a release version of our application, the DGEEV function did not write beyond the specified length vector of 360.

I then created a console application that loads the input matrix from a file and then calls the DGEEV function. The behavior mentioned above showed the potential heap corruption on the work vector.

Is this a known issue? If so, are there any other MKL functions that we must worry about?

Thanks in advance for your response.

Murray

P.S. I realize that querying the function for the optimum work vector length is a workaround.

Attachment	Size
Download DGEEV_workOverwrite.cpp	2.89 KB
Download TestMatrix.txt	27.84 KB

↧

MKL GEMM slower for larger matrices

June 25, 2018, 5:56 pm

Latest and popular articles on Intel Technologies

≫ Next: Segfault when using cblas_dgemm()

≪ Previous: Potential Buffer overwrite in DGEEV when compiled in a debug mode.

For matrix mul A(m,k) * B(k,n):

m=9, k=256, n=256 is faster than m=9, k=512, n=512 and all larger k and n.

On my E5-2630v3 (16 cores, HT disabled), k,n=256 get 850 GFLOPS while k,n=512 only get 257 GFLOPS.

Here is my testing code. I am doing 64 gemms here:

#include <cstdio>
#include <cstdlib>
#include <chrono>
#include <algorithm>
#include <functional>
#include <random>
#include <omp.h>
#include <mkl.h>
#include <unistd.h>


#define ITERATION 10000

int main(int argc, char *argv[])
{
	int opt;
	int n = 9;
	int c = 256; int c_block = 256;
	int k = 256; int k_block = 256;
	int t = 1;
	while ((opt = getopt(argc, argv, "n:c:k:t:")) != -1) {
		switch (opt) {
			case 'n': n = strtol(optarg, NULL, 10); break;
			case 'c': c = strtol(optarg, NULL, 10); break;
			case 'k': k = strtol(optarg, NULL, 10); break;
			case 't': t = strtol(optarg, NULL, 10); break;
			default: printf("unknown option\n");
		}
	}

	omp_set_dynamic(0);
	omp_set_num_threads(t);
	
	float *AS[64], *BS[64], *CS[64];
	for (int i = 0; i < 64; ++i) {
		AS[i] = (float*)mkl_malloc(sizeof(float)*n*c, 64);
		BS[i] = (float*)mkl_malloc(sizeof(float)*c*k, 64);
		CS[i] = (float*)mkl_malloc(sizeof(float)*n*k, 64);
	} 
	
	auto randgen = std::bind(std::uniform_real_distribution<float>(), std::mt19937(0));
	for (int i = 0; i < 64; ++i) {
		std::generate(AS[i], AS[i]+n*c, std::ref(randgen));
		std::generate(BS[i], BS[i]+c*k, std::ref(randgen));
		// std::generate(CS[i], CS[i]+n*k, std::ref(randgen));
	}

	using Clock = std::chrono::high_resolution_clock;
	auto t1 = Clock::now();
	for (int iter = 0; iter < ITERATION; ++iter) {
		#pragma omp parallel
		{
			const int nthreads = omp_get_num_threads();
    		const int mythread = omp_get_thread_num();
    		const int start = mythread*64/nthreads;
   			const int finish = (mythread+1)*64/nthreads;  
			mkl_set_num_threads_local(1);
			for (int i = start; i < finish; ++i)
			{
				float * A = AS[i];
				float * B = BS[i];
				float * C = CS[i];
				cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, n, k, c, 1, A, c, B, k, 0, C, k);
			}
		}
		
	}
	auto t2 = Clock::now();
	auto elapsed = t2 - t1;
	auto time = std::chrono::duration_cast<std::chrono::nanoseconds>(elapsed).count();
	// printf("%.1lfs\n", 1e-9 * time);
	printf("%.lfGFLOPS\n", 1.0 * ITERATION * 64 * 2 * n * c * k / time);
	
	for (int i = 0; i < 64; ++i) {
		mkl_free(AS[i]);
		mkl_free(BS[i]);
		mkl_free(CS[i]);
	} 
	return 0;
}

↧

Segfault when using cblas_dgemm()

June 28, 2018, 7:24 am

Latest and popular articles on Intel Technologies

≫ Next: side change for mkl_dcsrmm

≪ Previous: MKL GEMM slower for larger matrices

When attempting to use the cblas_dgemm() function, I am experiencing a segmentation fault. The segfault occurs only when running the code on a Linux machine (works on a Windows machine).

The code is set up as follows:

#define ML_ROWS_A 5

#define ML_COLS_B 7

#define ML_COLS_A 6

double a_d[ML_ROWS_A*ML_COLS_A], b_d[ML_COLS_A*ML_COLS_B], c_d[ML_ROWS_A*ML_COLS_B];

const double alpha = 1.0;

const double beta = 1.0;

int m = ML_ROWS_A;

int n = ML_COLS_B;

int n = ML_COLS_A;

cblas_dgemm( CblasRowMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, a_d, k, b_d, n, beta, c_d, n );

If I add the calls:

int n_alloc;

mkl_mem_stat(&n_alloc);

Before calling cblas_dgemm(), the function works correctly (though no memory has been allocated through mkl calls). Similarly, if I instead allocate memory through an mkl call (such as mkl_calloc) before calling cblass_dgemm() the function works correctly (even though I do not actually pass the allocated memory to the function call). Can anyone explain why the segfault is occurring and why calling certain mkl functions before cblass_dgemm remedies the issue?

↧

side change for mkl_dcsrmm

June 30, 2018, 10:21 pm

Latest and popular articles on Intel Technologies

≫ Next: Problems with environment variables and using code examples

≪ Previous: Segfault when using cblas_dgemm()

Hi all

the current version of mkl_dcsrmm implements the operations

C=AB+C and C=A'B+C, where in both cases A is the csr sparse matrix.

Is there any way/alternative such that B is the csr sparse matrix??

In my specific application C and A are large dense matrices of dimension eg 3,000,000:70. B is of dimension 70:70 with up to 80% zero coefficients. Ignoring the sparse structure of B would lead to dgemm, but this would cause a large overhead for multiplications with zeros. Also mkl_dcsrmm would be feasible with A and C being transposed before the operation, however, due to their size this is not possible with regard to speed (mkl_dcsrmm is called up to 10,000 times).

Any suggestions appreciated.

Cheers

↧

Problems with environment variables and using code examples

June 30, 2018, 2:05 am

Latest and popular articles on Intel Technologies

≫ Next: GESV and PARDISO give different solutions

≪ Previous: side change for mkl_dcsrmm

I was following installation guide up to environmental variable step: https://software.intel.com/en-us/mkl-windows-developer-guide-setting-env....

I entered
mklvars intel64
in the command shell. The script got executed fine and when I checked PATH variable in the command shell, I could see the added path. But if I check System properties -> environment variables, the Path variable remained unchanged. Also, when I restarted the command shell Path variable was the same as before running the script.

Also if I try to run a sample code in VS2017, the line

#include "mkl.h"

gets underscored with a message "cannot open source file mkl.h"

Installation of mkl went without errors and all files that are mentioned in https://software.intel.com/en-us/mkl-windows-developer-guide-checking-yo... are on their place.

If I understand correctly, this error is due to the absence of mkl.h in the path.

Am I missing something? How can I correct it?

Thanks.

↧