Quantcast
Channel: Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library
Viewing all 2652 articles
Browse latest View live

Error in PARDISO memory allocation: MATCHING_REORDERING_DATA

$
0
0

Hi,
I am running PARDISO routines to solve a linear equation system with more than 40000 parameters. I use the "intel mkl 2016.1.150" librariesand set up the variable "export MKL_PARDISO_OOC_MAX_CORE_SIZE=120000". My system has 138Gb of RAM. I attach here below an extract of my code which gets in input the A matrix in CSR3 format and gives the solution x vector as output.

SUBROUTINE parsol(neq, a, ja, ia)

IMPLICIT NONE

c List of Parameters
c ------------------
        TYPE(t_neq)               :: neq
        INTEGER*4, DIMENSION(*)   :: ia
        INTEGER*4, DIMENSION(*)   :: ja
        REAL*8, DIMENSION(*)      :: a

c Local Parameters
c ----------------

c Local Variables
c ---------------
C..     Internal solver memory pointer
        INTEGER*8 pt(64)

C..     All other variables
        INTEGER*4    maxfct, mnum, mtype, phase, error, nrhs, msglvl
        INTEGER*4    iparm(64)
        REAL*8       dparm(64)
        REAL*8       b(neq%misc%npar)
        REAL*8       x(neq%misc%npar)

        INTEGER*4 i, j, idum, solver
        REAL*8  waltime1, waltime2, ddum, normb, normr

C.. Fill all arrays containing matrix data.

C   Number of right-hand-sides to solve
      nrhs = 1
C   Other parameters
      maxfct = 1
      mnum = 1

C
C  .. Setup Pardiso control parameters und initialize the solvers
C     internal adress pointers. This is only necessary for the FIRST
C     call of the PARDISO solver.

C  mtype = ...
C       1    real and structurally symmetric
C       2    real and symmetric positive definite
C       -2    real and symmetric indefinite
C       3    complex and structurally symmetric
C       4    complex and Hermitian positive definite
C       -4    complex and Hermitian indefinite
C       6    complex and symmetric
C       11    real and nonsymmetric
C       13    complex and nonsymmetric

      mtype     = 2

C  Initialisation
      pt(:) = 0
      iparm(1) = 0      ! initializes all iparm to their default values

      CALL pardisoinit(pt, mtype, iparm)

C  .. Memory use (in or out core)
      iparm(27) = 1
      iparm(60) = 1

C..   Reordering and Symbolic Factorization, This step also allocates
C     all memory that is necessary for the factorization

      phase     = 11  ! only reordering and symbolic factorization
      msglvl    = 1   ! with (1) or without (0) statistical information

      WRITE(*,*) 'Starting reordering ...'

      CALL pardiso (pt, maxfct, mnum, mtype, phase,
     1              neq%misc%npar, a, ia, ja,
     1              idum, nrhs, iparm, msglvl, ddum, ddum, error, dparm)

      WRITE(*,*) 'Reordering completed ! ',
     1            max(iparm(15), iparm(16)+iparm(63))

      IF (error .NE. 0) THEN
        WRITE(*,*) 'The following ERROR was detected: ', error
        STOP
      END IF

C.. Factorization.
C  phase = ...
C       11    Analysis
C       12    Analysis, numerical factorization
C       13    Analysis, numerical factorization, solve, iterative refinement
C       22    Numerical factorization
C       23    Numerical factorization, solve, iterative refinement
C       33    Solve, iterative refinement
C       331   like phase=33, but only forward substitution
C       332   like phase=33, but only diagonal substitution (if available)
C       333   like phase=33, but only backward substitution
C       0    Release internal memory for L and U matrix number mnum
C       -1   Release all internal memory for all matrices

      phase     = 22  ! only factorization
      CALL pardiso (pt, maxfct, mnum, mtype, phase,
     1              neq%misc%npar, a, ia, ja, idum,
     2              nrhs, iparm, msglvl, ddum, ddum, error, dparm)

      WRITE(*,*) 'Factorization completed ... '
      IF (error .NE. 0) THEN
         WRITE(*,*) 'The following ERROR was detected: ', error
         STOP
      ENDIF

C.. Back substitution and iterative refinement
      iparm(8)  = 1   ! max numbers of iterative refinement steps
      phase     = 33  ! only solve

      b = neq%bnor

      CALL pardiso (pt, maxfct, mnum, mtype, phase,
     1              neq%misc%npar, a, ia, ja,
     1              idum, nrhs, iparm, msglvl, b, x, error, dparm)

      WRITE(*,*) 'Solve completed ...  '

      neq%xxx = x

C.. Memory release
      phase     = -1  ! only solve

      CALL pardiso (pt, maxfct, mnum, mtype, phase,
     1              neq%misc%npar, a, ia, ja,
     1              idum, nrhs, iparm, msglvl, b, x, error, dparm)

      WRITE(*,*) 'Memory released ...  '

Here below is the program output giving back an memory problem. When I run the same program with the same configuration with up to around 32000 parameters, everything works smoothly (and it's astonishingly fast and efficient!).

 Starting reordering ...
*** Error in PARDISO  (     insufficient_memory) error_num= 1
*** Error in PARDISO memory allocation: MATCHING_REORDERING_DATA, allocation of 1 bytes failed
total memory wanted here: 6388548 kbyte

=== PARDISO: solving a symmetric positive definite system ===
1-based array indexing is turned ON
PARDISO double precision computation is turned ON
METIS algorithm at reorder step is turned ON


Summary: ( reordering phase )
================

Times:
======
Time spent in calculations of symmetric matrix portrait (fulladj): 22.780998 s
Time spent in reordering of the initial matrix (reorder)         : 0.000000 s
Time spent in symbolic factorization (symbfct)                   : 0.000000 s
Time spent in allocation of internal data structures (malloc)    : 1.213865 s
Time spent in additional calculations                            : 9.502059 s
Total time spent                                                 : 33.496922 s

Statistics:
===========
Parallel Direct Factorization is running on 16 OpenMP

< Linear system Ax = b >
             number of equations:           40397
             number of non-zeros in A:      815979003
             number of non-zeros in A (%): 50.001238

             number of right-hand sides:    1

< Factors L and U >
             number of columns for each panel: 64
             number of independent subgraphs:  0< Preprocessing with state of the art partitioning metis>
             number of supernodes:                    0
             size of largest supernode:               0
             number of non-zeros in L:                0
             number of non-zeros in U:                0
             number of non-zeros in L+U:              0
 Reordering completed !            0
 The following ERROR was detected:           -2

Any idea what's the problem? I do not think it's an hardware limitation since the system has 136Gb of RAM and the system is only 6Gb ... I also tried to solve the problem in OOC mode or with less threads without any luck.

Thanks for your help,

Stefano


Problems with linking libraries

$
0
0

Dear all,

I'm using ifort composer_xe_2013_sp1.2.144 with Ubuntu 14.04.4

I want to install a program following the instruction given for other compilers and platforms.

I compiled the relevant source files and created a library libtrlan.a. with ar t I checked that all objects are included.

The I wanted to create a test program that uses the programs in libtrlan.a, which themselves need LAPACK and BLAS routines.

Problem 1:

Irrespective of whether I try to link the MKL libraries to the exe or not, none of the routines for which interfaces in a MOD file are given is found

ifort -o simple.exe  libtrlan.a  simple.o

yields    simple.f90:(.text+0x88): undefined reference to `trl_init_info_' etc.

If I change the order of the files

ifort -o simple.exe    simple.o libtrlan.a

there are undefined references to the BLAS routines that are not yet added.

Problem 2:

When I try to link the MKL libraries with

ifort  -o simple.exe -L/opt/intel/composer_xe_2013_sp1.2.144/mkl/lib/intel64/  libmkl_lapack95_lp64.a  libmkl_blas95_lp64.a  simple.o  libtrlan.a

I get fortran errors

ifort: error #10236: File not found:  'libmkl_lapack95_lp64.a'
ifort: error #10236: File not found:  'libmkl_blas95_lp64.a'

When I do not use the -Ldir option

ifort  -o simple.exe /opt/intel/composer_xe_2013_sp1.2.144/mkl/lib/intel64/libmkl_lapack95_lp64.a /opt/intel/composer_xe_2013_sp1.2.144/mkl/lib/intel64/libmkl_blas95_lp64.a simple.o  libtrlan.a

I get again the undefined references to the BLAS routines.

I have no idea what's going wrong.

Please, help.

alex

 

Bug in GESDD (but not GESVD)

$
0
0

I have found a bug on Parallel Studio 16.0.2 where I get an error when computing the SVD with GESDD in the Python package SciPy. It can be reproduced on an MKL-built scipy with this array, which is finite (contains no NaN or inf) as:

>>> import numpy as np>>> from scipy import linalg>>> linalg.svd(np.load('fail.npy'), full_matrices=False)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/larsoner/.local/lib/python2.7/site-packages/scipy/linalg/decomp_svd.py", line 119, in svd
    raise LinAlgError("SVD did not converge")
numpy.linalg.linalg.LinAlgError: SVD did not converge

I am curious if anyone has insight into why this fails, or can reproduce it themselves. I do have access to older MKL routines so if it's helpful I could see if I get the error elsewhere, too.

I have tried this with MKL-enabled Anaconda, and it does not fail, although I do experience similar failures with other arrays with the Anaconda version, which seem to only happen on systems with SSE4.2 but no AVX extensions.

I recently worked on SciPy's SVD routines to add a wrapper for a GESVD backend (to complement the existing GESDD routine) here, and this command passes on bleeding-edge SciPy, so it does seem to be a problem with the GESDD implementation specifically:

>>> linalg.svd(np.load('fail.npy'), full_matrices=False, lapack_driver='gesvd')

 

Error compiling MKL for Xeon Phi (MIC) with Compiler assisted Offload mode in Visual Studio

$
0
0

Hello all,

I am new to Xeon Phi and MKL and I am trying to compile the Compiler assisted Offload for the sgemm example in C++ in Windows using Visual Studio (VS) Prof. R12.

I have set the properties in VS project to include MKL libraries, parallel for ilp64 (32bit ints) and set "mandatory" in the offload mode.

The tests I am running are on a 7200 KNC MIC. I have tried other simple applications all successful including reductions with compiler assisted offloading, implicit memory model and native execution. The driver is working fine and I can SSH to the MIC and mount a shared NFS directory. But calling an MKL function from VS using the compiled assisted offloading has not worked.

The error I get follows:

"error : function "sgemm" called in offload region must have been declared with compatible "target" attribute  sgemm(&transA, &transB, &n, &n, &n, &alpha, A, &n, B, &n, &beta, C, &n);"

It would seem that the cross compilation is simply not happening because the compiler is not informed of it and there is no forward declaration instructing the compiler to do it. I would assume that this gets done in the "mkl.h" header file or some of the other ones called therein.

I have also added the MKL include and lib paths in the VS paths entry in my solution's properties to no avail.

David

Convolution transformation using SDCON

$
0
0

Hi all!

I'm using the SDCON routine to perform a convolution transformation with Fortran, and I have to admit that it has (by far) better performances that what we have been writing.

I did a simple profiling of my code to check which part is the heaviest in terms of CPU time. It turned out to be the part including the convolution calculation, essentially because I'm calling the SDCON routine 7 times each time step!

I'm wondering if a parallel (or multi-threaded) version of it exists? I'm looking for enhancing the performances of my code, and it would be really helpful.

Thank you very much for helping me.

With best regards,

Ouissem

Intel MKL BSR sparse matrix product operations

$
0
0

hello, 

In a Msc. project we are implementing hadarmard, kathri-rao, and kronecker products on BSR sparse matrices. Is there any paper/reference regarding the algorithms implemented in the normal Intel MKL Sparse MX * MX multiplication.

Regards,

Sérgio Caldas

Should I expect this difference from DGETRS?

$
0
0

Hi, I've got a program that uses MKL and the Intel Composer 2015 compiler on Mac OS X that is giving me different results on an Ivy Bridge (a MacMini6,1 system with an i5-3210M CPU) and a Haswell (MacMini7,1 with an i5-4278U CPU) for the same binary.  Both systems are running Yosemite (10.10.5).  I've boiled it down to the following test case:

#include <mkl.h>
int main(int argc, char **argv) {
  int m=3, n=3, lda=5, ldb=3, nrhs=1, info;
  double a[] = {
    1., 2., 3., 0., 0.,
    4., 5., 6., 0., 0.,
    7., 8., 0., 0., 0.,
    0., 0., 0., 0., 0.
  };
  double b[] = { 6., 15., 15. };
  int ipiv[3];
  int i;

  dgetrf(&m, &n, a, &lda, ipiv, &info);
  for(i = 0; i < 3, i++) printf("ipiv[%d] = %d (expected 3)\n", i, ipiv[i]);
  dgetrs("T", &n, a, &lda, ipiv, b, &ldb, &info);
  for(i = 0; i < 3; i++) printf("b[%d] = %.17f (expected 1.00000000000000000)\n", i, b[i]);
}

Since our application uses OpenMP and statically links MKL, the following Makefile should build the test case in the same way:

repro: repro.o
        icc -qopenmp -o repro repro.o ${MKLROOT}/lib/libmkl_intel_lp64.a ${MKLROOT}/lib/libmkl_core.a ${MKLROOT}/lib/libmkl_intel_thread.a -lpthread -lm -ldl
repro.o: repro.c
        icc -qopenmp -I${MKLROOT}/include -o repro.o -c repro.c

On the Ivy Bridge system, I get exactly the expected values.  My problem is on the Haswell system, where I get the following results:

ipiv[0] = 3 (expected 3)
ipiv[1] = 3 (expected 3)
ipiv[2] = 3 (expected 3)
b[0] = 0.99999999999999978 (expected 1.00000000000000000)
b[1] = 1.00000000000000022 (expected 1.00000000000000000)
b[2] = 1.00000000000000000 (expected 1.00000000000000000)

I realize that this difference is extremely insignificant, but it's the difference between the two systems (running the same version of the same OS, differing only based on the CPU) that has us concerned.  Is this something we should expect from MKL, or is it a bug?

Fast poisson solver threading control

$
0
0

I'm trying to control the number of threads used by the fast poisson solver *_Helmholtz_3D.  The code is compiled with -tbb and -mkl, and I'm using tbb::task_scheduler_init() function to control the thread count.  However, this solver seem to be multithreaded regardless of the number of threads I tell tbb to use.  How can I control this?

MKL Version 11.3.2
ICPC Version 2016.2.062

Thanks!


unexpected outputs of lapack_cheev

$
0
0

Hello Guys,.

I am using the LAPACKE_cheev now from MKL. I have written a piece of code with this API. However, its outputs are out of my expectation. The code is attached. 

	lapack_complex_float *a = (lapack_complex_float*)malloc(sizeof(lapack_complex_float)*N*N);
	a[0].imag = 0;
	a[0].real = 0;
	a[1].imag = 1;
	a[1].real = 0;
	a[2].imag = 0;
	a[2].real = 1;
	a[3].imag = 0;
	a[3].real = 0;
	a[4].imag = 0;
	a[4].real = 0;
	a[5].imag = 0;
	a[5].real = 0;

	int matrix_order = LAPACK_ROW_MAJOR; //LAPACK_COL_MAJOR
	const char jobz = 'N';
	const char uplo = 'U';
	lapack_int n = N;
	lapack_int lda = N;
	float *w = (float*)malloc(sizeof(float)*N);

LAPACKE_cheev(matrix_order, jobz, uplo, n, a, lda, w);

	for (int i = 0; i < N; i++)
	{
		cout << w[i] << endl;
	}

I expect the outputs as follows:

[-1.414, 0, 1.414]

However, the outputs from my local machine is as follows:

[-4.31602e+008, -1, 1]

Could anybody help me with this issue?

dynamic load/free library who used mkl will result in the main program crash

$
0
0

Dear Team,

Hope you are doing well! I has a  dynamic library sample used some simple MKL code, if  load/free this dynamic library in the main program by LoadLibrary/FreeLibrary 542 times then the maim program will crashed and exited, but if I link the dynamic library with its export .lib everything is going OK. I'm confused is there limitations for the trial MKL, could you please confirm it? The tested studio is "parallel_studio_xe_2016_update2_setup.exe" and vs2010.

The dynamic library's source code and main program's source code could be find in the attahced "purl mkl.rar":
build.bat -- script to build the sample
Example_with_exporttable.cpp -- source code that use the dynamic library with its export library (no problem)
Example_with_dynamic_load_free.cpp -- source code that use LoadLibrary/FreeLibrary to call the dynamic library (crash)

 Downloadapplication/rarpurl mkl.rar

Problems with dgeqrf and dorgqr

$
0
0

Hello,

I'm having some unexpected results with the function LAPACKE_dgeqrf. Apparently I'm unable to get the appropriate QR decomposition at some cases, I'm rather obtaining a QR decomposition with some unexpected vector orientations for the orthogonal matrix Q.

Here is a MWE of the problem:

#include <stdio.h>
#include <stdlib.h>
#include "mkl.h"

#define N 2

int main()
{
    double *x   = (double *) malloc( sizeof(double) * N * N );
    double *tau = (double *) malloc( sizeof(double) * N );
    int i, j;

    /* Pathological example */
    x[0] = 4.0, x[1] = 1.0, x[2] = 3.0, x[3] = 1.0;

    printf("\n INITIAL MATRIX\n\n");
    for (i = 0; i < N; i++) {
        for (j = 0; j < N; j++) {
            printf(" %3.2lf\t", x[i*N+j]);
        }
        printf("\n");
    }

    LAPACKE_dgeqrf ( LAPACK_ROW_MAJOR, N, N, x, N, tau);

    printf("\n R MATRIX\n\n");
    for (i = 0; i < N; i++) {
        for (j = 0; j < N; j++) {
            if ( j >= i ){
                printf(" %3.2lf\t", x[i*N+j]);
            }else{
                printf(" %3.2lf\t", 0.0);
            }
        }
        printf("\n");
    }

    LAPACKE_dorgqr ( LAPACK_ROW_MAJOR, N, N, N, x, N, tau);

    printf("\n Q MATRIX\n\n");
    for (i = 0; i < N; i++) {
        for (j = 0; j < N; j++) {
            printf(" %3.2lf\t", x[i*N+j]);
        }
        printf("\n");
    }

    printf("\n");

    return 0;
}

With this example, the output I get is:

 INITIAL MATRIX

 4.00     1.00    
 3.00     1.00    

 R MATRIX

 -5.00     -1.40    
 0.00     0.20    

 Q MATRIX

 -0.80     -0.60    
 -0.60     0.80   

However, the expected QR decomposition would be:

 R MATRIX

 5.00     1.40    
 0.00     0.20    

 Q MATRIX

 0.80     -0.60    
 0.60     0.80  

I have found this problem with other Initial matrices as well.

Thanks in advance,

Paulo

 

Error : The CALL statement is invoking a function subprogram as a subroutine.

$
0
0

I am trying to learn to use MKL routines. I have written a simple code for calculating the jacobian matrix. However, it ends up with the below error:

The CALL statement is invoking a function subprogram as a subroutine. [DJACOBI]

 

Here is my code and I am compiling it with: ifort -mkl newton2.f90 

INCLUDE '/home/vahid/intel/composer_xe_2015.3.187/mkl/include/mkl_rci.f90'

program newton2
    implicit none
    real*8, dimension(3) :: x
    real*8, dimension(3,3) :: fjac
    real*8 :: phi1,phi2,phi3
    integer :: i,m,n
    integer, parameter :: maxit=100
    real*8, parameter :: eps=0.0001

    INCLUDE '/home/vahid/intel/composer_xe_2015.3.187/mkl/include/mkl_rci.fi'

    external fcn

    x(1)=0
    x(2)=0
    x(3)=0
    n=3;m=3

    !Call fcn (m, n, x, f)

    call djacobi(fcn , n, m, fjac, x, eps)

    write(*,*) fjac

end program newton2

subroutine fcn (m, n, x, f)
    real*8, dimension(3) :: x,f
    real*8 :: phi1,phi2,phi3

    phi1=0 ; phi2 = 0.1; phi3= 0.9;
    f(1) = 2*(800) *(x(1)-0.7)  - 2*(12000)*(x(2)-0.45);
    f(2) = 2*(12000)*(x(2)-0.45)- 2*(1200)*(x(3)-0.9)  ;
    f(3) = x(1)*phi1 + x(2)*phi2 + x(3)*phi3 - 0.42  ;

end subroutine fcn

 

How to optimize A'PA computation for memory use

$
0
0

Hi,

I am setting up several weighted normal matrices as A'PA, where A is the first design matrix of size nObs x nPar and P is the weight matrix of size nObs x nObs . A'PA is then a symmetric matrix of size nPar x nPar. I perform this operation using the DSYMM + DGEMM routines.

! Memory allocation
       IF (ASSOCIATED(AA))    DEALLOCATE(AA)
       ALLOCATE(AA(NOBS,NPAR),stat=ii)

       IF (ASSOCIATED(PA))    DEALLOCATE(PA)
       ALLOCATE(PA(NOBS,NPAR),stat=ii)

! I actually just need the U or L triangular part of it (symmetric)
       IF (ASSOCIATED(ATPA))    DEALLOCATE(ATPA)
       ALLOCATE(ATPA(NPAR,NPAR),stat=ii)

! Set up of PA
       CALL dsymm('L', 'U', nObs, nPar, 1.d0, P_f, nObs, AA, nObs, 0.d0, PA, nObs)

! Setup of A'PA
       CALL dgemm('T','N',nPar,nPar,nObs,1.d0,AA,nObs,PA,nObs,1.d0,ATPA,nPar)

! Deallocation of AA, PA
    IF (ASSOCIATED(AA))    DEALLOCATE(AA)
    IF (ASSOCIATED(PA))    DEALLOCATE(PA)

Now, in my case nPar is as large as 90000 (or more), so that I need to allocate a very large amount of memory for the output (A'PA). In principle, I just need the upper or lower triangular matrix (since it's symmetric) but I cannot find a way to avoid the simultaneous allocation of the 90000x90000 matrix (needed as output by DGEMM) and of the triangular matrix (which is approximately half the size) where I would copy the part I am interested in before deallocating the full one.

Do you have any suggestion or see any option to compute this product using optimized parallel routines without allocating the full A'PA matrix? I checked all routines or packed formats but I cannot find a viable way out of allocating the full matrix at some point.

Thanks!

ps : I have an alternative using the intrinsic MATMUL function and small batches of 5 observations that are then added together in a triangular matrix (allocated and accessed as a vector). Unfortunately, as you can imagine, this is hardly very efficient.

 

Pardiso always crashing on Linux

$
0
0

I consistently have Pardiso crash when running on Linux. I've trimmed it down to a relatively simple test case of building a Poisson matrix and solving it with Pardiso.  The same program runs fine on Windows 8 and 10. It segfaults on both Centos 6.7 and Ubuntu 14.04. I assume I'm doing something wrong, but I can't tell what.

I'm building and testing on Centos 6.7 (upgraded from 6.5, not originally installed as 6.7).  I'm using g++ 4.8.2.  I put a fresh install of MKL from "parallel_studio_xe_2016_composer_edition_for_cpp_update2.tgz" into my home directory. I'm intentionally using the TBB version of MKL, and not using OpenMP at all. When I started working on this, I was using OpenMP and Pardiso was still crashing in OpenMP instead of TBB.

GDB shows the segfault is during Pardiso phase 11, inside some tbbmalloc stuff.

Full Source file: https://drive.google.com/open?id=0B1-EScYNsy0uTDZNX2Etck9NM1U

Makefile: https://drive.google.com/file/d/0B1-EScYNsy0uNEZLQmx5dl80ZUk/view?usp=sh...

Lots of console output, versions, make, gdb, ldd: https://drive.google.com/file/d/0B1-EScYNsy0uTjFqU2lyQTVjdUU/view?usp=sh...

Any suggestions on what I should look at?

-Essex

Cholesky factorization guarantees on failure?

$
0
0

If potrf is called on an indefinite matrix A are there any guarantees on what state it leaves A in when it returns?

For example, if the submatrix A'=A[0:k,0:k] is positive definite, but A[0:k+1,0:k+1] is not, can it be assumed that when potrf returns, the A[0:k,0:k] region of memory contains the cholesky factorization of A'?

This looks like it might be true, but I don't see any guarantees listed in the documentation (https://software.intel.com/en-us/node/520881). There are some algorithms that use the partial factorization to compute a bound on the smallest eigenvalue (For example More and Sorensen's algorithm in the paper "Computing a Trust Region Step"), so it would be helpful if such guarantees could be made as part of the contract of potrf

 


Intel® Math Kernel Library 11.3 Update 3 is now available

$
0
0

Intel® Math Kernel Library 11.3 Update 3 is now available

Intel® Math Kernel Library (Intel® MKL) is a highly optimized, extensively threaded, and thread-safe library of mathematical functions for engineering, scientific, and financial applications that require maximum performance.

Intel MKL 11.3 Update 3 packages are now ready for download. Intel MKL is available as part of the Intel® Parallel Studio XE and Intel® System Studio . Please visit the Intel® Math Kernel Library Product Page.

Intel® MKL 11.3 Update 3 Bug fixes

New Features in MKL 11.3 Update 3

  • Improved Intel Optimized MP LINPACK Benchmark performance for Clusters on Intel® Advanced Vector Extensions 512 (Intel® AVX-512) and Second generation of Intel® Xeon Phi™ coprocessor
  • BLAS:
    • Improved small matrix [S,D]GEMM performance on Intel® Advanced Vector Extensions 2 (Intel AVX2), Intel® Xeon® product family, Intel AVX-512 and on second generation of Intel® Xeon Phi™ coprocessor
    • Improved threading (OpenMP) performance of xGEMMT, xHEMM, xHERK, xHER2K, xSYMM, xSYRK, xSYR2K on Intel AVX-512, and on second generation of Intel® Xeon Phi™ coprocessor
    • Improved [C,Z]GEMV, [C,Z]TRMV, and [C,Z]TRSV performance on Intel AVX2, Intel AVX512, Intel® Xeon® product family,and on second generation of Intel® Xeon Phi™ coprocessor
    • Fixed CBLAS_?GEMMT interfaces to correctly call underlying Fortran interface for row-major storage
  • LAPACK:
    • Updated Intel MKL LAPACK functionality to latest Netlib version 3.6. New features introduced in this version are:
      • SVD by Jacobi ([CZ]GESVJ) and preconditioned Jacobi ([CZ]GEJSV) algorithms
      • SVD via EVD allowing computation of a subset of singular values and vectors (?GESVDX)
      • Level 3 BLAS versions of generalized Schur (?GGES3), generalized EVD (?GGEV3), generalized SVD (?GGSVD3) and reduction to generalized upper Hessenberg form (?GGHD3)
      • Multiplication of general matrix by a unitary/orthogonal matrix possessing 2x2 structure ( [DS]ORM22/[CZ]UNM22)
    • Improved performance of LU (?GETRF) and QR(?GEQRF) on Intel AVX-512 and on second generation of Intel® Xeon Phi™ Coprocessor
    • Improved check of parameters for correctness in all LAPACK routines to enhance security
  • SCALAPACK:
    • Improved hybrid (MPI + OpenMP) performance of ScaLAPACK/PBLAS by increasing default block size returned by pilaenv
  • SparseBlas:
    • Added examples that cover spmm and spmmd functionality
    • Improved performance of parallel mkl_sparse_d_mv for general BSR matrices on Intel AVX2
  • Parallel Direct Sparse Solver for Clusters:
    • Improved performance of solving step for small matrices (less than 10000 elements)
    • Added mkl_progress support in Parallel Direct sparse solver for Clusters and fixed mkl_progress in Intel MKL PARDISO
  • Vector Mathematical Functions:
    • Improved implementation of Thread Local Storage (TLS) allocation/de-allocation, which helps with thread safety for DLLs in Windows when they are custom-made from static libraries
    • Improved the automatic threading algorithm leading to more even distribution of vectors across larger numbers of threads and improved the thread creation logic on Intel Xeon Phi, leading to improved performance on average

New Features in MKL 11.3 Update 2

  • Introduced mkl_finalize function to facilitate usage models when Intel MKL dynamic libraries or third party dynamic libraries are linked with Intel MKL statically are loaded and unloaded explicitly
  • Compiler offload mode now allows using Intel MKL dynamic libraries
  • Added Intel TBB threading for all BLAS level-1 functions
  • Intel MKL PARDISO:
    • Added support for block compressed sparse row (BSR) matrix storage format
    • Added optimization for matrixes with variable block structure
    • Added support for mkl_progress in Parallel Direct Sparse Solver for Clusters
    • Added cluster_sparse_solver_64 interface
  • Introduced sorting algorithm in Summary Statistics

     What's New in Intel MKL 11.3:

  • Batch GEMM Functions
  • Introduced new 2-stage (inspector-executor) APIs for Level 2 and Level 3 sparse BLAS functions
  • Introduced MPI wrappers that allow users to build custom BLACS library for most MPI implementations
  • Cluster components (Cluster Sparse Solver, Cluster FFT, ScaLAPACK) are now available for OS X*
  • Extended the Intel MKL memory manager to improve scaling on large SMP systems

Check out the latest Release Notes for more updates

Dynamic Loading Issues With MKL From SWIG Module in Python

$
0
0

I have a real head-scratcher of a problem here, and was hoping that someone can help me resolve it. The issue is to do with a fatal error generated when dynamically loading MKL from a Linux shared library, which is in turn referenced by a Python module created with the SWIG interface-generation tool.

I'm encountering this issue on an Ubuntu Linux system, using gcc 4.6.3 to compile against the version of MKL included in Parallel Studio 2016 Update 2. I'm also using Python 2.7.3 and SWIG 2.0.4.  (I can consistently reproduce the problem using the Intel C compiler, Python 2.7.11, and/or Parallel Studio 2015 as well.) I am running in an environment with all necessary environment variables set, as produced by mklvars.sh (MKLROOT is there, LD_LIBRARY_PATH includes the MKL libraries, etc.).

I've created a stripped-down example to demonstrate my issue, but it's still a little complicated, so I will explain as I go. First, we define a C library called foo in the header/source pair foo.h/foo.c. This library exposes a single function, bar(), which makes a trivial BLAS call. (First code block is foo.h, second is foo.c.)

#ifndef _FOO_H
#define _FOO_H

void bar();

#endif//_FOO_H
#include "mkl.h"

void bar() {
    double arr[1] = { 1.0 };
    cblas_daxpy(1, 1, arr, 1, arr, 1);
}

To check that this function runs without errors, we use it in a simple executable, defined in main.c:

#include "foo.h"

int main() {
    bar();
    return 0;
}

Then we create a simple SWIG interface file foo.i, allowing generation of a Python interface for the foo library:

%module foo

%{
#define SWIG_FILE_WITH_INIT
#include "foo.h"
%}

%include "foo.h"

The main executable and the Python/SWIG module can be built using gcc, with the following sequence of commands. Note that the MKL linking options are precisely as recommended by the MKL link line advisor tool. With the exception of a warning about a set-but-unused variable in the SWIG wrapper, compilation proceeds cleanly.

gcc -Wall -Wextra -O0 -fPIC -I$MKLROOT/include -c -o foo.o foo.c
gcc -Wall -Wextra -O0 -shared -L$MKLROOT/lib/intel64 -Wl,-rpath=./ -o libfoo.so foo.o -Wl,--no-as-needed -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -ldl -lpthread -lm
gcc -Wall -Wextra -O0 -L. -Wl,-rpath=./ -o main main.c -lfoo
swig -python foo.i
gcc -Wall -Wextra -O0 -fPIC -I/usr/include/python2.7 -c -o foo_wrap.o foo_wrap.c
gcc -Wall -Wextra -O0 -shared -L. -L$MKLROOT/lib/intel64 -Wl,-rpath=./ -o _foo.so foo_wrap.o -lfoo -Wl,--no-as-needed -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -ldl -lpthread -lm

After building, the main executable runs without errors. However, attempting to use the generated SWIG module from within a Python interpreter (launched from the directory containing the various outputs of the compilation process) produces the following error:

Python 2.7.3 (default, Jun 22 2015, 19:33:41)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.>>> import foo>>> foo.bar()
Intel MKL FATAL ERROR: Cannot load libmkl_avx2.so or libmkl_def.so.

By setting LD_DEBUG=libs and trying again, I can see that the error is connected to a symbol lookup error, with the error message:

[...]/libmkl_def.so: error: symbol lookup error: undefined symbol: mkl_dft_fft_fix_twiddle_table_32f (fatal)

This symbol is defined in libmkl_core.so, which I believe everything should be linked against. The same error (or at least, the same "Intel MKL FATAL ERROR: ..." output) is reported in a post on this forum from December 2015: <https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/.... Another forum post, linked from the original (<https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/...), suggests using LD_PRELOAD to attempt to resolve the problem. Sure enough, if I open the Python interpreter with LD_PRELOAD=$MKLROOT/lib/intel64/libmkl_core.so python, the call to foo.bar() executes without issue. (Attempting to preload libmkl_avx2.so or libmkl_def.so without libmkl_core.so produces a symbol lookup error for the exact same symbol as before.)

So, the question is: can anybody suggest why this is happening, and hopefully suggest a fix that does not involve LD_PRELOAD? (We can't ship code that requires LD_PRELOAD...) My first thought was that this was a Python issue, but I'm not sure -- the other forum post reporting this problem was in relation to a tool called FuPerMod, which (from a quick look at the relevant git repo) doesn't seem to make any use of Python at all...

static linking 11.3, mkl_tbb_thread : mismatch detected for '_MSC_VER'

$
0
0

I'm building a DLL as a plugin for Autodesk Maya. I'm trying to statically link MKL into my DLL and having some difficulty. I'm using Visual Studio 2013 as IDE and its default platform toolset (v120). I'm not using the Intel Visual Studio integration, because it doesn't seem to offer a static-linking option.

With Parallel Studio 11.2, I didn't have any difficulty. Static linking works fine. With 11.3 update 3, when I link to mkl_intel_thread.lib, it still works fine. However, when I link with mkl_tbb_thread.lib instead, linking fails with the following error:

mkl_tbb_thread.lib(lnnt_omp_tbb_lp64.obj) : error LNK2038: mismatch detected for '_MSC_VER': value '1600' doesn't match value '1800' in PardisoTest.obj

mkl_tbb_thread.lib(lnnt_omp_tbb.obj) : error LNK2038: mismatch detected for '_MSC_VER': value '1600' doesn't match value '1800' in PardisoTest.obj

I'll point out that 1800 is the MSVC compiler version that I'm using. 1600 is the Visual Studio 2010 compiler, and I'm surprised to see it here.

My entire linker arguments are:

/OUT:"C:\ziva\dev\Spikes\PardisoTest\x64\Release\PardisoTest.exe" /MANIFEST /LTCG /NXCOMPAT /PDB:"C:\ziva\dev\Spikes\PardisoTest\x64\Release\PardisoTest.pdb" /DYNAMICBASE "freeglut.lib""opengl32.lib""glew32.lib""mkl_core.lib""mkl_tbb_thread.lib""mkl_intel_lp64.lib""tbb.lib""kernel32.lib""user32.lib""gdi32.lib""winspool.lib""comdlg32.lib""advapi32.lib""shell32.lib""ole32.lib""oleaut32.lib""uuid.lib""odbc32.lib""odbccp32.lib" /DEBUG /MACHINE:X64 /OPT:REF /INCREMENTAL:NO /PGD:"C:\ziva\dev\Spikes\PardisoTest\x64\Release\PardisoTest.pgd" /SUBSYSTEM:CONSOLE /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /ManifestFile:"x64\Release\PardisoTest.exe.intermediate.manifest" /OPT:ICF /ERRORREPORT:PROMPT /NOLOGO /LIBPATH:"C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2016\windows\tbb\lib\intel64_win\vc12" /LIBPATH:"C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2016\windows\mkl\lib\intel64_win" /LIBPATH:"C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2016\windows\compiler\lib\intel64_win" /TLBID:1 

I'm not sure what to do with this. There's only one mkl_tbb_thread.lib file, so it's not like I am linking with the wrong one.

Thank you for any help you can provide,

-Essex

 

FEAST Eigenvalue Solver

$
0
0

This is an interesting observation from FEAST.  I am putting a bridge as the first big test of the HARRISON, PARDISO and FEAST.  The first thing in is the piles, each is the same vertical length, but they are not connected so one has an interesting matrix.  It all works fine, but one of the sixteen piles has a natural frequency that is slightly different to the rest.    It is not a problem as it is just part of the overall structure, but it has been intriguing.

 

Are we going to see the latest FEAST in MKL. 

I can get 584 vectors in 12 seconds on a DELL Precision. 

John

Reproducing Xeon Phi Linpack (GEMM) results

$
0
0

Hello all,

I am trying to reproduce the Matrix Multiply results presented in the following website and I am not getting the same results.

http://www.intel.com/content/www/us/en/benchmarks/server/xeon-phi/xeon-phi-linpack-stream.html

Attached is the modified file from the I am starting from the code that comes with the MKL library (under: C:\Program Files (x86)\IntelSWTools\compilers_and_libraries\windows\mkl\examples) with no buffer reuse and doing initially single precision computations.

Does anyone know if this is the code used for the benchmark or if there is a specific linpack library that I should be using, like the one found here:

https://software.intel.com/en-us/articles/intel-mkl-benchmarks-suite

The Xeon Phi Model I am using is the 7200P with 61 cores and 16GB RAM. 

Also, it curious that at 30000 rank matrices (~10.1 GB for the three matrices) the MIC reserves the memory (checking with the micsmc and ssh-ing into the MIC and using the top command) but performs no computations and it seems to hang.

Best regards,

David

Viewing all 2652 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>