undefined symbol mkl_blas_avx2_cgemm_copyb_ext

September 25, 2015, 9:21 pm

Latest and popular articles on Intel Technologies

≪ Previous: unexpected ?potrf subroutine failure

I have tried to embed MKL in a java application by embedding the static mkl libraries in a shared lib of mine and then opening that shared lib from the jni. However, I got a link error when I created the shared lib: the symbol mkl_blas_avx2_cgemm_copyb_ext was undefined. I have looked at the static libraries, and indeed, *all* static mkl libraries that have this symbol list them with 'U'. On the other hand, this symbol does not even appear in the shared libraries.

So I'm wondering, why does this symbol appear only in the static libs and why is it not defined?

Thanks,

--Laci

PS: The same issue occurs both in versions 11.2 and 11.3 (distributed with icc versions 2015.5 and 2016).

↧

mkl_ddiamm bug ?

September 26, 2015, 7:39 am

Latest and popular articles on Intel Technologies

≫ Next: Difference between C++ code with Blas/Lapack and Matlab

≪ Previous: undefined symbol mkl_blas_avx2_cgemm_copyb_ext

I am trying to multiply two matrices using mkl_ddiamm method:

C = A * B

where A is diagonal matrix 3x3 and B is general matrix 3x3. No matter what I try, i get as a result no A*B, but B*A. This is my sample code. It essentially does SVD decomposition of matrix A and checks, if the computed matrices U, S and VT satisfy all requirements according to theory i.e.

1. U * UT = I, where I us identity matrix

2. V * VT = I

3. U * S * VT = A

Result of temporary operation S * VT is not correct. In fact, the function mkl_ddiamm computes VT * S.

   // requirement: m >= n
   int m = 3;
   int n = 3;
   double *a = (double *)mkl_malloc(m * n * sizeof(double), 16);
   double *s = (double *)mkl_malloc(n * sizeof(double), 16);
   double *u = (double *)mkl_malloc(m * n * sizeof(double), 16);
   double *vt = (double *)mkl_malloc(n * n * sizeof(double), 16);
   double *superb = (double *)mkl_malloc((n-1) * sizeof(double), 16);
   // identity matrix m x m
   double *unit_m = (double *)mkl_malloc(m * m * sizeof(double), 16);

   for (int i = 0; i < m; i++)
      for (int j = 0; j < m; j++)
         unit_m[i*m+j] = i == j ? 1.0 : 0;

   // identity matrix n x n
   double *unit_n = (double *)mkl_malloc(n * n * sizeof(double), 16);

   for (int i = 0; i < n; i++)
      for (int j = 0; j < n; j++)
         unit_n[i*n+j] = i == j ? 1.0 : 0;

   a[0] = 1;   a[1] = 1; a[2] = 1;
   a[3] = 2.5; a[4] = 3; a[5] = 4;
   a[6] = 3;   a[7] = 2; a[8] = 1;

   lapack_int res = LAPACKE_dgesvd(LAPACK_ROW_MAJOR, 'S', 'S', m, n, a, n, s, u, n, vt, n, superb);

   // Checking correctness of SVD calculation ...

   // u * ut = I
   double *temp = (double *)mkl_malloc(m * m * sizeof(double), 16);
   memset(temp, 0, m * m * sizeof(double));
   cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasTrans, m, m, n, 1.0, u, n, u, n, 0, temp, m);

   // v * vt = I
   memset(temp, 0, n * n * sizeof(double));
   cblas_dgemm(CblasRowMajor, CblasTrans, CblasNoTrans, n, n, n, 1.0, vt, n, vt, n, 0, temp, n);

   // u * s * vt = a
   memset(temp, 0, n * n * sizeof(double));
   int lval = 3;
   int idiag = 0;
   int ndiag = 1;
   double alpha = 1.0;
   double beta = 0;
   mkl_ddiamm("N", &n, &n, &n, &alpha, "DLNF", s, &lval, &idiag, &ndiag, vt, &n, &beta, temp, &n);
   double *temp2 = (double *)mkl_malloc(m * n * sizeof(double), 16);
   memset(temp2, 0, m * n * sizeof(double));
   cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m, n, n, 1.0, u, n, temp, n, 0, temp2, n);

↧

Difference between C++ code with Blas/Lapack and Matlab

September 27, 2015, 12:06 am

Latest and popular articles on Intel Technologies

≫ Next: Extra precision

≪ Previous: mkl_ddiamm bug ?

Hi All!

I am trying to implement one Linear programming algorithm in C++. For the matrix multiplication, I use Blas and Lapack. However, I find C++ code performs worse than Matlab as the size of problem is large. Indeed, the difference becomes significant as the size increases.

I am wondering if it is caused by the optimization tricks of matlab use to call Intel MKL. Could some one help explain why Matlab sometimes outperform C++ with Blas/Lapack? Is there any way to improve this version of C++ code, or any option to optimize compiling?

Thank you for your time!

The following is my simplified code.

#include <math.h>
#include <mex.h>
#include <string.h>
#include "blas.h"
#include "lapack.h"


#if !defined(MAX)
#define  MAX(A, B)   ((A) > (B) ? (A) : (B))
#endif

#if !defined(MIN)
#define  MIN(A, B)   ((A) < (B) ? (A) : (B))
#endif

void mexFunction( int nlhs,   mxArray  *plhs[], int nrhs,   const mxArray  *prhs[] )
{
    double *A, *b, *c, *Ac, *AAT, *peigATA;
    double *x, *Ax, *tx, *y, *ATy, *s, *As, *lambda1, *lambda2, *ATlambda1, *Alambda2;
    double *ppobj, *pdobj, *ppresi, *pdresi, *pdgap, *piter;
    double *tmp, *tmpinv, *tmpres, *res1, *res2;
    double pobj, dobj, presi, dresi, dgap, duration;
    double gamma, alpha, beta_max, beta_min, tol, final_mu, final_tol_admm, feas_mul;
    double temp, beta1 = 1.0, beta2 = 1.0, one = 1.0, mone = -1.0, zero = 0.0;
    double mu, tol_admm, bnorm, cnorm, tau, eigATA, lambdaRes1, lambdaRes2;
    ptrdiff_t i, j, m, n, m2, mn, feas_count, info, verbose, inc = 1;
    ptrdiff_t k, outer, iter_all = 0, count_pbig = 0, count_dbig = 0, count1 = 0, count2 = 0, iter = 0;

    char *NTRANS, *TTRANS, *uplo;

    NTRANS = "N"; TTRANS = "T"; uplo = "U";

    mu = final_mu/pow(gamma,(int)outer);
    tol_admm = final_tol_admm/pow(gamma,(int)outer);

    /* Ax = A*x */
    dgemv(NTRANS,&m,&n,&one,A,&m,x,&inc,&zero,Ax,&inc);

    bnorm = dnrm2(&m,b,&inc);
    cnorm = dnrm2(&n,c,&inc);

    /*prepare stats*/
    /* AAT = A*A' */
    dgemm(NTRANS,TTRANS,&m,&m,&n,&one,A,&m,A,&m,&zero,AAT,&m);
    /* Ac = A*c */
    dgemv(NTRANS,&m,&n,&one,A,&m,c,&inc,&zero,Ac,&inc);

    /* Compute largest eigenvalue */
    memcpy(mxGetPr(rhsAAT[0]),AAT,(m*m)*sizeof(double));
    mexCallMATLAB(1,lhsAAT,1,rhsAAT,"eigs");
    peigATA = mxGetPr(lhsAAT[0]);
    eigATA = *peigATA;

    /* Cholesky Factorization: AAT = U */
    dpotrf(uplo, &m, AAT, &m, &info);

    while (mu >= final_mu) {
        k = 0;
        iter = iter + 1;
        presi = 10000; dresi = 10000; lambdaRes1 = 10000; lambdaRes2 = 10000;
        while ((MAX(MAX(MAX(presi,dresi),lambdaRes1),lambdaRes2) > tol_admm)&&(k < round(pow(1/mu,0.5)))) {
            k = k + 1;
            tau = 0.99/(beta1*eigATA);
            // Update x
            for (i=0; i<2; i++) {
                /* Ax = -b + Ax */
                daxpy(&m, &mone, b, &inc, Ax, &inc);
                /* Ax = beta1*Ax */
                dscal(&m, &beta1, Ax, &inc);
                /* Ax = lambda1 + Ax */
                daxpy(&m, &one, lambda1, &inc, Ax, &inc);
                /* tx = tau*A'*Ax */
                dgemv(TTRANS,&m,&n,&tau,A,&m,Ax,&inc,&zero,tx,&inc);
                /* tx = -x + tx */
                daxpy(&n, &mone, x, &inc, tx, &inc);
                /* tx = tau*c + tx */
                daxpy(&n, &tau, c, &inc, tx, &inc);
                /* x = (-tx+sqrt(tx.^2+4*mu*tau))/2 */
                for(j=0; j<n; j++) {
                    *(x+j) = (-*(tx+j)+sqrt((*(tx+j))*(*(tx+j))+4.0*mu*tau))/2.0;
                }
                /* Ax = A*x */
                dgemv(NTRANS,&m,&n,&one,A,&m,x,&inc,&zero,Ax,&inc);
            }

            // Update s
            /* ATy = -c+ATy */
            daxpy(&n, &mone, c, &inc, ATy, &inc);
            /* ATy = beta2*ATy*/
            dscal(&n, &beta2, ATy, &inc);
            /* ATy = lambda2+ATy*/
            daxpy(&n, &one, lambda2, &inc, ATy, &inc);
            /* s = (-ts+sqrt(ts.^2+4*mu*beta2))/(2*beta2) */
            for(j=0; j<n; j++) {
                *(s+j) = (-*(ATy+j)+sqrt((*(ATy+j))*(*(ATy+j))+4.0*mu*beta2))/(2.0*beta2);
            }
            /* As = A*s */
            dgemv(NTRANS,&m,&n,&one,A,&m,s,&inc,&zero,As,&inc);

            //Update y
            /* As = -Ac + As */
            daxpy(&m, &mone, Ac, &inc, As, &inc);
            /* As = -beta2*As */
            temp = mone*beta2; dscal(&m, &temp, As, &inc);
            /* As = b + As */
            daxpy(&m, &one, b, &inc, As, &inc);
            /* As = -Alambda2 + As */
            daxpy(&m, &mone, Alambda2, &inc, As, &inc);
            /* As = As/beta2 */
            temp = 1.0/beta2; dscal(&m, &temp, As, &inc);
            /* y = (AAT)^(-1)*As */
            dpotrs(uplo, &m, &inc, AAT, &m, As, &m, &info);
            dcopy(&m, As, &inc, y, &inc);
            /* ATy = A'*y */
            dgemv(TTRANS,&m,&n,&one,A,&m,y,&inc,&zero,ATy,&inc);

            //Update multipliers
            dcopy(&m, Ax, &inc, res1, &inc);
            dcopy(&n, ATy, &inc, res2, &inc);
            /* res1 = -b + res1 */
            daxpy(&m, &mone, b, &inc, res1, &inc);
            /* res2 = s + res2 */
            daxpy(&n, &one, s, &inc, res2, &inc);
            /* res2 = -c + res2 */
            daxpy(&n, &mone, c, &inc, res2, &inc);
            /* lambda1 = alpha*beta1*res1+lambda1 */
            temp = alpha*beta1; daxpy(&m, &temp, res1, &inc, lambda1, &inc);
            /* ATlambda1 = A'*lambda1 */
            dgemv(TTRANS,&m,&n,&one,A,&m,lambda1,&inc,&zero,ATlambda1,&inc);
            /* lambda2 = alpha*beta2*res2+lambda2 */
            temp = alpha*beta2; daxpy(&n, &temp, res2, &inc, lambda2, &inc);
            /* Alambda2 = A*lambda2 */
            dgemv(NTRANS,&m,&n,&one,A,&m,lambda2,&inc,&zero,Alambda2,&inc);

            //Stats
            /* presi = ||Ax - b||/(1+||b||) */
            temp = dnrm2(&m,res1,&inc);
            presi = temp/(1.0+bnorm);
            /* dresi = ||A'*y+s-c||/(1+||c||) */
            temp = dnrm2(&n,res2,&inc);
            dresi = temp/(1.0+cnorm);

            pobj = ddot(&n, c, &inc, x, &inc);
            dobj = ddot(&m, b, &inc, y, &inc);
            dgap = fabs(pobj-dobj)/(1.0+fabs(pobj)+fabs(dobj));

            /* tmpinv = 1.0./s */
            for(j=0; j<n; j++) { *(tmpinv+j) = 1.0/(*(s+j));}
            /* tmpres = (mu/beta2)*A*tmpinv */
            temp = mu/beta2; dgemv(NTRANS,&m,&n,&temp,A,&m,tmpinv,&inc,&zero,tmpres,&inc);
            /* tmpres = -b + tmpres */
            daxpy(&n, &mone, b, &inc, tmpres, &inc);
            temp = dnrm2(&m,tmpres,&inc); lambdaRes1 = temp/bnorm;

            /* tmpinv = 1.0./x */
            for(j=0; j<n; j++) { *(tmpinv+j) = 1.0/(*(x+j));}
            /* tmpinv = -mu*tmpinv */
            temp = mone*mu; dscal(&n, &temp, tmpinv, &inc);
            /* tmpinv = c + tmpinv */
            daxpy(&n, &one, c, &inc, tmpinv, &inc);
            /* tmpinv = ATlambda1 + tmpinv*/
            daxpy(&n, &one, ATlambda1, &inc, tmpinv, &inc);
            temp = dnrm2(&n,tmpinv,&inc); lambdaRes2 = temp/cnorm;

            if (MAX(MAX(presi,dresi),dgap) < tol) {
                iter_all = iter_all + k;
                return;
            }
        }
        iter_all = iter_all + k;
        mu= mu*gamma;
        tol_admm = tol_admm*gamma*0.5;
    }
    return;
}

The following is the compiling command.

function Installmex

% src = pwd;
% sdet = 'src';
fname{1} = 'BLAS-BRADMM'; ofname{1} = 'BRADMMw'; fcc{1} = 'cpp';

hasMKL = 0; % with MKL or not

details = 0 ;	    % 1 if details of each command are to be printed

v = version ;
try
    % ispc does not appear in MATLAB 5.3
    pc = ispc ;
    mac = ismac ;
catch                                                                       %#ok
    % if ispc fails, assume we are on a Windows PC if it's not unix
    pc = ~isunix ;
    mac = 0 ;
end

% if (~pc) && (~mac)
%     mex -O -largeArrayDims -lmwlapack -lmwblas  sfmult.cpp
%     mex -O -largeArrayDims -lmwlapack -lmwblas  dfeast.cpp
%     return
% end

flags = '' ;
is64 = ~isempty (strfind (computer, '64')) ;
if (is64)
    % 64-bit MATLAB
    flags = '-largeArrayDims' ;
end

% MATLAB 8.3.0 now has a -silent option to keep 'mex' from burbling too much
if (~verLessThan ('matlab', '8.3.0'))
    flags = ['-silent ' flags] ;
end

 %---------------------------------------------------------------------------
 % BLAS option
 %---------------------------------------------------------------------------

 % This is exceedingly ugly.  The MATLAB mex command needs to be told where to
 % fine the LAPACK and BLAS libraries, which is a real portability nightmare.

if (pc)
    if (verLessThan ('matlab', '6.5'))
        % MATLAB 6.1 and earlier: use the version supplied here
        lapack = 'lcc_lib/libmwlapack.lib' ;
    elseif (verLessThan ('matlab', '7.5'))
        lapack = 'libmwlapack.lib' ;
    else
        lapack = 'libmwlapack.lib libmwblas.lib' ;
    end
else
    if (verLessThan ('matlab', '7.5'))
        lapack = '-lmwlapack' ;
    else
        lapack = '-lmwlapack -lmwblas' ;
    end
end

if (is64 && ~verLessThan ('matlab', '7.8'))
    % versions 7.8 and later on 64-bit platforms use a 64-bit BLAS
    fprintf ('with 64-bit BLAS\n') ;
    flags = [flags ' -DBLAS64'] ;
end

if (~(pc || mac))
    % for POSIX timing routine
    lapack = [lapack ' -lrt'] ;
end

include = '';
mkl = '';
if hasMKL
    include = ['-I',MKLHOMEINCLUDE];
    if mac
        mkl = ['', MKLHOMELIB,filesep,'libmkl_intel_lp64.dylib '];
        mkl = [mkl, '', MKLHOMELIB,filesep,'libmkl_core.dylib '];
        mkl = [mkl, '', MKLHOMELIB,filesep,'libmkl_intel_thread.dylib   -ldl  -lm '];
    else
        mkl = ['', MKLHOMELIB,filesep,'libmkl_intel_lp64.so '];
        mkl = [mkl, '', MKLHOMELIB,filesep,'libmkl_core.so '];
        mkl = [mkl, '', MKLHOMELIB,filesep,'libmkl_intel_thread.so   -ldl  -lm  -lrt'];
    end
end

if (verLessThan ('matlab', '7.0'))
    % do not attempt to compile CHOLMOD with large file support
    include = [include ' -DNLARGEFILE'] ;
elseif (~pc)
    % Linux/Unix require these flags for large file support
    include = [include ' -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE'] ;
end

if (verLessThan ('matlab', '6.5'))
    % logical class does not exist in MATLAB 6.1 or earlie
    include = [include ' -DMATLAB6p1_OR_EARLIER'] ;
end

% compile each mexFunction
for k = 1:length(fname)
    s = sprintf ('mex  %s -DDLONG -O %s %s.%s -output %s', flags, ...
            include, fname{k}, fcc{k}, ofname{k}) ;
    s = [s '' lapack mkl] ;						    %#ok
    %s = [s ''  mkl] ;
    cmd (s, details) ;
end

 %------------------------------------------------------------------------------
function  cmd (s, details)
 %DO_CMD: evaluate a command, and either print it or print a "."
if (details)
    fprintf ('%s\n', s) ;
end
eval (s) ;

↧

Extra precision

September 29, 2015, 12:44 am

Latest and popular articles on Intel Technologies

≫ Next: mkl_zcsrcoo faster computation on subsequent calls?

≪ Previous: Difference between C++ code with Blas/Lapack and Matlab

I am evaluating MKL 11.3 for Windows. Could you tell me if MKL provides some functionalities like XBLAS (which supports the extra precision or up to twice the working precision)? I notice that MKL has some high-level functions, e.g., dposvxx, which uses extra precise refinement to compute the solution (at least twice the working precision). So MKL should have some underlying functions which can perform the extra precision. Are those underlying functions coming from XBLAS? Could we access those underlying functions?

Thank you very much for your advices.

↧

mkl_zcsrcoo faster computation on subsequent calls?

September 29, 2015, 8:46 am

Latest and popular articles on Intel Technologies

≫ Next: Error in running VASP 5.3.5 with mkl, ifort and mpif90

≪ Previous: Extra precision

Hi,

I have a sparse matrix in coordinate format (row, col, A) and I transform it to CSR to be used for PARDISO. The sparsity pattern never changes (that is, row and col are always the same). Vector A changes from time to time. As I understand, I can run with job(6)=1 to get only ia. Is this any faster? What does job(6)=2 do?

I put here the documentation for job(6). Thanks!

For conversion to the CSR format:
If job(6)=0, all arrays acsr, ja, ia are filled in for the output storage.
If job(6)=1, only array ia is filled in for the output storage.
If job(6)=2, then it is assumed that the routine already has been called with the job(6)=1, and the user allocated the required space for storing the output arrays acsr and ja.

↧

Error in running VASP 5.3.5 with mkl, ifort and mpif90

October 1, 2015, 5:00 am

Latest and popular articles on Intel Technologies

≫ Next: Problem with PARDISO in Windows 10

≪ Previous: mkl_zcsrcoo faster computation on subsequent calls?

I did installed VASP executable successfully, only I changed FC=mpif90 (openmpi compiled using Intel compiler) whatever you mentioned in the following link

https://software.intel.com/en-us/articles/building-vasp-with-intel-mkl-and-intel-compilers?page=1#comment-1842228

But I got the following error while running,

mpirun -np 4 /opt/VASP/vasp.5.3/vasp

this gives the error as follows,

WARNING: for PREC=h ENMAX is automatically increase by 25 %
this was not the case for versions prior to vasp.4.4
WARNING: for PREC=h ENMAX is automatically increase by 25 %
this was not the case for versions prior to vasp.4.4
WARNING: for PREC=h ENMAX is automatically increase by 25 %
this was not the case for versions prior to vasp.4.4
LDA part: xc-table for Ceperly-Alder, standard interpolation
POSCAR, INCAR and KPOINTS ok, starting setup
FFT: planning ...
WAVECAR not read
entering main loop
N E dE d eps ncg rms rms(c)
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
libmpi.so.1 00002B3133018DE9 Unknown Unknown Unknown
libmkl_blacs_inte 00002B3130D8B273 Unknown Unknown Unknown
libmkl_blacs_inte 00002B3130D7D9FB Unknown Unknown Unknown
libmkl_blacs_inte 00002B3130D7D409 Unknown Unknown Unknown
vasp 00000000004D7BCD Unknown Unknown Unknown
vasp 00000000004CA239 Unknown Unknown Unknown
vasp 0000000000E23D62 Unknown Unknown Unknown
vasp 0000000000E447AD Unknown Unknown Unknown
vasp 0000000000472BC5 Unknown Unknown Unknown
vasp 000000000044D25C Unknown Unknown Unknown
libc.so.6 00002B31340C1C36 Unknown Unknown Unknown
vasp 000000000044D159 Unknown Unknown Unknown

--------------------------------------------------------------------------
mpirun has exited due to process rank 6 with PID 12042 on
node node01 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

Here all the libs associated with vasp executable,

ldd vasp

linux-vdso.so.1 => (0x00007fffcd1d5000)
libmkl_intel_lp64.so => /opt/intel/mkl/lib/intel64/libmkl_intel_lp64.so (0x00002b7018572000)
libmkl_cdft_core.so => /opt/intel/mkl/lib/intel64/libmkl_cdft_core.so (0x00002b7018c84000)
libmkl_scalapack_lp64.so => /opt/intel/mkl/lib/intel64/libmkl_scalapack_lp64.so (0x00002b7018ea0000)
libmkl_blacs_intelmpi_lp64.so => /opt/intel/mkl/lib/intel64/libmkl_blacs_intelmpi_lp64.so (0x00002b701968b000)
libmkl_sequential.so => /opt/intel/mkl/lib/intel64/libmkl_sequential.so (0x00002b70198c8000)
libmkl_core.so => /opt/intel/mkl/lib/intel64/libmkl_core.so (0x00002b7019f66000)
libiomp5.so => /opt/intel/composer_xe_2013.1.117/compiler/lib/intel64/libiomp5.so (0x00002b701b174000)
libmpi_f90.so.3 => /opt/intel/openmpi-icc/lib/libmpi_f90.so.3 (0x00002b701b477000)
libmpi_f77.so.1 => /opt/intel/openmpi-icc/lib/libmpi_f77.so.1 (0x00002b701b67b000)
libmpi.so.1 => /opt/intel/openmpi-icc/lib/libmpi.so.1 (0x00002b701b8b8000)
libdl.so.2 => /lib64/libdl.so.2 (0x00002b701bd03000)
libm.so.6 => /lib64/libm.so.6 (0x00002b701bf07000)
librt.so.1 => /lib64/librt.so.1 (0x00002b701c180000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x00002b701c38a000)
libutil.so.1 => /lib64/libutil.so.1 (0x00002b701c5a2000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b701c7a5000)
libc.so.6 => /lib64/libc.so.6 (0x00002b701c9c3000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002b701cd37000)
libifport.so.5 => /opt/intel/composer_xe_2013.1.117/compiler/lib/intel64/libifport.so.5 (0x00002b701cf4d000)
libifcore.so.5 => /opt/intel/composer_xe_2013.1.117/compiler/lib/intel64/libifcore.so.5 (0x00002b701d17d000)
libimf.so => /opt/intel/composer_xe_2013.1.117/compiler/lib/intel64/libimf.so (0x00002b701d4b3000)
libintlc.so.5 => /opt/intel/composer_xe_2013.1.117/compiler/lib/intel64/libintlc.so.5 (0x00002b701d96f000)
libsvml.so => /opt/intel/composer_xe_2013.1.117/compiler/lib/intel64/libsvml.so (0x00002b701dbbe000)
libifcoremt.so.5 => /opt/intel/composer_xe_2013.1.117/compiler/lib/intel64/libifcoremt.so.5 (0x00002b701e48c000)
libirng.so => /opt/intel/composer_xe_2013.1.117/compiler/lib/intel64/libirng.so (0x00002b701e7f1000)
/lib64/ld-linux-x86-64.so.2 (0x00002b7018351000)

Please take a look into this and help me in running the same.

↧

Problem with PARDISO in Windows 10

October 1, 2015, 7:47 am

Latest and popular articles on Intel Technologies

≫ Next: Call dfsNewTask1D from C#

≪ Previous: Error in running VASP 5.3.5 with mkl, ifort and mpif90

I have used PARDISO in Fortran successfully in Windows 7. When I moved to Windows 10 my code suddenly did not want to run and it crashes in the call to PARDISO. It seems something changed in Windows between versions 8 and 8.1 (8.1 does not work either). Has anyone any idea why or any similar experiences? My mkl kernel is 11.0.

↧

Call dfsNewTask1D from C#

October 1, 2015, 1:38 pm

Latest and popular articles on Intel Technologies

≫ Next: Question with df?EditPPSpline1D in Data Fitting Functions

≪ Previous: Problem with PARDISO in Windows 10

Hey,

I'm trying to get the spline functionality to work in C#. I don't seem to know what to put in place for the DFTaskPtr. I can see it's a void pointer in the mkl_df_types.h file, but I don't know how to translate that to C#.

I naively did the following:

    [SuppressUnmanagedCodeSecurity]
    internal sealed class MklNative
    {
        [DllImport("mkl_rt.dll", CallingConvention = CallingConvention.Cdecl, ExactSpelling = true, SetLastError = false)]
        internal static extern int dfsNewTask1D(ref IntPtr task, int nx, double[] x, int xhint, int ny, double[] y, int yhint);
    }

Which doesn't work. Any help would be welcome.

↧

Question with df?EditPPSpline1D in Data Fitting Functions

October 1, 2015, 7:00 pm

Latest and popular articles on Intel Technologies

≫ Next: PARADISO Compile Error

≪ Previous: Call dfsNewTask1D from C#

Hi all,

I have a question about parameter bc of function df?EditPPSpline1D. In the documantation ( https://software.intel.com/en-us/node/522222#D69BA4E7-E8BD-4413-A2A9-769D5002F832 ) says :

const float* for dfsEditPPSpline1D

const double* for dfdEditPPSpline1D

Pointer to boundary conditions. The size of the array is defined by the value of parameter bc_type:

If you set free-end or not-a-knot boundary conditions, pass the NULL pointer to this parameter.
If you combine boundary conditions at the endpoints of the interpolation interval, pass an array of two elements.
If you set a boundary condition for the default quadratic spline or a periodic condition for Hermite or the default cubic spline, pass an array of one element.

But if y is vector value function with dimension ny(>1), on the last case, it only need one element array? or ny elements array instead?

↧

PARADISO Compile Error

October 2, 2015, 11:26 pm

Latest and popular articles on Intel Technologies

≫ Next: slow out-of-place matrix transposition using mkl_domatcopy

≪ Previous: Question with df?EditPPSpline1D in Data Fitting Functions

Hi,

I am trying to use PARADISO from the MKL library, however, I cannot compile my program because of the following C2059 syntax error '(' on Line 71 of mkl_paradiso.h

Has anyone else had this problem?

I am using VS2015, the same error occurs is ILP64 is on or off. I have an intel i7-4720HQ chip in a Razer Laptop, running windows 8.

compiling debug x64

↧

slow out-of-place matrix transposition using mkl_domatcopy

October 3, 2015, 3:09 am

Latest and popular articles on Intel Technologies

≫ Next: Xerbla and CBLAS/BLAS

≪ Previous: PARADISO Compile Error

I created test program to test speed of matrix transposition for various scenarios, in particular:

1. in-place transposition of square matrix using mkl_dimatcopy method
2. out-of-place transposition of square matrix using mkl_domatcopy method
3. in-place transposition of rectangular matrix using mkl_dimatcopy method
4. out-of-place transposition of rectangular matrix using mkl_domatcopy method
5. in-place transposition of square matrix using naive solution in C
6. out-of-place transposition of square matrix using naive solution in C
7. out-of-place transposition of rectangular matrix using naive solution in C

I didnt implement in-place transposition of rectangular matrix in C because of its complexity. My solution was single-threaded (i linked mkl_sequential_dll.lib to my binary), version of MKL used 11.3. This is output of my test program containing times needed for one matrix transposition. These times were averaged using 101 loops:

== Square in-place matrix (1000x1000) transposition using mkl_dimatcopy completed ==
== at 3.35496 milliseconds ==

== Square out-of-place matrix (1000x1000) transposition using mkl_domatcopy completed ==
== at 9.11405 milliseconds ==

== Rectangular in-place matrix (1500x230) transposition using mkl_dimatcopy completed ==
== at 58.50708 milliseconds ==

== Rectangular out-of-place matrix (1500x230) transposition using mkl_domatcopy completed ==
== at 4.34861 milliseconds ==

== Square in-place matrix (1000x1000) transposition using C completed ==
== at 5.41542 milliseconds ==

== Square out-of-place matrix (1000x1000) transposition using C completed ==
== at 12.04492 milliseconds ==

== Rectangular out-of-place matrix (1500x230) transposition using C completed ==
== at 1.51498 milliseconds ==

In all cases but one MKL solution was better. This exception is out-of-place transposition of rectangular matrix with size 1500x230. Very simple C solution which didnt care about optimal using of cache memory is 2.87 times faster ! The data type used was double. C solution was faster than MKL also for larger rectangular matrix 1500x600 which didnt fit into L3 cache. For this case C solution was 1.25 times faster than MKL. Even when MKL does additional operation (multiplication by parameter alpha) it doesnt explain the difference. There are 1500x230 = 345 000 multiplications. According to Intel optimization manual throughput of mulpd instruction for my architecture is 1 instruction per cycle (i.e. 2 double multiplication per cycle) if I interpret the manual correctly. To do 345 000 multiplication in ideal case I need 345 000 / 2 = 172 500 cycles or around 0.1 ms.

Configuration of my PC is as follows:

Windows 7 Home Premium 64-bit SP1

Intel Core i3 370M @ 2.40GHz
Arrandale 32nm Technology
Cores   2
Threads   4
L1 Data Cache Size   2 x 32 KBytes
L1 Instructions Cache Size   2 x 32 KBytes
L2 Unified Cache Size   2 x 256 KBytes
L3 Unified Cache Size   3072 KBytes

6,00GB Dual-Channel DDR3 @ 531MHz (7-7-7-20)

Visual Studio Express 2012 for Windows Desktop

Source code follows:

#include <stdio.h>
#include <stdlib.h>
#include <memory.h>
#include "mkl.h"

// Parameters of square matrices
int m1 = 1000;
int n1 = 1000;
int loop_count1 = 101;

// Parameters of rectangular matrices
int m2 = 1500;
int n2 = 230;
int loop_count2 = 101;

void TestTransMklSqIn();
void TestTransMklSqOut();
void TestTransMklRecIn();
void TestTransMklRecOut();
void TestTransCSqIn();
void TestTransCSqOut();
void TestTransCRecOut();
void CheckTranspose(double *A, double *B, int m, int n);

int main()
{
   TestTransMklSqIn();
   TestTransMklSqOut();
   TestTransMklRecIn();
   TestTransMklRecOut();
   TestTransCSqIn();
   TestTransCSqOut();
   TestTransCRecOut();
}

double *CreateMatrix(int m, int n, bool zero = false)
{
   double *A = (double*)mkl_malloc(m * n * sizeof(double), 16);
   if (zero)
   {
      memset((void*)A, 0, m * n *sizeof(double));
   }
   else
   {
      int count = m * n;
      for (int i = 0; i < count; i++)
      {
         A[i] = (double)(i+1);
      }
   }

   return A;
}

// In-place transposition of square matrix using MKL
void TestTransMklSqIn()
{
   double *A = CreateMatrix(m1, n1);
   if (A == NULL)
   {
      return;
   }

   double s_initial, s_elapsed;
   s_initial = dsecnd();

   for (int i = 0; i < loop_count1; i++)
   {
      mkl_dimatcopy('R', 'T', m1, n1, 1.0, A, n1, m1);
   }

   s_elapsed = (dsecnd() - s_initial) / loop_count1;

   printf (" == Square in-place matrix (%dx%d) transposition using mkl_dimatcopy completed == \n"" == at %.5f milliseconds == \n\n", m1, n1, (s_elapsed * 1000));

   mkl_free(A);
   A = NULL;
}

// Out-of-place transposition of square matrix using MKL
void TestTransMklSqOut()
{
   double *A = CreateMatrix(m1, n1);
   if (A == NULL)
   {
      return;
   }

   double *B = CreateMatrix(m1, n1);
   if (B == NULL)
   {
      mkl_free(A);
      A = NULL;
      return;
   }

   double s_initial, s_elapsed;
   s_initial = dsecnd();

   for (int i = 0; i < loop_count1; i++)
   {
      mkl_domatcopy('R', 'T', m1, n1, 1.0, A, n1, B, m1);
   }

   s_elapsed = (dsecnd() - s_initial) / loop_count1;

   printf (" == Square out-of-place matrix (%dx%d) transposition using mkl_domatcopy completed == \n"" == at %.5f milliseconds == \n\n", m1, n1, (s_elapsed * 1000));

   mkl_free(A);
   A = NULL;
   mkl_free(B);
   B = NULL;
}

// In-place transposition of rectangular matrix using MKL
void TestTransMklRecIn()
{
   double *A = CreateMatrix(m2, n2);
   if (A == NULL)
   {
      return;
   }

   double s_initial, s_elapsed;
   s_initial = dsecnd();

   for (int i = 0; i < loop_count2; i++)
   {
      mkl_dimatcopy('R', 'T', m2, n2, 1.0, A, n2, m2);
   }

   s_elapsed = (dsecnd() - s_initial) / loop_count2;

   printf (" == Rectangular in-place matrix (%dx%d) transposition using mkl_dimatcopy completed == \n"" == at %.5f milliseconds == \n\n", m2, n2, (s_elapsed * 1000));

   mkl_free(A);
   A = NULL;
}

// Out-of-place transposition of rectangular matrix using MKL
void TestTransMklRecOut()
{
   double *A = CreateMatrix(m2, n2);
   if (A == NULL)
   {
      return;
   }

   double *B = CreateMatrix(m2, n2, true);
   if (B == NULL)
   {
      mkl_free(A);
      A = NULL;
      return;
   }

   double s_initial, s_elapsed;
   s_initial = dsecnd();

   for (int i = 0; i < loop_count2; i++)
   {
      mkl_domatcopy('R', 'T', m2, n2, 1.0, A, n2, B, m2);
   }

   s_elapsed = (dsecnd() - s_initial) / loop_count2;

   CheckTranspose(A, B, m2, n2);

   printf (" == Rectangular out-of-place matrix (%dx%d) transposition using mkl_domatcopy completed == \n"" == at %.5f milliseconds == \n\n", m2, n2, (s_elapsed * 1000));

   mkl_free(A);
   A = NULL;
   mkl_free(B);
   B = NULL;
}

// In-place transposition of square matrix using naive solution in C
void TestTransCSqIn()
{
   double *A = CreateMatrix(m1, n1);
   if (A == NULL)
   {
      return;
   }

   double s_initial, s_elapsed;
   s_initial = dsecnd();

   double temp;
   for (int k = 0; k < loop_count1; k++)
   {
      for (int i = 0; i < m1; i++)
      {
         int idx1 = i * n1;
         int idx2 = i;
         for (int j = 0; j < i; j++)
         {
            temp = A[idx1];
            A[idx1] = A[idx2];
            A[idx2] = temp;
            idx1++;
            idx2+=n1;
         }
      }
   }

   s_elapsed = (dsecnd() - s_initial) / loop_count1;

   printf (" == Square in-place matrix (%dx%d) transposition using C completed == \n"" == at %.5f milliseconds == \n\n", m1, n1, (s_elapsed * 1000));

   mkl_free(A);
   A = NULL;
}

// Out-of-place transposition of square matrix using naive solution in C
void TestTransCSqOut()
{
   double *A = CreateMatrix(m1, n1);
   if (A == NULL)
   {
      return;
   }
   double *B = CreateMatrix(m1, n1);
   if (B == NULL)
   {
      mkl_free(A);
      A = NULL;
      return;
   }

   double s_initial, s_elapsed;
   s_initial = dsecnd();

   for (int k = 0; k < loop_count1; k++)
   {
      for (int i = 0; i < m1; i++)
      {
         int idx1 = i * n1;
         int idx2 = i;
         for (int j = 0; j < n1; j++)
         {
            B[idx2] = A[idx1];
            idx1++;
            idx2+=n1;
         }
      }
   }

   s_elapsed = (dsecnd() - s_initial) / loop_count1;

   printf (" == Square out-of-place matrix (%dx%d) transposition using C completed == \n"" == at %.5f milliseconds == \n\n", m1, n1, (s_elapsed * 1000));

   mkl_free(A);
   A = NULL;
   mkl_free(B);
   B = NULL;
}

// Out-of-place transposition of rectangular matrix using naive solution in C
void TestTransCRecOut()
{
   double *A = CreateMatrix(m2, n2);
   if (A == NULL)
   {
      return;
   }
   double *B = CreateMatrix(m2, n2, true);
   if (B == NULL)
   {
      mkl_free(A);
      A = NULL;
      return;
   }

   double s_initial, s_elapsed;
   s_initial = dsecnd();

   for (int k = 0; k < loop_count2; k++)
   {
      for (int i = 0; i < m2; i++)
      {
         int idx1 = i * n2;
         int idx2 = i;
         for (int j = 0; j < n2; j++)
         {
            B[idx2] = A[idx1];
            idx1++;
            idx2+=m2;
         }
      }
   }

   s_elapsed = (dsecnd() - s_initial) / loop_count2;

   CheckTranspose(A, B, m2, n2);

   printf (" == Rectangular out-of-place matrix (%dx%d) transposition using C completed == \n"" == at %.5f milliseconds == \n\n", m2, n2, (s_elapsed * 1000));

   mkl_free(A);
   A = NULL;
   mkl_free(B);
   B = NULL;
}

void CheckTranspose(double *A, double *B, int m, int n)
{
   bool res = true;
   for (int i = 0; i < m; i++)
   {
      for (int j = 0; j < n; j++)
      {
         if (A[i*n + j] != B[j*m + i])
         {
            res = false;
            break;
         }
      }
      if (!res)
      {
         break;
      }
   }

   if (!res)
   {
      printf("Matrix tranpose error detected !\n\n");
   }
}

↧

Xerbla and CBLAS/BLAS

October 9, 2015, 6:02 am

Latest and popular articles on Intel Technologies

≫ Next: No PZLAHQR in MKL

≪ Previous: slow out-of-place matrix transposition using mkl_domatcopy

Hello,

I have a problem using xerbla and cblas/blas functions. The following code ends up with a segmentation fault:

#include "mkl.h"
#include <iostream>

void XERBLA(const char * Name, const int * Num, const int Len){
  std::cout << "XERBLA CALLED!"<< Name << ": "<< *Num << ": "<< Len;
}

int main(int argc, char** argv){
 
  float a = 3.0f;
 
  int length = 1;
  int increment = 1;
  float* NullPointer = nullptr;

  /*VML*/
  vsSub(length, NullPointer, &a, &a);
  /*BLAS*/
  SCOPY(&length, NullPointer, &increment, &a, &increment);
  /*CBLAS*/
  cblas_scopy(length, NullPointer, increment, &a, increment);

  return 0;
}

This xerbla function was successfully tested with VML but for CBLAS/BLAS it is not called in the code above. Is it possible to catch invalid length/increments or nullpointer in CBLAS/BLAS?

Thank you very much,

Mario

↧

No PZLAHQR in MKL

October 10, 2015, 9:13 pm

Latest and popular articles on Intel Technologies

≫ Next: Sparse BLAS - sparse_dense matrix multiplication with zero indexing and colmajor matrix

≪ Previous: Xerbla and CBLAS/BLAS

Hello,

I am using ScaLapapck from MKL.

Is there a PZLAHRQ routine in MLK?

The reference manual says it has only a real version https://software.intel.com/en-us/node/521530

If no, then how can I do a Schur decomposition of a complex, non-hermitian matrix in Hessenberg form?

Thanks,

Alex

↧

Sparse BLAS - sparse_dense matrix multiplication with zero indexing and colmajor matrix

October 12, 2015, 2:16 am

Latest and popular articles on Intel Technologies

≫ Next: Matrix Inversion LAPACKE_zsytri

≪ Previous: No PZLAHQR in MKL

Hello,

I would like to use MKL sparse blas in C++ for computing multithreaded sparse-dense matrix multiplication.
I would like to compute C = S * B, with S a sparse matrix in format coo or csc in 0-indexing and
C and B two dense matrix in ColMajor and 0-indexing format.
In all the mkl_?coomm or mkl_?cscmm functions it seems like we can only compute
this product with matrix in 0-indexing and Row major format...
Is there an effcient (fast) way to compute this product with Colmajor and 0-indexing format for dense matrix,
which means not using not using transposition for dense matrix.

↧

Matrix Inversion LAPACKE_zsytri

October 12, 2015, 4:07 am

Latest and popular articles on Intel Technologies

≫ Next: Compiling in a mingw64/msys2 environment

≪ Previous: Sparse BLAS - sparse_dense matrix multiplication with zero indexing and colmajor matrix

hello everyone,

I'm testing the inversion of a symmetric matrix, in order to do that, I wrote this code:

Complex[,] c = (Complex[,])a.Clone(); //I need to use it later
Complex[,] d = (Complex[,])a.Clone(); //I need to use it later

int n = a.GetLength(0);
int lda = n;
int info = 1;

int[] ipiv = new int[n];

info = MKLWrapper.LAPACKE_zsytrf(LAPACK_ROW_MAJOR, UPLO_U, n, c, lda, ipiv);

info = MKLWrapper.LAPACKE_zsytri(LAPACK_ROW_MAJOR, UPLO_U, n, c, lda, ipiv);

ipiv = new int[n];

info = MKLWrapper.LAPACKE_zsytrf(LAPACK_ROW_MAJOR, UPLO_L, n, d, lda, ipiv);

info = MKLWrapper.LAPACKE_zsytri(LAPACK_ROW_MAJOR, UPLO_L, n, d, lda, ipiv);


for (int i = 1; i <= a.GetLength(0); i++)
{
	  for (int j = 1; j <= a.GetLength(1); j++)
  	{
		    if (i > j)
			      c[i, j] = d[i, j];
	  }
}

This code works, but I'm not sure this is the right way to obtain my aim.

Is there anyone that can help me?

Thank you very much

Gianluca

↧

Compiling in a mingw64/msys2 environment

October 12, 2015, 4:50 am

Latest and popular articles on Intel Technologies

≫ Next: MKL Pardiso (version 11.2.3): wrong output of phase 331 with multiple rhs and Schur complement enabled

≪ Previous: Matrix Inversion LAPACKE_zsytri

I'm trying to build my C program in a mingw64 (msys2) environment using cmake.

I successfully compiled my example code with inkl mkl_malloc() call, but once including vdMul() function I get the following compile error:

Warning: .drectve `-defaultlib:"uuid.lib"' unrecognized
Warning: corrupt .drectve at end of def file
Warning: .drectve `-defaultlib:"uuid.lib"' unrecognized
Warning: corrupt .drectve at end of def file
Warning: .drectve `-defaultlib:"uuid.lib"' unrecognized
Warning: corrupt .drectve at end of def file
C:/Program Files (x86)/IntelSWTools/compilers_and_libraries/windows/mkl/lib/intel64/mkl_intel_ilp64_dll.lib(./_tmp/interface_win32e_ilp64_dll/vml_fb_cpuid_iface.obj):(.text[mkl_vml_serv_cpu_detect]+0xcb): undefined reference to `__security_check_cookie'

My libraries are included in my cmake config:

find_library(MKL_CORE_LIBRARY mkl_core_dll PATHS ${MKL_ROOT_DIR}/lib/${MKL_ARCH_DIR} ${MKL_ROOT_DIR}/lib/)
find_library(MKL_INTEL_ILP64_LIBRARY mkl_intel_ilp64_dll PATHS ${MKL_ROOT_DIR}/lib/${MKL_ARCH_DIR} ${MKL_ROOT_DIR}/lib/)
find_library(MKL_SEQUENTIAL_LIBRARY mkl_sequential_dll PATHS ${MKL_ROOT_DIR}/lib/${MKL_ARCH_DIR} ${MKL_ROOT_DIR}/lib/)

Is there a compatibility problem here or am I doing something wrong?

Thanks.

↧

MKL Pardiso (version 11.2.3): wrong output of phase 331 with multiple rhs and Schur complement enabled

October 12, 2015, 9:35 am

Latest and popular articles on Intel Technologies

≫ Next: Difference of calculcation result between i7 and xeon cpu

≪ Previous: Compiling in a mingw64/msys2 environment

Hi,

I recently started using MKL_PARDISO. I noticed that phase 331 gives the wrong result if you want to solve for multiple right hand sides using the Schur complement feature.

Attached a code to reproduce the problem. I just copied the example you provide with mkl distribution for the Schur complement and added multiple rhs.

I'm using composer_xe_2015.3.187, with MKL 11.2.3

Stefano

Attachment	Size
Download pardiso_schur_c.c	8.04 KB

↧

Difference of calculcation result between i7 and xeon cpu

October 12, 2015, 11:21 pm

Latest and popular articles on Intel Technologies

≫ Next: Bug in MKL 11.3.0.109 FFT

≪ Previous: MKL Pardiso (version 11.2.3): wrong output of phase 331 with multiple rhs and Schur complement enabled

Hi.

I make the numerical calculation program and run two computers. (One is used i7 cpu, others Xeon cpu.)

The numerical result is slightly different. (input data is same).

My program is very sensitive to numerical difference, so i confused which one is correct.

Why the differece is occured?

the program is complied same computer under environment

Complier : MS VS2010 C++ Complier

MKL library Version : 11.1.4.237

Linked library : mkl_core.lib, mkl_intel_lp64.lib, mkl_intel_thread.lib, libiomp5md.lib

the program mainly used "DGEMM", "PARDISO" functions for solving linear equation directly(not iterative method).

And the number of thread set 1.

↧

Bug in MKL 11.3.0.109 FFT

October 13, 2015, 3:52 am

Latest and popular articles on Intel Technologies

≫ Next: Intel MKL 11.3 hotfix release for BLAS, FFT, and sparse BLAS issues

≪ Previous: Difference of calculcation result between i7 and xeon cpu

Hello,

I am pretty sure I found a bug in the DFTI (FFT) routine of MKL.

When doing a 2D complex-to-real FFT, for some array sizes, the FFT routine writes precisely 2 real values too many. In those cases, the FFT result is incorrect.

This happens both for single- en double-precision invocations.

For single-precision invocations, this happens precisely when: num_rows >= 30 and num_cols >= 22 and num_cols % 16 == 14.

For double-precision invocations, this happens precisely when: num_rows >= 17 and num_cols >= 22 and num_cols % 8 == 6.

A program that demonstrates the issue is shown below.

Kind regards, Sidney Cadot

//////////////////////////
// InvestigateMklBug.cc //
//////////////////////////

#include <iostream>
#include <complex>
#include <cassert>
#include <vector>

#include <mkl.h>

void testcase(unsigned  num_rows, unsigned num_cols)
{
    // We will do a complex-to-real 2D FFT.
    //
    // The real array is    (num_rows x nul_cols).
    // The complex_array is (num_rows x nul_cols_complex), where nul_cols_complex == num_cols / 2 + 1.
    //
    // It turns out that, for some values of num_rows/num_cols, the FFT writes beyond the last entry
    // of the 'real_array'.
    //
    // We investigate this by allocating 'real_array' with a few elements (EXTRA_ENTRIES) too many.
    //
    // Prior to the FFT, we initialize 'real_array' with a GUARD_VALUE.
    //
    // After the FFT, we check the number of GUARD_VALUEs still present.
    //
    // If this is less than the number of EXTRA_ENTRIES, elements were overwritten that shouldn't have.

    const unsigned num_cols_complex = num_cols / 2 + 1;

    // setup DFTI descriptor

    DFTI_DESCRIPTOR_HANDLE descriptor;

    const MKL_LONG dimensions[2] = {num_rows, num_cols};
    MKL_LONG status = DftiCreateDescriptor(&descriptor, DFTI_SINGLE, DFTI_REAL, 2, dimensions);
    assert(status == DFTI_NO_ERROR);

    status = DftiSetValue(descriptor, DFTI_PLACEMENT, DFTI_NOT_INPLACE);
    assert(status == DFTI_NO_ERROR);

    // The manual recommends setting this to DFTI_COMPLEX_COMPLEX.
    status = DftiSetValue(descriptor, DFTI_CONJUGATE_EVEN_STORAGE, DFTI_COMPLEX_COMPLEX);
    assert(status == DFTI_NO_ERROR);

    const MKL_LONG input_strides[3] = {0, num_cols_complex, 1};
    status = DftiSetValue(descriptor, DFTI_INPUT_STRIDES, input_strides);
    assert(status == DFTI_NO_ERROR);

    const MKL_LONG output_strides[3] = {0, num_cols, 1};
    status = DftiSetValue(descriptor, DFTI_OUTPUT_STRIDES, output_strides);
    assert(status == DFTI_NO_ERROR);

    const MKL_LONG thread_limit = 1;
    status = DftiSetValue(descriptor, DFTI_THREAD_LIMIT, thread_limit);
    assert(status == DFTI_NO_ERROR);

    status = DftiCommitDescriptor(descriptor);
    assert(status == DFTI_NO_ERROR);

    // Do inverse-fft

    const unsigned EXTRA_ENTRIES = 64;
    const float    GUARD_VALUE   = 999.0;

    std::vector<std::complex<float>> complex_array(num_rows * num_cols_complex);
    std::vector             <float > real_array   (num_rows * num_cols + EXTRA_ENTRIES, GUARD_VALUE);

    // Execute the FFT and free the descriptor.

    status = DftiComputeBackward(descriptor, complex_array.data(), real_array.data());
    assert(status == DFTI_NO_ERROR);

    status = DftiFreeDescriptor(&descriptor);
    assert(status == DFTI_NO_ERROR);

    // Investigate whether the FFT wrote more entries than expected.

    unsigned count_guards = 0;
    for (unsigned i = 0; i < real_array.size(); ++i)
    {
        if (real_array[i] == GUARD_VALUE)
        {
            ++count_guards;
        }
    }

    assert(count_guards <= EXTRA_ENTRIES);

    bool problem_detected = (count_guards != EXTRA_ENTRIES); // They should be the same.

    if (problem_detected)
    {
        std::cout << "num_rows "<< num_rows << " num_cols "<< num_cols << " num_cols_complex "<< num_cols_complex << " --> wrote "<< (EXTRA_ENTRIES - count_guards) << " real_array entries too many."<< std::endl;
    }
}

int main()
{
    for (unsigned num_rows = 1; num_rows <= 200; ++num_rows)
    {
        for (unsigned num_cols = 1; num_cols <= 200; ++num_cols)
        {
            testcase(num_rows, num_cols);
        }
    }
    return 0;
}

↧

Intel MKL 11.3 hotfix release for BLAS, FFT, and sparse BLAS issues

October 5, 2015, 12:24 pm

Latest and popular articles on Intel Technologies

≫ Next: MKL threading seems using only one core

≪ Previous: Bug in MKL 11.3.0.109 FFT

Intel MKL Users,

We recently discovered that some BLAS, FFT, and sparse BLAS routines produce incorrect results under certain circumstances. We believe the scope of the issues is limited and that they impact few use cases. Please read the description of each issues below to determine if you are impacted. These issues will be addressed in the upcoming Intel MKL 11.3.1 release. If you do not want to wait for the release of Intel MKL 11.3.1, a hotfix is available to replace your current installation by contacting us at intel.mkl@intel.com. Please include your license serial number in all correspondence.

{S,D}GEMM may give incorrect results for beta=0 cases.
1. Scope: Only affects Intel64 architectures with the AVX2 instruction set during multithreaded execution. For DGEMM, this affects matrices with N < 4000 and M/nthreads > 5004. For SGEMM, this affects matrices with N < 4000 and M/nthreads > 10008.
2. Present in Intel MKL versions: 11.2.1, 11.2.2, 11.2.3, 11.2.4, and 11.3
{S,D}SYMM may give incorrect results for beta=0 cases.
1. Scope: Only affects Intel64 architectures with the AVX2 instruction set for both multithreaded and sequential execution. For DSYMM, this affects matrices with M/nthreads > 5004. For SSYMM, this affects matrices with M/nthreads > 10008.
2. Present in Intel MKL versions: 11.2.1, 11.2.2, 11.2.3, 11.2.4, and 11.3
2D real-to-complex FFT produces incorrect results for some non-power-of-2 sizes.
1. Scope: Affects transform sizes 22+8*n for double precision and transform sizes 30+16*n for single precision.
2. Present in Intel MKL version: 11.3
Sparse matrix-vector multiplication using the two-stage, inspector-executor API produces incorrect results.
1. Scope: Only affects the CSR sparse matrix storage format during multithreaded execution. The problem arises when 'mkl_sparse_?_mv' routine is called after calls of the 'mkl_sparse_optimize' routine.
2. Present in Intel MKL version: 11.3

If you believe you are affected by any one of these issues and wish to receive the hotfix, please send a message to intel.mkl@intel.com. Please include your license serial number. Follow this link for instructions on finding your license serial number.

↧