Batched dgemm performance plateaus?

May 20, 2019, 1:24 pm

Latest and popular articles on Intel Technologies

≫ Next: Duplicated value in VSL error codes

≪ Previous: C++/Fortran/MPI code MKL compile error:

I have a problem where I need to compute many (1e4 - 1e6) small matrix-matrix and matrix-vector products (matrix dimensions around ~15 - 35). This problem seems "embarrassingly parallel" to me, and so I am confused as to why I am seeing the following performance issue: on a Google Cloud compute server with 48 physical cores (96 logical cores), performance plateaus at 10-16 threads. Adding additional threads does not reduce computation time. I have tried several different approaches: (1) cblas_dgemm_batch; (2) calling cblas_dgemm within a tbb::parallel_for, with both sequential and TBB-threaded MKL; (3) JIT-compiled problem-specific dgemm kernel (created with mkl_jit_create_dgemm) within a parallel_for; (4) mkl_dgemm_compact (along with mkl_dgepack and mkl_dgeunpack).

All of these yield roughly comparable performance (except for the compact functions--there, packing and unpacking time completely dominates computation time), but none of them seems to yield performance that scales linearly with the number of threads I specify, as I would expect. The maximum performance I see is around 50 GFLOPS on a system capable of around 1-2 TFLOPS. (Indeed, multiplying two large matrices achieves performance in the teraflop range.) Is this the best I can expect? Why do I not see performance scaling linearly with thread count on this embarrassingly parallel problem?

↧

Duplicated value in VSL error codes

May 21, 2019, 10:37 pm

Latest and popular articles on Intel Technologies

≫ Next: cblas_dgemm crashing when thread is terminated

≪ Previous: Batched dgemm performance plateaus?

mkl_vsl_defines.h (2019.1.144) contains two defines:

#define VSL_CC_ERROR_PRECISION (-2400)
#define VSL_CC_ERROR_METHOD (-2400)

with the same value. Is this an error?

Taking into account these defines:

#define VSL_CC_ERROR_TYPE (-2130)
#define VSL_CC_ERROR_EXTERNAL_PRECISION (-2141)
#define VSL_CC_ERROR_INTERNAL_PRECISION (-2142)

VSL_CC_ERROR_PRECISION should probably be equal to 2140.

↧

cblas_dgemm crashing when thread is terminated

May 23, 2019, 2:10 pm

Latest and popular articles on Intel Technologies

≫ Next: Documentation - access denied

≪ Previous: Duplicated value in VSL error codes

Hello,

I have a multithread code and I using MKL multithreaded as well.

I have created a unit test that always crashes MKL.

Let me explain the test:

From the main thread (MT), I create another thread (T1) to multiply a 4096x4096 matrix.
T1 calls cblas_dgemm
It's a heavy processing, then I let MT sleep for 1 second
Then, I invoke T1 termination
MKL crashes 100% of my attempts so far

Would someone know how can I work this around? I mean, to turn cblas_dgemm thread safer.

Thanks.

↧

Documentation - access denied

May 23, 2019, 7:31 am

Latest and popular articles on Intel Technologies

≫ Next: Problems with mkl_cluster_sparse_solver

≪ Previous: cblas_dgemm crashing when thread is terminated

I keep getting an access denied message when trying to visit any documentation page for mkl. For example: https://software.intel.com/en-us/mkl-windows-developer-guide

I registered, and logged in and it's the same error every time.

↧

Problems with mkl_cluster_sparse_solver

May 24, 2019, 9:10 am

Latest and popular articles on Intel Technologies

≫ Next: Questions about sgels example

≪ Previous: Documentation - access denied

Dear all,

unfortunately, I have again some troubles with the mkl_cluster_sparse_solver as in my previous topic. I have taken one of the examples intel provides in the example dir of mkl and modified it in two ways: on the one hand the code can now read an arbitrary matrix stored in the file fort.110 and on the other hand I perform a loop over the routines since I want to change the matrix within one cycle later on. The first problem arises when treating large system sizes.

In this case, you can find the matrix in fort1.zip. The program aborts with a segmentation fault after 18%: forrtl: severe (174): SIGSEGV, segmentation fault occurred. Unfortunately, this is somehow hard to track down what is the issue but it must be in the subroutine since it starts. As I said this happens for large matrices. Unfortunately I dont know how to get rid of this problem.

The next problem occurs for small matrices as found in fort.zip. The problem seems to be the loop: the first cycle everything works fine but the second cycle aborts with an error message I have already seen in one of my last topics:

Fatal error in PMPI_Reduce: Message truncated, error stack:
PMPI_Reduce(2334).................: MPI_Reduce(sbuf=0x7d7d7f8, rbuf=0x7f0b900, count=22912, MPI_DOUBLE, MPI_SUM, root=0, comm=0x84000004) failed
MPIR_Reduce_impl(1439)............: fail failed
I_MPIR_Reduce_intra(1533).........: Failure during collective
MPIR_Reduce_intra(1201)...........: fail failed
MPIR_Reduce_Shum_ring(833)........: fail failed
MPIDI_CH3U_Receive_data_found(131): Message from rank 1 and tag 11 truncated; 14000 bytes received but buffer size is 1296

I have tried what I did the last time: provide all parameters (nhrs, msglevel, iparm, ..) for all ranks again but it does not seem to fix the issue.

This is the program code (cl_solver_f90.f90):

program cluster_sparse_solver
use mkl_cluster_sparse_solver
implicit none
include 'mpif.h'
integer, parameter :: dp = kind(1.0D0)
!.. Internal solver memory pointer for 64-bit architectures
TYPE(MKL_CLUSTER_SPARSE_SOLVER_HANDLE)  :: pt(64)

integer maxfct, mnum, mtype, phase, nrhs, error, msglvl, i, ik, l1, k1, idum(1), DimensionL, Nsparse
integer*4 mpi_stat, rank, num_procs
double precision :: ddum(1)

integer, allocatable :: IA( : ),  JA( : ), iparm( : )
double precision, allocatable :: VAL( : ), rhodot( : ), rho( : )

integer(4) MKL_COMM


MKL_COMM=MPI_COMM_WORLD
call mpi_init(mpi_stat)
call mpi_comm_rank(MKL_COMM, rank, mpi_stat)


do l1 = 1, 64
  pt(l1)%dummy = 0
end do

 error       = 0   ! initialize error flag
 msglvl      = 1   ! print statistical information
 mtype       = 11  ! real, non-symmetric
 nrhs        = 1
 maxfct      = 1
 mnum        = 1

allocate(iparm(64))
 
do l1 = 1, 64
 iparm(l1) = 0
end do

!Setup PARDISO control parameter
 iparm(1)  = 1   ! do not use default values
 iparm(2)  = 3   ! fill-in reordering from METIS
 iparm(8)  = 100 ! Max. number of iterative refinement steps on entry
 iparm(10) = 13  ! perturb the pivot elements with 1E-13
 iparm(11) = 1   ! use nonsymmetric permutation and scaling MPS
 iparm(13) = 1   ! Improved accuracy using nonsymmetric weighted matching
 iparm(27) = 1   ! checks whether column indices are sorted in increasing order within each row

read(110,*) DimensionL, Nsparse

allocate(VAL(Nsparse),JA(Nsparse),IA(DimensionL))

if (rank.eq.0) then
do k1=1,Nsparse
read(110,*) VAL(k1)
end do
do k1=1,DimensionL+1
read(110,*) IA(k1)
end do
do k1=1,Nsparse
read(110,*) JA(k1)
end do
end if

allocate(rhodot(DimensionL), rho(DimensionL))

if (rank.eq.0) then
rhodot=0.0d0
rhodot(1) = 1.0d0
rho=0.0d0
end if

if (rank.eq.0) write(*,*) 'INIT PARDISO'

ik = 0
Pardisoloop: do

ik = ik + 1

phase = 12
call cluster_sparse_solver_64 ( pt, maxfct, mnum, mtype, phase, DimensionL, VAL, IA, JA, idum, nrhs, iparm, msglvl, ddum, ddum, MKL_COMM, error )
if (error.ne.0.and.rank.eq.0) write(*,*) 'ERROR: ', error

phase = 33
call cluster_sparse_solver_64 ( pt, maxfct, mnum, mtype, phase, DimensionL, VAL, IA, JA, idum, nrhs, iparm, msglvl, rhodot, rho, MKL_COMM, error )
if (error.ne.0.and.rank.eq.0) write(*,*) 'ERROR: ', error

if (ik.ge.4) exit Pardisoloop

end do Pardisoloop


call MPI_BARRIER(MKL_COMM,mpi_stat)

phase = -1
call cluster_sparse_solver_64 ( pt, maxfct, mnum, mtype, phase, DimensionL, ddum, idum, idum, idum, nrhs, iparm, msglvl, ddum, ddum, MKL_COMM, error )
if (error.ne.0.and.rank.eq.0) write(*,*) 'Release of memory: ', error


call mpi_finalize(mpi_stat)

end

I compile with

mpiifort -i8 -I${MKLROOT}/include -c -o mkl_cluster_sparse_solver.o ${MKLROOT}/include/mkl_cluster_sparse_solver.f90
mpiifort -i8 -I${MKLROOT}/include -c -o cl_solver_f90.o cl_solver_f90.f90
mpiifort mkl_cluster_sparse_solver.o cl_solver_f90.o -o MPI.out -Wl,--start-group ${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a ${MKLROOT}/lib/intel64/libmkl_intel_thread.a ${MKLROOT}/lib/intel64/libmkl_core.a ${MKLROOT}/lib/intel64/libmkl_blacs_intelmpi_ilp64.a -Wl,--end-group -liomp5 -lpthread -lm -ldl

and run the program with mpiexec -n 2 ./MPI.out. Our cluster has 16 cores per node and I request two of them. Ram should not be the problem (64gb), since it perfectly runs with the normal pardiso on just one node. I set export MKL_NUM_THREADS=16. Am I right that the slave MPI process should automatically obtain parts of the LL^T factors or do I have to use the distributed version in order to do so? The reason why I ask is that I cannot observe any process running on the second node.

The Versions are: MKL version: 2017.4.256, Ifort version: 17.0.6.256, IMPI version: 2017.4.239, but my college can also reproduce the issue on other versions/clusters.

Thanks in advance,

Horst

Attachment	Size
Download fort1.zip	52.63 MB
Download fort.zip	356.13 KB

↧

Questions about sgels example

May 24, 2019, 11:10 am

Latest and popular articles on Intel Technologies

≫ Next: Antivirus reports trojan in MKL2019-4

≪ Previous: Problems with mkl_cluster_sparse_solver

I have two questions about sgels example:

(1) why need to call sgels twice?

/* Query and allocate the optimal workspace */
   lwork = -1;
   sgels( "No transpose", &m, &n, &nrhs, a, &lda, b, &ldb, &wkopt, &lwork, &info );
   lwork = (MKL_INT)wkopt;
   work = (float*)malloc( lwork*sizeof(float) );
   /* Solve the equations A*X = B */
   sgels( "No transpose", &m, &n, &nrhs, a, &lda, b, &ldb, work, &lwork, &info );

(2) where to decide row-major matrix or column -major matrix?

Thank you very much!

↧

Antivirus reports trojan in MKL2019-4

May 26, 2019, 12:52 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® MKL version 2019 Update 4 is now available

≪ Previous: Questions about sgels example

When installing MKL for windows(w_mkl_2019.4.245) my antivirus reports the following:

temcat.tcat is infected with Gen:Trojan.Heur.LP.rS8@ayyXnNek

Happens when installer is almost finished during VS integration step, but not sure it is related.

on a side note alot of the download links on the performance libraries page for windows results in 404s (may 26th)

↧

Intel® MKL version 2019 Update 4 is now available

May 26, 2019, 7:11 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel MKL make R lm() fails if observations are high enough

≪ Previous: Antivirus reports trojan in MKL2019-4

Intel® Math Kernel Library (Intel® MKL) is a highly optimized, extensively threaded, and thread-safe library of mathematical functions for engineering, scientific, and financial applications that require maximum performance.

Intel MKL 2019 Update 3 packages are now ready for download.

Intel MKL is available as part of the Intel® Parallel Studio XE and Intel® System Studio. Please visit the Intel® Math Kernel Library Product Page.

Please see What's new in Intel MKL 2019 and in MKL 2019 Update 4 follow this link - https://software.intel.com/en-us/articles/intel-math-kernel-library-rele...

and here is the link to the MKL 2019 Bug Fix list - https://software.intel.com/en-us/articles/intel-math-kernel-library-2019...

↧

Intel MKL make R lm() fails if observations are high enough

May 25, 2019, 7:19 am

Latest and popular articles on Intel Technologies

≫ Next: Download link leads to error 404

≪ Previous: Intel® MKL version 2019 Update 4 is now available

I have installed Intel MKL on my Kubuntu 19.04. I have R 3.6.0.

Using Intel MKL, R's linear regression failes if the number of samples is somehow high (20k). Here is the code:

```
rm(list=ls())
N = 20000
xvar <- runif(N, -10, 10)
e <- rnorm(N, mean=0, sd=1)
yvar <- 1 + 2*xvar + e
plot(xvar,yvar)
lmMod <- lm(yvar~xvar)
print(summary(lmMod))
```

The coefficients are just random numbers and are not significant, R-squared is low. Instead for lower N (like 2000) it works.

Just uninstalling Intel MKL and thus relying back on OpenBLAS solved the problem completely.

Also check here

↧

Download link leads to error 404

May 28, 2019, 12:43 am

Latest and popular articles on Intel Technologies

≫ Next: Visual Studio Integration 2017 with 2015 toolset

≪ Previous: Intel MKL make R lm() fails if observations are high enough

Hi everyone,

I tried to register and download the MKL library from the links below. However both links lead to an error 404-page not found.

Thanks in advance for any help.

↧

Visual Studio Integration 2017 with 2015 toolset

May 30, 2019, 7:12 am

Latest and popular articles on Intel Technologies

≫ Next: Problem when solving large system using Scalapack PDGESV

≪ Previous: Download link leads to error 404

I have a C++ project in Visual Studio 2017, which is using the VS2015 toolset

I selected the VS2017 integration when I installed the MKL 2019.4.245, but I only see the integration options when the project is set to use the 2017 toolset.

To get the integration options in VS2017 for the 2015 toolset, I had to uninstall the MKL, install VS2015 (which took an hour), reinstall the MKL with the 2015 and 2017 integration options. I now see the MKL integration in VS2017 when using the 2015 toolset.

Can you fix the MKL installer to work with the 2015 toolset when it is installed as part of 2017?

↧

Problem when solving large system using Scalapack PDGESV

May 30, 2019, 9:20 pm

Latest and popular articles on Intel Technologies

≫ Next: Packed versus compact versus normal routines versus jit

≪ Previous: Visual Studio Integration 2017 with 2015 toolset

A parallel fortran code that solves a set of linear simultaneous equations Ax = b using the scalapack routine PDGESV fails (exiting with segmentation fault) when the no. of equations, N, becomes large. I have not identified the exact value of N at which problems arise, but, for example, the code works for all the values I have tested up to N= 50000, but fails at N=94423.

In particular, the failure appears to occur during the call to the scalapack routine (i.e. not when allocating / deallocating memory);
it enters routine PDGESV, but does not leave this routine.

I have prepared a simple small Fortran example code (see attachment below) that exhibits this problem. This code simply 1) allocates space for the matrix A and vector b, 2) fills their entries with random entries 3) calls PDGESV and then 4) deallocates the memory. The code has been tested on a variety of different matrix sizes (NxN) and with various BLACS processor arrays without any errors until N becomes large.

The problem does not seem to be a problem with lack of memory; on the machine I execute the code 192 GB is available,

whereas the code only uses 65 GB when N=94423. I have tried using the 'ulimit -s unlimited' command , but this did not resolve the problem. My feeling is that instead there is some problem with maybe exceeding some default limit on what memory is available to a single process in mpi? i.e. perhaps I am simply missing some appropriate FLAGS at compilation / run time?

I am running the program on a linux cluster using Red Hat Enterprise Linux Server release 7.3 (Maipo)

I compiled the following code with:

mpiifort -mcmodel=medium -m64 -mkl=cluster -o para.exe solve_by_lu_parallelmpi_simple_light2.for

and run it using (for example when N= 9445)

mpiexec.hydra -n 4 ./para.exe 9445 2 2 32

the command line arguments here denote selecting N=9445 and using a 2x2 BLACS process array with block size 32

For this smaller matrix size the program runs w/out any problems producing the output

WE ARE SOLVING A SYSTEM OF         9445 LINEAR EQUATIONS
PROC:            0           0 HAS MLOC, NLOC =        4736        4736
PROC:            0           0 ALLOCATING SPACE ...
PROC:            0           0 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC:            0           1 HAS MLOC, NLOC =        4736        4709
PROC:            0           1 ALLOCATING SPACE ...
PROC:            0           1 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC:            1           0 HAS MLOC, NLOC =        4709        4736
PROC:            1           0 ALLOCATING SPACE ...
PROC:            1           0 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC:            1           1 HAS MLOC, NLOC =        4709        4709
PROC:            1           1 ALLOCATING SPACE ...
PROC:            1           1 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC:            1           1
NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
PROC:            1           0
NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
PROC:            0           1
NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
PROC:            0           0
NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..

INFO code returned by PDGESV =            0

SO far so good. But when I try to solve a larger system using

mpiexec.hydra -n $NUM_PROCS ./para.exe 9445 2 2 32

the program crashes during the call to PDGESV with the output

WE ARE SOLVING A SYSTEM OF        94423 LINEAR EQUATIONS
PROC:            0           0 HAS MLOC, NLOC =       47223       47223
PROC:            0           0 ALLOCATING SPACE ...
PROC:            0           0 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC:            0           1 HAS MLOC, NLOC =       47223       47200
PROC:            0           1 ALLOCATING SPACE ...
PROC:            0           1 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC:            1           0 HAS MLOC, NLOC =       47200       47223
PROC:            1           0 ALLOCATING SPACE ...
PROC:            1           1 HAS MLOC, NLOC =       47200       47200
PROC:            1           1 ALLOCATING SPACE ...
PROC:            1           0 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC:            1           1 CONSTRUCTING MATRIX A AND RHS VECTOR B ...
PROC:            0           1
NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
PROC:            0           0
NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
PROC:            1           1
NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..
PROC:            1           0
NOW SOLVING SYSTEM AX = B USING SCALAPACK PDGESV ..

forrtl: 致命的なエラー (154): 配列インデックスが境界外です。
Image              PC                Routine            Line        Source
libifcore.so.5     00002B0D716C19AF for__signal_handl     Unknown Unknown
libpthread-2.17.s 00002B0D712335D0 Unknown               Unknown Unknown
libmkl_avx512.so   00002B11A45E5A47 mkl_blas_avx512_x     Unknown Unknown
libmkl_intel_lp64 00002B0D68E8BB55 dger_                 Unknown Unknown
libmkl_scalapack_ 00002B0D69F972AE pdger_                Unknown Unknown
libmkl_scalapack_ 00002B0D69E53541 pdgetf3_              Unknown Unknown
libmkl_scalapack_ 00002B0D69E53688 pdgetf3_              Unknown Unknown
libmkl_scalapack_ 00002B0D69C2E13B pdgetf2_              Unknown Unknown
libmkl_scalapack_ 00002B0D69C2E836 pdgetrf2_             Unknown Unknown
libmkl_scalapack_ 00002B0D6A014F6E pdgetrf_              Unknown Unknown
libmkl_scalapack_ 00002B0D69C29C7D pdgesv_               Unknown Unknown
para.exe           0000000000401F8C Unknown               Unknown Unknown
para.exe           00000000004011BE Unknown               Unknown Unknown
libc-2.17.so       00002B0D73DFC3D5 __libc_start_main     Unknown Unknown
para.exe           00000000004010C9 Unknown               Unknown Unknown

the first error line beginning forrtl: can be translated as

forrtl: Fatal error (154): Array index out of bounds.

The problem seems to be ocurring somewhere in the scalapack routines.

Does anyone have any recommendations / possible solutions ?

Any advice or pointers will be gratefully received,

Many thanks,

Dan.

Attachment	Size
Download solve_by_lu_parallelmpi_simple_light2.for	6.55 KB

↧

Packed versus compact versus normal routines versus jit

June 3, 2019, 5:42 am

Latest and popular articles on Intel Technologies

≫ Next: Is PARDISO for cluster available in composer edition?

≪ Previous: Problem when solving large system using Scalapack PDGESV

I need to do a tons of

C = A*B^T

Where all matrices n times n big. n is typical 256 but could be smaller or bigger. Both A and B are used multiple times. Moreover the C is used in later multiplications i.e. C replace A or B.

NOTE I am only interested in the sequential case. I do not want MKL to parallelize anything.

It seems the matrices are too large for the compact type. In any case compact seems to be for multiple matrices i.e. liked batched.

In the packed type the C will not be packed so I have to pack it.

There is also the mkl_jit_create* routines.

Now my question is what I should go for among the possible matrix multiplication methods?

PS: An interesting alternative is to use BLASFEO(https://github.com/giaf/blasfeo) which course you cannot say anything about but give an idea about my use case.

↧

Is PARDISO for cluster available in composer edition?

June 2, 2019, 11:12 pm

Latest and popular articles on Intel Technologies

≫ Next: 2017 version of MKL?

≪ Previous: Packed versus compact versus normal routines versus jit

Can i use the "Parallel Direct Sparse Solver for Clusters" in "Intel Parallel Studio XE Composer Edition" ?

If not, what version is available?

↧

2017 version of MKL?

June 3, 2019, 7:19 pm

Latest and popular articles on Intel Technologies

≫ Next: dfeast_scsrgv and zfeast_hcsrgv Segmentation fault Error

≪ Previous: Is PARDISO for cluster available in composer edition?

Is MKL 2017 is still available?

I have a Phi 3120a and would like to try the automatic offload feature - which was removed in the MKL 2018 release.

↧

dfeast_scsrgv and zfeast_hcsrgv Segmentation fault Error

June 5, 2019, 8:24 am

Latest and popular articles on Intel Technologies

≫ Next: Is it available MKL 2017 for free?

≪ Previous: 2017 version of MKL?

Hello,

I am trying solve a generalized eigenvalues problem Ax = λBx by means of dfeast_scsrgv, but, I get SIGSEV error, I test the individual matrices A and B with dfeast_scsrev and all works fine, so I think that is not a problem of data representation.

Attached goes a code example for the problem described.

pd: The hermitian version zfeast_hcsrgv has the same issue. Sorry for the english.

Attachment	Size
Download eigenvalues.cpp	6.08 KB

↧

Is it available MKL 2017 for free?

June 5, 2019, 3:55 am

Latest and popular articles on Intel Technologies

≫ Next: ScaLAPACK crash using different block sizes

≪ Previous: dfeast_scsrgv and zfeast_hcsrgv Segmentation fault Error

Hello, I wanted to ask if is available Intel MKL for free because I tried to download it from https://registrationcenter.intel.com/en/products/postregistration/?sn=NW... and the oldest version that I can download is the 2018 initial release version. Is there any way to download for free like the newer versions or do I need to purchase an Intel product related with the mkl?

↧

ScaLAPACK crash using different block sizes

June 12, 2019, 3:20 am

Latest and popular articles on Intel Technologies

≫ Next: SVD weird perfomance issues

≪ Previous: Is it available MKL 2017 for free?

Hi,

I would like to use pzgesv routine to solve system of linear equations but it crashes if use block size 64 and e.g. with two processes

it doesn't crash if i run the program with mpirun -np 1 ./myapp, or the block size is 4 and any number of processes.

number of rows = 116, nrhs = 4; openmpi, mpicxx -v: gcc version 7.4.0

here is the backtrace:

from the gdb

process 1

Thread 1 "myapp" received signal SIGSEGV, Segmentation fault.
0x00007ffff67483b2 in PMPI_Comm_size ()
from /usr/lib/x86_64-linux-gnu/libmpi.so.20
(gdb) bt
#0 0x00007ffff67483b2 in PMPI_Comm_size ()
from /usr/lib/x86_64-linux-gnu/libmpi.so.20
#1 0x000055555897967a in MKLMPI_Comm_size ()
#2 0x000055555568884c in PB_CpgemmMPI ()
#3 0x000055555564e916 in pzgemm_ ()
#4 0x00005555556355b0 in pzgetrf2_ ()
#5 0x0000555555634aaf in pzgetrf_ ()
#6 0x000055555562c60d in pzgesv_ ()
#7 0x00005555555ed944 in main (argc=1, argv=0x7fffffffd6e8)
at Main.cpp:159

process 0

the descriptors and data looks OK.

any idea what is going on?

Regards,

↧

SVD weird perfomance issues

June 12, 2019, 9:03 am

Latest and popular articles on Intel Technologies

≫ Next: Understanding arguments of mkl_sparse_?_qr

≪ Previous: ScaLAPACK crash using different block sizes

Hi,

I am facing performance issues with the function dgesvd when running in 64bit with AVX2 (MKL_CBWR=AVX2)

For some sizes of matrix the SVD duration is 25 times longer in 64bit than in 32bit !

You may reproduce with the test in attachment. On my side I get thoses durations for 1 svd on an mXn matrix:

101x63 : 32bit = 2ms, 64bit = 1.4ms;
101x64 : 32bit = 2ms, 64bit = 20ms;
102x64 : 32bit = 2ms, 64bit = 1.4ms;
103x103 : 32bit = 4ms, 64bit = 100ms;

There is no problem with MKL_CBWR=AVX.

Could you please have a look ?

My configuration:

Composer 2019 update 4 (same behaviour with 2018 up4)
BasePlatformToolSet : vc12
Win 10 Enterprise 64bit
CPU: i7-6820HQ

Regards,

Guillaume A.

Attachment	Size
Download svdPerfTest.c	3.05 KB

↧

Understanding arguments of mkl_sparse_?_qr

June 14, 2019, 1:35 pm

Latest and popular articles on Intel Technologies

≫ Next: Inconsistency finding Eigenvalues for sparse matrices with MKL

≪ Previous: SVD weird perfomance issues

Hello there

So I am trying to solve a sparse linear least squares min||Ax - b|| where the matrix A is sparse.

The MKL 2019 introduced QR solver with the documentation available at

https://software.intel.com/en-us/mkl-developer-reference-c-mkl-sparse-qr

Now I cannot utilize such function and my guess is that I have not fully understood the parameters specially ldx & ldb since once I call the function nothing happens or the program crashes!

Specifically assuming

Matrix A is is "m x n"& specified in CSR (since only CSR is supported at the moment)
b is an aligned array (so 1 column only for b) and the length is m
The solution array x is aligned (allocated) & has a length of "n"

I call (apologies for pseudo code!)

success_solve = mkl_sparse_s_qr(

operation = SPARSE_OPERATION_NON_TRANSPOSE,

A = CSR description,

descr = SPARSE_MATRIX_TYPE_GENERAL,

layout = SPARSE_LAYOUT_ROW_MAJOR,

columns = 1,

ldx = 1, //I have tried with both 0 & 1 and I failed at both

ldb = 1 ); //again tried with both 0 & 1 and failed at both

Appreciate any help && cheers

↧