Quantcast
Channel: Intel® oneAPI Math Kernel Library & Intel® Math Kernel Library
Viewing all 2652 articles
Browse latest View live

WordSize of SFMT19937

$
0
0

While generating integer sequences of n elements using the the SFMT19937 BRNG, I noticed that the output consisted of exactly n elements, while I should have received 4n elements, according to the Intel VSL Notes documentation.

I also found that the SFMT19937 WordSize obtained from the VSLBrngProperties struct returns 4, when I expected this to be 16.

Is this the correct behavior for the SFMT19937 BRNG?  If so, how would I go about obtaining all 128 bits of output?

 

 

 


Question for vdRngGaussian and vdRngUniform.

$
0
0

Hi!
I want to write the following algorithm using the functions vdRngGaussian and vdRngUniform.

Let X be a vector of n Bernoulli(phi) random variables.
sumx = sum(X);
accept<--aceept


for I=1:nN


1. Propose thetacan ~ Normal(theta, sigma2), wher theta = log(1 + phi) - log(1 -phi);


2.  phican = (exp(thetacan)-1)/(exp(thetacan)+1);


3.  Compute
           logcan = sumx*log(phican) + (n -sumx)*log(1-phican) + thetacan -2*log(1 +exp(thetacan));
           logold =  sumx*log(phi) + (n -sumx)*log(1-phi)+ theta-2*log(1 +exp(theta));
           logf = logcan - logold;

4. Propose u ~ Uniform(0, 1)

5. if log(u)<logf
    phi <--- phican;
    accept <-- accept  + 1
end

end of iterations

 

The criterion in order to choose sigma2 is the acceptance rate (accept/N) to range between 30%-50%.
If acceptance rate<30%, increase sigma2.
If acceptance/rata>30%, decrease sigma2.

I simulated n=5000 observations from Bernoulli(0.9). Using Matlab, I chose sigma2 to be  0.01. I run for
this dataset my matlab-code many times and the acceptance rate was around to 45%. The problem I face is that I can't specify sigma2
for the C-code. For example, I chose sigma2 to be 0.00001 and for 5 different iterations of the algorithm the rate was
 {1, 0.2290, 0.0206, 0.3550, 0.3550}.
This is a weird result because from theory it is known than for a value of sigma2, the rate should be equal to a number (for example 0.2290)
and not to range from 0.0206 to 1.

I attached my C-code, Matlab-code and the dataset.

Could you please tell me if I haven't use the functions properly.

Thank you very much.

 

mkl_lapack_ao_zgeqrf not located

$
0
0

I am running the Intel Compiler 16.0 in Visual Studio Professional 2013. I have a project that I wish to make as an x64 executable. I am able to compile and run the Win32 version, and I can compile and link the x64 version, but when I try to run it I get the message:

The procedure entry point mkl_lapack_ao_zgeqrf could not be located in the dynamic link library G:\Program Files(x86)\VNI\imsl\fnl701\Intel64\lib\imslmkl_dll.dll.

I have verified that the file exists in that path, and that there is no other file of that name in the system apart from the one in the corresponding Win32 path.

Any ideas on what I need to do to track this problem down?

Cluster Sparse Solver(cpardiso) reordering problem

$
0
0

Hello.

I tried to solve a large linear equation (1,000,000 x 1,000,000 / bandwidth = 100 or 1000) with cpardiso.

( the matrix type is real and symmetric indefinite. )

I have some problems about reordering time and memory.

CPARDISO's reordering phase is compare to slower than the other phase. So I checked event time using Traceanalyzer.

CPARDISO used only one process(rank 0) for reordering and Rank 0 collected information on the divided A matrix on each process.

As a result, When I solved the bandwidth 1,000 equation, It occurred insufficient memory error. (※bandwidth 100 equation was resolved)

 

Should CPARDISO do the reordering and collect the A matrix in only rank 0 ?

Does rank 0 must have a lot of memory to solve a large system?

How to solve this problem ?

 

The version of MKL is mkl 11.3, which was bundled with parallel studio xe 2016 cluster edition.

 

This is the setting for cpardiso

iparm[ 0] = 1;

iparm[ 1] = 0; (I also tried iparm[1]= 2 and 3)

iparm[ 5] = 0;

iparm[ 7] = 0;

iparm[ 9] = 8;

iparm[10] = 0;

iparm[12] = 0;

iparm[17] = 0;

iparm[18] = 0;

iparm[20] = 1;

iparm[26] = 0;

iparm[27] = 0;

iparm[34] = 1;

iparm[39] = input_value[1];

iparm[40] = input_value[2];

iparm[41] = input_value[3];

 

I used 4 nodes that are connected InfiniBand and Each node have 32 GB RAM.

Thanks.

Can't get pardiso to multithread (MKL linking issue?)

$
0
0

Hello, I'm currently trying to get pardiso to work with multi threading and I'm wondering if it is a linking issue or something else. I have tried some "easy" fixes that didn't work, then I tried the link advisor and get an error when linking.

Question: How do I get pardiso to work with multiple cores?

Background:

When calling pardiso I use the following iparm

  iparm= 0
  iparm(1) = 1! !0=solver default
  iparm(2) = 2 !  !2
  iparm(3) = 0 !reserved, set to zero
  iparm(4) = 0 ! Preconditioned CGS/CG.
  iparm(5) = 0 !
  iparm(6) = 0 !
  iparm(7) = 0 !
  iparm(8) = 9 ! Iterative refinement step.
  iparm(9) = 0 ! reserved, set to zero
  iparm(10) = 13
  iparm(11) = 1
  iparm(12) = 0
  iparm(13) = 0
  iparm(14) = 0
  iparm(15) = 0
  iparm(16) = 0
  iparm(17) = 0
  iparm(18) = -1
  iparm(19) = -1
  iparm(20) = 0
 

The problem is "large" with

             #equations:                                     71574
             #non-zeros in A:                                3815460
             non-zeros in A (%):                            0.074479

The simplest fix

  call mkl_set_dynamic(0)  ! disable adjustment of the number of threads
  call mkl_set_num_threads(4)
  call omp_set_num_threads(4)

This is done right before I call pardiso and has no effect on the number of cores used, the next step was to check the MKL linking, I then tried to use the link advisor, https://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/ but I get and error.

With the advisor,I choose the "Composer XE 2011" in the link advisor since that is where my MKL roots are in the makefile(please correct me if this is a wrong assumption),

MKLroot=/software/intel/parallel_studio/composer_xe_2011_sp1.7.256/mkl/lib/intel64

My OS is then Linux, the compiler is iFort/intel fortran, linking is dynamic(unsure of this, trying static yields same result/error), the interface is LP64.

For threading layer Im uncertain of which option to select, sequential or OpenMP. In the makefile there is "sequential" in LLIBS, but there are sections of the code that are already multithreading with OpenMP, only do-loops and no explicit MKL routines.

LLIBS =  -L/software/matlab/R2011a/bin/glnxa64 -leng -lmat -lmx -lut -Wl,--start-group ${MKLroot}/libmkl_intel_lp64.a ${MKLroot}/libmkl_sequential.a ${MKLroot}/libmkl_core.a -Wl,--end-group -lpthread 

Lastly I'm advised to do the following linking and add to my compiler

 -lpthread -lm
 -openmp -mkl=parallel 

Adding the compiler options work but when linking I get the following message/error

$ ifort -lpthread -lmifort: warning #10315: specifying -lm before files may supercede the Intel(R) math library and affect performance
/software/intel/parallel_studio/composer_xe_2011_sp1.7.256/compiler/lib/intel64/for_main.o: In function `main':
/export/users/nbtester/efi2linux_nightly/branch-12_1/20111012_000000/libdev/frtl/src/libfor/for_main.c:(.text+0x38): undefined reference to `MAIN__'

I assume this means that the linking is incomplete, how can I complete it? And hopefully get pardiso to work on more then 1 core

 

--

Sorry if this post is messy, this is my first time attempting to link anything. If any clarification is needed please ask, any help is much appreciated

/Viktor

 

MKL_pardiso weird behaviour with large matrices

$
0
0

Hello, everybody.

Still trying to solve very large systems of sparse equations with `mkl pardiso`, single node, multiple thread version. The solver is behaving really well for small systems, but not so much for larger equations. In our case, scalability is essential.

With a large enough system of linear equations (79999 x 79999, 100528321 non-zeros), the sovler returns a vectors of `-nan`s, without reporting any errors. The expected result is provided (`expected-result.txt`).

The example code is provided below. The data is linked to here: https://www.dropbox.com/s/jcvieffrkb7ivag/data.tar.gz?dl=0. It is a `tar.gz` archive with the binary matrix representation, read by the code provided below.

There is an additional strange behaviour of the solver, where the `allocation of internal data structures` in the factorization phase is dominating the runtime. It is taking one user thread (100% user time), and take the majority of the executable's time. This behaviour is similar to that described here: https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/.... I believe the core of the issue is within `parMETIS`, not `pardiso`.

The code is linked against MKL `11.2.3`, release `composer_xe_2015.3.187`.

Thank you for your assistance. I would be happy to provide additional details.

MKL inspector-executor mkl_sparse_optimize returns SPARSE_STATUS_NOT_SUPPORTED

$
0
0

I'm trying to get a test program working with the MKL inspector-executor framework, but I'm getting the rather opaque error code SPARSE_STATUS_NOT_SUPPORTED from mkl_sparse_optimize and it isn't clear what I'm doing wrong. I was hoping I could get some help. Below is a snippet of my code:

71             sparse_matrix_t mkl_ie_csr;
72             CHECK_IE(mkl_sparse_d_create_csr(&mkl_ie_csr,
73                     SPARSE_INDEX_BASE_ZERO, mkl_csr.m, mkl_csr.n,
74                     mkl_csr.row_pointers, rows_end, mkl_csr.columns,
75                     mkl_csr.values));
76
77             struct matrix_descr descr;
78             descr.type = SPARSE_MATRIX_TYPE_GENERAL;
79             descr.mode = SPARSE_FILL_MODE_LOWER;
80             descr.diag = SPARSE_DIAG_NON_UNIT;
81             // May return 6 (SPARSE_STATUS_NOT_SUPPORTED) if repeats == 1
82             CHECK_IE(mkl_sparse_set_mv_hint(mkl_ie_csr,
83                     SPARSE_OPERATION_NON_TRANSPOSE, descr,
84                     repeats));
85             CHECK_IE(mkl_sparse_set_memory_hint(mkl_ie_csr,
86                         SPARSE_MEMORY_AGGRESSIVE));
87             CHECK_IE(mkl_sparse_optimize(mkl_ie_csr));
88             for (r = 0; r < repeats; r++) {
89                status = mkl_sparse_d_mv(SPARSE_OPERATION_NON_TRANSPOSE, 1.0,
90                        &mkl_ie_csr, SPARSE_MATRIX_TYPE_DIAGONAL, x, 1.0, y);
91                assert(status == SPARSE_STATUS_SUCCESS);
92             }

CHECK_IE is just a macro that verifies SPARSE_STATUS_SUCCESS, and exits otherwise.

When I run this example, I get the error on line 87 after calling mkl_sparse_optimize. Now, if I remove the call to mkl_sparse_optimize everything runs and validates against my reference data so it seems like this is mostly correct. I just don't have any understanding of what specific misconfigurations could be causing an unsupported error to be returned from mkl_sparse_optimize. Can I get any clarification on that?

Thanks!

Max

Intel MKL FATAL ERROR: Error on loading function mkl_blas_avx_xdcopy

$
0
0

 

Hi experts

I am trying compile Bonmin (An mathematical programming software) using icc, icpc and  mkl from Intel System Studio 2016 (on Opensuse Linux 13.1). Bonmin requires lapack and blas functions. So, I am compiling Bonmin using:

-L/opt/intel/lib/intel64/ -L/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64/  -lmkl_intel_lp64  -lmkl_sequential   -lmkl_core  -liomp5  -lpthread -lm

The compilation seems work fine. I got an executable called Bonmin. However, when I try run Bonmin, I get the following error message:

Intel MKL FATAL ERROR: Error on loading function mkl_blas_avx_xdcopy

Directories /opt/intel/lib/intel64/ and /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64/ are already in my LD_LIBRARY_PATH. I checked file libmkl_avx is /opt/intel/compilers_and_libraries/linux/mkl/lib/intel64/ , so I do not have any idea about what is wrong.

If I compile Bonmin using

-lmkl_avx -lmkl_intel_lp64  -lmkl_sequential   -lmkl_core  -liomp5  -lpthread -lm

I get the same error...

Does anyone have some idea or tip to help me?

Thanks in Advanced


Nonlinear optimization with a matrix form of constraints

$
0
0

I'm trying to solve a nonlinear optimization problem with a matrix for linear constratins using MKL (Intel C++ 16.0).

Although aware that some bounds can be set for each variable x_i, e.g., LB_i <= x_i <= UB_i (like the usage example),

not quite sure how to impose additional constratins in matrix forms such that: Ax = b where A is m-by-n matrix; i.e. there are m constraints for variables x.

I'm actually trying to make a transition from using 'fminon' (http://kr.mathworks.com/help/optim/ug/fmincon.html) function of MATLAB to using MKL.

Is there any way? Thanks.

MKL reproductibility

$
0
0

Is there any way to get deterministic results from MKL sgemm/dgemm (even if that is much slower)?

What I mean is the following: When I do dgemm or sgemm (a lot of them) using the same input data I tend to see minor numerical differences. While not being large they can become quite significant when back-propagating though a very deep neural network (>20 layers). And they are significantly higher than with competing linear algebra packages.

Let me show you what I mean. I instantiated my network twice and initialized both instances using the same parameters. The following tables list the differences of gradients derived using these networks. (each number in the in the table, represents the gradients for an entire parameter bucket)

Parameters (MKL)

MKL_NUM_THREADS=1
OMP_NUM_THREADS=1
MKL_DYNAMIC=FALSE
OMP_DYNAMIC=FALSE

Results (MKL, confirmed single threaded by using MKL_VERBOSE=1)

min-diff: (0,5) -> -2.43985e-07, (0,10) -> -6.88851e-07, (0,15) -> -1.08151e-06, (0,20) -> -2.29150e-07, (0,25) -> -7.78865e-06, (0,30) -> -2.22526e-07, (0,35) -> -2.00457e-05, (0,40) -> -6.31442e-07, (0,45) -> -3.53903e-08, (0,50) -> -1.33878e-09, (0,55) -> -3.72529e-09, (0,60) -> -4.65661e-10, (0,65) -> -1.86265e-09, (0,70) -> -2.32831e-09, (0,75) -> -1.16415e-10, (0,80) -> -1.86265e-08
max-diff: (0,5) ->  3.52116e-07, (0,10) ->  6.34780e-07, (0,15) ->  9.27335e-07, (0,20) ->  2.05655e-07, (0,25) ->  6.20843e-06, (0,30) ->  2.58158e-07, (0,35) ->  2.12293e-05, (0,40) ->  6.60219e-07, (0,45) ->  2.79397e-08, (0,50) ->  1.16415e-09, (0,55) ->  5.87897e-09, (0,60) ->  5.23869e-10, (0,65) ->  1.86265e-09, (0,70) ->  2.56114e-09, (0,75) ->  1.16415e-10, (0,80) ->  1.86265e-08
rel-diff: (0,5) ->  1.70455e-03, (0,10) ->  2.38793e-03, (0,15) ->  1.39107e-03, (0,20) ->  2.02584e-03, (0,25) ->  6.83717e-04, (0,30) ->  9.16173e-04, (0,35) ->  1.73014e-04, (0,40) ->  1.49317e-04, (0,45) ->  2.10977e-07, (0,50) ->  2.14790e-07, (0,55) ->  6.37089e-08, (0,60) ->  8.91096e-08, (0,65) ->  7.81675e-09, (0,70) ->  1.67285e-07, (0,75) ->  3.78540e-10, (0,80) ->  1.72134e-07

min-diff =  min(A - B)

max-diff =  max(A - B)

rel-diff = norm(A - B) / norm(A + B)

If I bind exactly the same application against the current stable OpenBLAS implementation compiled for single threading, I get the following:

Parameters (OpenBLAS)

make BINARY=64 TARGET=SANDYBRIDGE USE_THREAD=0 MAX_STACK_ALLOC=2048

Results OpenBLAS (single threaded)

min-diff: (0,5) -> 0.00000e+00, (0,10) -> 0.00000e+00, (0,15) -> 0.00000e+00, (0,20) -> 0.00000e+00, (0,25) -> 0.00000e+00, (0,30) -> 0.00000e+00, (0,35) -> 0.00000e+00, (0,40) -> 0.00000e+00, (0,45) -> 0.00000e+00, (0,50) -> 0.00000e+00, (0,55) -> 0.00000e+00, (0,60) -> 0.00000e+00, (0,65) -> 0.00000e+00, (0,70) -> 0.00000e+00, (0,75) -> 0.00000e+00, (0,80) -> 0.00000e+00
max-diff: (0,5) -> 0.00000e+00, (0,10) -> 0.00000e+00, (0,15) -> 0.00000e+00, (0,20) -> 0.00000e+00, (0,25) -> 0.00000e+00, (0,30) -> 0.00000e+00, (0,35) -> 0.00000e+00, (0,40) -> 0.00000e+00, (0,45) -> 0.00000e+00, (0,50) -> 0.00000e+00, (0,55) -> 0.00000e+00, (0,60) -> 0.00000e+00, (0,65) -> 0.00000e+00, (0,70) -> 0.00000e+00, (0,75) -> 0.00000e+00, (0,80) -> 0.00000e+00
rel-diff: (0,5) -> 0.00000e+00, (0,10) -> 0.00000e+00, (0,15) -> 0.00000e+00, (0,20) -> 0.00000e+00, (0,25) -> 0.00000e+00, (0,30) -> 0.00000e+00, (0,35) -> 0.00000e+00, (0,40) -> 0.00000e+00, (0,45) -> 0.00000e+00, (0,50) -> 0.00000e+00, (0,55) -> 0.00000e+00, (0,60) -> 0.00000e+00, (0,65) -> 0.00000e+00, (0,70) -> 0.00000e+00, (0,75) -> 0.00000e+00, (0,80) -> 0.00000e+00

This is actually what I would expect. Since there is no multi-threading and given the same data exactly the same things should happen in the same order.

Now, just for fun and because my software can do it, I replace the calls to BLAS with matching modules for CUDNN and CUBLAS function calls (NVIDIA CUDA). Please note that unlike (OpenBLAS and MKL), this is not the same code-path.

Results CUDNN + CUBLAS (multi-threaded)

min-diff: (0,5) -> -3.63798e-11, (0,10) -> -1.45519e-10, (0,15) -> -1.96451e-10, (0,20) -> -4.36557e-11, (0,25) -> -1.39698e-09, (0,30) -> -8.00355e-11, (0,35) -> -3.25963e-09, (0,40) -> -2.32831e-10, (0,45) -> -3.72529e-09, (0,50) -> -2.32831e-10, (0,55) -> 0.00000e+00, (0,60) -> 0.00000e+00, (0,65) -> 0.00000e+00, (0,70) -> 0.00000e+00, (0,75) -> 0.00000e+00, (0,80) -> 0.00000e+00
max-diff: (0,5) ->  2.91038e-11, (0,10) ->  1.40062e-10, (0,15) ->  2.18279e-10, (0,20) ->  4.72937e-11, (0,25) ->  9.31323e-10, (0,30) ->  1.01863e-10, (0,35) ->  2.79397e-09, (0,40) ->  2.32831e-10, (0,45) ->  1.86265e-09, (0,50) ->  2.91038e-10, (0,55) -> 0.00000e+00, (0,60) -> 0.00000e+00, (0,65) -> 0.00000e+00, (0,70) -> 0.00000e+00, (0,75) -> 0.00000e+00, (0,80) -> 0.00000e+00
rel-diff: (0,5) ->  2.06397e-07, (0,10) ->  5.70014e-07, (0,15) ->  2.27241e-07, (0,20) ->  3.68574e-07, (0,25) ->  1.57384e-07, (0,30) ->  2.85175e-07, (0,35) ->  8.01234e-08, (0,40) ->  1.15262e-07, (0,45) ->  1.21201e-08, (0,50) ->  1.45475e-08, (0,55) -> 0.00000e+00, (0,60) -> 0.00000e+00, (0,65) -> 0.00000e+00, (0,70) -> 0.00000e+00, (0,75) -> 0.00000e+00, (0,80) -> 0.00000e+00

As you can see, I get reproducible results for the last layers (right hand side, fully connected nn-layers = large matrix multiplications). For the other layers (left hand side, convolutions nn-layers = many small matrix multiplications) we see small differences (the CUDA manual suggests this is to be expected due to the way they schedule in their underlying multi-threading implementation). Anyway, even with that, the differences have a much smaller magnitude than with MKL on the CPU.

Question:

Considering that I desire reproducibility. How can I configure MKL to produce the same or at least more similar results if it is invoked with the same data several times?

Calling Python Developers - High performance Python powered by Intel MKL is here!

$
0
0

We are introducing a Technical Preview of Intel® Distribution of Python*, with packages such as NumPy* and SciPy* accelerated using Intel MKL. Python developers can now enjoy much improved performance of many mathematical and linear algebra functions, with up to ~100x speedup in some cases, comparing to the vanilla Python distributions. The technical preview is available for everybody at no cost. Click here to register and download. For any questions, please jump onto the user forum

 

dgeev is much slower than matlab eig

$
0
0

I test a random 5000 by 5000 matrix using intel MKL dgeev and matlab separately on the same machine (Intel(R) Core(TM) i3-4150 CPU @ 3.50GHz) and record the CPU cost time for just the eigendecomposition step. When I use icc .... -mkl:parallel, it costs 541s, when I use icc ... -mkl:sequential to compile, it costs 232s.  However, matlab eig just cost 70s.  

Thus I have two questions:

1. why sequential is much faster than parallel?

2. according to matlab, it uses Intel(R) Math Kernel Library Version 11.1.1 to do eigen decomposition, why it is much much faster than dgeev used in my C++ codes.

Can you guys provide me any ideas?  Any suggestion on how to make eigen decomposition faster if I use C++?

Unable to install on OSX 10.9

$
0
0

I downloaded the MKL for OSX but am unable to install it. Clicking on the installer opens an empty window with all buttons greyed out. I tried running the installer form the command line and the problem seems to be connected to Java:

$ ./install.sh
In JPanelLicenseOptions
Inside JPanelRegistrationBegin
Calling initComponents
Calling initializePanel
/tmp/intel/se/channel/msg/../msg/20151115173351155,0,Core,0,SmartEngineJavaGUI
Exception in thread "Thread-0" java.lang.NoClassDefFoundError: org/apache/xpath/XPathAPI
	at com.intel.ISSA.GUI.CoreAdapter.MessageReader.getCoreState(MessageReader.java:142)
	at com.intel.ISSA.GUI.CoreAdapter.MessageReader.run(MessageReader.java:45)
Caused by: java.lang.ClassNotFoundException: org.apache.xpath.XPathAPI
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 2 more

I am running OSX 10.9 and Java 1.8_066-b17.

As a side comment, using this forum is quite frustrating. I'm unable to attach an image to my post with Safari and I've lost my post twice already while trying.

FFT error - DftiFreeDescriptor / DftiCommitDescriptor (Fortran)

$
0
0

I'm having an error using the DftiFreeDescriptor command on the FFT computation. I'm using Intel Parallel Studio XE Cluster Edition (2016 - Student download) with Visual Studio Community 2015. And here is the code:

seqc = CMPLX(seq)
Status1 = DftiCreateDescriptor(My_Desc2_Handle, DFTI_DOUBLE, DFTI_COMPLEX, 1, npts)
Status1 = DftiCommitDescriptor(My_Desc2_Handle)
Status1 = DftiComputeForward(My_Desc2_Handle, seqc)
Status1 = DftiFreeDescriptor(My_Desc2_Handle)

I start with this real(8)::seq(npts) and convert it to the complex seqc(npts), where npts is something between 2000 and 6000.

Debuging each step I have the following values for Status1  and My_Desc2_Handle:

Status1 = 0 / My_Desc2_Handle: Undefined address

Status1 = DftiCreateDescriptor(My_Desc2_Handle, DFTI_DOUBLE, DFTI_COMPLEX, 1, npts)

Status1 = 0 / My_Desc2_Handle = 0

Status1 = DftiCommitDescriptor(My_Desc2_Handle)

Status1 = 0 / My_Desc2_Handle = -620492544

Status1 = DftiComputeForward(My_Desc2_Handle, seqc)

Status1 = 0 / My_Desc2_Handle = -620492544

Status1 = DftiFreeDescriptor(My_Desc2_Handle)

Error, where in the output i get this:  Invalid address specified to RtlValidateHeap( 0000000011FA0000, 0000000014312970 )

So, I've tested changing npts to values around 100, than My_desc2_Handle receives a positive value and I don't get errors with the DftiFreeDescriptor command. But if I put it up to 200, My_Desc2_Handle gets negative and the error happens again. So I guess it is something with the DftiCommitDescriptor, where it can't allocate/address the memory correctly. If I don't add the DftiFreeDescriptor command, I just get my FFT normaly, but I'm not sure if the memory is leaking each time I run it.

Hanging on PARDISO Factorization (update MKL 11.1 update4 -> MKL 11.3)

$
0
0

Hello,

I'm using Sparse Matrix Solver in MKL.

I proceeded with the Version Update in MKL 11.1 update4 with MKL 11.3 this time.

The problem that didn't generate in an existing happens and I inquire.

The phenomenon that I stop in the Factorization process is happening after the update progress.

 

plz, help.


A question about the comparison between DSYMM and DGEMM of Blas pacage

$
0
0

I want to compare the performance of DSYMM and DGEMM in calculating matrix multiplication C=A*B, where A is a double precision symmetric matrix.

I use two different versions of code.

The first version is:

!time of DGEMM
call date_and_time(date,time,zone,values1)
call dgemm('n','n',n,n,n,1.0d0,a,n,b,n,0.0d0,c,n)
call date_and_time(date,time,zone,values2)
time_ms1=values2(8)-values1(8)
time_ms1=1000*(values2(7)-values1(7))+time_ms1
time_ms1=60*1000*(values2(6)-values1(6))+time_ms1


!time DSYMM
call date_and_time(date,time,zone,values1)
call dsymm('L','U',n,n,1.0d0,a1,n,b1,n,0.0d0,c1,n)
call date_and_time(date,time,zone,values2)
time_ms2=values2(8)-values1(8)
time_ms2=1000*(values2(7)-values1(7))+time_ms2
time_ms2=60*1000*(values2(6)-values1(6))+time_ms2

!print out the time
print*,time_ms1,time_ms2

 

Different from the first one, in the second version, I call the DGEMM/DSYMM one time before I test their performances.

In detail, the second version is:

call dgemm('n','n',n,n,n,1.0d0,a,n,b,n,0.0d0,c,n)

call date_and_time(date,time,zone,values1)
call dgemm('n','n',n,n,n,1.0d0,a,n,b,n,0.0d0,c,n)
call date_and_time(date,time,zone,values2)
time_ms1=values2(8)-values1(8)
time_ms1=1000*(values2(7)-values1(7))+time_ms1
time_ms1=60*1000*(values2(6)-values1(6))+time_ms1


call dsymm('L','U',n,n,1.0d0,a1,n,b1,n,0.0d0,c1,n)


call date_and_time(date,time,zone,values1)
call dsymm('L','U',n,n,1.0d0,a1,n,b1,n,0.0d0,c1,n)
call date_and_time(date,time,zone,values2)
time_ms2=values2(8)-values1(8)
time_ms2=1000*(values2(7)-values1(7))+time_ms2
time_ms2=60*1000*(values2(6)-values1(6))+time_ms2
print*,time_ms1,time_ms2

 

When I set the number "n" in the coed as 600 and do the calculations on a 12 Kernels DELL PC,  in the first version, the time used by DGEMM and DSYMM are ~35 ms and ~11 ms, indicating the DSYMM is faster than DGEMM.

But in the second version, the time of DGEMM and DSYMM are not very stable, sometimes they are 21 ms V.S. 23 ms, sometimes 7 ms V.S. 8 ms. So in the second version, the DGEMM and DSYMM have similar performance.

I am confused why the two versions of codes offer different conclusions? Why the time reported by second version is not very stable? Which one should be the right answer to the question that which performs better in the calculations of matrix multiplication involving symmetric matrix.

c# Marshalling LAPACKE_zgesvxx

$
0
0

 

I implemented a C# program that test LAPACKE_zgesvxx function. 

The example works with x64 platform but fail with x86 platform, return info=-7 error!

I suppose there are difference in function declaration, but I don't know where, and how to solve it.

I did many changes in the declaration part, but with 25 parameters became quite difficult to find the solution.

any ideas?

I Attach the example code and the data files used to test.

 

Thank you

Gianluca

AttachmentSize
DownloadMKLTest.zip18.21 KB
Downloadsol2.zip3.34 MB

Set the CNR mode for only one function call

$
0
0

Hi everyone,

I would like to do something like:

... // a lot of stuff with several calls to MKL routines
int cbwrStatus = mkl_cbwr_set(MKL_CBWR_AUTO);
int status = LAPACKE_dsyev( LAPACK_ROW_MAJOR, 'V', 'L', rank, matrix, rank, eigenValues );
mkl_cbwr_set(MKL_CBWR_BRANCH_OFF);
... // a lot of stuff with several calls to MKL routines

Here, mkl_cbwr_set calls fail with MKL_CBWR_ERR_MODE_CHANGE_FAILURE, obviously.

So, two questions:

- why is it forbidden (impossible ?) to change the CNR mode after a call to some MKL function ?

- is there a way to set the CNR mode for only one function ?

In my case, "dsyev" is the only function really critical for the reproductibility. I do not want to force the CNR mode at the beginning or before the execution because I am afraid of performance regression.

Thanks in advance for your help,

Guix

HPCC - use mkl=cluster and crashing in

$
0
0

I am trying to run the HPCC benchmark on our cluster.

The system has Intel compilers and MKL release 2015.5.223

I am compiling with Openmpi version 1.8.5 also (though I do not think this is the problem)

I am following https://software.intel.com/en-us/articles/performance-tools-for-software-developers-use-of-intel-mkl-in-hpcc-benchmark

I have compiled OK. This seems a bit out of date though.

With up to date MKL should I compile with    -mkl=cluster     for HPCC?

Are there some more up to date writeups please?

When I run it crashes in the MPIFFT portion of the run.

For what its worth:

[comp06:10941] *** Process received signal ***

[comp06:10941] Signal: Segmentation fault (11)

[comp06:10941] Signal code: Address not mapped (1)

[comp06:10941] Failing at address: 0x9b

[comp12:10910] [comp09:10876] [ 0] /lib64/libpthread.so.0(+0xf130)[0x2aaaaacde130]

[comp09:10876] [ 1] [comp12:10928] [ 0] /lib64/libpthread.so.0(+0xf130)[0x2aaaaacde130]

[comp12:10928] [ 1] [comp09:10890] [ 0] /lib64/libpthread.so.0(+0xf130)[0x2aaaaacde130]

[comp09:10890] [ 1] [comp13:10948] Signal: Segmentation fault (11)

[comp13:10948] Signal code: Address not mapped (1)

[comp13:10948] Failing at address: 0x9b

[comp07:10876] [ 0] /lib64/libpthread.so.0(+0xf130)[0x2aaaaacde130]

[comp07:10876] [ 1] [comp06:10945] *** Process received signal ***

[comp06:10945] Signal: Segmentation fault (11)

[comp06:10945] Signal code: Address not mapped (1)

[comp06:10945] Failing at address: 0x9b

/cm/shared/apps/openmpi/intel/64/1.8.5/lib64/libmpi.so.1(MPI_Comm_size+0x59)[0x2aaaab250969]

[comp12:10928] [ 2] ./hpcc[0x198e560]

[comp12:10928] [ 3] ./hpcc[0x4537f4]

 

Using MKL 11.3 with Intel Parallel Studio 2015 XE (Windows)

$
0
0

I am using MKL with Visual Studio and Intel Parallel Studio 2015

For various reasons I want to stay with the Intel Compiler 2015, but I want to use MKL 11.3. The Intel integrations with Visual Studio make using MKL or IPP or TBB very easy, You just select them in the "Intel performance libraries" option. The problem is that when using Intel Compiler 2015, the MKL that will be used for compiling and linking is MKL 11.2.

Is there a 'better' way to select MKL 11.3 apart from using a manual include and link path in my projects.

Viewing all 2652 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>