Is there any way to get deterministic results from MKL sgemm/dgemm (even if that is much slower)?
What I mean is the following: When I do dgemm or sgemm (a lot of them) using the same input data I tend to see minor numerical differences. While not being large they can become quite significant when back-propagating though a very deep neural network (>20 layers). And they are significantly higher than with competing linear algebra packages.
Let me show you what I mean. I instantiated my network twice and initialized both instances using the same parameters. The following tables list the differences of gradients derived using these networks. (each number in the in the table, represents the gradients for an entire parameter bucket)
Parameters (MKL)
MKL_NUM_THREADS=1
OMP_NUM_THREADS=1
MKL_DYNAMIC=FALSE
OMP_DYNAMIC=FALSE
Results (MKL, confirmed single threaded by using MKL_VERBOSE=1)
min-diff: (0,5) -> -2.43985e-07, (0,10) -> -6.88851e-07, (0,15) -> -1.08151e-06, (0,20) -> -2.29150e-07, (0,25) -> -7.78865e-06, (0,30) -> -2.22526e-07, (0,35) -> -2.00457e-05, (0,40) -> -6.31442e-07, (0,45) -> -3.53903e-08, (0,50) -> -1.33878e-09, (0,55) -> -3.72529e-09, (0,60) -> -4.65661e-10, (0,65) -> -1.86265e-09, (0,70) -> -2.32831e-09, (0,75) -> -1.16415e-10, (0,80) -> -1.86265e-08
max-diff: (0,5) -> 3.52116e-07, (0,10) -> 6.34780e-07, (0,15) -> 9.27335e-07, (0,20) -> 2.05655e-07, (0,25) -> 6.20843e-06, (0,30) -> 2.58158e-07, (0,35) -> 2.12293e-05, (0,40) -> 6.60219e-07, (0,45) -> 2.79397e-08, (0,50) -> 1.16415e-09, (0,55) -> 5.87897e-09, (0,60) -> 5.23869e-10, (0,65) -> 1.86265e-09, (0,70) -> 2.56114e-09, (0,75) -> 1.16415e-10, (0,80) -> 1.86265e-08
rel-diff: (0,5) -> 1.70455e-03, (0,10) -> 2.38793e-03, (0,15) -> 1.39107e-03, (0,20) -> 2.02584e-03, (0,25) -> 6.83717e-04, (0,30) -> 9.16173e-04, (0,35) -> 1.73014e-04, (0,40) -> 1.49317e-04, (0,45) -> 2.10977e-07, (0,50) -> 2.14790e-07, (0,55) -> 6.37089e-08, (0,60) -> 8.91096e-08, (0,65) -> 7.81675e-09, (0,70) -> 1.67285e-07, (0,75) -> 3.78540e-10, (0,80) -> 1.72134e-07
min-diff = min(A - B)
max-diff = max(A - B)
rel-diff = norm(A - B) / norm(A + B)
If I bind exactly the same application against the current stable OpenBLAS implementation compiled for single threading, I get the following:
Parameters (OpenBLAS)
make BINARY=64 TARGET=SANDYBRIDGE USE_THREAD=0 MAX_STACK_ALLOC=2048
Results OpenBLAS (single threaded)
min-diff: (0,5) -> 0.00000e+00, (0,10) -> 0.00000e+00, (0,15) -> 0.00000e+00, (0,20) -> 0.00000e+00, (0,25) -> 0.00000e+00, (0,30) -> 0.00000e+00, (0,35) -> 0.00000e+00, (0,40) -> 0.00000e+00, (0,45) -> 0.00000e+00, (0,50) -> 0.00000e+00, (0,55) -> 0.00000e+00, (0,60) -> 0.00000e+00, (0,65) -> 0.00000e+00, (0,70) -> 0.00000e+00, (0,75) -> 0.00000e+00, (0,80) -> 0.00000e+00
max-diff: (0,5) -> 0.00000e+00, (0,10) -> 0.00000e+00, (0,15) -> 0.00000e+00, (0,20) -> 0.00000e+00, (0,25) -> 0.00000e+00, (0,30) -> 0.00000e+00, (0,35) -> 0.00000e+00, (0,40) -> 0.00000e+00, (0,45) -> 0.00000e+00, (0,50) -> 0.00000e+00, (0,55) -> 0.00000e+00, (0,60) -> 0.00000e+00, (0,65) -> 0.00000e+00, (0,70) -> 0.00000e+00, (0,75) -> 0.00000e+00, (0,80) -> 0.00000e+00
rel-diff: (0,5) -> 0.00000e+00, (0,10) -> 0.00000e+00, (0,15) -> 0.00000e+00, (0,20) -> 0.00000e+00, (0,25) -> 0.00000e+00, (0,30) -> 0.00000e+00, (0,35) -> 0.00000e+00, (0,40) -> 0.00000e+00, (0,45) -> 0.00000e+00, (0,50) -> 0.00000e+00, (0,55) -> 0.00000e+00, (0,60) -> 0.00000e+00, (0,65) -> 0.00000e+00, (0,70) -> 0.00000e+00, (0,75) -> 0.00000e+00, (0,80) -> 0.00000e+00
This is actually what I would expect. Since there is no multi-threading and given the same data exactly the same things should happen in the same order.
Now, just for fun and because my software can do it, I replace the calls to BLAS with matching modules for CUDNN and CUBLAS function calls (NVIDIA CUDA). Please note that unlike (OpenBLAS and MKL), this is not the same code-path.
Results CUDNN + CUBLAS (multi-threaded)
min-diff: (0,5) -> -3.63798e-11, (0,10) -> -1.45519e-10, (0,15) -> -1.96451e-10, (0,20) -> -4.36557e-11, (0,25) -> -1.39698e-09, (0,30) -> -8.00355e-11, (0,35) -> -3.25963e-09, (0,40) -> -2.32831e-10, (0,45) -> -3.72529e-09, (0,50) -> -2.32831e-10, (0,55) -> 0.00000e+00, (0,60) -> 0.00000e+00, (0,65) -> 0.00000e+00, (0,70) -> 0.00000e+00, (0,75) -> 0.00000e+00, (0,80) -> 0.00000e+00
max-diff: (0,5) -> 2.91038e-11, (0,10) -> 1.40062e-10, (0,15) -> 2.18279e-10, (0,20) -> 4.72937e-11, (0,25) -> 9.31323e-10, (0,30) -> 1.01863e-10, (0,35) -> 2.79397e-09, (0,40) -> 2.32831e-10, (0,45) -> 1.86265e-09, (0,50) -> 2.91038e-10, (0,55) -> 0.00000e+00, (0,60) -> 0.00000e+00, (0,65) -> 0.00000e+00, (0,70) -> 0.00000e+00, (0,75) -> 0.00000e+00, (0,80) -> 0.00000e+00
rel-diff: (0,5) -> 2.06397e-07, (0,10) -> 5.70014e-07, (0,15) -> 2.27241e-07, (0,20) -> 3.68574e-07, (0,25) -> 1.57384e-07, (0,30) -> 2.85175e-07, (0,35) -> 8.01234e-08, (0,40) -> 1.15262e-07, (0,45) -> 1.21201e-08, (0,50) -> 1.45475e-08, (0,55) -> 0.00000e+00, (0,60) -> 0.00000e+00, (0,65) -> 0.00000e+00, (0,70) -> 0.00000e+00, (0,75) -> 0.00000e+00, (0,80) -> 0.00000e+00
As you can see, I get reproducible results for the last layers (right hand side, fully connected nn-layers = large matrix multiplications). For the other layers (left hand side, convolutions nn-layers = many small matrix multiplications) we see small differences (the CUDA manual suggests this is to be expected due to the way they schedule in their underlying multi-threading implementation). Anyway, even with that, the differences have a much smaller magnitude than with MKL on the CPU.
Question:
Considering that I desire reproducibility. How can I configure MKL to produce the same or at least more similar results if it is invoked with the same data several times?