-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DGESVD slow coparing with intel implementation #1077
Comments
One trivial factor is probably that the OpenBLAS build system did not pass the "-O2" compiler option to the fortran compiler until very recently - as #843 showed this has a noticable impact on calculations at least compared to netlib LAPACK (practically identical code but using -O2 optimization level for the default build). You need to either correct the "override FFLAGS" line in Makefile.system or set the FFLAGS environment variable accordingly before invoking "make". |
Are you with OpenBLAS?
|
You are not with OpenBLAS by default You have to rebuild scilab to use OpenBLAS, whose port in turn is ""good enough"" i.e likely half-speed LAPACK vs mkl without what @martin-frbg offered. |
I did a rebuild before open this issue. This is my system status: [ota@nostromo /usr/ports/math/openblas]$ ldd /usr/local/bin/scilab-bin | grep blas [ota@nostromo /usr/ports/math/openblas]$ pkg info scilab |
What is your CPU? Maybe it is not detected by 0.2.18? (shoud not be 40x slower) |
i7 3517U I'm rebuilding openblas without OpenSMP support. I'm getting a feeling that OpenMP is running only two threads. |
You can set OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1,1 to disable threading, no need to rebuild. |
1 thread A1=rand(1000, 1000);tic(); [S, v, D]=svd(A1); toc() -> ans = 46.151 This machine is virtualized under virtualbox and I'm running with 2 virtual CPUS. Maybe this is related? |
Yes, virtualbox filters AVX2 and FMA4 always, and often AVX and SSE 4 depending on version and luck. |
I do not enable AVX2 because my processor do not have this set of instructions, but AVX and SSE 4 are enabled on compilation. |
So far you discovered virtualisation overhead, Also CPUID (from real CPU) is overdue: Can you repeat measurement on real CPU? |
root@nostromo:~ # grep -e Features -e CPU /var/log/dmesg.today |
I have a no virtualized machine, but it is a little old. I'm rebuilding in there to test. root@squitch:/home/ota # grep -e Features -e CPU /var/log/dmesg.today |
The rebuild in the virtualized machine without OpenMP is done. No luck, same behavior. So, it not OpenMP related. |
Since your cpu supports no virtualisation - virtualbox completely emulates cpu seen in virtual machine. The more advanced instruction set you use the slower C emulation gets. |
But the Intel specs for this CPU claims that it supports virtualization: http://ark.intel.com/pt-BR/products/65714/Intel-Core-i7-3517U-Processor-4M-Cache-up-to-3_00-GHz |
Can you get serious and choose the processor for both timing tests and especially confirm on WHICH CPU your initial timings were measured? |
What did the OpenBLAS build detect your (virtual) CPU as ? (There will be a libopenblas_cputype.so in addition to libopenblas.so) ? The actual i7-35xx will be sandybridge I guess. |
I have finished the tests on my old machine. Same behavior. Now are a real machine and a virtualized with the same behavior. This is the real machine results: [ota@squitch ~]$ OPENBLAS_NUM_THREADS=1; export OPENBLAS_NUM_THREADS; OMP_NUM_THREADS=1 ; export OMP_NUM_THREADS -->A1=rand(1000,1000); tic(); [S, v, D]=svd(A1); toc()
-->exit -->A1=rand(1000,1000); tic(); [S, v, D]=svd(A1); toc()
-->exit -->A1=rand(1000,1000); tic(); [S, v, D]=svd(A1); toc()
|
On both installs there is no libopenblas_cputype.so. Name : openblas |
There is a flag on FreeBSD ports that claims add a support to "multiple CPU type". I'm disabling this flag and rebuilding to test. Maybe this is disabling optimizations. |
you have to download OpenBLAS 0.2.18 tarball and type 'make' (having gcc and gfortran available). It should build sandybridge specific code only and confirm it at the end of build in short summary (or barf out with errors if it did not detect CPU) |
E5-2697 v2 (a bit bigger cache and more hertz, yours could be 1.5-2x slower) |
The only library created are: -rw-r--r-- 1 root wheel 68136560 3 fev 14:33 work/OpenBLAS-0.2.19/libopenblasp-r0.2.19.a |
i mean not PKG build but independent build in a new directory from source tarball.
|
Could be that the freebsd source package has the DYNAMIC_ARCH=1 option permanently set in Makefile.rule in addition to providing it as a command line option to "make" ? Possibly you will find the cpu name in the config.h file that is produced as part of the build process. |
Here is the config.h #define OS_FREEBSD 1 |
Makefile.rule |
Your CPU is detected as Sandy Bridge, which matches Ivy bridge written in its specifications. Will you be able to extract dgesvd_ arguments on FreeBSD using gdb? (break dgesvd_ ; run; etc etc.) |
Thank you - so it seems to have identified your cpu correctly. That leaves the silly issue with the missing -O2 in Makefile.system - could you try adding that to the "override FFLAGS" line there and recompile just again ? (Though I do wonder if that alone could account for a 37-fold speed difference) |
@martin-frbg when I'm compiling I see lots of -O2 on command line. Look: |
@brada4 do you need that this be made on scilab or could it be made on a small program example? Because debug scilab looks a bit trick [ota@nostromo /usr/home/ota/workspace/SOFTWARE_Espelho/ExemploReadFromFile]$ gdb /usr/local/bin/scilab-cli-bin Program exited with code 01. |
Wrong dwarf version means gdb was compiled with different gcc than module examined. It is not a problem per se
now type svd() in scilab first press 'c' until scilab prompt returns (continue) repeat svd in scilab. back to brakpoints: now in gdb (wrap debugger with 'script', it is not comfortable to scroll your 3-screen postings, text file with 2 sentence description goes better) |
Certainly "lots of -O2" in the output, but note that these are all in the gcc calls used to build the BLAS part - the LAPACK is written in fortran and the gfortran calls do/did not get the -O2 |
also get 'info shared' i.e which modules are loaded in process... |
no need to debug as it stands in current FreeBSD ports: Regarding OpenMP and compiling - it matches Makefile.rule setting, you dont need it for single-threaded scilab. It would be best to notify arpack and scikit FreeBSD port maintainers about this mishap and ask to save you (or you learn to add openblas to arpack package and submit diff) There is nothing that OpenBLAS can do if library loader overrides it with other package. LD_PRELOAD may or may not help to work around. |
@brada4 ldd linking his scilab binary to libopenblasp.so was in the fifth post, so I do not get what you are ranting about ? |
ldd shows just 1st level of imports. |
@martin-frbg |
I'm using arpack-ng with Scilab. I did a patch to arpack-ng uses openblas instead gotoblas. The patch is following. |
After adding -O2 the performance is improved. -->A1=rand(1000, 1000); tic(); [S, v, D]=svd(A1); toc() |
Attached is gdb debug like suggest by @brada4 |
It is not only DGESVD (scilab manual is not saying complete story)
|
About -DMAX_CPU_NUMBER=1, is this correct even for a multicore CPU? |
Without -D please |
MAX_CPU_NUMBER appears to be used interchangeably with NUM_THREADS nowadays while NUM_CORES in config.h seems to reflect the number of physically separate cpu dies rather than cores. So for a dualcore x86 capable of hyperthreading a default build would have NUM_CORES=1 in config.h and -DMAX_CPU_NUMBER=4 on the command lines of the compiler. Your -DMAX_CPU_NUMBER would then appear to be for a build with no multithreading support (consistent with the "NUM_THREADS=1 USE_THREAD=0" seen on the line below "Building for openblas-0.2.19,1" in the typescript.txt you posted - apologies for not looking at that earlier, but we were only concerned with the -O2 then). Setting OPENBLAS_NUM_THREADS in the tests above cannot have had any effect then, perhaps your benchmark results would even be closer to the Windows/MKL ones if run with two threads ? |
When configuring the number of CPUs to 1 on virtualbox manager I'm getting a better performance than with number of CPUs of 2. So I suppose that only one thread is working to solve the svd. Testing with procstat looks confirm this. Even when using "OPENBLAS_NUM_THREADS=3; export OPENBLAS_NUM_THREADS; OMP_NUM_THREADS=3 ; export OMP_NUM_THREADS" I get only a start with three threads but after some small time only one thread looks works, no CPU is 100%. Look: PID TID COMM TDNAME CPU PRI STATE WCHAN So, for me, looks correct suppose that only one thread is doing the work even on a multicore CPU. So, is correct suppose that this is a Scilab issue or openblas must create the number of threads that matches the number of CPUs? |
You can read parameters in Makefile.rule , default build is pthreads, and number and type of CPUs that your msachine has. It is normal that 1st thread, which is main thread does small matrix operations, other threads will be employed for bigger matrices only. |
I wonder if this is relevant and Intel did it already before: |
Dears, when I run this command on a scilab running on freebsd compiled with openblas I get a time of 37 seconds
A1=rand(1000, 1000); tic();[S, v, D]=svd(A1);toc()
ans =
The same command on windows with intel library I get 1 second
A1=rand(1000, 1000); tic();[S, v, D]=svd(A1);toc()
ans =
Could you please give me a hint why is so different?
The text was updated successfully, but these errors were encountered: