Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DGESVD slow coparing with intel implementation #1077

Closed
OtacilioNeto opened this issue Feb 2, 2017 · 49 comments
Closed

DGESVD slow coparing with intel implementation #1077

OtacilioNeto opened this issue Feb 2, 2017 · 49 comments

Comments

@OtacilioNeto
Copy link

Dears, when I run this command on a scilab running on freebsd compiled with openblas I get a time of 37 seconds
A1=rand(1000, 1000); tic();[S, v, D]=svd(A1);toc()
ans =

37.276 

The same command on windows with intel library I get 1 second

A1=rand(1000, 1000); tic();[S, v, D]=svd(A1);toc()
ans =

1.382  

Could you please give me a hint why is so different?

@martin-frbg
Copy link
Collaborator

One trivial factor is probably that the OpenBLAS build system did not pass the "-O2" compiler option to the fortran compiler until very recently - as #843 showed this has a noticable impact on calculations at least compared to netlib LAPACK (practically identical code but using -O2 optimization level for the default build). You need to either correct the "override FFLAGS" line in Makefile.system or set the FFLAGS environment variable accordingly before invoking "make".
Less trivial could be that you are running OpenBLAS multithreaded and with too many threads to be efficient while MKL may be smart enough to use only one or two threads in this case. In that case running with OPENBLAS_NUM_THREADS=1 or 2 - or even building OpenBLAS without thread support may improve the timing.

@brada4
Copy link
Contributor

brada4 commented Feb 2, 2017

Are you with OpenBLAS?

R> x<-matrix(rnorm(1e6),1e3,1e3)
R> system.time(r<-svd(x))
   user  system elapsed 
  2.516   0.824   1.032 

@brada4
Copy link
Contributor

brada4 commented Feb 2, 2017

You are not with OpenBLAS by default
https://svnweb.freebsd.org/ports/head/math/scilab/Makefile?view=markup#l48

You have to rebuild scilab to use OpenBLAS, whose port in turn is ""good enough"" i.e likely half-speed LAPACK vs mkl without what @martin-frbg offered.
https://svnweb.freebsd.org/ports/head/math/openblas/Makefile?view=markup#l43

@OtacilioNeto
Copy link
Author

I did a rebuild before open this issue. This is my system status:

[ota@nostromo /usr/ports/math/openblas]$ ldd /usr/local/bin/scilab-bin | grep blas
libopenblasp.so.0 => /usr/local/lib/libopenblasp.so.0 (0x808a00000)

[ota@nostromo /usr/ports/math/openblas]$ pkg info scilab
scilab-5.5.2_4
Name : scilab
Version : 5.5.2_4
Installed on : Thu Feb 2 15:28:49 2017 BRT
Origin : math/scilab
Architecture : freebsd:11:x86:64
Prefix : /usr/local
Categories : math java cad
Licenses :
Maintainer : [email protected]
WWW : http://www.scilab.org
Comment : Scientific software package for numerical computations
Options :
ATLAS : off
GUI : on
NETLIB : off
OCAML : on
OPENBLAS : on
TK : on
Shared Libs required:
libcurl.so.4
libgcc_s.so.1
libpcre.so.1
libfftw3.so.3
libarpack.so.2
libopenblasp.so.0
libumfpack.so.1
libstdc++.so.6
libxml2.so.2
libmatio.so.4
libcolamd.so.1
libintl.so.8
libtk86.so.1
libpcreposix.so.0
libcholmod.so.1
libtcl86.so.1
libsuitesparseconfig.so.1
libquadmath.so.0
libhdf5_hl.so.100
libamd.so.1
libhdf5.so.100
libgfortran.so.3
libomp.so.0

@brada4
Copy link
Contributor

brada4 commented Feb 3, 2017

What is your CPU? Maybe it is not detected by 0.2.18? (shoud not be 40x slower)

@OtacilioNeto
Copy link
Author

i7 3517U

I'm rebuilding openblas without OpenSMP support. I'm getting a feeling that OpenMP is running only two threads.

@brada4
Copy link
Contributor

brada4 commented Feb 3, 2017

You can set OPENBLAS_NUM_THREADS=1 OMP_NUM_THREADS=1,1 to disable threading, no need to rebuild.

@OtacilioNeto
Copy link
Author

OtacilioNeto commented Feb 3, 2017

1 thread A1=rand(1000, 1000);tic(); [S, v, D]=svd(A1); toc() -> ans = 46.151
2 thread A1=rand(1000, 1000);tic(); [S, v, D]=svd(A1); toc() -> ans = 45.244
3 thread A1=rand(1000, 1000);tic(); [S, v, D]=svd(A1); toc() -> ans = 44.917
4 thread A1=rand(1000, 1000);tic(); [S, v, D]=svd(A1); toc() -> ans = 45.274
5 thread A1=rand(1000, 1000);tic(); [S, v, D]=svd(A1); toc() -> ans = 47.004

This machine is virtualized under virtualbox and I'm running with 2 virtual CPUS. Maybe this is related?

@brada4
Copy link
Contributor

brada4 commented Feb 3, 2017

Yes, virtualbox filters AVX2 and FMA4 always, and often AVX and SSE 4 depending on version and luck.

@OtacilioNeto
Copy link
Author

I do not enable AVX2 because my processor do not have this set of instructions, but AVX and SSE 4 are enabled on compilation.

@brada4
Copy link
Contributor

brada4 commented Feb 3, 2017

So far you discovered virtualisation overhead,
or impact of fake CPUID by virtual machine,
or simply very recent or very rare CPU that is handled by generic computation kernels.

Also CPUID (from real CPU) is overdue:
$ grep -e Features -e CPU /var/log/dmesg.boot > ~/cpuid.txt

Can you repeat measurement on real CPU?

@OtacilioNeto
Copy link
Author

root@nostromo:~ # grep -e Features -e CPU /var/log/dmesg.today
CPU: Intel(R) Core(TM) i7-3517U CPU @ 1.90GHz (2394.63-MHz K8-class CPU)
Features=0x1783fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE,SSE2,HTT>
Features2=0xdc982203<SSE3,PCLMULQDQ,SSSE3,CX16,SSE4.1,SSE4.2,POPCNT,XSAVE,OSXSAVE,AVX,RDRAND,HV>
AMD Features=0x28100800<SYSCALL,NX,RDTSCP,LM>
AMD Features2=0x1
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
cpu0: on acpi0
cpu1: on acpi0
SMP: AP CPU #1 Launched!

@OtacilioNeto
Copy link
Author

I have a no virtualized machine, but it is a little old. I'm rebuilding in there to test.
Is this one.

root@squitch:/home/ota # grep -e Features -e CPU /var/log/dmesg.today
CPU: Intel(R) Core(TM)2 CPU T5300 @ 1.73GHz (1729.04-MHz K8-class CPU)
Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
Features2=0xe39d<SSE3,DTES64,MON,DS_CPL,EST,TM2,SSSE3,CX16,xTPR,PDCM>
AMD Features=0x20100800<SYSCALL,NX,LM>
AMD Features2=0x1
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
SMP: AP CPU #1 Launched!
cpu0: on acpi0
cpu1: on acpi0
coretemp0: on cpu0
est: CPU supports Enhanced Speedstep, but is not recognized.
p4tcc0: on cpu0
coretemp1: on cpu1
est: CPU supports Enhanced Speedstep, but is not recognized.
p4tcc1: on cpu1

@OtacilioNeto
Copy link
Author

The rebuild in the virtualized machine without OpenMP is done. No luck, same behavior. So, it not OpenMP related.

@brada4
Copy link
Contributor

brada4 commented Feb 3, 2017

Since your cpu supports no virtualisation - virtualbox completely emulates cpu seen in virtual machine. The more advanced instruction set you use the slower C emulation gets.
You can close the issue as it is just observing truths of life, not any particular software problem.

@OtacilioNeto
Copy link
Author

But the Intel specs for this CPU claims that it supports virtualization:

http://ark.intel.com/pt-BR/products/65714/Intel-Core-i7-3517U-Processor-4M-Cache-up-to-3_00-GHz

@brada4
Copy link
Contributor

brada4 commented Feb 3, 2017

Can you get serious and choose the processor for both timing tests and especially confirm on WHICH CPU your initial timings were measured?

@martin-frbg
Copy link
Collaborator

martin-frbg commented Feb 3, 2017

What did the OpenBLAS build detect your (virtual) CPU as ? (There will be a libopenblas_cputype.so in addition to libopenblas.so) ? The actual i7-35xx will be sandybridge I guess.

@OtacilioNeto
Copy link
Author

I have finished the tests on my old machine. Same behavior. Now are a real machine and a virtualized with the same behavior. This is the real machine results:

[ota@squitch ~]$ OPENBLAS_NUM_THREADS=1; export OPENBLAS_NUM_THREADS; OMP_NUM_THREADS=1 ; export OMP_NUM_THREADS
[ota@squitch ~]$ scilab-cli
Scilab 5.5.2 (Feb 3 2017, 00:12:09)

-->A1=rand(1000,1000); tic(); [S, v, D]=svd(A1); toc()
ans =

79.956  

-->exit
[ota@squitch ~]$ OPENBLAS_NUM_THREADS=2; export OPENBLAS_NUM_THREADS; OMP_NUM_THREADS=2 ; export OMP_NUM_THREADS
[ota@squitch ~]$ scilab-cli
Scilab 5.5.2 (Feb 3 2017, 00:12:09)

-->A1=rand(1000,1000); tic(); [S, v, D]=svd(A1); toc()
ans =

79.429  

-->exit
[ota@squitch ~]$ OPENBLAS_NUM_THREADS=3; export OPENBLAS_NUM_THREADS; OMP_NUM_THREADS=3 ; export OMP_NUM_THREADS
[ota@squitch ~]$ scilab-cli
Scilab 5.5.2 (Feb 3 2017, 00:12:09)

-->A1=rand(1000,1000); tic(); [S, v, D]=svd(A1); toc()
ans =

79.128  

@OtacilioNeto
Copy link
Author

On both installs there is no libopenblas_cputype.so.

Name : openblas
Version : 0.2.19,1
Installed on : Fri Feb 3 00:05:12 2017 BRT
Origin : math/openblas
Architecture : freebsd:11:x86:64
Prefix : /usr/local
Categories : math
Licenses : BSD3CLAUSE
Maintainer : [email protected]
WWW : https://github.com/xianyi/OpenBLAS
Comment : Optimized BLAS library based on GotoBLAS2
Options :
AVX : on
AVX2 : off
CBLAS : on
DYNAMIC_ARCH : on
INTERFACE64 : off
OPENMP : on
Shared Libs required:
libquadmath.so.0
libgomp.so.1
libgfortran.so.3
Shared Libs provided:
libopenblas.so.0
libopenblasp.so.0

@OtacilioNeto
Copy link
Author

There is a flag on FreeBSD ports that claims add a support to "multiple CPU type". I'm disabling this flag and rebuilding to test. Maybe this is disabling optimizations.

@brada4
Copy link
Contributor

brada4 commented Feb 3, 2017

you have to download OpenBLAS 0.2.18 tarball and type 'make' (having gcc and gfortran available). It should build sandybridge specific code only and confirm it at the end of build in short summary (or barf out with errors if it did not detect CPU)
you can use POSIX script command to record long output.

@brada4
Copy link
Contributor

brada4 commented Feb 3, 2017

E5-2697 v2 (a bit bigger cache and more hertz, yours could be 1.5-2x slower)
MKL_NUM_THREADS=2 Rscript dgesvd.R (old Revo version)
1024x1024 : 15915.24 MFlops 0.450000 sec
OPENBLAS_NUM_THREADS=2 Rscript dgesvd.R (pure complete 0.2.19, with pessimal FFLAGS)
1024x1024 : 10440.03 MFlops 0.686000 sec

@OtacilioNeto
Copy link
Author

The only library created are:

-rw-r--r-- 1 root wheel 68136560 3 fev 14:33 work/OpenBLAS-0.2.19/libopenblasp-r0.2.19.a
-rwxr-xr-x 1 root wheel 39170016 3 fev 14:35 work/OpenBLAS-0.2.19/libopenblasp-r0.2.19.so
lrwxr-xr-x 1 root wheel 22 3 fev 13:58 work/OpenBLAS-0.2.19/libopenblasp.a -> libopenblasp-r0.2.19.a
lrwxr-xr-x 1 root wheel 23 3 fev 14:35 work/OpenBLAS-0.2.19/libopenblasp.so -> libopenblasp-r0.2.19.so

@brada4
Copy link
Contributor

brada4 commented Feb 3, 2017

i mean not PKG build but independent build in a new directory from source tarball.
i.e

tar xfz OpenBLAS*gz
cd OpenBLAS-0.2.18
script
make
exit
less typescript

@martin-frbg
Copy link
Collaborator

Could be that the freebsd source package has the DYNAMIC_ARCH=1 option permanently set in Makefile.rule in addition to providing it as a command line option to "make" ? Possibly you will find the cpu name in the config.h file that is produced as part of the build process.

@OtacilioNeto
Copy link
Author

Here is the config.h

#define OS_FREEBSD 1
#define ARCH_X86_64 1
#define C_GCC 1
#define 64BIT 1
#define PTHREAD_CREATE_FUNC pthread_create
#define BUNDERSCORE _
#define NEEDBUNDERSCORE 1
#define SANDYBRIDGE
#define L2_SIZE 262144
#define L2_ASSOCIATIVE 8
#define L2_LINESIZE 64
#define ITB_SIZE 4096
#define ITB_ASSOCIATIVE 4
#define ITB_ENTRIES 64
#define DTB_SIZE 4096
#define DTB_ASSOCIATIVE 4
#define DTB_DEFAULT_ENTRIES 64
#define HAVE_CMOV
#define HAVE_MMX
#define HAVE_SSE
#define HAVE_SSE2
#define HAVE_SSE3
#define HAVE_SSSE3
#define HAVE_SSE4_1
#define HAVE_SSE4_2
#define HAVE_AVX
#define HAVE_CFLUSH
#define NUM_SHAREDCACHE 1
#define NUM_CORES 1
#define CORE_SANDYBRIDGE
#define CHAR_CORENAME "SANDYBRIDGE"
#define SLOCAL_BUFFER_SIZE 24576
#define DLOCAL_BUFFER_SIZE 16384
#define CLOCAL_BUFFER_SIZE 32768
#define ZLOCAL_BUFFER_SIZE 24576
#define GEMM_MULTITHREAD_THRESHOLD 4

@OtacilioNeto
Copy link
Author

OtacilioNeto commented Feb 3, 2017

Makefile.rule

Makefile.rule.txt

@brada4
Copy link
Contributor

brada4 commented Feb 3, 2017

Your CPU is detected as Sandy Bridge, which matches Ivy bridge written in its specifications.
I would trust distribution package and not enter over-engineering trades now.

Will you be able to extract dgesvd_ arguments on FreeBSD using gdb? (break dgesvd_ ; run; etc etc.)

@martin-frbg
Copy link
Collaborator

Thank you - so it seems to have identified your cpu correctly. That leaves the silly issue with the missing -O2 in Makefile.system - could you try adding that to the "override FFLAGS" line there and recompile just again ? (Though I do wonder if that alone could account for a 37-fold speed difference)

@OtacilioNeto
Copy link
Author

@martin-frbg when I'm compiling I see lots of -O2 on command line. Look:
gcc49 -c -O2 -DMAX_STACK_ALLOC=2048 -DEXPRECISION -m128bit-long-double -Wall -m64 -DF_INTERFACE_GFORT -fPIC -DDYNAMIC_ARCH -DNO_AVX2 -DNO_WARMUP -DMAX_CPU_NUMBER=1 -DASMNAME=srotmg -DASMFNAME=srotmg_ -DNAME=srotmg_ -DCNAME=srotmg -DCHAR_NAME="srotmg_" -DCHAR_CNAME="srotmg" -DNO_AFFINITY -I.. -I. -UDOUBLE -UCOMPLEX rotmg.c -o srotmg.o
gcc49 -O2 -DMAX_STACK_ALLOC=2048 -DEXPRECISION -m128bit-long-double -Wall -m64 -DF_INTERFACE_GFORT -fPIC -DDYNAMIC_ARCH -DNO_AVX2 -DNO_WARMUP -DMAX_CPU_NUMBER=1 -DASMNAME=saxpby -DASMFNAME=saxpby_ -DNAME=saxpby_ -DCNAME=saxpby -DCHAR_NAME="saxpby_" -DCHAR_CNAME="saxpby" -DNO_AFFINITY -I.. -I. -UDOUBLE -UCOMPLEX -c axpby.c -o saxpby.o
gcc49 -O2 -DMAX_STACK_ALLOC=2048 -DEXPRECISION -m128bit-long-double -Wall -m64 -DF_INTERFACE_GFORT -fPIC -DDYNAMIC_ARCH -DNO_AVX2 -DNO_WARMUP -DMAX_CPU_NUMBER=1 -DASMNAME=cblas_isamax -DASMFNAME=cblas_isamax_ -DNAME=cblas_isamax_ -DCNAME=cblas_isamax -DCHAR_NAME="cblas_isamax_" -DCHAR_CNAME="cblas_isamax" -DNO_AFFINITY -I.. -I. -UDOUBLE -UCOMPLEX -DCBLAS -c -DUSE_ABS -UUSE_MIN imax.c -o cblas_isamax.o
gcc49 -O2 -DMAX_STACK_ALLOC=2048 -DEXPRECISION -m128bit-long-double -Wall -m64 -DF_INTERFACE_GFORT -fPIC -DDYNAMIC_ARCH -DNO_AVX2 -DNO_WARMUP -DMAX_CPU_NUMBER=1 -DASMNAME=cblas_sasum -DASMFNAME=cblas_sasum_ -DNAME=cblas_sasum_ -DCNAME=cblas_sasum -DCHAR_NAME="cblas_sasum_" -DCHAR_CNAME="cblas_sasum" -DNO_AFFINITY -I.. -I. -UDOUBLE -UCOMPLEX -DCBLAS -c asum.c -o cblas_sasum.o

@OtacilioNeto
Copy link
Author

@brada4 do you need that this be made on scilab or could it be made on a small program example?

Because debug scilab looks a bit trick

[ota@nostromo /usr/home/ota/workspace/SOFTWARE_Espelho/ExemploReadFromFile]$ gdb /usr/local/bin/scilab-cli-bin
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...(no debugging symbols found)...
(gdb) break dgesvd_
Function "dgesvd_" not defined.
Make breakpoint pending on future shared library load? (y or [n]) y
Breakpoint 1 (dgesvd_) pending.
(gdb) run
Starting program: /usr/local/bin/scilab-cli-bin
(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...Error while reading shared library symbols:
Dwarf Error: wrong version in compilation unit header (is 4, should be 2) [in module /usr/local/lib/gcc49/libgfortran.so.3]
Error while reading shared library symbols:
Dwarf Error: wrong version in compilation unit header (is 4, should be 2) [in module /usr/local/lib/gcc49/libquadmath.so.0]
Error while reading shared library symbols:
Dwarf Error: wrong version in compilation unit header (is 4, should be 2) [in module /usr/local/lib/gcc49/libgcc_s.so.1]
[New LWP 101077]
Error while reading shared library symbols:
Dwarf Error: wrong version in compilation unit header (is 4, should be 2) [in module /usr/local/lib/gcc49/libgomp.so.1]
Breakpoint 2 at 0x8077430ae
Pending breakpoint "dgesvd_" resolved
SCI environment variable not defined.

Program exited with code 01.
Current language: auto; currently minimal
(gdb)

@brada4
Copy link
Contributor

brada4 commented Feb 3, 2017

Wrong dwarf version means gdb was compiled with different gcc than module examined. It is not a problem per se
Lets try to debug live process instead instead of emulating startup script.
run scilab and type your m code until including tick()
now start blank gdb

attach <pid_of_scilab>
break dgesvd_ 
break dgeqrf_ # this will be first call that above makes
continue

now type svd() in scilab
and in gdb it should be at breakpoint

first press 'c' until scilab prompt returns (continue)
it should be 2 break of each kind

repeat svd in scilab.

back to brakpoints:
thread all apply backtrace # on all 4 breakpoints

now in gdb
'detach'
when scilab prompt returns and exit debugger

(wrap debugger with 'script', it is not comfortable to scroll your 3-screen postings, text file with 2 sentence description goes better)

@martin-frbg
Copy link
Collaborator

Certainly "lots of -O2" in the output, but note that these are all in the gcc calls used to build the BLAS part - the LAPACK is written in fortran and the gfortran calls do/did not get the -O2

@brada4
Copy link
Contributor

brada4 commented Feb 3, 2017

also get 'info shared' i.e which modules are loaded in process...

@brada4
Copy link
Contributor

brada4 commented Feb 3, 2017

no need to debug

as it stands in current FreeBSD ports:
scikit is linked to arpack which in turn is linked to netlib (no alternatives)
which in the end means you are just calling netlib blas where you expect openblas.

Regarding OpenMP and compiling - it matches Makefile.rule setting, you dont need it for single-threaded scilab.

It would be best to notify arpack and scikit FreeBSD port maintainers about this mishap and ask to save you (or you learn to add openblas to arpack package and submit diff)

There is nothing that OpenBLAS can do if library loader overrides it with other package.

LD_PRELOAD may or may not help to work around.

@martin-frbg
Copy link
Collaborator

@brada4 ldd linking his scilab binary to libopenblasp.so was in the fifth post, so I do not get what you are ranting about ?

@brada4
Copy link
Contributor

brada4 commented Feb 3, 2017

ldd shows just 1st level of imports.

@OtacilioNeto
Copy link
Author

@martin-frbg
I did a rebuild and scan and you are right. The -O2 is not passed to gfortran49. I have added it by hand and is rebuilding now. Attached is the log of the previously build.
typescript.txt

@OtacilioNeto
Copy link
Author

@brada4

I'm using arpack-ng with Scilab. I did a patch to arpack-ng uses openblas instead gotoblas. The patch is following.
arpack-ng.txt

@OtacilioNeto
Copy link
Author

After adding -O2 the performance is improved.

-->A1=rand(1000, 1000); tic(); [S, v, D]=svd(A1); toc()
ans = 20.015
-->A1=rand(1000, 1000); tic(); [S, v, D]=svd(A1); toc()
ans = 25.158
-->A1=rand(1000, 1000); tic(); [S, v, D]=svd(A1); toc()
ans = 27.053
-->A1=rand(1000, 1000); tic(); [S, v, D]=svd(A1); toc()
ans = 25.514
-->A1=rand(1000, 1000); tic(); [S, v, D]=svd(A1); toc()
ans = 26.208
-->A1=rand(1000, 1000); tic(); [S, v, D]=svd(A1); toc()
ans = 26.186

@OtacilioNeto
Copy link
Author

Attached is gdb debug like suggest by @brada4
typescript.txt

@brada4
Copy link
Contributor

brada4 commented Feb 5, 2017

It is not only DGESVD (scilab manual is not saying complete story)

octave:3> tic(); [S, v, D]=svd(A1); toc() # SVD+MM?
Elapsed time is 11.6619 seconds.
octave:4> tic(); v=svd(A1); toc() # just SVD like I did in R
Elapsed time is 2.2592 seconds.

@OtacilioNeto
Copy link
Author

About -DMAX_CPU_NUMBER=1, is this correct even for a multicore CPU?

@brada4
Copy link
Contributor

brada4 commented Feb 6, 2017

Without -D please

@martin-frbg
Copy link
Collaborator

martin-frbg commented Feb 6, 2017

MAX_CPU_NUMBER appears to be used interchangeably with NUM_THREADS nowadays while NUM_CORES in config.h seems to reflect the number of physically separate cpu dies rather than cores.

So for a dualcore x86 capable of hyperthreading a default build would have NUM_CORES=1 in config.h and -DMAX_CPU_NUMBER=4 on the command lines of the compiler. Your -DMAX_CPU_NUMBER would then appear to be for a build with no multithreading support (consistent with the "NUM_THREADS=1 USE_THREAD=0" seen on the line below "Building for openblas-0.2.19,1" in the typescript.txt you posted - apologies for not looking at that earlier, but we were only concerned with the -O2 then).

Setting OPENBLAS_NUM_THREADS in the tests above cannot have had any effect then, perhaps your benchmark results would even be closer to the Windows/MKL ones if run with two threads ?
(For actual real-world code the relative performance of threaded vs non-threaded OpenBLAS will depend on the workload and on whether the main program is multithreaded itself)

@OtacilioNeto
Copy link
Author

OtacilioNeto commented Feb 6, 2017

When configuring the number of CPUs to 1 on virtualbox manager I'm getting a better performance than with number of CPUs of 2. So I suppose that only one thread is working to solve the svd. Testing with procstat looks confirm this. Even when using "OPENBLAS_NUM_THREADS=3; export OPENBLAS_NUM_THREADS; OMP_NUM_THREADS=3 ; export OMP_NUM_THREADS" I get only a start with three threads but after some small time only one thread looks works, no CPU is 100%. Look:

PID TID COMM TDNAME CPU PRI STATE WCHAN
1148 100273 scilab-cli-bin - -1 182 run -
1148 100278 scilab-cli-bin - -1 152 sleep uwait
1148 100279 scilab-cli-bin - -1 180 run -
1148 100280 scilab-cli-bin - -1 180 run -
[ota@nostromo /usr/home/ota]$ procstat -t 1148
PID TID COMM TDNAME CPU PRI STATE WCHAN
1148 100273 scilab-cli-bin - -1 181 run -
1148 100278 scilab-cli-bin - -1 152 sleep uwait
1148 100279 scilab-cli-bin - -1 152 sleep usem
1148 100280 scilab-cli-bin - -1 152 sleep usem
[ota@nostromo /usr/home/ota]$ procstat -t 1148
PID TID COMM TDNAME CPU PRI STATE WCHAN
1148 100273 scilab-cli-bin - -1 184 run -
1148 100278 scilab-cli-bin - -1 152 sleep uwait
1148 100279 scilab-cli-bin - -1 152 sleep usem
1148 100280 scilab-cli-bin - -1 152 sleep usem
[ota@nostromo /usr/home/ota]$ procstat -t 1148
PID TID COMM TDNAME CPU PRI STATE WCHAN
1148 100273 scilab-cli-bin - 0 185 run -
1148 100278 scilab-cli-bin - -1 152 sleep uwait
1148 100279 scilab-cli-bin - -1 152 sleep usem
1148 100280 scilab-cli-bin - -1 152 sleep usem
[ota@nostromo /usr/home/ota]$ procstat -t 1148
PID TID COMM TDNAME CPU PRI STATE WCHAN
1148 100273 scilab-cli-bin - 1 186 run -
1148 100278 scilab-cli-bin - -1 152 sleep uwait
1148 100279 scilab-cli-bin - -1 152 sleep usem
1148 100280 scilab-cli-bin - -1 152 sleep usem
[ota@nostromo /usr/home/ota]$ procstat -t 1148
PID TID COMM TDNAME CPU PRI STATE WCHAN
1148 100273 scilab-cli-bin - -1 152 sleep uwait
1148 100278 scilab-cli-bin - -1 152 sleep uwait
1148 100279 scilab-cli-bin - -1 152 sleep usem
1148 100280 scilab-cli-bin - -1 152 sleep usem
1148 100282 scilab-cli-bin - -1 152 sleep ttyin

So, for me, looks correct suppose that only one thread is doing the work even on a multicore CPU. So, is correct suppose that this is a Scilab issue or openblas must create the number of threads that matches the number of CPUs?

@brada4
Copy link
Contributor

brada4 commented Feb 6, 2017

You can read parameters in Makefile.rule , default build is pthreads, and number and type of CPUs that your msachine has.

It is normal that 1st thread, which is main thread does small matrix operations, other threads will be employed for bigger matrices only.

@brada4
Copy link
Contributor

brada4 commented Mar 2, 2017

I wonder if this is relevant and Intel did it already before:
http://hg.savannah.gnu.org/hgweb/octave/rev/750c8b4b7164
I.e. it will not hurt anybody to use extra gigabyte for 5x speedup...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants