-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenBLAS far slower than Accelerate on IvyBridge #533
Comments
@dpo , When you mention "Accelerate", are you referring to the Anaconda Accelerate package by Continuum Analytics? If so, you are actually comparing MKL BLAS to OpenBLAS. Since MKL BLAS is used on so many systems, there is more interest to fix OpenBLAS shortcomings relative to MKL BLAS than to just the Anaconda Accelerate package (from one company). I have some suggestions to help increase the chances of your feedback resulting in improvements to OpenBLAS:
|
"Accelerate" means the OSX Accelerate framework, i.e., Apple's implementation of the BLAS. |
@dpo , we didn't optimized s/daxpy by AVX instructions. So far, these functions also use old SSE kernels. |
Refs #533. added optimized saxpy- and daxpy-kernel for haswell and sandybridge
Performance on OSX must have remained bad even after addition of the optimized kernels, due to the .align issue discussed much later in #1470. Closing here. |
I have an IvyBridge i7-3720QM. I installed OpenBLAS via Homebrew on OSX 10.9 (built from the
develop
branch ; OpenBLAS used SandyBridge as target) and I'm comparing it to Accelerate using the Tokyo Cython interface. I'm wondering why my OpenBLAS lags quite far behind Accelerate.Here are some sample results in single precision on vectors/matrices of size 30, 100 and 1000. The vertical axis represents thousands of calls per second to BLAS 1/2/3 functions). The horizontal axis represents the various tests, and data size increases as you move to the right.
On the plot, the
openblas1
curve corresponds toOPENBLAS_NUM_THREADS=1
,openblas2
to two threads, andopenblas4
to four threads. Double precision results are similar:I especially care about
saxpy
anddaxpy
, but those are amont the worst in single precision:and in double precision:
Increasing the number of threads typically degrades performance. For larger vectors/matrices in double precision, the BLAS libraries are pretty much on par, but not so for small vectors/matrices. In single precision, the picture is bleaker.
Full results: https://gist.github.com/fb08bd53b13728cb7e7c (ignore the numpy stuff; I'm not taking it into account in the present results).
The text was updated successfully, but these errors were encountered: