-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression by --march=native and --DUSE_XSIMD #1626
Comments
I can confirm on a ryzen 5 3600 system using clang, using [ins] In [6]: timeit loops.loops1(in_arr)
267 ms ± 204 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
[ins] In [7]: timeit loopsO3.loops1(in_arr)
96.5 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
[ins] In [8]: timeit loops_xsimd.loops1(in_arr)
269 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
[ins] In [9]: timeit loops_xsimd_native.loops1(in_arr)
321 ms ± 2.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
[ins] In [10]: timeit loops_xsimd_nativeO3.loops1(in_arr)
98 ms ± 166 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
[ins] In [11]: timeit loops_native.loops1(in_arr)
321 ms ± 307 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) Using gcc I don't see the same effect, however [ins] In [3]: timeit loops_native.loops1(in_arr)
267 ms ± 423 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
[nav] In [4]: timeit loops.loops1(in_arr)
269 ms ± 239 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
[ins] In [5]: timeit loops_xsimd.loops1(in_arr)
270 ms ± 232 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
[ins] In [6]: timeit loops_xsimd_native.loops1(in_arr)
265 ms ± 192 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
[ins] In [7]: timeit loops_xsimd_nativeO3.loops1(in_arr)
129 ms ± 77.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
[ins] In [8]: timeit loopsO3.loops1(in_arr)
129 ms ± 198 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
|
The problem with theses loops ( |
I updated my conda environment and repeated the tests. With the updated versions I cannot observe any performance regression. Instead -march=native accelerates loop1 and the for loops (loop1 and loop1a) got a nice performance boost. For my part the issue can be closed. Thanks a lot for the speed-up. Used versionsCentOS 7 Results:virtual CPU
Skylake
|
If my code is compiled with the flags --march=native and --DUSE_XSIMD, the performance decreases instead of increasing, this issue was already mentioned in #1493. For benchmarking I used two servers with different CPUs, one was a virtual server which only supported a limited set of cpu extensions and the other one used Skylake cores with more extensions.
The following code was used as benchmark.
The different functions were called with the script below.
Results:
virtual CPU
CPU flags:
fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pse36 clflush mmx fxsr sse sse2 syscall nx lm rep_good nopl eagerfpu pni cx16 hypervisor lahf_lm abm
Skylake
CPU flags:
fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap xsaveopt xsavec xgetbv1
The performance decrease is only visible on the Skylake system for the functions with the for loops.
Used versions
CentOS 7
miniconda with python 3.8.5, gcc 7.5.0, numpy 1.19.1, pythran 0.9.6 from conda-forge
The text was updated successfully, but these errors were encountered: