Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression by --march=native and --DUSE_XSIMD #1626

Closed
rqwa opened this issue Aug 27, 2020 · 3 comments
Closed

Performance regression by --march=native and --DUSE_XSIMD #1626

rqwa opened this issue Aug 27, 2020 · 3 comments

Comments

@rqwa
Copy link

rqwa commented Aug 27, 2020

If my code is compiled with the flags --march=native and --DUSE_XSIMD, the performance decreases instead of increasing, this issue was already mentioned in #1493. For benchmarking I used two servers with different CPUs, one was a virtual server which only supported a limited set of cpu extensions and the other one used Skylake cores with more extensions.
The following code was used as benchmark.

#pythran export loops1(float64[][][])
#pythran export loops1a(float64[][][])
#pythran export loops2(float64[][][])
#pythran export loops2a(float64[][][])
import numpy as np

def loops1(np_array):
    shape_x =np_array.shape[0]
    shape_y =np_array.shape[1]
    shape_z =np_array.shape[2]
    for x in range(shape_x):
        for y in range(shape_y):
            for z in range(shape_z):
                if np_array[x][y][z] < 0.3:
                    np_array[x][y][z] = 0
                elif np_array[x][y][z] < 0.6:
                    np_array[x][y][z] = 0.3
                elif np_array[x][y][z] < 0.9:
                    np_array[x][y][z] = 0.6
                else:
                    np_array[x][y][z] = 1
    return np_array

def loops1a(np_array):
    shape_x =np_array.shape[0]
    shape_y =np_array.shape[1]
    shape_z =np_array.shape[2]
    points_z = np.zeros((shape_x,shape_y), dtype=np.int32)
    for x in range(shape_x):
        for y in range(shape_y):
            for z in range(shape_z):
                if np_array[x][y][z] < 0.3:
                    np_array[x][y][z] = 0
                    points_z[x][y] += 1
                elif np_array[x][y][z] < 0.6:
                    np_array[x][y][z] = 0.3
                    points_z[x][y] += 1
                elif np_array[x][y][z] < 0.9:
                    np_array[x][y][z] = 0.6
                    points_z[x][y] += 1
                else:
                    np_array[x][y][z] = 1
    return np_array, points_z

def loops2(np_array):
    np_array[np_array < 0.3] = 0
    np_array[(np_array != 0) & (np_array < 0.6)] = 0.3
    np_array[(np_array != 0) & (np_array < 0.9)] = 0.6
    np_array[np_array >= 0.9] = 1

    return np_array
 
def loops2a(np_array):
    np_array[np_array < 0.3] = 0
    np_array[(np_array != 0) & (np_array < 0.6)] = 0.3
    np_array[(np_array != 0) & (np_array < 0.9)] = 0.6
    np_array[np_array >= 0.9] = 1

    points = np.sum(np.where(np_array < 0.9,1,0),2)

    return np_array, points

The different functions were called with the script below.

import numpy as np
from loops import loops1a, loops2a, loops1, loops2

in_arr = np.random.rand(100,100,1000)
out_arr = <loop_function>(in_arr)

Results:

virtual CPU

CPU flags: fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pse36 clflush mmx fxsr sse sse2 syscall nx lm rep_good nopl eagerfpu pni cx16 hypervisor lahf_lm abm

command loops1 loops2 loops1a loops2a
pythran 0.3081 0.3005 0.4298 0.3222
pythran –DUSE_XSIMD 0.3073 0.2969 0.4343 0.3150
pythran –march=native 0.3063 0.2996 0.4312 0.3236
pythran –DUSE_XSIMD –march=native 0.3072 0.2919 0.4388 0.3113

Skylake

CPU flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap xsaveopt xsavec xgetbv1

command loops1 loops2 loops1a loops2a
pythran 0.3058 0.3200 0.4301 0.3363
pythran –DUSE_XSIMD 0.3081 0.3230 0.4472 0.3225
pythran –march=native 0.4900 0.3240 0.5256 0.3259
pythran –DUSE_XSIMD –march=native 0.4957 0.3238 0.5293 0.3279

The performance decrease is only visible on the Skylake system for the functions with the for loops.

Used versions

CentOS 7
miniconda with python 3.8.5, gcc 7.5.0, numpy 1.19.1, pythran 0.9.6 from conda-forge

@cycomanic
Copy link
Contributor

I can confirm on a ryzen 5 3600 system using clang, using -O3 largely removes the penalty:

[ins] In [6]: timeit loops.loops1(in_arr)                                                                             
267 ms ± 204 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

[ins] In [7]: timeit loopsO3.loops1(in_arr)                                                                           
96.5 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

[ins] In [8]: timeit loops_xsimd.loops1(in_arr)                                                                       
269 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

[ins] In [9]: timeit loops_xsimd_native.loops1(in_arr)                                                                
321 ms ± 2.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

[ins] In [10]: timeit loops_xsimd_nativeO3.loops1(in_arr)                                                             
98 ms ± 166 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

[ins] In [11]: timeit loops_native.loops1(in_arr)                                                                     
321 ms ± 307 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using gcc I don't see the same effect, however -O3 yields significantly worse performance.

[ins] In [3]: timeit loops_native.loops1(in_arr)                                                                      
267 ms ± 423 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

[nav] In [4]: timeit loops.loops1(in_arr)                                                                             
269 ms ± 239 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

[ins] In [5]: timeit loops_xsimd.loops1(in_arr)                                                                       
270 ms ± 232 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

[ins] In [6]: timeit loops_xsimd_native.loops1(in_arr)                                                                
265 ms ± 192 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

[ins] In [7]: timeit loops_xsimd_nativeO3.loops1(in_arr)                                                              
129 ms ± 77.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

[ins] In [8]: timeit loopsO3.loops1(in_arr)                                                                           
129 ms ± 198 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
clang --version                                     
clang version 10.0.1 
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

gcc --version                                             
gcc (SUSE Linux) 10.2.1 20200805 [revision dda1e9d08434def88ed86557d08b23251332c5aa]
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

@serge-sans-paille
Copy link
Owner

The problem with theses loops (loops2 and loops2a) is that pythran doesn't know how to vectorize the array filtering, so -DUSE_XSIMD is ignored there :-/ Stated otherwise, pythran doesn't vectorize scatter/gather...

@rqwa
Copy link
Author

rqwa commented Oct 28, 2020

I updated my conda environment and repeated the tests. With the updated versions I cannot observe any performance regression. Instead -march=native accelerates loop1 and the for loops (loop1 and loop1a) got a nice performance boost. For my part the issue can be closed.

Thanks a lot for the speed-up.

Used versions

CentOS 7
miniconda with python 3.8.6, gcc 7.5.0, numpy 1.19.1, pythran 0.9.7 gast 0.4.0 beniget 0.3.0 from conda-forge

Results:

virtual CPU

command loops1 loops2 loops1a loops2a
pythran 0.0751 0.3011 0.0755 0.3101
pythran –DUSE_XSIMD 0.0744 0.2977 0.0754 0.3006
pythran –march=native 0.0773 0.2881 0.0756 0.3071
pythran –DUSE_XSIMD –march=native 0.0762 0.3144 0.0743 0.2938

Skylake

command loops1 loops2 loops1a loops2a
pythran 0.0738 0.2650 0.0746 0.2810
pythran –DUSE_XSIMD 0.0743 0.2758 0.0750 0.2901
pythran –march=native 0.0397 0.2709 0.0743 0.2877
pythran –DUSE_XSIMD –march=native 0.0397 0.2650 0.0736 0.2848

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants