Performance regression by --march=native and --DUSE_XSIMD #1626

rqwa · 2020-08-27T13:43:22Z

If my code is compiled with the flags --march=native and --DUSE_XSIMD, the performance decreases instead of increasing, this issue was already mentioned in #1493. For benchmarking I used two servers with different CPUs, one was a virtual server which only supported a limited set of cpu extensions and the other one used Skylake cores with more extensions.
The following code was used as benchmark.

#pythran export loops1(float64[][][])
#pythran export loops1a(float64[][][])
#pythran export loops2(float64[][][])
#pythran export loops2a(float64[][][])
import numpy as np

def loops1(np_array):
    shape_x =np_array.shape[0]
    shape_y =np_array.shape[1]
    shape_z =np_array.shape[2]
    for x in range(shape_x):
        for y in range(shape_y):
            for z in range(shape_z):
                if np_array[x][y][z] < 0.3:
                    np_array[x][y][z] = 0
                elif np_array[x][y][z] < 0.6:
                    np_array[x][y][z] = 0.3
                elif np_array[x][y][z] < 0.9:
                    np_array[x][y][z] = 0.6
                else:
                    np_array[x][y][z] = 1
    return np_array

def loops1a(np_array):
    shape_x =np_array.shape[0]
    shape_y =np_array.shape[1]
    shape_z =np_array.shape[2]
    points_z = np.zeros((shape_x,shape_y), dtype=np.int32)
    for x in range(shape_x):
        for y in range(shape_y):
            for z in range(shape_z):
                if np_array[x][y][z] < 0.3:
                    np_array[x][y][z] = 0
                    points_z[x][y] += 1
                elif np_array[x][y][z] < 0.6:
                    np_array[x][y][z] = 0.3
                    points_z[x][y] += 1
                elif np_array[x][y][z] < 0.9:
                    np_array[x][y][z] = 0.6
                    points_z[x][y] += 1
                else:
                    np_array[x][y][z] = 1
    return np_array, points_z

def loops2(np_array):
    np_array[np_array < 0.3] = 0
    np_array[(np_array != 0) & (np_array < 0.6)] = 0.3
    np_array[(np_array != 0) & (np_array < 0.9)] = 0.6
    np_array[np_array >= 0.9] = 1

    return np_array
 
def loops2a(np_array):
    np_array[np_array < 0.3] = 0
    np_array[(np_array != 0) & (np_array < 0.6)] = 0.3
    np_array[(np_array != 0) & (np_array < 0.9)] = 0.6
    np_array[np_array >= 0.9] = 1

    points = np.sum(np.where(np_array < 0.9,1,0),2)

    return np_array, points

The different functions were called with the script below.

import numpy as np
from loops import loops1a, loops2a, loops1, loops2

in_arr = np.random.rand(100,100,1000)
out_arr = <loop_function>(in_arr)

Results:

virtual CPU

CPU flags: fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pse36 clflush mmx fxsr sse sse2 syscall nx lm rep_good nopl eagerfpu pni cx16 hypervisor lahf_lm abm

command	loops1	loops2	loops1a	loops2a
pythran	0.3081	0.3005	0.4298	0.3222
pythran –DUSE_XSIMD	0.3073	0.2969	0.4343	0.3150
pythran –march=native	0.3063	0.2996	0.4312	0.3236
pythran –DUSE_XSIMD –march=native	0.3072	0.2919	0.4388	0.3113

Skylake

CPU flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap xsaveopt xsavec xgetbv1

command	loops1	loops2	loops1a	loops2a
pythran	0.3058	0.3200	0.4301	0.3363
pythran –DUSE_XSIMD	0.3081	0.3230	0.4472	0.3225
pythran –march=native	0.4900	0.3240	0.5256	0.3259
pythran –DUSE_XSIMD –march=native	0.4957	0.3238	0.5293	0.3279

The performance decrease is only visible on the Skylake system for the functions with the for loops.

Used versions

CentOS 7
miniconda with python 3.8.5, gcc 7.5.0, numpy 1.19.1, pythran 0.9.6 from conda-forge

The text was updated successfully, but these errors were encountered:

cycomanic · 2020-08-27T18:10:41Z

I can confirm on a ryzen 5 3600 system using clang, using -O3 largely removes the penalty:

[ins] In [6]: timeit loops.loops1(in_arr)                                                                             
267 ms ± 204 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

[ins] In [7]: timeit loopsO3.loops1(in_arr)                                                                           
96.5 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

[ins] In [8]: timeit loops_xsimd.loops1(in_arr)                                                                       
269 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

[ins] In [9]: timeit loops_xsimd_native.loops1(in_arr)                                                                
321 ms ± 2.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

[ins] In [10]: timeit loops_xsimd_nativeO3.loops1(in_arr)                                                             
98 ms ± 166 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

[ins] In [11]: timeit loops_native.loops1(in_arr)                                                                     
321 ms ± 307 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using gcc I don't see the same effect, however -O3 yields significantly worse performance.

[ins] In [3]: timeit loops_native.loops1(in_arr)                                                                      
267 ms ± 423 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

[nav] In [4]: timeit loops.loops1(in_arr)                                                                             
269 ms ± 239 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

[ins] In [5]: timeit loops_xsimd.loops1(in_arr)                                                                       
270 ms ± 232 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

[ins] In [6]: timeit loops_xsimd_native.loops1(in_arr)                                                                
265 ms ± 192 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

[ins] In [7]: timeit loops_xsimd_nativeO3.loops1(in_arr)                                                              
129 ms ± 77.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

[ins] In [8]: timeit loopsO3.loops1(in_arr)                                                                           
129 ms ± 198 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

clang --version                                     
clang version 10.0.1 
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

gcc --version                                             
gcc (SUSE Linux) 10.2.1 20200805 [revision dda1e9d08434def88ed86557d08b23251332c5aa]
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

serge-sans-paille · 2020-09-15T20:00:55Z

The problem with theses loops (loops2 and loops2a) is that pythran doesn't know how to vectorize the array filtering, so -DUSE_XSIMD is ignored there :-/ Stated otherwise, pythran doesn't vectorize scatter/gather...

rqwa · 2020-10-28T16:22:47Z

I updated my conda environment and repeated the tests. With the updated versions I cannot observe any performance regression. Instead -march=native accelerates loop1 and the for loops (loop1 and loop1a) got a nice performance boost. For my part the issue can be closed.

Thanks a lot for the speed-up.

Used versions

CentOS 7
miniconda with python 3.8.6, gcc 7.5.0, numpy 1.19.1, pythran 0.9.7 gast 0.4.0 beniget 0.3.0 from conda-forge

Results:

virtual CPU

command	loops1	loops2	loops1a	loops2a
pythran	0.0751	0.3011	0.0755	0.3101
pythran –DUSE_XSIMD	0.0744	0.2977	0.0754	0.3006
pythran –march=native	0.0773	0.2881	0.0756	0.3071
pythran –DUSE_XSIMD –march=native	0.0762	0.3144	0.0743	0.2938

Skylake

command	loops1	loops2	loops1a	loops2a
pythran	0.0738	0.2650	0.0746	0.2810
pythran –DUSE_XSIMD	0.0743	0.2758	0.0750	0.2901
pythran –march=native	0.0397	0.2709	0.0743	0.2877
pythran –DUSE_XSIMD –march=native	0.0397	0.2650	0.0736	0.2848

serge-sans-paille closed this as completed Apr 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression by --march=native and --DUSE_XSIMD #1626

Performance regression by --march=native and --DUSE_XSIMD #1626

rqwa commented Aug 27, 2020

cycomanic commented Aug 27, 2020

serge-sans-paille commented Sep 15, 2020

rqwa commented Oct 28, 2020

Performance regression by --march=native and --DUSE_XSIMD #1626

Performance regression by --march=native and --DUSE_XSIMD #1626

Comments

rqwa commented Aug 27, 2020

Results:

virtual CPU

Skylake

Used versions

cycomanic commented Aug 27, 2020

serge-sans-paille commented Sep 15, 2020

rqwa commented Oct 28, 2020

Used versions

Results:

virtual CPU

Skylake