Nbabel again: Julia ~30% faster than Pythran (Julia using 4D points instead of 3D points) #1709

paugier · 2021-01-19T21:12:02Z

I get a new PR in the project https://github.com/paugier/nbabel for a more efficient Julia implementation. It's now 30% faster to what we get with Pythran. 30% is of course not that big and for real codes it would not matter. However, since I started to write a reply to the original paper that should be published somewhere, in term of communication, 30% is a bit too much. Moreover, it would be interesting to understand.

Of course, all the time is spent is one function (compute_acceleration). I wrote a small microbenchmark to show the performance issue (see https://github.com/paugier/nbabel/blob/master/py/microbench/microbench_ju_tuple.jl and https://github.com/paugier/nbabel/blob/master/py/microbench/microbench_pythran.py).

The particularity of the Julia implementation is to use a Tuple{Float64, Float64, Float64, Float64} to represent a point in 3D (a Julia Tuple is immutable). More SIMD instructions can be used which makes the function 20% faster than with Tuple of 3 floats.

$ make julia_tuple
julia microbench_ju_tuple.jl
With Tuple{Float64, Float64, Float64}:
  2.344 ms (0 allocations: 0 bytes)
With Tuple{Float64, Float64, Float64, Float64}:
  1.855 ms (0 allocations: 0 bytes)

With Pythran, it gives with dim = 3 (30% slower than Julia with 4 floats):

make pythran
python microbench_pythran.py
shape= (1024, 3)
compute                          :     1 * norm
norm = 0.00277 s
compute_opt                      :     1 * norm

(There are 2 slightly different implementations to ease experiments.)

And with dim = 4 (this value has to be modified directly in the file), Pythran is slower (40% slower than Julia with 4 floats):

make pythran
python microbench_pythran.py
shape= (1024, 4)
compute                          :     1 * norm
norm = 0.00308 s
compute_opt                      :  1.18 * norm

@serge-sans-paille do you understand why Pythran is a bit slower? Do we face a case where Python-Numpy is really not enough expressive?

A remark that I guess is not the problem but still, I was surprised by the fact that pythran -P gives something like:

coef = (1.0 / (np.square(math.sqrt(builtins.sum(np.square(vector)))) * math.sqrt(builtins.sum(np.square(vector)))))

I guess the C++ compiler is smart enough to simplify that, but it's a bit strange...

The text was updated successfully, but these errors were encountered:

cycomanic · 2021-01-24T20:43:58Z

Are you sure it is simd instructions? At least removing the @simd from the julia loop does not make a difference (does julia use simd instructions on broadcasting?). Similarly compiling with or without -DUSE_XSIMD does not seem to make a difference for the pythran code.

One thing I notice is when I change:

for i in range(dim):
    vector[i] = position0[i] - positions[index_p1, i]
...
for i in range(dim):
    accelerations[index_p0, i] -= coef * mass1 *vector[i]
    accelerations[index_p1, i] += coef *mass0*vector[i]

to

vector = position0 - positions[index_p1]
...
accelerations[index_p0] -= coef * mass1 * vector
accelerations[index_p1] += coef*mass0*vector

I see a 10x slowdown.

serge-sans-paille · 2021-01-25T17:40:06Z

Can you give a try to the following code (both _opt and the base one) compiled with clang, using that branch #1712 and the -mfma flag?

from math import sqrt
import numpy as np


dim = 3

#pythran export compute(float[:,3], float[:], float[:,3])
def compute(accelerations, masses, positions):
    nb_particules = masses.size
    vector = np.empty_like(positions[0])
    acc0 = np.empty_like(accelerations[0])
    for index_p0 in range(nb_particules - 1):
        np.copyto(acc0, accelerations[index_p0])
        for index_p1 in range(index_p0 + 1, nb_particules):
            np.copyto(vector, positions[index_p0] - positions[index_p1])
            distance = sqrt(np.sum(vector ** 2))
            coef = np.power(distance, -3)
            np.copyto(acc0, acc0 + coef * masses[index_p1] * vector)
            accelerations[index_p1] += coef * masses[index_p0] * vector
        accelerations[index_p0] = -acc0


#pythran export compute_opt(float[:,:], float[:], float[:,:])
def compute_opt(accelerations, masses, positions):
    nb_particules = masses.size
    vector = np.empty(dim)
    accs = np.empty(dim)
    for index_p0 in range(nb_particules - 1):
        position0 = positions[index_p0]
        mass0 = masses[index_p0]
        accs.fill(0)
        for index_p1 in range(index_p0 + 1, nb_particules):
            for i in range(dim):
                vector[i] = position0[i] - positions[index_p1, i]
            d2 = 0.
            for i in range(dim):
                d2 = d2 + vector[i] * vector[i]
            distance3 = d2 * sqrt(d2)
            coef1 = masses[index_p1] / distance3
            coef0 = mass0 / distance3
            for i in range(dim):
                accs[i] = accs[i] + coef1 * vector[i]
                accelerations[index_p1, i] = accelerations[index_p1, i] + coef0 * vector[i]
        accelerations[index_p0] -= accs

paugier · 2021-01-25T19:34:57Z

Not faster at home :-( (opt1 and opt2 are the functions in your previous message)

shape= (1024, 3)
compute                          :     1 * norm
norm = 0.00301 s
compute_opt                      : 0.998 * norm
compute_opt1                     :  2.63 * norm
compute_opt2                     :  1.02 * norm

cycomanic · 2021-01-25T21:01:34Z

Same here

shape= (1024, 3)
compute                          :     1 * norm
norm = 0.0029 s
compute_opt                      :     1 * norm
compute2                         :  6.85 * norm
compute3                         :  2.41 * norm
compute_opt3                     :     1 * norm

(2 is the compute where I changed the loop over vector to vector operations, 3 are the functions from your message).

serge-sans-paille · 2021-01-26T21:41:01Z

@cycomanic / @paugier even with -mfma ?

cycomanic · 2021-01-27T18:49:33Z

@serge-sans-paille yes, -mfma did not make a difference (neither with gcc nor clang). Could it be an alignment issue, so that simd instructions are not used. Is there a way to check?

paugier · 2021-01-27T19:47:19Z

I confirm again. -mfma has no effect.

$ pythran -Ofast -march=native -mfma -vv microbench_pythran.py                                
INFO:      sys file exists: /home/pierre/Dev/pythran/pythran/pythran.cfg
INFO: platform file exists: /home/pierre/Dev/pythran/pythran/pythran-linux.cfg
INFO:     user file exists: /home/pierre/.pythranrc
INFO: pythranrc section [pythran] is valid and options are correct
INFO: pythranrc section [typing] is valid and options are correct
INFO: pythranrc section [compiler] is valid and options are correct
running build_ext
new_compiler returns <class 'distutils.unixccompiler.UnixCCompiler'>
building 'microbench_pythran' extension
C compiler: clang++-10 -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -fPIC

creating /tmp/tmp6jf_ia_m/tmp
compile options: '-DENABLE_PYTHON_MODULE -D__PYTHRAN__=3 -DPYTHRAN_BLAS_OPENBLAS -I/home/pierre/Dev/pythran/pythran -I/home/pierre/Dev/pythran/pythran/pythonic/patch -I/home/pierre/.pyenv/versions/3.8.2/lib/python3.8/site-packages/numpy/core/include -I/home/pierre/.pyenv/versions/3.8.2/include/python3.8 -c'
extra options: '-std=c++11 -fno-math-errno -w -fvisibility=hidden -fno-wrapv -Ofast -march=native -mfma'
clang++-10: /tmp/tmpu_itryoy.cpp
clang++-10 -pthread -shared -L/home/pierre/.pyenv/versions/3.8.2/lib -L/home/pierre/.pyenv/versions/3.8.2/lib /tmp/tmp6jf_ia_m/tmp/tmpu_itryoy.o -L/usr/lib/x86_64-linux-gnu -lopenblas -lopenblas -o /tmp/tmpuddb59eb/microbench_pythran.cpython-38-x86_64-linux-gnu.so -fvisibility=hidden -Wl,-strip-all -Ofast -march=native -mfma
INFO: Generated module: microbench_pythran
INFO: Output: /home/pierre/Dev/nbabel/py/microbench/__pythran__/microbench_pythran.cpython-38-x86_64-linux-gnu.so

cycomanic · 2021-01-27T20:40:44Z

Looking at the discussion of the julia folks about optimising this here should the arrays be transposed for better simd access? They discuss that they get significant benefit from laying the arrays out as Nx4, does that not mean the python code should be layed out as 4xN?

serge-sans-paille · 2021-03-09T08:00:20Z

@paugier / @cycomanic can you do an extra run with an updated master?

paugier · 2021-03-09T11:32:21Z

Nop, it's not better with master.

cycomanic · 2021-03-09T11:47:07Z

This is my laptop, so slightly different numbers, but I still don't see a difference with and without -mfma

transonic microbench_pythran.py -af "-march=native -Ofast"
1 files created or updated needs to be pythranized
python3 microbench_pythran.py
shape= (1024, 3)
compute                          :     1 * norm
norm = 0.0039 s
compute_opt                      :  1.02 * norm
compute_opt1                     :  6.13 * norm
compute_opt2                     :  1.04 * norm

transonic microbench_pythran.py -af "-march=native -Ofast -mfma"
1 files created or updated needs to be pythranized
python3 microbench_pythran.py
shape= (1024, 3)
compute                          :     1 * norm
norm = 0.00402 s
compute_opt                      : 0.995 * norm
compute_opt1                     :     6 * norm
compute_opt2                     :     1 * norm

paugier · 2021-03-09T12:06:22Z

@cycomanic, it would be interesting to compare with what you get on your computer with Julia. What is the output for you of make julia_tuple ?

cycomanic · 2021-03-09T14:29:48Z

@paugier sorry yes forgot.

julia microbench_ju_tuple.jl
With Tuple{Float64, Float64, Float64}:
  2.862 ms (0 allocations: 0 bytes)
With Tuple{Float64, Float64, Float64, Float64}:
  2.644 ms (0 allocations: 0 bytes)

So I don't see the large difference between the 3 and 4D method with Julia on this computer (I recall it was larger on my desktop at home, I will update once I get home). Both methods are about 40% faster than pythran.

cycomanic · 2021-03-11T13:56:38Z

As a follow-up on my home desktop (Ryzen 3600) compared to my laptop (i7) pythran fares slightly better, but also no difference with and without -mfma

julia microbench_ju_tuple.jl
With Tuple{Float64, Float64, Float64}:
  2.669 ms (0 allocations: 0 bytes)
With Tuple{Float64, Float64, Float64, Float64}:
  2.125 ms (0 allocations: 0 bytes)

transonic microbench_pythran.py -af "-march=native -Ofast"
1 files created or updated needs to be pythranized
python microbench_pythran.py
shape= (1024, 3)
compute                          :     1 * norm
norm = 0.00288 s
compute_opt                      :     1 * norm
compute_opt1                     :  4.34 * norm
compute_opt2                     :     1 * norm

transonic microbench_pythran.py -af "-march=native -Ofast"
1 files created or updated needs to be pythranized
python microbench_pythran.py
shape= (1024, 3)
compute                          :     1 * norm
norm = 0.00304 s
compute_opt                      :  1.01 * norm
compute_opt1                     :  3.61 * norm
compute_opt2                     :  1.05 * norm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nbabel again: Julia ~30% faster than Pythran (Julia using 4D points instead of 3D points) #1709

Nbabel again: Julia ~30% faster than Pythran (Julia using 4D points instead of 3D points) #1709

paugier commented Jan 19, 2021

cycomanic commented Jan 24, 2021

serge-sans-paille commented Jan 25, 2021

paugier commented Jan 25, 2021

cycomanic commented Jan 25, 2021

serge-sans-paille commented Jan 26, 2021

cycomanic commented Jan 27, 2021

paugier commented Jan 27, 2021

cycomanic commented Jan 27, 2021

serge-sans-paille commented Mar 9, 2021

paugier commented Mar 9, 2021

cycomanic commented Mar 9, 2021

paugier commented Mar 9, 2021

cycomanic commented Mar 9, 2021

cycomanic commented Mar 11, 2021

Nbabel again: Julia ~30% faster than Pythran (Julia using 4D points instead of 3D points) #1709

Nbabel again: Julia ~30% faster than Pythran (Julia using 4D points instead of 3D points) #1709

Comments

paugier commented Jan 19, 2021

cycomanic commented Jan 24, 2021

serge-sans-paille commented Jan 25, 2021

paugier commented Jan 25, 2021

cycomanic commented Jan 25, 2021

serge-sans-paille commented Jan 26, 2021

cycomanic commented Jan 27, 2021

paugier commented Jan 27, 2021

cycomanic commented Jan 27, 2021

serge-sans-paille commented Mar 9, 2021

paugier commented Mar 9, 2021

cycomanic commented Mar 9, 2021

paugier commented Mar 9, 2021

cycomanic commented Mar 9, 2021

cycomanic commented Mar 11, 2021