-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nbabel again: Julia ~30% faster than Pythran (Julia using 4D points instead of 3D points) #1709
Comments
Are you sure it is simd instructions? At least removing the @simd from the julia loop does not make a difference (does julia use simd instructions on broadcasting?). Similarly compiling with or without -DUSE_XSIMD does not seem to make a difference for the pythran code. One thing I notice is when I change: for i in range(dim):
vector[i] = position0[i] - positions[index_p1, i]
...
for i in range(dim):
accelerations[index_p0, i] -= coef * mass1 *vector[i]
accelerations[index_p1, i] += coef *mass0*vector[i] to vector = position0 - positions[index_p1]
...
accelerations[index_p0] -= coef * mass1 * vector
accelerations[index_p1] += coef*mass0*vector I see a 10x slowdown. |
Can you give a try to the following code (both from math import sqrt
import numpy as np
dim = 3
#pythran export compute(float[:,3], float[:], float[:,3])
def compute(accelerations, masses, positions):
nb_particules = masses.size
vector = np.empty_like(positions[0])
acc0 = np.empty_like(accelerations[0])
for index_p0 in range(nb_particules - 1):
np.copyto(acc0, accelerations[index_p0])
for index_p1 in range(index_p0 + 1, nb_particules):
np.copyto(vector, positions[index_p0] - positions[index_p1])
distance = sqrt(np.sum(vector ** 2))
coef = np.power(distance, -3)
np.copyto(acc0, acc0 + coef * masses[index_p1] * vector)
accelerations[index_p1] += coef * masses[index_p0] * vector
accelerations[index_p0] = -acc0
#pythran export compute_opt(float[:,:], float[:], float[:,:])
def compute_opt(accelerations, masses, positions):
nb_particules = masses.size
vector = np.empty(dim)
accs = np.empty(dim)
for index_p0 in range(nb_particules - 1):
position0 = positions[index_p0]
mass0 = masses[index_p0]
accs.fill(0)
for index_p1 in range(index_p0 + 1, nb_particules):
for i in range(dim):
vector[i] = position0[i] - positions[index_p1, i]
d2 = 0.
for i in range(dim):
d2 = d2 + vector[i] * vector[i]
distance3 = d2 * sqrt(d2)
coef1 = masses[index_p1] / distance3
coef0 = mass0 / distance3
for i in range(dim):
accs[i] = accs[i] + coef1 * vector[i]
accelerations[index_p1, i] = accelerations[index_p1, i] + coef0 * vector[i]
accelerations[index_p0] -= accs |
Not faster at home :-( (opt1 and opt2 are the functions in your previous message)
|
Same here
(2 is the compute where I changed the loop over vector to vector operations, 3 are the functions from your message). |
@cycomanic / @paugier even with |
@serge-sans-paille yes, -mfma did not make a difference (neither with gcc nor clang). Could it be an alignment issue, so that simd instructions are not used. Is there a way to check? |
I confirm again.
|
Looking at the discussion of the julia folks about optimising this here should the arrays be transposed for better simd access? They discuss that they get significant benefit from laying the arrays out as Nx4, does that not mean the python code should be layed out as 4xN? |
@paugier / @cycomanic can you do an extra run with an updated master? |
Nop, it's not better with master. |
This is my laptop, so slightly different numbers, but I still don't see a difference with and without
|
@cycomanic, it would be interesting to compare with what you get on your computer with Julia. What is the output for you of |
@paugier sorry yes forgot.
So I don't see the large difference between the 3 and 4D method with Julia on this computer (I recall it was larger on my desktop at home, I will update once I get home). Both methods are about 40% faster than pythran. |
As a follow-up on my home desktop (Ryzen 3600) compared to my laptop (i7) pythran fares slightly better, but also no difference with and without
|
I get a new PR in the project https://github.com/paugier/nbabel for a more efficient Julia implementation. It's now 30% faster to what we get with Pythran. 30% is of course not that big and for real codes it would not matter. However, since I started to write a reply to the original paper that should be published somewhere, in term of communication, 30% is a bit too much. Moreover, it would be interesting to understand.
Of course, all the time is spent is one function (compute_acceleration). I wrote a small microbenchmark to show the performance issue (see https://github.com/paugier/nbabel/blob/master/py/microbench/microbench_ju_tuple.jl and https://github.com/paugier/nbabel/blob/master/py/microbench/microbench_pythran.py).
The particularity of the Julia implementation is to use a
Tuple{Float64, Float64, Float64, Float64}
to represent a point in 3D (a Julia Tuple is immutable). More SIMD instructions can be used which makes the function 20% faster than withTuple
of 3 floats.With Pythran, it gives with
dim = 3
(30% slower than Julia with 4 floats):(There are 2 slightly different implementations to ease experiments.)
And with
dim = 4
(this value has to be modified directly in the file), Pythran is slower (40% slower than Julia with 4 floats):@serge-sans-paille do you understand why Pythran is a bit slower? Do we face a case where Python-Numpy is really not enough expressive?
A remark that I guess is not the problem but still, I was surprised by the fact that
pythran -P
gives something like:I guess the C++ compiler is smart enough to simplify that, but it's a bit strange...
The text was updated successfully, but these errors were encountered: