-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable parallelism #93
Conversation
Codecov Report
@@ Coverage Diff @@
## main #93 +/- ##
==========================================
- Coverage 82.60% 81.25% -1.36%
==========================================
Files 4 4
Lines 92 96 +4
==========================================
+ Hits 76 78 +2
- Misses 16 18 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
f16c60c
to
8eb418e
Compare
8eb418e
to
4aac60f
Compare
Thanks for getting this one going! After thinking this one over a bit, I'm not sure that an environment variable is the best way to go here. (If for no other reason, it's kind of a pain to set up unit tests for.) What do you think about making this more configurable via parameters to the user-facing API? E.g. resampy.resample(..., parallel=True) # or parallel=False we could have two jit-wrapped versions of the core function on hand by replacing Lines 7 to 8 in ee82e25
by something like def _resample_f(x, y, t_out, interp_win, interp_delta, num_table, scale=1.0):
...
resample_f_p = numba.jit(nopython=True, nogil=True, parallel=True)(_resample_f)
resample_f_s = numba.jit(nopython=True, nogil=True, parallel=False)(_resample_f) and then have the user-facing API select between the two. It's a little hacky, but should not present significant overhead and would certainly be easier to test and benchmark. As an aside, we might want to consider adding cached compilation (to both) so that we don't have to pay the startup cost each time. |
Dear @bmcfee, I like a lot your solution.
|
I'd like to benchmark it on typical loads before making a decision on this.
Nah, I think a simple numerical equivalence check for one run would be sufficient. Note that we might not get exact floating point equivalence here because fp math is not really associative or commutative, but it shouldn't drift too far from the serial implementation. |
Thanks @avalentino ! This is looking good - any sense of how it benchmarks on typical loads? |
Quick follow-up, I tried benchmarking this on some typical loads (10s, 60s, 600s signals at 44100→32000, float32) and didn't see any consistent benefit. I also tried this on both my laptop (8 cores) and a 24-core server, with no clear difference in behavior. A couple of thoughts:
|
@bmcfee I made some benchmarking when I initially proposed to activate the parallelism. |
Regarding the possible use of |
OK @bmcfee, it seems that the
I have used the following script to benchmark the resample function: #!/usr/bin/env python3
import os
import timeit
import numpy as np
import resampy
# CPU info
# https://stackoverflow.com/questions/4842448/getting-processor-information-in-python
def get_processor_name():
import platform
import subprocess
if platform.system() == "Windows":
return platform.processor()
elif platform.system() == "Darwin":
import os
os.environ['PATH'] = os.environ['PATH'] + os.pathsep + '/usr/sbin'
command ="sysctl -n machdep.cpu.brand_string"
return subprocess.check_output(command).strip()
elif platform.system() == "Linux":
import re
mode_name = None
cpu_cores = None
with open('/proc/cpuinfo') as fd:
cpuinfo = fd.read()
for line in cpuinfo.split("\n"):
if "model name" in line:
model_name = re.sub( ".*model name.*:", "", line, 1)
elif "cpu cores" in line:
cpu_cores = re.sub( ".*cpu cores.*:", "", line, 1).strip()
if None not in (cpu_cores, mode_name):
break
return f'{model_name} ({cpu_cores} physical cores)'
return ""
print('CPU:', get_processor_name())
print('CPU count:', os.cpu_count())
# setup
sr_orig = 22050.0
f0 = 440
T = 10
ovs = 4
t = np.arange(T * sr_orig) / sr_orig
x = np.sin(2 * np.pi * f0 * t)
sr_new = sr_orig * ovs
print('Input size', len(x))
print('Output size:', len(x) * ovs)
# compile
resampy.resample(x, sr_orig, sr_new, parallel=False)
resampy.resample(x, sr_orig, sr_new, parallel=True)
# timeit
debug = False
repeat = 5
results = {}
for label, parallel in zip(('sequential', 'parallel'), (False, True)):
timer = timeit.Timer(
f'resampy.resample(x, sr_orig, sr_new, parallel={parallel})',
globals=globals().copy())
number, _ = timer.autorange()
number = max(3, number)
# print('number:', number)
times = timer.repeat(repeat, number)
if debug:
print('sequential', times)
results[label] = min(times)/number * 1e3
print(f'best of {repeat}: {results[label]:.3f} [ms]')
print('speedup:', results['sequential']/results['parallel']) |
Nice - confirming that I also see some good speedups on my laptop with a basic test load (2 runs each to ensure eliminating compilation effects): In [4]: sr = 44100
In [5]: sr_new = 32000
In [6]: x = np.random.randn(120 * sr).astype(np.float32)
In [7]: %timeit resampy.resample(x, sr, sr_new, parallel=False)
3.14 s ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [8]: %timeit resampy.resample(x, sr, sr_new, parallel=False)
3.41 s ± 146 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [9]: %timeit resampy.resample(x, sr, sr_new, parallel=True)
723 ms ± 31.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [10]: %timeit resampy.resample(x, sr, sr_new, parallel=True)
791 ms ± 107 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) and on the 24-core server: In [9]: %timeit resampy.resample(x, sr, sr_new, parallel=True)
298 ms ± 3.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [10]: %timeit resampy.resample(x, sr, sr_new, parallel=False)
5 s ± 3.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) I'm curious about how the cache= parameter is affecting this though, especially given this bit of the documentation
We should probably dig into this and get a better understanding of what the trouble is. Otherwise, I'm happy for parallel=True to be the default going forward. |
It seems that the numba caching get somewhat confused if the one applies two timed the |
Ah! That makes sense, and is silly. Maybe we could just do something like resample_f_s = jit(..., parallel=False)(_resample_f)
resample_f_p = jit(..., parallel=True)(copy(_resample_f)) ? But otherwise agreed that caching is not necessary for this PR. |
The solution using |
Alright, let's punt it then. Probably this is an issue to raise with the numba folks - caching should depend on the jit parameters as well as the function body, and it seems like a bug if that's not the case. |
There is indeed an open numba issue: numba/numba#6845 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Is there anything else to do here, or shall we merge and release?
For me it is complete. |
|
This PR enable the numba jit parallel option by default.
The option can be disabled by setting an environment variable.