Enable parallelism #93

avalentino · 2022-06-18T09:11:48Z

This PR enable the numba jit parallel option by default.
The option can be disabled by setting an environment variable.

codecov-commenter · 2022-06-18T09:14:15Z

Codecov Report

Merging #93 (0603dea) into main (ee82e25) will decrease coverage by 1.35%.
The diff coverage is 77.77%.

@@            Coverage Diff             @@
##             main      #93      +/-   ##
==========================================
- Coverage   82.60%   81.25%   -1.36%     
==========================================
  Files           4        4              
  Lines          92       96       +4     
==========================================
+ Hits           76       78       +2     
- Misses         16       18       +2

Flag	Coverage Δ
unittests	`81.25% <77.77%> (-1.36%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
resampy/core.py	`80.76% <77.77%> (-2.57%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ee82e25...0603dea. Read the comment docs.

bmcfee · 2022-06-20T13:31:46Z

Thanks for getting this one going!

After thinking this one over a bit, I'm not sure that an environment variable is the best way to go here. (If for no other reason, it's kind of a pain to set up unit tests for.)

What do you think about making this more configurable via parameters to the user-facing API? E.g.

resampy.resample(..., parallel=True)  # or parallel=False

we could have two jit-wrapped versions of the core function on hand by replacing

resampy/resampy/interpn.py

Lines 7 to 8 in ee82e25

    
           @numba.jit(nopython=True, nogil=True) 
        
           def resample_f(x, y, t_out, interp_win, interp_delta, num_table, scale=1.0):

by something like

def _resample_f(x, y, t_out, interp_win, interp_delta, num_table, scale=1.0):
    ...

resample_f_p = numba.jit(nopython=True, nogil=True, parallel=True)(_resample_f)
resample_f_s = numba.jit(nopython=True, nogil=True, parallel=False)(_resample_f)

and then have the user-facing API select between the two. It's a little hacky, but should not present significant overhead and would certainly be easier to test and benchmark.

As an aside, we might want to consider adding cached compilation (to both) so that we don't have to pay the startup cost each time.

avalentino · 2022-06-20T16:57:20Z

Dear @bmcfee, I like a lot your solution.
I will try to implement it ASAP.
Just to filan details:

Can you please confirm that the default option for resample should be parallel=True?
do you want to run all tests for both the sequential and parallel version or just a dedicated test for the sequential version is enough?

bmcfee · 2022-06-20T17:59:53Z

Can you please confirm that the default option for resample should be parallel=True?

I'd like to benchmark it on typical loads before making a decision on this.

do you want to run all tests for both the sequential and parallel version or just a dedicated test for the sequential version is enough?

Nah, I think a simple numerical equivalence check for one run would be sufficient. Note that we might not get exact floating point equivalence here because fp math is not really associative or commutative, but it shouldn't drift too far from the serial implementation.

tests/test_quality.py

…lementation

bmcfee · 2022-06-23T14:11:12Z

Thanks @avalentino ! This is looking good - any sense of how it benchmarks on typical loads?

bmcfee · 2022-06-23T14:57:03Z

Quick follow-up, I tried benchmarking this on some typical loads (10s, 60s, 600s signals at 44100→32000, float32) and didn't see any consistent benefit. I also tried this on both my laptop (8 cores) and a 24-core server, with no clear difference in behavior.

A couple of thoughts:

It may be necessary to enable prange for the inner loops as well as the outer loop
We might have to do a bit of digging into the compiler optimization to see if it's missing something, or if it's optimizing at all.

avalentino · 2022-06-23T15:43:19Z

@bmcfee I made some benchmarking when I initially proposed to activate the parallelism.
In this moment I do not remember details but the speedup on my quac-core laptop laptop was around 3.1, which is not bad.
ASAP (at the moment I cannot access my laptop) I will try to recover my benchmark script, re-run it on my fresh Ubuntu installation and share it.

avalentino · 2022-06-23T15:45:39Z

Regarding the possible use of prange in inner loops at a first glance I would say that it is not a good idea, but nothing is better than a direct test

avalentino · 2022-06-25T09:03:06Z

OK @bmcfee, it seems that the cache parameter has a behaviour that is unexpected for me.
I should read better the documentation.
Meanwhile I have just removed the the option form the call to numba.jit and now things seems to behave like expected.

CPU:  Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (4 physical cores)
CPU count: 8
Input size 220500
Output size: 882000
best of 5: 1780.217 [ms]
best of 5: 471.998 [ms]
speedup: 3.77165920690257

I have used the following script to benchmark the resample function:

#!/usr/bin/env python3

import os
import timeit
import numpy as np
import resampy

# CPU info
# https://stackoverflow.com/questions/4842448/getting-processor-information-in-python
def get_processor_name():
    import platform
    import subprocess
    if platform.system() == "Windows":
        return platform.processor()
    elif platform.system() == "Darwin":
        import os
        os.environ['PATH'] = os.environ['PATH'] + os.pathsep + '/usr/sbin'
        command ="sysctl -n machdep.cpu.brand_string"
        return subprocess.check_output(command).strip()
    elif platform.system() == "Linux":
        import re
        mode_name = None
        cpu_cores = None
        with open('/proc/cpuinfo') as fd:
            cpuinfo = fd.read()
            for line in cpuinfo.split("\n"):
                if "model name" in line:
                    model_name = re.sub( ".*model name.*:", "", line, 1)
                elif "cpu cores" in line:
                    cpu_cores = re.sub( ".*cpu cores.*:", "", line, 1).strip()
                if None not in (cpu_cores, mode_name):
                    break
        return f'{model_name} ({cpu_cores} physical cores)'

    return ""

print('CPU:', get_processor_name())
print('CPU count:', os.cpu_count())


# setup
sr_orig = 22050.0
f0 = 440
T = 10
ovs = 4
t = np.arange(T * sr_orig) / sr_orig
x = np.sin(2 * np.pi * f0 * t)

sr_new = sr_orig * ovs
print('Input size', len(x))
print('Output size:', len(x) * ovs)


# compile
resampy.resample(x, sr_orig, sr_new, parallel=False)
resampy.resample(x, sr_orig, sr_new, parallel=True)

# timeit
debug = False
repeat = 5
results = {}
for label, parallel in zip(('sequential', 'parallel'), (False, True)):
    timer = timeit.Timer(
        f'resampy.resample(x, sr_orig, sr_new, parallel={parallel})',
        globals=globals().copy())
    
    number, _ = timer.autorange()
    number = max(3, number)
    # print('number:', number)

    times = timer.repeat(repeat, number)
    if debug:
        print('sequential', times)
    results[label] = min(times)/number * 1e3
    print(f'best of {repeat}: {results[label]:.3f} [ms]')

print('speedup:', results['sequential']/results['parallel'])

bmcfee · 2022-06-25T10:46:40Z

Meanwhile I have just removed the the option form the call to numba.jit and now things seems to behave like expected.

Nice - confirming that I also see some good speedups on my laptop with a basic test load (2 runs each to ensure eliminating compilation effects):

In [4]: sr = 44100

In [5]: sr_new = 32000

In [6]: x = np.random.randn(120 * sr).astype(np.float32)

In [7]: %timeit resampy.resample(x, sr, sr_new, parallel=False)
3.14 s ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: %timeit resampy.resample(x, sr, sr_new, parallel=False)
3.41 s ± 146 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: %timeit resampy.resample(x, sr, sr_new, parallel=True)
723 ms ± 31.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [10]: %timeit resampy.resample(x, sr, sr_new, parallel=True)
791 ms ± 107 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

and on the 24-core server:

In [9]: %timeit resampy.resample(x, sr, sr_new, parallel=True)
298 ms ± 3.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [10]: %timeit resampy.resample(x, sr, sr_new, parallel=False)
5 s ± 3.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I'm curious about how the cache= parameter is affecting this though, especially given this bit of the documentation

The following features are explicitly verified to work with caching:
ufuncs and gufuncs for the cpu and parallel target
parallel accelerator features (i.e. parallel=True)

We should probably dig into this and get a better understanding of what the trouble is.

Otherwise, I'm happy for parallel=True to be the default going forward.

avalentino · 2022-06-25T15:31:44Z

It seems that the numba caching get somewhat confused if the one applies two timed the jit to the same python function.
I made a simple test consisting in defining two identical python functions with different names (_resample_f_p, and _resample_f_s).
When I apply the jit decorator with parallel=True/False and cache=True all seems to work.
Probably it is better to leave out the caching for the moment.
Maybe we could try to open an issue for numba regarding this strange caching behaviour.

bmcfee · 2022-06-25T16:21:30Z

It seems that the numba caching get somewhat confused if the one applies two timed the jit to the same python function.

Ah! That makes sense, and is silly. Maybe we could just do something like

resample_f_s = jit(..., parallel=False)(_resample_f)
resample_f_p = jit(..., parallel=True)(copy(_resample_f))

?

But otherwise agreed that caching is not necessary for this PR.

avalentino · 2022-06-25T17:00:50Z

The solution using copy.copy seems not to work for me.
I have also tried to modify the __name__ of the copied function.

bmcfee · 2022-06-25T17:03:09Z

The solution using copy.copy seems not to work for me.

Alright, let's punt it then. Probably this is an issue to raise with the numba folks - caching should depend on the jit parameters as well as the function body, and it seems like a bug if that's not the case.

avalentino · 2022-06-25T17:08:41Z

There is indeed an open numba issue: numba/numba#6845

bmcfee

LGTM! Is there anything else to do here, or shall we merge and release?

avalentino · 2022-06-27T18:28:21Z

For me it is complete.
Please go ahead.

bmcfee · 2022-06-27T18:39:40Z

thanks!

avalentino force-pushed the feature/parallel branch 2 times, most recently from f16c60c to 8eb418e Compare June 18, 2022 09:27

Enable parallelism

4aac60f

avalentino force-pushed the feature/parallel branch from 8eb418e to 4aac60f Compare June 18, 2022 09:32

bmcfee added the functionality label Jun 20, 2022

bmcfee added this to the 0.3.0 milestone Jun 20, 2022

avalentino added 2 commits June 22, 2022 07:30

Better API for enabling parallel computation

46012cc

Formatting

7a74799

bmcfee reviewed Jun 22, 2022

View reviewed changes

tests/test_quality.py Outdated Show resolved Hide resolved

Use numpy assert_allclose instead of RMS for testing the parallel imp…

c4b2bed

…lementation

Remove cache option form the numba.jit call

0603dea

bmcfee mentioned this pull request Jun 25, 2022

Parallel jit for faster numba functions librosa/librosa#1480

Open

bmcfee mentioned this pull request Jun 26, 2022

cache=True with same wrapped fn but different targetoptions [request] numba/numba#6845

Open

bmcfee approved these changes Jun 27, 2022

View reviewed changes

bmcfee mentioned this pull request Jun 27, 2022

Faster resampling? #70

Closed

bmcfee merged commit ff7f7a7 into bmcfee:main Jun 27, 2022

avalentino deleted the feature/parallel branch June 27, 2022 22:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable parallelism #93

Enable parallelism #93

avalentino commented Jun 18, 2022

codecov-commenter commented Jun 18, 2022 •

edited

Loading

bmcfee commented Jun 20, 2022

avalentino commented Jun 20, 2022

bmcfee commented Jun 20, 2022

bmcfee commented Jun 23, 2022

bmcfee commented Jun 23, 2022

avalentino commented Jun 23, 2022

avalentino commented Jun 23, 2022

avalentino commented Jun 25, 2022 •

edited

Loading

bmcfee commented Jun 25, 2022

avalentino commented Jun 25, 2022

bmcfee commented Jun 25, 2022

avalentino commented Jun 25, 2022

bmcfee commented Jun 25, 2022

avalentino commented Jun 25, 2022

bmcfee left a comment

avalentino commented Jun 27, 2022

bmcfee commented Jun 27, 2022

Enable parallelism #93

Enable parallelism #93

Conversation

avalentino commented Jun 18, 2022

codecov-commenter commented Jun 18, 2022 • edited Loading

Codecov Report

bmcfee commented Jun 20, 2022

avalentino commented Jun 20, 2022

bmcfee commented Jun 20, 2022

bmcfee commented Jun 23, 2022

bmcfee commented Jun 23, 2022

avalentino commented Jun 23, 2022

avalentino commented Jun 23, 2022

avalentino commented Jun 25, 2022 • edited Loading

bmcfee commented Jun 25, 2022

avalentino commented Jun 25, 2022

bmcfee commented Jun 25, 2022

avalentino commented Jun 25, 2022

bmcfee commented Jun 25, 2022

avalentino commented Jun 25, 2022

bmcfee left a comment

Choose a reason for hiding this comment

avalentino commented Jun 27, 2022

bmcfee commented Jun 27, 2022

codecov-commenter commented Jun 18, 2022 •

edited

Loading

avalentino commented Jun 25, 2022 •

edited

Loading