gh-82054: Implements test.regrtest unittest sharding to pull in long tails #99637

gpshead · 2022-11-21T08:41:50Z

Currently a draft demonstration, this incomplete implementation has details to be worked out, but it works! It makes long tail tests take significantly less wall time on many core systems.

Example: 25 seconds wall time to run test_multiprocessing_spawn and test_concurrent_futures on a 12 thread zen2 (see below). #win

Known Issues to work out: result reporting and libregrtest accounting. You see any sharded test "complete" multiple times and your total tests run count goes higher than the total number of tests. 😂

Real caveat: This exposes a few ordering and concurrency weaknesses in some tests like test_asyncio that'll need fixing. This will be good for code health anyways!

Which tests get sharded is explicitly opt-in. Currently not in a maintainable spot. How best to maintain that needs to be worked out, but I expect we only ever have 10-20 slowest test modules that we declare as worth sharding. If you go to the extreme and enable sharding for everything a full test suite run takes 50%+ more wall time due to the overhead. But enabling it for the slowest dozen saves wall time. It'll vary based on host and parallelism level.

This implementation is inspired by and with the unittest TestLoader bits derived directly from the Apache 2.0 licensed
https://github.com/abseil/abseil-py/blob/v1.3.0/absl/testing/absltest.py#L2359

:~/oss/cpython (performance/test-sharding)$ ../b/python -m test -r -j 20
test_multiprocessing_spawn test_concurrent_futures
Using random seed 8555091
0:00:00 load avg: 0.98 Run tests in parallel using 20 child processes
0:00:08 load avg: 1.30 [1/2] test_multiprocessing_spawn passed
0:00:10 load avg: 1.68 [2/2] test_concurrent_futures passed
0:00:11 load avg: 1.68 [3/2] test_multiprocessing_spawn passed
0:00:12 load avg: 1.68 [4/2] test_multiprocessing_spawn passed
0:00:12 load avg: 1.68 [5/2] test_multiprocessing_spawn passed
0:00:14 load avg: 1.87 [6/2] test_multiprocessing_spawn passed
0:00:15 load avg: 1.87 [7/2] test_multiprocessing_spawn passed
0:00:16 load avg: 1.87 [8/2] test_concurrent_futures passed
0:00:16 load avg: 1.87 [9/2] test_multiprocessing_spawn passed
0:00:18 load avg: 1.87 [10/2] test_concurrent_futures passed
0:00:20 load avg: 1.72 [11/2] test_concurrent_futures passed
0:00:20 load avg: 1.72 [12/2] test_concurrent_futures passed
0:00:21 load avg: 1.72 [13/2] test_multiprocessing_spawn passed
0:00:21 load avg: 1.72 [14/2] test_concurrent_futures passed
0:00:22 load avg: 1.72 [15/2] test_concurrent_futures passed
0:00:25 load avg: 1.58 [16/2] test_concurrent_futures passed

== Tests result: SUCCESS ==

All 16 tests OK.

Total duration: 25.6 sec
Tests result: SUCCESS

Issue: unittest: execute tests in parallel #82054

An incomplete implementation with details to be worked out, but it works! It makes our long tail tests take significantly less time. At least when run on their own. Example: ~25 seconds wall time to run test_multiprocessing_spawn and test_concurrent_futures on a 12 thread machine for example. `python -m test -r -j 20 test_multiprocessing_spawn test_concurrent_futures` Known Issues to work out: result reporting and libregrtest accounting. You see any sharded test "complete" multiple times and your total tests run count goes higher than the total number of tests. 😂 Real caveat: This exposes ordering and concurrency weaknesses in some tests like test_asyncio that'll need fixing. Which tests get sharded is explicitly opt-in. Currently not in a maintainable spot. How best to maintain that needs to be worked out, but I expect we only ever have 10-20 test modules that we declare as worth sharding. This implementation is inspired by and with the unittest TestLoader bits derived directly from the Apache 2.0 licensed https://github.com/abseil/abseil-py/blob/v1.3.0/absl/testing/absltest.py#L2359 ``` :~/oss/cpython (performance/test-sharding)$ ../b/python -m test -r -j 20 test_multiprocessing_spawn test_concurrent_futures Using random seed 8555091 0:00:00 load avg: 0.98 Run tests in parallel using 20 child processes 0:00:08 load avg: 1.30 [1/2] test_multiprocessing_spawn passed 0:00:10 load avg: 1.68 [2/2] test_concurrent_futures passed 0:00:11 load avg: 1.68 [3/2] test_multiprocessing_spawn passed 0:00:12 load avg: 1.68 [4/2] test_multiprocessing_spawn passed 0:00:12 load avg: 1.68 [5/2] test_multiprocessing_spawn passed 0:00:14 load avg: 1.87 [6/2] test_multiprocessing_spawn passed 0:00:15 load avg: 1.87 [7/2] test_multiprocessing_spawn passed 0:00:16 load avg: 1.87 [8/2] test_concurrent_futures passed 0:00:16 load avg: 1.87 [9/2] test_multiprocessing_spawn passed 0:00:18 load avg: 1.87 [10/2] test_concurrent_futures passed 0:00:20 load avg: 1.72 [11/2] test_concurrent_futures passed 0:00:20 load avg: 1.72 [12/2] test_concurrent_futures passed 0:00:21 load avg: 1.72 [13/2] test_multiprocessing_spawn passed 0:00:21 load avg: 1.72 [14/2] test_concurrent_futures passed 0:00:22 load avg: 1.72 [15/2] test_concurrent_futures passed 0:00:25 load avg: 1.58 [16/2] test_concurrent_futures passed == Tests result: SUCCESS == All 16 tests OK. Total duration: 25.6 sec Tests result: SUCCESS ```

bedevere-bot · 2022-11-21T08:43:09Z

🤖 New build scheduled with the buildbot fleet by @gpshead for commit 2748b2d 🤖

If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again.

gpshead · 2022-11-21T23:16:07Z

Not a lot of measurable benefit as is on CI and buildbots with -j3 and -j4 running our entire test suite.

It is definitely a useful feature when working on something like multiprocessing where you yourself are regularly rerunning just that targeted test suite. But from an all tests perspective, we should probably chart the parallelism long tail with blame ascribed towards the end. the total area filled of that chart ultimately constrains how tight of a rectangle of parallelism we can pull it in to for minimal wall time.

I'm not sure how i've hooked things up in this draft PR are really right. the dunder-new on TestLoader feels like the wrong place to do stuff for example. tests that verify this does what is actually intended sharding wise are needed.

carljm · 2023-04-25T18:01:36Z

@zitterbewegung here's how we did something like this in (an earlier version of) Cinder, for comparison: facebookincubator/cinder@3c6be01

gpshead · 2023-04-25T20:41:10Z

here's how we did something like this in (an earlier version of) Cinder, for comparison: facebookincubator/cinder@3c6be01

Nice! An interesting easy approach: reading that and staring at it for a while, it looks like for suites that are already a sub-package of smaller suites it looks like you just treat those in the somewhat natural way of being individual test items thus regrtest will run them separately?

That approach is likely simple enough to immediately adopt. It could also provide a way for us to refactor other large single test_xxx.py files into a package for curated parallelization. There's a lot of merit in that.

gpshead · 2023-04-25T20:44:27Z

What I have in this PR offers a "full" every test in a class can be considered independent abstraction. This is what we allow people to opt-into per test .py file at work - but on existing test suites has the problem of existing tests that inadvertently have dependencies on order or other tests having run start acting flaky based on the random set chosen to run in each process. worthwhile cleanup if anyone takes it on, but a pain to do to a large old test suite full of tech. debt.

While I like using this PR whenever i'm working on a multiprocessing or concurrent.futures issue as it significantly speeds those up... I'm not currently motivated to try and work out the flakiness kinks to propose actually landing this as a supported thing.

Related alternative: look to see what pytest offers, if anything, in this individual test function & method parallelism regard. there's a can of worms of the "how to use pytest for the stdlib test suite" variety, but it is solvable and could be a motivating reason to pick a solution.

gpshead · 2023-09-23T18:52:36Z

test_multiprocessing and concurrent_futures and test_asyncio long tails have been meaningfully split up enough recently that the motivation for doing this within the CPython project is far less. Our full test suite run on a 20+ physical core host with python -m test -j50 or -j500 now completes in 45-50 seconds as opposed to the 2-3 minutes it took before.

I still think this is a nice idea, but it is an "ideal" goal and difficult to land in practice. It would likely not buy us more than a 10 second reduction off the above, though getting to the point we could do this would mean we've made meaningful improvements in our test suite to avoid currently mostly hidden flakiness or ordering dependencies.

We do still have a long tail of around ~10 tests that consume the final half of the above quoted total time on highly parallel systems.

vstinner · 2023-09-24T22:19:13Z

What is difficult is to estimate if a test method is "slow" or "quick". For example, most test_tools tests are done in less than a second, but just one takes 5 min and 30 min (if not longer): the evil "test_freeze" test.

What can be done is to keep track of previous runs to build a knowledge base to measure which test methods are slow. This knowledge could be used later to run slowest tests differently. Recently, @serhiy-storchaka annotated manually dozens of tests to skip tests unless a test resource is enabled: issue #108416. More recently, I wrote PR #109570 which adds the concept of "slow CI" and "fast CI" to decide which tests should be run or not. Again, it's hard to decide, the separation is done manually with some heuristics. @serhiy-storchaka did what I described: look at JUnit XML files to discover the slowest tests, and annotate them.

Well, slowly, we are making progress on better using all available CPUs, and making the overall task of "running the Python suite" faster :-)

vstinner · 2023-09-24T22:21:15Z

I didn't know this PR, interesting work!

bedevere-bot mentioned this pull request Nov 21, 2022

unittest: execute tests in parallel #82054

Closed

bedevere-bot added the awaiting core review label Nov 21, 2022

gpshead added type-feature A feature request or enhancement tests Tests in the Lib/test dir 🔨 test-with-buildbots Test PR w/ buildbots; report in status section labels Nov 21, 2022

bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Nov 21, 2022

gpshead added 3 commits November 21, 2022 18:11

minor tweaks: unshard test_tools, shard on -j 2

09b830e

Merge branch 'main' into performance/test-sharding

67e302f

add test_xmlrpc to the shard list.

f887c4d

gpshead closed this Sep 23, 2023

vstinner mentioned this pull request Sep 25, 2023

libregrtest: add option to run every test case in separate process #109817

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-82054: Implements test.regrtest unittest sharding to pull in long tails #99637

gh-82054: Implements test.regrtest unittest sharding to pull in long tails #99637

gpshead commented Nov 21, 2022 •

edited by bedevere-bot

Loading

bedevere-bot commented Nov 21, 2022

gpshead commented Nov 21, 2022

carljm commented Apr 25, 2023

gpshead commented Apr 25, 2023

gpshead commented Apr 25, 2023

gpshead commented Sep 23, 2023 •

edited

Loading

vstinner commented Sep 24, 2023

vstinner commented Sep 24, 2023

gh-82054: Implements test.regrtest unittest sharding to pull in long tails #99637

gh-82054: Implements test.regrtest unittest sharding to pull in long tails #99637

Conversation

gpshead commented Nov 21, 2022 • edited by bedevere-bot Loading

bedevere-bot commented Nov 21, 2022

gpshead commented Nov 21, 2022

carljm commented Apr 25, 2023

gpshead commented Apr 25, 2023

gpshead commented Apr 25, 2023

gpshead commented Sep 23, 2023 • edited Loading

vstinner commented Sep 24, 2023

vstinner commented Sep 24, 2023

gpshead commented Nov 21, 2022 •

edited by bedevere-bot

Loading

gpshead commented Sep 23, 2023 •

edited

Loading