Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-82054: Implements test.regrtest unittest sharding to pull in long tails #99637

Closed
wants to merge 4 commits into from

Conversation

gpshead
Copy link
Member

@gpshead gpshead commented Nov 21, 2022

Currently a draft demonstration, this incomplete implementation has details to be worked out, but it works! It makes long tail tests take significantly less wall time on many core systems.

Example: 25 seconds wall time to run test_multiprocessing_spawn and test_concurrent_futures on a 12 thread zen2 (see below). #win

Known Issues to work out: result reporting and libregrtest accounting. You see any sharded test "complete" multiple times and your total tests run count goes higher than the total number of tests. 😂

Real caveat: This exposes a few ordering and concurrency weaknesses in some tests like test_asyncio that'll need fixing. This will be good for code health anyways!

Which tests get sharded is explicitly opt-in. Currently not in a maintainable spot. How best to maintain that needs to be worked out, but I expect we only ever have 10-20 slowest test modules that we declare as worth sharding. If you go to the extreme and enable sharding for everything a full test suite run takes 50%+ more wall time due to the overhead. But enabling it for the slowest dozen saves wall time. It'll vary based on host and parallelism level.

This implementation is inspired by and with the unittest TestLoader bits derived directly from the Apache 2.0 licensed
https://github.com/abseil/abseil-py/blob/v1.3.0/absl/testing/absltest.py#L2359

:~/oss/cpython (performance/test-sharding)$ ../b/python -m test -r -j 20
test_multiprocessing_spawn test_concurrent_futures
Using random seed 8555091
0:00:00 load avg: 0.98 Run tests in parallel using 20 child processes
0:00:08 load avg: 1.30 [1/2] test_multiprocessing_spawn passed
0:00:10 load avg: 1.68 [2/2] test_concurrent_futures passed
0:00:11 load avg: 1.68 [3/2] test_multiprocessing_spawn passed
0:00:12 load avg: 1.68 [4/2] test_multiprocessing_spawn passed
0:00:12 load avg: 1.68 [5/2] test_multiprocessing_spawn passed
0:00:14 load avg: 1.87 [6/2] test_multiprocessing_spawn passed
0:00:15 load avg: 1.87 [7/2] test_multiprocessing_spawn passed
0:00:16 load avg: 1.87 [8/2] test_concurrent_futures passed
0:00:16 load avg: 1.87 [9/2] test_multiprocessing_spawn passed
0:00:18 load avg: 1.87 [10/2] test_concurrent_futures passed
0:00:20 load avg: 1.72 [11/2] test_concurrent_futures passed
0:00:20 load avg: 1.72 [12/2] test_concurrent_futures passed
0:00:21 load avg: 1.72 [13/2] test_multiprocessing_spawn passed
0:00:21 load avg: 1.72 [14/2] test_concurrent_futures passed
0:00:22 load avg: 1.72 [15/2] test_concurrent_futures passed
0:00:25 load avg: 1.58 [16/2] test_concurrent_futures passed

== Tests result: SUCCESS ==

All 16 tests OK.

Total duration: 25.6 sec
Tests result: SUCCESS

An incomplete implementation with details to be worked out, but it
works!  It makes our long tail tests take significantly less time. At
least when run on their own.

Example: ~25 seconds wall time to run test_multiprocessing_spawn and
test_concurrent_futures on a 12 thread machine for example.

`python -m test -r -j 20 test_multiprocessing_spawn test_concurrent_futures`

Known Issues to work out: result reporting and libregrtest accounting.
You see any sharded test "complete" multiple times and your total tests
run count goes higher than the total number of tests. 😂

Real caveat: This exposes ordering and concurrency weaknesses in some
tests like test_asyncio that'll need fixing.

Which tests get sharded is explicitly opt-in.  Currently not in a
maintainable spot.  How best to maintain that needs to be worked out,
but I expect we only ever have 10-20 test modules that we declare as
worth sharding.

This implementation is inspired by and with the unittest TestLoader bits
derived directly from the Apache 2.0 licensed
https://github.com/abseil/abseil-py/blob/v1.3.0/absl/testing/absltest.py#L2359

```
:~/oss/cpython (performance/test-sharding)$ ../b/python -m test -r -j 20
test_multiprocessing_spawn test_concurrent_futures
Using random seed 8555091
0:00:00 load avg: 0.98 Run tests in parallel using 20 child processes
0:00:08 load avg: 1.30 [1/2] test_multiprocessing_spawn passed
0:00:10 load avg: 1.68 [2/2] test_concurrent_futures passed
0:00:11 load avg: 1.68 [3/2] test_multiprocessing_spawn passed
0:00:12 load avg: 1.68 [4/2] test_multiprocessing_spawn passed
0:00:12 load avg: 1.68 [5/2] test_multiprocessing_spawn passed
0:00:14 load avg: 1.87 [6/2] test_multiprocessing_spawn passed
0:00:15 load avg: 1.87 [7/2] test_multiprocessing_spawn passed
0:00:16 load avg: 1.87 [8/2] test_concurrent_futures passed
0:00:16 load avg: 1.87 [9/2] test_multiprocessing_spawn passed
0:00:18 load avg: 1.87 [10/2] test_concurrent_futures passed
0:00:20 load avg: 1.72 [11/2] test_concurrent_futures passed
0:00:20 load avg: 1.72 [12/2] test_concurrent_futures passed
0:00:21 load avg: 1.72 [13/2] test_multiprocessing_spawn passed
0:00:21 load avg: 1.72 [14/2] test_concurrent_futures passed
0:00:22 load avg: 1.72 [15/2] test_concurrent_futures passed
0:00:25 load avg: 1.58 [16/2] test_concurrent_futures passed

== Tests result: SUCCESS ==

All 16 tests OK.

Total duration: 25.6 sec
Tests result: SUCCESS
```
@gpshead gpshead added type-feature A feature request or enhancement tests Tests in the Lib/test dir 🔨 test-with-buildbots Test PR w/ buildbots; report in status section labels Nov 21, 2022
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @gpshead for commit 2748b2d 🤖

If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Nov 21, 2022
@gpshead
Copy link
Member Author

gpshead commented Nov 21, 2022

Not a lot of measurable benefit as is on CI and buildbots with -j3 and -j4 running our entire test suite.

It is definitely a useful feature when working on something like multiprocessing where you yourself are regularly rerunning just that targeted test suite. But from an all tests perspective, we should probably chart the parallelism long tail with blame ascribed towards the end. the total area filled of that chart ultimately constrains how tight of a rectangle of parallelism we can pull it in to for minimal wall time.

I'm not sure how i've hooked things up in this draft PR are really right. the dunder-new on TestLoader feels like the wrong place to do stuff for example. tests that verify this does what is actually intended sharding wise are needed.

@carljm
Copy link
Member

carljm commented Apr 25, 2023

@zitterbewegung here's how we did something like this in (an earlier version of) Cinder, for comparison: facebookincubator/cinder@3c6be01

@gpshead
Copy link
Member Author

gpshead commented Apr 25, 2023

here's how we did something like this in (an earlier version of) Cinder, for comparison: facebookincubator/cinder@3c6be01

Nice! An interesting easy approach: reading that and staring at it for a while, it looks like for suites that are already a sub-package of smaller suites it looks like you just treat those in the somewhat natural way of being individual test items thus regrtest will run them separately?

That approach is likely simple enough to immediately adopt. It could also provide a way for us to refactor other large single test_xxx.py files into a package for curated parallelization. There's a lot of merit in that.

@gpshead
Copy link
Member Author

gpshead commented Apr 25, 2023

What I have in this PR offers a "full" every test in a class can be considered independent abstraction. This is what we allow people to opt-into per test .py file at work - but on existing test suites has the problem of existing tests that inadvertently have dependencies on order or other tests having run start acting flaky based on the random set chosen to run in each process. worthwhile cleanup if anyone takes it on, but a pain to do to a large old test suite full of tech. debt.

While I like using this PR whenever i'm working on a multiprocessing or concurrent.futures issue as it significantly speeds those up... I'm not currently motivated to try and work out the flakiness kinks to propose actually landing this as a supported thing.

Related alternative: look to see what pytest offers, if anything, in this individual test function & method parallelism regard. there's a can of worms of the "how to use pytest for the stdlib test suite" variety, but it is solvable and could be a motivating reason to pick a solution.

@gpshead
Copy link
Member Author

gpshead commented Sep 23, 2023

test_multiprocessing and concurrent_futures and test_asyncio long tails have been meaningfully split up enough recently that the motivation for doing this within the CPython project is far less. Our full test suite run on a 20+ physical core host with python -m test -j50 or -j500 now completes in 45-50 seconds as opposed to the 2-3 minutes it took before.

I still think this is a nice idea, but it is an "ideal" goal and difficult to land in practice. It would likely not buy us more than a 10 second reduction off the above, though getting to the point we could do this would mean we've made meaningful improvements in our test suite to avoid currently mostly hidden flakiness or ordering dependencies.

We do still have a long tail of around ~10 tests that consume the final half of the above quoted total time on highly parallel systems.

@gpshead gpshead closed this Sep 23, 2023
@vstinner
Copy link
Member

What is difficult is to estimate if a test method is "slow" or "quick". For example, most test_tools tests are done in less than a second, but just one takes 5 min and 30 min (if not longer): the evil "test_freeze" test.

What can be done is to keep track of previous runs to build a knowledge base to measure which test methods are slow. This knowledge could be used later to run slowest tests differently. Recently, @serhiy-storchaka annotated manually dozens of tests to skip tests unless a test resource is enabled: issue #108416. More recently, I wrote PR #109570 which adds the concept of "slow CI" and "fast CI" to decide which tests should be run or not. Again, it's hard to decide, the separation is done manually with some heuristics. @serhiy-storchaka did what I described: look at JUnit XML files to discover the slowest tests, and annotate them.

Well, slowly, we are making progress on better using all available CPUs, and making the overall task of "running the Python suite" faster :-)

@vstinner
Copy link
Member

I didn't know this PR, interesting work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting core review tests Tests in the Lib/test dir type-feature A feature request or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants