[Develop] Run 8 and 16 node performance tests in parallel #6308

hehe7318 · 2024-06-24T13:38:18Z

Description of changes

Modify test_starccm and test_openfoam
- Add a logic to run 8 and 16 node performance tests in parallel
  - This is because we create a cluster with 32 compute nodes, and previously we do the 8, 16 and 32 nodes tests one by one. But we can use 24 nodes of them to run 8 and 16 nodes performance tests in parallel to save time.

Tests

test_starccm passed 6, failed 1, but the failed test is not related to this PR. It failed previously.

Improvement

Test succeed, a significant improvement can be seen in the running time.
- Previously 2 hr 13 min for test_openfoam[alinux2] and 2 hr 13 min for test_openfoam[ubuntu2004]
- Now 1 hr 49 min for test_openfoam[alinux2] and 2 hr 4 min for test_openfoam[ubuntu2004]

Checklist

Make sure you are pointing to the right branch.
If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
Check all commits' messages are clear, describing what and why vs how.
Make sure to have added unit tests or integration tests to cover the new/modified code.
Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…d 16 nodes tests in parallel

gmarciani · 2024-06-25T15:04:09Z

tests/integration-tests/tests/performance_tests/test_starccm.py

+    # Copy additional files in advanced to avoid conflict when running 8 and 16 nodes tests in parallel
+    remote_command_executor._copy_additional_files([str(test_datadir / "starccm.slurm.sh")])
+    # Run 8 and 16 node tests in parallel
+    with ThreadPoolExecutor(max_workers=2) as executor:


Why do we need a threadpool executor?
What about submitting the two jobs to the scheduler and wait for them to complete?

In threadpool executor, what we do is exactly submitting the two jobs to the scheduler at the same time and wait for them to complete.
If we don't use threadpool executor, we can also submit them one by one. Of course them can run in parallel as well. Then wait one of them complete first, then wait the other one. Then run scripts one by one to get observed_value.
Compared to using threadpool executor, submitting the two jobs to the scheduler has:
pros:

Easier logic implement.

Save processor performance(Maybe, but I don't think it's what we need to concern).

cons:

Duplicated codes, worse readability.

A few more times to spent. While the jobs will run in parallel on the cluster, the script itself does not utilize Python’s concurrency features as effectively.

I prefer this threadpool approach.
Which one do you prefer?

Hi Giacomo, after investigate, I still think ThreadPoolExecutor is the better approach, and maybe the only approach we can take.

Let me explain why:

The first point is, if I remember correct, you mentioned I split the helper function run_starccm_test and run_openfoam_test then call them three times. You said it makes codes duplicated and hard to maintain, but it's not. If you see the codes, you can discover previously it runs three times in a loop, I just put them in a separate function to adopt the changes.

Sencond point: The above reason is not decisive. But this is important. Let's forget test_starccm for now, it can definitely use the approach you said, just need a bit more time to calculate perf_test_result not in parallel and we can afford that. But what about test_openfoam. If use the approach you said, the codes should be like:

remote_command_executor.run_remote_command( f'bash openfoam.slurm.sh "{subspace_benchmarks_dir}" "8" 2>&1', timeout=OPENFOAM_JOB_TIMEOUT, ) remote_command_executor.run_remote_command( f'bash openfoam.slurm.sh "{subspace_benchmarks_dir}" "16" 2>&1', timeout=OPENFOAM_JOB_TIMEOUT, )

But it's not like sbatch, these two command can not run in parallel. Unless we use like(following codes are assumed to be run in the HeadNode):

bash openfoam.slurm.sh "{subspace_benchmarks_dir}" "8" 2>&1 & bash openfoam.slurm.sh "{subspace_benchmarks_dir}" "16" 2>&1 & wait

But first, I am afraid timeout parameter can not work as expected. Second, we can not sure wait command will not make potential difference to other commands.

Hi Giacomo, after we agreed on the use of threadpool executor, I made changes to the PR. Now we only use it on test_openfoam.

…un 8 and 16 node tests in parallel

hehe7318 added 2 commits June 23, 2024 23:09

Run 8 and 16 node performance tests in parallel

bcadc9e

Delete a duplicate logging info

f24e073

hehe7318 added skip-changelog-update Disables the check that enforces changelog updates in PRs 3.x labels Jun 24, 2024

hehe7318 requested review from a team as code owners June 24, 2024 13:38

hehe7318 and others added 2 commits June 24, 2024 12:40

Copy additional files in advanced to avoid conflict when running 8 an…

b76afb9

…d 16 nodes tests in parallel

Merge branch 'develop' into wip/improve-perf-test-24-nodes

d756150

gmarciani reviewed Jun 25, 2024

View reviewed changes

hehe7318 mentioned this pull request Jul 2, 2024

[Develop][Draft] Group test starccm and openfoam to share the same cluster #6321

Open

hehe7318 and others added 6 commits July 3, 2024 03:34

Merge branch 'develop' into wip/improve-perf-test-24-nodes

bb511ed

Run 32 node test first to avoid spack: command not found error when r…

2863b15

…un 8 and 16 node tests in parallel

Submit jobs directly use sbatch instead of using threadpool executor

4cd4ea6

Delete unused import and reformat

9cea55f

Merge branch 'develop' into wip/improve-perf-test-24-nodes

9e83936

Minor changes on comments

58b1b19

gmarciani approved these changes Jul 5, 2024

View reviewed changes

hehe7318 merged commit 60184ba into aws:develop Jul 5, 2024
27 of 28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Develop] Run 8 and 16 node performance tests in parallel #6308

[Develop] Run 8 and 16 node performance tests in parallel #6308

hehe7318 commented Jun 24, 2024 •

edited

Loading

gmarciani Jun 25, 2024

hehe7318 Jun 25, 2024 •

edited

Loading

hehe7318 Jul 2, 2024 •

edited

Loading

hehe7318 Jul 5, 2024

[Develop] Run 8 and 16 node performance tests in parallel #6308

[Develop] Run 8 and 16 node performance tests in parallel #6308

Conversation

hehe7318 commented Jun 24, 2024 • edited Loading

Description of changes

Tests

Improvement

Checklist

gmarciani Jun 25, 2024

Choose a reason for hiding this comment

hehe7318 Jun 25, 2024 • edited Loading

Choose a reason for hiding this comment

hehe7318 Jul 2, 2024 • edited Loading

Choose a reason for hiding this comment

hehe7318 Jul 5, 2024

Choose a reason for hiding this comment

hehe7318 commented Jun 24, 2024 •

edited

Loading

hehe7318 Jun 25, 2024 •

edited

Loading

hehe7318 Jul 2, 2024 •

edited

Loading