Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Develop] Run 8 and 16 node performance tests in parallel #6308

Merged
merged 10 commits into from
Jul 5, 2024

Conversation

hehe7318
Copy link
Contributor

@hehe7318 hehe7318 commented Jun 24, 2024

Description of changes

  • Modify test_starccm and test_openfoam
    • Add a logic to run 8 and 16 node performance tests in parallel
      • This is because we create a cluster with 32 compute nodes, and previously we do the 8, 16 and 32 nodes tests one by one. But we can use 24 nodes of them to run 8 and 16 nodes performance tests in parallel to save time.

Tests

  • test_starccm passed 6, failed 1, but the failed test is not related to this PR. It failed previously.

Improvement

  • Test succeed, a significant improvement can be seen in the running time.
    • Previously 2 hr 13 min for test_openfoam[alinux2] and 2 hr 13 min for test_openfoam[ubuntu2004]
    • Now 1 hr 49 min for test_openfoam[alinux2] and 2 hr 4 min for test_openfoam[ubuntu2004]

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@hehe7318 hehe7318 added skip-changelog-update Disables the check that enforces changelog updates in PRs 3.x labels Jun 24, 2024
@hehe7318 hehe7318 requested review from a team as code owners June 24, 2024 13:38
# Copy additional files in advanced to avoid conflict when running 8 and 16 nodes tests in parallel
remote_command_executor._copy_additional_files([str(test_datadir / "starccm.slurm.sh")])
# Run 8 and 16 node tests in parallel
with ThreadPoolExecutor(max_workers=2) as executor:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a threadpool executor?
What about submitting the two jobs to the scheduler and wait for them to complete?

Copy link
Contributor Author

@hehe7318 hehe7318 Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In threadpool executor, what we do is exactly submitting the two jobs to the scheduler at the same time and wait for them to complete.
If we don't use threadpool executor, we can also submit them one by one. Of course them can run in parallel as well. Then wait one of them complete first, then wait the other one. Then run scripts one by one to get observed_value.
Compared to using threadpool executor, submitting the two jobs to the scheduler has:
pros:

  1. Easier logic implement.
  2. Save processor performance(Maybe, but I don't think it's what we need to concern).

cons:

  1. Duplicated codes, worse readability.
  2. A few more times to spent. While the jobs will run in parallel on the cluster, the script itself does not utilize Python’s concurrency features as effectively.

I prefer this threadpool approach.
Which one do you prefer?

Copy link
Contributor Author

@hehe7318 hehe7318 Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Giacomo, after investigate, I still think ThreadPoolExecutor is the better approach, and maybe the only approach we can take.

Let me explain why:

The first point is, if I remember correct, you mentioned I split the helper function run_starccm_test and run_openfoam_test then call them three times. You said it makes codes duplicated and hard to maintain, but it's not. If you see the codes, you can discover previously it runs three times in a loop, I just put them in a separate function to adopt the changes.

Sencond point: The above reason is not decisive. But this is important. Let's forget test_starccm for now, it can definitely use the approach you said, just need a bit more time to calculate perf_test_result not in parallel and we can afford that. But what about test_openfoam. If use the approach you said, the codes should be like:

    remote_command_executor.run_remote_command(
        f'bash openfoam.slurm.sh "{subspace_benchmarks_dir}" "8" 2>&1',
        timeout=OPENFOAM_JOB_TIMEOUT,
    )
    remote_command_executor.run_remote_command(
        f'bash openfoam.slurm.sh "{subspace_benchmarks_dir}" "16" 2>&1',
        timeout=OPENFOAM_JOB_TIMEOUT,
    )

But it's not like sbatch, these two command can not run in parallel. Unless we use like(following codes are assumed to be run in the HeadNode):

bash openfoam.slurm.sh "{subspace_benchmarks_dir}" "8" 2>&1 &
bash openfoam.slurm.sh "{subspace_benchmarks_dir}" "16" 2>&1 &
wait

But first, I am afraid timeout parameter can not work as expected. Second, we can not sure wait command will not make potential difference to other commands.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Giacomo, after we agreed on the use of threadpool executor, I made changes to the PR. Now we only use it on test_openfoam.

@hehe7318 hehe7318 merged commit 60184ba into aws:develop Jul 5, 2024
27 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.x skip-changelog-update Disables the check that enforces changelog updates in PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants