-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Develop] Run 8 and 16 node performance tests in parallel #6308
[Develop] Run 8 and 16 node performance tests in parallel #6308
Conversation
# Copy additional files in advanced to avoid conflict when running 8 and 16 nodes tests in parallel | ||
remote_command_executor._copy_additional_files([str(test_datadir / "starccm.slurm.sh")]) | ||
# Run 8 and 16 node tests in parallel | ||
with ThreadPoolExecutor(max_workers=2) as executor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a threadpool executor?
What about submitting the two jobs to the scheduler and wait for them to complete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In threadpool executor, what we do is exactly submitting the two jobs to the scheduler at the same time and wait for them to complete.
If we don't use threadpool executor, we can also submit them one by one. Of course them can run in parallel as well. Then wait one of them complete first, then wait the other one. Then run scripts one by one to get observed_value.
Compared to using threadpool executor, submitting the two jobs to the scheduler has:
pros:
- Easier logic implement.
- Save processor performance(Maybe, but I don't think it's what we need to concern).
cons:
- Duplicated codes, worse readability.
- A few more times to spent. While the jobs will run in parallel on the cluster, the script itself does not utilize Python’s concurrency features as effectively.
I prefer this threadpool approach.
Which one do you prefer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Giacomo, after investigate, I still think ThreadPoolExecutor is the better approach, and maybe the only approach we can take.
Let me explain why:
The first point is, if I remember correct, you mentioned I split the helper function run_starccm_test and run_openfoam_test then call them three times. You said it makes codes duplicated and hard to maintain, but it's not. If you see the codes, you can discover previously it runs three times in a loop, I just put them in a separate function to adopt the changes.
Sencond point: The above reason is not decisive. But this is important. Let's forget test_starccm for now, it can definitely use the approach you said, just need a bit more time to calculate perf_test_result
not in parallel and we can afford that. But what about test_openfoam. If use the approach you said, the codes should be like:
remote_command_executor.run_remote_command(
f'bash openfoam.slurm.sh "{subspace_benchmarks_dir}" "8" 2>&1',
timeout=OPENFOAM_JOB_TIMEOUT,
)
remote_command_executor.run_remote_command(
f'bash openfoam.slurm.sh "{subspace_benchmarks_dir}" "16" 2>&1',
timeout=OPENFOAM_JOB_TIMEOUT,
)
But it's not like sbatch, these two command can not run in parallel. Unless we use like(following codes are assumed to be run in the HeadNode):
bash openfoam.slurm.sh "{subspace_benchmarks_dir}" "8" 2>&1 &
bash openfoam.slurm.sh "{subspace_benchmarks_dir}" "16" 2>&1 &
wait
But first, I am afraid timeout
parameter can not work as expected. Second, we can not sure wait
command will not make potential difference to other commands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Giacomo, after we agreed on the use of threadpool executor, I made changes to the PR. Now we only use it on test_openfoam.
…un 8 and 16 node tests in parallel
Description of changes
test_starccm
andtest_openfoam
Tests
test_starccm
passed 6, failed 1, but the failed test is not related to this PR. It failed previously.Improvement
Checklist
develop
add the branch name as prefix in the PR title (e.g.[release-3.6]
).Please review the guidelines for contributing and Pull Request Instructions.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.