Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REALIZATION_MEMORY keyword not working for SLURM #10124

Closed
2 of 9 tasks
smeenks opened this issue Feb 20, 2025 · 0 comments · Fixed by #10130
Closed
2 of 9 tasks

REALIZATION_MEMORY keyword not working for SLURM #10124

smeenks opened this issue Feb 20, 2025 · 0 comments · Fixed by #10130
Assignees
Labels

Comments

@smeenks
Copy link

smeenks commented Feb 20, 2025

What happened? (You can include a screenshot if it helps explain)

In ert 13.0.4, using REALIZATION_MEMORY to allocate RAM to the compute nodes does not seem to work. The SLURM command seff indicates that the compute nodes have been granted 2 GB of memory (the default for our system configuration) despite setting "REALIZATION_MEMORY 10G".

We have included a zip file with an ERT case that runs a very simple forward model that allocates a NumPy array of variable size.
The 'logs' folder contains two files: one of a run that was set to low memory consumption (which finished) and one with high memory consumption (which failed). One thing that stands out is that the command which runs SLURM does not include the --mem option which is typically used to allocate memory for a node.

NOTE: the SLURM tool sacct does not always give the same error message for a killed node, sometimes it is OUT_OF_MEMORY and sometimes it simply says FAILED, but our system administrator confirmed that the cause of failure was due to OOM.

ert_test.zip

What did you expect to happen?

We expect the forward model not to crash if its memory consumption remains less than REALIZATION_MEMORY at all times. This does not happen.

steps to reproduce

- Download the attached ZIP.
- Change the path of ROOTDIR accordingly.
- You can tweak ARG0 of the forward model USE_MEMORY to a high or low number. It is supposed to indicate the number of MBs required by the python script (which uses only python and the sys library). But it will always be more than indicated, so a few tries may be needed.
- Run `ert ensemble_experiment ert_test_anonymous.yml`.
- Use SLURM's seff to monitor the memory consumption (and the amount of memory allocated to it) and check the log files when the case fails.

NOTE: our system uses RHEL 9.5.

Environment where bug has been observed

  • python 3.11
  • python 3.12
  • macosx
  • rhel7
  • rhel8
  • local queue
  • lsf queue
  • slurm queue
  • openPBS queue
@smeenks smeenks added the bug label Feb 20, 2025
@jonathan-eq jonathan-eq self-assigned this Feb 21, 2025
@jonathan-eq jonathan-eq moved this to In Progress in SCOUT Feb 21, 2025
@jonathan-eq jonathan-eq moved this from In Progress to Ready for Review in SCOUT Feb 21, 2025
@xjules xjules moved this from Ready for Review to Reviewed in SCOUT Feb 21, 2025
@github-project-automation github-project-automation bot moved this from Reviewed to Done in SCOUT Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants