-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slurm engine, handle AssocMaxSubmitJobLimit #241
Comments
I think manual configuration in the engine is fine for starters
probably count from the |
Via
As you see, this is not just a single limit. This is a whole matrix, depending on user, account and partition. Just using a single limit for all kinds of jobs would not work well. |
I guess really identifying the limits will be quite hard as the current implementation is for example not really aware of the partition a job is submitted to. What about just stop submitting jobs when we get a |
now what would be the correct way to implement this?
for a quick-and-dirty imlementation, should we just take the min of all of the limits returned? But in the example above this would be 100, which is too low if one only wanted to use the larger partitions.. Make it configurable by the user instead? |
Yes, this is also what I think would be the easiest / most pragmatic solution. My suggestion was not to wait for a fixed given amount of time, but wait for all other jobs on the engine to be submitted until at least one job finishes on the engine. Or maybe do both, and retry once the timeout has passed or some job finished.
I think handling all of this would add way too much complexity. |
At the RWTH ITC cluster, I get this:
I think they have a limit of 1000 jobs.
The Sisyphus Slurm engine should handle this: When it knows that it has already submitted that number of jobs, it should not submit new jobs.
I'm not sure how exactly to handle this:
AssocMaxSubmitJobLimit
and if we get this, stop submitting new jobs in this engine until at least one job finishes?The text was updated successfully, but these errors were encountered: