Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm engine, handle AssocMaxSubmitJobLimit #241

Open
albertz opened this issue Feb 12, 2025 · 5 comments
Open

Slurm engine, handle AssocMaxSubmitJobLimit #241

albertz opened this issue Feb 12, 2025 · 5 comments

Comments

@albertz
Copy link
Member

albertz commented Feb 12, 2025

At the RWTH ITC cluster, I get this:

[2025-02-12 16:37:20,610] ERROR: Error: sbatch: error: AssocMaxSubmitJobLimit                                                                                                         
[2025-02-12 16:37:20,610] ERROR: SBATCH command: sbatch -J i6_core.returnn.forward.ReturnnForwardJobV2.oEi02THsWnau.run --mail-type=None --mem=6G --gres=gpu:1 --cpus-per-task=2 --tim
e=1440 --export=all --ntasks-per-node=1 -A p0023565 -p c23g -o work/i6_core/returnn/forward/ReturnnForwardJobV2.oEi02THsWnau/engine/%x.%A.%a.batch -a 1-1:1 --wrap=srun -o work/i6_cor
e/returnn/forward/ReturnnForwardJobV2.oEi02THsWnau/engine/%x.%A.%a /home/az668407/work/py-envs/py3.12-torch2.5/bin/python tools/sisyphus/sis worker --engine long work/i6_core/returnn/forward/ReturnnForwardJobV2.oEi02THsWnau run                                                                                                                                         
[2025-02-12 16:37:20,610] ERROR: Error: sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)             
[2025-02-12 16:37:20,610] ERROR: Error: sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)             
[2025-02-12 16:37:20,610] ERROR: Error: sbatch: error: AssocMaxSubmitJobLimit       

I think they have a limit of 1000 jobs.

The Sisyphus Slurm engine should handle this: When it knows that it has already submitted that number of jobs, it should not submit new jobs.

I'm not sure how exactly to handle this:

  • How to determine the current limit? Automatically? Or user must manually explicitly specify this?
  • How to know the current number of jobs? Just what the Slurm engine knows about? Or check this directly somehow from Slurm?
  • Or just parse the error message for AssocMaxSubmitJobLimit and if we get this, stop submitting new jobs in this engine until at least one job finishes?
@michelwi
Copy link
Contributor

How to determine the current limit? Automatically? Or user must manually explicitly specify this?

I think manual configuration in the engine is fine for starters

How to know the current number of jobs?

probably count from the squeue --me output? this should also include all tasks from other managers or manual submissions

@albertz
Copy link
Member Author

albertz commented Feb 12, 2025

Via sacctmgr show User $(whoami) -s, you can query that information on the limits. This gives:

      User   Def Acct     Admin    Cluster    Account  Partition     Share   Priority MaxJobs MaxNodes  MaxCPUs MaxSubmit     MaxWall  MaxCPUMins                  QOS   Def QOS 
---------- ---------- --------- ---------- ---------- ---------- --------- ---------- ------- -------- -------- --------- ----------- ----------- -------------------- --------- 
  az668407    default      None        rcc   rwth1750      c23ms         1                800               768      1000  7-00:00:00                           normal           
  az668407    default      None        rcc   rwth1750   c23g_low         1                800               768      1000  7-00:00:00                           normal           
  az668407    default      None        rcc   rwth1750       c23g         1                800               768      1000  7-00:00:00                           normal           
  az668407    default      None        rcc   p0023565       c23g         1                800               768      1000  7-00:00:00                           normal           
  az668407    default      None        rcc    default  c23ms_low         1                 80                         100  7-00:00:00                           normal           
  az668407    default      None        rcc    default   c23g_low         1                 80                         100  7-00:00:00                           normal           
  az668407    default      None        rcc    default       c23g         1                 80                         100  7-00:00:00                           normal           
  az668407    default      None        rcc    default      c23ms         1                 80                         100  7-00:00:00                           normal           
  az668407    default      None        rcc    default   c18m_low         1                 80                         100  7-00:00:00                           normal           
  az668407    default      None        rcc    default   c18g_low         1                 80                         100  7-00:00:00                           normal           
  az668407    default      None        rcc    default       c18g         1                 80                         100  7-00:00:00                           normal           
  az668407    default      None        rcc    default       c18m         1                 80                         100  7-00:00:00                           normal           
  az668407    default      None        rcc   supp0003       dgx2         1                800             57600      1000  5-00:00:00                           normal           
  az668407    default      None        rcc   supp0003   dgx2_low         1                800             57600      1000  5-00:00:00                           normal           

As you see, this is not just a single limit. This is a whole matrix, depending on user, account and partition. Just using a single limit for all kinds of jobs would not work well.

@dthulke
Copy link
Member

dthulke commented Mar 4, 2025

I guess really identifying the limits will be quite hard as the current implementation is for example not really aware of the partition a job is submitted to.

What about just stop submitting jobs when we get a sbatch: error: AssocMaxSubmitJobLimit, wait for a certain amount of time and try again?

@michelwi
Copy link
Contributor

michelwi commented Mar 4, 2025

the current implementation is for example not really aware of the partition a job is submitted to

now what would be the correct way to implement this?

  • Should we add partition as a rqmt flavor and each Job gets the partition defined "manually" ? (this is what we do now and I don't like it)
  • Should the Slurm engine know all partitions and then submit jobs based on the rqmts of the job (and maybe on the current usage of partitions)?
  • Should there be one Slurm engine per partition and then the manager needs better support for managing multiple engines beyond "short" and "long"?

for a quick-and-dirty imlementation, should we just take the min of all of the limits returned? But in the example above this would be 100, which is too low if one only wanted to use the larger partitions..

Make it configurable by the user instead?

@albertz
Copy link
Member Author

albertz commented Mar 4, 2025

What about just stop submitting jobs when we get a sbatch: error: AssocMaxSubmitJobLimit, wait for a certain amount of time and try again?

Yes, this is also what I think would be the easiest / most pragmatic solution.

My suggestion was not to wait for a fixed given amount of time, but wait for all other jobs on the engine to be submitted until at least one job finishes on the engine. Or maybe do both, and retry once the timeout has passed or some job finished.

Should the Slurm engine know all partitions and then submit jobs based on the rqmts of the job (and maybe on the current usage of partitions)?

I think handling all of this would add way too much complexity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants