Slurm engine, handle AssocMaxSubmitJobLimit #241

albertz · 2025-02-12T15:52:59Z

At the RWTH ITC cluster, I get this:

[2025-02-12 16:37:20,610] ERROR: Error: sbatch: error: AssocMaxSubmitJobLimit                                                                                                         
[2025-02-12 16:37:20,610] ERROR: SBATCH command: sbatch -J i6_core.returnn.forward.ReturnnForwardJobV2.oEi02THsWnau.run --mail-type=None --mem=6G --gres=gpu:1 --cpus-per-task=2 --tim
e=1440 --export=all --ntasks-per-node=1 -A p0023565 -p c23g -o work/i6_core/returnn/forward/ReturnnForwardJobV2.oEi02THsWnau/engine/%x.%A.%a.batch -a 1-1:1 --wrap=srun -o work/i6_cor
e/returnn/forward/ReturnnForwardJobV2.oEi02THsWnau/engine/%x.%A.%a /home/az668407/work/py-envs/py3.12-torch2.5/bin/python tools/sisyphus/sis worker --engine long work/i6_core/returnn/forward/ReturnnForwardJobV2.oEi02THsWnau run                                                                                                                                         
[2025-02-12 16:37:20,610] ERROR: Error: sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)             
[2025-02-12 16:37:20,610] ERROR: Error: sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)             
[2025-02-12 16:37:20,610] ERROR: Error: sbatch: error: AssocMaxSubmitJobLimit

I think they have a limit of 1000 jobs.

The Sisyphus Slurm engine should handle this: When it knows that it has already submitted that number of jobs, it should not submit new jobs.

I'm not sure how exactly to handle this:

How to determine the current limit? Automatically? Or user must manually explicitly specify this?
How to know the current number of jobs? Just what the Slurm engine knows about? Or check this directly somehow from Slurm?
Or just parse the error message for AssocMaxSubmitJobLimit and if we get this, stop submitting new jobs in this engine until at least one job finishes?

The text was updated successfully, but these errors were encountered:

michelwi · 2025-02-12T15:56:18Z

How to determine the current limit? Automatically? Or user must manually explicitly specify this?

I think manual configuration in the engine is fine for starters

How to know the current number of jobs?

probably count from the squeue --me output? this should also include all tasks from other managers or manual submissions

albertz · 2025-02-12T16:19:50Z

Via sacctmgr show User $(whoami) -s, you can query that information on the limits. This gives:

      User   Def Acct     Admin    Cluster    Account  Partition     Share   Priority MaxJobs MaxNodes  MaxCPUs MaxSubmit     MaxWall  MaxCPUMins                  QOS   Def QOS 
---------- ---------- --------- ---------- ---------- ---------- --------- ---------- ------- -------- -------- --------- ----------- ----------- -------------------- --------- 
  az668407    default      None        rcc   rwth1750      c23ms         1                800               768      1000  7-00:00:00                           normal           
  az668407    default      None        rcc   rwth1750   c23g_low         1                800               768      1000  7-00:00:00                           normal           
  az668407    default      None        rcc   rwth1750       c23g         1                800               768      1000  7-00:00:00                           normal           
  az668407    default      None        rcc   p0023565       c23g         1                800               768      1000  7-00:00:00                           normal           
  az668407    default      None        rcc    default  c23ms_low         1                 80                         100  7-00:00:00                           normal           
  az668407    default      None        rcc    default   c23g_low         1                 80                         100  7-00:00:00                           normal           
  az668407    default      None        rcc    default       c23g         1                 80                         100  7-00:00:00                           normal           
  az668407    default      None        rcc    default      c23ms         1                 80                         100  7-00:00:00                           normal           
  az668407    default      None        rcc    default   c18m_low         1                 80                         100  7-00:00:00                           normal           
  az668407    default      None        rcc    default   c18g_low         1                 80                         100  7-00:00:00                           normal           
  az668407    default      None        rcc    default       c18g         1                 80                         100  7-00:00:00                           normal           
  az668407    default      None        rcc    default       c18m         1                 80                         100  7-00:00:00                           normal           
  az668407    default      None        rcc   supp0003       dgx2         1                800             57600      1000  5-00:00:00                           normal           
  az668407    default      None        rcc   supp0003   dgx2_low         1                800             57600      1000  5-00:00:00                           normal

As you see, this is not just a single limit. This is a whole matrix, depending on user, account and partition. Just using a single limit for all kinds of jobs would not work well.

dthulke · 2025-03-04T13:36:01Z

I guess really identifying the limits will be quite hard as the current implementation is for example not really aware of the partition a job is submitted to.

What about just stop submitting jobs when we get a sbatch: error: AssocMaxSubmitJobLimit, wait for a certain amount of time and try again?

michelwi · 2025-03-04T14:48:26Z

the current implementation is for example not really aware of the partition a job is submitted to

now what would be the correct way to implement this?

Should we add partition as a rqmt flavor and each Job gets the partition defined "manually" ? (this is what we do now and I don't like it)
Should the Slurm engine know all partitions and then submit jobs based on the rqmts of the job (and maybe on the current usage of partitions)?
Should there be one Slurm engine per partition and then the manager needs better support for managing multiple engines beyond "short" and "long"?

for a quick-and-dirty imlementation, should we just take the min of all of the limits returned? But in the example above this would be 100, which is too low if one only wanted to use the larger partitions..

Make it configurable by the user instead?

albertz · 2025-03-04T15:46:16Z

What about just stop submitting jobs when we get a sbatch: error: AssocMaxSubmitJobLimit, wait for a certain amount of time and try again?

Yes, this is also what I think would be the easiest / most pragmatic solution.

My suggestion was not to wait for a fixed given amount of time, but wait for all other jobs on the engine to be submitted until at least one job finishes on the engine. Or maybe do both, and retry once the timeout has passed or some job finished.

Should the Slurm engine know all partitions and then submit jobs based on the rqmts of the job (and maybe on the current usage of partitions)?

I think handling all of this would add way too much complexity.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slurm engine, handle AssocMaxSubmitJobLimit #241

Slurm engine, handle AssocMaxSubmitJobLimit #241

albertz commented Feb 12, 2025

michelwi commented Feb 12, 2025

albertz commented Feb 12, 2025

dthulke commented Mar 4, 2025

michelwi commented Mar 4, 2025

albertz commented Mar 4, 2025

Slurm engine, handle AssocMaxSubmitJobLimit #241

Slurm engine, handle AssocMaxSubmitJobLimit #241

Comments

albertz commented Feb 12, 2025

michelwi commented Feb 12, 2025

albertz commented Feb 12, 2025

dthulke commented Mar 4, 2025

michelwi commented Mar 4, 2025

albertz commented Mar 4, 2025