Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SARC-395] Ajuster la fonction de conversion gpu->rgu pour supporter différentes versions à travers le temps. #155

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

notoraptor
Copy link
Contributor

  • Read GPU billing from database, not from config anymore
  • Add dependency iguane to get GPU->RGU values
  • Add new client function get_rgus()
  • Move series function into client: update_job_series_rgu()
  • update_job_series_rgu(): take into account evolution of GPU billing acrosse time and type of GPU billing (billing_is_gpu) on each cluster
  • load_job_series(): make sure users columns are included only if job user column is included in data frame.
  • tests: allow to create entries for all testing clusters: read cluster names from sarc-test.json

…différentes versions à travers le temps.

- Read GPU billing from database, not from config anymore
- Add dependency `iguane` to get GPU->RGU values
- Add new client function get_rgus()
- Move series function into client: update_job_series_rgu()
- update_job_series_rgu(): take into account evolution of GPU billing acrosse time and type of GPU billing (billing_is_gpu) on each cluster
- load_job_series(): make sure users columns are included only if job `user` column is included in data frame.
- tests: allow to create entries for all testing clusters: read cluster names from sarc-test.json
@notoraptor
Copy link
Contributor Author

@bouthilx Voici une PR pour finir la gestion des RGUs !

Par rapport au document de référence, j'ai toutefois apporté une petite modification. Dans le document de référence ("GPU vs RGU"), pour calculer les RGUs sur DRAC, on avait prévu le calcul suivant:

DRAC

allocated.gres_gpu: Nombre de RGU scalé

nb_rgu = allocated.gres_gpu / config[grappe].scaling_rgu(allocated.start_time)

allocated.gres_rgu = nb_rgu

allocated.gres_gpu = nb_rgu / IGUANE[allocated.gpu_type]

Cependant, en observant des jobs réels, il me semble que allocated.gres_gpu / config[grappe].scaling_rgu(allocated.start_time) retourne directement le nombre de GPUs, pas le nombre de RGUs

Comme exemple, j'ai ce genre de jobs:

 {
  "cluster_name": "beluga",
  "job_id": 47622739,
  "job_state": "CANCELLED",
  "exit_code": 0,
  "partition": "gpubase_bynode_b1",
  "nodes": [
   "bg12106",
   "bg12107",
   "bg12108",
   "bg12113"
  ],
  "submit_time": "2024-05-23 19:15:55-04:00",
  "start_time": "2024-05-23 19:15:57-04:00",
  "end_time": "2024-05-23 19:55:28-04:00",
  "elapsed_time": 2371,
  "requested": {
   "cpu": 160,
   "mem": 737280,
   "node": 4,
   "billing": 35555,
   "gres_gpu": 16,
   "gpu_type": null
  },
  "allocated": {
   "cpu": 160,
   "mem": 737280,
   "node": 4,
   "billing": 35555,
   "gres_gpu": 16,
   "gpu_type": "Tesla V100-SXM2-16GB"
  }
 },

Le billing ici est 35555, le gpu type est "Tesla V100-SXM2-16GB", et la partition est gpubase_bynode_b1. Si je regarde les propriétés de la partition dans le fichier de config slurm de beluga, je trouve ceci:

PartitionName=gpubase_bynode_b1 MaxTime=3:00:00 Default=no MinNodes=1 AllowGroups=ALL 
PriorityJobFactor=11 DisableRootJobs=YES RootOnly=NO Hidden=NO OverSubscribe=NO GraceTime=0 PreemptMode=OFF
PriorityTier=10 ReqResv=NO DefMemPerCPU=256 AllowAccounts=ALL AllowQos=ALL 
Nodes=bg[11201-11214,11301-11313,11401-11414,11501-11513,11601-11614,11701-11713,11801-11814,11901-11913,12001-12014,12101-12113,12201-12214,12301-12313,12401-12410] 
TRESBillingWeights=CPU=222.22,Mem=47.62G,GRES/gpu=2200.0 DefaultTime=1:00:00 ExclusiveUser=NO

Qui m'indique donc GRES/gpu=2200.0. Et donc, si je fais 35555 / 2200.0, j'obtiens 16.161363636363635, soit environ 16, ce qui correspond bien au "gres_gpu": 16 dans le allocated du job.

PS: Je rappelle que allocated.gres_gpu prend la valeur de billing dans le job series.

J'ai donc remplacé:

DRAC

allocated.gres_gpu: Nombre de RGU scalé

nb_rgu = allocated.gres_gpu / config[grappe].scaling_rgu(allocated.start_time)

allocated.gres_rgu = nb_rgu

allocated.gres_gpu = nb_rgu / IGUANE[allocated.gpu_type]

Par:

DRAC

allocated.gres_gpu: Nombre de RGU scalé

nb_gpu = allocated.gres_gpu / config[grappe].scaling_rgu(allocated.start_time)

allocated.gres_rgu = nb_gpu * IGUANE[allocated.gpu_type]

allocated.gres_gpu = nb_gpu

- harmonize names of billed GPUs
- get GPU nodes as a list instead of a string, as some nodes may have many GPUs (e.g. MIG GPUs)

Improve RGU function to handle harmonized names of MIG GPUs.

Improve update_allocated_gpu_type():
- check default allocated.gpu_type if a single gpu_type cannot be inferred from nodes
- harmonize GPU name using __DEFAULTS__ if available even if job does not have nodes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant