Welcome to milabench Discussions! #316

Delaunay · 2024-11-22T03:58:40Z

Delaunay
Nov 22, 2024
Maintainer

👋 Welcome!

We’re using Discussions as a place to connect with other members of our community. We hope that you:

Ask questions you’re wondering about.
Share ideas.
Engage with other community members.
Welcome others and are open-minded. Remember that this is a community we
build together 💪.

To get started, comment below with an introduction of yourself and tell us about what you do with this community.

HungrySkeleton · 2025-03-05T14:17:49Z

HungrySkeleton
Mar 5, 2025

Hey Delaunay,

I'm working with an hpc team to test our systems and looking to test some models on our GPU's and figure out how scaling works for our multi-node setup. Didn't want to post flood the issues board with questions but i just had some questions about my understanding of milabench is correct. I'm using the non-dockerized version on my slurm system using the pip install guide.

Q1. When i'm running a multi node benchmark like llm-lora-ddp-nodes benchmark in an sbatch/salloc session system what happens if the main/master process is interrupted in my system.yaml file. Is there some sort of cleanup mechanism if a node stops cleaning mid benchmark?

Q2. When utilizing milabench system_slurm to generate my system.yaml config, what happens if the default port 22 isn't used. In my current setup my nodes are not using the default ssh port 22 and i'm finding the config file ends up looking a bit weird and i have to manually tweak it. I get some output like "couldn't resolve hostname" as well as "connection closed by port 22". Is it possible to adjust the code to work with non standard ssh ports?

Q3. When using a multi-node benchmark on a 2 node x 8gpu (a100) configuration based on my understanding of the launch process when milabench run is executed it will launch a main benchmark processes on each node which will then launch children process that run on the gpu's 1 process per gpu? My understanding is from the image that's in the git repo. As a second question do these benchmarks run independently or do they coordinate together when a multi-node benchmark (to my understanding of running llm-lora-ddp-nodes. Pytorch uses FDSP to replicate the model across the gpus and i'm assuming they coordinate their gradients.

Q4. Lastly in the report when we get the value of "n" in the columns is that the number of main processes launched in a multi-node benchmark and lastly what is the difference between "perf" and "score" metrics. I looked through the code and noticed that only when you run the single gpu benchmarks it's like a weighted average. While on multi-node or multi-gpus benchmarks perf and score are the same for the most part.

Thanks for your time and hopefully these are easy to questions to answer.

0 replies

Delaunay · 2025-03-05T22:15:27Z

Delaunay
Mar 5, 2025
Maintainer Author

Q1. When i'm running a multi node benchmark like llm-lora-ddp-nodes benchmark in an sbatch/salloc session system what happens if the main/master process is interrupted in my system.yaml file. Is there some sort of cleanup mechanism if a node stops cleaning mid benchmark?

Milabench does nothing specific, it relies on torchrun or accelerate to kills the worker nodes when the main node is interrupted.
Note that milabench does establish itself as a subreaper for its children but this only applies to the main node.

Q2. When utilizing milabench system_slurm to generate my system.yaml config, what happens if the default port 22 isn't used. In my current setup my nodes are not using the default ssh port 22 and i'm finding the config file ends up looking a bit weird and i have to manually tweak it. I get some output like "couldn't resolve hostname" as well as "connection closed by port 22". Is it possible to adjust the code to work with non standard ssh ports?

You can use the sshport option instead.

  system:

    # Nodes list
    nodes:
      - name: manager
        ip: 192.168.11.11
        sshport: 31

Q3. When using a multi-node benchmark on a 2 node x 8gpu (a100) configuration based on my understanding of the launch process when milabench run is executed it will launch a main benchmark processes on each node which will then launch children process that run on the gpu's 1 process per gpu? My understanding is from the image that's in the git repo. As a second question do these benchmarks run independently or do they coordinate together when a multi-node benchmark (to my understanding of running llm-lora-ddp-nodes. Pytorch uses FDSP to replicate the model across the gpus and i'm assuming they coordinate their gradients.

Yes, milabench does something similar to the example below to launch multi node experiments.
torchrun will then launch one process per GPUs. multi-gpu and multi-node benchmarks do not run independently, they will use
all the GPUs available to them to accomplish one single task. The report display how many GPUs were used (ngpu)

echo "---"
echo "llm-full-mp-nodes"
echo "================="
time (
  torchrun main node ... &
  ssh -p 22 [email protected] torchrun worker node .... &
  wait
)

Assuming each node has 8 GPUs

multi-gpu benchmarks used 8 GPUs together (ngpu=8, n=1),
while single GPU benchmark used all 8 independently (ngpu=1, n = 8).

Multi node gpus would show (ngpu=16, n=2)

bench                    | fail |   n | ngpu |           perf |   sem% |   std% | peak_memory |           score | weight
brax                     |    0 |   1 |    8 |      730035.71 |   0.1% |   0.4% |        2670 |       730035.71 |   1.00
diffusion-gpus           |    0 |   1 |    8 |         117.67 |   1.5% |  11.7% |       59944 |          117.67 |   1.00
diffusion-single         |    0 |   8 |    1 |          25.02 |   0.8% |  17.9% |       53994 |          202.10 |   1.00
dimenet                  |    0 |   8 |    1 |         366.85 |   0.7% |  16.2% |        2302 |         2973.32 |   1.00
dinov2-giant-gpus        |    0 |   1 |    8 |         445.68 |   0.4% |   3.0% |       69614 |          445.68 |   1.00
dinov2-giant-single      |    0 |   8 |    1 |          53.54 |   0.4% |   9.5% |       74646 |          432.65 |   1.00
dqn                      |    0 |   8 |    1 | 23089954554.91 |   1.1% |  89.9% |       62106 | 1

Q4. Lastly in the report when we get the value of "n" in the columns is that the number of main processes launched in a multi-node benchmark and lastly what is the difference between "perf" and "score" metrics. I looked through the code and noticed that only when you run the single gpu benchmarks it's like a weighted average. While on multi-node or multi-gpus benchmarks, perf and score are the same for the most part.

n is the number of processes milabench launched.

For single GPU benchmarks, n will be equal to the number of GPUs because milabench launch one independent benchmark per GPU.
For multi GPU benchmarks, n will be 1, because we only need to launch torchrun
For multi Node benchmarks, n will be equal to the number of nodes (One process per node).

For clarity, we added ngpu to quickly identify the number of GPU used inside the benchmark.

perf is the measured throughput (image/sec, token/s).

The score is a normalized "per-node" performance measure.

For single GPU benchmarks, the per-node performance is (GPU perf * number of GPUs) (or the sum across GPUs of their GPU perf)
For multi GPU benchmarks, the perf is already per node as the full node/ all gpus was/were used in the benchmark
For multi node benchmarks, we do not normalize the performance yet.

There is an ongoing discussion on how to handle the perf normalization.

0 replies

Delaunay · 2025-03-05T22:24:59Z

Delaunay
Mar 5, 2025
Maintainer Author

We need the normalization to get a good score that can give use the right perf/$
so we can select the best bang for the buck given the performance.

See below some simulation assuming same GPU but 4 vs 8 GPU per node configuration, we can see that using the raw performance numbers or the per gpu numbers may lead to suboptimal decisions.

0 replies

HungrySkeleton · 2025-03-12T14:39:26Z

HungrySkeleton
Mar 12, 2025

Here's what i'm trying to achieve, i've created my own copy of a slurm.yaml file for some extra configurations and i want to test it with schedule.py and lastly have it call the milabench/scripts/milabench_run.bash How would i specify it using the:

milabench schedule --dry --profile <profile name>

is there a mechanism to use your custom profiles, i found the section of code responsible for grabbing the parameters see below:

I noticed the section of code where it grabs the parameters from the slurm.yaml file below

What would be the best practice to substitute the parameters required here?

2 replies

Delaunay Mar 12, 2025
Maintainer Author

I made a concept to help switch the sbatch profiles here

milabench schedule --dry --sbatch-profile /home/d/delaunay/links/scratch/shared/cpyslurm.yaml --profile mine

export MILABENCH_SBATCH_PROFILES=/home/d/delaunay/links/scratch/shared/cpyslurm.yaml 
milabench schedule --dry --profile mine

That particular utility was made with automating the testing process of milabench so it might not like fit 100% into just scheduling things on slurm per se.

Delaunay Mar 12, 2025
Maintainer Author

Some clusters do not allow internet access on the compute node and the milabench/scripts/milabench_run.bash will not work in that case as it assumes internet access.

For those, you will need to split milabench execution in two.

milabench prepare & install

export NETWORK_BASE=/home/d/delaunay/links/scratch/shared
export MILABENCH_BASE=$NETWORK_BASE
export MILABENCH_CONFIG=$NETWORK_BASE/milabench/config/standard.yaml

cd $NETWORK_BASE
virtualenv ./env

source $NETWORK_BASE/env/bin/activate
milabench install
milabench prepare

milabench run

#!/bin/bash

#SBATCH --partition=gpubase_bynode_b3
#SBATCH --gpus-per-task=4
#SBATCH --ntasks=1
#SBATCH --exclusive
#SBATCH --nodes=1
#SBATCH --cpus-per-task=48
#SBATCH --time=5:00:00
#SBATCH --ntasks-per-node=1
#SBATCH --mem=0
#SBATCH --export=ALL,MILABENCH_SIZER_AUTO=0
#SBATCH --gres=gpu:4
#SBATCH -w tg11206


set -ex

# Load necessary modules (if any)

module load cuda/12.6

export NETWORK_BASE=/home/d/delaunay/links/scratch/shared
export PYTHONUNBUFFERED=1
export MILABENCH_CONFIG=$NETWORK_BASE/milabench/config/standard.yaml
export MILABENCH_GPU_ARCH=cuda
export MILABENCH_HF_TOKEN=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
export MILABENCH_BASE=/tmp/workspace/

export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:$HOME/lib64/nvidia"

source $NETWORK_BASE/env/bin/activate

# Copy datasets and checkpoints to local disk
milabench sharedsetup --network $NETWORK_BASE --local $MILABENCH_BASE

# Run benchmark
milabench run 

# Copy results to network drive
rsync -a $MILABENCH_BASE/runs $NETWORK_BASE/

HungrySkeleton · 2025-03-14T13:14:24Z

HungrySkeleton
Mar 14, 2025

Yet Another Question,

My goal is to just check if the auto sizing is working properly, after reading the documentation is it sufficient to just have the capacity set in my system.yaml and to run the benchmark while setting the environment variable MILABENCH_SIZER_AUTO=1.

Also should i include multirun into my currently existing system.yaml file and what is it's purpose? I've done some digging below and this was what i was able to piece together.

I get the purpose of scaling.yaml is to adjust the batch size for a given benchmark but i also no ticed in your example system.yaml. You have 3 different things setup on multirun could you explain the design purpose and how to use them. Based on my view of the code base you start from the run.py->system.py -> multirun() function -> apply_system -> then launch the actual benchmark.

From what i understand matrix lets you do a dynamically setup multiple configurations for the same benchmark you're about to run. Auto enables just dynamic batch resizing to fit your available VRAM, with a multiple of 8. and batch size just forces a particular size.

Also one more unrelated question that's not tied to auto-sizing, when i run this job script on slurm. What happens if i don't know which nodes i'm being allocated to be the main:true in our system.yaml file. Suppose i have 6 nodes x 8gpu config but i ask sbatch to just give me 2 nodes at random to test, is there a way to handle that case since i'm assuming you'd have to set true if i don't know ahead of time which nodes i'm getting allocated?

2 replies

Delaunay Mar 15, 2025
Maintainer Author

Batch Resize works by interpolating the batch size from known runs to make the bench fit a particular memory capacity.
It used to use a shared scaling config, but this did not translate well across GPU vendors, so we have switched to a per GPU config.
Currently, only MI325 and H100 have the necessary data for batch resizing. But you could leverage MILABENCH_SIZER_SAVE to set a file that is closest to the GPU you are trying to run.
NVIDIA configs usually translate well for NVIDIA GPUs for example.

Because batch resize needs a lot of data to work, we also created multirun to be able to run many different configurations with many different batch size to populate the scaling config. You don't need to use the multirun config to have batch resize.

batch resize can work without multirun with environment variables,
it has a bunch of options to tweak the batch size directly as well.

Note that milabench will also auto-detect the capacity if it is missing, so you don't need to specify it.

Note that batch resizing for multinode benchmark is often missing because it is cost prohibitive to run.

export MILABENCH_SIZER_SAVE="/.../H100.yaml"
export MILABENCH_SIZER_MULTIPLE="8"
export MILABENCH_SIZER_AUTO="1"

milabench run

I would use the branch h100_py2.6 for testing the batch resize as I just fixed a bug that prevented the resizing to work correctly.

Delaunay Mar 15, 2025
Maintainer Author

milabench comes with a command to generate the "system.yaml" file.

Essentially, the first node is going to be the main node, because slurm will execute the command on the first node.

milabench slurm_system > $NETWORK_BASE/system_2.yaml

Example:

system:
  arch: cuda
  gpu:
    capacity: 81559 MiB
  nodes:
  - hostname: tg...
    ip: 1...
    local: true
    main: true
    name: tg10908
    port: 8123
    sshport: 22
    user: delaunay
  - hostname: ....
    ip: 1....
    local: false
    main: false
    name: tg11101
    sshport: 22
    user: delaunay

Note that multi node benchmark have a set number of nodes they can use, main node is always node 0 and the worker could be any of the other available nodes
but it will probably simply end up being the second one.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Welcome to milabench Discussions! #316

{{title}}

Replies: 5 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Welcome to milabench Discussions! #316

Delaunay Nov 22, 2024 Maintainer

👋 Welcome!

Replies: 5 comments · 4 replies

HungrySkeleton Mar 5, 2025

Delaunay Mar 5, 2025 Maintainer Author

Delaunay Mar 5, 2025 Maintainer Author

HungrySkeleton Mar 12, 2025

Delaunay Mar 12, 2025 Maintainer Author

Delaunay Mar 12, 2025 Maintainer Author

HungrySkeleton Mar 14, 2025

Delaunay Mar 15, 2025 Maintainer Author

Delaunay Mar 15, 2025 Maintainer Author

Delaunay
Nov 22, 2024
Maintainer

Replies: 5 comments 4 replies

HungrySkeleton
Mar 5, 2025

Delaunay
Mar 5, 2025
Maintainer Author

Delaunay
Mar 5, 2025
Maintainer Author

HungrySkeleton
Mar 12, 2025

Delaunay Mar 12, 2025
Maintainer Author

Delaunay Mar 12, 2025
Maintainer Author

HungrySkeleton
Mar 14, 2025

Delaunay Mar 15, 2025
Maintainer Author

Delaunay Mar 15, 2025
Maintainer Author