From c9f8bc6bb84a6e5cc31acab07a658453aba63216 Mon Sep 17 00:00:00 2001
From: Jeremy Arnold <Jeremy.Arnold@amd.com>
Date: Wed, 22 Jan 2025 07:25:26 +0000
Subject: [PATCH 1/9] Dev-docker Documentation Updates

Minor updates to several sections, with links to other documents where appropriate.
---
 docs/dev-docker/README.md | 85 ++++++++++++---------------------------
 1 file changed, 26 insertions(+), 59 deletions(-)

diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md
index c3496358c15d9..cc7ac5fd18158 100644
--- a/docs/dev-docker/README.md
+++ b/docs/dev-docker/README.md
@@ -14,29 +14,6 @@ It includes:
 - vLLM 0.6.3
 - PyTorch 2.6dev (nightly)
 
-## System configuration
-
-The performance data below was measured on a server with MI300X accelerators with the following system configuration. The performance might vary with different system configurations.
-
-| System  | MI300X with 8 GPUs  |
-|---|---|
-| BKC | 24.13 |
-| ROCm | version ROCm 6.3 |
-| amdgpu | build 2009461 |
-| OS | Ubuntu 22.04 |
-| Linux Kernel | 5.15.0-117-generic |
-| BMCVersion | C2789.BC.0809.00 |
-| BiosVersion | C2789.5.BS.1C11.AG.1 |
-| CpldVersion | 02.02.00 |
-| DCSCMCpldVersion | 02.02.00 |
-| CX7 | FW 28.40.1000 |
-| RAM | 1 TB |
-| Host CPU | Intel(R) Xeon(R) Platinum 8480C |
-| Cores | 224 |
-| VRAM | 192 GB |
-| Power cap | 750 W |
-| SCLK/MCLK | 2100 Mhz / 1300 Mhz |
-
 ## Pull latest
 
 You can pull the image with `docker pull rocm/vllm-dev:main`
@@ -137,28 +114,18 @@ Download and launch the docker,
 
 There are some system settings to be configured for optimum performance on MI300X.
 
-#### NUMA balancing setting
-
-To optimize performance, disable automatic NUMA balancing. Otherwise, the GPU might hang until the periodic balancing is finalized. For further details, refer to the AMD Instinct MI300X system optimization guide.
-
-Disable automatic NUMA balancing
-
-    sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
-
-Check if NUMA balancing is disabled (returns 0 if disabled)
+#### System optimization
 
-    cat /proc/sys/kernel/numa_balancing
-    0
+Before running performance tests you should ensure that the system is optimized according to the [ROCm Documentation][https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html].  In particular, it is important to ensure that NUMA auto-balancing is disabled.
 
 #### LLM performance settings
 
 Some environment variables enhance the performance of the vLLM kernels and PyTorch's tunableOp on the MI300X accelerator. The settings below are already preconfigured in the Docker image. See the AMD Instinct MI300X workload optimization guide for more information.
 
-##### vLLM performance environment variables
+##### Performance environment variables
 
     export VLLM_USE_TRITON_FLASH_ATTN=0
     export NCCL_MIN_NCHANNELS=112
-    export VLLM_FP8_PADDING=1
 
 You can set both PYTORCH_TUNABLEOP_ENABLED and PYTORCH_TUNABLEOP_TUNING to 1 to performance GEMM tuning for the 1st benchmark run.
 It will take some time to complete the tuning during the benchmark. After tuning, it will generate several csv files as the performance lookup database. For the subsequent benchmark runs, you can keep
@@ -169,13 +136,13 @@ PYTORCH_TUNABLEOP_TUNING to 0 to use the selected kernels.
 ##### vLLM engine performance settings
 
 vLLM provides a number of engine options which can be changed to improve performance.
-Refer <https://docs.vllm.ai/en/stable/models/engine_args.html> for the complete list of vLLM engine options.
+Refer to the [vLLM Engine Args][https://docs.vllm.ai/en/stable/usage/engine_args.html] documentation for the complete list of vLLM engine options.
 Below is a list of options which are useful:
 - **--max-model-len** : Maximum context length supported by the model instance. Can be set to a lower value than model configuration value to improve performance and gpu memory utilization.
 - **--max-num-batched-tokens** : The maximum prefill size, i.e., how many prompt tokens can be packed together in a single prefill. Set to a higher value to improve prefill performance at the cost of higher gpu memory utilization. 65536 works well for LLama models.
-- **--max-num-seqs** : The maximum decode batch size. Set to a value higher than the default(256) to improve decode throughput. Higher values will also utilize more KV cache memory. Too high values can cause KV cache space to run out which will lead to decode preemption. 512/1024 works well for LLama models.
+- **--max-num-seqs** : The maximum decode batch size (default 256). Using larger values will allow more prompts to be processed concurrently, resulting in increased throughput (possibly at the expense of higher latency).  If the value is too large, there may not be enough GPU memory for the KV cache, resulting in requests getting preempted.  The optimal value will depend on the GPU memory, model size, and maximum context length.
 - **--max-seq-len-to-capture** : Maximum sequence length for which Hip-graphs are captured and utilized. It's recommended to use Hip-graphs for the best decode performance. The default value of this parameter is 8K, which is lower than the large context lengths supported by recent models such as LLama. Set this parameter to max-model-len or maximum context length supported by the model for best performance.
-- **--gpu-memory-utilization** : The ratio of GPU memory reserved by a vLLM instance. Default value is 0.9. It's recommended to set this to 0.99 to increase KV cache space.
+- **--gpu-memory-utilization** : The ratio of GPU memory reserved by a vLLM instance. Default value is 0.9.  Increasing the value (potentially as high as 0.99) will increase the amount of memory available for KV cache.  When running in graph mode (i.e. not using `--enforce-eager`), it may be necessary to use a slightly smaller value of 0.92 - 0.95 to ensure adequate memory is available for the HIP graph.
 
 Note: vLLM's server creation command line (vllm serve) supports the above parameters as command line arguments.
   
@@ -195,7 +162,7 @@ If you want to do limited online tuning use --enforce-eager and tune for particu
  Run the following command for BS=1/2/4/8:
 
         python /app/vllm/benchmarks/benchmark_latency.py \
-        --model <path to Meta-Llama-3.1-70B-Instruct-FP8-KV> \
+        --model <path to Llama-3.1-70B-Instruct-FP8-KV> \
         --quantization fp8 \
         --kv-cache-dtype fp8 \
         --dtype float16 \
@@ -209,16 +176,16 @@ If you want to do limited online tuning use --enforce-eager and tune for particu
         --num-scheduler-steps 10 \
         --enforce-eager
 
-The tuned file will be generated for device 0 only at /app/tuned_gemm_csv/bench_latency_tune_device_0_full.csv. Copy this file to /app/tuned_gemm_csv/bench_latency_tune_device_<D>_full.csv for D=1 through 7.
+The tuned file will be generated for device 0 only at /app/tuned_gemm_csv/bench_latency_tune_device_0_full.csv. Copy this file to /app/tuned_gemm_csv/bench_latency_tune_device_&lt;D>_full.csv for D=1 through 7.
 
 After the above steps, retain the environment variables set earlier, but set export PYTORCH_TUNABLEOP_TUNING=0 to disable online tuning, and use the tuned solutions.
 
 ##### Latency Benchmark
 
-Benchmark Meta-Llama-3.1-405B FP8 with input 128 tokens, output 128 tokens, batch size 32 and tensor parallelism 8 as an example,
+Benchmark Llama-3.1-405B FP8 with input 128 tokens, output 128 tokens, batch size 32 and tensor parallelism 8 as an example,
 
     python /app/vllm/benchmarks/benchmark_latency.py \
-    --model /data/llm/Meta-Llama-3.1-405B-Instruct-FP8-KV \
+    --model /data/llm/Llama-3.1-405B-Instruct-FP8-KV \
     --quantization fp8 \
     --kv-cache-dtype fp8 \
     --dtype half \
@@ -229,10 +196,10 @@ Benchmark Meta-Llama-3.1-405B FP8 with input 128 tokens, output 128 tokens, batc
     --input-len 128 \
     --output-len 128
 
-If you want to run Meta-Llama-3.1-405B FP16, please run
+If you want to run Llama-3.1-405B FP16, please run
 
     python /app/vllm/benchmarks/benchmark_latency.py \
-    --model /data/llm/Meta-Llama-3.1-405B-Instruct \
+    --model /data/llm/Llama-3.1-405B-Instruct \
     --dtype float16 \
     --gpu-memory-utilization 0.99 \
     --distributed-executor-backend mp \
@@ -250,10 +217,10 @@ For more information about the parameters, please run
 
 ##### Throughput Benchmark
 
-Benchmark Meta-Llama-3.1-405B FP8 with input 128 tokens, output 128 tokens and tensor parallelism 8 as an example,
+Benchmark Llama-3.1-405B FP8 with input 128 tokens, output 128 tokens and tensor parallelism 8 as an example,
 
     python /app/vllm/benchmarks/benchmark_throughput.py \
-    --model /data/llm/Meta-Llama-3.1-405B-Instruct-FP8-KV \
+    --model /data/llm/Llama-3.1-405B-Instruct-FP8-KV \
     --quantization fp8 \
     --kv-cache-dtype fp8 \
     --dtype half \
@@ -265,10 +232,10 @@ Benchmark Meta-Llama-3.1-405B FP8 with input 128 tokens, output 128 tokens and t
     --input-len 128 \
     --output-len 128
 
-If you want to run Meta-Llama-3.1-405B FP16, please run
+If you want to run Llama-3.1-405B FP16, please run
 
     python /app/vllm/benchmarks/benchmark_throughput.py \
-    --model /data/llm/Meta-Llama-3.1-405B-Instruct \
+    --model /data/llm/Llama-3.1-405B-Instruct \
     --dtype float16 \
     --gpu-memory-utilization 0.9 \
     --num-prompts 2000 \
@@ -311,9 +278,9 @@ line 245 -         interval = np.random.exponential(1.0 / request_rate)
 line 245 +         ## interval = np.random.exponential(1.0 / request_rate)
 line 246 +         interval = 1.0 / request_rate
 
-Benchmark Meta-Llama-3.1-70B with input 4096 tokens, output 512 tokens and tensor parallelism 8 as an example,
+Benchmark Llama-3.1-70B with input 4096 tokens, output 512 tokens and tensor parallelism 8 as an example,
 
-    vllm serve /data/llm/Meta-Llama-3.1-70B-Instruct-FP8-KV \
+    vllm serve /data/llm/Llama-3.1-70B-Instruct-FP8-KV \
     --swap-space 16 \
     --disable-log-requests \
     --quantization fp8 \
@@ -331,7 +298,7 @@ run client in a separate terminal. Use port_id from previous step else port-id=8
 
     python /app/vllm/benchmarks/benchmark_serving.py \
     --port 8000 \
-    --model /data/llm/Meta-Llama-3.1-70B-Instruct-FP8-KV \
+    --model /data/llm/Llama-3.1-70B-Instruct-FP8-KV \
     --dataset-name random \
     --random-input-len 4096 \
     --random-output-len 512 \
@@ -357,7 +324,7 @@ Example of running Llama3.1-8B on 1 CPX-NPS1 GPU with input 4096 and output 512.
     --max-model-len 4608 \
     --num-scheduler-steps 10 \
     --num-prompts 100 \
-    --model /data/llm/Meta-Llama-3.1-70B-Instruct-FP8-KV \
+    --model /data/llm/Llama-3.1-70B-Instruct-FP8-KV \
     --input-len 4096 \
     --output-len 512 \
     --dtype float16 \
@@ -376,11 +343,11 @@ Speculative decoding is one of the key features in vLLM. It has been supported o
 
 Without Speculative Decoding -
 
-     python benchmark_latency.py --model /models/models--amd--Meta-Llama-3.1-405B-Instruct-FP8-KV/ --max-model-len 26720 -tp 8 --batch-size 1 --use-v2-block-manager --input-len 1024 --output-len 128
+     python benchmark_latency.py --model /models/models--amd--Llama-3.1-405B-Instruct-FP8-KV/ --max-model-len 26720 -tp 8 --batch-size 1 --use-v2-block-manager --input-len 1024 --output-len 128
 
 With Speculative Decoding -
 
-     python benchmark_latency.py --model /models/models--amd--Meta-Llama-3.1-405B-Instruct-FP8-KV/ --max-model-len 26720 -tp 8 --batch-size 1 --use-v2-block-manager --input-len 1024 --output-len 128 --speculative-model /models/models--amd--Meta-Llama-3.1-8B-Instruct-FP8-KV/ --num-speculative-tokens 5
+     python benchmark_latency.py --model /models/models--amd--Llama-3.1-405B-Instruct-FP8-KV/ --max-model-len 26720 -tp 8 --batch-size 1 --use-v2-block-manager --input-len 1024 --output-len 128 --speculative-model /models/models--amd--Llama-3.1-8B-Instruct-FP8-KV/ --num-speculative-tokens 5
 
 You should see some performance improvement about the e2e latency.
 
@@ -388,7 +355,7 @@ You should see some performance improvement about the e2e latency.
 
 ### fp16
 
-vllm (pretrained=models--meta-llama--Meta-Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26,dtype=float16,tensor_parallel_size=8), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
+vllm (pretrained=models--meta-llama--Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26,dtype=float16,tensor_parallel_size=8), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
 
 | Tasks |Version|    Filter    |n-shot|  Metric   |   |Value |   |Stderr|
 |-------|------:|--------------|-----:|-----------|---|-----:|---|-----:|
@@ -396,7 +363,7 @@ vllm (pretrained=models--meta-llama--Meta-Llama-3.1-405B-Instruct/snapshots/0699
 
 ### fp8
 
-vllm (pretrained=models--meta-llama--Meta-Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26,dtype=float16,quantization=fp8,quantized_weights_path=/llama.safetensors,tensor_parallel_size=8), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 32
+vllm (pretrained=models--meta-llama--Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26,dtype=float16,quantization=fp8,quantized_weights_path=/llama.safetensors,tensor_parallel_size=8), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 32
 
 | Tasks |Version|    Filter    |n-shot|  Metric   |   |Value|   |Stderr|
 |-------|------:|--------------|-----:|-----------|---|----:|---|-----:|
@@ -404,9 +371,9 @@ vllm (pretrained=models--meta-llama--Meta-Llama-3.1-405B-Instruct/snapshots/0699
 
 ## Performance
 
-### LLaMA2/3 *MLPerf* 70B
+### *MLPerf* Llama-2-70B
 
-Please refer to the MLPerf instructions for recreating the MLPerf numbers.
+Please refer to the [Benchmarking Machine Learning using ROCm and AMD GPUs: Reproducing Our MLPerf Inference Submission — ROCm Blogs][https://rocm.blogs.amd.com/artificial-intelligence/mlperf-inf-4-1/README.html] for information on reproducing MLPerf 4.1 Inference results.  Note that due to changes in vLLM, it is not possible to use these instructions with the current rocm/vllm-dev docker image.
 
 ## Version
 

From 066c262c591014cf620ee7d32126d4ffe66131ef Mon Sep 17 00:00:00 2001
From: Jeremy Arnold <Jeremy.Arnold@amd.com>
Date: Fri, 24 Jan 2025 03:41:23 +0000
Subject: [PATCH 2/9] Fix formatting of GEMM filename

---
 docs/dev-docker/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md
index cc7ac5fd18158..ed08367004efc 100644
--- a/docs/dev-docker/README.md
+++ b/docs/dev-docker/README.md
@@ -176,7 +176,7 @@ If you want to do limited online tuning use --enforce-eager and tune for particu
         --num-scheduler-steps 10 \
         --enforce-eager
 
-The tuned file will be generated for device 0 only at /app/tuned_gemm_csv/bench_latency_tune_device_0_full.csv. Copy this file to /app/tuned_gemm_csv/bench_latency_tune_device_&lt;D>_full.csv for D=1 through 7.
+The tuned file will be generated for device 0 only at /app/tuned_gemm_csv/bench_latency_tune_device_0_full.csv. Copy this file to /app/tuned_gemm_csv/bench_latency_tune_device_\<D\>_full.csv for D=1 through 7.
 
 After the above steps, retain the environment variables set earlier, but set export PYTORCH_TUNABLEOP_TUNING=0 to disable online tuning, and use the tuned solutions.
 

From 72e4dd14c26653816b98399807476afe36ef6131 Mon Sep 17 00:00:00 2001
From: Jeremy Arnold <Jeremy.Arnold@amd.com>
Date: Fri, 24 Jan 2025 05:19:10 +0000
Subject: [PATCH 3/9] README cleanup

- Reorder some sections of the README to make them easier to follow
- Improve formatting of bash commands
- Prefer use of huggingface model names instead of hard-coded directories
- Clean up wording
---
 docs/dev-docker/README.md | 240 +++++++++++++++++++-------------------
 1 file changed, 117 insertions(+), 123 deletions(-)

diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md
index ed08367004efc..1607bdbfd68a5 100644
--- a/docs/dev-docker/README.md
+++ b/docs/dev-docker/README.md
@@ -6,9 +6,9 @@ Documentation for vLLM Inferencing on AMD Instinct platforms.
 
 vLLM is a toolkit and library for large language model (LLM) inference and serving. It deploys the PagedAttention algorithm, which reduces memory consumption and increases throughput by leveraging dynamic key and value allocation in GPU memory. vLLM also incorporates many recent LLM acceleration and quantization algorithms, such as fp8 GeMM, fp8 KV cache, continuous batching, flash attention, hip graph, tensor parallel, GPTQ, AWQ, and token speculation. In addition, AMD implements high-performance custom kernels and modules in vLLM to enhance performance further.
 
-This documentation shows some reference performance numbers and the steps to reproduce it for the popular Llama 3.1 series models from Meta with a pre-built AMD vLLM docker optimized for an AMD Instinct™ MI300X accelerator.
+This documentation includes information for running the popular Llama 3.1 series models from Meta using a pre-built AMD vLLM docker image optimized for an AMD Instinct™ MI300X or MI325X accelerator.
 
-It includes:
+The pre-built image includes:
 
 - ROCm™ 6.3
 - vLLM 0.6.3
@@ -16,79 +16,76 @@ It includes:
 
 ## Pull latest
 
-You can pull the image with `docker pull rocm/vllm-dev:main`
+You can pull the most recent validated docker image with `docker pull rocm/vllm-dev:main`
 
-### What is New
+## What is New
 
 - ROCm 6.3 support
 - Potential bug with Tunable Ops not saving due to a PyTorch issue
 
-Gemms are tuned using PyTorch's Tunable Ops  feature (https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/tunable/README.md)
+Gemms are tuned using PyTorch's Tunable Ops feature (https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/tunable/README.md)
 The  gemms are automatically enabled in the docker image, and all stored gemm configs are kept in /app/_gemm_csv in the same image
 
-### Reproducing benchmark results
+## Obtaining models
 
-### Use pre-quantized models
-
-To make it easier to run fp8 Llama 3.1 models on MI300X, the quantized checkpoints are available on AMD Huggingface space as follows
+The vllm-dev docker image should work with any model supported by vLLM.  When running with FP8, AMD has quantized models available for a variety of popular models, or you can quantize models yourself using Quark.  The vLLM benchmark scripts will download models automatically if needed, and then store them in a HuggingFace cache directory for reuse in future tests.  Alternatively you can choose to download the model to the cache (or to another directory on the system) in advance.
 
-- <https://huggingface.co/amd/Llama-3.1-8B-Instruct-FP8-KV>
-- <https://huggingface.co/amd/Llama-3.1-70B-Instruct-FP8-KV>
-- <https://huggingface.co/amd/Llama-3.1-405B-Instruct-FP8-KV>
-- <https://huggingface.co/amd/grok-1-FP8-KV>
+Many HuggingFace models, including Llama-3.1, have gated access.  You will need to an account at (https://huggingface.co), search for the model of interest, and request access to it if necessary.  You will also need to create a token for accessing these models from vLLM: open your user profile (https://huggingface.co/settings/profile), select "Access Tokens", press "+ Create New Token", and create a new Read token.
 
-Currently these models are private. Please join <https://huggingface.co/amd> to access.
+### Downloading models with huggingface-cli
 
-Download the model you want to run.  
+If you would like to download models directly (instead of allowing vLLM to download them automatically) you can install the HuggingFace CLI:
+```bash
+sudo pip install -U "huggingface_hub[cli]"
+```
 
-These FP8 quantized checkpoints were generated with AMD’s Quark Quantizer. For more information about Quark, please refer to <https://quark.docs.amd.com/latest/quark_example_torch_llm_gen.html>
+Then login using the token that you created earlier.  (Note, it is not necessary to save it as a git credential.):
+```bash
+huggingface-cli login
+```
 
-### Quantize your own models
+Note: The instructions in this document use `/data` to store the models.  If you choose a different directory, you will also need to make that change to the host volume mount when launching the docker container.  Some models can be quite large; please ensure that you have sufficient disk space prior to downloading the model.  Since the model download may take a long time, you may wish to use `tmux` or `screen` to avoid getting disconnected.
 
-This step is optional for you to use quantized models on your own. Take Llama 3.1 405B as an example.
+You can download a model to the huggingface-cache directory using a command similar to the following (substituting the name of the model you wish to download):
+```bash
+sudo mkdir -p /data/huggingface-cache
+sudo chmod -R a+w /data/huggingface-cache
+HF_HOME=/data/huggingface-cache huggingface-cli download meta-llama/Llama-3.1-405B-Instruct --exclude "original/*"
+```
 
-Download the Model View the Llama-3.1-405B model at <https://huggingface.co/meta-llama/Llama-3.1-405B>. Ensure that you have been granted access, and apply for it if you do not have access.
+Alternatively, you may wish to download the model to a specific directory, e.g. so you can quantize the model with Quark:
+```bash
+sudo mkdir -p /data/llama-3.1
+sudo chmod -R a+w /data/llama-3.1
+huggingface-cli download meta-llama/Llama-3.1-405B-Instruct --exclude "original/*" --local-dir /data/llama-3.1/Llama-3.1-405B-Instruct
+```
 
-If you do not already have a HuggingFace token, open your user profile (https://huggingface.co/settings/profile), select "Access Tokens", press "+ Create New Token", and create a new Read token.
+In the benchmark commands provided later in this document, replace the model name (e.g. `amd/Llama-3.1-405B-Instruct-FP8-KV`) with the path to the model (e.g. `/data/llama-3.1/Llama-3.1-405B-Instruct`)
 
-Install the `huggingface-cli` (if not already available on your system) and log in with the token you created earlier and download the model. The instructions in this document assume that the model will be stored under `/data/llama-3.1`. You can store the model in a different location, but then you'll need to update other commands accordingly. The model is quite large and will take some time to download; it is recommended to use tmux or screen to keep your session running without getting disconnected.
+### Use pre-quantized models
 
-    sudo pip install -U "huggingface_hub[cli]"
-    
-    huggingface-cli login
+AMD has provided FP8-quantized versions of several models in order to make them easier to run on MI300X / MI325X:
 
-Enter the token you created earlier; you do NOT need to save it as a git credential
+- <https://huggingface.co/amd/Llama-3.1-8B-Instruct-FP8-KV>
+- <https://huggingface.co/amd/Llama-3.1-70B-Instruct-FP8-KV>
+- <https://huggingface.co/amd/Llama-3.1-405B-Instruct-FP8-KV>
+- <https://huggingface.co/amd/grok-1-FP8-KV>
 
-Create the directory for Llama 3.1 models (if it doesn't already exist)
+These models are currently private; please join <https://huggingface.co/amd> to access.
 
-    sudo mkdir -p /data/llama-3.1
-    
-    sudo chmod -R a+w /data/llama-3.1
+These FP8 quantized checkpoints were generated with AMD’s Quark Quantizer. For more information about Quark, please refer to <https://quark.docs.amd.com/latest/quark_example_torch_llm_gen.html>
 
-Download the model
+### Quantize your own models
 
-    huggingface-cli download meta-llama/Llama-3.1-405B-Instruct --exclude "original/*" --local-dir /data/llama-3.1/Llama-3.1-405B-Instruct
+This is an optional step if you would like to quantize your own model instead of using AMD's pre-quantized models.  These instructions use Llama-3.1-405B as an example, but the commands are similar for other models.
 
-Similarly, you can download Llama-3.1-70B and Llama-3.1-8B.
+First download the model from <https://huggingface.co/meta-llama/Llama-3.1-405B> to the /data/llama-3.1 directory as described above.
 
 [Download and install Quark](https://quark.docs.amd.com/latest/install.html)
 
 Run the quantization script in the example folder using the following command line:
-export MODEL_DIR = [local model checkpoint folder] or meta-llama/Llama-3.1-405B-Instruct
-
-#### single GPU
-
-    python3 quantize_quark.py \
-    --model_dir $MODEL_DIR \
-    --output_dir Llama-3.1-405B-Instruct-FP8-KV \                           
-    --quant_scheme w_fp8_a_fp8 \
-    --kv_cache_dtype fp8 \
-    --num_calib_data 128 \
-    --model_export quark_safetensors \
-    --no_weight_matrix_merge
-
-#### If model size is too large for single GPU, please use multi GPU instead
-
+```bash
+export MODEL_DIR = /data/llama-3.1/Llama-3.1-405B-Instruct
     python3 quantize_quark.py \
     --model_dir $MODEL_DIR \
     --output_dir Llama-3.1-405B-Instruct-FP8-KV \                           
@@ -98,69 +95,67 @@ export MODEL_DIR = [local model checkpoint folder] or meta-llama/Llama-3.1-405B-
     --model_export quark_safetensors \
     --no_weight_matrix_merge \
     --multi_gpu
+```
+
+Note: the `--multi_gpu` parameter can be ommitted for small models that fit on a single GPU.
+
+## Performance testing with AMD vLLM Docker
+
+### System optimization
+
+Before running performance tests you should ensure that the system is optimized according to the [ROCm Documentation][https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html].  In particular, it is important to ensure that NUMA auto-balancing is disabled.
 
 ### Launch AMD vLLM Docker
 
-Download and launch the docker,
+Download and launch the docker.  The HF_TOKEN is required to be set (either here or after launching the container) if you want to allow vLLM to download gated models automatically; use your HuggingFace token in place of `<token>` in the command below:
 
+```bash
     docker run -it --rm --ipc=host --network=host --group-add render \
     --privileged --security-opt seccomp=unconfined \
     --cap-add=CAP_SYS_ADMIN --cap-add=SYS_PTRACE \
     --device=/dev/kfd --device=/dev/dri --device=/dev/mem \
-    -v /data/llama-3.1:/data/llm \
+    -e HF_HOME=/data \
+    -e HF_TOKEN=<token> \
+    -v /data:/data \
     rocm/vllm-dev:main
+```
 
-### Benchmark with AMD vLLM Docker
-
-There are some system settings to be configured for optimum performance on MI300X.
-
-#### System optimization
-
-Before running performance tests you should ensure that the system is optimized according to the [ROCm Documentation][https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html].  In particular, it is important to ensure that NUMA auto-balancing is disabled.
-
-#### LLM performance settings
-
-Some environment variables enhance the performance of the vLLM kernels and PyTorch's tunableOp on the MI300X accelerator. The settings below are already preconfigured in the Docker image. See the AMD Instinct MI300X workload optimization guide for more information.
-
-##### Performance environment variables
+### Performance environment variables
+Some environment variables enhance the performance of the vLLM kernels on the MI300X / MI325X accelerator. See the AMD Instinct MI300X workload optimization guide for more information.
 
+```bash
     export VLLM_USE_TRITON_FLASH_ATTN=0
     export NCCL_MIN_NCHANNELS=112
+```
 
-You can set both PYTORCH_TUNABLEOP_ENABLED and PYTORCH_TUNABLEOP_TUNING to 1 to performance GEMM tuning for the 1st benchmark run.
-It will take some time to complete the tuning during the benchmark. After tuning, it will generate several csv files as the performance lookup database. For the subsequent benchmark runs, you can keep
+### vLLM engine performance settings
 
-PYTORCH_TUNABLEOP_ENABLED as 1 and set
-PYTORCH_TUNABLEOP_TUNING to 0 to use the selected kernels.
+vLLM provides a number of engine options which can be changed to improve performance.  Refer to the [vLLM Engine Args][https://docs.vllm.ai/en/stable/usage/engine_args.html] documentation for the complete list of vLLM engine options.
 
-##### vLLM engine performance settings
-
-vLLM provides a number of engine options which can be changed to improve performance.
-Refer to the [vLLM Engine Args][https://docs.vllm.ai/en/stable/usage/engine_args.html] documentation for the complete list of vLLM engine options.
-Below is a list of options which are useful:
+Below is a list of a few of the key vLLM engine arguments for performance; these can be passed to the vLLM benchmark scripts:
 - **--max-model-len** : Maximum context length supported by the model instance. Can be set to a lower value than model configuration value to improve performance and gpu memory utilization.
 - **--max-num-batched-tokens** : The maximum prefill size, i.e., how many prompt tokens can be packed together in a single prefill. Set to a higher value to improve prefill performance at the cost of higher gpu memory utilization. 65536 works well for LLama models.
 - **--max-num-seqs** : The maximum decode batch size (default 256). Using larger values will allow more prompts to be processed concurrently, resulting in increased throughput (possibly at the expense of higher latency).  If the value is too large, there may not be enough GPU memory for the KV cache, resulting in requests getting preempted.  The optimal value will depend on the GPU memory, model size, and maximum context length.
 - **--max-seq-len-to-capture** : Maximum sequence length for which Hip-graphs are captured and utilized. It's recommended to use Hip-graphs for the best decode performance. The default value of this parameter is 8K, which is lower than the large context lengths supported by recent models such as LLama. Set this parameter to max-model-len or maximum context length supported by the model for best performance.
 - **--gpu-memory-utilization** : The ratio of GPU memory reserved by a vLLM instance. Default value is 0.9.  Increasing the value (potentially as high as 0.99) will increase the amount of memory available for KV cache.  When running in graph mode (i.e. not using `--enforce-eager`), it may be necessary to use a slightly smaller value of 0.92 - 0.95 to ensure adequate memory is available for the HIP graph.
 
-Note: vLLM's server creation command line (vllm serve) supports the above parameters as command line arguments.
-  
-##### Online Gemm Tuning
-
-Online Gemm tuning for small decode batch sizes can improve performance in some cases. e.g. Llama 70B upto Batch size 8
+### Online Gemm Tuning
+Optional: Online Gemm tuning for small decode batch sizes can improve performance in some cases. e.g. Llama 70B upto Batch size 8
 
 If you want to do limited online tuning use --enforce-eager and tune for particular batch sizes. See example below.
 
+```bash
         export PYTORCH_TUNABLEOP_TUNING=1
         export PYTORCH_TUNABLEOP_ENABLED=1
         export PYTORCH_TUNABLEOP_MAX_TUNING_DURATION_MS=100
         export PYTORCH_TUNABLEOP_MAX_WARMUP_DURATION_MS=10
         export PYTORCH_TUNABLEOP_ROTATING_BUFFER_SIZE=1024
         export PYTORCH_TUNABLEOP_FILENAME=/app/tuned_gemm_csv/bench_latency_tune_device_%d_full.csv
-
+```
  Run the following command for BS=1/2/4/8:
-
+```bash
+    for BS in 1 2 4 8
+    do
         python /app/vllm/benchmarks/benchmark_latency.py \
         --model <path to Llama-3.1-70B-Instruct-FP8-KV> \
         --quantization fp8 \
@@ -172,20 +167,22 @@ If you want to do limited online tuning use --enforce-eager and tune for particu
         --tensor-parallel-size 8 \
         --input-len 4096 \
         --output-len 512 \
-        --batch-size <BS> \
+        --batch-size ${BS} \
         --num-scheduler-steps 10 \
         --enforce-eager
+    done
+```
 
 The tuned file will be generated for device 0 only at /app/tuned_gemm_csv/bench_latency_tune_device_0_full.csv. Copy this file to /app/tuned_gemm_csv/bench_latency_tune_device_\<D\>_full.csv for D=1 through 7.
 
 After the above steps, retain the environment variables set earlier, but set export PYTORCH_TUNABLEOP_TUNING=0 to disable online tuning, and use the tuned solutions.
 
-##### Latency Benchmark
+### Latency Benchmark
 
 Benchmark Llama-3.1-405B FP8 with input 128 tokens, output 128 tokens, batch size 32 and tensor parallelism 8 as an example,
-
+```bash
     python /app/vllm/benchmarks/benchmark_latency.py \
-    --model /data/llm/Llama-3.1-405B-Instruct-FP8-KV \
+    --model amd/Llama-3.1-405B-Instruct-FP8-KV \
     --quantization fp8 \
     --kv-cache-dtype fp8 \
     --dtype half \
@@ -195,11 +192,12 @@ Benchmark Llama-3.1-405B FP8 with input 128 tokens, output 128 tokens, batch siz
     --batch size 32 \
     --input-len 128 \
     --output-len 128
+```
 
 If you want to run Llama-3.1-405B FP16, please run
-
+```bash
     python /app/vllm/benchmarks/benchmark_latency.py \
-    --model /data/llm/Llama-3.1-405B-Instruct \
+    --model meta-llama/Llama-3.1-405B-Instruct \
     --dtype float16 \
     --gpu-memory-utilization 0.99 \
     --distributed-executor-backend mp \
@@ -207,6 +205,7 @@ If you want to run Llama-3.1-405B FP16, please run
     --batch size 32 \
     --input-len 128 \
     --output-len 128
+```
 
 You can change various input-len, output-len, batch size and run the benchmark as well. When output-len is 1, it measures prefill latency (TTFT).
 Decoding latency (TPOT) can be calculated based on the measured latency.
@@ -215,12 +214,12 @@ For more information about the parameters, please run
 
     /app/vllm/benchmarks/benchmark_latency.py -h
 
-##### Throughput Benchmark
+### Throughput Benchmark
 
 Benchmark Llama-3.1-405B FP8 with input 128 tokens, output 128 tokens and tensor parallelism 8 as an example,
-
+```bash
     python /app/vllm/benchmarks/benchmark_throughput.py \
-    --model /data/llm/Llama-3.1-405B-Instruct-FP8-KV \
+    --model amd/Llama-3.1-405B-Instruct-FP8-KV \
     --quantization fp8 \
     --kv-cache-dtype fp8 \
     --dtype half \
@@ -231,11 +230,11 @@ Benchmark Llama-3.1-405B FP8 with input 128 tokens, output 128 tokens and tensor
     --tensor-parallel-size 8 \
     --input-len 128 \
     --output-len 128
-
+```
 If you want to run Llama-3.1-405B FP16, please run
-
+```bash
     python /app/vllm/benchmarks/benchmark_throughput.py \
-    --model /data/llm/Llama-3.1-405B-Instruct \
+    --model meta-llama/Llama-3.1-405B-Instruct \
     --dtype float16 \
     --gpu-memory-utilization 0.9 \
     --num-prompts 2000 \
@@ -250,7 +249,7 @@ If you want to run Llama-3.1-405B FP16, please run
     --swap-space
     --max-model-len
     --gpu-memory-utilization 0.99
-
+```
 For fp8 quantized Llama3.18B/70B models:
 
    Recommend TP:1 for Llama3.1-8B, 8 for Llama3.1-70B
@@ -265,22 +264,11 @@ For more information about the parameters, please run
 
 Tensor parallelism (TP) parameters depends on the model size. For Llama 3.1 70B and 8B model, TP 1 can be used as well for MI300X. In general, TP 8 and 1 is recommended to achieve the optimum performance.
 
-##### Online Server Benchmark
-
-Make the following changes if required
-
-/app/vllm/benchmarks/backend_request_func.py
-
-line 242 + "ignore_eos": True,
-
-/app/vllm/benchmarks/benchmark_serving.py
-line 245 -         interval = np.random.exponential(1.0 / request_rate)
-line 245 +         ## interval = np.random.exponential(1.0 / request_rate)
-line 246 +         interval = 1.0 / request_rate
+### Online Server Benchmark
 
 Benchmark Llama-3.1-70B with input 4096 tokens, output 512 tokens and tensor parallelism 8 as an example,
-
-    vllm serve /data/llm/Llama-3.1-70B-Instruct-FP8-KV \
+```bash
+    vllm serve amd/Llama-3.1-70B-Instruct-FP8-KV \
     --swap-space 16 \
     --disable-log-requests \
     --quantization fp8 \
@@ -291,40 +279,42 @@ Benchmark Llama-3.1-70B with input 4096 tokens, output 512 tokens and tensor par
     --max-num-batched-tokens 65536 \
     --gpu-memory-utilization 0.99 \
     --num_scheduler-steps 10
-
+```
 Change port (for example --port 8005) if port=8000 is currently being used by other processes.
 
-run client in a separate terminal. Use port_id from previous step else port-id=8000.
-
+Run client in a separate terminal. Use port_id from previous step else port-id=8000.
+```bash
     python /app/vllm/benchmarks/benchmark_serving.py \
     --port 8000 \
-    --model /data/llm/Llama-3.1-70B-Instruct-FP8-KV \
+    --model amd/Llama-3.1-70B-Instruct-FP8-KV \
     --dataset-name random \
     --random-input-len 4096 \
     --random-output-len 512 \
     --request-rate 1 \
+    --ignore-eos \
     --num-prompts 500 \
     --percentile-metrics ttft,tpot,itl,e2el
-
+```
 Once all prompts are processed, terminate the server gracefully (ctrl+c).
 
-##### CPX mode
+### CPX mode
 
 Currently only CPX-NPS1 mode is supported. So ONLY tp=1 is supported in CPX mode.
 But multiple instances can be started simultaneously (if needed) in CPX-NPS1 mode.
 
-Set GPUs in CPX mode
-
+Set GPUs in CPX mode with:
+```bash
     rocm-smi --setcomputepartition cpx
+```
 
 Example of running Llama3.1-8B on 1 CPX-NPS1 GPU with input 4096 and output 512. As mentioned above, tp=1.
-
+```bash
     HIP_VISIBLE_DEVICES=0 \
     python3 /app/vllm/benchmarks/benchmark_throughput.py \
     --max-model-len 4608 \
     --num-scheduler-steps 10 \
     --num-prompts 100 \
-    --model /data/llm/Llama-3.1-70B-Instruct-FP8-KV \
+    --model amd/Llama-3.1-70B-Instruct-FP8-KV \
     --input-len 4096 \
     --output-len 512 \
     --dtype float16 \
@@ -332,26 +322,29 @@ Example of running Llama3.1-8B on 1 CPX-NPS1 GPU with input 4096 and output 512.
     --output-json <path/to/output.json> \
     --quantization fp8 \
     --gpu-memory-utilization 0.99
+```
 
 Set GPU to SPX mode.
-
+```bash
     rocm-smi --setcomputepartition spx
+```
 
 ### Speculative Decoding
 
 Speculative decoding is one of the key features in vLLM. It has been supported on MI300. Here below is an example of the performance benchmark w/wo speculative decoding for Llama 3.1 405B with Llama 3.1 8B as the draft model.
 
 Without Speculative Decoding -
-
-     python benchmark_latency.py --model /models/models--amd--Llama-3.1-405B-Instruct-FP8-KV/ --max-model-len 26720 -tp 8 --batch-size 1 --use-v2-block-manager --input-len 1024 --output-len 128
+```bash
+     python benchmark_latency.py --model amd/Llama-3.1-405B-Instruct-FP8-KV --max-model-len 26720 -tp 8 --batch-size 1 --use-v2-block-manager --input-len 1024 --output-len 128
+```
 
 With Speculative Decoding -
-
-     python benchmark_latency.py --model /models/models--amd--Llama-3.1-405B-Instruct-FP8-KV/ --max-model-len 26720 -tp 8 --batch-size 1 --use-v2-block-manager --input-len 1024 --output-len 128 --speculative-model /models/models--amd--Llama-3.1-8B-Instruct-FP8-KV/ --num-speculative-tokens 5
-
+```bash
+     python benchmark_latency.py --model amd/Llama-3.1-405B-Instruct-FP8-KV --max-model-len 26720 -tp 8 --batch-size 1 --use-v2-block-manager --input-len 1024 --output-len 128 --speculative-model amd/Llama-3.1-8B-Instruct-FP8-KV --num-speculative-tokens 5
+```
 You should see some performance improvement about the e2e latency.
 
-### MMLU_PRO_Biology Accuracy Eval
+## MMLU_PRO_Biology Accuracy Eval
 
 ### fp16
 
@@ -388,8 +381,9 @@ vLLM: <https://github.com/ROCm/vllm/commit/2c60adc83981ada77a77b2adda78ef109d2e2
 ### Docker Manifest
 
 To reproduce the release docker:
-
+```bash
     git clone https://github.com/ROCm/vllm.git
     cd vllm
     git checkout 2c60adc83981ada77a77b2adda78ef109d2e2e2b
     docker build -f Dockerfile.rocm -t <your_tag> --build-arg BUILD_HIPBLASLT=1 --build-arg USE_CYTHON=1 .
+```
\ No newline at end of file

From bc602ade34539ded53ff785a3691f816f0521d65 Mon Sep 17 00:00:00 2001
From: Jeremy Arnold <Jeremy.Arnold@amd.com>
Date: Fri, 24 Jan 2025 07:12:59 +0000
Subject: [PATCH 4/9] Expanded sample commands for Latency and Throughput

---
 docs/dev-docker/README.md | 224 ++++++++++++++++++++++----------------
 1 file changed, 128 insertions(+), 96 deletions(-)

diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md
index 1607bdbfd68a5..0ec23993880fa 100644
--- a/docs/dev-docker/README.md
+++ b/docs/dev-docker/README.md
@@ -110,7 +110,7 @@ Before running performance tests you should ensure that the system is optimized
 Download and launch the docker.  The HF_TOKEN is required to be set (either here or after launching the container) if you want to allow vLLM to download gated models automatically; use your HuggingFace token in place of `<token>` in the command below:
 
 ```bash
-    docker run -it --rm --ipc=host --network=host --group-add render \
+docker run -it --rm --ipc=host --network=host --group-add render \
     --privileged --security-opt seccomp=unconfined \
     --cap-add=CAP_SYS_ADMIN --cap-add=SYS_PTRACE \
     --device=/dev/kfd --device=/dev/dri --device=/dev/mem \
@@ -124,8 +124,8 @@ Download and launch the docker.  The HF_TOKEN is required to be set (either here
 Some environment variables enhance the performance of the vLLM kernels on the MI300X / MI325X accelerator. See the AMD Instinct MI300X workload optimization guide for more information.
 
 ```bash
-    export VLLM_USE_TRITON_FLASH_ATTN=0
-    export NCCL_MIN_NCHANNELS=112
+export VLLM_USE_TRITON_FLASH_ATTN=0
+export NCCL_MIN_NCHANNELS=112
 ```
 
 ### vLLM engine performance settings
@@ -145,32 +145,32 @@ Optional: Online Gemm tuning for small decode batch sizes can improve performanc
 If you want to do limited online tuning use --enforce-eager and tune for particular batch sizes. See example below.
 
 ```bash
-        export PYTORCH_TUNABLEOP_TUNING=1
-        export PYTORCH_TUNABLEOP_ENABLED=1
-        export PYTORCH_TUNABLEOP_MAX_TUNING_DURATION_MS=100
-        export PYTORCH_TUNABLEOP_MAX_WARMUP_DURATION_MS=10
-        export PYTORCH_TUNABLEOP_ROTATING_BUFFER_SIZE=1024
-        export PYTORCH_TUNABLEOP_FILENAME=/app/tuned_gemm_csv/bench_latency_tune_device_%d_full.csv
+export PYTORCH_TUNABLEOP_TUNING=1
+export PYTORCH_TUNABLEOP_ENABLED=1
+export PYTORCH_TUNABLEOP_MAX_TUNING_DURATION_MS=100
+export PYTORCH_TUNABLEOP_MAX_WARMUP_DURATION_MS=10
+export PYTORCH_TUNABLEOP_ROTATING_BUFFER_SIZE=1024
+export PYTORCH_TUNABLEOP_FILENAME=/app/tuned_gemm_csv/bench_latency_tune_device_%d_full.csv
 ```
  Run the following command for BS=1/2/4/8:
 ```bash
-    for BS in 1 2 4 8
-    do
-        python /app/vllm/benchmarks/benchmark_latency.py \
-        --model <path to Llama-3.1-70B-Instruct-FP8-KV> \
-        --quantization fp8 \
-        --kv-cache-dtype fp8 \
-        --dtype float16 \
-        --max-model-len 8192 \
-        --num-iters-warmup 5 \
-        --num-iters 5 \
-        --tensor-parallel-size 8 \
-        --input-len 4096 \
-        --output-len 512 \
-        --batch-size ${BS} \
-        --num-scheduler-steps 10 \
-        --enforce-eager
-    done
+for BS in 1 2 4 8
+do
+    python /app/vllm/benchmarks/benchmark_latency.py \
+    --model <path to Llama-3.1-70B-Instruct-FP8-KV> \
+    --quantization fp8 \
+    --kv-cache-dtype fp8 \
+    --dtype float16 \
+    --max-model-len 8192 \
+    --num-iters-warmup 5 \
+    --num-iters 5 \
+    --tensor-parallel-size 8 \
+    --input-len 4096 \
+    --output-len 512 \
+    --batch-size ${BS} \
+    --num-scheduler-steps 10 \
+    --enforce-eager
+done
 ```
 
 The tuned file will be generated for device 0 only at /app/tuned_gemm_csv/bench_latency_tune_device_0_full.csv. Copy this file to /app/tuned_gemm_csv/bench_latency_tune_device_\<D\>_full.csv for D=1 through 7.
@@ -179,96 +179,128 @@ After the above steps, retain the environment variables set earlier, but set exp
 
 ### Latency Benchmark
 
-Benchmark Llama-3.1-405B FP8 with input 128 tokens, output 128 tokens, batch size 32 and tensor parallelism 8 as an example,
+vLLM's benchmark_latency.py script measures end-to-end latency for a specified model, input/output length, and batch size.
+
+You can run latency tests for FP8 models with:
 ```bash
-    python /app/vllm/benchmarks/benchmark_latency.py \
-    --model amd/Llama-3.1-405B-Instruct-FP8-KV \
+MODEL=amd/Llama-3.1-405B-Instruct-FP8-KV
+BS=1
+IN=128
+OUT=2048
+TP=8
+
+python3 /app/vllm/benchmarks/benchmark_latency.py \
+    --distributed-executor-backend mp \
     --quantization fp8 \
     --kv-cache-dtype fp8 \
-    --dtype half \
-    --gpu-memory-utilization 0.99 \
-    --distributed-executor-backend mp \
-    --tensor-parallel-size 8 \
-    --batch size 32 \
-    --input-len 128 \
-    --output-len 128
+    --dtype float16 \
+    --gpu-memory-utilization 0.95 \
+    --num-scheduler-steps 10 \
+    --model $MODEL \
+    --max-model-len 8192 \
+    --batch-size $BS \
+    --input-len $IN \
+    --output-len $OUT \
+    --tensor-parallel-size $TP
 ```
 
-If you want to run Llama-3.1-405B FP16, please run
+For FP16 models, remove `--quantization fp8 --kv-cache-dtype fp8`.
+
+When measuring models with long context lengths, performance may improve by setting `--max-model-len` to a smaller value (8192 in this example).  It is important, however, to ensure that the `--max-model-len` is at least as large as the IN + OUT token counts.
+
+To estimate Time To First Token (TTFT) with the benchmark_latency.py script, set the OUT to 1 token.  It is also recommended to use `--enforce-eager` and set `--num-scheduler-steps 1` to get a more accurate measurement of the time that it actually takes to generate the first token.  The following command includes these recommendations.  (For a more comprehensive measurement of TTFT, use the Online Serving Benchmark.)
+
 ```bash
-    python /app/vllm/benchmarks/benchmark_latency.py \
-    --model meta-llama/Llama-3.1-405B-Instruct \
-    --dtype float16 \
-    --gpu-memory-utilization 0.99 \
+MODEL=amd/Llama-3.1-405B-Instruct-FP8-KV
+BS=1
+IN=128
+OUT=1
+TP=8
+
+python3 /app/vllm/benchmarks/benchmark_latency.py \
     --distributed-executor-backend mp \
-    --tensor-parallel-size 8 \
-    --batch size 32 \
-    --input-len 128 \
-    --output-len 128
+    --quantization fp8 \
+    --kv-cache-dtype fp8 \
+    --dtype float16 \
+    --gpu-memory-utilization 0.95 \
+    --num-scheduler-steps 1 \
+    --enforce-eager \
+    --model $MODEL \
+    --max-model-len 8192 \
+    --batch-size $BS \
+    --input-len $IN \
+    --output-len $OUT \
+    --tensor-parallel-size $TP
 ```
 
-You can change various input-len, output-len, batch size and run the benchmark as well. When output-len is 1, it measures prefill latency (TTFT).
-Decoding latency (TPOT) can be calculated based on the measured latency.
-
-For more information about the parameters, please run
-
-    /app/vllm/benchmarks/benchmark_latency.py -h
+For additional information about the available parameters run:
+```bash
+/app/vllm/benchmarks/benchmark_latency.py -h
+```
 
 ### Throughput Benchmark
+vLLM's benchmark_throughput.py script measures offline throughput.  It can either use an input dataset or random prompts with fixed input/output lengths.
 
-Benchmark Llama-3.1-405B FP8 with input 128 tokens, output 128 tokens and tensor parallelism 8 as an example,
+You can run latency tests for FP8 models with:
 ```bash
-    python /app/vllm/benchmarks/benchmark_throughput.py \
-    --model amd/Llama-3.1-405B-Instruct-FP8-KV \
+MODEL=amd/Llama-3.1-405B-Instruct-FP8-KV
+BS=1
+IN=128
+OUT=2048
+TP=8
+PROMPTS=1000
+MAX_NUM_SEQS=2000
+
+python3 /app/vllm/benchmarks/benchmark_throughput.py \
+    --distributed-executor-backend mp \
     --quantization fp8 \
     --kv-cache-dtype fp8 \
-    --dtype half \
-    --gpu-memory-utilization 0.99 \
-    --num-prompts 2000 \
-    --distributed-executor-backend mp \
-    --num-scheduler-steps 10 \
-    --tensor-parallel-size 8 \
-    --input-len 128 \
-    --output-len 128
-```
-If you want to run Llama-3.1-405B FP16, please run
-```bash
-    python /app/vllm/benchmarks/benchmark_throughput.py \
-    --model meta-llama/Llama-3.1-405B-Instruct \
     --dtype float16 \
-    --gpu-memory-utilization 0.9 \
-    --num-prompts 2000 \
-    --distributed-executor-backend mp \
-    --num-scheduler-steps 10 \
-    --tensor-parallel-size 8 \
-    --input-len 128 \
-    --output-len 128 \
-    --swap-space 16 \
-    --max-model-len 8192 \
+    --gpu-memory-utilization 0.95 \
     --max-num-batched-tokens 65536 \
-    --swap-space
-    --max-model-len
-    --gpu-memory-utilization 0.99
+    --num-scheduler-steps 10 \
+    --enable-chunked-prefill False \
+    --model $MODEL \
+    --input-len $IN \
+    --output-len $OUT \
+    --tensor-parallel-size $TP \
+    --num-prompts $PROMPTS \
+    --max-num-seqs $MAX_NUM_SEQS    
 ```
-For fp8 quantized Llama3.18B/70B models:
 
-   Recommend TP:1 for Llama3.1-8B, 8 for Llama3.1-70B
-   Recommend NSCHED: 10 for Llama3.1-8B, 8 for Llama3.1-70B
+For FP16 models, remove `--quantization fp8 --kv-cache-dtype fp8`.
+
+When measuring models with long context lengths, performance may improve by setting `--max-model-len` to a smaller value (8192 in this example).  It is important, however, to ensure that the `--max-model-len` is at least as large as the IN + OUT token counts.
+
+It is important to tune vLLM’s --max-num-seqs value to an appropriate value depending on the model and input/output lengths.  Larger values will allow vLLM to leverage more of the GPU memory for KV Cache and process more prompts concurrently.  But if the value is too large, the KV cache will reach its capacity and vLLM will have to cancel and re-process some prompts.  Suggested values for various models and configurations are listed below.
 
-You can change various input-len, output-len, num-prompts and run the benchmark as well.
-Please note num-scheduler-step is a new feature added in vLLM 0.6.0. It can improve the decoding latency and throughput, however, it may increase the prefill latency.
+For models that fit on a single GPU, it is usually best to run with `--tensor-parallel-size 1`.  Requests can be distributed across multiple copies of vLLM running on different GPUs.  This will be more efficient than running a single copy of the model with `--tensor-parallel-size 8`.  (Note: the benchmark_throughput.py script does not include direct support for using multiple copies of vLLM)
 
-For more information about the parameters, please run
+For optimal performance, the PROMPTS value should be a multiple of the MAX_NUM_SEQS value -- for example, if MAX_NUM_SEQS=2048 then the PROMPTS value could be 2048, 4096, etc.  If PROMPTS is smaller than MAX_NUM_SEQS then there won’t be enough prompts for vLLM to maximize concurrency.
 
-    /app/vllm/benchmarks/benchmark_throughput.py -h
+Recommended values for various configurations are listed in this table:
 
-Tensor parallelism (TP) parameters depends on the model size. For Llama 3.1 70B and 8B model, TP 1 can be used as well for MI300X. In general, TP 8 and 1 is recommended to achieve the optimum performance.
+| MODEL                              | TP | IN   | OUT  | MAX_NUM_SEQS (MI300X) | MAX_NUM_SEQS (MI325X) |
+|------------------------------------|----|------|------|-----------------------|-----------------------|
+| amd/Llama-3.1-405B-Instruct-FP8-KV | 8  | 128  | 128  | 2500                  | 3000                  |
+| amd/Llama-3.1-405B-Instruct-FP8-KV | 8  | 128  | 2048 | 1500                  | 1500                  |
+| amd/Llama-3.1-405B-Instruct-FP8-KV | 8  | 2048 | 128  | 1500                  | 1500                  |
+| amd/Llama-3.1-405B-Instruct-FP8-KV | 8  | 2048 | 2048 | 750                   | 750                   |
+| amd/Llama-3.1-70B-Instruct-FP8-KV  | 1  | 128  | 128  | 2000                  | 2000                  |
+| amd/Llama-3.1-70B-Instruct-FP8-KV  | 1  | 128  | 2048 | 250                   | 250                   |
+| amd/Llama-3.1-70B-Instruct-FP8-KV  | 1  | 2048 | 128  | 250                   | 250                   |
+| amd/Llama-3.1-70B-Instruct-FP8-KV  | 1  | 2048 | 2048 | 250                   | 250                   |
+
+For additional information about the available parameters run:
+```bash
+/app/vllm/benchmarks/benchmark_throughput.py -h
+```
 
-### Online Server Benchmark
+### Online Serving Benchmark
 
 Benchmark Llama-3.1-70B with input 4096 tokens, output 512 tokens and tensor parallelism 8 as an example,
 ```bash
-    vllm serve amd/Llama-3.1-70B-Instruct-FP8-KV \
+vllm serve amd/Llama-3.1-70B-Instruct-FP8-KV \
     --swap-space 16 \
     --disable-log-requests \
     --quantization fp8 \
@@ -284,7 +316,7 @@ Change port (for example --port 8005) if port=8000 is currently being used by ot
 
 Run client in a separate terminal. Use port_id from previous step else port-id=8000.
 ```bash
-    python /app/vllm/benchmarks/benchmark_serving.py \
+python /app/vllm/benchmarks/benchmark_serving.py \
     --port 8000 \
     --model amd/Llama-3.1-70B-Instruct-FP8-KV \
     --dataset-name random \
@@ -304,13 +336,13 @@ But multiple instances can be started simultaneously (if needed) in CPX-NPS1 mod
 
 Set GPUs in CPX mode with:
 ```bash
-    rocm-smi --setcomputepartition cpx
+rocm-smi --setcomputepartition cpx
 ```
 
 Example of running Llama3.1-8B on 1 CPX-NPS1 GPU with input 4096 and output 512. As mentioned above, tp=1.
 ```bash
-    HIP_VISIBLE_DEVICES=0 \
-    python3 /app/vllm/benchmarks/benchmark_throughput.py \
+HIP_VISIBLE_DEVICES=0 \
+python3 /app/vllm/benchmarks/benchmark_throughput.py \
     --max-model-len 4608 \
     --num-scheduler-steps 10 \
     --num-prompts 100 \
@@ -326,7 +358,7 @@ Example of running Llama3.1-8B on 1 CPX-NPS1 GPU with input 4096 and output 512.
 
 Set GPU to SPX mode.
 ```bash
-    rocm-smi --setcomputepartition spx
+rocm-smi --setcomputepartition spx
 ```
 
 ### Speculative Decoding
@@ -335,12 +367,12 @@ Speculative decoding is one of the key features in vLLM. It has been supported o
 
 Without Speculative Decoding -
 ```bash
-     python benchmark_latency.py --model amd/Llama-3.1-405B-Instruct-FP8-KV --max-model-len 26720 -tp 8 --batch-size 1 --use-v2-block-manager --input-len 1024 --output-len 128
+python benchmark_latency.py --model amd/Llama-3.1-405B-Instruct-FP8-KV --max-model-len 26720 -tp 8 --batch-size 1 --use-v2-block-manager --input-len 1024 --output-len 128
 ```
 
 With Speculative Decoding -
 ```bash
-     python benchmark_latency.py --model amd/Llama-3.1-405B-Instruct-FP8-KV --max-model-len 26720 -tp 8 --batch-size 1 --use-v2-block-manager --input-len 1024 --output-len 128 --speculative-model amd/Llama-3.1-8B-Instruct-FP8-KV --num-speculative-tokens 5
+python benchmark_latency.py --model amd/Llama-3.1-405B-Instruct-FP8-KV --max-model-len 26720 -tp 8 --batch-size 1 --use-v2-block-manager --input-len 1024 --output-len 128 --speculative-model amd/Llama-3.1-8B-Instruct-FP8-KV --num-speculative-tokens 5
 ```
 You should see some performance improvement about the e2e latency.
 

From 38679012334054a13f727d1413cb3702b08c580d Mon Sep 17 00:00:00 2001
From: Jeremy Arnold <Jeremy.Arnold@amd.com>
Date: Fri, 24 Jan 2025 07:16:28 +0000
Subject: [PATCH 5/9] Fix markdown links

---
 docs/dev-docker/README.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md
index 0ec23993880fa..d9fed6d23f091 100644
--- a/docs/dev-docker/README.md
+++ b/docs/dev-docker/README.md
@@ -103,7 +103,7 @@ Note: the `--multi_gpu` parameter can be ommitted for small models that fit on a
 
 ### System optimization
 
-Before running performance tests you should ensure that the system is optimized according to the [ROCm Documentation][https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html].  In particular, it is important to ensure that NUMA auto-balancing is disabled.
+Before running performance tests you should ensure that the system is optimized according to the [ROCm Documentation](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html).  In particular, it is important to ensure that NUMA auto-balancing is disabled.
 
 ### Launch AMD vLLM Docker
 
@@ -130,7 +130,7 @@ export NCCL_MIN_NCHANNELS=112
 
 ### vLLM engine performance settings
 
-vLLM provides a number of engine options which can be changed to improve performance.  Refer to the [vLLM Engine Args][https://docs.vllm.ai/en/stable/usage/engine_args.html] documentation for the complete list of vLLM engine options.
+vLLM provides a number of engine options which can be changed to improve performance.  Refer to the [vLLM Engine Args](https://docs.vllm.ai/en/stable/usage/engine_args.html) documentation for the complete list of vLLM engine options.
 
 Below is a list of a few of the key vLLM engine arguments for performance; these can be passed to the vLLM benchmark scripts:
 - **--max-model-len** : Maximum context length supported by the model instance. Can be set to a lower value than model configuration value to improve performance and gpu memory utilization.
@@ -398,7 +398,7 @@ vllm (pretrained=models--meta-llama--Llama-3.1-405B-Instruct/snapshots/069992c75
 
 ### *MLPerf* Llama-2-70B
 
-Please refer to the [Benchmarking Machine Learning using ROCm and AMD GPUs: Reproducing Our MLPerf Inference Submission — ROCm Blogs][https://rocm.blogs.amd.com/artificial-intelligence/mlperf-inf-4-1/README.html] for information on reproducing MLPerf 4.1 Inference results.  Note that due to changes in vLLM, it is not possible to use these instructions with the current rocm/vllm-dev docker image.
+Please refer to the [Benchmarking Machine Learning using ROCm and AMD GPUs: Reproducing Our MLPerf Inference Submission — ROCm Blogs](https://rocm.blogs.amd.com/artificial-intelligence/mlperf-inf-4-1/README.html) for information on reproducing MLPerf 4.1 Inference results.  Note that due to changes in vLLM, it is not possible to use these instructions with the current rocm/vllm-dev docker image.
 
 ## Version
 

From f3a1845e9eb74ee4cd5b5cf2e3e3b79f5e8c6cca Mon Sep 17 00:00:00 2001
From: Jeremy Arnold <Jeremy.Arnold@amd.com>
Date: Fri, 24 Jan 2025 07:28:09 +0000
Subject: [PATCH 6/9] Fix pre-commit errors

---
 docs/dev-docker/README.md | 31 ++++++++++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md
index d9fed6d23f091..810955f436bc4 100644
--- a/docs/dev-docker/README.md
+++ b/docs/dev-docker/README.md
@@ -35,11 +35,13 @@ Many HuggingFace models, including Llama-3.1, have gated access.  You will need
 ### Downloading models with huggingface-cli
 
 If you would like to download models directly (instead of allowing vLLM to download them automatically) you can install the HuggingFace CLI:
+
 ```bash
 sudo pip install -U "huggingface_hub[cli]"
 ```
 
 Then login using the token that you created earlier.  (Note, it is not necessary to save it as a git credential.):
+
 ```bash
 huggingface-cli login
 ```
@@ -47,6 +49,7 @@ huggingface-cli login
 Note: The instructions in this document use `/data` to store the models.  If you choose a different directory, you will also need to make that change to the host volume mount when launching the docker container.  Some models can be quite large; please ensure that you have sufficient disk space prior to downloading the model.  Since the model download may take a long time, you may wish to use `tmux` or `screen` to avoid getting disconnected.
 
 You can download a model to the huggingface-cache directory using a command similar to the following (substituting the name of the model you wish to download):
+
 ```bash
 sudo mkdir -p /data/huggingface-cache
 sudo chmod -R a+w /data/huggingface-cache
@@ -54,6 +57,7 @@ HF_HOME=/data/huggingface-cache huggingface-cli download meta-llama/Llama-3.1-40
 ```
 
 Alternatively, you may wish to download the model to a specific directory, e.g. so you can quantize the model with Quark:
+
 ```bash
 sudo mkdir -p /data/llama-3.1
 sudo chmod -R a+w /data/llama-3.1
@@ -84,6 +88,7 @@ First download the model from <https://huggingface.co/meta-llama/Llama-3.1-405B>
 [Download and install Quark](https://quark.docs.amd.com/latest/install.html)
 
 Run the quantization script in the example folder using the following command line:
+
 ```bash
 export MODEL_DIR = /data/llama-3.1/Llama-3.1-405B-Instruct
     python3 quantize_quark.py \
@@ -97,7 +102,7 @@ export MODEL_DIR = /data/llama-3.1/Llama-3.1-405B-Instruct
     --multi_gpu
 ```
 
-Note: the `--multi_gpu` parameter can be ommitted for small models that fit on a single GPU.
+Note: the `--multi_gpu` parameter can be omitted for small models that fit on a single GPU.
 
 ## Performance testing with AMD vLLM Docker
 
@@ -121,6 +126,7 @@ docker run -it --rm --ipc=host --network=host --group-add render \
 ```
 
 ### Performance environment variables
+
 Some environment variables enhance the performance of the vLLM kernels on the MI300X / MI325X accelerator. See the AMD Instinct MI300X workload optimization guide for more information.
 
 ```bash
@@ -140,6 +146,7 @@ Below is a list of a few of the key vLLM engine arguments for performance; these
 - **--gpu-memory-utilization** : The ratio of GPU memory reserved by a vLLM instance. Default value is 0.9.  Increasing the value (potentially as high as 0.99) will increase the amount of memory available for KV cache.  When running in graph mode (i.e. not using `--enforce-eager`), it may be necessary to use a slightly smaller value of 0.92 - 0.95 to ensure adequate memory is available for the HIP graph.
 
 ### Online Gemm Tuning
+
 Optional: Online Gemm tuning for small decode batch sizes can improve performance in some cases. e.g. Llama 70B upto Batch size 8
 
 If you want to do limited online tuning use --enforce-eager and tune for particular batch sizes. See example below.
@@ -152,7 +159,9 @@ export PYTORCH_TUNABLEOP_MAX_WARMUP_DURATION_MS=10
 export PYTORCH_TUNABLEOP_ROTATING_BUFFER_SIZE=1024
 export PYTORCH_TUNABLEOP_FILENAME=/app/tuned_gemm_csv/bench_latency_tune_device_%d_full.csv
 ```
- Run the following command for BS=1/2/4/8:
+
+Run the following command for BS=1/2/4/8:
+
 ```bash
 for BS in 1 2 4 8
 do
@@ -182,6 +191,7 @@ After the above steps, retain the environment variables set earlier, but set exp
 vLLM's benchmark_latency.py script measures end-to-end latency for a specified model, input/output length, and batch size.
 
 You can run latency tests for FP8 models with:
+
 ```bash
 MODEL=amd/Llama-3.1-405B-Instruct-FP8-KV
 BS=1
@@ -234,14 +244,17 @@ python3 /app/vllm/benchmarks/benchmark_latency.py \
 ```
 
 For additional information about the available parameters run:
+
 ```bash
 /app/vllm/benchmarks/benchmark_latency.py -h
 ```
 
 ### Throughput Benchmark
+
 vLLM's benchmark_throughput.py script measures offline throughput.  It can either use an input dataset or random prompts with fixed input/output lengths.
 
 You can run latency tests for FP8 models with:
+
 ```bash
 MODEL=amd/Llama-3.1-405B-Instruct-FP8-KV
 BS=1
@@ -292,6 +305,7 @@ Recommended values for various configurations are listed in this table:
 | amd/Llama-3.1-70B-Instruct-FP8-KV  | 1  | 2048 | 2048 | 250                   | 250                   |
 
 For additional information about the available parameters run:
+
 ```bash
 /app/vllm/benchmarks/benchmark_throughput.py -h
 ```
@@ -299,6 +313,7 @@ For additional information about the available parameters run:
 ### Online Serving Benchmark
 
 Benchmark Llama-3.1-70B with input 4096 tokens, output 512 tokens and tensor parallelism 8 as an example,
+
 ```bash
 vllm serve amd/Llama-3.1-70B-Instruct-FP8-KV \
     --swap-space 16 \
@@ -312,9 +327,11 @@ vllm serve amd/Llama-3.1-70B-Instruct-FP8-KV \
     --gpu-memory-utilization 0.99 \
     --num_scheduler-steps 10
 ```
+
 Change port (for example --port 8005) if port=8000 is currently being used by other processes.
 
 Run client in a separate terminal. Use port_id from previous step else port-id=8000.
+
 ```bash
 python /app/vllm/benchmarks/benchmark_serving.py \
     --port 8000 \
@@ -327,6 +344,7 @@ python /app/vllm/benchmarks/benchmark_serving.py \
     --num-prompts 500 \
     --percentile-metrics ttft,tpot,itl,e2el
 ```
+
 Once all prompts are processed, terminate the server gracefully (ctrl+c).
 
 ### CPX mode
@@ -335,11 +353,13 @@ Currently only CPX-NPS1 mode is supported. So ONLY tp=1 is supported in CPX mode
 But multiple instances can be started simultaneously (if needed) in CPX-NPS1 mode.
 
 Set GPUs in CPX mode with:
+
 ```bash
 rocm-smi --setcomputepartition cpx
 ```
 
 Example of running Llama3.1-8B on 1 CPX-NPS1 GPU with input 4096 and output 512. As mentioned above, tp=1.
+
 ```bash
 HIP_VISIBLE_DEVICES=0 \
 python3 /app/vllm/benchmarks/benchmark_throughput.py \
@@ -357,6 +377,7 @@ python3 /app/vllm/benchmarks/benchmark_throughput.py \
 ```
 
 Set GPU to SPX mode.
+
 ```bash
 rocm-smi --setcomputepartition spx
 ```
@@ -366,14 +387,17 @@ rocm-smi --setcomputepartition spx
 Speculative decoding is one of the key features in vLLM. It has been supported on MI300. Here below is an example of the performance benchmark w/wo speculative decoding for Llama 3.1 405B with Llama 3.1 8B as the draft model.
 
 Without Speculative Decoding -
+
 ```bash
 python benchmark_latency.py --model amd/Llama-3.1-405B-Instruct-FP8-KV --max-model-len 26720 -tp 8 --batch-size 1 --use-v2-block-manager --input-len 1024 --output-len 128
 ```
 
 With Speculative Decoding -
+
 ```bash
 python benchmark_latency.py --model amd/Llama-3.1-405B-Instruct-FP8-KV --max-model-len 26720 -tp 8 --batch-size 1 --use-v2-block-manager --input-len 1024 --output-len 128 --speculative-model amd/Llama-3.1-8B-Instruct-FP8-KV --num-speculative-tokens 5
 ```
+
 You should see some performance improvement about the e2e latency.
 
 ## MMLU_PRO_Biology Accuracy Eval
@@ -413,9 +437,10 @@ vLLM: <https://github.com/ROCm/vllm/commit/2c60adc83981ada77a77b2adda78ef109d2e2
 ### Docker Manifest
 
 To reproduce the release docker:
+
 ```bash
     git clone https://github.com/ROCm/vllm.git
     cd vllm
     git checkout 2c60adc83981ada77a77b2adda78ef109d2e2e2b
     docker build -f Dockerfile.rocm -t <your_tag> --build-arg BUILD_HIPBLASLT=1 --build-arg USE_CYTHON=1 .
-```
\ No newline at end of file
+```

From 724e07cc8cb4f89869bbb8afc132ec5e7d76ae13 Mon Sep 17 00:00:00 2001
From: Jeremy Arnold <Jeremy.Arnold@amd.com>
Date: Fri, 24 Jan 2025 19:12:01 +0000
Subject: [PATCH 7/9] Updates from review

Initial updates to incorporate feedback from a review session held with @t-parry
---
 docs/dev-docker/README.md | 106 +++++++++-----------------------------
 1 file changed, 24 insertions(+), 82 deletions(-)

diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md
index 810955f436bc4..6bf52a0005fd6 100644
--- a/docs/dev-docker/README.md
+++ b/docs/dev-docker/README.md
@@ -23,31 +23,43 @@ You can pull the most recent validated docker image with `docker pull rocm/vllm-
 - ROCm 6.3 support
 - Potential bug with Tunable Ops not saving due to a PyTorch issue
 
-Gemms are tuned using PyTorch's Tunable Ops feature (https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/tunable/README.md)
-The  gemms are automatically enabled in the docker image, and all stored gemm configs are kept in /app/_gemm_csv in the same image
+## Preparation
 
-## Obtaining models
+### Obtaining access to models
 
 The vllm-dev docker image should work with any model supported by vLLM.  When running with FP8, AMD has quantized models available for a variety of popular models, or you can quantize models yourself using Quark.  The vLLM benchmark scripts will download models automatically if needed, and then store them in a HuggingFace cache directory for reuse in future tests.  Alternatively you can choose to download the model to the cache (or to another directory on the system) in advance.
 
 Many HuggingFace models, including Llama-3.1, have gated access.  You will need to an account at (https://huggingface.co), search for the model of interest, and request access to it if necessary.  You will also need to create a token for accessing these models from vLLM: open your user profile (https://huggingface.co/settings/profile), select "Access Tokens", press "+ Create New Token", and create a new Read token.
 
-### Downloading models with huggingface-cli
+### System optimization
+
+Before running performance tests you should ensure that the system is optimized according to the [ROCm Documentation](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html).  In particular, it is important to ensure that NUMA auto-balancing is disabled.
 
-If you would like to download models directly (instead of allowing vLLM to download them automatically) you can install the HuggingFace CLI:
+### Launch AMD vLLM Docker
+
+Download and launch the docker.  The HF_TOKEN is required to be set (either here or after launching the container) if you want to allow vLLM to download gated models automatically; use your HuggingFace token in place of `<token>` in the command below:
 
 ```bash
-sudo pip install -U "huggingface_hub[cli]"
+docker run -it --rm --ipc=host --network=host --group-add render \
+    --privileged --security-opt seccomp=unconfined \
+    --cap-add=CAP_SYS_ADMIN --cap-add=SYS_PTRACE \
+    --device=/dev/kfd --device=/dev/dri --device=/dev/mem \
+    -e HF_HOME=/data \
+    -e HF_TOKEN=<token> \
+    -v /data:/data \
+    rocm/vllm-dev:main
 ```
 
-Then login using the token that you created earlier.  (Note, it is not necessary to save it as a git credential.):
+Note: The instructions in this document use `/data` to store the models.  If you choose a different directory, you will also need to make that change to the host volume mount when launching the docker container.  For example, `-v /home/username/models:/data` in place of `-v /data:/data` would store the models in /home/username/models on the host.  Some models can be quite large; please ensure that you have sufficient disk space prior to downloading the model.  Since the model download may take a long time, you may wish to use `tmux` or `screen` to avoid getting disconnected.
+
+### Downloading models with huggingface-cli
+
+If you would like to download models directly (instead of allowing vLLM to download them automatically) you can use the huggingface-cli inside the running docker container.  Login using the token that you created earlier.  (Note, it is not necessary to save it as a git credential.)
 
 ```bash
 huggingface-cli login
 ```
 
-Note: The instructions in this document use `/data` to store the models.  If you choose a different directory, you will also need to make that change to the host volume mount when launching the docker container.  Some models can be quite large; please ensure that you have sufficient disk space prior to downloading the model.  Since the model download may take a long time, you may wish to use `tmux` or `screen` to avoid getting disconnected.
-
 You can download a model to the huggingface-cache directory using a command similar to the following (substituting the name of the model you wish to download):
 
 ```bash
@@ -68,14 +80,13 @@ In the benchmark commands provided later in this document, replace the model nam
 
 ### Use pre-quantized models
 
-AMD has provided FP8-quantized versions of several models in order to make them easier to run on MI300X / MI325X:
+AMD has provided [FP8-quantized versions](https://huggingface.co/collections/amd/quark-quantized-ocp-fp8-models-66db7936d18fcbaf95d4405c) of several models in order to make them easier to run on MI300X / MI325X, including:
 
 - <https://huggingface.co/amd/Llama-3.1-8B-Instruct-FP8-KV>
 - <https://huggingface.co/amd/Llama-3.1-70B-Instruct-FP8-KV>
 - <https://huggingface.co/amd/Llama-3.1-405B-Instruct-FP8-KV>
-- <https://huggingface.co/amd/grok-1-FP8-KV>
 
-These models are currently private; please join <https://huggingface.co/amd> to access.
+Some models may be private to those who are members of <https://huggingface.co/amd>.
 
 These FP8 quantized checkpoints were generated with AMD’s Quark Quantizer. For more information about Quark, please refer to <https://quark.docs.amd.com/latest/quark_example_torch_llm_gen.html>
 
@@ -106,24 +117,6 @@ Note: the `--multi_gpu` parameter can be omitted for small models that fit on a
 
 ## Performance testing with AMD vLLM Docker
 
-### System optimization
-
-Before running performance tests you should ensure that the system is optimized according to the [ROCm Documentation](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html).  In particular, it is important to ensure that NUMA auto-balancing is disabled.
-
-### Launch AMD vLLM Docker
-
-Download and launch the docker.  The HF_TOKEN is required to be set (either here or after launching the container) if you want to allow vLLM to download gated models automatically; use your HuggingFace token in place of `<token>` in the command below:
-
-```bash
-docker run -it --rm --ipc=host --network=host --group-add render \
-    --privileged --security-opt seccomp=unconfined \
-    --cap-add=CAP_SYS_ADMIN --cap-add=SYS_PTRACE \
-    --device=/dev/kfd --device=/dev/dri --device=/dev/mem \
-    -e HF_HOME=/data \
-    -e HF_TOKEN=<token> \
-    -v /data:/data \
-    rocm/vllm-dev:main
-```
 
 ### Performance environment variables
 
@@ -145,47 +138,6 @@ Below is a list of a few of the key vLLM engine arguments for performance; these
 - **--max-seq-len-to-capture** : Maximum sequence length for which Hip-graphs are captured and utilized. It's recommended to use Hip-graphs for the best decode performance. The default value of this parameter is 8K, which is lower than the large context lengths supported by recent models such as LLama. Set this parameter to max-model-len or maximum context length supported by the model for best performance.
 - **--gpu-memory-utilization** : The ratio of GPU memory reserved by a vLLM instance. Default value is 0.9.  Increasing the value (potentially as high as 0.99) will increase the amount of memory available for KV cache.  When running in graph mode (i.e. not using `--enforce-eager`), it may be necessary to use a slightly smaller value of 0.92 - 0.95 to ensure adequate memory is available for the HIP graph.
 
-### Online Gemm Tuning
-
-Optional: Online Gemm tuning for small decode batch sizes can improve performance in some cases. e.g. Llama 70B upto Batch size 8
-
-If you want to do limited online tuning use --enforce-eager and tune for particular batch sizes. See example below.
-
-```bash
-export PYTORCH_TUNABLEOP_TUNING=1
-export PYTORCH_TUNABLEOP_ENABLED=1
-export PYTORCH_TUNABLEOP_MAX_TUNING_DURATION_MS=100
-export PYTORCH_TUNABLEOP_MAX_WARMUP_DURATION_MS=10
-export PYTORCH_TUNABLEOP_ROTATING_BUFFER_SIZE=1024
-export PYTORCH_TUNABLEOP_FILENAME=/app/tuned_gemm_csv/bench_latency_tune_device_%d_full.csv
-```
-
-Run the following command for BS=1/2/4/8:
-
-```bash
-for BS in 1 2 4 8
-do
-    python /app/vllm/benchmarks/benchmark_latency.py \
-    --model <path to Llama-3.1-70B-Instruct-FP8-KV> \
-    --quantization fp8 \
-    --kv-cache-dtype fp8 \
-    --dtype float16 \
-    --max-model-len 8192 \
-    --num-iters-warmup 5 \
-    --num-iters 5 \
-    --tensor-parallel-size 8 \
-    --input-len 4096 \
-    --output-len 512 \
-    --batch-size ${BS} \
-    --num-scheduler-steps 10 \
-    --enforce-eager
-done
-```
-
-The tuned file will be generated for device 0 only at /app/tuned_gemm_csv/bench_latency_tune_device_0_full.csv. Copy this file to /app/tuned_gemm_csv/bench_latency_tune_device_\<D\>_full.csv for D=1 through 7.
-
-After the above steps, retain the environment variables set earlier, but set export PYTORCH_TUNABLEOP_TUNING=0 to disable online tuning, and use the tuned solutions.
-
 ### Latency Benchmark
 
 vLLM's benchmark_latency.py script measures end-to-end latency for a specified model, input/output length, and batch size.
@@ -424,17 +376,7 @@ vllm (pretrained=models--meta-llama--Llama-3.1-405B-Instruct/snapshots/069992c75
 
 Please refer to the [Benchmarking Machine Learning using ROCm and AMD GPUs: Reproducing Our MLPerf Inference Submission — ROCm Blogs](https://rocm.blogs.amd.com/artificial-intelligence/mlperf-inf-4-1/README.html) for information on reproducing MLPerf 4.1 Inference results.  Note that due to changes in vLLM, it is not possible to use these instructions with the current rocm/vllm-dev docker image.
 
-## Version
-
-### Release Notes
-
-20240906a: Legacy quantization formats required `--quantization fp8_rocm` as a flag instead of `--quantization fp8`
-
-Updated:
-
-vLLM: <https://github.com/ROCm/vllm/commit/2c60adc83981ada77a77b2adda78ef109d2e2e2b>
-
-### Docker Manifest
+## Docker Manifest
 
 To reproduce the release docker:
 

From f5a53175a50e38597a16b54240b16234be0731c9 Mon Sep 17 00:00:00 2001
From: Jeremy Arnold <Jeremy.Arnold@amd.com>
Date: Fri, 24 Jan 2025 21:11:46 +0000
Subject: [PATCH 8/9] Update script args to match current recommendations

---
 docs/dev-docker/README.md | 68 ++++++++++++++-------------------------
 1 file changed, 25 insertions(+), 43 deletions(-)

diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md
index 6bf52a0005fd6..427f4ddd70955 100644
--- a/docs/dev-docker/README.md
+++ b/docs/dev-docker/README.md
@@ -117,7 +117,6 @@ Note: the `--multi_gpu` parameter can be omitted for small models that fit on a
 
 ## Performance testing with AMD vLLM Docker
 
-
 ### Performance environment variables
 
 Some environment variables enhance the performance of the vLLM kernels on the MI300X / MI325X accelerator. See the AMD Instinct MI300X workload optimization guide for more information.
@@ -156,44 +155,23 @@ python3 /app/vllm/benchmarks/benchmark_latency.py \
     --quantization fp8 \
     --kv-cache-dtype fp8 \
     --dtype float16 \
-    --gpu-memory-utilization 0.95 \
-    --num-scheduler-steps 10 \
+    --gpu-memory-utilization 0.9 \
+    --trust-remote-code \
     --model $MODEL \
-    --max-model-len 8192 \
     --batch-size $BS \
     --input-len $IN \
     --output-len $OUT \
-    --tensor-parallel-size $TP
+    --tensor-parallel-size $TP \
+    --num-iters-warmup 3 \
+    --num-iters 5 \
+    --output-json output.json
 ```
 
 For FP16 models, remove `--quantization fp8 --kv-cache-dtype fp8`.
 
-When measuring models with long context lengths, performance may improve by setting `--max-model-len` to a smaller value (8192 in this example).  It is important, however, to ensure that the `--max-model-len` is at least as large as the IN + OUT token counts.
+When measuring models with long context lengths, performance may improve by setting `--max-model-len` to a smaller value.  It is important, however, to ensure that the `--max-model-len` is at least as large as the IN + OUT token counts.
 
-To estimate Time To First Token (TTFT) with the benchmark_latency.py script, set the OUT to 1 token.  It is also recommended to use `--enforce-eager` and set `--num-scheduler-steps 1` to get a more accurate measurement of the time that it actually takes to generate the first token.  The following command includes these recommendations.  (For a more comprehensive measurement of TTFT, use the Online Serving Benchmark.)
-
-```bash
-MODEL=amd/Llama-3.1-405B-Instruct-FP8-KV
-BS=1
-IN=128
-OUT=1
-TP=8
-
-python3 /app/vllm/benchmarks/benchmark_latency.py \
-    --distributed-executor-backend mp \
-    --quantization fp8 \
-    --kv-cache-dtype fp8 \
-    --dtype float16 \
-    --gpu-memory-utilization 0.95 \
-    --num-scheduler-steps 1 \
-    --enforce-eager \
-    --model $MODEL \
-    --max-model-len 8192 \
-    --batch-size $BS \
-    --input-len $IN \
-    --output-len $OUT \
-    --tensor-parallel-size $TP
-```
+To estimate Time To First Token (TTFT) with the benchmark_latency.py script, set the OUT to 1 token.  It is also recommended to use `--enforce-eager` to get a more accurate measurement of the time that it actually takes to generate the first token.  (For a more comprehensive measurement of TTFT, use the Online Serving Benchmark.)
 
 For additional information about the available parameters run:
 
@@ -221,16 +199,20 @@ python3 /app/vllm/benchmarks/benchmark_throughput.py \
     --quantization fp8 \
     --kv-cache-dtype fp8 \
     --dtype float16 \
-    --gpu-memory-utilization 0.95 \
-    --max-num-batched-tokens 65536 \
+    --gpu-memory-utilization 0.9 \
+    --trust-remote-code \
     --num-scheduler-steps 10 \
     --enable-chunked-prefill False \
     --model $MODEL \
+    --max-model-len 8192 \
+    --max-num-batched-tokens 131072 \
+    --max-seq-len-to-capture 131072 \
     --input-len $IN \
     --output-len $OUT \
     --tensor-parallel-size $TP \
     --num-prompts $PROMPTS \
-    --max-num-seqs $MAX_NUM_SEQS    
+    --max-num-seqs $MAX_NUM_SEQS \
+    --output-json output.json
 ```
 
 For FP16 models, remove `--quantization fp8 --kv-cache-dtype fp8`.
@@ -245,16 +227,16 @@ For optimal performance, the PROMPTS value should be a multiple of the MAX_NUM_S
 
 Recommended values for various configurations are listed in this table:
 
-| MODEL                              | TP | IN   | OUT  | MAX_NUM_SEQS (MI300X) | MAX_NUM_SEQS (MI325X) |
-|------------------------------------|----|------|------|-----------------------|-----------------------|
-| amd/Llama-3.1-405B-Instruct-FP8-KV | 8  | 128  | 128  | 2500                  | 3000                  |
-| amd/Llama-3.1-405B-Instruct-FP8-KV | 8  | 128  | 2048 | 1500                  | 1500                  |
-| amd/Llama-3.1-405B-Instruct-FP8-KV | 8  | 2048 | 128  | 1500                  | 1500                  |
-| amd/Llama-3.1-405B-Instruct-FP8-KV | 8  | 2048 | 2048 | 750                   | 750                   |
-| amd/Llama-3.1-70B-Instruct-FP8-KV  | 1  | 128  | 128  | 2000                  | 2000                  |
-| amd/Llama-3.1-70B-Instruct-FP8-KV  | 1  | 128  | 2048 | 250                   | 250                   |
-| amd/Llama-3.1-70B-Instruct-FP8-KV  | 1  | 2048 | 128  | 250                   | 250                   |
-| amd/Llama-3.1-70B-Instruct-FP8-KV  | 1  | 2048 | 2048 | 250                   | 250                   |
+| MODEL                              | TP | IN   | OUT  | PROMPTS | MAX_NUM_SEQS |
+|------------------------------------|----|------|------|---------|--------------|
+| amd/Llama-3.1-70B-Instruct-FP8-KV  | 8  | 128  | 2048 | 3200    | 3200         |
+| amd/Llama-3.1-70B-Instruct-FP8-KV  | 8  | 128  | 4096 | 1500    | 1500         |
+| amd/Llama-3.1-70B-Instruct-FP8-KV  | 8  | 500  | 2000 | 2000    | 2000         |
+| amd/Llama-3.1-70B-Instruct-FP8-KV  | 8  | 2048 | 2048 | 1500    | 1500         |
+| amd/Llama-3.1-405B-Instruct-FP8-KV | 8  | 128  | 2048 | 1500    | 1500         |
+| amd/Llama-3.1-405B-Instruct-FP8-KV | 8  | 128  | 4096 | 1500    | 1500         |
+| amd/Llama-3.1-405B-Instruct-FP8-KV | 8  | 500  | 2000 | 2000    | 2000         |
+| amd/Llama-3.1-405B-Instruct-FP8-KV | 8  | 2048 | 2048 | 500     | 500          |
 
 For additional information about the available parameters run:
 

From a450e14a479877066121245101f7f327f9e8d914 Mon Sep 17 00:00:00 2001
From: Jeremy Arnold <Jeremy.Arnold@amd.com>
Date: Sat, 25 Jan 2025 01:08:17 +0000
Subject: [PATCH 9/9] Remove recommended max-num-seqs values for now

---
 docs/dev-docker/README.md | 15 +--------------
 1 file changed, 1 insertion(+), 14 deletions(-)

diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md
index 427f4ddd70955..e5b7d37b0af29 100644
--- a/docs/dev-docker/README.md
+++ b/docs/dev-docker/README.md
@@ -223,20 +223,7 @@ It is important to tune vLLM’s --max-num-seqs value to an appropriate value de
 
 For models that fit on a single GPU, it is usually best to run with `--tensor-parallel-size 1`.  Requests can be distributed across multiple copies of vLLM running on different GPUs.  This will be more efficient than running a single copy of the model with `--tensor-parallel-size 8`.  (Note: the benchmark_throughput.py script does not include direct support for using multiple copies of vLLM)
 
-For optimal performance, the PROMPTS value should be a multiple of the MAX_NUM_SEQS value -- for example, if MAX_NUM_SEQS=2048 then the PROMPTS value could be 2048, 4096, etc.  If PROMPTS is smaller than MAX_NUM_SEQS then there won’t be enough prompts for vLLM to maximize concurrency.
-
-Recommended values for various configurations are listed in this table:
-
-| MODEL                              | TP | IN   | OUT  | PROMPTS | MAX_NUM_SEQS |
-|------------------------------------|----|------|------|---------|--------------|
-| amd/Llama-3.1-70B-Instruct-FP8-KV  | 8  | 128  | 2048 | 3200    | 3200         |
-| amd/Llama-3.1-70B-Instruct-FP8-KV  | 8  | 128  | 4096 | 1500    | 1500         |
-| amd/Llama-3.1-70B-Instruct-FP8-KV  | 8  | 500  | 2000 | 2000    | 2000         |
-| amd/Llama-3.1-70B-Instruct-FP8-KV  | 8  | 2048 | 2048 | 1500    | 1500         |
-| amd/Llama-3.1-405B-Instruct-FP8-KV | 8  | 128  | 2048 | 1500    | 1500         |
-| amd/Llama-3.1-405B-Instruct-FP8-KV | 8  | 128  | 4096 | 1500    | 1500         |
-| amd/Llama-3.1-405B-Instruct-FP8-KV | 8  | 500  | 2000 | 2000    | 2000         |
-| amd/Llama-3.1-405B-Instruct-FP8-KV | 8  | 2048 | 2048 | 500     | 500          |
+For optimal performance, the PROMPTS value should be a multiple of the MAX_NUM_SEQS value -- for example, if MAX_NUM_SEQS=1500 then the PROMPTS value could be 1500, 3000, etc.  If PROMPTS is smaller than MAX_NUM_SEQS then there won’t be enough prompts for vLLM to maximize concurrency.
 
 For additional information about the available parameters run: