[V1][TPU] Remove unnecessary padding for running on TPU. #14467

vanbasten23 · 2025-03-08T01:51:00Z

Remove unnecessary padding for running on TPU. Also update the tunable block size (NUM_QUERIES_PER_BLOCK, NUM_KV_PAGES_PER_BLOCK) for the ragged kernel.

Test plan:

$ VLLM_USE_V1=1 pytest -s -v vllm/tests/entrypoints/llm/test_accuracy.py::test_lm_eval_accuracy_v1_engine 2>&1 | tee out.txt

VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct --disable-log-requests --port 8003 --gpu-memory-utilization 0.95 --max-num-batched-tokens 8192 --tensor-parallel-size 1 --max-model-len 2048 &
python3 benchmark_serving.py --model meta-llama/Llama-3.1-8B-Instruct     --dataset-name sonnet     --dataset-path sonnet_4x.txt     --num-prompts 1000     --sonnet-input-len 2000     --sonnet-output-len 128     --port 8003

Signed-off-by: Xiongfei Wei <[email protected]>

github-actions · 2025-03-08T01:51:08Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

yarongmu-google

NUM_KV_PAGES_PER_BLOCK is no longer used in tpu_model_runner.py after this change. Is that intentional?

Signed-off-by: Xiongfei Wei <[email protected]>

vanbasten23 · 2025-03-08T02:18:56Z

NUM_KV_PAGES_PER_BLOCK is no longer used in tpu_model_runner.py after this change. Is that intentional?

Yea, NUM_KV_PAGES_PER_BLOCK is used for padding. Since we don't need to pad anymore, we no longer need it.

vllm/v1/worker/tpu_model_runner.py

Signed-off-by: Xiongfei Wei <[email protected]>

mgoin · 2025-03-08T09:37:48Z

vllm/v1/worker/tpu_model_runner.py

-        self.max_num_tokens = _get_padded_number(
-            scheduler_config.max_num_batched_tokens, NUM_QUERIES_PER_BLOCK)
-        self.max_num_reqs = _get_padded_number(scheduler_config.max_num_seqs,
-                                               NUM_QUERIES_PER_BLOCK)
+        self.max_num_tokens = scheduler_config.max_num_batched_tokens
+        self.max_num_reqs = scheduler_config.max_num_seqs


Rounding these up was intentional since the user could specify odd values non divisible by our constraints. I think we should keep this

Are these constrains from kernel? If yes, we no longer need these paddings because the new kernel removed all the constrains. We can save tons of memory and computing by removing these paddings.

Thanks for the review!

@mgoin , by "user could specify odd values", what "value" are you referring to?

Here both NUM_KV_PAGES_PER_BLOCK, NUM_QUERIES_PER_BLOCK are tunable parameter of the ragged kernel. The padding is needed mainly because the kernel v1 has the constraint that self.max_num_tokens%NUM_QUERIES_PER_BLOCK==0 and self.max_num_blocks_per_req%NUM_KV_PAGES_PER_BLOCK==0.

Early this week we switched the kernel from v1 to v2 where in v2 we don't have such constraints, that's why I think we can remove these constraints.

Also note that here are the "max" num_tokens instead of the actual num_tokens we would use. For the actual num_tokens in the real workload and "warmup", we still pad to the next power of 2: https://github.com/vllm-project/vllm/blob/8ed5421aaa7da24051acdae53c860e6ce6598403/vllm/v1/worker/tpu_model_runner.py#L420C45-L420C66.

vanbasten23 · 2025-03-09T21:44:22Z

Thanks @mgoin for the review!

Reduce the size of block_table by getting rid of padding.

ad38ffa

Signed-off-by: Xiongfei Wei <[email protected]>

mergify bot added the v1 label Mar 8, 2025

vanbasten23 marked this pull request as ready for review March 8, 2025 01:52

vanbasten23 requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners March 8, 2025 01:52

yarongmu-google reviewed Mar 8, 2025

View reviewed changes

remove unused import

c881174

Signed-off-by: Xiongfei Wei <[email protected]>

bythew3i reviewed Mar 8, 2025

View reviewed changes

vllm/v1/worker/tpu_model_runner.py Show resolved Hide resolved

remove other padding

541afd6

Signed-off-by: Xiongfei Wei <[email protected]>

vanbasten23 changed the title ~~Reduce the size of block_table by getting rid of padding.~~ [V1][TPU] Remove unnecessary padding for running on TPU. Mar 8, 2025

mgoin reviewed Mar 8, 2025

View reviewed changes

mgoin added tpu Related to Google TPUs ready ONLY add when PR is ready to merge/full CI is needed labels Mar 8, 2025

mgoin approved these changes Mar 9, 2025

View reviewed changes

mgoin merged commit 10f7552 into vllm-project:main Mar 9, 2025
46 checks passed

mgoin mentioned this pull request Mar 9, 2025

[Hardware][TPU] Fix the recompiling issue in logits processor after warmup #14510

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1][TPU] Remove unnecessary padding for running on TPU. #14467

[V1][TPU] Remove unnecessary padding for running on TPU. #14467

vanbasten23 commented Mar 8, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Mar 8, 2025

yarongmu-google left a comment

vanbasten23 commented Mar 8, 2025

mgoin Mar 8, 2025

bythew3i Mar 8, 2025

vanbasten23 Mar 8, 2025

vanbasten23 commented Mar 9, 2025

[V1][TPU] Remove unnecessary padding for running on TPU. #14467

[V1][TPU] Remove unnecessary padding for running on TPU. #14467

Conversation

vanbasten23 commented Mar 8, 2025 • edited by github-actions bot Loading

github-actions bot commented Mar 8, 2025

yarongmu-google left a comment

Choose a reason for hiding this comment

vanbasten23 commented Mar 8, 2025

mgoin Mar 8, 2025

Choose a reason for hiding this comment

bythew3i Mar 8, 2025

Choose a reason for hiding this comment

vanbasten23 Mar 8, 2025

Choose a reason for hiding this comment

vanbasten23 commented Mar 9, 2025

vanbasten23 commented Mar 8, 2025 •

edited by github-actions bot

Loading