Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[V1][TPU] Remove unnecessary padding for running on TPU. #14467

Merged
merged 3 commits into from
Mar 9, 2025

Conversation

vanbasten23
Copy link
Collaborator

@vanbasten23 vanbasten23 commented Mar 8, 2025

Remove unnecessary padding for running on TPU. Also update the tunable block size (NUM_QUERIES_PER_BLOCK, NUM_KV_PAGES_PER_BLOCK) for the ragged kernel.

Test plan:

  1. $ VLLM_USE_V1=1 pytest -s -v vllm/tests/entrypoints/llm/test_accuracy.py::test_lm_eval_accuracy_v1_engine 2>&1 | tee out.txt
VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct --disable-log-requests --port 8003 --gpu-memory-utilization 0.95 --max-num-batched-tokens 8192 --tensor-parallel-size 1 --max-model-len 2048 &
python3 benchmark_serving.py --model meta-llama/Llama-3.1-8B-Instruct     --dataset-name sonnet     --dataset-path sonnet_4x.txt     --num-prompts 1000     --sonnet-input-len 2000     --sonnet-output-len 128     --port 8003

Copy link

github-actions bot commented Mar 8, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label Mar 8, 2025
@vanbasten23 vanbasten23 marked this pull request as ready for review March 8, 2025 01:52
Copy link
Contributor

@yarongmu-google yarongmu-google left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NUM_KV_PAGES_PER_BLOCK is no longer used in tpu_model_runner.py after this change. Is that intentional?

Signed-off-by: Xiongfei Wei <[email protected]>
@vanbasten23
Copy link
Collaborator Author

NUM_KV_PAGES_PER_BLOCK is no longer used in tpu_model_runner.py after this change. Is that intentional?

Yea, NUM_KV_PAGES_PER_BLOCK is used for padding. Since we don't need to pad anymore, we no longer need it.

Signed-off-by: Xiongfei Wei <[email protected]>
@vanbasten23 vanbasten23 changed the title Reduce the size of block_table by getting rid of padding. [V1][TPU] Remove unnecessary padding for running on TPU. Mar 8, 2025
Comment on lines -80 to +79
self.max_num_tokens = _get_padded_number(
scheduler_config.max_num_batched_tokens, NUM_QUERIES_PER_BLOCK)
self.max_num_reqs = _get_padded_number(scheduler_config.max_num_seqs,
NUM_QUERIES_PER_BLOCK)
self.max_num_tokens = scheduler_config.max_num_batched_tokens
self.max_num_reqs = scheduler_config.max_num_seqs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rounding these up was intentional since the user could specify odd values non divisible by our constraints. I think we should keep this

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these constrains from kernel? If yes, we no longer need these paddings because the new kernel removed all the constrains. We can save tons of memory and computing by removing these paddings.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

@mgoin , by "user could specify odd values", what "value" are you referring to?

Here both NUM_KV_PAGES_PER_BLOCK, NUM_QUERIES_PER_BLOCK are tunable parameter of the ragged kernel. The padding is needed mainly because the kernel v1 has the constraint that self.max_num_tokens%NUM_QUERIES_PER_BLOCK==0 and self.max_num_blocks_per_req%NUM_KV_PAGES_PER_BLOCK==0.

Early this week we switched the kernel from v1 to v2 where in v2 we don't have such constraints, that's why I think we can remove these constraints.

Also note that here are the "max" num_tokens instead of the actual num_tokens we would use. For the actual num_tokens in the real workload and "warmup", we still pad to the next power of 2: https://github.com/vllm-project/vllm/blob/8ed5421aaa7da24051acdae53c860e6ce6598403/vllm/v1/worker/tpu_model_runner.py#L420C45-L420C66.

@mgoin mgoin added tpu Related to Google TPUs ready ONLY add when PR is ready to merge/full CI is needed labels Mar 8, 2025
@mgoin mgoin merged commit 10f7552 into vllm-project:main Mar 9, 2025
46 checks passed
@vanbasten23
Copy link
Collaborator Author

Thanks @mgoin for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed tpu Related to Google TPUs v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants