-
-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1][Metrics] Handle preemptions #13169
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
pre-commit failure is a
|
28c55d0
to
3876e2c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the POV of a client, "preemptions" create bumps in ITL (inter token latency) aka TPOT (time per output token).
You can see this clearly when considering streaming:
- The user sends a prompt like
Hello my name is Robert and I
- The model streams back
like
->to
->work
->on
->vllm
If a request is preempted after the word to
, we will evict the request and re-run once we have enough KV cache memory with the prompt as "Hello my name is Robert and I like to" (so the generated tokens are added to the prompt when recomputing --- this makes the recomputation happen much more quickly --- and prefix caching also helps to make this happen fast).
From the POV of the user, this means that the ITL of the word "work" will be elevated (since all the time associated with being preempted, waiting, and then recomputing will happen before the word is emitted). The TTFT is not impacted since we already streamed the word like
. As a result, I think that we should make the metrics coming out of vLLM match this structure
- When a
PREEMPTION
occurs, this should manifest as a higherinter token interval
for that token - When a
PREEMPTION
occurs, this should manifest as higherdecode interval
for that request
So I think that it should look like this:
|
One other case we should make sure we cover is --- what happens if the request is preempted during prefill? E.g. if we have a prompt of length 100k and we get preempted half way through. In this case, the preemption/recomputation time should be allocated to the prefill phase and the TTFT phase Im not sure if this is possible to happen or whether there is some invariant that we always have enough KV cache for the prompt to be processed. Worth asking Woosuk or Cody |
This pull request has merge conflicts that must be resolved before it can be |
In the "preempted prefill" case, I had imagined the queued interval to be up until the final If we're aligned on the diagrams above, I think the code change is simply to not reset |
I am in alignment with the charts above! Thanks for drawing it out! |
Add a core engine PREEMPTED event. Add the num_preemptions_total counter from v0. Also, make preemptions reset the scheduled and first token timestamps resulting in: ``` << queued timestamp >> [ queue interval ] | | (possible preemptions) | << scheduled timestamp >> | << preempted timestamp >> | << scheduled timestamp >> | << new token timestamp (FIRST) >> | << preempted timestamp >> v << scheduled timestamp >> [ prefill interval ] << new token timestamp (FIRST) >> [ inter-token interval ] << new token timestamp >> [ decode interval (relative to most recent first token time) [ inference interval (relative to most recent scheduled time) << new token timestamp (FINISHED) >> ``` Signed-off-by: Mark McLoughlin <[email protected]>
Don't include prefill preemption time in the queued interval. Don't reset first token on preemption - already decoded tokens are retained and reused. Signed-off-by: Mark McLoughlin <[email protected]>
3876e2c
to
ed0dfd8
Compare
As per the discussion in vllm-project#13169. Signed-off-by: Mark McLoughlin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great and is very robust. Thanks!
Signed-off-by: Johnny <[email protected]>
Signed-off-by: Linkun Chen <[email protected]>
Part of #10582
Add a core engine
PREEMPTED
event.Add the
num_preemptions_total
counter from v0.Also, make preemptions reset the scheduled and first token timestamps resulting in: