[V1][Metrics] Handle preemptions #13169

markmc · 2025-02-12T17:22:59Z

Add a core engine PREEMPTED event.

Add the num_preemptions_total counter from v0.

Also, make preemptions reset the scheduled and first token timestamps resulting in:

  << queued timestamp >>
    [ queue interval ]
  << scheduled timestamp (FIRST) >>
    [ prefill interval ]
  << new token timestamp (FIRST) >>
    [ inter-token interval ]
  << new token timestamp >>
    [ inter-token interval ]
      |
      |	(possible preemptions)
      | << scheduled timestamp >>
      | << preempted timestamp >>
      | << scheduled timestamp >>
      | << new token timestamp >>
      | << preempted timestamp >>
      v
  << new token timestamp >>
    [ decode interval (relative to first token time)
    [ inference interval (relative to first scheduled time)
  << new token timestamp (FINISHED) >>

github-actions · 2025-02-12T17:23:12Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

markmc · 2025-02-12T17:45:19Z

pre-commit failure is a yapf failure that doesn't happen for me locally:

  File "/home/runner/.cache/pre-commit/repom9gt4aao/py_env-python3.12/lib/python3.12/site-packages/yapf_third_party/_ylib2to3/pygram.py", line 39, in <module>
    pattern_grammar = driver.load_grammar(_PATTERN_GRAMMAR_FILE)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/.cache/pre-commit/repom9gt4aao/py_env-python3.12/lib/python3.12/site-packages/yapf_third_party/_ylib2to3/pgen2/driver.py", line 248, in load_grammar
    g.load(gp)
  File "/home/runner/.cache/pre-commit/repom9gt4aao/py_env-python3.12/lib/python3.12/site-packages/yapf_third_party/_ylib2to3/pgen2/grammar.py", line 128, in load
    d = pickle.load(f)
        ^^^^^^^^^^^^^^
EOFError: Ran out of input

robertgshaw2-redhat

From the POV of a client, "preemptions" create bumps in ITL (inter token latency) aka TPOT (time per output token).

You can see this clearly when considering streaming:

The user sends a prompt like Hello my name is Robert and I
The model streams back like -> to -> work -> on -> vllm

If a request is preempted after the word to, we will evict the request and re-run once we have enough KV cache memory with the prompt as "Hello my name is Robert and I like to" (so the generated tokens are added to the prompt when recomputing --- this makes the recomputation happen much more quickly --- and prefix caching also helps to make this happen fast).

From the POV of the user, this means that the ITL of the word "work" will be elevated (since all the time associated with being preempted, waiting, and then recomputing will happen before the word is emitted). The TTFT is not impacted since we already streamed the word like. As a result, I think that we should make the metrics coming out of vLLM match this structure

When a PREEMPTION occurs, this should manifest as a higher inter token interval for that token
When a PREEMPTION occurs, this should manifest as higher decode interval for that request

robertgshaw2-redhat · 2025-02-20T23:02:24Z

So I think that it should look like this:

<< queued timestamp >>  # unique per request and frozen after being set
    [ queue interval ]
<< scheduled timestamp >> # unique per request and frozen after being set
    [ prefill interval ]
<< new token timestamp (FIRST) >> # unique per request and frozen after being set
    [ inter-token interval ]
<< new token timestamp >>
<< preempted timestamp >>
    | request is in the preempted queue
    | request is re-scheduled
    | recompute up to the current token
    [ inter-token interval ]
<< new token timestamp >>
    [ inter-token interval ]
<< new token timestamp >>
<< preempted timestamp >>
    | request is in the preempted queue
    | request is re-scheduled
    | recompute up to the current token
    [ inter-token interval ]
  << new token timestamp >>
    [ decode interval (relative to scheduled) ] # all the time spent in the preempted + recompute state is allocated here
    [ inference interval (relative to scheduled) ]
  << new token timestamp (FINISHED) >>

robertgshaw2-redhat · 2025-02-20T23:11:04Z

One other case we should make sure we cover is --- what happens if the request is preempted during prefill?

E.g. if we have a prompt of length 100k and we get preempted half way through. In this case, the preemption/recomputation time should be allocated to the prefill phase and the TTFT phase

Im not sure if this is possible to happen or whether there is some invariant that we always have enough KV cache for the prompt to be processed. Worth asking Woosuk or Cody

mergify · 2025-02-25T08:21:06Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @markmc.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

markmc · 2025-02-25T17:19:20Z

I drafted some thoughts, and found it super difficult to clarify this in words or ASCII art so let's try this ... does this match your thinking?

markmc · 2025-02-25T17:30:29Z

In the "preempted prefill" case, I had imagined the queued interval to be up until the final SCHEDULED event ... nothing useful happened with the request, its waiting to be prioritized for resources? Not a big deal, I guess - most important that TTFT ~= queued_interval + prefill_interval

If we're aligned on the diagrams above, I think the code change is simply to not reset scheduled_ts or first_token_ts once they've been set?

robertgshaw2-redhat · 2025-02-26T13:01:00Z

I am in alignment with the charts above! Thanks for drawing it out!

Add a core engine PREEMPTED event. Add the num_preemptions_total counter from v0. Also, make preemptions reset the scheduled and first token timestamps resulting in: ``` << queued timestamp >> [ queue interval ] | | (possible preemptions) | << scheduled timestamp >> | << preempted timestamp >> | << scheduled timestamp >> | << new token timestamp (FIRST) >> | << preempted timestamp >> v << scheduled timestamp >> [ prefill interval ] << new token timestamp (FIRST) >> [ inter-token interval ] << new token timestamp >> [ decode interval (relative to most recent first token time) [ inference interval (relative to most recent scheduled time) << new token timestamp (FINISHED) >> ``` Signed-off-by: Mark McLoughlin <[email protected]>

Don't include prefill preemption time in the queued interval. Don't reset first token on preemption - already decoded tokens are retained and reused. Signed-off-by: Mark McLoughlin <[email protected]>

As per the discussion in vllm-project#13169. Signed-off-by: Mark McLoughlin <[email protected]>

robertgshaw2-redhat

This looks great and is very robust. Thanks!

Signed-off-by: Johnny <[email protected]>

Signed-off-by: Linkun Chen <[email protected]>

markmc requested review from DarkLight1337, robertgshaw2-redhat, simon-mo, WoosukKwon, njhill, ywang96, comaniac and alexm-redhat as code owners February 12, 2025 17:23

mergify bot added the v1 label Feb 12, 2025

markmc mentioned this pull request Feb 12, 2025

[Feature][v1]: Add metrics support #10582

Open

1 task

markmc force-pushed the metrics-v1-preemptions branch 2 times, most recently from 28c55d0 to 3876e2c Compare February 15, 2025 17:12

robertgshaw2-redhat requested changes Feb 20, 2025

View reviewed changes

mergify bot added the needs-rebase label Feb 25, 2025

markmc added 2 commits February 26, 2025 12:12

[V1][Metrics] Change where preemption time is accounted

ed0dfd8

Don't include prefill preemption time in the queued interval. Don't reset first token on preemption - already decoded tokens are retained and reused. Signed-off-by: Mark McLoughlin <[email protected]>

markmc force-pushed the metrics-v1-preemptions branch from 3876e2c to ed0dfd8 Compare February 26, 2025 17:17

mergify bot removed the needs-rebase label Feb 26, 2025

markmc added a commit to markmc/vllm that referenced this pull request Feb 26, 2025

[v1][Metrics] Add interval calculation diagrams

b857805

As per the discussion in vllm-project#13169. Signed-off-by: Mark McLoughlin <[email protected]>

robertgshaw2-redhat approved these changes Feb 27, 2025

View reviewed changes

robertgshaw2-redhat enabled auto-merge (squash) February 27, 2025 02:06

robertgshaw2-redhat added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 27, 2025

vllm-bot merged commit cd711c4 into vllm-project:main Feb 27, 2025
45 of 47 checks passed

johnnynunez pushed a commit to johnnynunez/vllm that referenced this pull request Mar 3, 2025

[V1][Metrics] Handle preemptions (vllm-project#13169)

7a8afc0

Signed-off-by: Johnny <[email protected]>

Akshat-Tripathi pushed a commit to krai/vllm that referenced this pull request Mar 3, 2025

[V1][Metrics] Handle preemptions (vllm-project#13169)

c9095e0

lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Mar 5, 2025

[V1][Metrics] Handle preemptions (vllm-project#13169)

9dc9a10

Signed-off-by: Linkun Chen <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1][Metrics] Handle preemptions #13169

[V1][Metrics] Handle preemptions #13169

markmc commented Feb 12, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Feb 12, 2025

markmc commented Feb 12, 2025

robertgshaw2-redhat left a comment •

edited

Loading

robertgshaw2-redhat commented Feb 20, 2025 •

edited

Loading

robertgshaw2-redhat commented Feb 20, 2025

mergify bot commented Feb 25, 2025

markmc commented Feb 25, 2025 •

edited

Loading

markmc commented Feb 25, 2025

robertgshaw2-redhat commented Feb 26, 2025

robertgshaw2-redhat left a comment

[V1][Metrics] Handle preemptions #13169

[V1][Metrics] Handle preemptions #13169

Conversation

markmc commented Feb 12, 2025 • edited by github-actions bot Loading

github-actions bot commented Feb 12, 2025

markmc commented Feb 12, 2025

robertgshaw2-redhat left a comment • edited Loading

Choose a reason for hiding this comment

robertgshaw2-redhat commented Feb 20, 2025 • edited Loading

robertgshaw2-redhat commented Feb 20, 2025

mergify bot commented Feb 25, 2025

markmc commented Feb 25, 2025 • edited Loading

markmc commented Feb 25, 2025

robertgshaw2-redhat commented Feb 26, 2025

robertgshaw2-redhat left a comment

Choose a reason for hiding this comment

markmc commented Feb 12, 2025 •

edited by github-actions bot

Loading

robertgshaw2-redhat left a comment •

edited

Loading

robertgshaw2-redhat commented Feb 20, 2025 •

edited

Loading

markmc commented Feb 25, 2025 •

edited

Loading