Skip to content

Commit

Permalink
[V0][Metrics] Deprecate some KV/prefix cache metrics
Browse files Browse the repository at this point in the history
vllm:cpu_cache_usage_perc and vllm:cpu_prefix_cache_hit_rate will
no longer be relevant in V1 since we no longer implement KV cache
offloading. So these metrics should be considered deprecated.

And as agreed in #12592, we have added prefix_cache_queries and
prefix_cache_hits counters to replace the prefix_cache_hit_rate
gauge as it allows the interval over which the hit rate is
calculated to be controlled in a Prometheus query like:

```
rate(prefix_cache_queries[5m]) / rate(prefix_cache_hits[5m])
```

In theory, we could ease the transition be implementing the
old hit rate metric in V1 and the new queries/hits metrics
in V0, but it's probably not worthwhile unless we learn the
hit rate metric is heavily used by V0 users.

Signed-off-by: Mark McLoughlin <[email protected]>
  • Loading branch information
markmc committed Mar 3, 2025
1 parent e584b85 commit e89b9fe
Showing 1 changed file with 19 additions and 4 deletions.
23 changes: 19 additions & 4 deletions vllm/engine/metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,26 +79,41 @@ def __init__(self, labelnames: List[str], vllm_config: VllmConfig):
documentation="Number of requests swapped to CPU.",
labelnames=labelnames,
multiprocess_mode="sum")

# KV Cache Usage in %
self.gauge_gpu_cache_usage = self._gauge_cls(
name="vllm:gpu_cache_usage_perc",
documentation="GPU KV-cache usage. 1 means 100 percent usage.",
labelnames=labelnames,
multiprocess_mode="sum")

# Deprecated in 0.8 - KV cache offloading is not used in V1
# TODO: in 0.9, only enable if show_hidden_metrics=True
self.gauge_cpu_cache_usage = self._gauge_cls(
name="vllm:cpu_cache_usage_perc",
documentation="CPU KV-cache usage. 1 means 100 percent usage.",
documentation=(
"CPU KV-cache usage. 1 means 100 percent usage. "
"DEPRECATED: KV cache offloading is not used in V1"),
labelnames=labelnames,
multiprocess_mode="sum")
# Prefix caching block hit rate

# Deprecated in 0.8 - KV cache offloading is not used in V1
# TODO: in 0.9, only enable if show_hidden_metrics=True
self.gauge_cpu_prefix_cache_hit_rate = self._gauge_cls(
name="vllm:cpu_prefix_cache_hit_rate",
documentation="CPU prefix cache block hit rate.",
documentation=(
"CPU prefix cache block hit rate. "
"DEPRECATED: KV cache offloading is not used in V1"),
labelnames=labelnames,
multiprocess_mode="sum")

# Deprecated in 0.8 - replaced by queries+hits counters in V1
# TODO: in 0.9, only enable if show_hidden_metrics=True
self.gauge_gpu_prefix_cache_hit_rate = self._gauge_cls(
name="vllm:gpu_prefix_cache_hit_rate",
documentation="GPU prefix cache block hit rate.",
documentation=("GPU prefix cache block hit rate. "
"DEPRECATED: use vllm:gpu_prefix_cache_queries and "
"vllm:gpu_prefix_cache_queries in V1"),
labelnames=labelnames,
multiprocess_mode="sum")

Expand Down

0 comments on commit e89b9fe

Please sign in to comment.