@@ -5,17 +5,17 @@ Ensure the v1 LLM Engine exposes a superset of the metrics available in v0.
55## Objectives
66
77- Achieve parity of metrics between v0 and v1.
8- - The priority use case is accessing these metrics via Prometheus as this is what we expect to be used in production environments.
9- - Logging support - i.e. printing metrics to the info log - is provided for more ad-hoc testing, debugging, development, and exploratory use cases.
8+ - The priority use case is accessing these metrics via Prometheus, as this is what we expect to be used in production environments.
9+ - Logging support ( i.e. printing metrics to the info log) is provided for more ad-hoc testing, debugging, development, and exploratory use cases.
1010
1111## Background
1212
1313Metrics in vLLM can be categorized as follows:
1414
15- 1 . Server-level metrics: these are global metrics that track the state and performance of the LLM engine. These are typically exposed as Gauges or Counters in Prometheus.
16- 2 . Request-level metrics: these are metrics that track the characteristics - e.g. size and timing - of individual requests. These are typically exposed as Histograms in Prometheus, and are often the SLO that an SRE monitoring vLLM will be tracking.
15+ 1 . Server-level metrics: Global metrics that track the state and performance of the LLM engine. These are typically exposed as Gauges or Counters in Prometheus.
16+ 2 . Request-level metrics: Metrics that track the characteristics ( e.g. size and timing) of individual requests. These are typically exposed as Histograms in Prometheus and are often the SLOs that an SRE monitoring vLLM will be tracking.
1717
18- The mental model is that the "Server -level Metrics" explain why the "Request -level Metrics" are what they are .
18+ The mental model is that server -level metrics help explain the values of request -level metrics .
1919
2020### v0 Metrics
2121
@@ -65,20 +65,20 @@ vLLM also provides [a reference example](../../examples/online_serving/prometheu
6565
6666The subset of metrics exposed in the Grafana dashboard gives us an indication of which metrics are especially important:
6767
68- - ` vllm:e2e_request_latency_seconds_bucket ` - End to end request latency measured in seconds
69- - ` vllm:prompt_tokens_total ` - Prompt Tokens
70- - ` vllm:generation_tokens_total ` - Generation Tokens
71- - ` vllm:time_per_output_token_seconds ` - Inter token latency (Time Per Output Token, TPOT) in second .
68+ - ` vllm:e2e_request_latency_seconds_bucket ` - End to end request latency measured in seconds.
69+ - ` vllm:prompt_tokens_total ` - Prompt tokens.
70+ - ` vllm:generation_tokens_total ` - Generation tokens.
71+ - ` vllm:time_per_output_token_seconds ` - Inter- token latency (Time Per Output Token, TPOT) in seconds .
7272- ` vllm:time_to_first_token_seconds ` - Time to First Token (TTFT) latency in seconds.
73- - ` vllm:num_requests_running ` (also, ` _swapped ` and ` _waiting ` ) - Number of requests in RUNNING, WAITING, and SWAPPED state
73+ - ` vllm:num_requests_running ` (also, ` _swapped ` and ` _waiting ` ) - Number of requests in the RUNNING, WAITING, and SWAPPED states.
7474- ` vllm:gpu_cache_usage_perc ` - Percentage of used cache blocks by vLLM.
75- - ` vllm:request_prompt_tokens ` - Request prompt length
76- - ` vllm:request_generation_tokens ` - request generation length
77- - ` vllm:request_success_total ` - Number of finished requests by their finish reason: either an EOS token was generated or the max sequence length was reached
78- - ` vllm:request_queue_time_seconds ` - Queue Time
79- - ` vllm:request_prefill_time_seconds ` - Requests Prefill Time
80- - ` vllm:request_decode_time_seconds ` - Requests Decode Time
81- - ` vllm:request_max_num_generation_tokens ` - Max Generation Token in Sequence Group
75+ - ` vllm:request_prompt_tokens ` - Request prompt length.
76+ - ` vllm:request_generation_tokens ` - Request generation length.
77+ - ` vllm:request_success_total ` - Number of finished requests by their finish reason: either an EOS token was generated or the max sequence length was reached.
78+ - ` vllm:request_queue_time_seconds ` - Queue time.
79+ - ` vllm:request_prefill_time_seconds ` - Requests prefill time.
80+ - ` vllm:request_decode_time_seconds ` - Requests decode time.
81+ - ` vllm:request_max_num_generation_tokens ` - Max generation tokens in a sequence group.
8282
8383See [ the PR which added this Dashboard] ( gh-pr:2316 ) for interesting and useful background on the choices made here.
8484
@@ -103,7 +103,7 @@ In v0, metrics are collected in the engine core process and we use multi-process
103103
104104### Built in Python/Process Metrics
105105
106- The following metrics are supported by default by ` prometheus_client ` , but the are not exposed with multiprocess mode is used:
106+ The following metrics are supported by default by ` prometheus_client ` , but they are not exposed when multi-process mode is used:
107107
108108- ` python_gc_objects_collected_total `
109109- ` python_gc_objects_uncollectable_total `
@@ -158,6 +158,7 @@ In v1, we wish to move computation and overhead out of the engine core
158158process to minimize the time between each forward pass.
159159
160160The overall idea of V1 EngineCore design is:
161+
161162- EngineCore is the inner loop. Performance is most critical here
162163- AsyncLLM is the outer loop. This is overlapped with GPU execution
163164 (ideally), so this is where any "overheads" should be if
@@ -178,7 +179,7 @@ time" (`time.time()`) to calculate intervals as the former is
178179unaffected by system clock changes (e.g. from NTP).
179180
180181It's also important to note that monotonic clocks differ between
181- processes - each process has its own reference. point. So it is
182+ processes - each process has its own reference point. So it is
182183meaningless to compare monotonic timestamps from different processes.
183184
184185Therefore, in order to calculate an interval, we must compare two
@@ -343,14 +344,15 @@ vllm:time_to_first_token_seconds_bucket{le="0.1",model_name="meta-llama/Llama-3.
343344vllm:time_to_first_token_seconds_count{model_name=" meta-llama/Llama-3.1-8B-Instruct" } 140.0
344345```
345346
346- Note - the choice of histogram buckets to be most useful to users
347- across a broad set of use cases is not straightforward and will
348- require refinement over time.
347+ !!! note
348+ The choice of histogram buckets to be most useful to users
349+ across a broad set of use cases is not straightforward and will
350+ require refinement over time.
349351
350352### Cache Config Info
351353
352- ` prometheus_client ` has support for [ Info
353- metrics] ( https://prometheus.github.io/client_python/instrumenting/info/ )
354+ ` prometheus_client ` has support for
355+ [ Info metrics] ( https://prometheus.github.io/client_python/instrumenting/info/ )
354356which are equivalent to a ` Gauge ` whose value is permanently set to 1,
355357but exposes interesting key/value pair information via labels. This is
356358used for information about an instance that does not change - so it
@@ -363,14 +365,11 @@ We use this concept for the `vllm:cache_config_info` metric:
363365# HELP vllm:cache_config_info Information of the LLMEngine CacheConfig
364366# TYPE vllm:cache_config_info gauge
365367vllm:cache_config_info{block_size="16",cache_dtype="auto",calculate_kv_scales="False",cpu_offload_gb="0",enable_prefix_caching="False",gpu_memory_utilization="0.9",...} 1.0
366-
367368```
368369
369- However, ` prometheus_client ` has [ never supported Info metrics in
370- multiprocessing
371- mode] ( https://github.com/prometheus/client_python/pull/300 ) - for
372- [ unclear
373- reasons] ( gh-pr:7279#discussion_r1710417152 ) . We
370+ However, ` prometheus_client ` has
371+ [ never supported Info metrics in multiprocessing mode] ( https://github.com/prometheus/client_python/pull/300 ) -
372+ for [ unclear reasons] ( gh-pr:7279#discussion_r1710417152 ) . We
374373simply use a ` Gauge ` metric set to 1 and
375374` multiprocess_mode="mostrecent" ` instead.
376375
@@ -395,11 +394,9 @@ distinguish between per-adapter counts. This should be revisited.
395394Note that ` multiprocess_mode="livemostrecent" ` is used - the most
396395recent metric is used, but only from currently running processes.
397396
398- This was added in
399- < gh-pr:9477 > and there is
400- [ at least one known
401- user] ( https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/54 ) . If
402- we revisit this design and deprecate the old metric, we should reduce
397+ This was added in < gh-pr:9477 > and there is
398+ [ at least one known user] ( https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/54 ) .
399+ If we revisit this design and deprecate the old metric, we should reduce
403400the need for a significant deprecation period by making the change in
404401v0 also and asking this project to move to the new metric.
405402
@@ -442,23 +439,20 @@ suddenly (from their perspective) when it is removed, even if there is
442439an equivalent metric for them to use.
443440
444441As an example, see how ` vllm:avg_prompt_throughput_toks_per_s ` was
445- [ deprecated] ( gh-pr:2764 ) (with a
446- comment in the code),
447- [ removed] ( gh-pr:12383 ) , and then
448- [ noticed by a
449- user] ( gh-issue:13218 ) .
442+ [ deprecated] ( gh-pr:2764 ) (with a comment in the code),
443+ [ removed] ( gh-pr:12383 ) , and then [ noticed by a user] ( gh-issue:13218 ) .
450444
451445In general:
452446
453- 1 ) We should be cautious about deprecating metrics, especially since
447+ 1 . We should be cautious about deprecating metrics, especially since
454448 it can be hard to predict the user impact.
455- 2 ) We should include a prominent deprecation notice in the help string
449+ 2 . We should include a prominent deprecation notice in the help string
456450 that is included in the `/metrics' output.
457- 3 ) We should list deprecated metrics in user-facing documentation and
451+ 3 . We should list deprecated metrics in user-facing documentation and
458452 release notes.
459- 4 ) We should consider hiding deprecated metrics behind a CLI argument
460- in order to give administrators [ an escape
461- hatch] ( https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#show-hidden-metrics )
453+ 4 . We should consider hiding deprecated metrics behind a CLI argument
454+ in order to give administrators
455+ [ an escape hatch] ( https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/#show-hidden-metrics )
462456 for some time before deleting them.
463457
464458See the [ deprecation policy] ( ../../contributing/deprecation_policy.md ) for
@@ -474,15 +468,15 @@ removed.
474468The ` vllm:time_in_queue_requests ` Histogram metric was added by
475469< gh-pr:9659 > and its calculation is:
476470
477- ```
471+ ``` python
478472 self .metrics.first_scheduled_time = now
479473 self .metrics.time_in_queue = now - self .metrics.arrival_time
480474```
481475
482476Two weeks later, < gh-pr:4464 > added ` vllm:request_queue_time_seconds ` leaving
483477us with:
484478
485- ```
479+ ``` python
486480if seq_group.is_finished():
487481 if (seq_group.metrics.first_scheduled_time is not None and
488482 seq_group.metrics.first_token_time is not None ):
@@ -517,8 +511,7 @@ cache to complete other requests), we swap kv cache blocks out to CPU
517511memory. This is also known as "KV cache offloading" and is configured
518512with ` --swap-space ` and ` --preemption-mode ` .
519513
520- In v0, [ vLLM has long supported beam
521- search] ( gh-issue:6226 ) . The
514+ In v0, [ vLLM has long supported beam search] ( gh-issue:6226 ) . The
522515SequenceGroup encapsulated the idea of N Sequences which
523516all shared the same prompt kv blocks. This enabled KV cache block
524517sharing between requests, and copy-on-write to do branching. CPU
@@ -530,9 +523,8 @@ option than CPU swapping since blocks can be evicted slowly on demand
530523and the part of the prompt that was evicted can be recomputed.
531524
532525SequenceGroup was removed in V1, although a replacement will be
533- required for "parallel sampling" (` n>1 ` ). [ Beam search was moved out of
534- the core (in
535- V0)] ( gh-issue:8306 ) . There was a
526+ required for "parallel sampling" (` n>1 ` ).
527+ [ Beam search was moved out of the core (in V0)] ( gh-issue:8306 ) . There was a
536528lot of complex code for a very uncommon feature.
537529
538530In V1, with prefix caching being better (zero over head) and therefore
@@ -547,18 +539,18 @@ Some v0 metrics are only relevant in the context of "parallel
547539sampling". This is where the ` n ` parameter in a request is used to
548540request multiple completions from the same prompt.
549541
550- As part of adding parallel sampling support in < gh-pr:10980 > we should
542+ As part of adding parallel sampling support in < gh-pr:10980 > , we should
551543also add these metrics.
552544
553545- ` vllm:request_params_n ` (Histogram)
554546
555- Observes the value of the 'n' parameter of every finished request.
547+ Observes the value of the 'n' parameter of every finished request.
556548
557549- ` vllm:request_max_num_generation_tokens ` (Histogram)
558550
559- Observes the maximum output length of all sequences in every finished
560- sequence group. In the absence of parallel sampling, this is
561- equivalent to ` vllm:request_generation_tokens ` .
551+ Observes the maximum output length of all sequences in every finished
552+ sequence group. In the absence of parallel sampling, this is
553+ equivalent to ` vllm:request_generation_tokens ` .
562554
563555### Speculative Decoding
564556
@@ -576,26 +568,23 @@ There is a PR under review (<gh-pr:12193>) to add "prompt lookup (ngram)"
576568seculative decoding to v1. Other techniques will follow. We should
577569revisit the v0 metrics in this context.
578570
579- Note - we should probably expose acceptance rate as separate accepted
580- and draft counters, like we do for prefix caching hit rate. Efficiency
581- likely also needs similar treatment.
571+ !!! note
572+ We should probably expose acceptance rate as separate accepted
573+ and draft counters, like we do for prefix caching hit rate. Efficiency
574+ likely also needs similar treatment.
582575
583576### Autoscaling and Load-balancing
584577
585578A common use case for our metrics is to support automated scaling of
586579vLLM instances.
587580
588- For related discussion from the [ Kubernetes Serving Working
589- Group] ( https://github.com/kubernetes/community/tree/master/wg-serving ) ,
581+ For related discussion from the
582+ [ Kubernetes Serving Working Group] ( https://github.com/kubernetes/community/tree/master/wg-serving ) ,
590583see:
591584
592- - [ Standardizing Large Model Server Metrics in
593- Kubernetes] ( https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk )
594- - [ Benchmarking LLM Workloads for Performance Evaluation and
595- Autoscaling in
596- Kubernetes] ( https://docs.google.com/document/d/1k4Q4X14hW4vftElIuYGDu5KDe2LtV1XammoG-Xi3bbQ )
597- - [ Inference
598- Perf] ( https://github.com/kubernetes-sigs/wg-serving/tree/main/proposals/013-inference-perf )
585+ - [ Standardizing Large Model Server Metrics in Kubernetes] ( https://docs.google.com/document/d/1SpSp1E6moa4HSrJnS4x3NpLuj88sMXr2tbofKlzTZpk )
586+ - [ Benchmarking LLM Workloads for Performance Evaluation and Autoscaling in Kubernetes] ( https://docs.google.com/document/d/1k4Q4X14hW4vftElIuYGDu5KDe2LtV1XammoG-Xi3bbQ )
587+ - [ Inference Perf] ( https://github.com/kubernetes-sigs/wg-serving/tree/main/proposals/013-inference-perf )
599588- < gh-issue:5041 > and < gh-pr:12726 > .
600589
601590This is a non-trivial topic. Consider this comment from Rob:
@@ -619,19 +608,16 @@ should judge an instance as approaching saturation:
619608
620609Our approach to naming metrics probably deserves to be revisited:
621610
622- 1 . The use of colons in metric names seems contrary to [ "colons are
623- reserved for user defined recording
624- rules"] ( https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels )
611+ 1 . The use of colons in metric names seems contrary to
612+ [ "colons are reserved for user defined recording rules"] ( https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels ) .
6256132 . Most of our metrics follow the convention of ending with units, but
626614 not all do.
6276153 . Some of our metric names end with ` _total ` :
628616
629- ```
630- If there is a suffix of `_total` on the metric name, it will be removed. When
631- exposing the time series for counter, a `_total` suffix will be added. This is
632- for compatibility between OpenMetrics and the Prometheus text format, as OpenMetrics
633- requires the `_total` suffix.
634- ```
617+ If there is a suffix of ` _total ` on the metric name, it will be removed. When
618+ exposing the time series for counter, a ` _total ` suffix will be added. This is
619+ for compatibility between OpenMetrics and the Prometheus text format, as OpenMetrics
620+ requires the ` _total ` suffix.
635621
636622### Adding More Metrics
637623
@@ -642,8 +628,7 @@ There is no shortage of ideas for new metrics:
642628- Proposals arising from specific use cases, like the Kubernetes
643629 auto-scaling topic above
644630- Proposals that might arise out of standardisation efforts like
645- [ OpenTelemetry Semantic Conventions for Gen
646- AI] ( https://github.com/open-telemetry/semantic-conventions/tree/main/docs/gen-ai ) .
631+ [ OpenTelemetry Semantic Conventions for Gen AI] ( https://github.com/open-telemetry/semantic-conventions/tree/main/docs/gen-ai ) .
647632
648633We should be cautious in our approach to adding new metrics. While
649634metrics are often relatively straightforward to add:
@@ -668,18 +653,14 @@ fall under the more general heading of "Observability".
668653v0 has support for OpenTelemetry tracing:
669654
670655- Added by < gh-pr:4687 >
671- - Configured with ` --oltp-traces-endpoint ` and
672- ` --collect-detailed-traces `
673- - [ OpenTelemetry blog
674- post] ( https://opentelemetry.io/blog/2024/llm-observability/ )
656+ - Configured with ` --oltp-traces-endpoint ` and ` --collect-detailed-traces `
657+ - [ OpenTelemetry blog post] ( https://opentelemetry.io/blog/2024/llm-observability/ )
675658- [ User-facing docs] ( ../../examples/online_serving/opentelemetry.md )
676- - [ Blog
677- post] ( https://medium.com/@ronen.schaffer/follow-the-trail-supercharging-vllm-with-opentelemetry-distributed-tracing-aa655229b46f )
678- - [ IBM product
679- docs] ( https://www.ibm.com/docs/en/instana-observability/current?topic=mgaa-monitoring-large-language-models-llms-vllm-public-preview )
659+ - [ Blog post] ( https://medium.com/@ronen.schaffer/follow-the-trail-supercharging-vllm-with-opentelemetry-distributed-tracing-aa655229b46f )
660+ - [ IBM product docs] ( https://www.ibm.com/docs/en/instana-observability/current?topic=mgaa-monitoring-large-language-models-llms-vllm-public-preview )
680661
681- OpenTelemetry has a [ Gen AI Working
682- Group] ( https://github.com/open-telemetry/community/blob/main/projects/gen-ai.md ) .
662+ OpenTelemetry has a
663+ [ Gen AI Working Group] ( https://github.com/open-telemetry/community/blob/main/projects/gen-ai.md ) .
683664
684665Since metrics is a big enough topic on its own, we are going to tackle
685666the topic of tracing in v1 separately.
@@ -698,7 +679,7 @@ These metrics are only enabled when OpenTelemetry tracing is enabled
698679and if ` --collect-detailed-traces=all/model/worker ` is used. The
699680documentation for this option states:
700681
701- > collect detailed traces for the specified " modules. This involves
682+ > collect detailed traces for the specified modules. This involves
702683> use of possibly costly and or blocking operations and hence might
703684> have a performance impact.
704685
0 commit comments