Skip to content

Commit 9fa5450

Browse files
authored
feat: add HTTP queue metrics for NIM frontend request tracking (#2914)
Signed-off-by: Keiven Chang <[email protected]>
1 parent 02a22cb commit 9fa5450

File tree

7 files changed

+451
-147
lines changed

7 files changed

+451
-147
lines changed

deploy/metrics/README.md

Lines changed: 110 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ As of Q2 2025, Dynamo HTTP Frontend metrics are exposed when you build container
3636

3737
### Available Metrics
3838

39-
#### Component Metrics
39+
#### Backend Component Metrics
4040

4141
The core Dynamo backend system automatically exposes metrics with the `dynamo_component_*` prefix for all components that use the `DistributedRuntime` framework:
4242

@@ -47,6 +47,19 @@ The core Dynamo backend system automatically exposes metrics with the `dynamo_co
4747
- `dynamo_component_response_bytes_total`: Total bytes sent in responses (counter)
4848
- `dynamo_component_system_uptime_seconds`: DistributedRuntime uptime (gauge)
4949

50+
#### KV Router Statistics (kvstats)
51+
52+
KV router statistics are automatically exposed by LLM workers and KV router components with the `dynamo_component_kvstats_*` prefix. These metrics provide insights into GPU memory usage and cache efficiency:
53+
54+
- `dynamo_component_kvstats_active_blocks`: Number of active KV cache blocks currently in use (gauge)
55+
- `dynamo_component_kvstats_total_blocks`: Total number of KV cache blocks available (gauge)
56+
- `dynamo_component_kvstats_gpu_cache_usage_percent`: GPU cache usage as a percentage (0.0-1.0) (gauge)
57+
- `dynamo_component_kvstats_gpu_prefix_cache_hit_rate`: GPU prefix cache hit rate as a percentage (0.0-1.0) (gauge)
58+
59+
These metrics are published by:
60+
- **LLM Workers**: vLLM and TRT-LLM backends publish these metrics through their respective publishers
61+
- **KV Router**: The KV router component aggregates and exposes these metrics for load balancing decisions
62+
5063
#### Specialized Component Metrics
5164

5265
Some components expose additional metrics specific to their functionality:
@@ -57,14 +70,80 @@ Some components expose additional metrics specific to their functionality:
5770

5871
When using Dynamo HTTP Frontend (`--framework VLLM` or `--framework TRTLLM`), these metrics are automatically exposed with the `dynamo_frontend_*` prefix and include `model` labels containing the model name:
5972

60-
- `dynamo_frontend_inflight_requests`: Inflight requests (gauge)
73+
- `dynamo_frontend_inflight_requests_total`: Inflight requests (gauge)
74+
- `dynamo_frontend_queued_requests_total`: Number of requests in HTTP processing queue (gauge)
6175
- `dynamo_frontend_input_sequence_tokens`: Input sequence length (histogram)
6276
- `dynamo_frontend_inter_token_latency_seconds`: Inter-token latency (histogram)
6377
- `dynamo_frontend_output_sequence_tokens`: Output sequence length (histogram)
6478
- `dynamo_frontend_request_duration_seconds`: LLM request duration (histogram)
6579
- `dynamo_frontend_requests_total`: Total LLM requests (counter)
6680
- `dynamo_frontend_time_to_first_token_seconds`: Time to first token (histogram)
6781

82+
**Note**: The `dynamo_frontend_inflight_requests_total` metric tracks requests from HTTP handler start until the complete response is finished, while `dynamo_frontend_queued_requests_total` tracks requests from HTTP handler start until first token generation begins (including prefill time). HTTP queue time is a subset of inflight time.
83+
84+
#### Request Processing Flow
85+
86+
This section explains the distinction between two key metrics used to track request processing:
87+
88+
1. **Inflight**: Tracks requests from HTTP handler start until the complete response is finished
89+
2. **HTTP Queue**: Tracks requests from HTTP handler start until first token generation begins (including prefill time)
90+
91+
**Example Request Flow:**
92+
```
93+
curl -s localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
94+
"model": "Qwen/Qwen3-0.6B",
95+
"prompt": "Hello let's talk about LLMs",
96+
"stream": false,
97+
"max_tokens": 1000
98+
}'
99+
```
100+
101+
**Timeline:**
102+
```
103+
Timeline: 0, 1, ...
104+
Client ────> Frontend:8000 ────────────────────> Dynamo component/backend (vLLM, SGLang, TRT)
105+
│request start │received │
106+
| | |
107+
│ ├──> start prefill ──> first token ──> |last token
108+
│ │ (not impl) | |
109+
├─────actual HTTP queue¹ ──────────┘ │ |
110+
│ │ │
111+
├─────implemented HTTP queue ─────────────────────────────┘ |
112+
│ │
113+
└─────────────────────────────────── Inflight ────────────────────────────┘
114+
```
115+
116+
**Concurrency Example:**
117+
Suppose the backend allows 3 concurrent requests and there are 10 clients continuously hitting the frontend:
118+
- All 10 requests will be counted as inflight (from start until complete response)
119+
- 7 requests will be in HTTP queue most of the time
120+
- 3 requests will be actively processed (between first token and last token)
121+
122+
**Testing Setup:**
123+
Try launching a frontend and a Mocker backend that allows 3 concurrent requests:
124+
```bash
125+
$ python -m dynamo.frontend --http-port 8000
126+
$ python -m dynamo.mocker --model-path Qwen/Qwen3-0.6B --max-num-seqs 3
127+
# Launch your 10 concurrent clients here
128+
# Then check the queued_requests_total and inflight_requests_total metrics from the frontend:
129+
$ curl -s localhost:8000/metrics|grep -v '^#'|grep -E 'queue|inflight'
130+
dynamo_frontend_queued_requests_total{model="qwen/qwen3-0.6b"} 7
131+
dynamo_frontend_inflight_requests_total{model="qwen/qwen3-0.6b"} 10
132+
```
133+
134+
**Real setup using vLLM (instead of Mocker):**
135+
```bash
136+
$ python -m dynamo.vllm --model Qwen/Qwen3-0.6B \
137+
--enforce-eager --no-enable-prefix-caching --max-num-seqs 3
138+
```
139+
140+
**Key Differences:**
141+
- **Inflight**: Measures total request lifetime including processing time
142+
- **HTTP Queue**: Measures queuing time before processing begins (including prefill time)
143+
- **HTTP Queue ≤ Inflight** (HTTP queue is a subset of inflight time)
144+
145+
¹ **TODO**: Implement the "actual" HTTP queue metric that tracks from request start until first token generation begins, rather than the current implementation that tracks until first token is received by the frontend
146+
68147
### Required Files
69148

70149
The following configuration files should be present in this directory:
@@ -76,6 +155,35 @@ The following configuration files should be present in this directory:
76155
- [grafana_dashboards/grafana-dcgm-metrics.json](./grafana_dashboards/grafana-dcgm-metrics.json): Contains Grafana dashboard configuration for DCGM GPU metrics
77156
- [grafana_dashboards/grafana-llm-metrics.json](./grafana_dashboards/grafana-llm-metrics.json): This file, which is being phased out, contains the Grafana dashboard configuration for LLM-specific metrics. It requires an additional `metrics` component to operate concurrently. A new version is under development.
78157

158+
### Metric Name Constants
159+
160+
The [prometheus_names.rs](../../lib/runtime/src/metrics/prometheus_names.rs) module provides centralized Prometheus metric name constants and sanitization utilities for the Dynamo metrics system. This module ensures consistency across all components and prevents metric name duplication.
161+
162+
#### Key Features
163+
164+
- **Centralized Constants**: All Prometheus metric names are defined as constants to avoid duplication and typos
165+
- **Automatic Sanitization**: Functions to sanitize metric and label names according to Prometheus naming rules
166+
- **Component Organization**: Metric names are organized by component (frontend, work_handler, nats_client, etc.)
167+
- **Validation Arrays**: Arrays of metric names for iteration and validation purposes
168+
169+
#### Metric Name Prefixes
170+
171+
- `dynamo_component_*`: Core component metrics (requests, latency, bytes, etc.)
172+
- `dynamo_frontend_*`: Frontend service metrics (LLM HTTP service)
173+
- `nats_client_*`: NATS client connection and message metrics
174+
- `nats_service_*`: NATS service statistics metrics
175+
- `kvstats_*`: KV cache statistics from LLM workers
176+
177+
#### Sanitization Functions
178+
179+
The module provides functions to ensure metric and label names comply with Prometheus naming conventions:
180+
181+
- `sanitize_prometheus_name()`: Sanitizes metric names (allows colons and `__`)
182+
- `sanitize_prometheus_label()`: Sanitizes label names (no colons, no `__` prefix)
183+
- `build_component_metric_name()`: Builds full component metric names with proper prefixing
184+
185+
This centralized approach ensures all Dynamo components use consistent, valid Prometheus metric names without manual coordination.
186+
79187
## Getting Started
80188

81189
### Prerequisites

lib/llm/src/discovery/model_manager.rs

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -254,6 +254,15 @@ impl ModelManager {
254254
.and_then(|config| config.tool_call_parser.clone())
255255
.map(|parser| parser.to_string())
256256
}
257+
258+
/// Creates parsing options with tool call parser and reasoning parser for the specified model.
259+
/// Currently reasoning parser is not implemented (returns None).
260+
pub fn get_parsing_options(&self, model: &str) -> crate::protocols::openai::ParsingOptions {
261+
let tool_call_parser = self.get_model_tool_call_parser(model);
262+
let reasoning_parser = None; // TODO: Implement reasoning parser
263+
264+
crate::protocols::openai::ParsingOptions::new(tool_call_parser, reasoning_parser)
265+
}
257266
}
258267

259268
pub struct ModelEngines<E> {

lib/llm/src/grpc/service/kserve.rs

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ use tokio::task::JoinHandle;
2121
use tokio_stream::{Stream, StreamExt};
2222
use tokio_util::sync::CancellationToken;
2323

24-
use crate::grpc::service::openai::{completion_response_stream, get_parsing_options};
24+
use crate::grpc::service::openai::completion_response_stream;
2525
use tonic::{Request, Response, Status, transport::Server};
2626

2727
use crate::protocols::openai::completions::{
@@ -207,7 +207,7 @@ impl GrpcInferenceService for KserveService {
207207
}
208208

209209
let model = completion_request.inner.model.clone();
210-
let parsing_options = get_parsing_options(self.state.manager(), &model);
210+
let parsing_options = self.state.manager.get_parsing_options(&model);
211211

212212
let stream = completion_response_stream(self.state_clone(), completion_request).await?;
213213

@@ -277,7 +277,7 @@ impl GrpcInferenceService for KserveService {
277277
}
278278

279279
let model = completion_request.inner.model.clone();
280-
let parsing_options = get_parsing_options(state.manager(), &model);
280+
let parsing_options = state.manager.get_parsing_options(&model);
281281

282282
let streaming = completion_request.inner.stream.unwrap_or(false);
283283

lib/llm/src/grpc/service/openai.rs

Lines changed: 14 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -9,10 +9,8 @@ use dynamo_runtime::{
99
use futures::{Stream, StreamExt, stream};
1010
use std::sync::Arc;
1111

12-
use crate::discovery::ModelManager;
13-
use crate::protocols::openai::{
14-
ParsingOptions,
15-
completions::{NvCreateCompletionRequest, NvCreateCompletionResponse},
12+
use crate::protocols::openai::completions::{
13+
NvCreateCompletionRequest, NvCreateCompletionResponse,
1614
};
1715
use crate::types::Annotated;
1816

@@ -21,9 +19,8 @@ use super::kserve;
2119
// [gluo NOTE] These are common utilities that should be shared between frontends
2220
use crate::http::service::{
2321
disconnect::{ConnectionHandle, create_connection_monitor},
24-
metrics::{Endpoint, ResponseMetricCollector},
22+
metrics::{Endpoint, InflightGuard, process_response_and_observe_metrics},
2523
};
26-
use crate::{http::service::metrics::InflightGuard, preprocessor::LLMMetricAnnotation};
2724

2825
use tonic::Status;
2926

@@ -72,6 +69,8 @@ pub async fn completion_response_stream(
7269
.get_completions_engine(model)
7370
.map_err(|_| Status::not_found("model not found"))?;
7471

72+
let http_queue_guard = state.metrics_clone().create_http_queue_guard(model);
73+
7574
let inflight_guard =
7675
state
7776
.metrics_clone()
@@ -112,9 +111,15 @@ pub async fn completion_response_stream(
112111
// apply any annotations to the front of the stream
113112
let stream = stream::iter(annotations).chain(stream);
114113

115-
// Tap on the stream to collect response metrics
114+
// Tap on the stream to collect response metrics and handle http_queue_guard
115+
let mut http_queue_guard = Some(http_queue_guard);
116116
let stream = stream.inspect(move |response| {
117-
process_metrics_only(response, &mut response_collector);
117+
// Calls observe_response() on each token - drops http_queue_guard on first token
118+
process_response_and_observe_metrics(
119+
response,
120+
&mut response_collector,
121+
&mut http_queue_guard,
122+
);
118123
});
119124

120125
let stream = grpc_monitor_for_disconnects(stream, ctx, inflight_guard, stream_handle);
@@ -166,18 +171,8 @@ pub fn grpc_monitor_for_disconnects<T>(
166171
}
167172
}
168173

169-
fn process_metrics_only<T>(
170-
annotated: &Annotated<T>,
171-
response_collector: &mut ResponseMetricCollector,
172-
) {
173-
// update metrics
174-
if let Ok(Some(metrics)) = LLMMetricAnnotation::from_annotation(annotated) {
175-
response_collector.observe_current_osl(metrics.output_tokens);
176-
response_collector.observe_response(metrics.input_tokens, metrics.chunk_tokens);
177-
}
178-
}
179-
180174
/// Get the request ID from a primary source, or lastly create a new one if not present
175+
// TODO: Similar function exists in lib/llm/src/http/service/openai.rs but with different signature and more complex logic (distributed tracing, headers)
181176
fn get_or_create_request_id(primary: Option<&str>) -> String {
182177
// Try to get the request ID from the primary source
183178
if let Some(primary) = primary
@@ -190,10 +185,3 @@ fn get_or_create_request_id(primary: Option<&str>) -> String {
190185
let uuid = uuid::Uuid::new_v4();
191186
uuid.to_string()
192187
}
193-
194-
pub fn get_parsing_options(manager: &ModelManager, model: &str) -> ParsingOptions {
195-
let tool_call_parser = manager.get_model_tool_call_parser(model);
196-
let reasoning_parser = None; // TODO: Implement reasoning parser
197-
198-
ParsingOptions::new(tool_call_parser, reasoning_parser)
199-
}

0 commit comments

Comments
 (0)