ai-dynamo
diff --git a/‎deploy/metrics/README.md‎
Lines changed: 110 additions & 2 deletions b/‎deploy/metrics/README.md‎
Lines changed: 110 additions & 2 deletions
diff --git a/‎lib/llm/src/discovery/model_manager.rs‎
Lines changed: 9 additions & 0 deletions b/‎lib/llm/src/discovery/model_manager.rs‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎lib/llm/src/grpc/service/kserve.rs‎
Lines changed: 3 additions & 3 deletions b/‎lib/llm/src/grpc/service/kserve.rs‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎lib/llm/src/grpc/service/openai.rs‎
Lines changed: 14 additions & 26 deletions b/‎lib/llm/src/grpc/service/openai.rs‎
Lines changed: 14 additions & 26 deletions
@@ -36,7 +36,7 @@ As of Q2 2025, Dynamo HTTP Frontend metrics are exposed when you build container
 
 ### Available Metrics
 
-#### Component Metrics
+#### Backend Component Metrics
 
 The core Dynamo backend system automatically exposes metrics with the `dynamo_component_*` prefix for all components that use the `DistributedRuntime` framework:
 
@@ -47,6 +47,19 @@ The core Dynamo backend system automatically exposes metrics with the `dynamo_co
 - `dynamo_component_response_bytes_total`: Total bytes sent in responses (counter)
 - `dynamo_component_system_uptime_seconds`: DistributedRuntime uptime (gauge)
 
+#### KV Router Statistics (kvstats)
+
+KV router statistics are automatically exposed by LLM workers and KV router components with the `dynamo_component_kvstats_*` prefix. These metrics provide insights into GPU memory usage and cache efficiency:
+
+- `dynamo_component_kvstats_active_blocks`: Number of active KV cache blocks currently in use (gauge)
+- `dynamo_component_kvstats_total_blocks`: Total number of KV cache blocks available (gauge)
+- `dynamo_component_kvstats_gpu_cache_usage_percent`: GPU cache usage as a percentage (0.0-1.0) (gauge)
+- `dynamo_component_kvstats_gpu_prefix_cache_hit_rate`: GPU prefix cache hit rate as a percentage (0.0-1.0) (gauge)
+
+These metrics are published by:
+- **LLM Workers**: vLLM and TRT-LLM backends publish these metrics through their respective publishers
+- **KV Router**: The KV router component aggregates and exposes these metrics for load balancing decisions
+
 #### Specialized Component Metrics
 
 Some components expose additional metrics specific to their functionality:
@@ -57,14 +70,80 @@ Some components expose additional metrics specific to their functionality:
 
 When using Dynamo HTTP Frontend (`--framework VLLM` or `--framework TRTLLM`), these metrics are automatically exposed with the `dynamo_frontend_*` prefix and include `model` labels containing the model name:
 
-- `dynamo_frontend_inflight_requests`: Inflight requests (gauge)
+- `dynamo_frontend_inflight_requests_total`: Inflight requests (gauge)
+- `dynamo_frontend_queued_requests_total`: Number of requests in HTTP processing queue (gauge)
 - `dynamo_frontend_input_sequence_tokens`: Input sequence length (histogram)
 - `dynamo_frontend_inter_token_latency_seconds`: Inter-token latency (histogram)
 - `dynamo_frontend_output_sequence_tokens`: Output sequence length (histogram)
 - `dynamo_frontend_request_duration_seconds`: LLM request duration (histogram)
 - `dynamo_frontend_requests_total`: Total LLM requests (counter)
 - `dynamo_frontend_time_to_first_token_seconds`: Time to first token (histogram)
 
+**Note**: The `dynamo_frontend_inflight_requests_total` metric tracks requests from HTTP handler start until the complete response is finished, while `dynamo_frontend_queued_requests_total` tracks requests from HTTP handler start until first token generation begins (including prefill time). HTTP queue time is a subset of inflight time.
+
+#### Request Processing Flow
+
+This section explains the distinction between two key metrics used to track request processing:
+
+1. **Inflight**: Tracks requests from HTTP handler start until the complete response is finished
+2. **HTTP Queue**: Tracks requests from HTTP handler start until first token generation begins (including prefill time)
+
+**Example Request Flow:**
+```
+curl -s localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
+  "model": "Qwen/Qwen3-0.6B",
+  "prompt": "Hello let's talk about LLMs",
+  "stream": false,
+  "max_tokens": 1000
+}'
+```
+
+**Timeline:**
+```
+Timeline:    0, 1, ...
+Client ────> Frontend:8000 ────────────────────> Dynamo component/backend (vLLM, SGLang, TRT)
+             │request start                     │received                              │
+             |                                  |                                      |
+             │                                  ├──> start prefill ──> first token ──> |last token
+             │                                  │     (not impl)       |               |
+             ├─────actual HTTP queue¹ ──────────┘                      │               |
+             │                                                         │               │
+             ├─────implemented HTTP queue ─────────────────────────────┘               |
+             │                                                                         │
+             └─────────────────────────────────── Inflight ────────────────────────────┘
+```
+
+**Concurrency Example:**
+Suppose the backend allows 3 concurrent requests and there are 10 clients continuously hitting the frontend:
+- All 10 requests will be counted as inflight (from start until complete response)
+- 7 requests will be in HTTP queue most of the time
+- 3 requests will be actively processed (between first token and last token)
+
+**Testing Setup:**
+Try launching a frontend and a Mocker backend that allows 3 concurrent requests:
+```bash
+$ python -m dynamo.frontend --http-port 8000
+$ python -m dynamo.mocker --model-path Qwen/Qwen3-0.6B --max-num-seqs 3
+# Launch your 10 concurrent clients here
+# Then check the queued_requests_total and inflight_requests_total metrics from the frontend:
+$ curl -s localhost:8000/metrics|grep -v '^#'|grep -E 'queue|inflight'
+dynamo_frontend_queued_requests_total{model="qwen/qwen3-0.6b"} 7
+dynamo_frontend_inflight_requests_total{model="qwen/qwen3-0.6b"} 10
+```
+
+**Real setup using vLLM (instead of Mocker):**
+```bash
+$ python -m dynamo.vllm --model Qwen/Qwen3-0.6B  \
+   --enforce-eager --no-enable-prefix-caching --max-num-seqs 3
+```
+
+**Key Differences:**
+- **Inflight**: Measures total request lifetime including processing time
+- **HTTP Queue**: Measures queuing time before processing begins (including prefill time)
+- **HTTP Queue ≤ Inflight** (HTTP queue is a subset of inflight time)
+
+¹ **TODO**: Implement the "actual" HTTP queue metric that tracks from request start until first token generation begins, rather than the current implementation that tracks until first token is received by the frontend
+
 ### Required Files
 
 The following configuration files should be present in this directory:
@@ -76,6 +155,35 @@ The following configuration files should be present in this directory:
 - [grafana_dashboards/grafana-dcgm-metrics.json](./grafana_dashboards/grafana-dcgm-metrics.json): Contains Grafana dashboard configuration for DCGM GPU metrics
 - [grafana_dashboards/grafana-llm-metrics.json](./grafana_dashboards/grafana-llm-metrics.json): This file, which is being phased out, contains the Grafana dashboard configuration for LLM-specific metrics. It requires an additional `metrics` component to operate concurrently. A new version is under development.
 
+### Metric Name Constants
+
+The [prometheus_names.rs](../../lib/runtime/src/metrics/prometheus_names.rs) module provides centralized Prometheus metric name constants and sanitization utilities for the Dynamo metrics system. This module ensures consistency across all components and prevents metric name duplication.
+
+#### Key Features
+
+- **Centralized Constants**: All Prometheus metric names are defined as constants to avoid duplication and typos
+- **Automatic Sanitization**: Functions to sanitize metric and label names according to Prometheus naming rules
+- **Component Organization**: Metric names are organized by component (frontend, work_handler, nats_client, etc.)
+- **Validation Arrays**: Arrays of metric names for iteration and validation purposes
+
+#### Metric Name Prefixes
+
+- `dynamo_component_*`: Core component metrics (requests, latency, bytes, etc.)
+- `dynamo_frontend_*`: Frontend service metrics (LLM HTTP service)
+- `nats_client_*`: NATS client connection and message metrics
+- `nats_service_*`: NATS service statistics metrics
+- `kvstats_*`: KV cache statistics from LLM workers
+
+#### Sanitization Functions
+
+The module provides functions to ensure metric and label names comply with Prometheus naming conventions:
+
+- `sanitize_prometheus_name()`: Sanitizes metric names (allows colons and `__`)
+- `sanitize_prometheus_label()`: Sanitizes label names (no colons, no `__` prefix)
+- `build_component_metric_name()`: Builds full component metric names with proper prefixing
+
+This centralized approach ensures all Dynamo components use consistent, valid Prometheus metric names without manual coordination.
+
 ## Getting Started
 
 ### Prerequisites
 
@@ -254,6 +254,15 @@ impl ModelManager {
             .and_then(|config| config.tool_call_parser.clone())
             .map(|parser| parser.to_string())
     }
+
+    /// Creates parsing options with tool call parser and reasoning parser for the specified model.
+    /// Currently reasoning parser is not implemented (returns None).
+    pub fn get_parsing_options(&self, model: &str) -> crate::protocols::openai::ParsingOptions {
+        let tool_call_parser = self.get_model_tool_call_parser(model);
+        let reasoning_parser = None; // TODO: Implement reasoning parser
+
+        crate::protocols::openai::ParsingOptions::new(tool_call_parser, reasoning_parser)
+    }
 }
 
 pub struct ModelEngines<E> {
 
@@ -21,7 +21,7 @@ use tokio::task::JoinHandle;
 use tokio_stream::{Stream, StreamExt};
 use tokio_util::sync::CancellationToken;
 
-use crate::grpc::service::openai::{completion_response_stream, get_parsing_options};
+use crate::grpc::service::openai::completion_response_stream;
 use tonic::{Request, Response, Status, transport::Server};
 
 use crate::protocols::openai::completions::{
@@ -207,7 +207,7 @@ impl GrpcInferenceService for KserveService {
         }
 
         let model = completion_request.inner.model.clone();
-        let parsing_options = get_parsing_options(self.state.manager(), &model);
+        let parsing_options = self.state.manager.get_parsing_options(&model);
 
         let stream = completion_response_stream(self.state_clone(), completion_request).await?;
 
@@ -277,7 +277,7 @@ impl GrpcInferenceService for KserveService {
                 }
 
                 let model = completion_request.inner.model.clone();
-                let parsing_options = get_parsing_options(state.manager(), &model);
+                let parsing_options = state.manager.get_parsing_options(&model);
 
                 let streaming = completion_request.inner.stream.unwrap_or(false);
 
 
@@ -9,10 +9,8 @@ use dynamo_runtime::{
 use futures::{Stream, StreamExt, stream};
 use std::sync::Arc;
 
-use crate::discovery::ModelManager;
-use crate::protocols::openai::{
-    ParsingOptions,
-    completions::{NvCreateCompletionRequest, NvCreateCompletionResponse},
+use crate::protocols::openai::completions::{
+    NvCreateCompletionRequest, NvCreateCompletionResponse,
 };
 use crate::types::Annotated;
 
@@ -21,9 +19,8 @@ use super::kserve;
 // [gluo NOTE] These are common utilities that should be shared between frontends
 use crate::http::service::{
     disconnect::{ConnectionHandle, create_connection_monitor},
-    metrics::{Endpoint, ResponseMetricCollector},
+    metrics::{Endpoint, InflightGuard, process_response_and_observe_metrics},
 };
-use crate::{http::service::metrics::InflightGuard, preprocessor::LLMMetricAnnotation};
 
 use tonic::Status;
 
@@ -72,6 +69,8 @@ pub async fn completion_response_stream(
         .get_completions_engine(model)
         .map_err(|_| Status::not_found("model not found"))?;
 
+    let http_queue_guard = state.metrics_clone().create_http_queue_guard(model);
+
     let inflight_guard =
         state
             .metrics_clone()
@@ -112,9 +111,15 @@ pub async fn completion_response_stream(
     // apply any annotations to the front of the stream
     let stream = stream::iter(annotations).chain(stream);
 
-    // Tap on the stream to collect response metrics
+    // Tap on the stream to collect response metrics and handle http_queue_guard
+    let mut http_queue_guard = Some(http_queue_guard);
     let stream = stream.inspect(move |response| {
-        process_metrics_only(response, &mut response_collector);
+        // Calls observe_response() on each token - drops http_queue_guard on first token
+        process_response_and_observe_metrics(
+            response,
+            &mut response_collector,
+            &mut http_queue_guard,
+        );
     });
 
     let stream = grpc_monitor_for_disconnects(stream, ctx, inflight_guard, stream_handle);
@@ -166,18 +171,8 @@ pub fn grpc_monitor_for_disconnects<T>(
     }
 }
 
-fn process_metrics_only<T>(
-    annotated: &Annotated<T>,
-    response_collector: &mut ResponseMetricCollector,
-) {
-    // update metrics
-    if let Ok(Some(metrics)) = LLMMetricAnnotation::from_annotation(annotated) {
-        response_collector.observe_current_osl(metrics.output_tokens);
-        response_collector.observe_response(metrics.input_tokens, metrics.chunk_tokens);
-    }
-}
-
 /// Get the request ID from a primary source, or lastly create a new one if not present
+// TODO: Similar function exists in lib/llm/src/http/service/openai.rs but with different signature and more complex logic (distributed tracing, headers)
 fn get_or_create_request_id(primary: Option<&str>) -> String {
     // Try to get the request ID from the primary source
     if let Some(primary) = primary
@@ -190,10 +185,3 @@ fn get_or_create_request_id(primary: Option<&str>) -> String {
     let uuid = uuid::Uuid::new_v4();
     uuid.to_string()
 }
-
-pub fn get_parsing_options(manager: &ModelManager, model: &str) -> ParsingOptions {
-    let tool_call_parser = manager.get_model_tool_call_parser(model);
-    let reasoning_parser = None; // TODO: Implement reasoning parser
-
-    ParsingOptions::new(tool_call_parser, reasoning_parser)
-}