llm-d · Gregory-Pereira · Mar 4, 2026 · Feb 12, 2026 · Feb 12, 2026 · Feb 26, 2026
diff --git a/README.md b/README.md
@@ -79,6 +79,11 @@ For more details of architecture see the [project proposal](./docs/proposals/llm
 
 See the [accelerator docs](./docs/accelerators/README.md) for points of contact for more details about the accelerators, networks, and configurations tested and our [roadmap](https://github.com/llm-d/llm-d/issues/146) for what is coming next.
 
+## 🔍 Observability
+
+- [Monitoring & Metrics](./docs/monitoring/README.md) - Prometheus, Grafana dashboards, and PromQL queries
+- [Distributed Tracing](./docs/monitoring/tracing/README.md) - OpenTelemetry tracing across vLLM, routing proxy, and EPP
+
 ## 📦 Releases
 
 Our [guides](./guides/README.md) are living docs and kept current. For details about the Helm charts and component releases, visit our [GitHub Releases page](https://github.com/llm-d/llm-d/releases) to review release notes.

diff --git a/docs/monitoring/README.md b/docs/monitoring/README.md
@@ -113,12 +113,25 @@ EPP provides additional metrics for request routing, scheduling latency, and plu
 
 EPP metrics include request rates, error rates, scheduling latency, and plugin processing times, providing insights into the inference routing and scheduling performance.
 
+## Distributed Tracing
+
+llm-d supports OpenTelemetry distributed tracing across vLLM, the routing proxy, and the EPP/inference scheduler. See [Distributed Tracing](./tracing/README.md) for setup instructions, or use the [install-otel-collector-jaeger.sh](./scripts/install-otel-collector-jaeger.sh) script to deploy an OTel Collector and Jaeger backend with one command.
+
 ## Dashboards
 
 Grafana dashboard raw JSON files can be imported manually into a Grafana UI. Here is a current list of community dashboards:
 
 - [llm-d vLLM Overview dashboard](./grafana/dashboards/llm-d-vllm-overview.json)
-  - vLLM metrics
+  - General vLLM metrics overview for monitoring llm-d inference servers
+- [llm-d Failure & Saturation Indicators dashboard](./grafana/dashboards/llm-d-failure-saturation-dashboard.json)
+  - Key failure and saturation indicators for identifying system issues and capacity constraints
+- [llm-d Diagnostic Drill-Down dashboard](./grafana/dashboards/llm-d-diagnostic-drilldown-dashboard.json)
+  - Detailed diagnostic metrics for investigating performance issues
+- [llm-d Performance Dashboard](./grafana/dashboards/llm-performance-kv-cache.json)
+  - Performance metrics including KV cache utilization
+- [P/D Coordinator Metrics dashboard](./grafana/dashboards/pd-coordinator-metrics.json)
+  - Prefill/Decode disaggregation performance metrics
+  - Shows vLLM E2E latency, prefill duration, decode duration, and phase breakdown
 - [inference-gateway dashboard v1.0.1](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/v1.0.1/tools/dashboards/inference_gateway.json)
   - EPP metrics
 - [GKE managed inference gateway dashboard](https://cloud.google.com/kubernetes-engine/docs/how-to/customize-gke-inference-gateway-configurations#inference-gateway-dashboard)
@@ -133,7 +146,8 @@ For specific PromQL queries to monitor LLM-D deployments, see:
 
 To populate metrics (especially error metrics) for testing and monitoring validation:
 
-- [Load Generation Script](./scripts/generate-load-llmd.sh) - Sends both valid and malformed requests to generate metrics
+- [Traffic Generation Script](./scripts/generate-traffic-basic.sh) - Sends both valid and malformed requests to generate metrics
+- [P/D Traffic Generator](./scripts/generate-traffic-pd.sh) - Concurrent traffic optimized for P/D disaggregation tracing
 
 ## Troubleshooting
 

diff --git a/docs/monitoring/example-promQL-queries.md b/docs/monitoring/example-promQL-queries.md
@@ -1,7 +1,7 @@
 # Example PromQL Queries for LLM-D Monitoring
 
 This document provides PromQL queries for monitoring LLM-D deployments using Prometheus metrics.
-The provided [load generation script](./scripts/generate-load-llmd.sh) will populate error metrics for testing.
+The provided [load generation script](./scripts/generate-traffic-basic.sh) will populate error metrics for testing.
 
 ## Tier 1: Immediate Failure & Saturation Indicators
 
@@ -41,8 +41,8 @@ The provided [load generation script](./scripts/generate-load-llmd.sh) will popu
 | ----------- | ------------ |
 | **Request Distribution** (QPS per instance) | `sum by(pod) (rate(inference_objective_request_total{target_model!=""}[5m]))` |
 | **Token Distribution** | `sum by(pod) (rate(vllm:prompt_tokens_total[5m]) + rate(vllm:generation_tokens_total[5m]))` |
-| **Idle GPU Time** | `1 - avg by(pod) (rate(vllm:iteration_tokens_total_count[5m]) > 0)` |
-| **Routing Decision Latency** | `histogram_quantile(0.99, sum by(le) (rate(inference_extension_scheduler_plugin_duration_seconds_bucket[5m])))` |
+| **Idle GPU Time** | `1 - clamp_max(rate(vllm:iteration_tokens_total_count[5m]), 1)` |
+| **Routing Decision Latency** | `histogram_quantile(0.99, sum by(le) (rate(inference_extension_plugin_duration_seconds_bucket[5m])))` |
 
 ### Path C: Prefix Caching
 
@@ -99,22 +99,22 @@ The provided [load generation script](./scripts/generate-load-llmd.sh) will popu
 ### Error Metrics
 
 - Error metrics (`*_error_total`) only appear after the first error occurs
-- Use the provided [load generation script](./scripts/generate-load-llmd.sh) to populate error metrics for testing
+- Use the provided [load generation script](./scripts/generate-traffic-basic.sh) to populate error metrics for testing
 
 ## Missing Metrics (Require Additional Instrumentation)
 
 The following metrics from community-gathered monitoring requirements are not currently available and would need custom instrumentation:
 
 ### Path C: Prefix Caching
 
-- **Cache Eviction Rate**: No metrics track when cache entries are evicted due to memory pressure
 - **Prefix Cache Memory Usage (Absolute)**: Only percentage utilization is available
+- **Cache Eviction Rate**: KV cache residency metrics are available when `--kv-cache-metrics-enabled` is set: `vllm:kv_block_lifetime_seconds`, `vllm:kv_block_idle_before_evict_seconds`, `vllm:kv_block_reuse_gap_seconds`
 
 ### Path D: P/D Disaggregation
 
 - **KV Cache Transfer Times**: No metrics track the latency of transferring KV cache between prefill and decode workers
 
 ### Workarounds
 
-- **Cache Pressure Detection**: Monitor trends in `vllm:prefix_cache_hits` / `vllm:prefix_cache_queries` - declining hit rates may indicate cache evictions
+- **Cache Pressure Detection**: Monitor trends in `vllm:prefix_cache_hits_total` / `vllm:prefix_cache_queries_total` - declining hit rates may indicate cache evictions
 - **Transfer Bottlenecks**: Monitor overall latency spikes during P/D operations as an indirect indicator