Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,11 @@ For more details of architecture see the [project proposal](./docs/proposals/llm

See the [accelerator docs](./docs/accelerators/README.md) for points of contact for more details about the accelerators, networks, and configurations tested and our [roadmap](https://github.com/llm-d/llm-d/issues/146) for what is coming next.

## 🔍 Observability

- [Monitoring & Metrics](./docs/monitoring/README.md) - Prometheus, Grafana dashboards, and PromQL queries
- [Distributed Tracing](./docs/monitoring/tracing/README.md) - OpenTelemetry tracing across vLLM, routing proxy, and EPP

## 📦 Releases

Our [guides](./guides/README.md) are living docs and kept current. For details about the Helm charts and component releases, visit our [GitHub Releases page](https://github.com/llm-d/llm-d/releases) to review release notes.
Expand Down
18 changes: 16 additions & 2 deletions docs/monitoring/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,12 +113,25 @@ EPP provides additional metrics for request routing, scheduling latency, and plu

EPP metrics include request rates, error rates, scheduling latency, and plugin processing times, providing insights into the inference routing and scheduling performance.

## Distributed Tracing

llm-d supports OpenTelemetry distributed tracing across vLLM, the routing proxy, and the EPP/inference scheduler. See [Distributed Tracing](./tracing/README.md) for setup instructions, or use the [install-otel-collector-jaeger.sh](./scripts/install-otel-collector-jaeger.sh) script to deploy an OTel Collector and Jaeger backend with one command.

## Dashboards

Grafana dashboard raw JSON files can be imported manually into a Grafana UI. Here is a current list of community dashboards:

- [llm-d vLLM Overview dashboard](./grafana/dashboards/llm-d-vllm-overview.json)
- vLLM metrics
- General vLLM metrics overview for monitoring llm-d inference servers
- [llm-d Failure & Saturation Indicators dashboard](./grafana/dashboards/llm-d-failure-saturation-dashboard.json)
- Key failure and saturation indicators for identifying system issues and capacity constraints
- [llm-d Diagnostic Drill-Down dashboard](./grafana/dashboards/llm-d-diagnostic-drilldown-dashboard.json)
- Detailed diagnostic metrics for investigating performance issues
- [llm-d Performance Dashboard](./grafana/dashboards/llm-performance-kv-cache.json)
- Performance metrics including KV cache utilization
- [P/D Coordinator Metrics dashboard](./grafana/dashboards/pd-coordinator-metrics.json)
- Prefill/Decode disaggregation performance metrics
- Shows vLLM E2E latency, prefill duration, decode duration, and phase breakdown
- [inference-gateway dashboard v1.0.1](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/v1.0.1/tools/dashboards/inference_gateway.json)
- EPP metrics
- [GKE managed inference gateway dashboard](https://cloud.google.com/kubernetes-engine/docs/how-to/customize-gke-inference-gateway-configurations#inference-gateway-dashboard)
Expand All @@ -133,7 +146,8 @@ For specific PromQL queries to monitor LLM-D deployments, see:

To populate metrics (especially error metrics) for testing and monitoring validation:

- [Load Generation Script](./scripts/generate-load-llmd.sh) - Sends both valid and malformed requests to generate metrics
- [Traffic Generation Script](./scripts/generate-traffic-basic.sh) - Sends both valid and malformed requests to generate metrics
- [P/D Traffic Generator](./scripts/generate-traffic-pd.sh) - Concurrent traffic optimized for P/D disaggregation tracing

## Troubleshooting

Expand Down
12 changes: 6 additions & 6 deletions docs/monitoring/example-promQL-queries.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Example PromQL Queries for LLM-D Monitoring

This document provides PromQL queries for monitoring LLM-D deployments using Prometheus metrics.
The provided [load generation script](./scripts/generate-load-llmd.sh) will populate error metrics for testing.
The provided [load generation script](./scripts/generate-traffic-basic.sh) will populate error metrics for testing.

## Tier 1: Immediate Failure & Saturation Indicators

Expand Down Expand Up @@ -41,8 +41,8 @@ The provided [load generation script](./scripts/generate-load-llmd.sh) will popu
| ----------- | ------------ |
| **Request Distribution** (QPS per instance) | `sum by(pod) (rate(inference_objective_request_total{target_model!=""}[5m]))` |
| **Token Distribution** | `sum by(pod) (rate(vllm:prompt_tokens_total[5m]) + rate(vllm:generation_tokens_total[5m]))` |
| **Idle GPU Time** | `1 - avg by(pod) (rate(vllm:iteration_tokens_total_count[5m]) > 0)` |
| **Routing Decision Latency** | `histogram_quantile(0.99, sum by(le) (rate(inference_extension_scheduler_plugin_duration_seconds_bucket[5m])))` |
| **Idle GPU Time** | `1 - clamp_max(rate(vllm:iteration_tokens_total_count[5m]), 1)` |
| **Routing Decision Latency** | `histogram_quantile(0.99, sum by(le) (rate(inference_extension_plugin_duration_seconds_bucket[5m])))` |

### Path C: Prefix Caching

Expand Down Expand Up @@ -99,22 +99,22 @@ The provided [load generation script](./scripts/generate-load-llmd.sh) will popu
### Error Metrics

- Error metrics (`*_error_total`) only appear after the first error occurs
- Use the provided [load generation script](./scripts/generate-load-llmd.sh) to populate error metrics for testing
- Use the provided [load generation script](./scripts/generate-traffic-basic.sh) to populate error metrics for testing

## Missing Metrics (Require Additional Instrumentation)

The following metrics from community-gathered monitoring requirements are not currently available and would need custom instrumentation:

### Path C: Prefix Caching

- **Cache Eviction Rate**: No metrics track when cache entries are evicted due to memory pressure
- **Prefix Cache Memory Usage (Absolute)**: Only percentage utilization is available
- **Cache Eviction Rate**: KV cache residency metrics are available when `--kv-cache-metrics-enabled` is set: `vllm:kv_block_lifetime_seconds`, `vllm:kv_block_idle_before_evict_seconds`, `vllm:kv_block_reuse_gap_seconds`

### Path D: P/D Disaggregation

- **KV Cache Transfer Times**: No metrics track the latency of transferring KV cache between prefill and decode workers

### Workarounds

- **Cache Pressure Detection**: Monitor trends in `vllm:prefix_cache_hits` / `vllm:prefix_cache_queries` - declining hit rates may indicate cache evictions
- **Cache Pressure Detection**: Monitor trends in `vllm:prefix_cache_hits_total` / `vllm:prefix_cache_queries_total` - declining hit rates may indicate cache evictions
- **Transfer Bottlenecks**: Monitor overall latency spikes during P/D operations as an indirect indicator
Loading
Loading