Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughDecouples telemetry tracing enablement from Prometheus, moves descheduler tracing options to a top-level cmdOptions block, standardizes OTLP host:port endpoint formatting, adds a telemetry-gateway HelmRelease and load‑balancing exporter, enables JSON logging, expands k8sattributes extraction, and raises Prometheus tracing sampling to 1.0. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant W as Instrumented Workload
participant C as Alloy Collector (node/agent)
participant G as Telemetry Gateway (Alloy HelmRelease)
participant T as Tempo Distributor
Note over W,C: OTLP gRPC export uses formatted host:port
W->>C: Export spans (OTLP gRPC)
C->>C: k8sattributes extraction, JSON logging, batch processing
C->>G: OTLP gRPC (load‑balancing resolver, zstd, tls.insecure)
G->>G: Tail sampling & batch processing (sampling_fraction=1.0)
G->>T: Export traces to tempo distributor:4317 (OTLP)
Note right of T: ServiceMonitor exposes metrics if Prometheus enabled
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (7)
🚧 Files skipped from review as they are similar to previous changes (3)
🧰 Additional context used🪛 YAMLlint (1.37.1)charts/base-cluster/templates/monitoring/alloy-gateway.yaml[error] 1-1: syntax error: expected the node content, but found '-' (syntax) ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
🔇 Additional comments (10)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull Request Overview
This PR implements tail sampling for distributed tracing by introducing a telemetry gateway. The gateway filters traces based on errors, latency (>200ms), and random sampling (0.1%), while upstream clients now sample 100% of traces.
- Introduces a new Alloy-based telemetry gateway with tail sampling policies
- Updates all tracing clients to use 100% sampling and route through the gateway
- Removes the requirement for Prometheus to be enabled when tracing is enabled
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| charts/base-cluster/templates/monitoring/alloy-gateway.yaml | New telemetry gateway deployment with tail sampling configuration |
| charts/base-cluster/templates/monitoring/alloy-collector.yaml | Routes traces through load-balanced gateway instead of directly to Tempo |
| charts/base-cluster/templates/monitoring/kube-prometheus-stack/_prometheus_config.yaml | Increases sampling fraction from 0.1 to 1.0 |
| charts/base-cluster/templates/kyverno/kyverno.yaml | Removes Prometheus dependency from telemetry enablement |
| charts/base-cluster/templates/ingress/traefik.yaml | Adds explicit gRPC enablement for OTLP endpoint |
| charts/base-cluster/templates/ingress/nginx.yaml | Removes Prometheus dependency from telemetry enablement |
| charts/base-cluster/templates/descheduler/descheduler.yaml | Migrates from deschedulerPolicy.tracing to cmdOptions for OTLP configuration |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
charts/base-cluster/templates/kyverno/kyverno.yaml (1)
70-77: Fix Helm template:andneeds at least two args.
andin Go templates requires ≥2 operands; callingand $telemetryConf.enabledraiseswrong number of args for andat render time. Assign the value directly instead.- {{- $telemetryEnabled := and $telemetryConf.enabled -}} + {{- $telemetryEnabled := $telemetryConf.enabled -}}charts/base-cluster/templates/ingress/nginx.yaml (1)
32-39: Remove the single-operandandto avoid template failure.
{{- if and $telemetryConf.enabled }}triggerswrong number of args for andduring Helm rendering. Use the flag directly.- {{- if and $telemetryConf.enabled }} + {{- if $telemetryConf.enabled }}
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
charts/base-cluster/templates/descheduler/descheduler.yaml(1 hunks)charts/base-cluster/templates/ingress/nginx.yaml(1 hunks)charts/base-cluster/templates/ingress/traefik.yaml(1 hunks)charts/base-cluster/templates/kyverno/kyverno.yaml(1 hunks)charts/base-cluster/templates/monitoring/alloy-collector.yaml(3 hunks)charts/base-cluster/templates/monitoring/alloy-gateway.yaml(1 hunks)charts/base-cluster/templates/monitoring/kube-prometheus-stack/_prometheus_config.yaml(1 hunks)
🧰 Additional context used
🪛 YAMLlint (1.37.1)
charts/base-cluster/templates/monitoring/alloy-gateway.yaml
[error] 1-1: syntax error: expected the node content, but found '-'
(syntax)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: check licenses
- GitHub Check: lint helm chart (base-cluster)
8b44a8e to
b6606b5
Compare
That way traces with errors are always stored, traces that are >200ms and with a random chance of 0.1%. Adjust your clients to just sample 100% of the spans/traces.
b6606b5 to
74da747
Compare
🤖 I have created a release *beep* *boop* --- ## [10.1.0](base-cluster-v10.0.3...base-cluster-v10.1.0) (2025-11-28) ### Features * **base-cluster/logging:** enable automatic resizing ([#1785](#1785)) ([167e5e0](167e5e0)) * **base-cluster/tracing:** add gateway to enable tail sampling ([#1736](#1736)) ([7c1bd9a](7c1bd9a)) ### Bug Fixes * **base-cluster/backup:** fix secret creation for velero ([#1816](#1816)) ([04a8ca0](04a8ca0)) * **base-cluster/monitoring:** alertmanager condition ([#1781](#1781)) ([b6abed0](b6abed0)) ### Miscellaneous Chores * **base-cluster/dependencies:** update common docker tag to v1.6.0 ([#1796](#1796)) ([f1d8f05](f1d8f05)) * **base-cluster/dependencies:** update docker.io/curlimages/curl docker tag to v8.17.0 ([#1797](#1797)) ([86362fe](86362fe)) * **base-cluster/dependencies:** update docker.io/fluxcd/flux-cli docker tag to v2.7.3 ([#1798](#1798)) ([f7b42d1](f7b42d1)) * **base-cluster/dependencies:** update docker.io/fluxcd/flux-cli docker tag to v2.7.4 ([#1818](#1818)) ([6a318a1](6a318a1)) * **base-cluster/dependencies:** update docker.io/fluxcd/flux-cli docker tag to v2.7.5 ([#1823](#1823)) ([bcd266e](bcd266e)) * **base-cluster/dependencies:** update docker.io/grafana/grafana-image-renderer docker tag to v3.12.9 ([#1639](#1639)) ([e99101a](e99101a)) * **base-cluster/dependencies:** update helm release alloy to v1.2.1 ([#1771](#1771)) ([87df788](87df788)) * **base-cluster/dependencies:** update helm release alloy to v1.4.0 ([#1799](#1799)) ([9bc1aaa](9bc1aaa)) * **base-cluster/dependencies:** update helm release descheduler to v0.34.0 ([#1800](#1800)) ([33f9a53](33f9a53)) * **base-cluster/dependencies:** update helm release external-dns to v1.19.0 ([#1801](#1801)) ([c1f24a4](c1f24a4)) * **base-cluster/dependencies:** update helm release kube-prometheus-stack to v75.15.2 ([#1772](#1772)) ([0cc66b2](0cc66b2)) * **base-cluster/dependencies:** update helm release kube-prometheus-stack to v75.18.1 ([#1802](#1802)) ([b096b58](b096b58)) * **base-cluster/dependencies:** update helm release loki to v6.46.0 ([#1727](#1727)) ([ec1b906](ec1b906)) * **base-cluster/dependencies:** update helm release metrics-server to v3.13.0 ([#1805](#1805)) ([6ba8633](6ba8633)) * **base-cluster/dependencies:** update helm release oauth2-proxy to v7.14.2 ([#1635](#1635)) ([d88c7c0](d88c7c0)) * **base-cluster/dependencies:** update helm release oauth2-proxy to v7.18.0 ([#1806](#1806)) ([636a585](636a585)) * **base-cluster/dependencies:** update helm release reflector to v9.1.39 ([#1790](#1790)) ([5b032af](5b032af)) * **base-cluster/dependencies:** update helm release reflector to v9.1.40 ([#1819](#1819)) ([da8be9d](da8be9d)) * **base-cluster/dependencies:** update helm release tempo-distributed to v1.48.1 ([#1791](#1791)) ([d00ac00](d00ac00)) * **base-cluster/dependencies:** update helm release tempo-distributed to v1.56.2 ([#1807](#1807)) ([80c67d1](80c67d1)) * **base-cluster/dependencies:** update helm release tetragon to v1.6.0 ([#1808](#1808)) ([c3fb92d](c3fb92d)) * **base-cluster/dependencies:** update helm release trivy-operator to v0.31.0 ([#1809](#1809)) ([59976f6](59976f6)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
That way traces with errors are always stored, traces that
are >200ms and with a random chance of 0.1%.
Adjust your clients to just sample 100% of the spans/traces.
Summary by CodeRabbit
New Features
Refactor