Skip to content

feat(base-cluster/tracing): add gateway to enable tail sampling#1736

Merged
cwrau merged 1 commit intomainfrom
feat/base-cluster/add-telemetry-gateway-for-tail-sampling
Nov 20, 2025
Merged

feat(base-cluster/tracing): add gateway to enable tail sampling#1736
cwrau merged 1 commit intomainfrom
feat/base-cluster/add-telemetry-gateway-for-tail-sampling

Conversation

@cwrau
Copy link
Copy Markdown
Member

@cwrau cwrau commented Oct 14, 2025

That way traces with errors are always stored, traces that
are >200ms and with a random chance of 0.1%.

Adjust your clients to just sample 100% of the spans/traces.

Summary by CodeRabbit

  • New Features

    • Added a telemetry gateway for distributed tracing with tail-sampling policies and OTLP endpoints.
    • Enabled JSON-formatted logs in the telemetry collector.
    • Added a load‑balanced OTLP exporter with Kubernetes service discovery and zstd compression.
  • Refactor

    • Simplified telemetry enablement across components (no longer tied to Prometheus).
    • Standardized OTLP endpoint handling to explicit host:port.
    • Moved descheduler tracing options to a unified telemetry command-options block.
    • Enriched Kubernetes metadata on traces and increased default trace sampling to 100%.

Copilot AI review requested due to automatic review settings October 14, 2025 09:27
@cwrau cwrau enabled auto-merge October 14, 2025 09:27
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Oct 14, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Decouples telemetry tracing enablement from Prometheus, moves descheduler tracing options to a top-level cmdOptions block, standardizes OTLP host:port endpoint formatting, adds a telemetry-gateway HelmRelease and load‑balancing exporter, enables JSON logging, expands k8sattributes extraction, and raises Prometheus tracing sampling to 1.0.

Changes

Cohort / File(s) Summary
Telemetry condition simplification
charts/base-cluster/templates/descheduler/descheduler.yaml, charts/base-cluster/templates/ingress/nginx.yaml, charts/base-cluster/templates/kyverno/kyverno.yaml
Removed dependency on monitoring.prometheus.enabled for rendering telemetry/tracing blocks; tracing now gated solely by telemetryConf.enabled. Descheduler telemetry fields moved into a new top-level cmdOptions subsection.
Traefik OTLP endpoint formatting
charts/base-cluster/templates/ingress/traefik.yaml
OTLP gRPC endpoint changed from a direct endpoint ref to a formatted host:port string (quoted) with explicit grpc.enabled: true; insecure handling remains gated by telemetry config.
Alloy collector pipeline overhaul
charts/base-cluster/templates/monitoring/alloy-collector.yaml
Added JSON logging; added tracing block (sampling_fraction=1.0); expanded k8sattributes metadata extraction; replaced distinct tempo exporter with a loadbalancing.gateway exporter using a Kubernetes resolver targeting telemetry-gateway, OTLP + zstd compression and tls.insecure.
New telemetry gateway HelmRelease
charts/base-cluster/templates/monitoring/alloy-gateway.yaml
Added HelmRelease telemetry-gateway (namespace monitoring) deploying Grafana Alloy collector with OTLP receiver, tail/batch sampling policies, Tempo exporter endpoint, extra OTLP/metrics ports, autoscaling, and a ServiceMonitor; rendering gated by both monitoring.prometheus.enabled and monitoring.tracing.enabled.
Prometheus tracing sampling
charts/base-cluster/templates/monitoring/kube-prometheus-stack/_prometheus_config.yaml
Increased tracingConfig.clientType.grpc.samplingFraction from 0.1 to 1.0.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant W as Instrumented Workload
  participant C as Alloy Collector (node/agent)
  participant G as Telemetry Gateway (Alloy HelmRelease)
  participant T as Tempo Distributor

  Note over W,C: OTLP gRPC export uses formatted host:port
  W->>C: Export spans (OTLP gRPC)
  C->>C: k8sattributes extraction, JSON logging, batch processing
  C->>G: OTLP gRPC (load‑balancing resolver, zstd, tls.insecure)
  G->>G: Tail sampling & batch processing (sampling_fraction=1.0)
  G->>T: Export traces to tempo distributor:4317 (OTLP)
  Note right of T: ServiceMonitor exposes metrics if Prometheus enabled
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • tasches
  • marvinWolff
  • teutonet-bot

Poem

A rabbit scurries, code in paw,
Traces hop along the raw.
Gateway hums and spans take flight,
JSON logs glow through the night.
Zstd crunches, sampling beams — carrots for observability dreams! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "feat(base-cluster/tracing): add gateway to enable tail sampling" directly corresponds to the primary change in this pull request: the introduction of a new telemetry-gateway HelmRelease component with tail sampling capabilities (as evidenced by the new alloy-gateway.yaml file). The title is concise, specific, and clearly communicates the main feature being added, which aligns with the PR objectives of adding a gateway to enable tail sampling with policies for error traces, long traces, and random sampling. While the changeset includes refactoring of telemetry configuration across multiple other components (descheduler, nginx, traefik, kyverno, collector), these are supporting changes that enable the core gateway feature rather than the main point of the PR.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/base-cluster/add-telemetry-gateway-for-tail-sampling

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b6606b5 and 74da747.

📒 Files selected for processing (7)
  • charts/base-cluster/templates/descheduler/descheduler.yaml (1 hunks)
  • charts/base-cluster/templates/ingress/nginx.yaml (1 hunks)
  • charts/base-cluster/templates/ingress/traefik.yaml (1 hunks)
  • charts/base-cluster/templates/kyverno/kyverno.yaml (1 hunks)
  • charts/base-cluster/templates/monitoring/alloy-collector.yaml (3 hunks)
  • charts/base-cluster/templates/monitoring/alloy-gateway.yaml (1 hunks)
  • charts/base-cluster/templates/monitoring/kube-prometheus-stack/_prometheus_config.yaml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • charts/base-cluster/templates/kyverno/kyverno.yaml
  • charts/base-cluster/templates/ingress/nginx.yaml
  • charts/base-cluster/templates/monitoring/kube-prometheus-stack/_prometheus_config.yaml
🧰 Additional context used
🪛 YAMLlint (1.37.1)
charts/base-cluster/templates/monitoring/alloy-gateway.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: check licenses
  • GitHub Check: lint helm chart (base-cluster)
🔇 Additional comments (10)
charts/base-cluster/templates/ingress/traefik.yaml (1)

80-91: LGTM!

The telemetry endpoint construction and OTLP gRPC configuration looks correct. The host:port formatting with proper int64 type conversion is sound.

charts/base-cluster/templates/monitoring/alloy-collector.yaml (4)

44-46: LGTM!

JSON logging format is consistent with broader observability improvements across ingress components.


196-199: LGTM!

The tracing block with 100% sampling fraction correctly implements the PR objective of instructing the collector to sample 100% of spans, allowing downstream tail sampling to make decisions.


201-215: Comprehensive metadata extraction enhancement.

The expanded metadata extraction (replicaset, statefulset, daemonset, cronjob, job, pod.start_time, container details) significantly improves trace context and debuggability. Well done.


243-265: k8sattributes processor authentication is not broken—remove this review comment.

The k8sattributes processor defaults to "serviceAccount" authentication when auth_type is omitted, using the Pod's service account token to access the Kubernetes API. The removal of the explicit auth_type = "serviceAccount" line does not introduce a failure; the processor will authenticate correctly by default. The current configuration (line 201 onward) will function as intended for pod discovery and resource attribute enrichment.

Likely an incorrect or invalid review comment.

charts/base-cluster/templates/monitoring/alloy-gateway.yaml (4)

1-1: LGTM!

The conditional gating correctly uses and with 2+ arguments (unlike the earlier descheduler issue). Both prometheus.enabled and tracing.enabled conditions are required.


44-91: Tail sampling policies correctly implement PR objectives.

The three sampling policies (error traces, 200ms latency threshold, probabilistic 0.1%) align well with stated requirements and ensure critical traces are retained while reducing overall volume.


84-84: Clarify sampling_percentage semantics in comments.

The sampling_percentage = 0.1 value is ambiguous without documentation. Based on PR description mentioning "0.1% random sampling," confirm this is the correct interpretation:

  • If sampling_percentage expects a percentage value (0–100 range): 0.1 = 0.1% ✓ correct
  • If sampling_percentage expects a fraction (0–1 range): 0.1 = 10% ✗ incorrect

Add a clarifying comment above line 84 to document the intended semantics:

             policy {
               name = "sample-random"
               type = "probabilistic"
               probabilistic {
+                # sampling_percentage = 0.1 means 0.1%, not 10% (percentage range is 0-100)
                 sampling_percentage = 0.1
               }
             }

Also verify against the Alloy/OpenTelemetry documentation for the tail sampling processor to confirm the value range interpretation.


99-107: LGTM!

The OTLP Tempo exporter endpoint and TLS insecure configuration are appropriate for in-cluster communication. The endpoint targets the Grafana Tempo distributor service correctly.

charts/base-cluster/templates/descheduler/descheduler.yaml (1)

41-41: Consider passing global context for consistency with traefik, though autodiscovery will still function.

Line 41 doesn't pass the "global" .Values.global parameter that traefik.yaml uses (line 80). Unlike the review comment implies, the template won't malfunction—it has a fallback (line 44 in _telemetry.tpl uses | default (dict)) that triggers autodiscovery mode. However, this means descheduler relies on service autodiscovery rather than explicit telemetry config from .Values.global.telemetry.otlp. If explicit configuration is needed (e.g., for non-standard collector endpoints), add the parameter:

-    {{- $telemetryConf := include "common.telemetry.conf" (dict "protocol" "otlp") | fromYaml }}
+    {{- $telemetryConf := include "common.telemetry.conf" (dict "protocol" "otlp" "global" .Values.global) | fromYaml }}

Note: Other components (prometheus, kyverno, nginx) omit global and use autodiscovery, so this is not a blocking issue.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements tail sampling for distributed tracing by introducing a telemetry gateway. The gateway filters traces based on errors, latency (>200ms), and random sampling (0.1%), while upstream clients now sample 100% of traces.

  • Introduces a new Alloy-based telemetry gateway with tail sampling policies
  • Updates all tracing clients to use 100% sampling and route through the gateway
  • Removes the requirement for Prometheus to be enabled when tracing is enabled

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
charts/base-cluster/templates/monitoring/alloy-gateway.yaml New telemetry gateway deployment with tail sampling configuration
charts/base-cluster/templates/monitoring/alloy-collector.yaml Routes traces through load-balanced gateway instead of directly to Tempo
charts/base-cluster/templates/monitoring/kube-prometheus-stack/_prometheus_config.yaml Increases sampling fraction from 0.1 to 1.0
charts/base-cluster/templates/kyverno/kyverno.yaml Removes Prometheus dependency from telemetry enablement
charts/base-cluster/templates/ingress/traefik.yaml Adds explicit gRPC enablement for OTLP endpoint
charts/base-cluster/templates/ingress/nginx.yaml Removes Prometheus dependency from telemetry enablement
charts/base-cluster/templates/descheduler/descheduler.yaml Migrates from deschedulerPolicy.tracing to cmdOptions for OTLP configuration

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread charts/base-cluster/templates/monitoring/alloy-gateway.yaml
Comment thread charts/base-cluster/templates/monitoring/alloy-collector.yaml
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
charts/base-cluster/templates/kyverno/kyverno.yaml (1)

70-77: Fix Helm template: and needs at least two args.

and in Go templates requires ≥2 operands; calling and $telemetryConf.enabled raises wrong number of args for and at render time. Assign the value directly instead.

-      {{- $telemetryEnabled := and $telemetryConf.enabled -}}
+      {{- $telemetryEnabled := $telemetryConf.enabled -}}
charts/base-cluster/templates/ingress/nginx.yaml (1)

32-39: Remove the single-operand and to avoid template failure.

{{- if and $telemetryConf.enabled }} triggers wrong number of args for and during Helm rendering. Use the flag directly.

-      {{- if and $telemetryConf.enabled }}
+      {{- if $telemetryConf.enabled }}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 51a42a2 and 8b44a8e.

📒 Files selected for processing (7)
  • charts/base-cluster/templates/descheduler/descheduler.yaml (1 hunks)
  • charts/base-cluster/templates/ingress/nginx.yaml (1 hunks)
  • charts/base-cluster/templates/ingress/traefik.yaml (1 hunks)
  • charts/base-cluster/templates/kyverno/kyverno.yaml (1 hunks)
  • charts/base-cluster/templates/monitoring/alloy-collector.yaml (3 hunks)
  • charts/base-cluster/templates/monitoring/alloy-gateway.yaml (1 hunks)
  • charts/base-cluster/templates/monitoring/kube-prometheus-stack/_prometheus_config.yaml (1 hunks)
🧰 Additional context used
🪛 YAMLlint (1.37.1)
charts/base-cluster/templates/monitoring/alloy-gateway.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: check licenses
  • GitHub Check: lint helm chart (base-cluster)

Comment thread charts/base-cluster/templates/descheduler/descheduler.yaml
@cwrau cwrau force-pushed the feat/base-cluster/add-telemetry-gateway-for-tail-sampling branch from 8b44a8e to b6606b5 Compare October 22, 2025 09:18
That way traces with errors are always stored, traces that
are >200ms and with a random chance of 0.1%.

Adjust your clients to just sample 100% of the spans/traces.
@cwrau cwrau force-pushed the feat/base-cluster/add-telemetry-gateway-for-tail-sampling branch from b6606b5 to 74da747 Compare October 24, 2025 09:54
@cwrau cwrau added this pull request to the merge queue Nov 20, 2025
Merged via the queue into main with commit 7c1bd9a Nov 20, 2025
31 of 32 checks passed
@cwrau cwrau deleted the feat/base-cluster/add-telemetry-gateway-for-tail-sampling branch November 20, 2025 11:47
github-merge-queue Bot pushed a commit that referenced this pull request Nov 28, 2025
🤖 I have created a release *beep* *boop*
---


##
[10.1.0](base-cluster-v10.0.3...base-cluster-v10.1.0)
(2025-11-28)


### Features

* **base-cluster/logging:** enable automatic resizing
([#1785](#1785))
([167e5e0](167e5e0))
* **base-cluster/tracing:** add gateway to enable tail sampling
([#1736](#1736))
([7c1bd9a](7c1bd9a))


### Bug Fixes

* **base-cluster/backup:** fix secret creation for velero
([#1816](#1816))
([04a8ca0](04a8ca0))
* **base-cluster/monitoring:** alertmanager condition
([#1781](#1781))
([b6abed0](b6abed0))


### Miscellaneous Chores

* **base-cluster/dependencies:** update common docker tag to v1.6.0
([#1796](#1796))
([f1d8f05](f1d8f05))
* **base-cluster/dependencies:** update docker.io/curlimages/curl docker
tag to v8.17.0
([#1797](#1797))
([86362fe](86362fe))
* **base-cluster/dependencies:** update docker.io/fluxcd/flux-cli docker
tag to v2.7.3
([#1798](#1798))
([f7b42d1](f7b42d1))
* **base-cluster/dependencies:** update docker.io/fluxcd/flux-cli docker
tag to v2.7.4
([#1818](#1818))
([6a318a1](6a318a1))
* **base-cluster/dependencies:** update docker.io/fluxcd/flux-cli docker
tag to v2.7.5
([#1823](#1823))
([bcd266e](bcd266e))
* **base-cluster/dependencies:** update
docker.io/grafana/grafana-image-renderer docker tag to v3.12.9
([#1639](#1639))
([e99101a](e99101a))
* **base-cluster/dependencies:** update helm release alloy to v1.2.1
([#1771](#1771))
([87df788](87df788))
* **base-cluster/dependencies:** update helm release alloy to v1.4.0
([#1799](#1799))
([9bc1aaa](9bc1aaa))
* **base-cluster/dependencies:** update helm release descheduler to
v0.34.0
([#1800](#1800))
([33f9a53](33f9a53))
* **base-cluster/dependencies:** update helm release external-dns to
v1.19.0
([#1801](#1801))
([c1f24a4](c1f24a4))
* **base-cluster/dependencies:** update helm release
kube-prometheus-stack to v75.15.2
([#1772](#1772))
([0cc66b2](0cc66b2))
* **base-cluster/dependencies:** update helm release
kube-prometheus-stack to v75.18.1
([#1802](#1802))
([b096b58](b096b58))
* **base-cluster/dependencies:** update helm release loki to v6.46.0
([#1727](#1727))
([ec1b906](ec1b906))
* **base-cluster/dependencies:** update helm release metrics-server to
v3.13.0
([#1805](#1805))
([6ba8633](6ba8633))
* **base-cluster/dependencies:** update helm release oauth2-proxy to
v7.14.2
([#1635](#1635))
([d88c7c0](d88c7c0))
* **base-cluster/dependencies:** update helm release oauth2-proxy to
v7.18.0
([#1806](#1806))
([636a585](636a585))
* **base-cluster/dependencies:** update helm release reflector to
v9.1.39
([#1790](#1790))
([5b032af](5b032af))
* **base-cluster/dependencies:** update helm release reflector to
v9.1.40
([#1819](#1819))
([da8be9d](da8be9d))
* **base-cluster/dependencies:** update helm release tempo-distributed
to v1.48.1
([#1791](#1791))
([d00ac00](d00ac00))
* **base-cluster/dependencies:** update helm release tempo-distributed
to v1.56.2
([#1807](#1807))
([80c67d1](80c67d1))
* **base-cluster/dependencies:** update helm release tetragon to v1.6.0
([#1808](#1808))
([c3fb92d](c3fb92d))
* **base-cluster/dependencies:** update helm release trivy-operator to
v0.31.0
([#1809](#1809))
([59976f6](59976f6))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants