feat(base-cluster/tracing): add gateway to enable tail sampling by cwrau · Pull Request #1736 · teutonet/teutonet-helm-charts

cwrau · 2025-10-14T09:27:28Z

That way traces with errors are always stored, traces that
are >200ms and with a random chance of 0.1%.

Adjust your clients to just sample 100% of the spans/traces.

Summary by CodeRabbit

New Features
- Added a telemetry gateway for distributed tracing with tail-sampling policies and OTLP endpoints.
- Enabled JSON-formatted logs in the telemetry collector.
- Added a load‑balanced OTLP exporter with Kubernetes service discovery and zstd compression.
Refactor
- Simplified telemetry enablement across components (no longer tied to Prometheus).
- Standardized OTLP endpoint handling to explicit host:port.
- Moved descheduler tracing options to a unified telemetry command-options block.
- Enriched Kubernetes metadata on traces and increased default trace sampling to 100%.

coderabbitai · 2025-10-14T09:27:42Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Decouples telemetry tracing enablement from Prometheus, moves descheduler tracing options to a top-level cmdOptions block, standardizes OTLP host:port endpoint formatting, adds a telemetry-gateway HelmRelease and load‑balancing exporter, enables JSON logging, expands k8sattributes extraction, and raises Prometheus tracing sampling to 1.0.

Changes

Cohort / File(s)	Summary
Telemetry condition simplification `charts/base-cluster/templates/descheduler/descheduler.yaml`, `charts/base-cluster/templates/ingress/nginx.yaml`, `charts/base-cluster/templates/kyverno/kyverno.yaml`	Removed dependency on `monitoring.prometheus.enabled` for rendering telemetry/tracing blocks; tracing now gated solely by `telemetryConf.enabled`. Descheduler telemetry fields moved into a new top-level `cmdOptions` subsection.
Traefik OTLP endpoint formatting `charts/base-cluster/templates/ingress/traefik.yaml`	OTLP gRPC endpoint changed from a direct endpoint ref to a formatted `host:port` string (quoted) with explicit `grpc.enabled: true`; insecure handling remains gated by telemetry config.
Alloy collector pipeline overhaul `charts/base-cluster/templates/monitoring/alloy-collector.yaml`	Added JSON logging; added tracing block (sampling_fraction=1.0); expanded `k8sattributes` metadata extraction; replaced distinct tempo exporter with a `loadbalancing.gateway` exporter using a Kubernetes resolver targeting `telemetry-gateway`, OTLP + zstd compression and tls.insecure.
New telemetry gateway HelmRelease `charts/base-cluster/templates/monitoring/alloy-gateway.yaml`	Added `HelmRelease telemetry-gateway` (namespace `monitoring`) deploying Grafana Alloy collector with OTLP receiver, tail/batch sampling policies, Tempo exporter endpoint, extra OTLP/metrics ports, autoscaling, and a ServiceMonitor; rendering gated by both `monitoring.prometheus.enabled` and `monitoring.tracing.enabled`.
Prometheus tracing sampling `charts/base-cluster/templates/monitoring/kube-prometheus-stack/_prometheus_config.yaml`	Increased `tracingConfig.clientType.grpc.samplingFraction` from `0.1` to `1.0`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant W as Instrumented Workload
  participant C as Alloy Collector (node/agent)
  participant G as Telemetry Gateway (Alloy HelmRelease)
  participant T as Tempo Distributor

  Note over W,C: OTLP gRPC export uses formatted host:port
  W->>C: Export spans (OTLP gRPC)
  C->>C: k8sattributes extraction, JSON logging, batch processing
  C->>G: OTLP gRPC (load‑balancing resolver, zstd, tls.insecure)
  G->>G: Tail sampling & batch processing (sampling_fraction=1.0)
  G->>T: Export traces to tempo distributor:4317 (OTLP)
  Note right of T: ServiceMonitor exposes metrics if Prometheus enabled

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

feat(base-cluster/monitoring): rename alloy to be a generic name #1433 — Overlaps changes to Alloy telemetry HelmRelease and routing (telemetry-gateway / alloy values).
feat(base-cluster/monitoring): configure service graph for grafana #1422 — Similar telemetry templating changes affecting endpoint formatting and tracing condition logic.
feat(base-cluster/ingress)!: add option traefik for ingress controller and make it default #1420 — Related edits in ingress telemetry templates (nginx/traefik) and OTLP configuration.

Suggested reviewers

tasches
marvinWolff
teutonet-bot

Poem

A rabbit scurries, code in paw,
Traces hop along the raw.
Gateway hums and spans take flight,
JSON logs glow through the night.
Zstd crunches, sampling beams — carrots for observability dreams! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "feat(base-cluster/tracing): add gateway to enable tail sampling" directly corresponds to the primary change in this pull request: the introduction of a new telemetry-gateway HelmRelease component with tail sampling capabilities (as evidenced by the new alloy-gateway.yaml file). The title is concise, specific, and clearly communicates the main feature being added, which aligns with the PR objectives of adding a gateway to enable tail sampling with policies for error traces, long traces, and random sampling. While the changeset includes refactoring of telemetry configuration across multiple other components (descheduler, nginx, traefik, kyverno, collector), these are supporting changes that enable the core gateway feature rather than the main point of the PR.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/base-cluster/add-telemetry-gateway-for-tail-sampling

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b6606b5 and 74da747.

📒 Files selected for processing (7)

charts/base-cluster/templates/descheduler/descheduler.yaml (1 hunks)
charts/base-cluster/templates/ingress/nginx.yaml (1 hunks)
charts/base-cluster/templates/ingress/traefik.yaml (1 hunks)
charts/base-cluster/templates/kyverno/kyverno.yaml (1 hunks)
charts/base-cluster/templates/monitoring/alloy-collector.yaml (3 hunks)
charts/base-cluster/templates/monitoring/alloy-gateway.yaml (1 hunks)
charts/base-cluster/templates/monitoring/kube-prometheus-stack/_prometheus_config.yaml (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

charts/base-cluster/templates/kyverno/kyverno.yaml
charts/base-cluster/templates/ingress/nginx.yaml
charts/base-cluster/templates/monitoring/kube-prometheus-stack/_prometheus_config.yaml

🧰 Additional context used

🪛 YAMLlint (1.37.1)

charts/base-cluster/templates/monitoring/alloy-gateway.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: check licenses
GitHub Check: lint helm chart (base-cluster)

🔇 Additional comments (10)

charts/base-cluster/templates/ingress/traefik.yaml (1)

80-91: LGTM!

The telemetry endpoint construction and OTLP gRPC configuration looks correct. The host:port formatting with proper int64 type conversion is sound.

charts/base-cluster/templates/monitoring/alloy-collector.yaml (4)

44-46: LGTM!

JSON logging format is consistent with broader observability improvements across ingress components.

196-199: LGTM!

The tracing block with 100% sampling fraction correctly implements the PR objective of instructing the collector to sample 100% of spans, allowing downstream tail sampling to make decisions.

201-215: Comprehensive metadata extraction enhancement.

The expanded metadata extraction (replicaset, statefulset, daemonset, cronjob, job, pod.start_time, container details) significantly improves trace context and debuggability. Well done.

243-265: k8sattributes processor authentication is not broken—remove this review comment.

The k8sattributes processor defaults to "serviceAccount" authentication when auth_type is omitted, using the Pod's service account token to access the Kubernetes API. The removal of the explicit auth_type = "serviceAccount" line does not introduce a failure; the processor will authenticate correctly by default. The current configuration (line 201 onward) will function as intended for pod discovery and resource attribute enrichment.

Likely an incorrect or invalid review comment.
charts/base-cluster/templates/monitoring/alloy-gateway.yaml (4)
1-1: LGTM!

The conditional gating correctly uses and with 2+ arguments (unlike the earlier descheduler issue). Both prometheus.enabled and tracing.enabled conditions are required.

44-91: Tail sampling policies correctly implement PR objectives.

The three sampling policies (error traces, 200ms latency threshold, probabilistic 0.1%) align well with stated requirements and ensure critical traces are retained while reducing overall volume.

84-84: Clarify sampling_percentage semantics in comments.

The sampling_percentage = 0.1 value is ambiguous without documentation. Based on PR description mentioning "0.1% random sampling," confirm this is the correct interpretation:

If sampling_percentage expects a percentage value (0–100 range): 0.1 = 0.1% ✓ correct

If sampling_percentage expects a fraction (0–1 range): 0.1 = 10% ✗ incorrect

Add a clarifying comment above line 84 to document the intended semantics:
             policy {
               name = "sample-random"
               type = "probabilistic"
               probabilistic {
+                # sampling_percentage = 0.1 means 0.1%, not 10% (percentage range is 0-100)
                 sampling_percentage = 0.1
               }
             }
Also verify against the Alloy/OpenTelemetry documentation for the tail sampling processor to confirm the value range interpretation.

99-107: LGTM!

The OTLP Tempo exporter endpoint and TLS insecure configuration are appropriate for in-cluster communication. The endpoint targets the Grafana Tempo distributor service correctly.
charts/base-cluster/templates/descheduler/descheduler.yaml (1)
41-41: Consider passing global context for consistency with traefik, though autodiscovery will still function.

Line 41 doesn't pass the "global" .Values.global parameter that traefik.yaml uses (line 80). Unlike the review comment implies, the template won't malfunction—it has a fallback (line 44 in _telemetry.tpl uses | default (dict)) that triggers autodiscovery mode. However, this means descheduler relies on service autodiscovery rather than explicit telemetry config from .Values.global.telemetry.otlp. If explicit configuration is needed (e.g., for non-standard collector endpoints), add the parameter:
-    {{- $telemetryConf := include "common.telemetry.conf" (dict "protocol" "otlp") | fromYaml }}
+    {{- $telemetryConf := include "common.telemetry.conf" (dict "protocol" "otlp" "global" .Values.global) | fromYaml }}
Note: Other components (prometheus, kyverno, nginx) omit global and use autodiscovery, so this is not a blocking issue.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull Request Overview

This PR implements tail sampling for distributed tracing by introducing a telemetry gateway. The gateway filters traces based on errors, latency (>200ms), and random sampling (0.1%), while upstream clients now sample 100% of traces.

Introduces a new Alloy-based telemetry gateway with tail sampling policies
Updates all tracing clients to use 100% sampling and route through the gateway
Removes the requirement for Prometheus to be enabled when tracing is enabled

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
charts/base-cluster/templates/monitoring/alloy-gateway.yaml	New telemetry gateway deployment with tail sampling configuration
charts/base-cluster/templates/monitoring/alloy-collector.yaml	Routes traces through load-balanced gateway instead of directly to Tempo
charts/base-cluster/templates/monitoring/kube-prometheus-stack/_prometheus_config.yaml	Increases sampling fraction from 0.1 to 1.0
charts/base-cluster/templates/kyverno/kyverno.yaml	Removes Prometheus dependency from telemetry enablement
charts/base-cluster/templates/ingress/traefik.yaml	Adds explicit gRPC enablement for OTLP endpoint
charts/base-cluster/templates/ingress/nginx.yaml	Removes Prometheus dependency from telemetry enablement
charts/base-cluster/templates/descheduler/descheduler.yaml	Migrates from deschedulerPolicy.tracing to cmdOptions for OTLP configuration

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

charts/base-cluster/templates/kyverno/kyverno.yaml (1)
70-77: Fix Helm template: and needs at least two args.

and in Go templates requires ≥2 operands; calling and $telemetryConf.enabled raises wrong number of args for and at render time. Assign the value directly instead.
-      {{- $telemetryEnabled := and $telemetryConf.enabled -}}
+      {{- $telemetryEnabled := $telemetryConf.enabled -}}
charts/base-cluster/templates/ingress/nginx.yaml (1)
32-39: Remove the single-operand and to avoid template failure.

{{- if and $telemetryConf.enabled }} triggers wrong number of args for and during Helm rendering. Use the flag directly.
-      {{- if and $telemetryConf.enabled }}
+      {{- if $telemetryConf.enabled }}

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 51a42a2 and 8b44a8e.

📒 Files selected for processing (7)

charts/base-cluster/templates/descheduler/descheduler.yaml (1 hunks)
charts/base-cluster/templates/ingress/nginx.yaml (1 hunks)
charts/base-cluster/templates/ingress/traefik.yaml (1 hunks)
charts/base-cluster/templates/kyverno/kyverno.yaml (1 hunks)
charts/base-cluster/templates/monitoring/alloy-collector.yaml (3 hunks)
charts/base-cluster/templates/monitoring/alloy-gateway.yaml (1 hunks)
charts/base-cluster/templates/monitoring/kube-prometheus-stack/_prometheus_config.yaml (1 hunks)

🧰 Additional context used

🪛 YAMLlint (1.37.1)

charts/base-cluster/templates/monitoring/alloy-gateway.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: check licenses
GitHub Check: lint helm chart (base-cluster)

That way traces with errors are always stored, traces that are >200ms and with a random chance of 0.1%. Adjust your clients to just sample 100% of the spans/traces.

🤖 I have created a release *beep* *boop* --- ## [10.1.0](base-cluster-v10.0.3...base-cluster-v10.1.0) (2025-11-28) ### Features * **base-cluster/logging:** enable automatic resizing ([#1785](#1785)) ([167e5e0](167e5e0)) * **base-cluster/tracing:** add gateway to enable tail sampling ([#1736](#1736)) ([7c1bd9a](7c1bd9a)) ### Bug Fixes * **base-cluster/backup:** fix secret creation for velero ([#1816](#1816)) ([04a8ca0](04a8ca0)) * **base-cluster/monitoring:** alertmanager condition ([#1781](#1781)) ([b6abed0](b6abed0)) ### Miscellaneous Chores * **base-cluster/dependencies:** update common docker tag to v1.6.0 ([#1796](#1796)) ([f1d8f05](f1d8f05)) * **base-cluster/dependencies:** update docker.io/curlimages/curl docker tag to v8.17.0 ([#1797](#1797)) ([86362fe](86362fe)) * **base-cluster/dependencies:** update docker.io/fluxcd/flux-cli docker tag to v2.7.3 ([#1798](#1798)) ([f7b42d1](f7b42d1)) * **base-cluster/dependencies:** update docker.io/fluxcd/flux-cli docker tag to v2.7.4 ([#1818](#1818)) ([6a318a1](6a318a1)) * **base-cluster/dependencies:** update docker.io/fluxcd/flux-cli docker tag to v2.7.5 ([#1823](#1823)) ([bcd266e](bcd266e)) * **base-cluster/dependencies:** update docker.io/grafana/grafana-image-renderer docker tag to v3.12.9 ([#1639](#1639)) ([e99101a](e99101a)) * **base-cluster/dependencies:** update helm release alloy to v1.2.1 ([#1771](#1771)) ([87df788](87df788)) * **base-cluster/dependencies:** update helm release alloy to v1.4.0 ([#1799](#1799)) ([9bc1aaa](9bc1aaa)) * **base-cluster/dependencies:** update helm release descheduler to v0.34.0 ([#1800](#1800)) ([33f9a53](33f9a53)) * **base-cluster/dependencies:** update helm release external-dns to v1.19.0 ([#1801](#1801)) ([c1f24a4](c1f24a4)) * **base-cluster/dependencies:** update helm release kube-prometheus-stack to v75.15.2 ([#1772](#1772)) ([0cc66b2](0cc66b2)) * **base-cluster/dependencies:** update helm release kube-prometheus-stack to v75.18.1 ([#1802](#1802)) ([b096b58](b096b58)) * **base-cluster/dependencies:** update helm release loki to v6.46.0 ([#1727](#1727)) ([ec1b906](ec1b906)) * **base-cluster/dependencies:** update helm release metrics-server to v3.13.0 ([#1805](#1805)) ([6ba8633](6ba8633)) * **base-cluster/dependencies:** update helm release oauth2-proxy to v7.14.2 ([#1635](#1635)) ([d88c7c0](d88c7c0)) * **base-cluster/dependencies:** update helm release oauth2-proxy to v7.18.0 ([#1806](#1806)) ([636a585](636a585)) * **base-cluster/dependencies:** update helm release reflector to v9.1.39 ([#1790](#1790)) ([5b032af](5b032af)) * **base-cluster/dependencies:** update helm release reflector to v9.1.40 ([#1819](#1819)) ([da8be9d](da8be9d)) * **base-cluster/dependencies:** update helm release tempo-distributed to v1.48.1 ([#1791](#1791)) ([d00ac00](d00ac00)) * **base-cluster/dependencies:** update helm release tempo-distributed to v1.56.2 ([#1807](#1807)) ([80c67d1](80c67d1)) * **base-cluster/dependencies:** update helm release tetragon to v1.6.0 ([#1808](#1808)) ([c3fb92d](c3fb92d)) * **base-cluster/dependencies:** update helm release trivy-operator to v0.31.0 ([#1809](#1809)) ([59976f6](59976f6)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: github-actions <41898282+github-actions[bot]@users.noreply.github.com>

Copilot AI review requested due to automatic review settings October 14, 2025 09:27

cwrau requested review from marvinWolff, tasches and teutonet-bot as code owners October 14, 2025 09:27

cwrau enabled auto-merge October 14, 2025 09:27

github-actions Bot assigned cwrau Oct 14, 2025

teutonet-bot added the base-cluster label Oct 14, 2025

Copilot AI reviewed Oct 14, 2025

View reviewed changes

Comment thread charts/base-cluster/templates/monitoring/alloy-gateway.yaml

Comment thread charts/base-cluster/templates/monitoring/alloy-collector.yaml

coderabbitai Bot reviewed Oct 14, 2025

View reviewed changes

Comment thread charts/base-cluster/templates/descheduler/descheduler.yaml

cwrau force-pushed the feat/base-cluster/add-telemetry-gateway-for-tail-sampling branch from 8b44a8e to b6606b5 Compare October 22, 2025 09:18

feat(base-cluster/tracing): add gateway to enable tail sampling

74da747

That way traces with errors are always stored, traces that are >200ms and with a random chance of 0.1%. Adjust your clients to just sample 100% of the spans/traces.

cwrau force-pushed the feat/base-cluster/add-telemetry-gateway-for-tail-sampling branch from b6606b5 to 74da747 Compare October 24, 2025 09:54

marvinWolff approved these changes Nov 20, 2025

View reviewed changes

cwrau added this pull request to the merge queue Nov 20, 2025

Merged via the queue into main with commit 7c1bd9a Nov 20, 2025
31 of 32 checks passed

cwrau deleted the feat/base-cluster/add-telemetry-gateway-for-tail-sampling branch November 20, 2025 11:47

teutonet-bot mentioned this pull request Nov 20, 2025

chore(main): [bot] release base-cluster:10.1.0 #1784

Merged

coderabbitai Bot mentioned this pull request Nov 28, 2025

fix(base-cluster/tracing): use correct resources for tracing gateway #1832

Merged

This was referenced Nov 28, 2025

chore(main): [bot] release base-cluster:10.1.2 #1841

Merged

chore(main): [bot] release base-cluster:11.0.0 #1862

Merged

coderabbitai Bot mentioned this pull request Dec 9, 2025

fix(base-cluster/monitoring)!: grafana-tempo-distributed would need s3 #1875

Merged

teutonet-bot mentioned this pull request Mar 17, 2026

chore(main): [bot] release base-cluster:11.1.1 #2023

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(base-cluster/tracing): add gateway to enable tail sampling#1736

feat(base-cluster/tracing): add gateway to enable tail sampling#1736
cwrau merged 1 commit intomainfrom
feat/base-cluster/add-telemetry-gateway-for-tail-sampling

cwrau commented Oct 14, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Oct 14, 2025 •

edited

Loading

Other AI code review bot(s) detected

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

cwrau commented Oct 14, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cwrau commented Oct 14, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Oct 14, 2025 •

edited

Loading