Skip to content

Fix stale goroutine label, validate UTF-8 on label strings#1454

Open
gnurizen wants to merge 1 commit into
open-telemetry:mainfrom
parca-dev:fix-stale-go-label
Open

Fix stale goroutine label, validate UTF-8 on label strings#1454
gnurizen wants to merge 1 commit into
open-telemetry:mainfrom
parca-dev:fix-stale-go-label

Conversation

@gnurizen
Copy link
Copy Markdown
Contributor

@gnurizen gnurizen commented May 22, 2026

Summary

This regressed in 40dd9d9 ("ebpf: simplify get_pristine_per_cpu_record" #1091), which removed the explicit zeroing of trace->custom_labels.labels on the assumption that the slots were only read after a successful write that fully populated them. The slots are populated via bpf_probe_read_user, which only writes the bytes it reads, so a key/value shorter than one previously written to the same per-CPU slot inherits trailing bytes from the prior trace and produces a corrupted label.

  • eBPF side: write a single NUL after each bpf_probe_read_user.
  • Go side: validate label keys and values as UTF-8, which OTLP/pprof require. Strictness is deliberately asymmetric:
    • Keys are strict. Any invalid byte (including a single split-rune continuation at the end) drops the whole label. A corrupted key would silently group unrelated samples under a garbage name, which is worse than dropping.
    • Values are lenient. Fixed-width eBPF buffers can clip a multi-byte rune in half at the buffer boundary; we salvage the longest valid UTF-8 prefix rather than discard the label, so a clipped request_id or customer_name still arrives with everything up to the broken rune intact.
  • Two new metrics (IDGoLabelsDroppedInvalidName, IDGoLabelsDroppedInvalidValue) count labels dropped for each reason.
  • comm is left as best-effort: kernel-supplied, almost always ASCII, no useful fallback if it isn't.

Supersedes #1453.

@gnurizen gnurizen force-pushed the fix-stale-go-label branch 2 times, most recently from 19372b1 to 108290a Compare May 22, 2026 12:38
This regressed in 40dd9d9 ("ebpf: simplify
get_pristine_per_cpu_record" open-telemetry#1091), which removed an explicit zeroing of
trace->custom_labels.labels on the assumption that the slots were only
read after a successful write that fully populated them. The slots are
populated via bpf_probe_read_user, which only writes the bytes it reads,
so a key/value shorter than one previously written to the same per-CPU
slot inherits the trailing bytes from the prior trace and produces a
corrupted label.

The eBPF side now writes a single NUL after each bpf_probe_read_user;
userspace stops at the first NUL, so one terminator is sufficient. On
the Go side, label keys and values are validated as UTF-8 (required by
OTLP/pprof), with deliberately asymmetric strictness:

  - Keys are strict: any invalid byte (including a single split-rune
    continuation at the end) drops the whole label. A corrupted key
    would silently group unrelated samples under a garbage name.
  - Values are lenient: on fixed-width truncation that splits a
    multi-byte rune we salvage the longest valid UTF-8 prefix rather
    than drop the label, so a clipped request_id or customer_name still
    arrives with everything up to the broken rune intact.

Two metrics count labels dropped for each reason. Comm is left as
best-effort: it's kernel-supplied, almost always ASCII, and we don't
have a useful fallback if it isn't.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant