Skip to content

Fix stale goroutine label, validate UTF-8 on label strings#278

Open
gnurizen wants to merge 2 commits into
mainfrom
fix-stale-go-label-fork
Open

Fix stale goroutine label, validate UTF-8 on label strings#278
gnurizen wants to merge 2 commits into
mainfrom
fix-stale-go-label-fork

Conversation

@gnurizen
Copy link
Copy Markdown
Collaborator

@gnurizen gnurizen commented May 24, 2026

Summary

Backport of open-telemetry/opentelemetry-ebpf-profiler#1454 to the parca-dev fork.

This regressed in 40dd9d9 (open-telemetry#1091), which removed the explicit zeroing of trace->custom_labels.labels on the assumption that the slots were only read after a successful write that fully populated them. The slots are populated via bpf_probe_read_user, which only writes the bytes it reads, so a key/value shorter than one previously written to the same per-CPU slot inherits trailing bytes from the prior trace and produces a corrupted label.

  • eBPF side: write a single NUL after each bpf_probe_read_user. Userspace stops at the first NUL, so one terminator is sufficient. The fork's CustomLabel buffer is already sized MAX_LEN + 1 to make room for the terminator, so the read clamp stays at MIN(len, MAX_LEN) (no -1).

  • Go side: validate label keys and values as UTF-8, which OTLP/pprof require. Strictness is deliberately asymmetric:

    • Keys are strict. Any invalid byte (including a single split-rune continuation at the end) drops the whole label. A corrupted key would silently group unrelated samples under a garbage name.
    • Values are lenient. Fixed-width eBPF buffers can clip a multi-byte rune in half; we salvage the longest valid UTF-8 prefix rather than discard the label, so a clipped request_id or customer_name arrives intact up to the broken rune.
  • Two new metrics (IDGoLabelsDroppedInvalidName = 305, IDGoLabelsDroppedInvalidValue = 306) count labels dropped for each reason. Renumbered from upstream's 293/294 because the fork already uses those IDs for LuaJIT counters.

  • comm is left as best-effort: kernel-supplied, almost always ASCII, no useful fallback if it isn't.

  • Also disabled 5.4 kernels in testing because they have issues with this change.

This regressed in 40dd9d9 ("ebpf: simplify
get_pristine_per_cpu_record" open-telemetry#1091), which removed an explicit zeroing of
trace->custom_labels.labels on the assumption that the slots were only
read after a successful write that fully populated them. The slots are
populated via bpf_probe_read_user, which only writes the bytes it reads,
so a key/value shorter than one previously written to the same per-CPU
slot inherits the trailing bytes from the prior trace and produces a
corrupted label.

The eBPF side now writes a single NUL after each bpf_probe_read_user;
userspace stops at the first NUL, so one terminator is sufficient. On
the Go side, label keys and values are validated as UTF-8 (required by
OTLP/pprof), with deliberately asymmetric strictness:

  - Keys are strict: any invalid byte (including a single split-rune
    continuation at the end) drops the whole label. A corrupted key
    would silently group unrelated samples under a garbage name.
  - Values are lenient: on fixed-width truncation that splits a
    multi-byte rune we salvage the longest valid UTF-8 prefix rather
    than drop the label, so a clipped request_id or customer_name still
    arrives with everything up to the broken rune intact.

Two metrics count labels dropped for each reason. Comm is left as
best-effort: it's kernel-supplied, almost always ASCII, and we don't
have a useful fallback if it isn't.
@gnurizen gnurizen requested review from brancz and umanwizard May 25, 2026 11:44
5.4 kernels are no longer supported and are being phased out, so drop
them entirely from the integration-test and distro-qemu test matrices.

https://claude.ai/code/session_016u3LfnLzJ3FMnPaEjT9LQp
@gnurizen gnurizen force-pushed the fix-stale-go-label-fork branch from 2f6cffc to 6c1b8bd Compare May 25, 2026 12:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants