Skip to content

feat(collector): add allow_startup_failure config option#1233

Merged
christos68k merged 7 commits into
open-telemetry:mainfrom
rushabh-exe:rushabh-exe/allow-startup-failure
Mar 31, 2026
Merged

feat(collector): add allow_startup_failure config option#1233
christos68k merged 7 commits into
open-telemetry:mainfrom
rushabh-exe:rushabh-exe/allow-startup-failure

Conversation

@rushabh-exe
Copy link
Copy Markdown
Contributor

Fixes #1214

Description

Adds a new allow_startup_failure boolean config option to the profiler
receiver. When set to true, errors during Start() are logged but not
returned to the OTel Collector — so it keeps running without profiling
instead of failing entirely.

The Problem

eBPF profiler startup can fail for reasons outside the user's control
(old kernel, missing permissions, containerized environments). Right now
that failure takes down the entire collector, including logs and metrics
which are usually more critical than profiling.

Solution

Added allow_startup_failure to the config. When true, Start() catches
the error, logs it, and returns nil so the collector continues normally.

Also extracted a small profilerController interface to decouple the
internal Controller from the concrete type — this allowed adding the first
unit tests for the collector/internal package.

Testing

Added three unit tests:

  • startup failure with flag true → returns nil, logs error
  • startup failure with flag false → propagates error
  • successful startup → always returns nil

@rushabh-exe rushabh-exe requested review from a team as code owners March 6, 2026 08:41
@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla Bot commented Mar 6, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

@christos68k christos68k requested a review from dmathieu March 6, 2026 08:55
Comment thread collector/internal/controller.go Outdated
Comment thread collector/internal/controller.go Outdated
Comment thread collector/internal/controller.go Outdated
Comment thread collector/config/config.go Outdated
@dmathieu
Copy link
Copy Markdown
Member

dmathieu commented Mar 9, 2026

While this looks good to me, I'm not sure you should be doing profiling in the same collector as traces, metrics and profiles.
ebpf-profiler needs to run in an agent collector (daemonset), while the recommended pattern for all other signals is to have a cluster-wide collector (deployment).

Since this pattern isn't really what we've been advising for, I would hide this behavior behind a feature gate.

@christos68k
Copy link
Copy Markdown
Member

christos68k commented Mar 9, 2026

While this looks good to me, I'm not sure you should be doing profiling in the same collector as traces, metrics and profiles. ebpf-profiler needs to run in an agent collector (daemonset), while the recommended pattern for all other signals is to have a cluster-wide collector (deployment).

Since this pattern isn't really what we've been advising for, I would hide this behavior behind a feature gate.

Shouldn't feature gate be something that the downstream implementors decide about / put together depending on the requirements of the downstream collector? Essentially, this upstream PR makes this bit functionality available but is not opinionated in how it should be used.

@dmathieu
Copy link
Copy Markdown
Member

dmathieu commented Mar 9, 2026

Yes, but this assumes profiles can fail to start (and never start), which would be fine with purely a log entry. That's probably not the behavior you want when you're using a collector dedicated to doing profiling.
In that configuration, you likely want your collector to actually receive profiles, or alert you with an actual failure.

@dmathieu
Copy link
Copy Markdown
Member

dmathieu commented Mar 9, 2026

There is a precedent for feature gates that enable specific opinionated features.
See service.AllowNoPipelines for example: https://github.com/open-telemetry/opentelemetry-collector/blob/main/service/metadata.yaml#L192-L196

@rushabh-exe
Copy link
Copy Markdown
Contributor Author

Any update on the direction here?

@christos68k
Copy link
Copy Markdown
Member

christos68k commented Mar 16, 2026

Any update on the direction here?

Let's switch this to error_mode (see @rogercoll's comment here)

@rushabh-exe rushabh-exe force-pushed the rushabh-exe/allow-startup-failure branch from ac0804d to 3aa9282 Compare March 16, 2026 20:31
@rushabh-exe
Copy link
Copy Markdown
Contributor Author

Updated, switched to ErrorMode with ignore and propagate values following the OTTL pattern!

Comment thread collector/config/config.go
Comment thread collector/internal/controller.go
Comment thread collector/config/config.go
@dmathieu
Copy link
Copy Markdown
Member

Could you add tests?

@rushabh-exe
Copy link
Copy Markdown
Contributor Author

Sure, I can add tests. Should I cover just config validation (empty ErrorMode defaults to propagate, valid modes pass, invalid returns error) or also the controller Start() behaviour with error modes?

@dmathieu
Copy link
Copy Markdown
Member

Ideally, both.

@rushabh-exe
Copy link
Copy Markdown
Contributor Author

Added config validation tests. For controllerStart() tests I'd need to bring back the profilerController interface to mock startup failures, but @christos68k previously mentioned removing it since it was only used in one spot. Should I still add it back or is there another way you'd prefer?

@rushabh-exe rushabh-exe force-pushed the rushabh-exe/allow-startup-failure branch from e444720 to a71bc14 Compare March 23, 2026 18:13
@rushabh-exe
Copy link
Copy Markdown
Contributor Author

Moved the ErrorMode default to defaultConfig() in factory.go, that's where all the other defaults live. Empty error_mode is now a validation error instead of silently defaulting. Also added a Start() integration test using WithReporterFactory

@rushabh-exe rushabh-exe requested a review from christos68k March 23, 2026 18:28
Comment thread collector/start_test.go Outdated
Comment thread collector/start_test.go Outdated
Comment thread collector/start_test.go
Add allow_startup_failure bool to the collector config. When true,
startup errors from the profiler are logged but not returned to the
OTel Collector, so it keeps running without profiling.

Also extracted a profilerController interface to make the package testable
and added unit tests for the new behavior.

Closes open-telemetry#1214
Replace the bool allow_startup_failure config field with a string
ErrorMode type following the OTTL pattern. Supported values are
propagate (default) and ignore.
@rushabh-exe rushabh-exe force-pushed the rushabh-exe/allow-startup-failure branch from a71bc14 to e835a71 Compare March 27, 2026 18:05
@rushabh-exe rushabh-exe requested a review from christos68k March 27, 2026 18:18
@christos68k christos68k merged commit 68ccfcf into open-telemetry:main Mar 31, 2026
31 of 32 checks passed
gnurizen pushed a commit to parca-dev/opentelemetry-ebpf-profiler that referenced this pull request May 4, 2026
gnurizen pushed a commit to parca-dev/opentelemetry-ebpf-profiler that referenced this pull request May 5, 2026
gnurizen pushed a commit to parca-dev/opentelemetry-ebpf-profiler that referenced this pull request May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add configuration option to allow start-up failures

5 participants