[otelbeat] fix status reporting when running beatreceivers by VihasMakwana · Pull Request #44528 · elastic/beats

VihasMakwana · 2025-05-28T10:29:51Z

Proposed commit message

If a metricset/filebeat runner is degraded, it reports its status to its parent module by calling UpdateStatus. If there are multiple runners under a module, the status will keep flipping between different modes.
To fix this, create a helper struct that stores last reported status for each runner and add a new method to calculate module's status based on child runners.

This PR also adds a wrapper that converts beats status to collector status.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Related issues

Related Beat receivers do not correctly report status back to the Elastic Agent elastic-agent#8210

Screenshots

Output

Here's output of running two streams (degraded) together:

┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a degraded state
   ├─ pipeline:logs/_agent-component/filestream-default
   │  ├─ status: StatusRecoverableError [error while running harvester: cannot read from file source: /var/log/elasticAgent-install-20240625_133733.log]
   │  ├─ exporter:elasticsearch/_agent-component/default
   │  │  └─ status: StatusOK
   │  └─ receiver:filebeatreceiver/_agent-component/filestream-default
   │     └─ status: StatusRecoverableError [error while running harvester: cannot read from file source: /var/log/elasticAgent-install-20240625_133733.log]
   └─ pipeline:logs/_agent-component/system/metrics-default
      ├─ status: StatusRecoverableError [Error fetching data for metricset system.process: error fetching process list: non fatal error; reporting partial metrics: error fetching PID metrics for 607 processes, most likely a "permission denied" error. Enable debug logging to determine the exact cause.]
      ├─ exporter:elasticsearch/_agent-component/default
      │  └─ status: StatusOK
      └─ receiver:metricbeatreceiver/_agent-component/system/metrics-default
         └─ status: StatusRecoverableError [Error fetching data for metricset system.process: error fetching process list: non fatal error; reporting partial metrics: error fetching PID metrics for 607 processes, most likely a "permission denied" error. Enable debug logging to determine the exact cause.]

Testing

Checkout this PR locally
Go to elastic-agent and follow this guide to test local beats changes
Package agent with mage package
Follow steps on Beat receivers do not correctly report status back to the Elastic Agent elastic-agent#8210 to install agent and verify the status

Closes elastic/elastic-agent#8210

github-actions · 2025-05-28T10:30:04Z

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

mergify · 2025-05-28T10:30:31Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @VihasMakwana? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
backport-active-all is the label that automatically backports to all active branches.
backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

mergify · 2025-05-29T08:49:42Z

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b fix-status-reporting-metricsets upstream/fix-status-reporting-metricsets
git merge upstream/main
git push upstream fix-status-reporting-metricsets

elasticmachine · 2025-05-29T10:57:19Z

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

VihasMakwana · 2025-06-06T06:23:30Z

@swiatekm @mauri870 this should be good now. I've added testing. Please take a look when you're around 😄

mauri870

Thanks!

cmacknz · 2025-06-06T17:28:10Z

I don't see a test in here proving that Beat receivers now report status correctly, please add one.

cmacknz · 2025-06-09T20:06:26Z

+	r.parent.UpdateStatus(calcState, calcMsg)
+}
+
+func (r *reporter) calculateState() (Status, string) {


Now that you've pointed me at the pre-existing implementation for inputs+streams in

beats/x-pack/libbeat/management/unit.go

Line 208 in 4309305

func (u *agentUnit) calcState() (status.Status, string) {

Why does this also need to exist? I can see what it is doing but it's not clear why you had to add this separate group reporter?

The manager already has an UpdateStatus method meant to report status, and there is now an OtelManager that can do OTel specific things, why do we need to add group status reporters in metricbeat and filebeat without going through the existing manager interface?

The previous one is tied to the concept of units, which we don't have anymore?

It still feels like it would be best if there was a way to account for this through the manager interface so that everything is centralized into a single interface.

In the previous control protocol implementation, the status reporter was injected when the manager reloaded inputs:

beats/x-pack/libbeat/management/managerV2.go

Line 817 in 4309305

in.StatusReporter = unit.GetReporterForStreamByIndex(idx)

This also defined the reporting resolution of status reporters, there was one per unit which mapped to one per input.

In the otel world there would naturally be one status reporter per receiver, and we would ideally have one per beat receiver input one day.

Why can't the status reporter be injected when the receiver is created? I think you probably still need this code to toggle the receiver status based on the input UpdateStatus calls but I am suspecting the group reporters are being created in the wrong places (directly in metricbeat.go and crawler.go) which are not beat receiver specific.

Why does this also need to exist? I can see what it is doing but it's not clear why you had to add this separate group reporter?

I'll give you an example why we need the resolution logic. Consider the following config

inputs: - module: system metricsets: - process - module: system metricsets: - cpu - module: system metricsets: - memory

In metricbeat, we create one Runner per module. The status reporter for runner is set here

beats/metricbeat/mb/module/runner_group.go

Lines 44 to 50 in ac45d48

func (rg *runnerGroup) SetStatusReporter(reporter status.StatusReporter) {

for _, runner := range rg.runners {

if runnerWithStatus, ok := runner.(status.WithStatusReporter); ok {

runnerWithStatus.SetStatusReporter(reporter)

}

}

}

and it accepts any implementation we pass. We just need to adhere to the status.Reporter interface

beats/libbeat/management/status/status.go

Lines 46 to 49 in ac45d48

type StatusReporter interface {

// UpdateStatus updates the status of the unit.

UpdateStatus(status Status, msg string)

}

if we directly set the otel's status reporter as it is (without any group reporter), then we face following problems:

Conflicting statues:

When multiple modules are running, each one can independently update the shared reporter's status. This leads can lead misleading results. For example:

If one module with process metricset is DEGRADED, it updates the reporter's status, which is as expected.

But if another module is HEALTHY, it would also update reporter's status as HEALTHY, overwriting previous DEGRADED state.

To avoid this race condition, we need a centralised place to aggregate statues of each module and calculate status of entire beat.

Why can't the status reporter be injected when the receiver is created? I think you probably still need this code to toggle the receiver status based on the input UpdateStatus calls but I am suspecting the group reporters are being created in the wrong places (directly in metricbeat.go and crawler.go) which are not beat receiver specific.

That should be the case in an ideal world, but receivers can report status via component.Host, which is only accessible during receiver startup. Unfortunately, there's currently no way to configure a status reporter at the time of receiver creation 😢

I mentioned this when I originally opened the PR, but I guess it got lost in the history.

It still feels like it would be best if there was a way to account for this through the manager interface so that everything is centralized into a single interface.

This is on my TODO list. I'll work on it as a follow-up.

Thanks, going to summarize what I mentioned in the project meeting today to make sure it is captured here:

We need to make sure status reporting works for all of the Beats (heartbeat, osquerybeat, etc) and ideally we want to be able to do this without going into each individual Beat's concept of a module or an input. If we can find a way to centralize the injection of the group status reporter in libbeat our life is easier. In the control protocol, the Beat starts up with no inputs configured and then on reload the status reporters are injected.

beats/x-pack/libbeat/management/managerV2.go

Line 817 in 4309305

in.StatusReporter = unit.GetReporterForStreamByIndex(idx)

We need to make the end state grouping clearer and make sure the way we've wired this up can preserve it. Today each individual input in an elastic-agent.yml has an independent status, and streams under that input are rolled up into that of the parent input. We need to preserve this with Beats receivers but it is not entirely the Beats responsibility to do this, it also might not be possible with the way otel status reporting works today, we'd ideally want a concept of a sub-component (for multiple inputs in a single receiver have independent state).

To use a configuration example let's imagine someone split the cpu, memory, network, and filesystem system metricsets across two inputs:

inputs: # This system/metrics module reports state independently of any other in the same configuration. - type: system/metrics id: cpu-memory use_output: default streams: # The status of each metricset is aggregated into the state of overall parent module. - metricsets: - cpu data_stream.dataset: system.cpu - metricsets: - memory data_stream.dataset: system.memory # This system/metrics module reports state independently of the one above. - type: system/metrics id: network-filesystem use_output: default streams: # The status of each metricset is aggregated into the state of overall parent module. - metricsets: - network data_stream.dataset: system.network - metricsets: - filesystem data_stream.dataset: system.filesystem

I think to get this to work correctly Elastic Agent would have to orchestrate both of those two inputs into separate system/metrics receivers (CC @leehinman and @swiatekm keep me honest) without doing anything else. Otherwise, we may need to do something outside of straight collector component status reporting to preserve the way this works today.

If we one day gained the ability to report sub-component status they wouldn't need to be separate receivers.

…wana/beats into fix-status-reporting-metricsets

VihasMakwana · 2025-06-10T09:07:48Z

+func NewGroupStatusReporter(parent status.StatusReporter) RunnerReporter {
+	// If the parent is a "fallbackManager", we're operating in standard standalone mode,
+	// so setting a group reporter isn't necessary.
+	if _, ok := parent.(*fallbackManager); ok || parent == nil {


@cmacknz I've made a slight adjustment here.

When we run standalone beats, we create a fallbackManager. Now, we return ano-op status reporter in that case.
This prevents mixing the two use cases.

VihasMakwana · 2025-06-13T08:25:03Z

Closing in favour of #44782

fix: fix status reporting for metricsets

c6bd143

VihasMakwana self-assigned this May 28, 2025

VihasMakwana added bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels May 28, 2025

botelastic Bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels May 28, 2025

VihasMakwana added 3 commits May 28, 2025 16:01

remove if

a4ead4d

fix map initialization

5b5a218

revert wrapper changes

c7ab288

VihasMakwana force-pushed the fix-status-reporting-metricsets branch from 2060f71 to c7ab288 Compare May 29, 2025 05:20

VihasMakwana and others added 3 commits May 29, 2025 11:04

fix status reporting

8b17fb0

Merge branch 'main' into fix-status-reporting-metricsets

1f18cd6

notice

48b8e12

VihasMakwana added the backport-9.0 Automated backport to the 9.0 branch label May 29, 2025

nit

47c529b

VihasMakwana marked this pull request as ready for review May 29, 2025 10:57

VihasMakwana requested review from a team as code owners May 29, 2025 10:57

VihasMakwana requested review from belimawr and mauri870 May 29, 2025 10:57

Merge branch 'main' into fix-status-reporting-metricsets

025f2c7

VihasMakwana marked this pull request as draft May 29, 2025 11:04

VihasMakwana mentioned this pull request May 29, 2025

Beat receivers do not correctly report status back to the Elastic Agent elastic/elastic-agent#8210

Closed

VihasMakwana added 2 commits May 29, 2025 19:21

add it for filebeat

ba19557

use hash instead of runner.String()

308ccaf

VihasMakwana changed the title ~~[metricbeat] fix status reporting for metricsets~~ [otelbeat] fix status reporting when running beatreceivers May 29, 2025

VihasMakwana requested a review from mauri870 June 6, 2025 06:20

Merge branch 'main' into fix-status-reporting-metricsets

5d7f818

linter happy

31b99f3

swiatekm approved these changes Jun 6, 2025

View reviewed changes

mauri870 approved these changes Jun 6, 2025

View reviewed changes

cmacknz reviewed Jun 6, 2025

View reviewed changes

Comment thread libbeat/management/status/group.go Outdated

cmacknz reviewed Jun 6, 2025

View reviewed changes

Comment thread libbeat/management/status/group.go

cmacknz reviewed Jun 6, 2025

View reviewed changes

Comment thread libbeat/management/status/group.go Outdated

cmacknz reviewed Jun 6, 2025

View reviewed changes

Comment thread metricbeat/beater/metricbeat.go

add test case

a93f275

VihasMakwana commented Jun 6, 2025

View reviewed changes

Comment thread libbeat/otelbeat/oteltest/oteltest.go

comments

dfd1d24

VihasMakwana requested a review from cmacknz June 6, 2025 19:46

Merge branch 'main' into fix-status-reporting-metricsets

e2649a4

cmacknz reviewed Jun 9, 2025

View reviewed changes

Comment thread libbeat/management/status/group_test.go Outdated

VihasMakwana added 2 commits June 10, 2025 13:48

test improvements

8e33dfb

Merge branch 'fix-status-reporting-metricsets' of github.com:VihasMak…

d64de30

…wana/beats into fix-status-reporting-metricsets

VihasMakwana force-pushed the fix-status-reporting-metricsets branch from aa89373 to d64de30 Compare June 10, 2025 08:36

VihasMakwana added 2 commits June 10, 2025 14:30

minor change

77a7911

comment

c274941

VihasMakwana commented Jun 10, 2025

View reviewed changes

gci

3f8c1ef

VihasMakwana closed this Jun 13, 2025

VihasMakwana mentioned this pull request Jun 16, 2025

[beatreceiver] - Add status reporting #44782

Merged

6 tasks

This was referenced Jun 26, 2025

[8.19](backport #44782) [beatreceiver] - Add status reporting #45045

Merged

[9.0](backport #44782) [beatreceiver] - Add status reporting #45046

Merged

	func (rg *runnerGroup) SetStatusReporter(reporter status.StatusReporter) {
	for _, runner := range rg.runners {
	if runnerWithStatus, ok := runner.(status.WithStatusReporter); ok {
	runnerWithStatus.SetStatusReporter(reporter)
	}
	}
	}

	type StatusReporter interface {
	// UpdateStatus updates the status of the unit.
	UpdateStatus(status Status, msg string)
	}

Conversation

VihasMakwana commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed commit message

Checklist

Related issues

Screenshots

Output

Testing

Uh oh!

github-actions Bot commented May 28, 2025

🤖 GitHub comments

Uh oh!

mergify Bot commented May 28, 2025

Uh oh!

mergify Bot commented May 29, 2025

Uh oh!

elasticmachine commented May 29, 2025

Uh oh!

VihasMakwana commented Jun 6, 2025

Uh oh!

mauri870 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cmacknz commented Jun 6, 2025

Uh oh!

Uh oh!

cmacknz Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

cmacknz Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

cmacknz Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

VihasMakwana Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmacknz Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

VihasMakwana Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

VihasMakwana commented Jun 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

VihasMakwana commented May 28, 2025 •

edited

Loading

VihasMakwana Jun 10, 2025 •

edited

Loading

cmacknz Jun 10, 2025 •

edited

Loading