Skip to content

[otelbeat] fix status reporting when running beatreceivers#44528

Closed
VihasMakwana wants to merge 39 commits into
elastic:mainfrom
VihasMakwana:fix-status-reporting-metricsets
Closed

[otelbeat] fix status reporting when running beatreceivers#44528
VihasMakwana wants to merge 39 commits into
elastic:mainfrom
VihasMakwana:fix-status-reporting-metricsets

Conversation

@VihasMakwana
Copy link
Copy Markdown
Contributor

@VihasMakwana VihasMakwana commented May 28, 2025

Proposed commit message

If a metricset/filebeat runner is degraded, it reports its status to its parent module by calling UpdateStatus. If there are multiple runners under a module, the status will keep flipping between different modes.
To fix this, create a helper struct that stores last reported status for each runner and add a new method to calculate module's status based on child runners.

This PR also adds a wrapper that converts beats status to collector status.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Related issues

Screenshots

Screenshot 2025-05-29 at 8 14 07 PM

Output

Here's output of running two streams (degraded) together:

┌─ fleet
│  └─ status: (STOPPED) Not enrolled into Fleet
└─ elastic-agent
   ├─ status: (DEGRADED) 1 or more components/units in a degraded state
   ├─ pipeline:logs/_agent-component/filestream-default
   │  ├─ status: StatusRecoverableError [error while running harvester: cannot read from file source: /var/log/elasticAgent-install-20240625_133733.log]
   │  ├─ exporter:elasticsearch/_agent-component/default
   │  │  └─ status: StatusOK
   │  └─ receiver:filebeatreceiver/_agent-component/filestream-default
   │     └─ status: StatusRecoverableError [error while running harvester: cannot read from file source: /var/log/elasticAgent-install-20240625_133733.log]
   └─ pipeline:logs/_agent-component/system/metrics-default
      ├─ status: StatusRecoverableError [Error fetching data for metricset system.process: error fetching process list: non fatal error; reporting partial metrics: error fetching PID metrics for 607 processes, most likely a "permission denied" error. Enable debug logging to determine the exact cause.]
      ├─ exporter:elasticsearch/_agent-component/default
      │  └─ status: StatusOK
      └─ receiver:metricbeatreceiver/_agent-component/system/metrics-default
         └─ status: StatusRecoverableError [Error fetching data for metricset system.process: error fetching process list: non fatal error; reporting partial metrics: error fetching PID metrics for 607 processes, most likely a "permission denied" error. Enable debug logging to determine the exact cause.]

Testing

  1. Checkout this PR locally
  2. Go to elastic-agent and follow this guide to test local beats changes
  3. Package agent with mage package
  4. Follow steps on Beat receivers do not correctly report status back to the Elastic Agent elastic-agent#8210 to install agent and verify the status

Closes elastic/elastic-agent#8210

@VihasMakwana VihasMakwana self-assigned this May 28, 2025
@VihasMakwana VihasMakwana added bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels May 28, 2025
@botelastic botelastic Bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels May 28, 2025
@github-actions
Copy link
Copy Markdown
Contributor

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 28, 2025

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @VihasMakwana? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@VihasMakwana VihasMakwana force-pushed the fix-status-reporting-metricsets branch from 2060f71 to c7ab288 Compare May 29, 2025 05:20
@VihasMakwana VihasMakwana added the backport-9.0 Automated backport to the 9.0 branch label May 29, 2025
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 29, 2025

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b fix-status-reporting-metricsets upstream/fix-status-reporting-metricsets
git merge upstream/main
git push upstream fix-status-reporting-metricsets

@VihasMakwana VihasMakwana marked this pull request as ready for review May 29, 2025 10:57
@VihasMakwana VihasMakwana requested review from a team as code owners May 29, 2025 10:57
@VihasMakwana VihasMakwana requested review from belimawr and mauri870 May 29, 2025 10:57
@elasticmachine
Copy link
Copy Markdown
Contributor

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@VihasMakwana VihasMakwana changed the title [metricbeat] fix status reporting for metricsets [otelbeat] fix status reporting when running beatreceivers May 29, 2025
@VihasMakwana VihasMakwana requested a review from mauri870 June 6, 2025 06:20
@VihasMakwana
Copy link
Copy Markdown
Contributor Author

@swiatekm @mauri870 this should be good now. I've added testing. Please take a look when you're around 😄

Copy link
Copy Markdown
Member

@mauri870 mauri870 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment thread libbeat/management/status/group.go Outdated
Comment thread libbeat/management/status/group.go
Comment thread libbeat/management/status/group.go Outdated
Comment thread metricbeat/beater/metricbeat.go
@cmacknz
Copy link
Copy Markdown
Member

cmacknz commented Jun 6, 2025

I don't see a test in here proving that Beat receivers now report status correctly, please add one.

Comment thread libbeat/otelbeat/oteltest/oteltest.go
@VihasMakwana VihasMakwana requested a review from cmacknz June 6, 2025 19:46
Comment thread libbeat/management/status/group.go Outdated
r.parent.UpdateStatus(calcState, calcMsg)
}

func (r *reporter) calculateState() (Status, string) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you've pointed me at the pre-existing implementation for inputs+streams in

func (u *agentUnit) calcState() (status.Status, string) {

Why does this also need to exist? I can see what it is doing but it's not clear why you had to add this separate group reporter?

The manager already has an UpdateStatus method meant to report status, and there is now an OtelManager that can do OTel specific things, why do we need to add group status reporters in metricbeat and filebeat without going through the existing manager interface?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous one is tied to the concept of units, which we don't have anymore?

It still feels like it would be best if there was a way to account for this through the manager interface so that everything is centralized into a single interface.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the previous control protocol implementation, the status reporter was injected when the manager reloaded inputs:

in.StatusReporter = unit.GetReporterForStreamByIndex(idx)

This also defined the reporting resolution of status reporters, there was one per unit which mapped to one per input.

In the otel world there would naturally be one status reporter per receiver, and we would ideally have one per beat receiver input one day.

Why can't the status reporter be injected when the receiver is created? I think you probably still need this code to toggle the receiver status based on the input UpdateStatus calls but I am suspecting the group reporters are being created in the wrong places (directly in metricbeat.go and crawler.go) which are not beat receiver specific.

Copy link
Copy Markdown
Contributor Author

@VihasMakwana VihasMakwana Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this also need to exist? I can see what it is doing but it's not clear why you had to add this separate group reporter?

I'll give you an example why we need the resolution logic. Consider the following config

inputs:
- module: system
  metricsets:
        - process
- module: system
  metricsets:
        - cpu
- module: system
  metricsets:
        - memory

In metricbeat, we create one Runner per module. The status reporter for runner is set here

func (rg *runnerGroup) SetStatusReporter(reporter status.StatusReporter) {
for _, runner := range rg.runners {
if runnerWithStatus, ok := runner.(status.WithStatusReporter); ok {
runnerWithStatus.SetStatusReporter(reporter)
}
}
}
and it accepts any implementation we pass. We just need to adhere to the status.Reporter interface
type StatusReporter interface {
// UpdateStatus updates the status of the unit.
UpdateStatus(status Status, msg string)
}

if we directly set the otel's status reporter as it is (without any group reporter), then we face following problems:

  1. Conflicting statues:
    • When multiple modules are running, each one can independently update the shared reporter's status. This leads can lead misleading results. For example:
      • If one module with process metricset is DEGRADED, it updates the reporter's status, which is as expected.
      • But if another module is HEALTHY, it would also update reporter's status as HEALTHY, overwriting previous DEGRADED state.
  2. To avoid this race condition, we need a centralised place to aggregate statues of each module and calculate status of entire beat.

Why can't the status reporter be injected when the receiver is created? I think you probably still need this code to toggle the receiver status based on the input UpdateStatus calls but I am suspecting the group reporters are being created in the wrong places (directly in metricbeat.go and crawler.go) which are not beat receiver specific.

That should be the case in an ideal world, but receivers can report status via component.Host, which is only accessible during receiver startup. Unfortunately, there's currently no way to configure a status reporter at the time of receiver creation 😢

I mentioned this when I originally opened the PR, but I guess it got lost in the history.

It still feels like it would be best if there was a way to account for this through the manager interface so that everything is centralized into a single interface.

This is on my TODO list. I'll work on it as a follow-up.

Copy link
Copy Markdown
Member

@cmacknz cmacknz Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, going to summarize what I mentioned in the project meeting today to make sure it is captured here:

  1. We need to make sure status reporting works for all of the Beats (heartbeat, osquerybeat, etc) and ideally we want to be able to do this without going into each individual Beat's concept of a module or an input. If we can find a way to centralize the injection of the group status reporter in libbeat our life is easier. In the control protocol, the Beat starts up with no inputs configured and then on reload the status reporters are injected.

    in.StatusReporter = unit.GetReporterForStreamByIndex(idx)

  2. We need to make the end state grouping clearer and make sure the way we've wired this up can preserve it. Today each individual input in an elastic-agent.yml has an independent status, and streams under that input are rolled up into that of the parent input. We need to preserve this with Beats receivers but it is not entirely the Beats responsibility to do this, it also might not be possible with the way otel status reporting works today, we'd ideally want a concept of a sub-component (for multiple inputs in a single receiver have independent state).

To use a configuration example let's imagine someone split the cpu, memory, network, and filesystem system metricsets across two inputs:

inputs:
  # This system/metrics module reports state independently of any other in the same configuration.
  - type: system/metrics
    id: cpu-memory
    use_output: default
    streams:
      # The status of each metricset is aggregated into the state of overall parent module.
      - metricsets:
        - cpu
        data_stream.dataset: system.cpu
      - metricsets:
        - memory
        data_stream.dataset: system.memory
  # This system/metrics module reports state independently of the one above.
  - type: system/metrics
    id: network-filesystem
    use_output: default
    streams:
      # The status of each metricset is aggregated into the state of overall parent module.
      - metricsets:
        - network
        data_stream.dataset: system.network
      - metricsets:
        - filesystem
        data_stream.dataset: system.filesystem

I think to get this to work correctly Elastic Agent would have to orchestrate both of those two inputs into separate system/metrics receivers (CC @leehinman and @swiatekm keep me honest) without doing anything else. Otherwise, we may need to do something outside of straight collector component status reporting to preserve the way this works today.

If we one day gained the ability to report sub-component status they wouldn't need to be separate receivers.

Comment thread libbeat/management/status/group_test.go Outdated
@VihasMakwana VihasMakwana force-pushed the fix-status-reporting-metricsets branch from aa89373 to d64de30 Compare June 10, 2025 08:36
func NewGroupStatusReporter(parent status.StatusReporter) RunnerReporter {
// If the parent is a "fallbackManager", we're operating in standard standalone mode,
// so setting a group reporter isn't necessary.
if _, ok := parent.(*fallbackManager); ok || parent == nil {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmacknz I've made a slight adjustment here.

When we run standalone beats, we create a fallbackManager. Now, we return ano-op status reporter in that case.
This prevents mixing the two use cases.

@VihasMakwana
Copy link
Copy Markdown
Contributor Author

Closing in favour of #44782

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-9.0 Automated backport to the 9.0 branch enhancement Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Beat receivers do not correctly report status back to the Elastic Agent

7 participants