[otelbeat] fix status reporting when running beatreceivers#44528
[otelbeat] fix status reporting when running beatreceivers#44528VihasMakwana wants to merge 39 commits into
Conversation
🤖 GitHub commentsExpand to view the GitHub comments
Just comment with:
|
|
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
2060f71 to
c7ab288
Compare
|
This pull request is now in conflicts. Could you fix it? 🙏 |
|
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
|
I don't see a test in here proving that Beat receivers now report status correctly, please add one. |
| r.parent.UpdateStatus(calcState, calcMsg) | ||
| } | ||
|
|
||
| func (r *reporter) calculateState() (Status, string) { |
There was a problem hiding this comment.
Now that you've pointed me at the pre-existing implementation for inputs+streams in
beats/x-pack/libbeat/management/unit.go
Line 208 in 4309305
Why does this also need to exist? I can see what it is doing but it's not clear why you had to add this separate group reporter?
The manager already has an UpdateStatus method meant to report status, and there is now an OtelManager that can do OTel specific things, why do we need to add group status reporters in metricbeat and filebeat without going through the existing manager interface?
There was a problem hiding this comment.
The previous one is tied to the concept of units, which we don't have anymore?
It still feels like it would be best if there was a way to account for this through the manager interface so that everything is centralized into a single interface.
There was a problem hiding this comment.
In the previous control protocol implementation, the status reporter was injected when the manager reloaded inputs:
beats/x-pack/libbeat/management/managerV2.go
Line 817 in 4309305
This also defined the reporting resolution of status reporters, there was one per unit which mapped to one per input.
In the otel world there would naturally be one status reporter per receiver, and we would ideally have one per beat receiver input one day.
Why can't the status reporter be injected when the receiver is created? I think you probably still need this code to toggle the receiver status based on the input UpdateStatus calls but I am suspecting the group reporters are being created in the wrong places (directly in metricbeat.go and crawler.go) which are not beat receiver specific.
There was a problem hiding this comment.
Why does this also need to exist? I can see what it is doing but it's not clear why you had to add this separate group reporter?
I'll give you an example why we need the resolution logic. Consider the following config
inputs:
- module: system
metricsets:
- process
- module: system
metricsets:
- cpu
- module: system
metricsets:
- memoryIn metricbeat, we create one Runner per module. The status reporter for runner is set here
beats/metricbeat/mb/module/runner_group.go
Lines 44 to 50 in ac45d48
status.Reporter interface beats/libbeat/management/status/status.go
Lines 46 to 49 in ac45d48
if we directly set the otel's status reporter as it is (without any group reporter), then we face following problems:
- Conflicting statues:
- When multiple modules are running, each one can independently update the shared reporter's status. This leads can lead misleading results. For example:
- If one module with
processmetricset isDEGRADED, it updates the reporter's status, which is as expected. - But if another module is
HEALTHY, it would also update reporter's status asHEALTHY, overwriting previousDEGRADEDstate.
- If one module with
- When multiple modules are running, each one can independently update the shared reporter's status. This leads can lead misleading results. For example:
- To avoid this race condition, we need a centralised place to aggregate statues of each module and calculate status of entire beat.
Why can't the status reporter be injected when the receiver is created? I think you probably still need this code to toggle the receiver status based on the input UpdateStatus calls but I am suspecting the group reporters are being created in the wrong places (directly in metricbeat.go and crawler.go) which are not beat receiver specific.
That should be the case in an ideal world, but receivers can report status via component.Host, which is only accessible during receiver startup. Unfortunately, there's currently no way to configure a status reporter at the time of receiver creation 😢
I mentioned this when I originally opened the PR, but I guess it got lost in the history.
It still feels like it would be best if there was a way to account for this through the manager interface so that everything is centralized into a single interface.
This is on my TODO list. I'll work on it as a follow-up.
There was a problem hiding this comment.
Thanks, going to summarize what I mentioned in the project meeting today to make sure it is captured here:
-
We need to make sure status reporting works for all of the Beats (heartbeat, osquerybeat, etc) and ideally we want to be able to do this without going into each individual Beat's concept of a module or an input. If we can find a way to centralize the injection of the group status reporter in libbeat our life is easier. In the control protocol, the Beat starts up with no inputs configured and then on reload the status reporters are injected.
beats/x-pack/libbeat/management/managerV2.go
Line 817 in 4309305
-
We need to make the end state grouping clearer and make sure the way we've wired this up can preserve it. Today each individual input in an elastic-agent.yml has an independent status, and streams under that input are rolled up into that of the parent input. We need to preserve this with Beats receivers but it is not entirely the Beats responsibility to do this, it also might not be possible with the way otel status reporting works today, we'd ideally want a concept of a sub-component (for multiple inputs in a single receiver have independent state).
To use a configuration example let's imagine someone split the cpu, memory, network, and filesystem system metricsets across two inputs:
inputs:
# This system/metrics module reports state independently of any other in the same configuration.
- type: system/metrics
id: cpu-memory
use_output: default
streams:
# The status of each metricset is aggregated into the state of overall parent module.
- metricsets:
- cpu
data_stream.dataset: system.cpu
- metricsets:
- memory
data_stream.dataset: system.memory
# This system/metrics module reports state independently of the one above.
- type: system/metrics
id: network-filesystem
use_output: default
streams:
# The status of each metricset is aggregated into the state of overall parent module.
- metricsets:
- network
data_stream.dataset: system.network
- metricsets:
- filesystem
data_stream.dataset: system.filesystemI think to get this to work correctly Elastic Agent would have to orchestrate both of those two inputs into separate system/metrics receivers (CC @leehinman and @swiatekm keep me honest) without doing anything else. Otherwise, we may need to do something outside of straight collector component status reporting to preserve the way this works today.
If we one day gained the ability to report sub-component status they wouldn't need to be separate receivers.
…wana/beats into fix-status-reporting-metricsets
aa89373 to
d64de30
Compare
| func NewGroupStatusReporter(parent status.StatusReporter) RunnerReporter { | ||
| // If the parent is a "fallbackManager", we're operating in standard standalone mode, | ||
| // so setting a group reporter isn't necessary. | ||
| if _, ok := parent.(*fallbackManager); ok || parent == nil { |
There was a problem hiding this comment.
@cmacknz I've made a slight adjustment here.
When we run standalone beats, we create a fallbackManager. Now, we return ano-op status reporter in that case.
This prevents mixing the two use cases.
|
Closing in favour of #44782 |
Proposed commit message
If a metricset/filebeat runner is degraded, it reports its status to its parent module by calling
UpdateStatus. If there are multiple runners under a module, the status will keep flipping between different modes.To fix this, create a helper struct that stores last reported status for each runner and add a new method to calculate module's status based on child runners.
This PR also adds a wrapper that converts beats status to collector status.
Checklist
CHANGELOG.next.asciidocorCHANGELOG-developer.next.asciidoc.Related issues
Screenshots
Output
Here's output of running two streams (degraded) together:
Testing
elastic-agentand follow this guide to test local beats changesmage packageCloses elastic/elastic-agent#8210