Skip to content

Conversation

@swiatekm
Copy link
Contributor

Proposed commit message

Fix Node and container resource limit metrics missing intermittently.

This is a bug very recently introduced by the refactor in #41216. Metadata watchers are not just responsible for updating metadata, but also Node and container metrics. Only updating the latter eagerly when metadata is requested leads to races, where the values may be missing depending on the order in which metrics are fetched.

This fix decouples metrics calculation from metadata calculation. Metrics now have their own handlers attached to the watcher, and are completely detached from metadata enrichers. I don't like the resulting architecture that much, as it concentrates a lot of logic in the watcher. But it is an improvement over the status quo, and I'd like to fix this bug promptly before we release it to users.

The bug was quite difficult to catch in E2E tests, as it could take some time to appear. I've tested this change much more carefully, and haven't seen any issues after hours of running it in my test cluster.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have added tests that prove my fix is effective or that my feature works

How to test this PR locally

Simplest way is to install elastic-agent standalone and look at the default Kubernetes dashboard.

Related issues

@swiatekm swiatekm added bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels Oct 25, 2024
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Oct 25, 2024
@mergify
Copy link
Contributor

mergify bot commented Oct 25, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @swiatekm? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

@mergify
Copy link
Contributor

mergify bot commented Oct 25, 2024

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Oct 25, 2024
@swiatekm swiatekm added backport-8.15 Automated backport to the 8.15 branch with mergify backport-8.16 Automated backport with mergify labels Oct 25, 2024
@swiatekm swiatekm marked this pull request as ready for review October 25, 2024 13:01
@swiatekm swiatekm requested a review from a team as a code owner October 25, 2024 13:01
@swiatekm swiatekm requested review from constanca-m and gizas October 25, 2024 13:01
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@swiatekm
Copy link
Contributor Author

swiatekm commented Oct 25, 2024

I've no idea what the problem is with linting on darwin. Looks like a build error, but we don't build metricbeat on MacOS, so it's difficult to diagnose. I can reproduce it locally by running golangci-lint with GOOS=darwin CGO_ENABLED=1, but the error is plain incorrect.

EDIT: Just added an exception, similar to #33649 .

@swiatekm swiatekm force-pushed the fix-metricbeat-container-metrics branch from b4248b4 to 17cb914 Compare October 28, 2024 10:46
@swiatekm swiatekm requested a review from a team as a code owner October 28, 2024 10:46
@swiatekm swiatekm requested review from VihasMakwana and faec October 28, 2024 10:47
@pierrehilbert pierrehilbert requested review from mauri870 and removed request for faec October 28, 2024 11:43
Copy link
Member

@mauri870 mauri870 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good overall, but I don't have deep knowledge of the integration. It would be helpful to get a review from another developer.

@MichaelKatsoulis
Copy link
Contributor

@swiatekm The changes look good to me. I tested them also and everything looks to be working as expected.
Maybe you could also update the enrichers.md file to also describe the updated process.

@swiatekm
Copy link
Contributor Author

@MichaelKatsoulis I'll update the documentation in a follow-up, I don't want to hold this PR up.

@swiatekm swiatekm merged commit e7cc6fc into main Oct 30, 2024
@swiatekm swiatekm deleted the fix-metricbeat-container-metrics branch October 30, 2024 15:16
mergify bot pushed a commit that referenced this pull request Oct 30, 2024
…41453)

* Fix Pod and container resource limit metrics missing intermittently

* Add another exception to typecheck linter

(cherry picked from commit e7cc6fc)
mergify bot pushed a commit that referenced this pull request Oct 30, 2024
…41453)

* Fix Pod and container resource limit metrics missing intermittently

* Add another exception to typecheck linter

(cherry picked from commit e7cc6fc)
mergify bot pushed a commit that referenced this pull request Oct 30, 2024
…41453)

* Fix Pod and container resource limit metrics missing intermittently

* Add another exception to typecheck linter

(cherry picked from commit e7cc6fc)
swiatekm added a commit that referenced this pull request Oct 30, 2024
…41453) (#41484)

* Fix Pod and container resource limit metrics missing intermittently

* Add another exception to typecheck linter

(cherry picked from commit e7cc6fc)

Co-authored-by: Mikołaj Świątek <[email protected]>
pierrehilbert pushed a commit that referenced this pull request Oct 30, 2024
…41453) (#41483)

* Fix Pod and container resource limit metrics missing intermittently

* Add another exception to typecheck linter

(cherry picked from commit e7cc6fc)

Co-authored-by: Mikołaj Świątek <[email protected]>
pierrehilbert pushed a commit that referenced this pull request Oct 30, 2024
…41453) (#41485)

* Fix Pod and container resource limit metrics missing intermittently

* Add another exception to typecheck linter

(cherry picked from commit e7cc6fc)

Co-authored-by: Mikołaj Świątek <[email protected]>
@khushijain21 khushijain21 mentioned this pull request Jun 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-8.x Automated backport to the 8.x branch with mergify backport-8.15 Automated backport to the 8.15 branch with mergify backport-8.16 Automated backport with mergify bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pod and container resource limit metrics missing intermittently

5 participants