Skip to content

Conversation

@pzl
Copy link
Member

@pzl pzl commented Dec 10, 2025

What is the problem this PR solves?

Improves elasticsearch performance when uploading large files

How does this PR solve the problem?

forcing refresh is somewhat costly, and doing that on every chunk upload was unnecessary. For large files and many chunks, this was able to overwhelm elasticsearch. The chunk records only need to be searchable and found at the very end, so a single refresh call is used at the finalization step instead of every chunk.

Note that the final refresh call is still necessary. It was observed that the finalization step was not always able to locate the most-recently-uploaded chunks immediately prior to the finish api call, and validation would fail.

How to test this PR locally

One could either do the "get agent diagnostics" zip function, or otherwise use elastic defend, go to responder and do a "get-file" command to execute this path.

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have added an entry in ./changelog/fragments using the changelog tool

@pzl pzl requested a review from a team as a code owner December 10, 2025 20:13
@pzl pzl requested review from pchila and ycombinator December 10, 2025 20:13
@mergify
Copy link
Contributor

mergify bot commented Dec 10, 2025

This pull request does not have a backport label. Could you fix it @pzl? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-./d./d is the label to automatically backport to the 8./d branch. /d is the digit
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

@pzl
Copy link
Member Author

pzl commented Dec 10, 2025

It doesn't seem like I have access to add/edit labels on this?

@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label Dec 11, 2025
@cmacknz
Copy link
Member

cmacknz commented Dec 15, 2025

Note that the final refresh call is still necessary. It was observed that the finalization step was not always able to locate the most-recently-uploaded chunks immediately prior to the finish api call, and validation would fail.

Is there any way to get this into a test? There is nothing preventing us from regressing on the fix in this PR in the future right now.

Even calls against mocks to ensure a refresh request happens after the whole file is received would do it.

@pzl
Copy link
Member Author

pzl commented Dec 15, 2025

Is there any way to get this into a test? There is nothing preventing us from regressing on the fix in this PR in the future right now.

Even calls against mocks to ensure a refresh request happens after the whole file is received would do it.

I think the reason here is that there isn't a behavioral change, but an implementation detail change. Testing implementation this narrowly would be fairly brittle, and would fail if modified at all (even if it maintains valid behavior).

How the function validates that all chunks are present is interior. Could alternatively call search, count chunks, and wait+retry for some n times for a natural refresh to occur. That would be as-valid, but the test would fail. I'm not sure what the test would look like for "uses some strategy for making sure very-recent documents are indexed for search" is, without it being unnecessarily brittle.

I do grant, "we found a bug, we should add tests to make sure we don't do it again by accident" is right, but "calling refresh 3,000 times in rapid succession" on an index was the bug.

@cmacknz
Copy link
Member

cmacknz commented Dec 15, 2025

Testing implementation this narrowly would be fairly brittle, and would fail if modified at all

IMO this is the point, that the next person to touch this in 6 months or whatever is pointed to this call and any comments around it should they happen to change it. I wouldn't want to test everything this way but a brittle test is fine in this specific scenario.

@pzl
Copy link
Member Author

pzl commented Dec 15, 2025

got it: this is more of a "here be dragons, don't break it unless you know the ramifications," situation. Accepting some brittle tests to draw even more attention that changes to this need to be very intentional. Guarding against inadvertent changes that upend the learnings and testing here

I can add some comments and tests to that effect, sure

Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tests and additional comments, there is a nice paper trail to follow for use of refresh now.

@pzl
Copy link
Member Author

pzl commented Dec 16, 2025

Buildkite test this

@pzl pzl merged commit 22195e3 into elastic:main Dec 16, 2025
10 checks passed
@ebeahan
Copy link
Member

ebeahan commented Dec 16, 2025

@pzl i saw you mention you didn't have the ability to add labels - do we need to backport this to any other branches?

@pzl
Copy link
Member Author

pzl commented Dec 16, 2025

@pzl i saw you mention you didn't have the ability to add labels - do we need to backport this to any other branches?

No, 9.3 and-up is good

(and label permissions have since been fixed, thank you!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants