-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Improving statsByShard performance when the number of shards is very large #130857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
44 commits
Select commit
Hold shift + click to select a range
168c6c5
First attempt at improving statsByShard performance
masseyke 4a42e34
Update docs/changelog/130857.yaml
masseyke f381bb9
reducing memory usage
masseyke 72ba4f5
Merge branch 'fix/statsByShard-performance' of github.com:masseyke/el…
masseyke 41bdbea
removing unused method
masseyke 212186a
minor cleanup
masseyke 16971df
minor cleanup
masseyke c8fc29e
fixing IndicesServiceTests
masseyke 4455f70
removing all uses of N^2 code
masseyke 4b69095
Merge branch 'main' into fix/statsByShard-performance
masseyke 34659a8
fixing test
masseyke 1d77b9d
moving code into reusable methods
masseyke 10fac5b
cleanup
masseyke 21ac4f4
moving code back
masseyke 3873deb
adding unit tests
masseyke 4871c7c
[CI] Auto commit changes from spotless
b95efb5
Merge branch 'main' into fix/statsByShard-performance
joegallo 3c430a7
Merge branch 'main' into fix/statsByShard-performance
masseyke 9f7c36e
Merge branch 'main' into fix/statsByShard-performance
masseyke 9414544
making sure it works with changes from #135298
masseyke d3d25d6
fixed a (bad) typo
masseyke ef99d55
merging main
masseyke 94c4ab7
minor cleanup
masseyke bf71afc
fixing description
masseyke 1255796
Update docs/changelog/130857.yaml
masseyke 1d2cbd7
improving comments
masseyke 5431f7f
Merge branch 'fix/statsByShard-performance' of github.com:masseyke/el…
masseyke 441574f
fixing a bad record field rename
masseyke d380eb5
Rely on the ShardId rather than the IndexShard
joegallo 641b940
This can be a final long
joegallo aae6e7f
Fix a typo
joegallo 556db6a
Fix a typo
joegallo ce81caa
Remove an unused constructor
joegallo be92c46
Tidy up some Intellij gripes
joegallo a2542ed
Use computeIfAbsent for this
joegallo 7bf19d3
pre-computing a map of shared ram for all shards
masseyke 57d6da8
Merge branch 'main' into fix/statsByShard-performance
joegallo ec57e0e
Only map the non-zero values
joegallo 4a34321
Externalize the null check
joegallo a3ac3aa
Make this method instance rather than static
joegallo ccd7d2b
Rename this method
joegallo 81bfdbd
Add a javadoc, as promised
joegallo cfb6d99
Merge branch 'main' into fix/statsByShard-performance
joegallo 123743f
Merge branch 'main' into fix/statsByShard-performance
masseyke File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| pr: 130857 | ||
| summary: Improving statsByShard performance when the number of shards is very large | ||
| area: Stats | ||
| type: bug | ||
| issues: | ||
| - 97222 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I like where this is going, but I'm curious if we can afford a fairly large map allocation here.
I don't love the pattern here of a pulling out the
CacheTotalsas kindof a disembodied magic thing and then using it in the loop.But if we can afford a (large) map allocation here, then we could have just one method that returns a
Map<ShardId, Long>that does the looping internally and just does thecacheTotalsandsharedRamcalculations internally.I don't think that's a big change to code, and it probably still gives us a pretty nice speedup in terms of Big O.
But it does cost us an allocation.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coincidentally, my first pass at this did this. But I was kind of worried about trading off an O(N^2) performance problem for an O(N) memory problem. It would probably be fine? But it also might be the thing that pushes us into an OOME when we're trying to calculate the stats for a big cluster. So I had also not liked the magic token, but came down on the side of it being preferable to possibly causing an OOME.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm showing my ignorance by asking this, but for the O(N^2) operation and possible O(N) map allocation, are we running that on the coordinating node for the entire cluster, or are we running it on each node within the cluster in terms of the shards that they're hosting? I imagine it must be the latter. Assuming so (but we should confirm that!), I'm inclined to think that an O(N) (N=shards on this node) map allocation of
ShardIdtoLongis probably fine -- we'd probably be paying like 32 bytes per entry in the map? 36?Basically, even if a cluster has a ton of shards, those shards are split up among a reasonable number of nodes, so we're probably looking at a worst case scenario of an order of magnitude fewer shards per node than the shards per cluster.
It's possible I'm thinking about this wrong, though (and of course all this is predicated on the notion that the calculation we're doing is on all the nodes of the cluster and not on the coordinating node).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also been assuming that N is just shards on this node (I can't remember if I've checked that assumption). We probably tend to have the most shards per node on serverless nodes, and I believe that's currently capped at 15,000 (although we're able to push that number up as we get rid of problems like this one). So if your math is right, that's probably less than 1mb per node, which isn't bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++, if we're right about these assumptions, then it's probably a drop in the bucket compared to the memory associated with just having 15,000 shards in memory on the node. 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I just pushed a change that does that. I might look more closely at test coverage tomorrow.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a slight optimization on top of that (in ec57e0e) so that we only store the non-zero values in the map -- that way the map is O(N) on the number of shards that have representation in the query cache rather than being O(N) on the number of shards that are on the node.