Skip to content

Add shard heap usage to ClusterInfo#139557

Merged
elasticsearchmachine merged 16 commits intoelastic:mainfrom
DiannaHohensee:2025/12/11/ES-12882-simulate-heap
Feb 24, 2026
Merged

Add shard heap usage to ClusterInfo#139557
elasticsearchmachine merged 16 commits intoelastic:mainfrom
DiannaHohensee:2025/12/11/ES-12882-simulate-heap

Conversation

@DiannaHohensee
Copy link
Copy Markdown
Contributor

@DiannaHohensee DiannaHohensee commented Dec 15, 2025

Adds a map of shard heap usages to the ClusterInfo, as well as
adding an interface for collection thereof to the
EstimatedHeapUsageCollector, which the
InternalClusterInfoService uses to supply the new shard heap values.

This is a step towards simulating heap usage changes when a shard
moves during shard allocation computation.

Relates ES-12882

@DiannaHohensee DiannaHohensee self-assigned this Dec 15, 2025
@elasticsearchmachine elasticsearchmachine added serverless-linked Added by automation, don't add manually v9.3.0 v9.4.0 and removed v9.3.0 labels Dec 15, 2025
@DiannaHohensee DiannaHohensee force-pushed the 2025/12/11/ES-12882-simulate-heap branch 2 times, most recently from bc01ce5 to 1f71b0f Compare December 18, 2025 23:10
@DiannaHohensee DiannaHohensee force-pushed the 2025/12/11/ES-12882-simulate-heap branch from 1f71b0f to 9da594e Compare December 19, 2025 20:23
@DiannaHohensee DiannaHohensee marked this pull request as ready for review December 19, 2025 23:20
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Dec 19, 2025
@DiannaHohensee DiannaHohensee added :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. >non-issue and removed needs:triage Requires assignment of a team area label labels Dec 19, 2025
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@DiannaHohensee DiannaHohensee removed the serverless-linked Added by automation, don't add manually label Dec 19, 2025
Copy link
Copy Markdown
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

for (var entry : estimatedShardHeapUsages.entrySet()) {
assertThat(entry.getValue().shardHeapUsageBytes(), greaterThanOrEqualTo(0L));
assertThat(entry.getValue().indexHeapUsageBytes(), greaterThanOrEqualTo(0L));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should only have to put a single document in each index and we'd get a non-zero value here wouldn't we? (because of the mapping size in bytes at least)

Then we could assert greaterThan(0L) which seems stronger?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain what you mean. The heap values are randomly generated for this test, anywhere from 0 to Long.MAX_VALUE. So even if some writes were done, that would not guarantee a non-zero value?

Stateful doesn't have real metrics for heap usage, that's only in serverless.

int numEntries = randomIntBetween(0, 128);
Map<ShardId, ShardAndIndexHeapUsage> shardHeapUsageBuilder = new HashMap<>(numEntries);
for (int i = 0; i < numEntries; i++) {
shardHeapUsageBuilder.put(randomShardId(), new ShardAndIndexHeapUsage(randomLong(), randomLong()));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be randomNonNegativeLong(), I think a negative value would be unexpected here right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for catching this. Fixed 👍

Copy link
Copy Markdown
Contributor Author

@DiannaHohensee DiannaHohensee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to non-negative longs, and added a few verifies that the new collection method is called by the cluster info service (lots of mocks). 1ff495e

int numEntries = randomIntBetween(0, 128);
Map<ShardId, ShardAndIndexHeapUsage> shardHeapUsageBuilder = new HashMap<>(numEntries);
for (int i = 0; i < numEntries; i++) {
shardHeapUsageBuilder.put(randomShardId(), new ShardAndIndexHeapUsage(randomLong(), randomLong()));
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for catching this. Fixed 👍

for (var entry : estimatedShardHeapUsages.entrySet()) {
assertThat(entry.getValue().shardHeapUsageBytes(), greaterThanOrEqualTo(0L));
assertThat(entry.getValue().indexHeapUsageBytes(), greaterThanOrEqualTo(0L));
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain what you mean. The heap values are randomly generated for this test, anywhere from 0 to Long.MAX_VALUE. So even if some writes were done, that would not guarantee a non-zero value?

Stateful doesn't have real metrics for heap usage, that's only in serverless.

@DiannaHohensee DiannaHohensee added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jan 15, 2026
@DiannaHohensee DiannaHohensee removed the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jan 15, 2026
@DiannaHohensee
Copy link
Copy Markdown
Contributor Author

Oh, this needs to wait for the serverless changes because there's a change to an interface that serverless implements. Doh.

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team. label Feb 20, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DiannaHohensee DiannaHohensee added serverless-linked Added by automation, don't add manually auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) and removed Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. labels Feb 23, 2026
@elasticsearchmachine elasticsearchmachine merged commit f4482ba into elastic:main Feb 24, 2026
35 checks passed
@DiannaHohensee DiannaHohensee deleted the 2025/12/11/ES-12882-simulate-heap branch February 24, 2026 02:15
szybia added a commit to szybia/elasticsearch that referenced this pull request Feb 24, 2026
…on-sliced-reindex

* upstream/main:
  Update docs for v9.3.1 release (elastic#142887)
  Update docs for v9.2.6 release (elastic#142888)
  Improves visibility of vector index options and inference configuration (elastic#141653)
  Disable CAE in microsoft-graph-authz plugin (elastic#142848)
  Small improvements to `GetSnapshotsIT#testAllFeatures` (elastic#142825)
  Fix IndexSettingsTests synthetic ID tests (elastic#142654)
  [Test] Unmute tests of SnapshotShutdownIT (elastic#142921)
  Fixing metrics_info.json kibana definition file name (elastic#142813)
  [Packaging] Disable glibc 2.43 malloc huge pages in Wolfi images (elastic#142894)
  Mute org.elasticsearch.xpack.searchablesnapshots.SearchableSnapshotsTSDBSyntheticIdIntegTests testSearchableSnapshot elastic#142918
  Add shard heap usage to ClusterInfo (elastic#139557)
  ESQL: Load script fields row-by-row (elastic#142807)
  ESQL: Consolidate doc values memory tracking (elastic#142816)
  ES-14124  Create Index Count Limit User documentation Page (elastic#142570)
  Add a es819 codec test to verify tryRead returns null if may contain duplicates (elastic#142409)
  Support arithmetic operations for dense_vectors: scalar version (elastic#141060)
  [Transform] Allow project_routing (elastic#142421)
  Refactor query rewrite async actions for knn and sparse_vector queries (elastic#142889)
  Do not mark bulk indexing requests as retried after primary relocations (elastic#142157)
sidosera pushed a commit to sidosera/elasticsearch that referenced this pull request Feb 24, 2026
Adds a map of shard heap usages to the ClusterInfo, as well as adding an
interface for collection thereof to the EstimatedHeapUsageCollector,
which the InternalClusterInfoService uses to supply the new shard heap
values.

This is a step towards simulating heap usage changes when a shard moves
during shard allocation computation.

Relates ES-12882
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >non-issue serverless-linked Added by automation, don't add manually Team:Distributed Meta label for distributed team. v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants