Skip to content

Add metric for shard snapshot queue time#143658

Merged
elasticsearchmachine merged 9 commits intoelastic:mainfrom
ywangd:ES-14292-snapshot-queue-time-metrics
Mar 12, 2026
Merged

Add metric for shard snapshot queue time#143658
elasticsearchmachine merged 9 commits intoelastic:mainfrom
ywangd:ES-14292-snapshot-queue-time-metrics

Conversation

@ywangd
Copy link
Copy Markdown
Member

@ywangd ywangd commented Mar 5, 2026

Add a new histogram metric
es.repositories.snapshots.shards.queue_time.histogram that reports how long each shard snapshot spent waiting in queues before its actual operation began. The creation time is stored as a new field. Queue time is computed when the shard is dequeued in BlobStoreRepository.doSnapshotShard. This new field is an internal stat and not exposed to end users, i.e. not available in IndexShardSnapshotStatus.Copy.

Closes ES-14292

Add a new histogram metric
es.repositories.snapshots.shards.queue_time.histogram that
reports how long each shard snapshot spent waiting in queues
before its actual operation began. The creation time is stored
as a negated value in the existing startTimeMillis field of
IndexShardSnapshotStatus, which is then overwritten by
moveToStarted. This lets startTimeMillis > 0 reliably indicate
that the snapshot has started, replacing the previous != 0
check. Queue time is computed when the shard is dequeued in
BlobStoreRepository.doSnapshotShard.

Closes ES-14292
@ywangd ywangd requested a review from nicktindall March 5, 2026 06:55
@ywangd ywangd added >non-issue :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v9.4.0 labels Mar 5, 2026
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team. label Mar 5, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Comment on lines +337 to +343
/**
* @param creationTimeMillis the time this status was created, used to compute queue time until the snapshot starts.
* Stored as a negative value in {@code startTimeMillis} so that {@code startTimeMillis > 0}
* reliably indicates that {@link #moveToStarted} has been called.
*/
public static IndexShardSnapshotStatus newInitializing(ShardGeneration generation, long creationTimeMillis) {
return new IndexShardSnapshotStatus(Stage.INIT, -creationTimeMillis, 0L, 0, 0, 0, 0, 0, 0, null, generation, "initializing");
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This avoids adding a new field to just track the queue time. Please let me know if you'd rather prefer a separate field.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a new field, I think it's preferable from a maintainability perspective

Comment on lines +320 to +324
startTimeMillis,
startTimeMillis < 0 ? 0 : startTimeMillis,
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Copy ends up being visible to the users and can go across the wire. So maintaining the field being non-negative seems right. It really is an implementation detail.

@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Mar 6, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 6, 2026

Important

Review skipped

Auto reviews are limited based on label configuration.

🏷️ Required labels (at least one) (2)
  • Team:Delivery
  • Team:Search - Inference

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 503731d8-8a43-4dbc-aa35-6b4959eb9b07

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think we should have a separate field. I assume that has BWC implications? but still I think it's worthwhile for readability and perhaps it'll become useful for more one day

Comment on lines +337 to +343
/**
* @param creationTimeMillis the time this status was created, used to compute queue time until the snapshot starts.
* Stored as a negative value in {@code startTimeMillis} so that {@code startTimeMillis > 0}
* reliably indicates that {@link #moveToStarted} has been called.
*/
public static IndexShardSnapshotStatus newInitializing(ShardGeneration generation, long creationTimeMillis) {
return new IndexShardSnapshotStatus(Stage.INIT, -creationTimeMillis, 0L, 0, 0, 0, 0, 0, 0, null, generation, "initializing");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add a new field, I think it's preferable from a maintainability perspective

@ywangd
Copy link
Copy Markdown
Member Author

ywangd commented Mar 11, 2026

Yeah no problem. I updated with a separate field in 2951f9c

@ywangd ywangd requested a review from nicktindall March 11, 2026 06:04
Copy link
Copy Markdown
Contributor

@nicktindall nicktindall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, is zero a good default to use for creationTimeMillis in newDone? I know we won't use the creation time value for any metrics but I don't know what newDone statuses are used for (just wondering if startTime might make more sense?)

@ywangd
Copy link
Copy Markdown
Member Author

ywangd commented Mar 11, 2026

is zero a good default to use for creationTimeMillis in newDone?

newDone creates a Copy used for transport actions. We don't want the new field to be visible to users. It's an internal tracking and also avoids any BWC by not showing it.

@ywangd ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Mar 12, 2026
@elasticsearchmachine elasticsearchmachine merged commit 5152000 into elastic:main Mar 12, 2026
36 checks passed
@ywangd ywangd deleted the ES-14292-snapshot-queue-time-metrics branch March 12, 2026 03:22
michalborek pushed a commit to michalborek/elasticsearch that referenced this pull request Mar 23, 2026
Add a new histogram metric
`es.repositories.snapshots.shards.queue_time.histogram` that reports how
long each shard snapshot spent waiting in queues before its actual
operation began. The creation time is stored as a new field. Queue time
is computed when the shard is dequeued in
`BlobStoreRepository.doSnapshotShard`. This new field is an internal
stat and not exposed to end users, i.e. not available in
`IndexShardSnapshotStatus.Copy`.

Closes ES-14292
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >non-issue serverless-linked Added by automation, don't add manually Team:Distributed Meta label for distributed team. v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants