Improve lost-increment message in repo analysis by DaveCTurner · Pull Request #131200 · elastic/elasticsearch

DaveCTurner · 2025-07-14T11:28:45Z

Today repository analysis may fail with a message like the following:

[test-repo] register [test-register-contended-F_NNXHrSSDGveoeyj1skwg]
should have value [10] but instead had value
[OptionalBytesReference[00 00 00 00 00 00 00 09]]

This is confusing because one might interpret should have value [10]
as an indication that Elasticsearch definitely wrote this value to the
register, leaving you trying to work out how that particular write was
lost. In fact it can be more subtle than that, we only believe the
register blob should have this value because we know we completed 10
supposedly-atomic increment operations, and the failure could instead be
that these operations are not as atomic as they need to be and that one
or more of the increments was lost.

This commit makes the message more verbose, clarifying that this failure
could be an atomicity problem rather than a simple lost write:

[test-repo] Successfully completed all [10] atomic increments of
register [test-register-contended-F_NNXHrSSDGveoeyj1skwg] so its
expected value is [OptionalBytesReference[00 00 00 00 00 00 00 0a]],
but reading its value with [getRegister] unexpectedly yielded
[OptionalBytesReference[00 00 00 00 00 00 00 09]]. This anomaly may
indicate an atomicity failure amongst concurrent
compare-and-exchange operations on registers in this repository.

Today repository analysis may fail with a message like the following: [test-repo] register [test-register-contended-F_NNXHrSSDGveoeyj1skwg] should have value [10] but instead had value [OptionalBytesReference[00 00 00 00 00 00 00 09]] This is confusing because one might interpret `should have value [10]` as an indication that Elasticsearch definitely wrote this value to the register, leaving you trying to work out how that particular write was lost. In fact it can be more subtle than that, we only believe the register blob should have this value because we know we completed 10 supposedly-atomic increment operations, and the failure could instead be that these operations are not as atomic as they need to be and that one or more of the increments was lost. This commit makes the message more verbose, clarifying that this failure could be an atomicity problem rather than a simple lost write: [test-repo] successfully completed all [10] atomic increments of register [test-register-contended-F_NNXHrSSDGveoeyj1skwg] so its expected value is [OptionalBytesReference[00 00 00 00 00 00 00 0a]], but reading its value with [getRegister] unexpectedly yielded [OptionalBytesReference[00 00 00 00 00 00 00 09]]

elasticsearchmachine · 2025-07-14T11:29:09Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

elasticsearchmachine · 2025-07-14T11:29:10Z

Hi @DaveCTurner, I've created a changelog YAML for you.

ywangd

I have one minor question

ywangd · 2025-07-15T02:56:23Z

...n/java/org/elasticsearch/repositories/blobstore/testkit/analyze/RepositoryAnalyzeAction.java

+                                                        value is [%s], but reading its value with [%s] unexpectedly yielded [%s]. This \
+                                                        anomaly may indicate an atomicity failure amongst concurrent compare-and-exchange \
+                                                        operations on registers in this repository.""",
+                                                    expectedFinalRegisterValue,


This is not the number of increments, right? Do you mean to pass it down from the caller of finalRegisterValueVerifier?

It is the number of increments, yes: it counts the number of successful responses to a ContendedRegisterAnalyzeAction (each of which does one increment operation). The caller of finalRegisterValueVerifier doesn't know this yet, because this is called before any of the actions have run.

We could simplify this because we always either increment this value or fail the whole analysis (skipping these checks). IIRC that wasn't always the case in some earlier draft and this never got cleaned up.

OK I see the content is doubled as the count as well since it starts from 0 and increment 1 each time. Thanks for explaining.

I opened #131274

ywangd

LGTM

Today repository analysis may fail with a message like the following: [test-repo] register [test-register-contended-F_NNXHrSSDGveoeyj1skwg] should have value [10] but instead had value [OptionalBytesReference[00 00 00 00 00 00 00 09]] This is confusing because one might interpret `should have value [10]` as an indication that Elasticsearch definitely wrote this value to the register, leaving you trying to work out how that particular write was lost. In fact it can be more subtle than that, we only believe the register blob should have this value because we know we completed 10 supposedly-atomic increment operations, and the failure could instead be that these operations are not as atomic as they need to be and that one or more of the increments was lost. This commit makes the message more verbose, clarifying that this failure could be an atomicity problem rather than a simple lost write: [test-repo] Successfully completed all [10] atomic increments of register [test-register-contended-F_NNXHrSSDGveoeyj1skwg] so its expected value is [OptionalBytesReference[00 00 00 00 00 00 00 0a]], but reading its value with [getRegister] unexpectedly yielded [OptionalBytesReference[00 00 00 00 00 00 00 09]]. This anomaly may indicate an atomicity failure amongst concurrent compare-and-exchange operations on registers in this repository.

DaveCTurner requested a review from ywangd July 14, 2025 11:28

DaveCTurner added >enhancement :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v9.2.0 labels Jul 14, 2025

elasticsearchmachine added the Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. label Jul 14, 2025

Update docs/changelog/131200.yaml

f14e1f6

DaveCTurner added 4 commits July 14, 2025 12:57

Add another sentence

7c4cabe

Merge branch 'main' into 2025/07/14/repo-analysis-lost-increment-message

fcb30dc

Synchronization is unnecessary now, only one register op happens here

2c561dc

Merge branch 'main' into 2025/07/14/repo-analysis-lost-increment-message

4edcae9

ywangd reviewed Jul 15, 2025

View reviewed changes

ywangd approved these changes Jul 15, 2025

View reviewed changes

DaveCTurner merged commit 6f55796 into elastic:main Jul 15, 2025
33 checks passed

DaveCTurner deleted the 2025/07/14/repo-analysis-lost-increment-message branch July 15, 2025 07:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Improve lost-increment message in repo analysis#131200

Improve lost-increment message in repo analysis#131200
DaveCTurner merged 6 commits intoelastic:mainfrom
DaveCTurner:2025/07/14/repo-analysis-lost-increment-message

DaveCTurner commented Jul 14, 2025 •

edited

Loading

Uh oh!

elasticsearchmachine commented Jul 14, 2025

Uh oh!

elasticsearchmachine commented Jul 14, 2025

Uh oh!

ywangd left a comment

Uh oh!

ywangd Jul 15, 2025

Uh oh!

DaveCTurner Jul 15, 2025

Uh oh!

ywangd Jul 15, 2025

Uh oh!

DaveCTurner Jul 15, 2025

Uh oh!

ywangd left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

DaveCTurner commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Jul 14, 2025

Uh oh!

elasticsearchmachine commented Jul 14, 2025

Uh oh!

ywangd left a comment

Choose a reason for hiding this comment

Uh oh!

ywangd Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

ywangd Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

ywangd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DaveCTurner commented Jul 14, 2025 •

edited

Loading