Improve lost-increment message in repo analysis#131200
Improve lost-increment message in repo analysis#131200DaveCTurner merged 6 commits intoelastic:mainfrom
Conversation
Today repository analysis may fail with a message like the following:
[test-repo] register [test-register-contended-F_NNXHrSSDGveoeyj1skwg]
should have value [10] but instead had value
[OptionalBytesReference[00 00 00 00 00 00 00 09]]
This is confusing because one might interpret `should have value [10]`
as an indication that Elasticsearch definitely wrote this value to the
register, leaving you trying to work out how that particular write was
lost. In fact it can be more subtle than that, we only believe the
register blob should have this value because we know we completed 10
supposedly-atomic increment operations, and the failure could instead be
that these operations are not as atomic as they need to be and that one
or more of the increments was lost.
This commit makes the message more verbose, clarifying that this failure
could be an atomicity problem rather than a simple lost write:
[test-repo] successfully completed all [10] atomic increments of
register [test-register-contended-F_NNXHrSSDGveoeyj1skwg] so its
expected value is [OptionalBytesReference[00 00 00 00 00 00 00 0a]],
but reading its value with [getRegister] unexpectedly yielded
[OptionalBytesReference[00 00 00 00 00 00 00 09]]
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
|
Hi @DaveCTurner, I've created a changelog YAML for you. |
| value is [%s], but reading its value with [%s] unexpectedly yielded [%s]. This \ | ||
| anomaly may indicate an atomicity failure amongst concurrent compare-and-exchange \ | ||
| operations on registers in this repository.""", | ||
| expectedFinalRegisterValue, |
There was a problem hiding this comment.
This is not the number of increments, right? Do you mean to pass it down from the caller of finalRegisterValueVerifier?
There was a problem hiding this comment.
It is the number of increments, yes: it counts the number of successful responses to a ContendedRegisterAnalyzeAction (each of which does one increment operation). The caller of finalRegisterValueVerifier doesn't know this yet, because this is called before any of the actions have run.
We could simplify this because we always either increment this value or fail the whole analysis (skipping these checks). IIRC that wasn't always the case in some earlier draft and this never got cleaned up.
There was a problem hiding this comment.
OK I see the content is doubled as the count as well since it starts from 0 and increment 1 each time. Thanks for explaining.
Today repository analysis may fail with a message like the following:
[test-repo] register [test-register-contended-F_NNXHrSSDGveoeyj1skwg]
should have value [10] but instead had value
[OptionalBytesReference[00 00 00 00 00 00 00 09]]
This is confusing because one might interpret `should have value [10]`
as an indication that Elasticsearch definitely wrote this value to the
register, leaving you trying to work out how that particular write was
lost. In fact it can be more subtle than that, we only believe the
register blob should have this value because we know we completed 10
supposedly-atomic increment operations, and the failure could instead be
that these operations are not as atomic as they need to be and that one
or more of the increments was lost.
This commit makes the message more verbose, clarifying that this failure
could be an atomicity problem rather than a simple lost write:
[test-repo] Successfully completed all [10] atomic increments of
register [test-register-contended-F_NNXHrSSDGveoeyj1skwg] so its
expected value is [OptionalBytesReference[00 00 00 00 00 00 00 0a]],
but reading its value with [getRegister] unexpectedly yielded
[OptionalBytesReference[00 00 00 00 00 00 00 09]]. This anomaly may
indicate an atomicity failure amongst concurrent
compare-and-exchange operations on registers in this repository.
Today repository analysis may fail with a message like the following:
[test-repo] register [test-register-contended-F_NNXHrSSDGveoeyj1skwg]
should have value [10] but instead had value
[OptionalBytesReference[00 00 00 00 00 00 00 09]]
This is confusing because one might interpret `should have value [10]`
as an indication that Elasticsearch definitely wrote this value to the
register, leaving you trying to work out how that particular write was
lost. In fact it can be more subtle than that, we only believe the
register blob should have this value because we know we completed 10
supposedly-atomic increment operations, and the failure could instead be
that these operations are not as atomic as they need to be and that one
or more of the increments was lost.
This commit makes the message more verbose, clarifying that this failure
could be an atomicity problem rather than a simple lost write:
[test-repo] Successfully completed all [10] atomic increments of
register [test-register-contended-F_NNXHrSSDGveoeyj1skwg] so its
expected value is [OptionalBytesReference[00 00 00 00 00 00 00 0a]],
but reading its value with [getRegister] unexpectedly yielded
[OptionalBytesReference[00 00 00 00 00 00 00 09]]. This anomaly may
indicate an atomicity failure amongst concurrent
compare-and-exchange operations on registers in this repository.
Today repository analysis may fail with a message like the following:
This is confusing because one might interpret
should have value [10]as an indication that Elasticsearch definitely wrote this value to the
register, leaving you trying to work out how that particular write was
lost. In fact it can be more subtle than that, we only believe the
register blob should have this value because we know we completed 10
supposedly-atomic increment operations, and the failure could instead be
that these operations are not as atomic as they need to be and that one
or more of the increments was lost.
This commit makes the message more verbose, clarifying that this failure
could be an atomicity problem rather than a simple lost write: