-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Watch dropping an event when compacting on delete #18089
Comments
cc @MadhavJivrajani @siyuanfoundation @fuweid Can someone take a look and double check my findings? |
/assign |
I rerun the report on latest version of #17833 and it found that linearization was broken. Looks like a false positive due to some bug. |
Ok, ok I encountered another case. Failpoint used:
Report: 1717749910517706368.tar.gz Found when adding compaction to operations serathius@5959110 Logs showing broken watch guarantee due to missing log:
Screenshot showing a transaction for revision 141 and fact that key "key6" was present so delete really happened. Watch response showing missing delete event Complete log:
|
@fuweid Think this is related to #17780 See the following logs:
Looks like we lost first delete after restoring from restoring from "last compact revision". |
Yes. Let me try to reproduce it in my local. Update it later. |
I managed to reproduce it without multi TXN, just a delete request. This extends the impact to Kubernetes. Again the common thing is that etcd misses a watch even if it restores kvstore on revision before a delete. In last repro:
etcd bootstrap log:
|
Something similar reproduced on v3.4, however this time it was long after crash, and it flagged broken resumable properly. Which just means it provided incorrect event on a newly opened watch which provided revision.
|
For this one, the tombstone of
|
Correct. :( Currently, watcher isn't able to send out all the possible deleted events. |
Did the client side receive an ErrCompacted? If yes, then it should be expected behaviour.
|
No, it would be recorded in the report. |
@fuweid, don't understand the statement. Distributed system needs to uphold it's guarantees even if a node is down. Events on WAL used for replay should match what users observes. For watch client should observe either full history or or get ErrCompacted, as @ahrtr mentioned. |
One thing I noticed, that the WatchChan is not broken during etcd downtime, so the time it's down is short enough that the etcd client code retries the request in a transparent way. There might be a bug there too. Still there is a possibility of a bug in recording client responses, it is a little more complicated to record history of watch than KV requests. etcd/tests/robustness/client/client.go Lines 268 to 304 in 8a0054f
|
@fuweid Please feel free to ping me on slack if you want a quick discussion on this issue. |
It's expected behaviour and by design as long as etcdserver doesn't return any error (e.g ErrCompacted). |
Added logs to ensure that issue is not one robustness test, got: They clearly show that revision 125 was not provided to client.
|
Where did you add the log? Also suggest to provide a clear summary on the exact steps you did. |
Signed-off-by: Marek Siarkowicz <[email protected]>
Code #18145 |
… other issues Signed-off-by: Marek Siarkowicz <[email protected]>
Disable robustness test detection of #18089 to allow detecting other issues
… other issues Signed-off-by: Marek Siarkowicz <[email protected]>
…etecting other issues" This reverts commit 4fe227c. Signed-off-by: Wei Fu <[email protected]>
…etecting other issues" This reverts commit 4fe227c. Signed-off-by: Wei Fu <[email protected]>
Revert "Disable robustness test detection of #18089 to allow detecting other issues
Awesome work @fuweid ! |
Let's close the issue and create dedicated issue for robustness testing . |
…etecting other issues" This reverts commit 4fe227c. Signed-off-by: Wei Fu <[email protected]>
…etecting other issues" This reverts commit 4fe227c. Signed-off-by: Wei Fu <[email protected]>
1) Use SleepBeforeSendWatchResponse failpoint to simulate slow watch 2) Decrease compact period from 200ms to 100ms to increase the probability of compacting on Delete 3) Introduce a new traffic pattern of 50/50 Put and Delete With these three changes the `make test-robustness-issue18089` command can reproduce issue 18089. Signed-off-by: Jiayin Mao <[email protected]>
1) Use SleepBeforeSendWatchResponse failpoint to simulate slow watch 2) Decrease compact period from 200ms to 100ms to increase the probability of compacting on Delete 3) Introduce a new traffic pattern of 50/50 Put and Delete With these three changes the `make test-robustness-issue18089` command can reproduce issue 18089. Signed-off-by: Jiayin Mao <[email protected]>
1) Use SleepBeforeSendWatchResponse failpoint to simulate slow watch 2) Decrease compact period from 200ms to 100ms to increase the probability of compacting on Delete 3) Introduce a new traffic pattern of 50/50 Put and Delete With these three changes the `make test-robustness-issue18089` command can reproduce issue 18089. Signed-off-by: Jiayin Mao <[email protected]>
1) Use SleepBeforeSendWatchResponse failpoint to simulate slow watch 2) Decrease compact period from 200ms to 100ms to increase the probability of compacting on Delete 3) Introduce a new traffic pattern of 50/50 Put and Delete With these three changes the `make test-robustness-issue18089` command can reproduce issue 18089. Signed-off-by: Jiayin Mao <[email protected]>
Bug report criteria
What happened?
When testing #17833 I encountered watch breaking reliable guarantee.
Report included
1716928418720891814.zip
What did you expect to happen?
Watch should not break guarantee
How can we reproduce it (as minimally and precisely as possible)?
TODO
Anything else we need to know?
No response
Etcd version (please run commands below)
v3.5
Etcd configuration (command line flags or environment variables)
paste your configuration here
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
Relevant log output
The text was updated successfully, but these errors were encountered: