-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug(core): Fixed infinite loop in CommitToDisk #8614
Conversation
worker/server_state.go
Outdated
@@ -48,7 +48,7 @@ const ( | |||
`client_key=; sasl-mechanism=PLAIN;` | |||
LimitDefaults = `mutations=allow; query-edge=1000000; normalize-node=10000; ` + | |||
`mutations-nquad=1000000; disallow-drop=false; query-timeout=0ms; txn-abort-after=5m; ` + | |||
` max-retries=-1;max-pending-queries=10000` | |||
` max-retries=3;max-pending-queries=10000` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like a larger impact that we give up on retries after 3 tries. Looking at x.go::commitOrAbort() and x.go::RetryUntilSuccess() it looks like we will try 3 times 10ms apart with this setting.
Will this have a big impact if there is a short-term issue like a pod going down, some kind of stuck transaction? E.g. I notice that CommmitToDisk() waits for the cache.Lock() mutex. I'm worried that some contention will cause transactions to be aborted.
Perhaps more retries with exponential backoff would be safer, but I am not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on the backoff approach. 3x10ms might not be enough time for network hiccups.
There doesn't seem to be a backoff func in the x package, but easily added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I happened to just see this comment in oracle.go, which suggests we need to retry aggressively in some cases, but I'm not sure this is related or if this (oracle::proposeAndWait()) calls the disk write with retry or not.
// NOTE: It is important that we continue retrying proposeTxn until we succeed. This should
// happen, irrespective of what the user context timeout might be. We check for it before
// reaching this stage, but now that we're here, we have to ensure that the commit proposal goes
// through. Otherwise, we should block here forever. If we don't do this, we'll see txn
// violations in Jepsen, because we'll send out a MaxAssigned higher than a commit, which would
// cause newer txns to see older data.
7c1883e
to
5885e5d
Compare
bb0be52
to
397bac0
Compare
990117d
8c467d9
to
990117d
Compare
Changes Unknown when pulling 73e4a1b on harshil/big_key_fix into ** on main**. |
990117d
to
0a724e5
Compare
e352d01
to
183f4c8
Compare
183f4c8
to
73e4a1b
Compare
Whenever an error happens in CommitToDisk, we retry the function according to x.config.MaxRetries. However, the value was set to -1, which causes the function to get stuck in infinite loop. This leads to dgraph not being able to commit the transaction, nor move forward to newer queries. CommitToDisk function is required to be pushed to disk. In case of an error, different alphas in a group can end up having different data leading to data loss. To avoid such issues, we panic in case we are not able to CommitToDisk after 10 retries. Once the alpha is restarted, if the issue is fixed, the alpha would start to work again. This way, Alphas wont fail silently and we would know if there was any issue occuring. Fixes: https://github.com/dgraph-io/projects/issues/85
Whenever an error happens in CommitToDisk, we retry the function according to x.config.MaxRetries. However, the value was set to -1, which causes the function to get stuck in infinite loop. This leads to dgraph not being able to commit the transaction, nor move forward to newer queries. CommitToDisk function is required to be pushed to disk. In case of an error, different alphas in a group can end up having different data leading to data loss. To avoid such issues, we panic in case we are not able to CommitToDisk after 10 retries. Once the alpha is restarted, if the issue is fixed, the alpha would start to work again. This way, Alphas wont fail silently and we would know if there was any issue occuring. Fixes: https://github.com/dgraph-io/projects/issues/85
Whenever an error happens in CommitToDisk, we retry the function according to x.config.MaxRetries. However, the value was set to -1, which causes the function to get stuck in infinite loop. This leads to dgraph not being able to commit the transaction, nor move forward to newer queries.
CommitToDisk function is required to be pushed to disk. In case of an error, different alphas in a group can end up having different data leading to data loss. To avoid such issues, we panic in case we are not able to CommitToDisk after 10 retries. Once the alpha is restarted, if the issue is fixed, the alpha would start to work again. This way, Alphas wont fail silently and we would know if there was any issue occuring.
Fixes: https://github.com/dgraph-io/projects/issues/85