Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(core): Fixed infinite loop in CommitToDisk #8614

Merged
merged 1 commit into from
Feb 2, 2023

Conversation

harshil-goel
Copy link
Contributor

@harshil-goel harshil-goel commented Jan 18, 2023

Whenever an error happens in CommitToDisk, we retry the function according to x.config.MaxRetries. However, the value was set to -1, which causes the function to get stuck in infinite loop. This leads to dgraph not being able to commit the transaction, nor move forward to newer queries.

CommitToDisk function is required to be pushed to disk. In case of an error, different alphas in a group can end up having different data leading to data loss. To avoid such issues, we panic in case we are not able to CommitToDisk after 10 retries. Once the alpha is restarted, if the issue is fixed, the alpha would start to work again. This way, Alphas wont fail silently and we would know if there was any issue occuring.

Fixes: https://github.com/dgraph-io/projects/issues/85

@CLAassistant
Copy link

CLAassistant commented Jan 18, 2023

CLA assistant check
All committers have signed the CLA.

@all-seeing-code all-seeing-code self-requested a review January 18, 2023 16:14
skrdgraph
skrdgraph previously approved these changes Jan 18, 2023
x/x.go Show resolved Hide resolved
@@ -48,7 +48,7 @@ const (
`client_key=; sasl-mechanism=PLAIN;`
LimitDefaults = `mutations=allow; query-edge=1000000; normalize-node=10000; ` +
`mutations-nquad=1000000; disallow-drop=false; query-timeout=0ms; txn-abort-after=5m; ` +
` max-retries=-1;max-pending-queries=10000`
` max-retries=3;max-pending-queries=10000`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like a larger impact that we give up on retries after 3 tries. Looking at x.go::commitOrAbort() and x.go::RetryUntilSuccess() it looks like we will try 3 times 10ms apart with this setting.

Will this have a big impact if there is a short-term issue like a pod going down, some kind of stuck transaction? E.g. I notice that CommmitToDisk() waits for the cache.Lock() mutex. I'm worried that some contention will cause transactions to be aborted.

Perhaps more retries with exponential backoff would be safer, but I am not sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the backoff approach. 3x10ms might not be enough time for network hiccups.

There doesn't seem to be a backoff func in the x package, but easily added.

Copy link

@damonfeldman damonfeldman Jan 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I happened to just see this comment in oracle.go, which suggests we need to retry aggressively in some cases, but I'm not sure this is related or if this (oracle::proposeAndWait()) calls the disk write with retry or not.

// NOTE: It is important that we continue retrying proposeTxn until we succeed. This should
// happen, irrespective of what the user context timeout might be. We check for it before
// reaching this stage, but now that we're here, we have to ensure that the commit proposal goes
// through. Otherwise, we should block here forever. If we don't do this, we'll see txn
// violations in Jepsen, because we'll send out a MaxAssigned higher than a commit, which would
// cause newer txns to see older data.

@mangalaman93 mangalaman93 force-pushed the aman/65k branch 5 times, most recently from 7c1883e to 5885e5d Compare January 25, 2023 17:55
@mangalaman93 mangalaman93 force-pushed the aman/65k branch 3 times, most recently from bb0be52 to 397bac0 Compare January 31, 2023 09:33
Base automatically changed from aman/65k to main January 31, 2023 16:50
worker/draft.go Outdated Show resolved Hide resolved
worker/draft.go Show resolved Hide resolved
worker/draft.go Show resolved Hide resolved
x/x.go Outdated Show resolved Hide resolved
@coveralls
Copy link

coveralls commented Feb 1, 2023

Coverage Status

Changes Unknown when pulling 73e4a1b on harshil/big_key_fix into ** on main**.

mangalaman93
mangalaman93 previously approved these changes Feb 2, 2023
@harshil-goel harshil-goel merged commit f5f49da into main Feb 2, 2023
@harshil-goel harshil-goel deleted the harshil/big_key_fix branch February 2, 2023 14:01
all-seeing-code pushed a commit that referenced this pull request Feb 8, 2023
Whenever an error happens in CommitToDisk, we retry the function
according to x.config.MaxRetries. However, the value was set to -1,
which causes the function to get stuck in infinite loop. This leads to
dgraph not being able to commit the transaction, nor move forward to
newer queries.

CommitToDisk function is required to be pushed to disk. In case of an
error, different alphas in a group can end up having different data
leading to data loss. To avoid such issues, we panic in case we are not
able to CommitToDisk after 10 retries. Once the alpha is restarted, if
the issue is fixed, the alpha would start to work again. This way,
Alphas wont fail silently and we would know if there was any issue
occuring.

Fixes: https://github.com/dgraph-io/projects/issues/85
all-seeing-code pushed a commit that referenced this pull request Feb 8, 2023
Whenever an error happens in CommitToDisk, we retry the function
according to x.config.MaxRetries. However, the value was set to -1,
which causes the function to get stuck in infinite loop. This leads to
dgraph not being able to commit the transaction, nor move forward to
newer queries.

CommitToDisk function is required to be pushed to disk. In case of an
error, different alphas in a group can end up having different data
leading to data loss. To avoid such issues, we panic in case we are not
able to CommitToDisk after 10 retries. Once the alpha is restarted, if
the issue is fixed, the alpha would start to work again. This way,
Alphas wont fail silently and we would know if there was any issue
occuring.

Fixes: https://github.com/dgraph-io/projects/issues/85
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

9 participants