bug(core): Fixed infinite loop in CommitToDisk #8614

harshil-goel · 2023-01-18T16:03:15Z

Whenever an error happens in CommitToDisk, we retry the function according to x.config.MaxRetries. However, the value was set to -1, which causes the function to get stuck in infinite loop. This leads to dgraph not being able to commit the transaction, nor move forward to newer queries.

CommitToDisk function is required to be pushed to disk. In case of an error, different alphas in a group can end up having different data leading to data loss. To avoid such issues, we panic in case we are not able to CommitToDisk after 10 retries. Once the alpha is restarted, if the issue is fixed, the alpha would start to work again. This way, Alphas wont fail silently and we would know if there was any issue occuring.

Fixes: https://github.com/dgraph-io/projects/issues/85

CLAassistant · 2023-01-18T16:03:21Z

All committers have signed the CLA.

x/x.go

damonfeldman · 2023-01-18T18:02:21Z

worker/server_state.go

@@ -48,7 +48,7 @@ const (
 		`client_key=; sasl-mechanism=PLAIN;`
 	LimitDefaults = `mutations=allow; query-edge=1000000; normalize-node=10000; ` +
 		`mutations-nquad=1000000; disallow-drop=false; query-timeout=0ms; txn-abort-after=5m; ` +
-		` max-retries=-1;max-pending-queries=10000`
+		` max-retries=3;max-pending-queries=10000`


It seems like a larger impact that we give up on retries after 3 tries. Looking at x.go::commitOrAbort() and x.go::RetryUntilSuccess() it looks like we will try 3 times 10ms apart with this setting.

Will this have a big impact if there is a short-term issue like a pod going down, some kind of stuck transaction? E.g. I notice that CommmitToDisk() waits for the cache.Lock() mutex. I'm worried that some contention will cause transactions to be aborted.

Perhaps more retries with exponential backoff would be safer, but I am not sure.

+1 on the backoff approach. 3x10ms might not be enough time for network hiccups.

There doesn't seem to be a backoff func in the x package, but easily added.

I happened to just see this comment in oracle.go, which suggests we need to retry aggressively in some cases, but I'm not sure this is related or if this (oracle::proposeAndWait()) calls the disk write with retry or not.

// NOTE: It is important that we continue retrying proposeTxn until we succeed. This should // happen, irrespective of what the user context timeout might be. We check for it before // reaching this stage, but now that we're here, we have to ensure that the commit proposal goes // through. Otherwise, we should block here forever. If we don't do this, we'll see txn // violations in Jepsen, because we'll send out a MaxAssigned higher than a commit, which would // cause newer txns to see older data.

worker/draft.go

x/x.go

coveralls · 2023-02-01T15:32:51Z

Changes Unknown when pulling 73e4a1b on harshil/big_key_fix into ** on main**.

Whenever an error happens in CommitToDisk, we retry the function according to x.config.MaxRetries. However, the value was set to -1, which causes the function to get stuck in infinite loop. This leads to dgraph not being able to commit the transaction, nor move forward to newer queries. CommitToDisk function is required to be pushed to disk. In case of an error, different alphas in a group can end up having different data leading to data loss. To avoid such issues, we panic in case we are not able to CommitToDisk after 10 retries. Once the alpha is restarted, if the issue is fixed, the alpha would start to work again. This way, Alphas wont fail silently and we would know if there was any issue occuring. Fixes: https://github.com/dgraph-io/projects/issues/85

harshil-goel requested review from akon-dey, skrdgraph, darkn3rd, meghalims, matthewmcneely and billprovince as code owners January 18, 2023 16:03

all-seeing-code self-requested a review January 18, 2023 16:14

all-seeing-code previously approved these changes Jan 18, 2023

View reviewed changes

skrdgraph approved these changes Jan 18, 2023

View reviewed changes

skrdgraph previously approved these changes Jan 18, 2023

View reviewed changes

nswamy reviewed Jan 18, 2023

View reviewed changes

x/x.go Show resolved Hide resolved

damonfeldman reviewed Jan 18, 2023

View reviewed changes

mangalaman93 force-pushed the aman/65k branch 5 times, most recently from 7c1883e to 5885e5d Compare January 25, 2023 17:55

mangalaman93 force-pushed the aman/65k branch 3 times, most recently from bb0be52 to 397bac0 Compare January 31, 2023 09:33

Base automatically changed from aman/65k to main January 31, 2023 16:50

harshil-goel dismissed stale reviews from skrdgraph and all-seeing-code via 990117d February 1, 2023 04:00

harshil-goel force-pushed the harshil/big_key_fix branch from 8c467d9 to 990117d Compare February 1, 2023 04:00

mangalaman93 requested changes Feb 1, 2023

View reviewed changes

worker/draft.go Outdated Show resolved Hide resolved

worker/draft.go Show resolved Hide resolved

worker/draft.go Show resolved Hide resolved

x/x.go Outdated Show resolved Hide resolved

harshil-goel force-pushed the harshil/big_key_fix branch from 990117d to 0a724e5 Compare February 2, 2023 05:01

mangalaman93 previously approved these changes Feb 2, 2023

View reviewed changes

mangalaman93 force-pushed the harshil/big_key_fix branch from e352d01 to 183f4c8 Compare February 2, 2023 12:42

mangalaman93 dismissed their stale review via 183f4c8 February 2, 2023 12:50

Fixed infinite loop in CommitToDisk

73e4a1b

mangalaman93 force-pushed the harshil/big_key_fix branch from 183f4c8 to 73e4a1b Compare February 2, 2023 12:52

mangalaman93 approved these changes Feb 2, 2023

View reviewed changes

all-seeing-code approved these changes Feb 2, 2023

View reviewed changes

harshil-goel merged commit f5f49da into main Feb 2, 2023

harshil-goel deleted the harshil/big_key_fix branch February 2, 2023 14:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(core): Fixed infinite loop in CommitToDisk #8614

bug(core): Fixed infinite loop in CommitToDisk #8614

harshil-goel commented Jan 18, 2023 •

edited

Loading

CLAassistant commented Jan 18, 2023 •

edited

Loading

damonfeldman Jan 18, 2023

matthewmcneely Jan 18, 2023

damonfeldman Jan 20, 2023 •

edited

Loading

coveralls commented Feb 1, 2023 •

edited

Loading

bug(core): Fixed infinite loop in CommitToDisk #8614

bug(core): Fixed infinite loop in CommitToDisk #8614

Conversation

harshil-goel commented Jan 18, 2023 • edited Loading

CLAassistant commented Jan 18, 2023 • edited Loading

damonfeldman Jan 18, 2023

Choose a reason for hiding this comment

matthewmcneely Jan 18, 2023

Choose a reason for hiding this comment

damonfeldman Jan 20, 2023 • edited Loading

Choose a reason for hiding this comment

coveralls commented Feb 1, 2023 • edited Loading

harshil-goel commented Jan 18, 2023 •

edited

Loading

CLAassistant commented Jan 18, 2023 •

edited

Loading

damonfeldman Jan 20, 2023 •

edited

Loading

coveralls commented Feb 1, 2023 •

edited

Loading