-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Roll up exceeded value size and cluster hung #4733
Comments
is there any body could help me? |
Hey @JimWen, I am checking this. |
thank you @ashish-goswami
The data is User behavior in app like click/chat/signin/signup etc. And so the schme is like type Action{ and the graph is like So, for a single user , it could have many from-action pairs, ie many entries for single subject, predicate combination.
i got nothing , just hung, i check the log again, the following may br useful, alpha 1 print endless log like
the log is Value exceeded, so the kv here store as key(subject, predicate combination) and value(many entries) ? |
@danielmai : Let's test the posting list split for Jepsen and re-introduce it, and release a patch fix. |
@JimWen : You can revert back to an earlier version, that'd help get your cluster back up and running. |
Dgraph v1.2.1 patch fix is out with posting list splits enabled. You can upgrade your cluster to this release. Large posting lists will be split during rollups. https://github.com/dgraph-io/dgraph/releases/v1.2.1 |
thank you very much, i will have a try @danielmai |
I'll close this issue as the patch fix v1.2.1 is released. Please reopen if you're still seeing this issue. |
The problem is still there, i'm wondering what the value that exceeded actually is and what the fix release split is. |
@JimWen Did you do an export/import of your v1.2.0 database into a new v1.2.1 database? When you load the data into a new v1.2.1 Dgraph instance the posting lists would be split when rollups happen. You can configure how often snapshots (and rollups) happen via the
|
@danielmai Thank you Considering the dataset is too big, i just rebulk load the whole data, export/import may be very slow in this situation? |
bad news, when i export data there is also value exceeding error, and it is the same alpha, so maybe we can say this problem is caused by bulkload when bulkload does not do any split. the log is as followings I0211 11:50:13.031266 228800 log.go:34] Export Created batch of size: 203 MB in 15.240465535s. |
I also encountered the same issue and have updated to the latest version, this issue has not been resolved |
Hey @FelixHolmes , We'll verify this from our end and reply on this thread. |
If the list is too big and is not split before writing it too disk, we could end up in the situation described in #4733. So the posting lists should be split before writing them to disk.
If the list is too big and is not split before writing it too disk, we could end up in the situation described in #4733. So the posting lists should be split before writing them to disk.
We are experiencing similar issue, one of the Alphas node were restarted (not sure if this was OOM or something else) but till then it cannot sync with the cluster, see the log entries below:
It runs under k8s orchestration so we assume that k8s restarts the node as it becomes unresponsive. Any idea why it can sync with rest of the Dgraph cluster nodes? And what else we can do except exporting data and setting a new cluster from scratch? |
I run the following commands, if you would like to receive them let please let me know
|
When I performed the following a read-only query against that broken node
i got such response
does it mean that the data is corrupted on the disk? |
What version of Dgraph are you using?
Have you tried reproducing the issue with the latest release?
No, it seems the code is same.
What is the hardware spec (RAM, OS)?
128G mem & 1.8T SSD
Linux version 3.10.0-1062.9.1.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Fri Dec 6 15:49:49 UTC 2019
Steps to reproduce the issue (command/config used to run Dgraph).
Expected behaviour and actual result.
At first the cluster work fine , after about 30 minutes, any mutation or query can not apply, it seems the whole cluster is hung.
alpha 3 log
I0205 17:05:30.041988 258429 log.go:34] Rolling up Time elapsed: 01m43s, bytes sent: 3.1 GB, speed: 30 MB/sec
I0205 17:05:31.042005 258429 log.go:34] Rolling up Time elapsed: 01m44s, bytes sent: 3.1 GB, speed: 29 MB/sec
I0205 17:05:32.041998 258429 log.go:34] Rolling up Time elapsed: 01m45s, bytes sent: 3.1 GB, speed: 29 MB/sec
I0205 17:05:33.042004 258429 log.go:34] Rolling up Time elapsed: 01m46s, bytes sent: 3.1 GB, speed: 29 MB/sec
I0205 17:05:34.041991 258429 log.go:34] Rolling up Time elapsed: 01m47s, bytes sent: 3.1 GB, speed: 29 MB/sec
I0205 17:05:35.041990 258429 log.go:34] Rolling up Time elapsed: 01m48s, bytes sent: 3.1 GB, speed: 28 MB/sec
I0205 17:05:36.041991 258429 log.go:34] Rolling up Time elapsed: 01m49s, bytes sent: 3.1 GB, speed: 28 MB/sec
I0205 17:05:37.041989 258429 log.go:34] Rolling up Time elapsed: 01m50s, bytes sent: 3.1 GB, speed: 28 MB/sec
I0205 17:05:38.048405 258429 log.go:34] Rolling up Time elapsed: 01m51s, bytes sent: 3.1 GB, speed: 28 MB/sec
I0205 17:05:39.052383 258429 log.go:34] Rolling up Time elapsed: 01m52s, bytes sent: 3.1 GB, speed: 27 MB/sec
I0205 17:05:40.042005 258429 log.go:34] Rolling up Time elapsed: 01m53s, bytes sent: 3.1 GB, speed: 27 MB/sec
E0205 17:06:55.871480 258429 draft.go:442] Error while rolling up lists at 18061: Value with size 1439997117 exceeded 1073741823 limit. Value:
00000000 0a b7 b9 d2 ae 05 12 13 08 83 80 80 80 80 80 80 |................|
00000010 80 10 12 05 00 00 00 00 00 18 01 12 13 08 84 80 |................|
00000020 80 80 80 80 80 80 10 12 05 00 00 00 00 00 18 01 |................|
00000030 12 13 08 87 80 80 80 80 80 80 80 10 12 05 00 00 |................|
00000040 00 00 00 18 01 12 13 08 88 80 80 80 80 80 80 80 |................|
00000050 10 12 05 00 00 00 00 00 18 01 12 13 08 8b 80 80 |................|
00000060 80 80 80 80 80 10 12 05 00 00 00 00 00 18 01 12 |................|
00000070 13 08 92 80 80 80 80 80 80 80 10 12 05 00 00 00 |................|
00000080 00 00 18 01 12 13 08 95 80 80 80 80 80 80 80 10 |................|
alpha 1 log
I0205 17:04:21.971401 7770 log.go:34] Rolling up Created batch of size: 941 kB in 75.65334ms.
I0205 17:04:22.103731 7770 log.go:34] Rolling up Created batch of size: 929 kB in 76.538732ms.
I0205 17:04:22.171771 7770 log.go:34] Rolling up Created batch of size: 925 kB in 66.394744ms.
I0205 17:04:22.257937 7770 log.go:34] Rolling up Created batch of size: 927 kB in 66.859031ms.
I0205 17:04:22.334075 7770 log.go:34] Rolling up Created batch of size: 932 kB in 74.420747ms.
I0205 17:04:22.417214 7770 log.go:34] Rolling up Created batch of size: 929 kB in 68.814851ms.
I0205 17:04:23.277135 7770 log.go:34] Rolling up Time elapsed: 37s, bytes sent: 328 MB, speed: 8.9 MB/sec
I0205 17:04:23.601378 7770 log.go:34] Rolling up Time elapsed: 38s, bytes sent: 328 MB, speed: 8.6 MB/sec
I0205 17:04:24.601421 7770 log.go:34] Rolling up Time elapsed: 39s, bytes sent: 328 MB, speed: 8.4 MB/sec
I0205 17:04:25.403778 7770 log.go:34] Rolling up Created batch of size: 884 kB in 59.479535ms.
I0205 17:04:25.601506 7770 log.go:34] Rolling up Time elapsed: 40s, bytes sent: 329 MB, speed: 8.2 MB/sec
I0205 17:22:45.611681 7770 draft.go:1353] Skipping snapshot at index: 17145. Insufficient discard entries: 1. MinPendingStartTs: 23689
I0205 17:23:45.620155 7770 draft.go:1353] Skipping snapshot at index: 17145. Insufficient discard entries: 1. MinPendingStartTs: 23689
I0205 17:24:45.623038 7770 draft.go:1353] Skipping snapshot at index: 17145. Insufficient discard entries: 1. MinPendingStartTs: 23689
I0205 17:25:45.627988 7770 draft.go:1353] Skipping snapshot at index: 17145. Insufficient discard entries: 1. MinPendingStartTs: 23689
I0205 17:26:45.633200 7770 draft.go:1353] Skipping snapshot at index: 17145. Insufficient discard entries: 1. MinPendingStartTs: 23689
I0205 17:26:45.633257 7770 draft.go:1194] Found 1 old transactions. Acting to abort them.
I0205 17:26:45.633781 7770 draft.go:1155] TryAbort 1 txns with start ts. Error:
I0205 17:26:45.633795 7770 draft.go:1178] TryAbort selectively proposing only aborted txns: txns:<start_ts:23689 >
I0205 17:26:45.634236 7770 draft.go:1197] Done abortOldTransactions for 1 txns. Error:
I0205 17:29:45.819500 7770 draft.go:403] Creating snapshot at index: 20373. ReadTs: 26306.
The text was updated successfully, but these errors were encountered: