Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roll up exceeded value size and alpha crash #4752

Closed
JimWen opened this issue Feb 10, 2020 · 7 comments
Closed

Roll up exceeded value size and alpha crash #4752

JimWen opened this issue Feb 10, 2020 · 7 comments

Comments

@JimWen
Copy link

JimWen commented Feb 10, 2020

What version of Dgraph are you using?

  • Dgraph version : v1.2.1
  • Dgraph SHA-256 : 3f18ff84570b2944f4d75f6f508d55d902715c7ca2310799cc2991064eb046f8
  • Commit SHA-1 : ddcda92
  • Commit timestamp : 2020-02-06 15:31:05 -0800
  • Branch : HEAD
  • Go version : go1.13.5

Have you tried reproducing the issue with the latest release?

yes

What is the hardware spec (RAM, OS)?

128G mem & 1.8T SSD

Linux version 3.10.0-1062.9.1.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Fri Dec 6 15:49:49 UTC 2019

Steps to reproduce the issue (command/config used to run Dgraph).

same as issue #4733 with dgraph

Expected behaviour and actual result.

Value exceeding is still there, the cluster did't hung any more but one alpha crash after about 30 minutes. the log is as followings:

2020/02/10 16:46:34 Key not found
github.com/dgraph-io/badger/v2.init
        /tmp/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/errors.go:36
runtime.doInit
        /usr/local/go/src/runtime/proc.go:5222
runtime.doInit
        /usr/local/go/src/runtime/proc.go:5217
runtime.doInit
        /usr/local/go/src/runtime/proc.go:5217
runtime.doInit
        /usr/local/go/src/runtime/proc.go:5217
runtime.doInit
        /usr/local/go/src/runtime/proc.go:5217
runtime.main
        /usr/local/go/src/runtime/proc.go:190
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1357

github.com/dgraph-io/dgraph/x.Check
        /tmp/go/src/github.com/dgraph-io/dgraph/x/error.go:42
github.com/dgraph-io/dgraph/posting.(*List).rollup
        /tmp/go/src/github.com/dgraph-io/dgraph/posting/list.go:834
github.com/dgraph-io/dgraph/posting.(*List).Rollup
        /tmp/go/src/github.com/dgraph-io/dgraph/posting/list.go:710
github.com/dgraph-io/dgraph/worker.(*node).rollupLists.func3
        /tmp/go/src/github.com/dgraph-io/dgraph/worker/draft.go:1081
github.com/dgraph-io/badger/v2.(*Stream).produceKVs.func1
        /tmp/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/stream.go:191
github.com/dgraph-io/badger/v2.(*Stream).produceKVs
        /tmp/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/stream.go:234
github.com/dgraph-io/badger/v2.(*Stream).Orchestrate.func1
        /tmp/go/pkg/mod/github.com/dgraph-io/badger/[email protected]/stream.go:337
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1357

Something important

I'm wondering what the value that exceeded actually is and what the fix release split is.

The edges here grow very soon, about 1 billion every day and when i check the source code, i found that rolling up is set a fixed time interval of 5 minutes, is it reasonable here?

@JimWen
Copy link
Author

JimWen commented Feb 11, 2020

Today i try to not bulk load first, just import data through api, it works fine. So, is this problem caused by the bulk load data?

@martinmr
Copy link
Contributor

Thanks for the update. It looks like the list is too long and becomes too big to be rolled up. I am not sure where the error is coming from but I am assuming it's when writing data. I think the fact that the issue goes away when you don't use the bulk loader supports this theory because splits are not done when loading data using the bulk loader.

Just for reference. Splitting a posting list means writing a posting list using multiple key-value pairs in badger. It's meant to prevent the type of issues that arise when a posting list becomes too big.

@danielmai
Copy link
Contributor

@JimWen Can we get a copy of your dataset to reproduce this issue? Feel free to email me privately (see my GitHub profile email).

@JimWen
Copy link
Author

JimWen commented Feb 11, 2020

Thang you @martinmr
Yes, that's true. After bulk loading, there is about 20 billion edges, no exception happens before i start to import data through api.And i also notice something strange->

1.When importing data through api after bulk loading, the log is as followings

I0205 17:05:38.048405 258429 log.go:34] Rolling up Time elapsed: 01m51s, bytes sent: 3.1 GB, speed: 28 MB/sec
I0205 17:05:39.052383 258429 log.go:34] Rolling up Time elapsed: 01m52s, bytes sent: 3.1 GB, speed: 27 MB/sec
I0205 17:05:40.042005 258429 log.go:34] Rolling up Time elapsed: 01m53s, bytes sent: 3.1 GB, speed: 27 MB/sec
E0205 17:06:55.871480 258429 draft.go:442] Error while rolling up lists at 18061: Value with size 1439997117 exceeded 1073741823 limit. Value:
00000000 0a b7 b9 d2 ae 05 12 13 08 83 80 80 80 80 80 80 |................|
00000010 80 10 12 05 00 00 00 00 00 18 01 12 13 08 84 80 |................|
00000020 80 80 80 80 80 80 10 12 05 00 00 00 00 00 18 01 |................|
00000030 12 13 08 87 80 80 80 80 80 80 80 10 12 05 00 00 |................|
00000040 00 00 00 18 01 12 13 08 88 80 80 80 80 80 80 80 |................|
00000050 10 12 05 00 00 00 00 00 18 01 12 13 08 8b 80 80 |................|
00000060 80 80 80 80 80 10 12 05 00 00 00 00 00 18 01 12 |................|
00000070 13 08 92 80 80 80 80 80 80 80 10 12 05 00 00 00 |................|
00000080 00 00 18 01 12 13 08 95 80 80 80 80 80 80 80 10 |................|

2.When impoting data just through api, the log is as followings

I0211 08:55:50.262319 26603 log.go:34] Rolling up Created batch of size: 4.3 MB in 267.724618ms.
I0211 08:55:50.542791 26603 log.go:34] Rolling up Created batch of size: 4.5 MB in 273.896934ms.
I0211 08:55:50.786033 26603 log.go:34] Rolling up Created batch of size: 4.3 MB in 239.172574ms.
I0211 08:55:50.805872 26603 log.go:34] Rolling up Time elapsed: 08s, bytes sent: 104 MB, speed: 13 MB/sec
I0211 08:55:51.208664 26603 log.go:34] Rolling up Created batch of size: 12 MB in 66.96367ms.
I0211 08:55:51.208682 26603 log.go:34] Rolling up Sent 1125937 keys
I0211 08:55:51.226869 26603 draft.go:1102] Rolled up 1125895 keys. Done
I0211 08:55:51.226887 26603 draft.go:445] List rollup at Ts 852355: OK.

It seems that 1 rollup bytes sent size is much bigger than 2, so it exceeded, why this happens?

@JimWen
Copy link
Author

JimWen commented Feb 11, 2020

@JimWen Can we get a copy of your dataset to reproduce this issue? Feel free to email me privately (see my GitHub profile email).

Sorry, i'm afraid the data can't be provided, the data is the confidential data of company

@JimWen
Copy link
Author

JimWen commented Feb 17, 2020

I have changed to use liveload to load init data, everything works fine util now, why not try to support spliting in bulkload? It seems that bulkload would bring much problem with huge dataset, but the only problem of liveload is too slow.@martinmr @danielmai

@martinmr
Copy link
Contributor

@JimWen Yes, bulk loader should be able to handle big lists. I'll open a feature request.

As with regards to the rollup size differences between the bulk loader and the live loader, it makes sense. You start your cluster after running the bulk loader so all your data is already loaded. With the live loader, the data is being loaded as it's rolled up so the size of the lists is smaller.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants