Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alpha segfaults #4288

Closed
dhagrow opened this issue Nov 18, 2019 · 14 comments
Closed

Alpha segfaults #4288

dhagrow opened this issue Nov 18, 2019 · 14 comments
Labels
area/crash Dgraph issues that cause an operation to fail, or the whole server to crash. kind/bug Something is broken. priority/P0 Critical issue that requires immediate attention.

Comments

@dhagrow
Copy link

dhagrow commented Nov 18, 2019

What version of Dgraph are you using?

Master branch (717710a)

Have you tried reproducing the issue with the latest release?

Yes, v1.1.0 and master

What is the hardware spec (RAM, OS)?

DigitalOcean droplets
4 core Intel Xeon
8GB RAM
Ubuntu 18.04.3

Steps to reproduce the issue (command/config used to run Dgraph).

./dgraph alpha --zero=xx:5080 --my=xx:7080 -w ~/data/dgraph/w/ -p ~/data/dgraph/p/

  • binary compiled on Fedora 30, then copied over to VMs

Expected behaviour and actual result.

Many segfaults over the course of several days. Each occurred after several hours of run-time on two separate DO VMs.

One tracelog began with this:

fatal error: fault [signal SIGSEGV: segmentation violation code=0x1 addr=0xc030000000 pc=0x11ef2b1]              
goroutine 1281391 [running]:                                                                  
runtime.throw(0x175d0cb, 0x5)   /usr/local/go/src/runtime/panic.go:617 +0x72 fp=0xc00023bb88 sp=0xc00023bb58 pc=0x9d0fd2                                                                                            
runtime.sigpanic()                                                                            
        /usr/local/go/src/runtime/signal_unix.go:397 +0x401 fp=0xc00023bbb8 sp=0xc00023bb88 pc
=0x9e7151                                                                                     
github.com/dgraph-io/dgraph/vendor/github.com/dgryski/go-groupvarint.Decode4(0xc00023bc30, 0x4
, 0x4, 0xc02ffffff0, 0x9, 0x10)                                                               
        /tmp/go/src/github.com/dgraph-io/dgraph/vendor/github.com/dgryski/go-groupvarint/decod
e_amd64.s:11 +0x11 fp=0xc00023bbc0 sp=0xc00023bbb8 pc=0x11ef2b1                               
github.com/dgraph-io/dgraph/codec.(*Decoder).unpackBlock(0xc015a0b1d0, 0xc01dd16bc0, 0x2, 0x8)
        /tmp/go/src/github.com/dgraph-io/dgraph/codec/codec.go:145 +0x226 fp=0xc00023bc68 sp=0
xc00023bbc0 pc=0x11efaa6                                                                      
github.com/dgraph-io/dgraph/codec.(*Decoder).LinearSeek(0xc015a0b1d0, 0xb5d9, 0x8, 0xc00344445
8, 0x4a0)
        /tmp/go/src/github.com/dgraph-io/dgraph/codec/codec.go:250 +0x62 fp=0xc00023bc98 sp=0xc00023bc68 pc=0x11f00b2                                                                       
github.com/dgraph-io/dgraph/algo.IntersectCompressedWithLinJump(0xc015a0b1d0, 0xc00343a000, 0x192b, 0x192b, 0xc00023bd30)                                                                   
        /tmp/go/src/github.com/dgraph-io/dgraph/algo/uidlist.go:79 +0xcd fp=0xc00023bcf8 sp=0xc00023bc98 pc=0x11f0fdd                                                                       
github.com/dgraph-io/dgraph/algo.IntersectCompressedWith(0xc01db56b80, 0x0, 0xc01db568c0, 0xc01db56bc0)                                                                                     
        /tmp/go/src/github.com/dgraph-io/dgraph/algo/uidlist.go:64 +0x134 fp=0xc00023bd58 sp=0
xc00023bcf8 pc=0x11f0e64                                                                      
github.com/dgraph-io/dgraph/posting.(*List).Uids(0xc002905bc0, 0x16e36, 0x0, 0xc01db568c0, 0x1
76abfe, 0x11, 0x0)                                                                            
        /tmp/go/src/github.com/dgraph-io/dgraph/posting/list.go:903 +0x186 fp=0xc00023bde0 sp=
0xc00023bd58 pc=0x11fea96                                                                     
github.com/dgraph-io/dgraph/worker.(*queryState).handleUidPostings.func1(0x0, 0x1, 0xc01176bc8
0, 0xc01de35900)                                                                              
        /tmp/go/src/github.com/dgraph-io/dgraph/worker/task.go:674 +0x11ff fp=0xc00023bf80 sp=
0xc00023bde0 pc=0x1375a2f                                                                     
github.com/dgraph-io/dgraph/worker.(*queryState).handleUidPostings.func2(0xc002905a40, 0xc01e$
637c0, 0x0, 0x1)
       /tmp/go/src/github.com/dgraph-io/dgraph/worker/task.go:691 +0x3a fp=0xc00023bfc0 sp=0x
c00023bf80 pc=0x1375e8a                                                                       runtime.goexit()                                                                              
        /usr/local/go/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc00023bfc8 sp=0xc00023bfc0 pc=0xa010d1                                                                                         
created by github.com/dgraph-io/dgraph/worker.(*queryState).handleUidPostings                         /tmp/go/src/github.com/dgraph-io/dgraph/worker/task.go:690 +0x3b3                     

Other tracelogs were too long to capture from the beginning.

Possible related; both VMs logged many of these warnings:

W1113 03:35:13.841935       1 draft.go:958] Raft.Ready took too long to process: Timer Total:
204ms. Breakdown: [{sync 204ms} {disk 0s} {proposals 0s} {advance 0s}] Num entries: 1. MustSync: true

Initial discussion here: https://discuss.dgraph.io/t/slow-syncs-and-segfault/5417

Time permitting, I can try to reproduce with mock data, but I can't say when I might get the chance.

@shekarm
Copy link

shekarm commented Nov 18, 2019 via email

@danielmai danielmai added area/crash Dgraph issues that cause an operation to fail, or the whole server to crash. kind/bug Something is broken. labels Nov 19, 2019
@dhagrow
Copy link
Author

dhagrow commented Nov 21, 2019

Still haven't had a chance to reproduce with mock data, but I have managed to capture a full log, which I have attached.
crash.tar.gz

@shekarm
Copy link

shekarm commented Nov 21, 2019

Thank you. I will have an engineer look at this issue promptly.

@animesh2049
Copy link
Contributor

Hi @dhagrow
This issue has already been fixed in 2823be2 commit. Can you check if the issue is reproducible on current master?

@dhagrow
Copy link
Author

dhagrow commented Nov 22, 2019

Hi @animesh2049,
You can see in my initial report the issue has been reproduced on the master branch (717710a).

@animesh2049
Copy link
Contributor

Hi @dhagrow
The crash log that you have shared seems different from the one which you have put above in the issue. The log above panics at go-groupvarint.Decode4

fatal error: fault [signal SIGSEGV: segmentation violation code=0x1 addr=0xc030000000 pc=0x11ef2b1]
goroutine 1281391 [running]:
runtime.throw(0x175d0cb, 0x5) /usr/local/go/src/runtime/panic.go:617 +0x72 fp=0xc00023bb88 sp=0xc00023bb58 pc=0x9d0fd2
runtime.sigpanic()
/usr/local/go/src/runtime/signal_unix.go:397 +0x401 fp=0xc00023bbb8 sp=0xc00023bb88 pc
=0x9e7151
github.com/dgraph-io/dgraph/vendor/github.com/dgryski/go-groupvarint.Decode4(0xc00023bc30, 0x4
, 0x4, 0xc02ffffff0, 0x9, 0x10)
/tmp/go/src/github.com/dgraph-io/dgraph/vendor/github.com/dgryski/go-groupvarint/decod
e_amd64.s:11 +0x11 fp=0xc00023bbc0 sp=0xc00023bbb8 pc=0x11ef2b1
github.com/dgraph-io/dgraph/codec.(*Decoder).unpackBlock(0xc015a0b1d0, 0xc01dd16bc0, 0x2, 0x8)

but the log that you have shared (crash.tar.gz) throws error at badger.(*header).Decode

runtime.throw(0x17277a8, 0x5)
/usr/lib/golang/src/runtime/panic.go:617 +0x72 fp=0xc01a7428c8 sp=0xc01a742898 pc=0x937f52
runtime.sigpanic()
/usr/lib/golang/src/runtime/signal_unix.go:397 +0x401 fp=0xc01a7428f8 sp=0xc01a7428c8 pc=0x94e0d1
github.com/dgraph-io/badger.(*header).Decode(0xc01a7429a0, 0x7fc6ff030e7f, 0x957, 0x40fcf17f, 0x7fc6ff030e7f)
/home/miguel/go/pkg/mod/github.com/dgraph-io/[email protected]/structs.go:82 +0x35 fp=0xc01a742940 sp=0xc01a7428f8 pc=0x11612d5
github.com/dgraph-io/badger.(*valueLog).Read(0xc000057a08, 0x95700000018, 0x3f030e7f, 0xc0191e41a0, 0x0, 0xc02c688000, 0xc01a742a78, 0x1013be3, 0xc02c6860a0, 0x1)
/home/miguel/go/pkg/mod/github.com/dgraph-io/[email protected]/value.go:1166 +0xf1 fp=0xc01a742a00 sp=0xc01a742940 pc=0x116d451

@jarifibrahim can you please take a look into it.

@dhagrow
Copy link
Author

dhagrow commented Nov 22, 2019

@animesh2049 yes, you're right. The first crash happened when I was using 1.1.0, so it likely shows the crash fixed by 2823be2. The log in crash.tar.gz happened while running from the master branch. Sorry, I assumed they represented the same issue.

@jarifibrahim
Copy link
Contributor

runtime.throw(0x17277a8, 0x5)
/usr/lib/golang/src/runtime/panic.go:617 +0x72 fp=0xc01a7428c8 sp=0xc01a742898 pc=0x937f52
runtime.sigpanic()
/usr/lib/golang/src/runtime/signal_unix.go:397 +0x401 fp=0xc01a7428f8 sp=0xc01a7428c8 pc=0x94e0d1
github.com/dgraph-io/badger.(*header).Decode(0xc01a7429a0, 0x7fc6ff030e7f, 0x957, 0x40fcf17f, 0x7fc6ff030e7f)
/home/miguel/go/pkg/mod/github.com/dgraph-io/[email protected]/structs.go:82 +0x35 fp=0xc01a742940 sp=0xc01a7428f8 pc=0x11612d5
github.com/dgraph-io/badger.(*valueLog).Read(0xc000057a08, 0x95700000018, 0x3f030e7f, 0xc0191e41a0, 0x0, 0xc02c688000, 0xc01a742a78, 0x1013be3, 0xc02c6860a0, 0x1)
/home/miguel/go/pkg/mod/github.com/dgraph-io/[email protected]/value.go:1166 +0xf1 fp=0xc01a742a00 sp=0xc01a742940 pc=0x116d451

The above stack trace seems related to dgraph-io/badger#1131 .

@animesh2049
Copy link
Contributor

I am assuming go-groupvarint issue is now fixed. So I am closing this issue for now.
@dhagrow feel free to reopen the issue in case you encounter go-groupvarint issue again.

@dhagrow
Copy link
Author

dhagrow commented Nov 25, 2019

@animesh2049 I'm sure the go-groupvarint issue was fixed by 1.1.0, but I see no indication that the badger.(*header).Decode issue has been fixed. @jarifibrahim has pointed out that the bug may be in the badger code base, but this is a segfault that still affects both 1.1.0 and the latest master of dgraph.

Unless that is being tracked somewhere else, I don't understand why this issue should be closed.

@jarifibrahim
Copy link
Contributor

Hey @dhagrow I've created dgraph-io/badger#1136 to track the seg fault issue in badger.

@manishrjain manishrjain reopened this Nov 26, 2019
@manishrjain
Copy link
Contributor

Agree. This issue shouldn’t be closed until we’ve fixed both the segfaults.

@manishrjain manishrjain added the priority/P0 Critical issue that requires immediate attention. label Nov 26, 2019
@tuesmiddt
Copy link

Hey @dhagrow

Are you still getting the same badger.(*header).Decode error on the latest master? Your original report was based off 717710a. As of that commit, dgraph was using version cbdef65095c7 of badger. Commit dgraph-io/badger@1621eca might have fixed your issue. The latest dgraph master uses badger v2.0.0 and includes that fix.

@dhagrow
Copy link
Author

dhagrow commented Dec 2, 2019

Hi @tuesmiddt. I have been running on a7dec1e since Saturday, and am happy to report that there have been no crashes. It does look as if the issue is fixed. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/crash Dgraph issues that cause an operation to fail, or the whole server to crash. kind/bug Something is broken. priority/P0 Critical issue that requires immediate attention.
Development

No branches or pull requests

7 participants