Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dgraph should handle shutdown gracefully #3873

Closed
animesh2049 opened this issue Aug 23, 2019 · 7 comments · Fixed by #5138 or #5137
Closed

Dgraph should handle shutdown gracefully #3873

animesh2049 opened this issue Aug 23, 2019 · 7 comments · Fixed by #5138 or #5137
Labels
area/crash Dgraph issues that cause an operation to fail, or the whole server to crash. kind/bug Something is broken. priority/P0 Critical issue that requires immediate attention. status/accepted We accept to investigate/work on it.
Milestone

Comments

@animesh2049
Copy link
Contributor

What version of Go are you using (go version)?

v1.1.0-rc1

$ go version
go1.12.5

Does this issue reproduce with the latest master?

Yes

What are the hardware specifications of the machine (RAM, OS, Disk)?

General

What did you do?

I ran 6 node cluster (dgraph), flock, flock client earlyoom.

What did you see instead?

Panic.

I0823 09:50:13.229249       1 run.go:549] Caught Ctrl-C. Terminating now (this may take a few seconds)...
E0823 09:50:13.229523       1 run.go:348] Stopped taking more http(s) requests. Err: accept tcp [::]:8182: use of closed network connection
E0823 09:50:13.229845       1 run.go:329] GRPC listener canceled: accept tcp [::]:9182: use of closed network connection
I0823 09:50:13.231313       1 run.go:568] GRPC and HTTP stopped.
I0823 09:50:13.231369       1 worker.go:97] Stopping group...
I0823 09:50:13.231517       1 worker.go:100] Stopping node...
E0823 09:50:13.231588       1 groups.go:748] Unable to sync memberships. Error: rpc error: code = Canceled desc = context canceled. State: <nil>
I0823 09:50:13.231607       1 draft.go:712] Stopping node.Run
I0823 09:50:13.233993       1 draft.go:756] Raft node done.
I0823 09:50:13.235020       1 worker.go:103] Stopping worker server...
I0823 09:50:13.241186       1 run.go:571] Server shutdown. Bye!
panic: runtime error: invalid memory address or nil pointer dereference
        panic: Unclosed iterator at time of Txn.Discard.
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x1189f87]

goroutine 36565957 [running]:
github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger.(*Txn).Discard(0xc0dc0ab980)
        /home/flock/go/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/txn.go:446 +0xde
panic(0x15b19e0, 0x216b2d0)
        /usr/local/go/src/runtime/panic.go:522 +0x1b5
github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/skl.(*Skiplist).IncrRef(...)
        /home/flock/go/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/skl/skl.go:86
github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger.(*DB).getMemTables(0xc00009fc00, 0x0, 0x0, 0x0, 0x0)
        /home/flock/go/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/db.go:496 +0xd7
github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger.(*Txn).NewIterator(0xc0dc0ab980, 0x0, 0x64, 0x100, 0xc038d56d60, 0x13, 0x13, 0x1, 0x0)
        /home/flock/go/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/iterator.go:409 +0x75
github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger.(*Txn).NewKeyIterator(0xc0dc0ab980, 0xc038d56d60, 0x13, 0x13, 0x0, 0x64, 0x100, 0xc038d56d60, 0x13, 0x13, ...)
        /home/flock/go/src/github.com/dgraph-io/dgraph/vendor/github.com/dgraph-io/badger/iterator.go:439 +0xa6
github.com/dgraph-io/dgraph/posting.getNew(0xc038d56d60, 0x13, 0x13, 0xc00009fc00, 0x0, 0x0, 0x0)
        /home/flock/go/src/github.com/dgraph-io/dgraph/posting/mvcc.go:214 +0x148
github.com/dgraph-io/dgraph/posting.(*LocalCache).getInternal(0xc0f32ad340, 0xc038d56d60, 0x13, 0x13, 0x1, 0x13, 0x0, 0x0)
        /home/flock/go/src/github.com/dgraph-io/dgraph/posting/lists.go:236 +0xde
github.com/dgraph-io/dgraph/posting.(*LocalCache).Get(...)
        /home/flock/go/src/github.com/dgraph-io/dgraph/posting/lists.go:260
github.com/dgraph-io/dgraph/worker.(*queryState).handleUidPostings.func1(0x2c9d0, 0x3054c, 0xc034920480, 0x9bedc6)
        /home/flock/go/src/github.com/dgraph-io/dgraph/worker/task.go:575 +0x23b
github.com/dgraph-io/dgraph/worker.(*queryState).handleUidPostings.func2(0xc0f79b87e0, 0xc0f79ad540, 0x2c9d0, 0x3054c)
        /home/flock/go/src/github.com/dgraph-io/dgraph/worker/task.go:683 +0x3a
created by github.com/dgraph-io/dgraph/worker.(*queryState).handleUidPostings
        /home/flock/go/src/github.com/dgraph-io/dgraph/worker/task.go:682 +0x3b0
@jarifibrahim jarifibrahim changed the title Dgraph crashes after pressing ctlr-c becasue of badger Dgraph crashes after pressing ctlr-c because of badger Aug 23, 2019
@jarifibrahim jarifibrahim changed the title Dgraph crashes after pressing ctlr-c because of badger Dgraph crashes after pressing ctrl-c because of badger Aug 23, 2019
@jarifibrahim
Copy link
Contributor

I saw the same error yesterday while running badger.

@jarifibrahim
Copy link
Contributor

jarifibrahim commented Aug 26, 2019

In case of the failure that I observed, the error was thrown because DB was being closed when a transaction was active.

It's possible that Dgraph too is trying to close DB on CTRL-C while a transaction was still active.

@jarifibrahim jarifibrahim changed the title Dgraph crashes after pressing ctrl-c because of badger Dgraph should handle shutdown gracefully Aug 27, 2019
@jarifibrahim jarifibrahim transferred this issue from dgraph-io/badger Aug 27, 2019
@jarifibrahim
Copy link
Contributor

I've moved the issue to dgraph from badger. The error was caused in dgraph because dgraph tried to close badger when a transaction was active.

@campoy campoy added area/crash Dgraph issues that cause an operation to fail, or the whole server to crash. kind/bug Something is broken. priority/P0 Critical issue that requires immediate attention. status/accepted We accept to investigate/work on it. labels Sep 3, 2019
@campoy campoy modified the milestones: Dgraph v1.1, Dgraph v1.2 Sep 3, 2019
@campoy
Copy link
Contributor

campoy commented Sep 3, 2019

This is not a new issue, requires some big changes - moving to next release.

@campoy campoy modified the milestones: Dgraph v1.2, Dgraph v1.1.1 Sep 13, 2019
@campoy campoy added the status/needs-attention This issue needs more eyes on it, more investigation might be required before accepting/rejecting it label Sep 13, 2019
@jarifibrahim
Copy link
Contributor

Potentially fixed by #4356

@manishrjain
Copy link
Contributor

Let's test with 20.03.0-beta. By adding a long sleep here, triggering that code and calling Ctrl+C. I think that goroutine does get shutdown gracefully, not sure why we're seeing this above issue.

https://github.com/dgraph-io/dgraph/blob/0e3c4b0d20aa5ec479ac2dc8f461999a03fe838d/worker/mutation.go#L124-L126

@lgalatin lgalatin removed the status/needs-attention This issue needs more eyes on it, more investigation might be required before accepting/rejecting it label Mar 25, 2020
@jarifibrahim
Copy link
Contributor

jarifibrahim commented Apr 8, 2020

I see 3 possible reasons for alpha crash while shutting down. In all the cases, we close badger before the operation completes. Dgraph could crash at the following places

  1. The executor.processMutationCh doesn't wait for all the goroutines to stop https://github.com/dgraph-io/dgraph/blob/285024cfb9925b0589831618153409e1f31f84fb/worker/background_mutation.go#L97 This should be fixed by Shutdown executor goroutines #5130
  2. The schema.Load(...) in reindexing https://github.com/dgraph-io/dgraph/blob/0e3c4b0d20aa5ec479ac2dc8f461999a03fe838d/worker/mutation.go#L124-L126
    The issue can be easily seen by adding a 10-second sleep inside schema.Load. Alpha doesn't wait for these goroutines to finish. We access badger inside load. This should be fixed by Alpha Close: Wait for indexing to complete #5137
  3. The calculate() function in handleValuePosting function https://github.com/dgraph-io/dgraph/blob/285024cfb9925b0589831618153409e1f31f84fb/worker/task.go#L493-L501
    We would return from the function one first occurrence of the err without waiting for the all goroutines to finish. We access badger in the cache.Get() call in calculate . This should be fixed by Use errGroup in handleValuePostings #5138

jarifibrahim pushed a commit that referenced this issue Apr 13, 2020
The background reindexing starts indexing in the background but we
do not wait for the indexing to complete in case of error while shutting
down alpha. This PR fixes that issue by waiting for the background
indexing to complete before we shutdown alpha. 

The issue can be easily seen by adding a long sleep in the
background indexing and trying to shutdown alpha. It can be seen
that Alpha doesn't wait for the indexing to complete while shutting
down.

Fixes #3873
jarifibrahim pushed a commit that referenced this issue Jun 8, 2020
The background reindexing starts indexing in the background but we
do not wait for the indexing to complete in case of error while shutting
down alpha. This PR fixes that issue by waiting for the background
indexing to complete before we shutdown alpha. 

The issue can be easily seen by adding a long sleep in the
background indexing and trying to shutdown alpha. It can be seen
that Alpha doesn't wait for the indexing to complete while shutting
down.

Fixes #3873

(cherry picked from commit dc75424)
jarifibrahim pushed a commit that referenced this issue Jun 9, 2020
The background reindexing starts indexing in the background but we
do not wait for the indexing to complete in case of error while shutting
down alpha. This PR fixes that issue by waiting for the background
indexing to complete before we shutdown alpha. 

The issue can be easily seen by adding a long sleep in the
background indexing and trying to shutdown alpha. It can be seen
that Alpha doesn't wait for the indexing to complete while shutting
down.

Fixes #3873

(cherry picked from commit dc75424)
dna2github pushed a commit to dna2fork/dgraph that referenced this issue Jul 18, 2020
The background reindexing starts indexing in the background but we
do not wait for the indexing to complete in case of error while shutting
down alpha. This PR fixes that issue by waiting for the background
indexing to complete before we shutdown alpha. 

The issue can be easily seen by adding a long sleep in the
background indexing and trying to shutdown alpha. It can be seen
that Alpha doesn't wait for the indexing to complete while shutting
down.

Fixes hypermodeinc#3873
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/crash Dgraph issues that cause an operation to fail, or the whole server to crash. kind/bug Something is broken. priority/P0 Critical issue that requires immediate attention. status/accepted We accept to investigate/work on it.
5 participants