Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Breaking flag] fix(Backups): new badger SuperFlag, NumGoroutines option solves OOM crashes #7387

Merged
merged 21 commits into from
Feb 22, 2021

Conversation

karlmcguire
Copy link
Contributor

@karlmcguire karlmcguire commented Feb 2, 2021

(This PR is for DGRAPH-2953, paired with dgraph-io/badger#1656 which adds the Badger options.)

On master, Alpha nodes would crash during a backup on a machine with low memory (<6gb in my tests). This was because the default number of goroutines was hardcoded as 16 which is used to create local threads and used to overwrite Stream.numGo.

The solution is to add a --badger "goroutines=8; ..." SuperFlag to Alpha and anything else that might want to set Badger options. So far I've also added it to bulk/run.go, but that might be too much for this PR. The command line flags for Badger should be the same across each tool, for consistency.

The question now is how we should calculate defaults. The default goroutines value of 8 seems to be working fine on a 4gb container, but should we lessen it for machines with less memory?


This change is Reviewable

@karlmcguire karlmcguire added area/tools Issues related to maintenance tools and CLI. area/crash Dgraph issues that cause an operation to fail, or the whole server to crash. area/enterprise/backup Related to binary backups labels Feb 2, 2021
@github-actions github-actions bot added the area/bulk-loader Issues related to bulk loading. label Feb 2, 2021
@jarifibrahim jarifibrahim changed the title fix(Backups): new badger SuperFlag, NumGoroutines option solves OOM crashes [Breaking] fix(Backups): new badger SuperFlag, NumGoroutines option solves OOM crashes Feb 3, 2021
go.mod Outdated Show resolved Hide resolved
dgraph/cmd/alpha/run.go Outdated Show resolved Hide resolved
compression=[none, zstd:level, snappy] specifies the compression algorithm and the compression
level (if applicable) for the postings directory. "none" would disable compression, while
"zstd:1" would set zstd compression at level 1.
cache_mb=N total size of cache (in MB) per shard in the reducer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's one more cache_mb flags in the x/flags.go file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have the posting list cache. We can move both the badger caches inside the --badger flag and keep the posting list size flag outside.

Something like dgraph --badger="cache_mb=100; cache_percentage=50,50" --posting-cache-mb=50

worker/backup_processor.go Outdated Show resolved Hide resolved
@jarifibrahim jarifibrahim changed the title [Breaking] fix(Backups): new badger SuperFlag, NumGoroutines option solves OOM crashes [Breaking flag] fix(Backups): new badger SuperFlag, NumGoroutines option solves OOM crashes Feb 3, 2021
" while zstd:1 would set zstd compression at level 1.")
flag.String("badger", worker.BadgerDefaults,
`Various badger options.
goroutines=N provides the number of goroutines to use in badger.Stream.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this goroutines flag in Alpha for creating local threads during backups, and Badger uses the goroutines value as the default Stream.numGo.

flag.String("badger", worker.BadgerDefaults,
`Various badger options.
goroutines=N provides the number of goroutines to use in badger.Stream.
compression=[none, zstd:level, snappy] specifies the compression algorithm and the compression
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this compression flag in Alpha for setting the compression type of Badger.

" (FORMAT: BlockCacheSize, IndexCacheSize).")
flag.String("badger", worker.BadgerDefaults,
`Various badger options.
compression=[none, zstd:level, snappy] specifies the compression algorithm and the compression
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this compression flag in Bulk for setting the compression type of Badger (similar to how Alpha uses it).

compression=[none, zstd:level, snappy] specifies the compression algorithm and the compression
level (if applicable) for the postings directory. "none" would disable compression, while
"zstd:1" would set zstd compression at level 1.
cache_mb=N total size of cache (in MB) per shard in the reducer.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this cache_mb flag in Bulk for setting the Badger cache size. The default value is set in worker/server_state.go.

level (if applicable) for the postings directory. "none" would disable compression, while
"zstd:1" would set zstd compression at level 1.
cache_mb=N total size of cache (in MB) per shard in the reducer.
cache_percentage=N cache percentages summing up to 100 for various caches.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this cache_percentage flag in Bulk for setting the BlockCacheSize and IndexCacheSize.

worker/backup_processor.go Show resolved Hide resolved
@@ -30,6 +30,10 @@ import (
"github.com/golang/glog"
)

const (
BadgerDefaults = "goroutines=8; compression=snappy; cache_mb=64; cache_percentage=70,30"
Copy link
Contributor Author

@karlmcguire karlmcguire Feb 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these defaults aren't used by Alpha, which is kind of confusing...

We use goroutines and compression in Alpha.

We use compression, cache_mb, and cache_percentage in Bulk.

It doesn't make sense to split up the BadgerDefaults across packages, but it might be worth adding a comment explaining which packages use which options.

Copy link
Contributor

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 5 files reviewed, 14 unresolved discussions (waiting on @karlmcguire and @vvbalaji-dgraph)


dgraph/cmd/alpha/run.go, line 609 at r2 (raw file):

	walCache := (cachePercent[3] * (totalCache << 20)) / 100

	badger := x.NewSuperFlag(Alpha.Conf.GetString("badger")).MergeAndCheckDefault(worker.BadgerDefaults)

100 chars.


dgraph/cmd/bulk/run.go, line 120 at r2 (raw file):

	flag.String("badger", worker.BadgerDefaults,
		`Various badger options.
	compression=[none, zstd:level, snappy] specifies the compression algorithm and the compression

This string can be in one place. Also, we should be using Badger options directly. Whatever options Badger already has, should do directly to Badger.


dgraph/cmd/bulk/run.go, line 132 at r2 (raw file):

func run() {
	badger := x.NewSuperFlag(Bulk.Conf.GetString("badger")).MergeAndCheckDefault(worker.BadgerDefaults)

100 chars.


worker/server_state.go, line 34 at r2 (raw file):

Previously, karlmcguire (Karl McGuire) wrote…

Some of these defaults aren't used by Alpha, which is kind of confusing...

We use goroutines and compression in Alpha.

We use compression, cache_mb, and cache_percentage in Bulk.

It doesn't make sense to split up the BadgerDefaults across packages, but it might be worth adding a comment explaining which packages use which options.

Let's not change the default value of cache.

@karlmcguire
Copy link
Contributor Author

karlmcguire commented Feb 16, 2021

This string can be in one place.

Where? We have Badger flags for Alpha and Bulk, both using different flags. Alpha used to expose compression and goroutine, Bulk used to expose compression, cache_mb, and cache_percentage. If we get rid of cache_mb and cache_percentage then we could probably just use the same compression and goroutine flag string for Alpha and Bulk.

Also, we should be using Badger options directly. Whatever options Badger already has, should do directly to Badger.

I'm not sure what this means. Are you saying to keep this rather than using badger.FromSuperFlag?

Let's not change the default value of cache.

We're currently using cache_mb and cache_percentage in Bulk here for the block and index caches. Should we not expose these flags?

@karlmcguire karlmcguire merged commit e2c1dae into master Feb 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/bulk-loader Issues related to bulk loading. area/crash Dgraph issues that cause an operation to fail, or the whole server to crash. area/enterprise/backup Related to binary backups area/tools Issues related to maintenance tools and CLI.
Development

Successfully merging this pull request may close these issues.

3 participants