fix(backup): use StreamWriter instead of KVLoader during backup restore #8510

mangalaman93 · 2022-12-13T12:24:24Z

cherry-pick PR #7753

This commit is a major rewrite of online restore code. It used to use KVLoader in badger. Now it instead uses StreamWriter that is much faster for writes in the case of restore.

CLAassistant · 2022-12-13T12:24:29Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

all-seeing-code · 2022-12-13T13:40:50Z

Do we want to merge all these in one go or can we split these up in smaller cherry-picks?

mangalaman93 · 2022-12-13T13:47:24Z

no, I was thinking of merging them one by one but I realized that these changes are better merged together. The challenge with these changes is that it has lot of refactoring including file name changes and it become very difficult to make sense of them one commit at a time.

ee/acl/acl_test.go

systest/backup/common/utils.go

skrdgraph · 2022-12-16T07:16:53Z

NIT - could we align titles to our existing format, fix(<area>): <title>?

coveralls · 2022-12-28T22:10:45Z

Coverage: 66.783% (+0.4%) from 66.343% when pulling dd950f7 on aman/cp into f89eeef on main.

all-seeing-code · 2022-12-29T07:34:09Z

A couple of comments -

Code coverage seems to drop by 0.4%, can we check why that happens and if we can add a scenario in tests to cover it?
The change speeds up the back-up and restore. This should be verified on a large dataset (21 million/ldbcsf10) to have some idea on performance gain due to StreamWriter

mangalaman93 · 2023-01-01T08:57:56Z

Code coverage seems to drop by 0.4%, can we check why that happens and if we can add a scenario in tests to cover it?

Most of the normal cases are already covered. I have a few more tests in mind and Siddhesh has a PR for adding more tests. I want to unblock rest of the changes for the slash alignment and I will work on adding tests in parallel.

The change speeds up the back-up and restore. This should be verified on a large dataset (21 million/ldbcsf10) to have some idea on performance gain due to StreamWriter

I think StreamWriter is inherently faster than our existing approach. Even if it is not fast, it improves the performance for writes later on. Let's talk more about this in tomorrow's meeting.

ee/backup/run.go

worker/snapshot_test.go

billprovince

still looking over ...

dgraph/cmd/bulk/reduce.go

dgraph/cmd/increment/increment.go

dgraph/main.go

mangalaman93 · 2023-01-09T16:42:55Z

This is how the map reduce phases are implemented in this PR: we create MAP files each of a limited size and write sorted data into it. We may end up creating many such files. Then we take all of these MAP files and read part of the data from each file, sort all of this data and then use streamwriter to write the sorted data into pstore badger. We store some sort of partition keys in the MAP file in the beginning of the file. The partition keys are just intermediate keys among the entries that we store in the map file. When we read data during reduce, we read in the chunks of these partition keys, meaning from one partition key to the next partition key. I am not sure if there is a value in having these partition keys. Maybe, we can live without them.

all-seeing-code · 2023-01-10T11:34:32Z

This is how the map reduce phases are implemented in this PR: we create MAP files each of a limited size and write sorted data into it. We may end up creating many such files. Then we take all of these MAP files and read part of the data from each file, sort all of this data and then use streamwriter to write the sorted data into pstore badger. We store some sort of partition keys in the MAP file in the beginning of the file. The partition keys are just intermediate keys among the entries that we store in the map file. When we read data during reduce, we read in the chunks of these partition keys, meaning from one partition key to the next partition key. I am not sure if there is a value in having these partition keys. Maybe, we can live without them.

Few questions as I try to wade through the change:

we create MAP files each of a limited size

Is this size configurable or hard-coded?
Can you point to where this size is picked up from?
Is this correct to assume that the individual Map file size would decide how many such map files are generated

We store some sort of partition keys in the MAP file in the beginning of the file. The partition keys are just intermediate keys among the entries that we store in the map file. When we read data during reduce, we read in the chunks of these partition keys, meaning from one partition key to the next partition key.

Is this because we want to make use of some buffered reader which can read certain amount in memory, we process that and then move on to a different chunk?

mangalaman93 · 2023-01-10T12:34:59Z

Few questions as I try to wade through the change:

we create MAP files each of a limited size
* Is this size configurable or hard-coded?

hard coded

* Can you point to where this size is picked up from?

restore_reduce.go, it is set to 2GB

* Is this correct to assume that the individual `Map file` size would decide how many such map files are generated

correct

We store some sort of partition keys in the MAP file in the beginning of the file. The partition keys are just intermediate keys among the entries that we store in the map file. When we read data during reduce, we read in the chunks of these partition keys, meaning from one partition key to the next partition key.
* Is this because we want to make use of some buffered reader which can read certain amount in memory, we process that and then move on to a different chunk?

correct. We need to sort data in all the map files while each map file is already sorted. We read data until the partition key from each map file sort that data and then write it to badger.

harshil-goel

LGTM. Minor nitpicks here and there. I haven't really looked that deeply into the algorithm yet.

ee/backup/run.go

harshil-goel · 2023-01-16T16:03:20Z

worker/backup.go

+type predicateSet map[string]struct{}
+
+// Manifest records backup details, these are values used during restore.
+// Since is the timestamp from which the next incremental backup should start (it's set


Can we do something like SinceTs is the the timestamp

what do you mean?

worker/restore_reduce.go

worker/restore_map.go

dgraph/cmd/increment/increment.go

dgraph/main.go

all-seeing-code · 2023-01-16T15:33:17Z

ee/audit/interceptor_ee.go

-	"/state":  true,
+	"/health":        true,
+	"/state":         true,
+	"/probe/graphql": true,


Does this require a doc update in the audit log section that this endpoint will not be audited?
I think it does require a doc update:
https://dgraph.io/docs/enterprise-features/audit-logs/

I will make a note of it. Thanks

worker/backup.go

ee/backup/run.go

worker/backup_ee.go

all-seeing-code · 2023-01-18T12:57:58Z

It changes export as well. I think we can mention that in the description.

This commit is a major rewrite of backup and online restore code. It used to use KVLoader in badger. Now it instead uses StreamWriter that is much faster for writes. cherry-pick PR #7753 following commits are cherry-picked (in reverse order): * opt(restore): Sort the buffer before spinning the writeToDisk goroutine (#7984) (#7996) * fix(backup): Fix full backup request (#7932) (#7933) * fix: fixing graphql schema update when the data is restored + skipping /probe/graphql from audit (#7925) * fix(restore): return nil if there is error (#7899) * Don't ban namespace in export_backup * reset the kv.StreamId before sending to stream writer (#7833) (#7837) * fix(restore): Bump uid and namespace after restore (#7790) (#7800) * fix(ee): GetKeys should return an error (#7713) (#7797) * fix(backup): Free the UidPack after use (#7786) * fix(export-backup): Fix double free in export backup (#7780) (#7783) * fix(lsbackup): Fix profiler in lsBackup (#7729) * Bring back "perf(Backup): Improve backup performance (#7601)" * Opt(Backup): Make backups faster (#7680) * Fix s3 backup copy (#7669) * [BREAKING] Opt(Restore): Optimize Restore's new map-reduce based design (#7666) * Perf(restore): Implement map-reduce based restore (#7664) * feat(backup): Merge backup refactoring * Revert "perf(Backup): Improve backup performance (#7601)"

mangalaman93 added the slash-to-main PRs which bring slash branch on par with main. label Dec 13, 2022

mangalaman93 requested a review from akon-dey as a code owner December 13, 2022 12:24

mangalaman93 self-assigned this Dec 13, 2022

mangalaman93 requested review from MichelDiz, matthewmcneely, skrdgraph, darkn3rd, gcxml, meghalims and billprovince as code owners December 13, 2022 12:24

github-actions bot added area/bulk-loader Issues related to bulk loading. area/enterprise Related to proprietary features area/graphql Issues related to GraphQL support on Dgraph. area/live-loader Issues related to live loading. area/testing Testing related issues labels Dec 13, 2022

matthewmcneely reviewed Dec 13, 2022

View reviewed changes

ee/acl/acl_test.go Outdated Show resolved Hide resolved

matthewmcneely reviewed Dec 13, 2022

View reviewed changes

systest/backup/common/utils.go Outdated Show resolved Hide resolved

mangalaman93 force-pushed the aman/cp branch from f0932d6 to 2a123f2 Compare December 14, 2022 12:34

mangalaman93 force-pushed the aman/cp branch from 2a123f2 to 575ba5b Compare December 28, 2022 21:09

mangalaman93 changed the title ~~cherry-pick PR https://github.com/dgraph-io/dgraph/pull/7753~~ fix(backup): cherry-pick PR https://github.com/dgraph-io/dgraph/pull/7753 Dec 28, 2022

mangalaman93 force-pushed the aman/cp branch from 575ba5b to 0abf09c Compare December 28, 2022 21:42

mangalaman93 force-pushed the aman/cp branch from 0abf09c to 4dbaf51 Compare December 29, 2022 06:29

mangalaman93 force-pushed the aman/cp branch from 4dbaf51 to 1335343 Compare January 2, 2023 06:16

mangalaman93 force-pushed the aman/cp branch from 5372973 to b8b51d2 Compare January 2, 2023 09:24

matthewmcneely reviewed Jan 2, 2023

View reviewed changes

ee/backup/run.go Outdated Show resolved Hide resolved

matthewmcneely reviewed Jan 2, 2023

View reviewed changes

worker/snapshot_test.go Outdated Show resolved Hide resolved

Base automatically changed from aman/sensitive to main January 3, 2023 04:11

mangalaman93 force-pushed the aman/cp branch from b8b51d2 to af7ed17 Compare January 3, 2023 04:18

mangalaman93 changed the title ~~fix(backup): cherry-pick PR https://github.com/dgraph-io/dgraph/pull/7753~~ fix(backup): use StreamWriter instead of KVLoader during backup restore Jan 3, 2023

mangalaman93 force-pushed the aman/cp branch 2 times, most recently from 27861d7 to c51a385 Compare January 3, 2023 14:37

billprovince reviewed Jan 3, 2023

View reviewed changes

dgraph/cmd/bulk/reduce.go Show resolved Hide resolved

dgraph/cmd/increment/increment.go Show resolved Hide resolved

dgraph/cmd/increment/increment.go Outdated Show resolved Hide resolved

dgraph/cmd/increment/increment.go Show resolved Hide resolved

dgraph/main.go Show resolved Hide resolved

mangalaman93 force-pushed the aman/cp branch 2 times, most recently from 0c75dae to 545677f Compare January 4, 2023 05:29

mangalaman93 mentioned this pull request Jan 4, 2023

fix(probe): do not contend for lock in lazy load (#8037) (#8041) #8566

Merged

mangalaman93 force-pushed the aman/cp branch from 545677f to d41203f Compare January 6, 2023 06:42

mangalaman93 added this to the v23.0.0 milestone Jan 11, 2023

mangalaman93 mentioned this pull request Jan 11, 2023

refactor(s3Handler.createObject): No need to pass client #8604

Closed

harshil-goel reviewed Jan 16, 2023

View reviewed changes

mangalaman93 force-pushed the aman/cp branch from d41203f to 9c59b2b Compare January 17, 2023 07:54

harshil-goel approved these changes Jan 17, 2023

View reviewed changes

matthewmcneely approved these changes Jan 17, 2023

View reviewed changes

mangalaman93 force-pushed the aman/cp branch 2 times, most recently from 2ac2673 to 3307aab Compare January 17, 2023 19:18

all-seeing-code reviewed Jan 18, 2023

View reviewed changes

mangalaman93 force-pushed the aman/cp branch from 3307aab to dd950f7 Compare January 18, 2023 18:09

mangalaman93 merged commit ee15a9f into main Jan 18, 2023

mangalaman93 deleted the aman/cp branch January 18, 2023 20:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(backup): use StreamWriter instead of KVLoader during backup restore #8510

fix(backup): use StreamWriter instead of KVLoader during backup restore #8510

mangalaman93 commented Dec 13, 2022 •

edited

Loading

CLAassistant commented Dec 13, 2022 •

edited

Loading

all-seeing-code commented Dec 13, 2022

mangalaman93 commented Dec 13, 2022

skrdgraph commented Dec 16, 2022

coveralls commented Dec 28, 2022 •

edited

Loading

all-seeing-code commented Dec 29, 2022 •

edited

Loading

mangalaman93 commented Jan 1, 2023 •

edited

Loading

billprovince left a comment

mangalaman93 commented Jan 9, 2023

all-seeing-code commented Jan 10, 2023

mangalaman93 commented Jan 10, 2023

harshil-goel left a comment

harshil-goel Jan 16, 2023

mangalaman93 Jan 17, 2023

all-seeing-code Jan 16, 2023

mangalaman93 Jan 18, 2023

all-seeing-code commented Jan 18, 2023

fix(backup): use StreamWriter instead of KVLoader during backup restore #8510

fix(backup): use StreamWriter instead of KVLoader during backup restore #8510

Conversation

mangalaman93 commented Dec 13, 2022 • edited Loading

CLAassistant commented Dec 13, 2022 • edited Loading

all-seeing-code commented Dec 13, 2022

mangalaman93 commented Dec 13, 2022

skrdgraph commented Dec 16, 2022

coveralls commented Dec 28, 2022 • edited Loading

all-seeing-code commented Dec 29, 2022 • edited Loading

mangalaman93 commented Jan 1, 2023 • edited Loading

billprovince left a comment

Choose a reason for hiding this comment

mangalaman93 commented Jan 9, 2023

all-seeing-code commented Jan 10, 2023

mangalaman93 commented Jan 10, 2023

harshil-goel left a comment

Choose a reason for hiding this comment

harshil-goel Jan 16, 2023

Choose a reason for hiding this comment

mangalaman93 Jan 17, 2023

Choose a reason for hiding this comment

all-seeing-code Jan 16, 2023

Choose a reason for hiding this comment

mangalaman93 Jan 18, 2023

Choose a reason for hiding this comment

all-seeing-code commented Jan 18, 2023

mangalaman93 commented Dec 13, 2022 •

edited

Loading

CLAassistant commented Dec 13, 2022 •

edited

Loading

coveralls commented Dec 28, 2022 •

edited

Loading

all-seeing-code commented Dec 29, 2022 •

edited

Loading

mangalaman93 commented Jan 1, 2023 •

edited

Loading