Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic online restore. #5095

Merged
merged 11 commits into from
Apr 13, 2020
Merged

Basic online restore. #5095

merged 11 commits into from
Apr 13, 2020

Conversation

martinmr
Copy link
Contributor

@martinmr martinmr commented Apr 3, 2020

This change allows Dgraph to restore a backup. This PR provides basic
functionality that allows us to restore a backup to an empty cluster.

There's still work that needs to be done to allow a restore to work on
a cluster with existing data. Tablets need to be moved to the right
group for this part to work correctly.


This change is Reviewable

Docs Preview: Dgraph Preview

This change allows Dgraph to restore a backup. This PR provides basic
functionality that allows us to restore a backup to an empty cluster.

There's still work that needs to be done to allow a restore to work on
a cluster with existing data. Tablets need to be moved to the right
group for this part to work correctly.
@martinmr martinmr requested review from manishrjain and a team as code owners April 3, 2020 17:47
worker/online_restore_ee.go Outdated Show resolved Hide resolved
worker/online_restore_ee.go Show resolved Hide resolved
worker/online_restore_ee.go Show resolved Hide resolved
worker/online_restore_ee.go Outdated Show resolved Hide resolved
Copy link
Contributor Author

@martinmr martinmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 22 files reviewed, 4 unresolved discussions (waiting on @golangcibot and @manishrjain)


worker/online_restore_ee.go, line 36 at r1 (raw file):

Previously, golangcibot (Bot from GolangCI) wrote…

Error return value of UpdateMembershipState is not checked (from errcheck)

Done.


worker/online_restore_ee.go, line 145 at r1 (raw file):

Previously, golangcibot (Bot from GolangCI) wrote…

ineffectual assignment to err (from ineffassign)

Done.


worker/online_restore_ee.go, line 146 at r1 (raw file):

Previously, golangcibot (Bot from GolangCI) wrote…

ineffectual assignment to err (from ineffassign)

Done.


worker/online_restore_ee.go, line 160 at r1 (raw file):

Previously, golangcibot (Bot from GolangCI) wrote…

printf: Errorf format %s has arg req.GroupId of wrong type uint32 (from govet)

Done.

Copy link
Contributor

@manishrjain manishrjain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Remember to add all the TODOs -- and also do address all those quickly after this. Don't let them hang around for too long.

Reviewed 20 of 22 files at r1, 2 of 2 files at r2.
Reviewable status: all files reviewed, 16 unresolved discussions (waiting on @golangcibot and @martinmr)


graphql/admin/endpoints_ee.go, line 287 at r2 (raw file):

	"""
	Start restoring a binary backup.  See : https://docs.dgraph.io/enterprise-features/#binary-backups

100 chars.


graphql/admin/restore.go, line 47 at r2 (raw file):

func (rr *restoreResolver) Rewrite(
	m schema.Mutation) (*gql.GraphQuery, []*dgoapi.Mutation, error) {
	glog.Info("Got restore request")

remove, or expand?


protos/pb.proto, line 284 at r2 (raw file):

	repeated KV kv = 4;
	MembershipState state = 5;
	string clean_predicate = 6;	 // Delete the predicate which was moved to other group.

Avoid unnecessary changes.


worker/online_restore_ee.go, line 120 at r2 (raw file):

		return errors.Errorf("nil restore request")
	}

Add a TODO for few things.

  1. Add this as an operation, so rollups and others would stop.
  2. Switch to draining mode.

Ensure defer handle both on exit from this func.

Another TODO is to ensure that all the groups have received this command and ack that they have proposed and "found" it come back via Raft. If you return after found, the user can determine if the restore is done or not. That they can achieve by asking for status/health endpoint via GraphQL. It would tell them which operations are active. Once, the restore op is done, it would be removed from that list (this should be documented).

Another TODO is to allow the user to specify which backup series to restore from.
TODO: Allow a way to retrieve the list of backup series via graphql (JIRA ticket).


worker/online_restore_ee.go, line 186 at r2 (raw file):

	// from being replayed.
	if err := groups().Node.proposeSnapshot(1); err != nil {
		return err

errors.Wrapf(...)


worker/online_restore_ee.go, line 196 at r2 (raw file):

	res := LoadBackup(req.Location, req.BackupId,
		func(r io.Reader, groupId int, preds predicateSet) (uint64, error) {
			gzReader, err := gzip.NewReader(r)

TODO: Fix for encrypted backups.


worker/online_restore_ee.go, line 203 at r2 (raw file):

			maxUid, err := loadFromBackup(pstore, gzReader, req.RestoreTs, preds)
			if err != nil {
				return 0, err

wrapf


worker/online_restore_ee.go, line 213 at r2 (raw file):

			}
			zc := pb.NewZeroClient(pl.Get())
			_, err = zc.AssignUids(ctx, &pb.Num{Val: maxUid})

if _, err := ; err != nil { ... } Also, do the wrapf.


worker/online_restore_ee.go, line 222 at r2 (raw file):

			return maxUid, nil
		})
	return res.Err

wrapf


worker/restore.go, line 84 at r2 (raw file):

// If ts is greater than zero, the key-value pairs will be written with that timestamp.
// Otherwise, the original value is used.
func loadFromBackup(db *badger.DB, r io.Reader, restoreTs uint64, preds predicateSet) (

No need for restoreTs.


worker/restore.go, line 188 at r2 (raw file):

			case posting.BitSchemaPosting:
				// Schema and type keys should always have a timestamp equal to 1.
				kv.Version = 1

remove.


worker/s3_handler.go, line 326 at r2 (raw file):

// load any backup objects found.
// Returns nil and the maximum Since value on success, error otherwise.
func (h *s3Handler) Load(uri *url.URL, backupId string, fn loadFn) LoadResult {

Does LoadResult have to be public?

Copy link
Contributor Author

@martinmr martinmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 14 of 22 files reviewed, 16 unresolved discussions (waiting on @golangcibot and @manishrjain)


graphql/admin/endpoints_ee.go, line 287 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

100 chars.

Done.


graphql/admin/restore.go, line 47 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

remove, or expand?

Done.


protos/pb.proto, line 284 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Avoid unnecessary changes.

Done.


worker/online_restore_ee.go, line 120 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Add a TODO for few things.

  1. Add this as an operation, so rollups and others would stop.
  2. Switch to draining mode.

Ensure defer handle both on exit from this func.

Another TODO is to ensure that all the groups have received this command and ack that they have proposed and "found" it come back via Raft. If you return after found, the user can determine if the restore is done or not. That they can achieve by asking for status/health endpoint via GraphQL. It would tell them which operations are active. Once, the restore op is done, it would be removed from that list (this should be documented).

Another TODO is to allow the user to specify which backup series to restore from.
TODO: Allow a way to retrieve the list of backup series via graphql (JIRA ticket).

Done. Added the todos and opened JIRA tickets for them.


worker/online_restore_ee.go, line 186 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

errors.Wrapf(...)

Done.


worker/online_restore_ee.go, line 196 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

TODO: Fix for encrypted backups.

Done. Added a TODO.


worker/online_restore_ee.go, line 203 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

wrapf

Done.


worker/online_restore_ee.go, line 213 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

if _, err := ; err != nil { ... } Also, do the wrapf.

Done.


worker/online_restore_ee.go, line 222 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

wrapf

Done.


worker/restore.go, line 84 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

No need for restoreTs.

I think this is needed. For example, let's assume restoreTs=10. If the backup has keys with versions > 10, those values will not be seen by queries until the timestamp of the cluster reaches the timestamp of the values.

We either have to do this or advance the timestamp of the cluster to the max timestamp seen in the backup. I've added a TODO here as well to look further into it.


worker/restore.go, line 188 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

remove.

Done.


worker/s3_handler.go, line 326 at r2 (raw file):

Previously, manishrjain (Manish R Jain) wrote…

Does LoadResult have to be public?

No. I've added a todo to do the refactor in a later PR.

@martinmr martinmr merged commit de1cc8a into master Apr 13, 2020
@martinmr martinmr deleted the martinmr/online-restore branch April 13, 2020 22:26
dna2github pushed a commit to dna2fork/dgraph that referenced this pull request Jul 18, 2020
This change allows Dgraph to restore a backup. This PR provides basic
functionality that allows us to restore a backup to an empty cluster.

There's still work that needs to be done to allow a restore to work on
a cluster with existing data. Tablets need to be moved to the right
group for this part to work correctly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants