Basic online restore. #5095

martinmr · 2020-04-03T17:47:23Z

This change allows Dgraph to restore a backup. This PR provides basic
functionality that allows us to restore a backup to an empty cluster.

There's still work that needs to be done to allow a restore to work on
a cluster with existing data. Tablets need to be moved to the right
group for this part to work correctly.

This change is

Docs Preview:

This change allows Dgraph to restore a backup. This PR provides basic functionality that allows us to restore a backup to an empty cluster. There's still work that needs to be done to allow a restore to work on a cluster with existing data. Tablets need to be moved to the right group for this part to work correctly.

worker/online_restore_ee.go

martinmr

Reviewable status: 0 of 22 files reviewed, 4 unresolved discussions (waiting on @golangcibot and @manishrjain)

worker/online_restore_ee.go, line 36 at r1 (raw file):

Previously, golangcibot (Bot from GolangCI) wrote…

Error return value of UpdateMembershipState is not checked (from errcheck)

Done.

worker/online_restore_ee.go, line 145 at r1 (raw file):

Previously, golangcibot (Bot from GolangCI) wrote…

ineffectual assignment to err (from ineffassign)

Done.

worker/online_restore_ee.go, line 146 at r1 (raw file):

Previously, golangcibot (Bot from GolangCI) wrote…

ineffectual assignment to err (from ineffassign)

Done.

worker/online_restore_ee.go, line 160 at r1 (raw file):

Previously, golangcibot (Bot from GolangCI) wrote…

printf: Errorf format %s has arg req.GroupId of wrong type uint32 (from govet)

Done.

manishrjain

Remember to add all the TODOs -- and also do address all those quickly after this. Don't let them hang around for too long.

Reviewed 20 of 22 files at r1, 2 of 2 files at r2.
Reviewable status: all files reviewed, 16 unresolved discussions (waiting on @golangcibot and @martinmr)

graphql/admin/endpoints_ee.go, line 287 at r2 (raw file):

	"""
	Start restoring a binary backup.  See : https://docs.dgraph.io/enterprise-features/#binary-backups

100 chars.

graphql/admin/restore.go, line 47 at r2 (raw file):

func (rr *restoreResolver) Rewrite(
	m schema.Mutation) (*gql.GraphQuery, []*dgoapi.Mutation, error) {
	glog.Info("Got restore request")

remove, or expand?

protos/pb.proto, line 284 at r2 (raw file):

	repeated KV kv = 4;
	MembershipState state = 5;
	string clean_predicate = 6;	 // Delete the predicate which was moved to other group.

Avoid unnecessary changes.

worker/online_restore_ee.go, line 120 at r2 (raw file):

		return errors.Errorf("nil restore request")
	}

Add a TODO for few things.

Add this as an operation, so rollups and others would stop.
Switch to draining mode.

Ensure defer handle both on exit from this func.

Another TODO is to ensure that all the groups have received this command and ack that they have proposed and "found" it come back via Raft. If you return after found, the user can determine if the restore is done or not. That they can achieve by asking for status/health endpoint via GraphQL. It would tell them which operations are active. Once, the restore op is done, it would be removed from that list (this should be documented).

Another TODO is to allow the user to specify which backup series to restore from.
TODO: Allow a way to retrieve the list of backup series via graphql (JIRA ticket).

worker/online_restore_ee.go, line 186 at r2 (raw file):

	// from being replayed.
	if err := groups().Node.proposeSnapshot(1); err != nil {
		return err

errors.Wrapf(...)

worker/online_restore_ee.go, line 196 at r2 (raw file):

	res := LoadBackup(req.Location, req.BackupId,
		func(r io.Reader, groupId int, preds predicateSet) (uint64, error) {
			gzReader, err := gzip.NewReader(r)

TODO: Fix for encrypted backups.

worker/online_restore_ee.go, line 203 at r2 (raw file):

			maxUid, err := loadFromBackup(pstore, gzReader, req.RestoreTs, preds)
			if err != nil {
				return 0, err

wrapf

worker/online_restore_ee.go, line 213 at r2 (raw file):

			}
			zc := pb.NewZeroClient(pl.Get())
			_, err = zc.AssignUids(ctx, &pb.Num{Val: maxUid})

if _, err := ; err != nil { ... } Also, do the wrapf.

worker/online_restore_ee.go, line 222 at r2 (raw file):

			return maxUid, nil
		})
	return res.Err

wrapf

worker/restore.go, line 84 at r2 (raw file):

// If ts is greater than zero, the key-value pairs will be written with that timestamp.
// Otherwise, the original value is used.
func loadFromBackup(db *badger.DB, r io.Reader, restoreTs uint64, preds predicateSet) (

No need for restoreTs.

worker/restore.go, line 188 at r2 (raw file):

			case posting.BitSchemaPosting:
				// Schema and type keys should always have a timestamp equal to 1.
				kv.Version = 1

remove.

worker/s3_handler.go, line 326 at r2 (raw file):

// load any backup objects found.
// Returns nil and the maximum Since value on success, error otherwise.
func (h *s3Handler) Load(uri *url.URL, backupId string, fn loadFn) LoadResult {

Does LoadResult have to be public?

…tore

martinmr

Reviewable status: 14 of 22 files reviewed, 16 unresolved discussions (waiting on @golangcibot and @manishrjain)

graphql/admin/endpoints_ee.go, line 287 at r2 (raw file):