-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Basic online restore. #5095
Basic online restore. #5095
Conversation
This change allows Dgraph to restore a backup. This PR provides basic functionality that allows us to restore a backup to an empty cluster. There's still work that needs to be done to allow a restore to work on a cluster with existing data. Tablets need to be moved to the right group for this part to work correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 22 files reviewed, 4 unresolved discussions (waiting on @golangcibot and @manishrjain)
worker/online_restore_ee.go, line 36 at r1 (raw file):
Previously, golangcibot (Bot from GolangCI) wrote…
Error return value of
UpdateMembershipState
is not checked (fromerrcheck
)
Done.
worker/online_restore_ee.go, line 145 at r1 (raw file):
Previously, golangcibot (Bot from GolangCI) wrote…
ineffectual assignment to
err
(fromineffassign
)
Done.
worker/online_restore_ee.go, line 146 at r1 (raw file):
Previously, golangcibot (Bot from GolangCI) wrote…
ineffectual assignment to
err
(fromineffassign
)
Done.
worker/online_restore_ee.go, line 160 at r1 (raw file):
Previously, golangcibot (Bot from GolangCI) wrote…
printf: Errorf format %s has arg req.GroupId of wrong type uint32 (from
govet
)
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember to add all the TODOs -- and also do address all those quickly after this. Don't let them hang around for too long.
Reviewed 20 of 22 files at r1, 2 of 2 files at r2.
Reviewable status: all files reviewed, 16 unresolved discussions (waiting on @golangcibot and @martinmr)
graphql/admin/endpoints_ee.go, line 287 at r2 (raw file):
""" Start restoring a binary backup. See : https://docs.dgraph.io/enterprise-features/#binary-backups
100 chars.
graphql/admin/restore.go, line 47 at r2 (raw file):
func (rr *restoreResolver) Rewrite( m schema.Mutation) (*gql.GraphQuery, []*dgoapi.Mutation, error) { glog.Info("Got restore request")
remove, or expand?
protos/pb.proto, line 284 at r2 (raw file):
repeated KV kv = 4; MembershipState state = 5; string clean_predicate = 6; // Delete the predicate which was moved to other group.
Avoid unnecessary changes.
worker/online_restore_ee.go, line 120 at r2 (raw file):
return errors.Errorf("nil restore request") }
Add a TODO for few things.
- Add this as an operation, so rollups and others would stop.
- Switch to draining mode.
Ensure defer handle both on exit from this func.
Another TODO is to ensure that all the groups have received this command and ack that they have proposed and "found" it come back via Raft. If you return after found, the user can determine if the restore is done or not. That they can achieve by asking for status/health endpoint via GraphQL. It would tell them which operations are active. Once, the restore op is done, it would be removed from that list (this should be documented).
Another TODO is to allow the user to specify which backup series to restore from.
TODO: Allow a way to retrieve the list of backup series via graphql (JIRA ticket).
worker/online_restore_ee.go, line 186 at r2 (raw file):
// from being replayed. if err := groups().Node.proposeSnapshot(1); err != nil { return err
errors.Wrapf(...)
worker/online_restore_ee.go, line 196 at r2 (raw file):
res := LoadBackup(req.Location, req.BackupId, func(r io.Reader, groupId int, preds predicateSet) (uint64, error) { gzReader, err := gzip.NewReader(r)
TODO: Fix for encrypted backups.
worker/online_restore_ee.go, line 203 at r2 (raw file):
maxUid, err := loadFromBackup(pstore, gzReader, req.RestoreTs, preds) if err != nil { return 0, err
wrapf
worker/online_restore_ee.go, line 213 at r2 (raw file):
} zc := pb.NewZeroClient(pl.Get()) _, err = zc.AssignUids(ctx, &pb.Num{Val: maxUid})
if _, err := ; err != nil { ... }
Also, do the wrapf.
worker/online_restore_ee.go, line 222 at r2 (raw file):
return maxUid, nil }) return res.Err
wrapf
worker/restore.go, line 84 at r2 (raw file):
// If ts is greater than zero, the key-value pairs will be written with that timestamp. // Otherwise, the original value is used. func loadFromBackup(db *badger.DB, r io.Reader, restoreTs uint64, preds predicateSet) (
No need for restoreTs.
worker/restore.go, line 188 at r2 (raw file):
case posting.BitSchemaPosting: // Schema and type keys should always have a timestamp equal to 1. kv.Version = 1
remove.
worker/s3_handler.go, line 326 at r2 (raw file):
// load any backup objects found. // Returns nil and the maximum Since value on success, error otherwise. func (h *s3Handler) Load(uri *url.URL, backupId string, fn loadFn) LoadResult {
Does LoadResult have to be public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 14 of 22 files reviewed, 16 unresolved discussions (waiting on @golangcibot and @manishrjain)
graphql/admin/endpoints_ee.go, line 287 at r2 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
100 chars.
Done.
graphql/admin/restore.go, line 47 at r2 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
remove, or expand?
Done.
protos/pb.proto, line 284 at r2 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Avoid unnecessary changes.
Done.
worker/online_restore_ee.go, line 120 at r2 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Add a TODO for few things.
- Add this as an operation, so rollups and others would stop.
- Switch to draining mode.
Ensure defer handle both on exit from this func.
Another TODO is to ensure that all the groups have received this command and ack that they have proposed and "found" it come back via Raft. If you return after found, the user can determine if the restore is done or not. That they can achieve by asking for status/health endpoint via GraphQL. It would tell them which operations are active. Once, the restore op is done, it would be removed from that list (this should be documented).
Another TODO is to allow the user to specify which backup series to restore from.
TODO: Allow a way to retrieve the list of backup series via graphql (JIRA ticket).
Done. Added the todos and opened JIRA tickets for them.
worker/online_restore_ee.go, line 186 at r2 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
errors.Wrapf(...)
Done.
worker/online_restore_ee.go, line 196 at r2 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
TODO: Fix for encrypted backups.
Done. Added a TODO.
worker/online_restore_ee.go, line 203 at r2 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
wrapf
Done.
worker/online_restore_ee.go, line 213 at r2 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
if _, err := ; err != nil { ... }
Also, do the wrapf.
Done.
worker/online_restore_ee.go, line 222 at r2 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
wrapf
Done.
worker/restore.go, line 84 at r2 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
No need for restoreTs.
I think this is needed. For example, let's assume restoreTs=10. If the backup has keys with versions > 10, those values will not be seen by queries until the timestamp of the cluster reaches the timestamp of the values.
We either have to do this or advance the timestamp of the cluster to the max timestamp seen in the backup. I've added a TODO here as well to look further into it.
worker/restore.go, line 188 at r2 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
remove.
Done.
worker/s3_handler.go, line 326 at r2 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Does LoadResult have to be public?
No. I've added a todo to do the refactor in a later PR.
This change allows Dgraph to restore a backup. This PR provides basic functionality that allows us to restore a backup to an empty cluster. There's still work that needs to be done to allow a restore to work on a cluster with existing data. Tablets need to be moved to the right group for this part to work correctly.
This change allows Dgraph to restore a backup. This PR provides basic
functionality that allows us to restore a backup to an empty cluster.
There's still work that needs to be done to allow a restore to work on
a cluster with existing data. Tablets need to be moved to the right
group for this part to work correctly.
This change is
Docs Preview: