Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restic integration with Ark #508

Merged
merged 2 commits into from
Jun 6, 2018
Merged

restic integration with Ark #508

merged 2 commits into from
Jun 6, 2018

Conversation

skriss
Copy link
Contributor

@skriss skriss commented May 18, 2018

Fixes #19

Reviewers: I wouldn't bother looking through individual commits, just the overall diff. I'll probably squash them all.

TODO:

  • update unit tests
  • update Makefile / associated build artifacts
  • finalize maintenance schedule
  • run backups/restores of multiple volumes within a pod concurrently

Copy link
Contributor

@ncdc ncdc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a quick first pass. Will look in more detail at the restic repository manager and below tomorrow.

return false
}

fmt.Printf("Found %s", doneFile)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to break here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, since we need to make it through this loop for every child that's a directory. the Printf is for information only.

os.Setenv("AZURE_ACCOUNT_NAME", os.Getenv("AZURE_STORAGE_ACCOUNT_ID"))
os.Setenv("AZURE_ACCOUNT_KEY", os.Getenv("AZURE_STORAGE_KEY"))

resticArgs := append([]string{"-p=" + passwordFile}, os.Args[1:]...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are longhand forms for the flags, it might be clearer to use those.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea.

checkError(resticCmd.Run(), "ERROR running restic command")
}

func repoName(args []string) string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc/example pls

// otherwise.
func checkError(err error, msg string) {
if err != nil {
fmt.Fprintf(os.Stderr, msg+": %s\n", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually %v errors

// checkError prints an error message and the error's text to stderr and
// exits with a status code of 1 if the error is not nil, or is a no-op
// otherwise.
func checkError(err error, msg string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're printing the msg and then the error, I think it would make sense to reverse the order of the arguments.

// for each volume to backup:
for _, volumeName := range volumesToBackup {
wg.Add(1)
go func(volumeName string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be declared standalone instead of as an anonymous function?

cmd := backupCommand(br.metadataManager.RepoPrefix(), pod.Namespace, string(pod.UID), volumeDir, snapshotTags)

// exec it in the daemonset pod
if err := br.daemonSetExecutor.Exec(pod.Spec.NodeName, cmd, 10*time.Minute, log); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10 minutes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably need to make this tuneable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we're switching to a CRD instead of using exec, but yes, we'll need this to be configurable from the start.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either per backup or globally or both (use backup's but fallback to global if unset)

// save the snapshot's ID in the pod annotation
lock.Lock()
SetPodSnapshotAnnotation(pod, volumeName, snapshotID)
backupSnapshots = append(backupSnapshots, fmt.Sprintf("%s/%s", pod.Namespace, snapshotID))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using a lock, could we somehow use channel(s) to coordinate retrieving the snapshot id?

}

func (dse *defaultDaemonSetExecutor) getDaemonSetPod(node string) (*corev1api.Pod, error) {
// TODO we may want to cache this list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can start the k8s shared informer factory beforehand and use the pod informer

}

// strip the trailing '/' if it exists
if prefix[len(prefix)-1] == '/' {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strings.TrimSuffix

"k8s.io/client-go/rest"
)

func main() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ncdc I was considering trying to get rid of this binary. Since it's used in the daemonset, that would require the Ark server to create a temp creds file in the appropriate daemonset pod. Do you think that's a good idea?

@skriss
Copy link
Contributor Author

skriss commented May 23, 2018

@ncdc any issue with me squashing most/all of the commits down to 1?

go func() {
interval := time.Hour

<-time.After(interval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to have an initial one hour delay. If you wanted to be more confident that maintenance is run hourly (and especially not every time you start the server, if you're rapidly starting/quitting/restarting), we could have some CRD that keeps track of overall restic status, including the last maintenance time.

@@ -314,6 +348,7 @@ var defaultResourcePriorities = []string{
"configmaps",
"serviceaccounts",
"limitranges",
"pods",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should comment why the list is in this order: namespaces first because everything that isn't cluster-scoped lives in them, PVs because we need to get the volumes before PVCs or pods, ..., pods because we might be restoring restic backups, and we need to try to restore the pods before allowing the controllers (deployments, replicasets, statefulsets, etc) to spin up their own pods

}

// ensure a repo exists for the pod's namespace
exists, err := br.metadataManager.RepositoryExists(pod.Namespace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth it to combine RepositoryExists and InitRepo as EnsureRepo?

return errors.WithStack(err)
}

if err := ib.resticBackupper.BackupPodVolumes(ib.backup, pod, log); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we've called restic.GetVolumesToBackup above, maybe we should pass that data into BackupPodVolumes so we don't have to do it twice?

RestorePodVolumes(restore *arkv1api.Restore, pod *corev1api.Pod, log logrus.FieldLogger) error
}

func (br *backupperRestorer) BackupPodVolumes(backup *arkv1api.Backup, pod *corev1api.Pod, log logrus.FieldLogger) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you feel about this function not mutating either the backup or the pod, and instead returning the appropriate information so that another function can deal with the mutations? It should make it easier to unit test where we call BackupPodVolumes (in item_backupper) - we'd be able to test the changes coming back from a mocked resticBackupper.

}

func (rm *repositoryManager) getAllRepos() ([]string, error) {
prefixes, err := rm.objectStore.ListCommonPrefixes(rm.bucket, "/")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the resticLocation that came from the config was bucket/dir, that goes into rm.bucket - will that work here having both the bucket and the directory in a single string? Or do we need to call ListCommonPrefixes(bucketName, dir+"/")?

return errors.WithStack(err)
}

if err := ib.resticBackupper.BackupPodVolumes(ib.backup, pod, log); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, this invocation ends up listing all the common prefixes (i.e. repos) in resticLocation in object storage (ensuring the restic repo exists for the namespace) for each pod. Are you concerned about how frequently we'll be running that same query for a given namespace? I know the cost is fairly minimal, but there is still a cost associated with listing from object storage. We may want something more efficient and/or a cache of the data.

}

func (rm *repositoryManager) InitRepo(name string) error {
resticCreds, err := rm.secretsClient.Get(credsSecret, metav1.GetOptions{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan was to require an admin to precreate manually per namespace, wasn't it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should discuss the ui/ux. We could have something like

# generates an encryption key, creates a secret for it, maybe creates a new ResticRepo, ark server sees it's New and runs restic init?
ark restic init --namespace foo

# same as above, but uses a precreated encryption key instead of generating
ark restic init --namespace foo --encryption-key /path/to/file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we were to add a ResticRepo CRD, we could use that to manage its status (needs init, already done, etc). Not sure if we need this or how it would impact backup replication. Would probably have to be tied to a BackupTarget when we get there.

}

defer func() {
file.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need better error handling around close/remove errors.

initContainer.VolumeMounts = append(initContainer.VolumeMounts, mount)
}

if len(pod.Spec.InitContainers) == 0 || pod.Spec.InitContainers[0].Name != "restic-wait" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are no init containers, list=initContainer

If the first init container isn't restic-wait, set list=initContainer, rest...

If the first init container is restic-wait, overwrite it with initContainer

?

@skriss skriss force-pushed the real-restic branch 3 times, most recently from dcd1f05 to 95b82e4 Compare May 25, 2018 04:24
@skriss
Copy link
Contributor Author

skriss commented May 25, 2018

@ncdc switched restores to using a CRD, and did some cleanup with the backup CRD/controller.

Still need to change backupper_restorer to not be global (and should be easy to split in half now).
Still need to support manual repo key creation.
Some lower-level cleanup to do as well.

case <-timeout:
errs = append(errs, errors.New("timed out waiting for all PodVolumeBackups to complete"))
break ForLoop
case res := <-br.backupResults[backup.UID]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

res, ok and check if !ok (channel closed)

},
)

// TODO do I need to use an actual context here?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either a context or a stop channel passed in to the constructor. Use that single context/channel when running the informers and waiting for cache syncs.

br.metadataManager.RLock(pod.Namespace)
defer br.metadataManager.RUnlock(pod.Namespace)

// TODO should we return here, or continue with what we can?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed briefly on slack, continue with what we can. We need a solid UX around error reporting (probably after this PR).

defer delete(br.backupResults, backup.UID)

for _, volumeName := range volumesToBackup {
br.metadataManager.RLock(pod.Namespace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit scary needing to expose a lock in another component and lock/unlock it outside said component. Is there an alternative?

},
},
Labels: map[string]string{
"ark.heptio.com/backup": backup.Name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should include uid too. We have constants for backup name and uid labels that I added for delete backup requests: BackupNameLabel and BackupUIDLabel. They're in delete_backup_request.go but we should probably move them to a labels.go file.

"ns": pod.Namespace,
"volume": volumeName,
},
RepoPrefix: br.metadataManager.RepoPrefix(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's discuss if we need this in the future (when we have backup targets)

br.metadataManager.RLock(pod.Namespace)
defer br.metadataManager.RUnlock(pod.Namespace)

// TODO should we return here, or continue with what we can?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

continue

},
},
Labels: map[string]string{
"ark.heptio.com/restore": restore.Name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it a constant in pkg/apis/... and also use uid

// Ark stores restic backups of pod volumes, specified either as "bucket" or
// "bucket/prefix". This bucket must be different than the `Bucket` field.
// Optional.
ResticLocation string `json:"resticLocation"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that restic has its own set of supported storage backends, upon further reflection, I think we should split this out into a separate struct. Otherwise it implies that we/restic support all possible ObjectStores Ark supports (including any/all 3rd party plugins).

Let's have fields for provider name, bucket, prefix, and an open config map.


resultChan := make(chan *arkv1api.PodVolumeBackup)

informer := arkv1informers.NewFilteredPodVolumeBackupInformer(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will probably want to have 1 informer running per backup, and this does 1 informer per backup+pod. We could add a new struct that would be per-backup, and create a new one at the beginning of a backup.

case res := <-resultChan:
switch res.Status.Phase {
case arkv1api.PodVolumeBackupPhaseCompleted:
SetPodSnapshotAnnotation(pod, res.Spec.Volume, res.Status.SnapshotID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think I'd like to see this function return the pod/snapshot associations, and have another function modify that pod. It hopefully will make testing easier.

},
}

// TODO should we return here, or continue with what we can?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

record the error & continue

UpdateFunc: func(_, obj interface{}) {
pvb := obj.(*arkv1api.PodVolumeBackup)

if pvb.Status.Phase == arkv1api.PodVolumeBackupPhaseCompleted || pvb.Status.Phase == arkv1api.PodVolumeBackupPhaseFailed {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this function be run in parallel for 2 pods for the same backup? We'll likely need to match PVBs for this backup's name/UID.

UpdateFunc: func(_, obj interface{}) {
pvr := obj.(*arkv1api.PodVolumeRestore)

if pvr.Status.Phase == arkv1api.PodVolumeRestorePhaseCompleted || pvr.Status.Phase == arkv1api.PodVolumeRestorePhaseFailed {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this function be run in parallel for 2 pods for the same restore? We'll likely need to match PVRs for this restore's name/UID.


resultChan := make(chan *arkv1api.PodVolumeRestore)

informer := arkv1informers.NewFilteredPodVolumeRestoreInformer(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above re 1 informer per restore instead of 1 per restore+pod.

@skriss skriss force-pushed the real-restic branch 3 times, most recently from 1e92200 to 67cd96f Compare May 30, 2018 22:54
@@ -0,0 +1,23 @@
# Copyright 2017 the Heptio Ark contributors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2018

@@ -0,0 +1,61 @@
package main
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add license header please

@@ -29,11 +29,12 @@ spec:
serviceAccountName: ark
containers:
- name: ark
image: gcr.io/heptio-images/ark:latest
image: gcr.io/steve-heptio/ark:restic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undo

command:
- /ark
args:
- server
- --log-level=debug
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undo?

# See the License for the specific language governing permissions and
# limitations under the License.

---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically not needed - it's only used to separate between previous item and next item

}

r.resultsLock.Lock()
r.results[resultsKey(pod.Namespace, pod.Name)] = make(chan *arkv1api.PodVolumeRestore)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Save channel as var and use below in for/select loop

repoManager *repositoryManager
informer cache.SharedIndexInformer
results map[string]chan *arkv1api.PodVolumeRestore
resultsLock sync.Mutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reorganize so everything locked by resultsLock is grouped below it

r.repoManager.repoLocker.Lock(pod.Namespace, false)
defer r.repoManager.repoLocker.Unlock(pod.Namespace, false)

volumeRestore := &arkv1api.PodVolumeRestore{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

helper func

)

for volume, snapshot := range volumesToRestore {
r.repoManager.repoLocker.Lock(pod.Namespace, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lock once per namespace, outside of this func

@@ -0,0 +1,26 @@
# Copyright 2018 the Heptio Ark contributors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is no longer used?

@ncdc
Copy link
Contributor

ncdc commented Jun 4, 2018

Also need to remove the restic-init-container/restic-init-container binary executable from the branch

@ncdc
Copy link
Contributor

ncdc commented Jun 4, 2018 via email

@skriss skriss force-pushed the real-restic branch 2 times, most recently from 363b3de to 64e0ca4 Compare June 4, 2018 21:45
@skriss
Copy link
Contributor Author

skriss commented Jun 4, 2018

@ncdc I believe I've addressed all your material comments, with a handful of minor things to do prior to 0.9. I also pulled your CLI commit into this PR. When you've had a chance to go through the latest code review commits, I'm going to squash this whole thing down to just 2 commits, one for vendor/ updates and one for the rest.

@skriss
Copy link
Contributor Author

skriss commented Jun 4, 2018

We'll need to update CI to use the new make all-containers and make all-push targets.

cancelFunc context.CancelFunc
}

func NewGenericServer(
Copy link
Contributor

@ncdc ncdc Jun 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this is helpful. There is not enough commonality between the daemonset command and the server command.

DS:

  • ✅ sets controllers
  • ❌ doesn't set preStartControllers
  • ✅ sets postStartControllers
  • ❌ doesn't set cleanup

Server:

  • ❌ doesn't set controllers
  • ✅ sets preStartControllers
  • ✅ sets postStartControllers
  • ✅ sets cleanup

If we decide to keep this, it would look better using functional options, so the constructor would be reduced to

func NewGenericServer(logger logrus.FieldLogger, options ...Option)

The common functionality used by both DS and Server is essentially:

  • handling shutdown signals
  • running controllers (although the registration is handled differently between DS and Server)

Pre/PostStartControllers and Cleanup are not used by both.

I don't think it's a burden to have 2 independent Run() implementations each invoke handleShutdownSignals.

We could consider creating a "controller executor" like this:

type ControllerRegistration struct {
  controller controller.Interface
  workers int
}

type ControllerExecutor struct {
  controllers []ControllerRegistration
}

func (ce *ControllerExecutor) Execute(ctx context.Context) {
        var wg sync.WaitGroup
	for _, c := range ce.controllers {
		wg.Add(1)
		go func(c ControllerRegistration) {
			c.controller.Run(ctx, c.workers)
                        wg.Done()
		}(c)
	}
}

Both the DS and Server structs could use this. I think I'd prefer an approach of this nature as opposed to the genericServer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'll back this out.

@skriss
Copy link
Contributor Author

skriss commented Jun 6, 2018

Updated to just create common funcs for getting loggers, handling signals. Can do any other refactoring later.

@ncdc
Copy link
Contributor

ncdc commented Jun 6, 2018

Thanks. Squash away.

@skriss skriss changed the title [WIP] restic integration with Ark restic integration with Ark Jun 6, 2018
@skriss
Copy link
Contributor Author

skriss commented Jun 6, 2018

squashed and green!

@ncdc ncdc merged commit ed2d7b4 into vmware-tanzu:master Jun 6, 2018
@skriss skriss deleted the real-restic branch June 8, 2018 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants