Skip to content

Conversation

nprokopic
Copy link
Contributor

@nprokopic nprokopic commented Sep 2, 2025

What issue type does this pull request address? (keep at least one, remove the others)
/kind enhancement

What does this pull request do? Which issues does it resolve? (use resolves #<issue_number> if possible)
resolves ENG-8530 and ENG-8531

Please provide a short message that should be published in the vcluster release notes
Add vcluster snapshot create command that creates snapshots asynchronously.

What else do we need to know?
vcluster snapshot create creates a snapshot request, which is then reconciled by a vCluster snapshot controller that creates snapshots in the background.

  • Snapshot request consists of two resources:
    • ConfigMap: vCluster snapshot controller reconciles the ConfigMaps with a special vcluster.loft.sh/snapshot-request label (ATM the ConfigMap is used mostly
    • Secret: Used to temporarily save snapshot options (including credentials), so that vCluster snapshot controller can access the snapshot location. The snapshot request Secret is deleted as soon as the snapshot request has been processed (either completed successfully or failed)
  • vcluster snapshot create command sends a POST request to the vCluster /vcluster/snapshots route
  • /vcluster/snapshots route handler creates snapshot request resources
  • for vCluster with shared nodes, snapshot request resources are created in the host cluster
  • for vCluster with private nodes, snapshot request resources are created in the virtual cluster

Old vcluster snapshot command has been deprecated and it continues to work like before.

e2e snapshot test have been updated to test both the old CLI-based vcluster snapshot command and the new controller-based vcluster snapshot create command.

  • The latest platform release uses the old vcluster snapshot command, so it's important to keep running the tests for the old command, until the platform is updated to use the new vcluster snapshot create command.
  • Tests specs are identical for both CLI-based and controller-based snapshot command.
  • The test for vcluster snapshot create creates the snapshot in the container on the path container:///snapshot-data/snapshot.tar.gz. The snapshot container path is from a PVC (created in the CI) that is mounted to vcluster container, so that restore pod can use it when restoring. The /data/... container path could not have been be used, because some e2e tests (e.g. HA setup) use emptyDir for /data, which is not accessible by a separate restore pod.

@nprokopic nprokopic force-pushed the feature/snapshot-controller/etcd-snapshot branch from 75a63d3 to dc28fef Compare September 3, 2025 10:45
@nprokopic nprokopic marked this pull request as ready for review September 9, 2025 23:17
@cbron cbron requested a review from jjaferson September 10, 2025 19:34
Comment on lines 101 to 134
responseBody := &bytes.Buffer{}
_, err = responseBody.ReadFrom(response.Body)
if err != nil {
return fmt.Errorf("failed to read response body: %w", err)
}
snapshotRequestResultJSON := responseBody.Bytes()
var snapshotRequestResult snapshot.Request
err = json.Unmarshal(snapshotRequestResultJSON, &snapshotRequestResult)
if err != nil {
return fmt.Errorf("failed to unmarshal snapshot request result: %w", err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You can accomplish the body reading and unmarshalling with fewer intermediate steps. Something like

var snapshotRequestResult snapshot.Request
if err = json.NewDecoder(response.Body).Decode(&snapshotRequestResult); err != nil {
  return fmt.Errorf("failed to unmarshal snapshot request result: %w", err)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will check this out!


// create the snapshot request Secret and ConfigMap
// - for shared nodes, create resources in the host cluster in the vCluster namespace
// - for private nodes, create resources in the virtual cluster in the kube-system namespace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For private nodes can we not use the vCluster namespace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For private nodes the snapshot request resources are created inside of the virtual cluster, so theoretically we can create them in any namespace. I would save them in some namespace like vcluster-system, but AFAIK we don't have that, so I went with kube-system.

For shared nodes OTOH, the snapshot request resources are created inside of the host cluster, so vCluster namespace makes the most sense.

P.S. Theoretically we could also use vCluster namespace on the host for private nodes as well, but during the design phase (multiple meets) we have decided to use the virtual cluster for private nodes, and the host cluster for shared nodes.

ObjectMeta: metav1.ObjectMeta{
Namespace: vClusterNamespace,
Labels: map[string]string{
RequestLabel: "",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be good to have the vcluster name in the label values? in case we want to filter down snapshots request from an specific vCluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can filter them by vcluster namespace, e.g.

kubectl get configmaps -n $VCLUSTER_NAME -l vcluster.loft.sh/snapshot-request
NAME                     DATA   AGE
snapshot-request-2jvb9   1      15h
snapshot-request-6cxg8   1      15s
snapshot-request-bv7pm   1      14h
snapshot-request-c8dbt   1      18s
snapshot-request-clqkf   1      19h
snapshot-request-ftnhd   1      18h
snapshot-request-n6rc9   1      15h
snapshot-request-vm5dg   1      14h

Also we should also add a new command for listing the snapshots (and in-progress snapshot requests), e.g. vcluster snapshot list, so checking the snapshots should be even easier.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can add the vcluster label in any case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we should also add a new command for listing the snapshots

That is the reason I was asking but I guess many snapshots can be happening in parallel and even to see the history of snapshots that have been taken

localPort := RandomPort()
errorChan := make(chan error)
go func() {
errorChan <- portforward.StartPortForwardingWithRestart(ctx, kubeConfig, "127.0.0.1", podName, vCluster.Namespace, strconv.Itoa(localPort), "8443", make(chan struct{}), portForwardingOptions.StdOut, portForwardingOptions.StdErr, log)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen with the port forwarding if the command is interrupted? Would this process get cancelled?

Copy link
Contributor Author

@nprokopic nprokopic Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure TBH, I have just used the existing function (moved it here so it can be reused) for getting the virtual cluster config, so I didn't dig into it to see how it works in detail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I was wondering because we have an issue where the vcluster was being paused and not resumed due to the restore command being interrupted https://linear.app/loft/issue/ENG-8545/bug-vcluster-restore-leaves-pod-scaled-down-if-interrupted-preventing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here specifically, if you interrupt the CLI command while it's trying to get the kubeconfig for accessing the virtual cluster, then the snapshot request shouldn't be even sent, because it first has to connect in order to get the access to the virtual server.

@nprokopic nprokopic force-pushed the feature/snapshot-controller/etcd-snapshot branch from 550cdb2 to 35ca0c4 Compare September 11, 2025 11:15
hidalgopl
hidalgopl previously approved these changes Sep 11, 2025
Copy link
Contributor

@jjaferson jjaferson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few suggestions for improvement in the e2e tests

})
// check deployment is deleted
Eventually(func() error {
_, err := f.VClusterClient.CoreV1().Secrets(testNamespaceName).List(f.Context, metav1.ListOptions{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we should check if the deployment was deleted here

if container.State.Running == nil || !container.Ready {
return fmt.Errorf("pod %s container %s is not running", pod.Name, container.Name)
}
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check for not found?

Containers: []corev1.Container{
{
Name: "example-container",
Image: "nginx:1.25.0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: move this into a constant

AccessModes: []corev1.PersistentVolumeAccessMode{corev1.ReadWriteOnce},
Resources: corev1.VolumeResourceRequirements{
Requests: corev1.ResourceList{
corev1.ResourceStorage: resource.MustParse("5Gi"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: move this to a constant

return fmt.Errorf("pod %s has no container status", pod.Name)
}
// check configmap is deleted
Eventually(func() error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can move this function to a generic one verifyResourceDeleted(resourceType) error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants