Skip to content

[SDK] Snapshot users' workspace into distributed TrainJob workload #48

@andreyvelich

Description

@andreyvelich

What you would like to be added?

As we discussed earlier, we want to design an approach to snapshot users' workspace into TrainJob (e.g. distributed ML workload): kubeflow/trainer#2324 (comment).
To achieve this, we plan to generate a unique TrainJob ID before submitting it to the Kubernetes control plane.

During the KubeCon 2024 demo, we demonstrated how workspace snapshotting might work: https://youtu.be/Lgy4ir1AhYw?t=458.
In this demo, we pushed Python code files into S3 and then loaded them into TrainJob using initContainers.

However, we can consider various approaches, for instance:

  • Using distributed cache.
  • Using kubectl cp.

Why is this needed?

This should streamline Data Scientists user experience while working with Kubeflow Training Python SDK.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions