[SDK] Snapshot users' workspace into distributed TrainJob workload

### What you would like to be added?

As we discussed earlier, we want to design an approach to snapshot users' workspace into TrainJob (e.g. distributed ML workload): https://github.com/kubeflow/training-operator/pull/2324#discussion_r1862719941.
To achieve this, we plan to generate a unique TrainJob ID before submitting it to the Kubernetes control plane.

During the KubeCon 2024 demo, we demonstrated how workspace snapshotting might work: https://youtu.be/Lgy4ir1AhYw?t=458.
In this demo, we pushed Python code files into S3 and then loaded them into TrainJob using initContainers.

However, we can consider various approaches, for instance:
- Using distributed cache.
- Using `kubectl cp`.




### Why is this needed?

This should streamline Data Scientists user experience while working with Kubeflow Training Python SDK.
 

### Love this feature?

Give it a 👍 We prioritize the features with most 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SDK] Snapshot users' workspace into distributed TrainJob workload #48

What you would like to be added?

Why is this needed?

Love this feature?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[SDK] Snapshot users' workspace into distributed TrainJob workload #48

Description

What you would like to be added?

Why is this needed?

Love this feature?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions