-
Notifications
You must be signed in to change notification settings - Fork 40
Open
Labels
Description
What you would like to be added?
As we discussed earlier, we want to design an approach to snapshot users' workspace into TrainJob (e.g. distributed ML workload): kubeflow/trainer#2324 (comment).
To achieve this, we plan to generate a unique TrainJob ID before submitting it to the Kubernetes control plane.
During the KubeCon 2024 demo, we demonstrated how workspace snapshotting might work: https://youtu.be/Lgy4ir1AhYw?t=458.
In this demo, we pushed Python code files into S3 and then loaded them into TrainJob using initContainers.
However, we can consider various approaches, for instance:
- Using distributed cache.
- Using
kubectl cp
.
Why is this needed?
This should streamline Data Scientists user experience while working with Kubeflow Training Python SDK.
Love this feature?
Give it a 👍 We prioritize the features with most 👍
astefanutti, shravan-achar, eoinfennessy, Leoauro and kramaranya