-
Notifications
You must be signed in to change notification settings - Fork 45
feat: KEP-2 Local Execution Mode Proposal #34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 1 commit
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
8378fbc
KEP-2: Local Execution Mode Proposal
szaher e59b17f
Updated proposal
szaher b64c955
update proposal to use training backends
szaher 336af96
add constraint on resource limitation for local mode
szaher 9e1ac06
Move proposals into docs
szaher c2e51bf
Use ExecutionBackends instead of TrainingBackends
szaher ef4a4a7
update docs and graphs
szaher 93816d7
Merge branch 'local-exec-proposal' of github.com:szaher/sdk into loca…
szaher f54795e
update graphs
szaher File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,77 @@ | ||
| # KEP-2: Trainer Local Execution | ||
|
|
||
| ## Summary | ||
|
|
||
| This KEP proposes the introduction of a local execution mode for the Kubeflow Trainer SDK, allowing machine learning (ML) engineers to test and experiment with their models locally before submitting them to a kubernetes based infrastructure. The feature will enable ML engineers to use Subprocess, Docker or other container runtimes to create isolated environments for training jobs, reducing the cost and time spent running experiments on expensive cloud resources. This local execution mode will allow for rapid prototyping, debugging, and validation of training jobs. | ||
|
|
||
| ## Motivation | ||
|
|
||
| Currently, Kubeflow’s Trainer SDK requires jobs to be executed on a Kubernetes cluster. This setup can incur significant costs and time delays, especially for model experiments that are in the early stages. ML engineers often want to experiment locally before scaling their models to a full cloud-based infrastructure. The proposed local execution mode will allow engineers to quickly test their models in isolated containers or virtualenvs via subprocess, facilitating a faster and more efficient workflow. | ||
|
|
||
| ### Goals | ||
| - Allow users to run training jobs on their local machines using container runtimes or subprocess. | ||
| - Implement a Local Trainer Client that integrates seamlessly with the Kubeflow SDK, supporting both single-node and multi-node training processes. | ||
| - Provide an implementation that supports PyTorch, with the potential to extend to other ML frameworks or runtimes. | ||
| - Ensure compatibility with existing Kubeflow Trainer SDK features and user interfaces. | ||
|
|
||
| ### Non-Goals | ||
| - Full support for distributed training in the first phase of implementation. | ||
| - Support for all ML frameworks or runtime environments in the initial proof-of-concept. | ||
| - Major changes to the Kubeflow Trainer SDK architecture. | ||
|
|
||
| ## Proposal | ||
|
|
||
| The local execution mode will allow users to run training jobs in container runtime environment on their local machines, mimicking the larger Kubeflow setup but without requiring Kubernetes. | ||
|
|
||
|  | ||
|
|
||
| ### User Stories (Optional) | ||
|
|
||
| #### Story 1 | ||
| As an ML engineer, I want to run my model locally using Podman/Docker containers so that I can test my training job without incurring the costs of running a Kubernetes cluster. | ||
|
|
||
| #### Story 2 | ||
| As an ML engineer, I want to initialize datasets and models within Podman/Docker containers, so that I can streamline my local training environment. | ||
|
|
||
| ### Notes/Constraints/Caveats | ||
| - The local execution mode will work only with Podman, Docker and Subporcess. | ||
| - The subprocess implementation will be restricted to single node. | ||
| - The local execution mode will support only pytorch runtime initially. | ||
|
|
||
| ### Risks and Mitigations | ||
| - **Risk**: Compatibility issues with non-Docker container runtimes. | ||
| - **Mitigation**: Initially restrict support to Podman/Docker and evaluate alternatives for future phases. | ||
| - **Risk**: Potential conflicts between local and Kubernetes execution modes. | ||
| - **Mitigation**: Ensure that the local trainer client is implemented with the exact same interface as the current TrainerClient to enable users to switch between both seamlessly. | ||
|
|
||
| ## Design Details | ||
szaher marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
szaher marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| The local execution mode will be implemented using a new `LocalTrainerClient`, which will allow users to execute training jobs using containers. The client will utilize container runtime capabilities to create isolated environments, including volumes and networks, to manage the training lifecycle. It will also allow for easy dataset and model initialization. | ||
szaher marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| - **LocalTrainerClient** will expose an interface similar to the existing `TrainerClient`. | ||
| - The Podman/Docker client will connect to a local container environment, create shared volumes, and initialize datasets and models as needed. | ||
| - The **DockerJobClient** will manage Docker containers, networks, and volumes using runtime definitions specified by the user. | ||
| - The **PodmanJobClient** will manage Podman containers, networks, and volumes using runtime definitions specified by the user. | ||
szaher marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| - Containers will be labeled with job IDs, making it possible to track job status and logs. | ||
|
|
||
|  | ||
szaher marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ### Test Plan | ||
|
|
||
| - **Unit Tests**: Ensure that the `LocalTrainerClient` and `JobClient` have complete unit test coverage, especially for container management, dataset initialization, and job tracking. | ||
| - **E2E Tests**: Conduct end-to-end tests to validate the local execution mode, ensuring that jobs can be initialized, executed, and tracked correctly within Podman/Docker containers. | ||
|
|
||
| ### Graduation Criteria | ||
| - The feature will move to the `beta` stage once it supports multi-node training with pytorch framework as default runtime and works seamlessly with local environments. | ||
| - Full support for multi-worker configurations and additional ML frameworks will be considered for the `stable` release. | ||
|
|
||
| ## Implementation History | ||
| - **KEP Creation**: April 2025 | ||
| - **Implementation Start**: April 2025 | ||
| ## Drawbacks | ||
|
|
||
| - The initial implementation will be limited to single-worker training jobs, which may restrict users who need multi-node support. | ||
| - The local execution mode will initially only support Podman/Docker and may require additional configurations for other container runtimes in the future. | ||
|
|
||
| ## Alternatives | ||
| - **Full Kubernetes Execution**: Enable users to always run jobs on Kubernetes clusters, though this comes with higher costs and longer development cycles for ML engineers. | ||
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.