Skip to content

Commit 1290f5d

Browse files
authored
feat: KEP-2 Local Execution Mode Proposal (#34)
* KEP-2: Local Execution Mode Proposal Signed-off-by: Saad Zaher <[email protected]> * Updated proposal * add apple containers * fix typo in Subprocess * add API consistency to the design details Signed-off-by: Saad Zaher <[email protected]> * update proposal to use training backends Signed-off-by: Saad Zaher <[email protected]> * add constraint on resource limitation for local mode Signed-off-by: Saad Zaher <[email protected]> * Move proposals into docs Signed-off-by: Saad Zaher <[email protected]> * Use ExecutionBackends instead of TrainingBackends Signed-off-by: Saad Zaher <[email protected]> * update docs and graphs Signed-off-by: Saad Zaher <[email protected]> * update graphs Signed-off-by: Saad Zaher <[email protected]> --------- Signed-off-by: Saad Zaher <[email protected]> Signed-off-by: Saad Zaher <[email protected]>
1 parent d5c60f5 commit 1290f5d

File tree

3 files changed

+1009
-0
lines changed

3 files changed

+1009
-0
lines changed
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
# KEP-2: Trainer Local Execution
2+
3+
## Summary
4+
5+
This KEP proposes the introduction of a local execution mode for the Kubeflow Trainer SDK, allowing AI Practitioners to test and experiment with their models locally before submitting them to a kubernetes based infrastructure.
6+
The feature will enable AI Practitioners to use Subprocess, Podman, Docker or other container runtimes to create isolated environments for training jobs, reducing the cost and time spent running experiments on expensive cloud resources.
7+
This local execution mode will allow for rapid prototyping, debugging, and validation of training jobs.
8+
9+
## Motivation
10+
11+
Currently, Kubeflow’s Trainer SDK requires jobs to be executed on a Kubernetes cluster.
12+
This setup can incur significant costs and time delays, especially for model experiments that are in the early stages.
13+
AI Practitioners often want to experiment locally before scaling their models to a full cloud-based infrastructure.
14+
The proposed local execution mode will allow engineers to quickly test their models in isolated containers or virtualenvs via subprocess, facilitating a faster and more efficient workflow.
15+
16+
### Goals
17+
- Allow users to run training jobs on their local machines using container runtimes or subprocess.
18+
- Rework current Kubeflow Trainer SDK to implement Execution Backends with Kubernetes Backend as default.
19+
- Implement Local Execution Backends that integrates seamlessly with the Kubeflow SDK, supporting both single-node and multi-node training processes.
20+
- Provide an implementation that supports PyTorch, with the potential to extend to other ML frameworks or runtimes.
21+
- Ensure compatibility with existing Kubeflow Trainer SDK features and user interfaces.
22+
23+
### Non-Goals
24+
- Full support for distributed training in the first phase of implementation.
25+
- Support for all ML frameworks or runtime environments in the initial proof-of-concept.
26+
- Major changes to the Kubeflow Trainer SDK architecture.
27+
28+
## Proposal
29+
30+
The local execution mode will allow users to run training jobs in container runtime environment on their local machines, mimicking the larger Kubeflow setup but without requiring Kubernetes.
31+
32+
![Architecture Diagram](high-level-arch.svg)
33+
34+
### User Stories (Optional)
35+
36+
#### Story 1
37+
As an AI Practitioner, I want to run my model locally using Podman/Docker containers so that I can test my training job without incurring the costs of running a Kubernetes cluster.
38+
39+
#### Story 2
40+
As an AI Practitioner, I want to initialize datasets and models within Podman/Docker containers, so that I can streamline my local training environment.
41+
42+
### Notes/Constraints/Caveats
43+
- Local execution mode will first support Subprocess, with future plans to explore Podman, Docker, and Apple Container.
44+
- The subprocess implementation will be restricted to single node.
45+
- The local execution mode will support only pytorch runtime initially.
46+
- Resource limitations on memory, cpu and gpu is not fully supported locally and might not be supported if the execution backend doesn't expose apis to support it.
47+
48+
### Risks and Mitigations
49+
- **Risk**: Compatibility issues with non-Docker container runtimes.
50+
- **Mitigation**: Initially restrict support to Podman/Docker and evaluate alternatives for future phases.
51+
- **Risk**: Potential conflicts between local and Kubernetes execution modes.
52+
- **Mitigation**: Ensure that the local execution backends are implemented with the exact same interface as the kubernetes backend to enable users to switch between both seamlessly.
53+
54+
## Design Details
55+
56+
The local execution mode will be implemented using a new `LocalProcessBackend`, `PodmanBackend`, `DockerBackend` which will allow users to execute training jobs using containers and virtual environment isolation. The client will utilize container runtime capabilities to create isolated environments, including volumes and networks, to manage the training lifecycle. It will also allow for easy dataset and model initialization.
57+
58+
- Different execution backends will need to implement the same interface from the `ExecutionBackend` abstract class so `TrainerClient` can initialize and load the backend.
59+
- The Podman/Docker client will connect to a local container environment, create shared volumes, and initialize datasets and models as needed.
60+
- The **DockerBackend** will manage Docker containers, networks, and volumes using runtime definitions specified by the user.
61+
- The **PodmanBackend** will manage Podman containers, networks, and volumes using runtime definitions specified by the user.
62+
- Containers will be labeled with job IDs, making it possible to track job status and logs.
63+
- An abstract interface to maintain API consistency across different clients or backends.
64+
65+
![Detailed Workflow](detailed-workflow.svg)
66+
67+
### Test Plan
68+
69+
- **Unit Tests**: Ensure that different execution backends have complete unit test coverage, especially for container management, dataset initialization, and job tracking.
70+
- **E2E Tests**: Conduct end-to-end tests to validate the local execution mode, ensuring that jobs can be initialized, executed, and tracked correctly within Podman/Docker containers.
71+
72+
### Graduation Criteria
73+
- The feature will move to the `beta` stage once it supports multi-node training with pytorch framework as default runtime and works seamlessly with local environments.
74+
- Full support for multi-worker configurations and additional ML frameworks will be considered for the `stable` release.
75+
76+
## Implementation History
77+
- **KEP Creation**: April 2025
78+
- **Implementation Start**: April 2025
79+
## Drawbacks
80+
81+
- The initial implementation will be limited to single-worker training jobs, which may restrict users who need multi-node support.
82+
- The local execution mode will initially only support Subprocess and may require additional configurations for Podman/Docker container runtimes in the future.
83+
84+
## Alternatives
85+
- **Full Kubernetes Execution**: Enable users to always run jobs on Kubernetes clusters, though this comes with higher costs and longer development cycles for ML engineers.

0 commit comments

Comments
 (0)