Skip to content

Commit 9b3700a

Browse files
committed
Rename ExecutionBackend to RuntimeBackend
Signed-off-by: Andrey Velichkevich <[email protected]>
1 parent 0239bd4 commit 9b3700a

File tree

6 files changed

+20
-10
lines changed

6 files changed

+20
-10
lines changed

docs/proposals/2-trainer-local-execution/README.md

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,38 +14,44 @@ AI Practitioners often want to experiment locally before scaling their models to
1414
The proposed local execution mode will allow engineers to quickly test their models in isolated containers or virtualenvs via subprocess, facilitating a faster and more efficient workflow.
1515

1616
### Goals
17+
1718
- Allow users to run training jobs on their local machines using container runtimes or subprocess.
1819
- Rework current Kubeflow Trainer SDK to implement Execution Backends with Kubernetes Backend as default.
1920
- Implement Local Execution Backends that integrates seamlessly with the Kubeflow SDK, supporting both single-node and multi-node training processes.
2021
- Provide an implementation that supports PyTorch, with the potential to extend to other ML frameworks or runtimes.
2122
- Ensure compatibility with existing Kubeflow Trainer SDK features and user interfaces.
2223

2324
### Non-Goals
25+
2426
- Full support for distributed training in the first phase of implementation.
2527
- Support for all ML frameworks or runtime environments in the initial proof-of-concept.
2628
- Major changes to the Kubeflow Trainer SDK architecture.
2729

2830
## Proposal
2931

30-
The local execution mode will allow users to run training jobs in container runtime environment on their local machines, mimicking the larger Kubeflow setup but without requiring Kubernetes.
32+
The local execution mode will allow users to run training jobs in container runtime environment on their local machines, mimicking the larger Kubeflow setup but without requiring Kubernetes.
3133

3234
![Architecture Diagram](high-level-arch.svg)
3335

3436
### User Stories (Optional)
3537

3638
#### Story 1
39+
3740
As an AI Practitioner, I want to run my model locally using Podman/Docker containers so that I can test my training job without incurring the costs of running a Kubernetes cluster.
3841

3942
#### Story 2
43+
4044
As an AI Practitioner, I want to initialize datasets and models within Podman/Docker containers, so that I can streamline my local training environment.
4145

4246
### Notes/Constraints/Caveats
47+
4348
- Local execution mode will first support Subprocess, with future plans to explore Podman, Docker, and Apple Container.
4449
- The subprocess implementation will be restricted to single node.
4550
- The local execution mode will support only pytorch runtime initially.
4651
- Resource limitations on memory, cpu and gpu is not fully supported locally and might not be supported if the execution backend doesn't expose apis to support it.
4752

4853
### Risks and Mitigations
54+
4955
- **Risk**: Compatibility issues with non-Docker container runtimes.
5056
- **Mitigation**: Initially restrict support to Podman/Docker and evaluate alternatives for future phases.
5157
- **Risk**: Potential conflicts between local and Kubernetes execution modes.
@@ -55,7 +61,7 @@ As an AI Practitioner, I want to initialize datasets and models within Podman/Do
5561

5662
The local execution mode will be implemented using a new `LocalProcessBackend`, `PodmanBackend`, `DockerBackend` which will allow users to execute training jobs using containers and virtual environment isolation. The client will utilize container runtime capabilities to create isolated environments, including volumes and networks, to manage the training lifecycle. It will also allow for easy dataset and model initialization.
5763

58-
- Different execution backends will need to implement the same interface from the `ExecutionBackend` abstract class so `TrainerClient` can initialize and load the backend.
64+
- Different execution backends will need to implement the same interface from the `RuntimeBackend` abstract class so `TrainerClient` can initialize and load the backend.
5965
- The Podman/Docker client will connect to a local container environment, create shared volumes, and initialize datasets and models as needed.
6066
- The **DockerBackend** will manage Docker containers, networks, and volumes using runtime definitions specified by the user.
6167
- The **PodmanBackend** will manage Podman containers, networks, and volumes using runtime definitions specified by the user.
@@ -70,16 +76,20 @@ The local execution mode will be implemented using a new `LocalProcessBackend`,
7076
- **E2E Tests**: Conduct end-to-end tests to validate the local execution mode, ensuring that jobs can be initialized, executed, and tracked correctly within Podman/Docker containers.
7177

7278
### Graduation Criteria
79+
7380
- The feature will move to the `beta` stage once it supports multi-node training with pytorch framework as default runtime and works seamlessly with local environments.
7481
- Full support for multi-worker configurations and additional ML frameworks will be considered for the `stable` release.
7582

7683
## Implementation History
84+
7785
- **KEP Creation**: April 2025
7886
- **Implementation Start**: April 2025
87+
7988
## Drawbacks
8089

8190
- The initial implementation will be limited to single-worker training jobs, which may restrict users who need multi-node support.
8291
- The local execution mode will initially only support Subprocess and may require additional configurations for Podman/Docker container runtimes in the future.
8392

8493
## Alternatives
94+
8595
- **Full Kubernetes Execution**: Enable users to always run jobs on Kubernetes clusters, though this comes with higher costs and longer development cycles for ML engineers.

kubeflow/optimizer/backends/base.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
from kubeflow.trainer.types.types import TrainJobTemplate
2121

2222

23-
class ExecutionBackend(abc.ABC):
23+
class RuntimeBackend(abc.ABC):
2424
@abc.abstractmethod
2525
def optimize(
2626
self,

kubeflow/optimizer/backends/kubernetes/backend.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@
2525
import kubeflow.common.constants as common_constants
2626
from kubeflow.common.types import KubernetesBackendConfig
2727
import kubeflow.common.utils as common_utils
28-
from kubeflow.optimizer.backends.base import ExecutionBackend
28+
from kubeflow.optimizer.backends.base import RuntimeBackend
2929
from kubeflow.optimizer.backends.kubernetes import utils
3030
from kubeflow.optimizer.constants import constants
3131
from kubeflow.optimizer.types.algorithm_types import RandomSearch
@@ -43,7 +43,7 @@
4343
logger = logging.getLogger(__name__)
4444

4545

46-
class KubernetesBackend(ExecutionBackend):
46+
class KubernetesBackend(RuntimeBackend):
4747
def __init__(
4848
self,
4949
cfg: KubernetesBackendConfig,

kubeflow/trainer/backends/base.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
from kubeflow.trainer.types import types
2121

2222

23-
class ExecutionBackend(abc.ABC):
23+
class RuntimeBackend(abc.ABC):
2424
@abc.abstractmethod
2525
def list_runtimes(self) -> list[types.Runtime]:
2626
raise NotImplementedError()

kubeflow/trainer/backends/kubernetes/backend.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,15 +29,15 @@
2929
import kubeflow.common.constants as common_constants
3030
import kubeflow.common.types as common_types
3131
import kubeflow.common.utils as common_utils
32-
from kubeflow.trainer.backends.base import ExecutionBackend
32+
from kubeflow.trainer.backends.base import RuntimeBackend
3333
import kubeflow.trainer.backends.kubernetes.utils as utils
3434
from kubeflow.trainer.constants import constants
3535
from kubeflow.trainer.types import types
3636

3737
logger = logging.getLogger(__name__)
3838

3939

40-
class KubernetesBackend(ExecutionBackend):
40+
class KubernetesBackend(RuntimeBackend):
4141
def __init__(
4242
self,
4343
cfg: common_types.KubernetesBackendConfig,

kubeflow/trainer/backends/localprocess/backend.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
from typing import Optional, Union
2121
import uuid
2222

23-
from kubeflow.trainer.backends.base import ExecutionBackend
23+
from kubeflow.trainer.backends.base import RuntimeBackend
2424
from kubeflow.trainer.backends.localprocess import utils as local_utils
2525
from kubeflow.trainer.backends.localprocess.constants import local_runtimes
2626
from kubeflow.trainer.backends.localprocess.job import LocalJob
@@ -35,7 +35,7 @@
3535
logger = logging.getLogger(__name__)
3636

3737

38-
class LocalProcessBackend(ExecutionBackend):
38+
class LocalProcessBackend(RuntimeBackend):
3939
def __init__(
4040
self,
4141
cfg: LocalProcessBackendConfig,

0 commit comments

Comments
 (0)