Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,45 @@ TrainerClient().wait_for_job_status(job_id)
print("\n".join(TrainerClient().get_job_logs(name=job_id)))
```

## Local Development

Kubeflow SDK provides first-class support for local development, allowing you to test and iterate on your models without needing a Kubernetes cluster.

### Execution Backends

Choose the backend that fits your development workflow:

| Backend | Description | Use Case |
|---------|-------------|----------|
| **KubernetesBackend** | Run jobs on Kubernetes cluster | Production, multi-node distributed training |
| **ContainerBackend** | Auto-detects Docker or Podman | Local development with container isolation |
| **LocalProcessBackend** | Run as local Python subprocesses | Quick prototyping, debugging |

### Local Container Execution

The **ContainerBackend** automatically detects and uses either Docker or Podman:

```bash
# Install with Docker support
pip install kubeflow[docker]

# Or install with Podman support
pip install kubeflow[podman]
```

```python
from kubeflow.trainer import TrainerClient, ContainerBackendConfig, CustomTrainer

# Auto-detects Docker or Podman
config = ContainerBackendConfig()
client = TrainerClient(backend_config=config)

# Your training runs in isolated containers
job_id = client.train(trainer=CustomTrainer(func=train_fn))
```

For detailed configuration options and platform-specific setup (macOS, Linux), see the [ContainerBackend documentation](kubeflow/trainer/backends/container/README.md).

## Supported Kubeflow Projects

| Project | Status | Version Support | Description |
Expand Down
2 changes: 2 additions & 0 deletions kubeflow/trainer/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@

# Import the Kubeflow Trainer client.
from kubeflow.trainer.api.trainer_client import TrainerClient # noqa: F401
from kubeflow.trainer.backends.container.types import ContainerBackendConfig

# import backends and its associated configs
from kubeflow.trainer.backends.kubernetes.types import KubernetesBackendConfig
Expand Down Expand Up @@ -58,5 +59,6 @@
"TrainerClient",
"TrainerType",
"LocalProcessBackendConfig",
"ContainerBackendConfig",
"KubernetesBackendConfig",
]
17 changes: 13 additions & 4 deletions kubeflow/trainer/api/trainer_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@
import logging
from typing import Optional, Union

from kubeflow.trainer.backends.container.backend import ContainerBackend
from kubeflow.trainer.backends.container.types import ContainerBackendConfig
from kubeflow.trainer.backends.kubernetes.backend import KubernetesBackend
from kubeflow.trainer.backends.kubernetes.types import KubernetesBackendConfig
from kubeflow.trainer.backends.localprocess.backend import (
Expand All @@ -31,14 +33,19 @@
class TrainerClient:
def __init__(
self,
backend_config: Union[KubernetesBackendConfig, LocalProcessBackendConfig] = None,
backend_config: Union[
KubernetesBackendConfig,
LocalProcessBackendConfig,
ContainerBackendConfig,
] = None,
):
"""Initialize a Kubeflow Trainer client.

Args:
backend_config: Backend configuration. Either KubernetesBackendConfig or
LocalProcessBackendConfig, or None to use the backend's
default config class. Defaults to KubernetesBackendConfig.
backend_config: Backend configuration. Either KubernetesBackendConfig,
LocalProcessBackendConfig, ContainerBackendConfig,
or None to use the backend's default config class.
Defaults to KubernetesBackendConfig.

Raises:
ValueError: Invalid backend configuration.
Expand All @@ -52,6 +59,8 @@ def __init__(
self.backend = KubernetesBackend(backend_config)
elif isinstance(backend_config, LocalProcessBackendConfig):
self.backend = LocalProcessBackend(backend_config)
elif isinstance(backend_config, ContainerBackendConfig):
self.backend = ContainerBackend(backend_config)
else:
raise ValueError(f"Invalid backend config '{backend_config}'")

Expand Down
162 changes: 162 additions & 0 deletions kubeflow/trainer/backends/container/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
# ContainerBackend

The unified container backend for Kubeflow Trainer that automatically detects and uses either Docker or Podman.

## Overview

This backend provides a single, unified interface for container-based training execution, automatically detecting which container runtime is available on your system.

The implementation uses the **adapter pattern** to abstract away differences between Docker and Podman APIs, providing clean separation between runtime detection logic and container operations.

## Usage

### Basic usage (auto-detection)

```python
from kubeflow.trainer import TrainerClient, ContainerBackendConfig

# Auto-detects Docker or Podman
config = ContainerBackendConfig()
client = TrainerClient(backend_config=config)
```

### Force specific runtime

```python
# Force Docker
config = ContainerBackendConfig(runtime="docker")
client = TrainerClient(backend_config=config)

# Force Podman
config = ContainerBackendConfig(runtime="podman")
client = TrainerClient(backend_config=config)
```

### Configuration options

```python
config = ContainerBackendConfig(
# Optional: force specific runtime ("docker" or "podman")
runtime=None,

# Optional: explicit image override
image="my-custom-image:latest",

# Image pull policy: "IfNotPresent", "Always", or "Never"
pull_policy="IfNotPresent",

# Auto-remove containers and networks on job deletion
auto_remove=True,

# GPU support (varies by runtime)
gpus=None,

# Environment variables for all containers
env={"MY_VAR": "value"},

# Container daemon URL override (required for Colima/Podman Machine on macOS)
container_host=None,

# Base directory for job workspaces
workdir_base=None,
)
```

### macOS-specific configuration

On macOS, you may need to specify `container_host` depending on your container runtime:

**Docker with Colima:**
```python
import os
config = ContainerBackendConfig(
container_host=f"unix://{os.path.expanduser('~')}/.colima/default/docker.sock"
)
```

**Podman Machine:**
```python
import os
config = ContainerBackendConfig(
container_host=f"unix://{os.path.expanduser('~')}/.local/share/containers/podman/machine/podman.sock"
)
```

**Docker Desktop:**
```python
# Usually works without specifying container_host
config = ContainerBackendConfig()
```

Alternatively, set environment variables before running:
```bash
# For Colima
export DOCKER_HOST="unix://$HOME/.colima/default/docker.sock"

# For Podman Machine
export CONTAINER_HOST="unix://$HOME/.local/share/containers/podman/machine/podman.sock"
```

### How it works

The backend initialization follows this logic:

1. If `runtime` is specified in config, use that runtime exclusively
2. Otherwise, try to initialize Docker client adapter
3. If Docker fails, try to initialize Podman client adapter
4. If both fail, raise a RuntimeError

If you don't have Docker or Podman installed, use `LocalProcessBackendConfig` instead, which runs training as local subprocesses.

All container operations are delegated to the adapter, eliminating code duplication.

## Installation

Install with Docker support:
```bash
pip install kubeflow[docker]
```

Install with Podman support:
```bash
pip install kubeflow[podman]
```

Install with both:
```bash
pip install kubeflow[docker,podman]
```

## Example: Training Job

```python
from kubeflow.trainer import TrainerClient, ContainerBackendConfig, CustomTrainer

# Define your training function
def train():
import torch
print(f"Training with PyTorch {torch.__version__}")
# Your training code here

# Create trainer
trainer = CustomTrainer(
func=train,
packages_to_install=["torch"],
)

# Initialize client (auto-detects runtime)
config = ContainerBackendConfig()
client = TrainerClient(backend_config=config)

# Run training
job_name = client.train(trainer=trainer)
print(f"Training job started: {job_name}")

# Get logs
for log in client.get_job_logs(job_name, follow=True):
print(log, end='')
```

## See also

- [Example notebook](TBA) - Complete working example to be added
18 changes: 18 additions & 0 deletions kubeflow/trainer/backends/container/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Copyright 2025 The Kubeflow Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from kubeflow.trainer.backends.container.backend import ContainerBackend
from kubeflow.trainer.backends.container.types import ContainerBackendConfig

__all__ = ["ContainerBackend", "ContainerBackendConfig"]
Loading