Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,19 @@ You can find the tutorial series below:
- [Part 2](examples/hello-world/hello_experiments.ipynb).
- [Part 3](examples/hello-world/hello_scripts.py).

#### Lepton Executor Examples

The Lepton executor examples demonstrate comprehensive distributed training workflows on Lepton clusters:

- **Distributed Training**: Multi-node, multi-GPU setups with automatic scaling
- **Secure Secret Management**: Use workspace secrets instead of hardcoded tokens
- **Storage Integration**: Remote storage mounting and data management
- **Container Orchestration**: Advanced environment setup and dependency management
- **Production Workflows**: End-to-end ML training pipelines

You can find the Lepton examples here:
- [Lepton Distributed Training Examples](examples/lepton/)

## Contribute to NeMo Run
Please see the [contribution guide](./CONTRIBUTING.md) to contribute to NeMo Run.

Expand Down
117 changes: 117 additions & 0 deletions examples/lepton/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
# Lepton Executor Examples

This directory contains examples demonstrating how to use the `LeptonExecutor` for distributed machine learning workflows on Lepton clusters.

## Examples

### 🚀 finetune.py

A comprehensive example showing how to use the `LeptonExecutor` for **distributed NeMo model fine-tuning** with advanced features including secure secret management, remote storage, and custom environment setup.

#### Usage Examples

**Basic Fine-tuning:**
```python
# Single-node, single-GPU setup
python finetune.py

# The example will:
# 1. Create a LeptonExecutor with comprehensive configuration
# 2. Set up NeMo fine-tuning recipe with LoRA
# 3. Launch distributed training with monitoring
# 4. Handle resource management and cleanup
```

**Distributed Training:**
```python
# Multi-node setup (modify in the script)
nodes = 4
gpus_per_node = 8
# Will automatically configure FSDP2 strategy for 32 total GPUs
```

#### Configuration Guide

**Resource Configuration:**
```python
# Adjust these based on your Lepton workspace
resource_shape="gpu.8xh100-80gb" # GPU type and count
node_group="your-node-group-name" # Your Lepton node group
```

**Storage Setup:**
```python
mounts=[{
"from": "node-nfs:your-storage", # Storage source
"path": "/path/to/your/remote/storage", # Remote path
"mount_path": "/nemo-workspace", # Container mount point
}]
```

**Secret Management:**

For sensitive data like API tokens:
```python
# NOT RECOMMENDED - Hardcoded secrets
env_vars={
"HF_TOKEN": "hf_your_actual_token_here", # Exposed in code!
}

# RECOMMENDED - Secure secret references
env_vars={
"HF_TOKEN": {"value_from": {"secret_name_ref": "HUGGING_FACE_HUB_TOKEN_read"}},
"WANDB_API_KEY": {"value_from": {"secret_name_ref": "WANDB_API_KEY_secret"}},
# Regular env vars can still be set directly
"NCCL_DEBUG": "INFO",
"TORCH_DISTRIBUTED_DEBUG": "INFO",
}
```

#### Prerequisites

**1. Lepton Workspace Setup:**
- Node groups configured with appropriate GPUs
- Shared storage mounted and accessible
- Container registry access for NeMo images

**2. Optional Secrets (for enhanced security):**
```bash
# Create these secrets in your Lepton workspace
HUGGING_FACE_HUB_TOKEN_read # For HuggingFace model access
WANDB_API_KEY_secret # For experiment tracking
```

**3. Resource Requirements:**
- GPU nodes (H100, A100, V100, etc.)
- Sufficient shared storage space
- Network connectivity for container pulls

#### Advanced Features

**Custom Pre-launch Commands:**
```python
pre_launch_commands=[
"echo '🚀 Starting setup...'",
"nvidia-smi", # Check GPU status
"df -h", # Check disk space
"python3 -m pip install 'datasets>=4.0.0'", # Install dependencies
"export CUSTOM_VAR=value", # Set environment
]
```

**Training Strategy Selection:**
```python
# Automatic strategy selection for single node
if nodes == 1:
recipe.trainer.strategy = "auto"

# FSDP2 for multi-node distributed training
else:
recipe.trainer.strategy = run.Config(
nl.FSDP2Strategy,
data_parallel_size=nodes * gpus_per_node,
tensor_parallel_size=1
)
```

For more details on Lepton cluster management and configuration, refer to the Lepton documentation.
165 changes: 165 additions & 0 deletions examples/lepton/finetune.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
#!/usr/bin/env python3
"""
NeMo Fine-tuning with Lepton Executor

This comprehensive example demonstrates how to use the LeptonExecutor for distributed
NeMo model fine-tuning with various advanced features.

Prerequisites:
- Lepton workspace with proper node groups and GPU resources
- Secrets configured in your Lepton workspace (optional but recommended)
- Shared storage accessible to your Lepton cluster
- NeMo container image available

This example serves as a template for production ML workflows on Lepton clusters.
"""

from nemo.collections import llm
import nemo_run as run
from nemo import lightning as nl


def nemo_lepton_executor(nodes: int, devices: int, container_image: str):
"""
Create a LeptonExecutor with secret handling capabilities.

Args:
nodes: Number of nodes for distributed training
devices: Number of GPUs per node
container_image: Docker container image to use

Returns:
Configured LeptonExecutor with secret support
"""

return run.LeptonExecutor(
# Required parameters
container_image=container_image,
nemo_run_dir="/nemo-workspace", # Directory for NeMo Run files on remote storage
# Lepton compute configuration
nodes=nodes,
gpus_per_node=devices,
nprocs_per_node=devices, # Number of processes per node (usually = gpus_per_node)
# Lepton workspace configuration - REQUIRED for actual usage
resource_shape="gpu.1xh200", # Specify GPU type/count - adjust as needed
node_group="your-node-group-name", # Specify your node group - must exist in workspace
# Remote storage mounts (using correct mount structure)
mounts=[
{
"from": "node-nfs:your-shared-storage",
"path": "/path/to/your/remote/storage", # Remote storage path
"mount_path": "/nemo-workspace", # Mount path in container
}
],
# Environment variables - SECURE SECRET HANDLING
env_vars={
# SECRET REFERENCES (recommended for sensitive data)
# These reference secrets stored securely in your Lepton workspace
"HF_TOKEN": {"value_from": {"secret_name_ref": "HUGGING_FACE_HUB_TOKEN_read"}},
"WANDB_API_KEY": {
"value_from": {"secret_name_ref": "WANDB_API_KEY_secret"}
}, # Optional
# 📋 REGULAR ENVIRONMENT VARIABLES
# Non-sensitive configuration can be set directly
"NCCL_DEBUG": "INFO",
"TORCH_DISTRIBUTED_DEBUG": "INFO",
"CUDA_LAUNCH_BLOCKING": "1",
"TOKENIZERS_PARALLELISM": "false",
},
# Shared memory size for inter-process communication
shared_memory_size=65536,
# Custom commands to run before launching the training
pre_launch_commands=[
"echo '🚀 Starting NeMo fine-tuning with Lepton secrets...'",
"nvidia-smi",
"df -h",
"python3 -m pip install 'datasets>=4.0.0'",
"python3 -m pip install 'transformers>=4.40.0'",
],
)


def create_finetune_recipe(nodes: int, gpus_per_node: int):
"""
Create a NeMo fine-tuning recipe with LoRA.

Args:
nodes: Number of nodes for distributed training
gpus_per_node: Number of GPUs per node

Returns:
Configured NeMo recipe for fine-tuning
"""

recipe = llm.hf_auto_model_for_causal_lm.finetune_recipe(
model_name="meta-llama/Llama-3.2-3B", # Model to fine-tune
dir="/nemo-workspace/llama3.2_3b_lepton", # Use nemo-workspace mount path
name="llama3_lora_lepton",
num_nodes=nodes,
num_gpus_per_node=gpus_per_node,
peft_scheme="lora", # Parameter-Efficient Fine-Tuning with LoRA
max_steps=100, # Adjust based on your needs
)

# LoRA configuration
recipe.peft.target_modules = ["linear_qkv", "linear_proj", "linear_fc1", "*_proj"]
recipe.peft.dim = 16
recipe.peft.alpha = 32

# Strategy configuration for distributed training
if nodes == 1:
recipe.trainer.strategy = "auto" # Let Lightning choose the best strategy
else:
recipe.trainer.strategy = run.Config(
nl.FSDP2Strategy, data_parallel_size=nodes * gpus_per_node, tensor_parallel_size=1
)

return recipe


if __name__ == "__main__":
# Configuration
nodes = 1 # Start with single node for testing
gpus_per_node = 1

# Create the fine-tuning recipe
recipe = create_finetune_recipe(nodes, gpus_per_node)

# Create the executor with secret handling
executor = nemo_lepton_executor(
nodes=nodes,
devices=gpus_per_node,
container_image="nvcr.io/nvidia/nemo:25.04", # Use appropriate NeMo container
)

# Optional: Check executor capabilities
print("🔍 Executor Information:")
print(f"📋 Nodes: {executor.nnodes()}")
print(f"📋 Processes per node: {executor.nproc_per_node()}")

# Check macro support
macro_values = executor.macro_values()
print(f"📋 Macro values support: {macro_values is not None}")

try:
# Create and run the experiment
with run.Experiment(
"lepton-nemo-secrets-demo", executor=executor, log_level="DEBUG"
) as exp:
# Add the fine-tuning task
task_id = exp.add(recipe, tail_logs=True, name="llama3_lora_with_secrets")

# Execute the experiment
print("Starting fine-tuning experiment with secure secret handling...")
exp.run(detach=False, tail_logs=True, sequential=True)

print("Experiment completed successfully!")

except Exception as e:
print(f"\n Error occurred: {type(e).__name__}")
print(f" Message: {str(e)}")
print("\n💡 Common issues to check:")
print(" - Ensure your Lepton workspace has the required secrets configured")
print(" - Verify node_group and resource_shape match your workspace")
print(" - Check that mount paths are correct and accessible")
print(" - Confirm container image is available and compatible")
12 changes: 11 additions & 1 deletion nemo_run/core/execution/lepton.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
from leptonai.api.v1.types.dedicated_node_group import DedicatedNodeGroup
from leptonai.api.v1.types.deployment import (
EnvVar,
EnvValue,
LeptonContainer,
Mount,
)
Expand Down Expand Up @@ -232,7 +233,16 @@ def create_lepton_job(self, name: str):
"""
client = APIClient()

envs = [EnvVar(name=key, value=value) for key, value in self.env_vars.items()]
# Process environment variables - handle both regular values and secret references
envs = []
for key, value in self.env_vars.items():
if isinstance(value, dict) and "value_from" in value:
# Handle secret reference
secret_name_ref = value["value_from"]["secret_name_ref"]
envs.append(EnvVar(name=key, value_from=EnvValue(secret_name_ref=secret_name_ref)))
else:
# Handle regular environment variable
envs.append(EnvVar(name=key, value=str(value)))

cmd = ["/bin/bash", "-c", f"bash {self.lepton_job_dir}/launch_script.sh"]

Expand Down
Loading