Skip to content

Commit 9ffd6d2

Browse files
committed
feat: Add Lepton secret handling and comprehensive distributed training support
- Add secure environment variable handling with Lepton workspace secrets - Support mixed environment variables (secrets + regular values) - Implement EnvValue import and processing in LeptonExecutor - Create detailed Lepton examples with distributed training workflows - Update documentation with security best practices and usage guides Key changes: * LeptonExecutor now supports {'value_from': {'secret_name_ref': 'SECRET_NAME'}} syntax
1 parent f104fe6 commit 9ffd6d2

File tree

5 files changed

+536
-1
lines changed

5 files changed

+536
-1
lines changed

README.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,19 @@ You can find the tutorial series below:
102102
- [Part 2](examples/hello-world/hello_experiments.ipynb).
103103
- [Part 3](examples/hello-world/hello_scripts.py).
104104

105+
#### Lepton Executor Examples
106+
107+
The Lepton executor examples demonstrate comprehensive distributed training workflows on Lepton clusters:
108+
109+
- **Distributed Training**: Multi-node, multi-GPU setups with automatic scaling
110+
- **Secure Secret Management**: Use workspace secrets instead of hardcoded tokens
111+
- **Storage Integration**: Remote storage mounting and data management
112+
- **Container Orchestration**: Advanced environment setup and dependency management
113+
- **Production Workflows**: End-to-end ML training pipelines
114+
115+
You can find the Lepton examples here:
116+
- [Lepton Distributed Training Examples](examples/lepton/)
117+
105118
## Contribute to NeMo Run
106119
Please see the [contribution guide](./CONTRIBUTING.md) to contribute to NeMo Run.
107120

examples/lepton/README.md

Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
# Lepton Executor Examples
2+
3+
This directory contains examples demonstrating how to use the `LeptonExecutor` for distributed machine learning workflows on Lepton clusters.
4+
5+
## Examples
6+
7+
### 🚀 finetune.py
8+
9+
A comprehensive example showing how to use the `LeptonExecutor` for **distributed NeMo model fine-tuning** with advanced features including secure secret management, remote storage, and custom environment setup.
10+
11+
#### Usage Examples
12+
13+
**Basic Fine-tuning:**
14+
```python
15+
# Single-node, single-GPU setup
16+
python finetune.py
17+
18+
# The example will:
19+
# 1. Create a LeptonExecutor with comprehensive configuration
20+
# 2. Set up NeMo fine-tuning recipe with LoRA
21+
# 3. Launch distributed training with monitoring
22+
# 4. Handle resource management and cleanup
23+
```
24+
25+
**Distributed Training:**
26+
```python
27+
# Multi-node setup (modify in the script)
28+
nodes = 4
29+
gpus_per_node = 8
30+
# Will automatically configure FSDP2 strategy for 32 total GPUs
31+
```
32+
33+
#### Configuration Guide
34+
35+
**Resource Configuration:**
36+
```python
37+
# Adjust these based on your Lepton workspace
38+
resource_shape="gpu.8xh100-80gb" # GPU type and count
39+
node_group="your-node-group-name" # Your Lepton node group
40+
```
41+
42+
**Storage Setup:**
43+
```python
44+
mounts=[{
45+
"from": "node-nfs:your-storage", # Storage source
46+
"path": "/path/to/your/remote/storage", # Remote path
47+
"mount_path": "/nemo-workspace", # Container mount point
48+
}]
49+
```
50+
51+
**Secret Management:**
52+
53+
For sensitive data like API tokens:
54+
```python
55+
# NOT RECOMMENDED - Hardcoded secrets
56+
env_vars={
57+
"HF_TOKEN": "hf_your_actual_token_here", # Exposed in code!
58+
}
59+
60+
# RECOMMENDED - Secure secret references
61+
env_vars={
62+
"HF_TOKEN": {"value_from": {"secret_name_ref": "HUGGING_FACE_HUB_TOKEN_read"}},
63+
"WANDB_API_KEY": {"value_from": {"secret_name_ref": "WANDB_API_KEY_secret"}},
64+
# Regular env vars can still be set directly
65+
"NCCL_DEBUG": "INFO",
66+
"TORCH_DISTRIBUTED_DEBUG": "INFO",
67+
}
68+
```
69+
70+
#### Prerequisites
71+
72+
**1. Lepton Workspace Setup:**
73+
- Node groups configured with appropriate GPUs
74+
- Shared storage mounted and accessible
75+
- Container registry access for NeMo images
76+
77+
**2. Optional Secrets (for enhanced security):**
78+
```bash
79+
# Create these secrets in your Lepton workspace
80+
HUGGING_FACE_HUB_TOKEN_read # For HuggingFace model access
81+
WANDB_API_KEY_secret # For experiment tracking
82+
```
83+
84+
**3. Resource Requirements:**
85+
- GPU nodes (H100, A100, V100, etc.)
86+
- Sufficient shared storage space
87+
- Network connectivity for container pulls
88+
89+
#### Advanced Features
90+
91+
**Custom Pre-launch Commands:**
92+
```python
93+
pre_launch_commands=[
94+
"echo '🚀 Starting setup...'",
95+
"nvidia-smi", # Check GPU status
96+
"df -h", # Check disk space
97+
"python3 -m pip install 'datasets>=4.0.0'", # Install dependencies
98+
"export CUSTOM_VAR=value", # Set environment
99+
]
100+
```
101+
102+
**Training Strategy Selection:**
103+
```python
104+
# Automatic strategy selection for single node
105+
if nodes == 1:
106+
recipe.trainer.strategy = "auto"
107+
108+
# FSDP2 for multi-node distributed training
109+
else:
110+
recipe.trainer.strategy = run.Config(
111+
nl.FSDP2Strategy,
112+
data_parallel_size=nodes * gpus_per_node,
113+
tensor_parallel_size=1
114+
)
115+
```
116+
117+
For more details on Lepton cluster management and configuration, refer to the Lepton documentation.

examples/lepton/finetune.py

Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
#!/usr/bin/env python3
2+
"""
3+
NeMo Fine-tuning with Lepton Executor
4+
5+
This comprehensive example demonstrates how to use the LeptonExecutor for distributed
6+
NeMo model fine-tuning with various advanced features.
7+
8+
Prerequisites:
9+
- Lepton workspace with proper node groups and GPU resources
10+
- Secrets configured in your Lepton workspace (optional but recommended)
11+
- Shared storage accessible to your Lepton cluster
12+
- NeMo container image available
13+
14+
This example serves as a template for production ML workflows on Lepton clusters.
15+
"""
16+
17+
from nemo.collections import llm
18+
import nemo_run as run
19+
from nemo import lightning as nl
20+
21+
22+
def nemo_lepton_executor(nodes: int, devices: int, container_image: str):
23+
"""
24+
Create a LeptonExecutor with secret handling capabilities.
25+
26+
Args:
27+
nodes: Number of nodes for distributed training
28+
devices: Number of GPUs per node
29+
container_image: Docker container image to use
30+
31+
Returns:
32+
Configured LeptonExecutor with secret support
33+
"""
34+
35+
return run.LeptonExecutor(
36+
# Required parameters
37+
container_image=container_image,
38+
nemo_run_dir="/nemo-workspace", # Directory for NeMo Run files on remote storage
39+
# Lepton compute configuration
40+
nodes=nodes,
41+
gpus_per_node=devices,
42+
nprocs_per_node=devices, # Number of processes per node (usually = gpus_per_node)
43+
# Lepton workspace configuration - REQUIRED for actual usage
44+
resource_shape="gpu.1xh200", # Specify GPU type/count - adjust as needed
45+
node_group="your-node-group-name", # Specify your node group - must exist in workspace
46+
# Remote storage mounts (using correct mount structure)
47+
mounts=[
48+
{
49+
"from": "node-nfs:your-shared-storage",
50+
"path": "/path/to/your/remote/storage", # Remote storage path
51+
"mount_path": "/nemo-workspace", # Mount path in container
52+
}
53+
],
54+
# Environment variables - SECURE SECRET HANDLING
55+
env_vars={
56+
# SECRET REFERENCES (recommended for sensitive data)
57+
# These reference secrets stored securely in your Lepton workspace
58+
"HF_TOKEN": {"value_from": {"secret_name_ref": "HUGGING_FACE_HUB_TOKEN_read"}},
59+
"WANDB_API_KEY": {
60+
"value_from": {"secret_name_ref": "WANDB_API_KEY_secret"}
61+
}, # Optional
62+
# 📋 REGULAR ENVIRONMENT VARIABLES
63+
# Non-sensitive configuration can be set directly
64+
"NCCL_DEBUG": "INFO",
65+
"TORCH_DISTRIBUTED_DEBUG": "INFO",
66+
"CUDA_LAUNCH_BLOCKING": "1",
67+
"TOKENIZERS_PARALLELISM": "false",
68+
},
69+
# Shared memory size for inter-process communication
70+
shared_memory_size=65536,
71+
# Custom commands to run before launching the training
72+
pre_launch_commands=[
73+
"echo '🚀 Starting NeMo fine-tuning with Lepton secrets...'",
74+
"nvidia-smi",
75+
"df -h",
76+
"python3 -m pip install 'datasets>=4.0.0'",
77+
"python3 -m pip install 'transformers>=4.40.0'",
78+
],
79+
)
80+
81+
82+
def create_finetune_recipe(nodes: int, gpus_per_node: int):
83+
"""
84+
Create a NeMo fine-tuning recipe with LoRA.
85+
86+
Args:
87+
nodes: Number of nodes for distributed training
88+
gpus_per_node: Number of GPUs per node
89+
90+
Returns:
91+
Configured NeMo recipe for fine-tuning
92+
"""
93+
94+
recipe = llm.hf_auto_model_for_causal_lm.finetune_recipe(
95+
model_name="meta-llama/Llama-3.2-3B", # Model to fine-tune
96+
dir="/nemo-workspace/llama3.2_3b_lepton", # Use nemo-workspace mount path
97+
name="llama3_lora_lepton",
98+
num_nodes=nodes,
99+
num_gpus_per_node=gpus_per_node,
100+
peft_scheme="lora", # Parameter-Efficient Fine-Tuning with LoRA
101+
max_steps=100, # Adjust based on your needs
102+
)
103+
104+
# LoRA configuration
105+
recipe.peft.target_modules = ["linear_qkv", "linear_proj", "linear_fc1", "*_proj"]
106+
recipe.peft.dim = 16
107+
recipe.peft.alpha = 32
108+
109+
# Strategy configuration for distributed training
110+
if nodes == 1:
111+
recipe.trainer.strategy = "auto" # Let Lightning choose the best strategy
112+
else:
113+
recipe.trainer.strategy = run.Config(
114+
nl.FSDP2Strategy, data_parallel_size=nodes * gpus_per_node, tensor_parallel_size=1
115+
)
116+
117+
return recipe
118+
119+
120+
if __name__ == "__main__":
121+
# Configuration
122+
nodes = 1 # Start with single node for testing
123+
gpus_per_node = 1
124+
125+
# Create the fine-tuning recipe
126+
recipe = create_finetune_recipe(nodes, gpus_per_node)
127+
128+
# Create the executor with secret handling
129+
executor = nemo_lepton_executor(
130+
nodes=nodes,
131+
devices=gpus_per_node,
132+
container_image="nvcr.io/nvidia/nemo:25.04", # Use appropriate NeMo container
133+
)
134+
135+
# Optional: Check executor capabilities
136+
print("🔍 Executor Information:")
137+
print(f"📋 Nodes: {executor.nnodes()}")
138+
print(f"📋 Processes per node: {executor.nproc_per_node()}")
139+
140+
# Check macro support
141+
macro_values = executor.macro_values()
142+
print(f"📋 Macro values support: {macro_values is not None}")
143+
144+
try:
145+
# Create and run the experiment
146+
with run.Experiment(
147+
"lepton-nemo-secrets-demo", executor=executor, log_level="DEBUG"
148+
) as exp:
149+
# Add the fine-tuning task
150+
task_id = exp.add(recipe, tail_logs=True, name="llama3_lora_with_secrets")
151+
152+
# Execute the experiment
153+
print("Starting fine-tuning experiment with secure secret handling...")
154+
exp.run(detach=False, tail_logs=True, sequential=True)
155+
156+
print("Experiment completed successfully!")
157+
158+
except Exception as e:
159+
print(f"\n Error occurred: {type(e).__name__}")
160+
print(f" Message: {str(e)}")
161+
print("\n💡 Common issues to check:")
162+
print(" - Ensure your Lepton workspace has the required secrets configured")
163+
print(" - Verify node_group and resource_shape match your workspace")
164+
print(" - Check that mount paths are correct and accessible")
165+
print(" - Confirm container image is available and compatible")

nemo_run/core/execution/lepton.py

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
from leptonai.api.v1.types.dedicated_node_group import DedicatedNodeGroup
1818
from leptonai.api.v1.types.deployment import (
1919
EnvVar,
20+
EnvValue,
2021
LeptonContainer,
2122
Mount,
2223
)
@@ -232,7 +233,16 @@ def create_lepton_job(self, name: str):
232233
"""
233234
client = APIClient()
234235

235-
envs = [EnvVar(name=key, value=value) for key, value in self.env_vars.items()]
236+
# Process environment variables - handle both regular values and secret references
237+
envs = []
238+
for key, value in self.env_vars.items():
239+
if isinstance(value, dict) and "value_from" in value:
240+
# Handle secret reference
241+
secret_name_ref = value["value_from"]["secret_name_ref"]
242+
envs.append(EnvVar(name=key, value_from=EnvValue(secret_name_ref=secret_name_ref)))
243+
else:
244+
# Handle regular environment variable
245+
envs.append(EnvVar(name=key, value=str(value)))
236246

237247
cmd = ["/bin/bash", "-c", f"bash {self.lepton_job_dir}/launch_script.sh"]
238248

0 commit comments

Comments
 (0)