diff --git a/docs/reference/README.md b/docs/reference/README.md index dfc737b5..a55f107d 100644 --- a/docs/reference/README.md +++ b/docs/reference/README.md @@ -22,7 +22,8 @@ Technical reference documentation for submission scripts, workflow templates, en | [Script Reference](scripts.md) | Submission script inventory, CLI arguments, and configuration | Available | | [Script Examples](scripts-examples.md) | Detailed examples for training, inference, and pipeline submission | Available | | [Copilot Artifacts](copilot-artifacts.md) | Agents, instructions, prompts, and skills inventory | Available | -| Workflow Templates | AzureML and OSMO workflow YAML templates and configuration | Coming soon | +| [Workflow Templates (AzureML)](workflow-templates-azureml.md) | AzureML workflow YAML templates and configuration | Available | +| [Workflow Templates (OSMO)](workflow-templates-osmo.md) | OSMO workflow YAML templates and configuration | Available | | Terraform Variables | Terraform input variables and outputs reference | Coming soon | | Environment Variables | Environment variables for training and deployment | Coming soon | diff --git a/docs/reference/workflow-templates-azureml.md b/docs/reference/workflow-templates-azureml.md new file mode 100644 index 00000000..de82dee9 --- /dev/null +++ b/docs/reference/workflow-templates-azureml.md @@ -0,0 +1,75 @@ +--- +title: Workflow Templates (AzureML) +description: Canonical AzureML workflow template reference for training and evaluation jobs. +author: Microsoft Robotics-AI Team +ms.date: 2026-04-01 +ms.topic: reference +keywords: + - azureml + - workflows + - templates + - training + - evaluation +--- + +Canonical AzureML workflow templates for RL and LeRobot training and evaluation. +Template names, defaults, and paths in this page are derived from the YAML files +in `training/` and `evaluation/`. + +## Template Inventory + +| Template | Purpose | Source YAML path | Typical submit path | +| --- | --- | --- | --- | +| `train.yaml` | IsaacLab RL training job structure | `training/rl/workflows/azureml/train.yaml` | `training/rl/scripts/submit-azureml-training.sh` | +| `lerobot-train.yaml` | LeRobot behavioral cloning training job structure | `training/il/workflows/azureml/lerobot-train.yaml` | `training/il/scripts/submit-azureml-lerobot-training.sh` | +| `validate.yaml` | IsaacLab policy validation against registered models | `evaluation/sil/workflows/azureml/validate.yaml` | `evaluation/sil/scripts/submit-azureml-validation.sh` | +| `lerobot-eval.yaml` | LeRobot policy evaluation and optional model registration | `evaluation/sil/workflows/azureml/lerobot-eval.yaml` | `evaluation/sil/scripts/submit-azureml-lerobot-eval.sh` | + +## train.yaml + +| Field | Details | +| --- | --- | +| Purpose | Structural template for IsaacLab RL training submissions in AzureML. | +| Source YAML path | `training/rl/workflows/azureml/train.yaml` | +| Primary parameters and overrides | `inputs.task` (`Isaac-Velocity-Rough-Anymal-C-v0`), `inputs.num_envs` (`"2048"`), `inputs.max_iterations` (`"600"`), `inputs.checkpoint_mode` (`from-scratch`), `inputs.checkpoint_uri` (`none`), `inputs.register_checkpoint` (`none`), `inputs.run_azure_smoke_test` (`"false"`). | +| Typical submit path | `training/rl/scripts/submit-azureml-training.sh` | +| Usage notes | Keep template values as structural defaults. The submit script sets runtime command, compute, and Azure context. | + +## lerobot-train.yaml + +| Field | Details | +| --- | --- | +| Purpose | Structural template for LeRobot ACT or Diffusion training on AzureML. | +| Source YAML path | `training/il/workflows/azureml/lerobot-train.yaml` | +| Primary parameters and overrides | `inputs.dataset_repo_id` (`none`), `inputs.policy_type` (`act`), `inputs.job_name` (`lerobot-act-training`), `inputs.output_dir` (`/workspace/outputs/train`), `inputs.training_steps` (`none`), `inputs.batch_size` (`none`), `inputs.eval_freq` (`none`), `inputs.save_freq` (`"5000"`), `inputs.register_checkpoint` (`none`). | +| Typical submit path | `training/il/scripts/submit-azureml-lerobot-training.sh` | +| Usage notes | Use script flags for policy source and hyperparameters. Secrets such as HuggingFace and WANDB values are injected at submission time. | + +## validate.yaml + +| Field | Details | +| --- | --- | +| Purpose | Structural template for IsaacLab validation jobs against registered models. | +| Source YAML path | `evaluation/sil/workflows/azureml/validate.yaml` | +| Primary parameters and overrides | `inputs.trained_model.path` (`azureml:placeholder:1`), `inputs.task` (`auto`), `inputs.framework` (`auto`), `inputs.eval_episodes` (`100`), `inputs.num_envs` (`64`), `inputs.success_threshold` (`-1.0`). | +| Typical submit path | `evaluation/sil/scripts/submit-azureml-validation.sh` | +| Usage notes | The script resolves model metadata and passes overrides with `--set`. The template intentionally uses sentinel defaults (`auto`, placeholder paths). | + +## lerobot-eval.yaml + +| Field | Details | +| --- | --- | +| Purpose | Structural template for LeRobot evaluation and optional model registration on AzureML. | +| Source YAML path | `evaluation/sil/workflows/azureml/lerobot-eval.yaml` | +| Primary parameters and overrides | `inputs.policy_repo_id` (`none`), `inputs.policy_type` (`act`), `inputs.dataset_repo_id` (`none`), `inputs.eval_episodes` (`"10"`), `inputs.eval_batch_size` (`"10"`), `inputs.record_video` (`"false"`), `inputs.mlflow_enable` (`"false"`), `inputs.register_model` (`none`), `inputs.blob_storage_container` (`datasets`). | +| Typical submit path | `evaluation/sil/scripts/submit-azureml-lerobot-eval.sh` | +| Usage notes | This template is the canonical AzureML LeRobot evaluation reference. | + +## Usage Notes + +| Topic | Guidance | +| --- | --- | +| Source of truth | Use YAML files in `training/` and `evaluation/` for template names, keys, and defaults. | +| Override pattern | Treat templates as structure-first; submission scripts provide runtime command and environment-specific values. | +| Azure context | Set `subscription_id`, `resource_group`, and `workspace_name` through script options or environment variables. | +| Related reference | See [Reference index](README.md) for adjacent script and artifact guides. | diff --git a/docs/reference/workflow-templates-osmo.md b/docs/reference/workflow-templates-osmo.md new file mode 100644 index 00000000..c3bf48ba --- /dev/null +++ b/docs/reference/workflow-templates-osmo.md @@ -0,0 +1,86 @@ +--- +title: Workflow Templates (OSMO) +description: Canonical OSMO workflow template reference for training and evaluation jobs. +author: Microsoft Robotics-AI Team +ms.date: 2026-04-01 +ms.topic: reference +keywords: + - osmo + - workflows + - templates + - training + - evaluation +--- + +Canonical OSMO workflow templates for RL and LeRobot training and evaluation. +Template names in this page are based on current YAML files and exclude stale +legacy naming. + +## Template Inventory + +| Template | Purpose | Source YAML path | Typical submit path | +| --- | --- | --- | --- | +| `train.yaml` | IsaacLab RL training with inline payload archive | `training/rl/workflows/osmo/train.yaml` | `training/rl/scripts/submit-osmo-training.sh` | +| `train-dataset.yaml` | IsaacLab RL training with dataset folder injection | `training/il/workflows/osmo/train-dataset.yaml` | `training/il/scripts/submit-osmo-dataset-training.sh` | +| `lerobot-train.yaml` | LeRobot ACT or Diffusion training workflow | `training/il/workflows/osmo/lerobot-train.yaml` | `training/il/scripts/submit-osmo-lerobot-training.sh` | +| `eval.yaml` | IsaacLab checkpoint evaluation workflow | `evaluation/sil/workflows/osmo/eval.yaml` | `evaluation/sil/scripts/submit-osmo-eval.sh` | +| `lerobot-eval.yaml` | LeRobot policy evaluation workflow | `evaluation/sil/workflows/osmo/lerobot-eval.yaml` | `evaluation/sil/scripts/submit-osmo-lerobot-eval.sh` | + +## train.yaml + +| Field | Details | +| --- | --- | +| Purpose | OSMO RL training using a base64-encoded runtime payload. | +| Source YAML path | `training/rl/workflows/osmo/train.yaml` | +| Primary parameters and overrides | `default-values.task` (`Isaac-Velocity-Rough-Anymal-C-v0`), `default-values.num_envs` (`"2048"`), `default-values.max_iterations` (empty), `default-values.checkpoint_mode` (`from-scratch`), `default-values.training_backend` (`skrl`), `default-values.gpu` (`"1"`), `default-values.cpu` (`"30"`). | +| Typical submit path | `training/rl/scripts/submit-osmo-training.sh` | +| Usage notes | Use for RL training when shipping the runtime payload inline. Script flags typically override task, resources, and checkpoint behavior. | + +## train-dataset.yaml + +| Field | Details | +| --- | --- | +| Purpose | OSMO RL training that mounts training code from an uploaded dataset path. | +| Source YAML path | `training/il/workflows/osmo/train-dataset.yaml` | +| Primary parameters and overrides | `default-values.dataset_bucket` (`training`), `default-values.dataset_name` (`training-code`), `default-values.task` (`Isaac-Velocity-Rough-Anymal-C-v0`), `default-values.num_envs` (`"2048"`), `default-values.checkpoint_mode` (`from-scratch`), `default-values.training_backend` (`skrl`). | +| Typical submit path | `training/il/scripts/submit-osmo-dataset-training.sh` | +| Usage notes | Use when payload size or reuse favors dataset-based delivery. The script stages and uploads training sources before submission. | + +## lerobot-train.yaml + +| Field | Details | +| --- | --- | +| Purpose | OSMO LeRobot training with optional Azure Blob dataset source and checkpoint registration. | +| Source YAML path | `training/il/workflows/osmo/lerobot-train.yaml` | +| Primary parameters and overrides | `default-values.policy_type` (`act`), `default-values.dataset_repo_id` (empty), `default-values.training_steps` (`"100000"`), `default-values.batch_size` (`"32"`), `default-values.learning_rate` (`"1e-4"`), `default-values.save_freq` (`"5000"`), `default-values.storage_container` (`datasets`), `default-values.register_checkpoint` (empty). | +| Typical submit path | `training/il/scripts/submit-osmo-lerobot-training.sh` | +| Usage notes | Supports HuggingFace and blob-backed datasets. Keep policy type and data source aligned with script flags to avoid mixed-source configuration. | + +## eval.yaml + +| Field | Details | +| --- | --- | +| Purpose | OSMO IsaacLab checkpoint evaluation for policy export and validation. | +| Source YAML path | `evaluation/sil/workflows/osmo/eval.yaml` | +| Primary parameters and overrides | `default-values.task` (`Isaac-Ant-v0`), `default-values.num_envs` (`"4"`), `default-values.max_steps` (`"500"`), `default-values.video_length` (`"200"`), `default-values.checkpoint_uri` (empty), `default-values.inference_format` (`both`). | +| Typical submit path | `evaluation/sil/scripts/submit-osmo-eval.sh` | +| Usage notes | Requires checkpoint URI at submission. Use `inference_format` to control ONNX/JIT export behavior for downstream use. | + +## lerobot-eval.yaml + +| Field | Details | +| --- | --- | +| Purpose | OSMO LeRobot evaluation for HuggingFace or AzureML model sources, with optional registration. | +| Source YAML path | `evaluation/sil/workflows/osmo/lerobot-eval.yaml` | +| Primary parameters and overrides | `default-values.policy_repo_id` (empty), `default-values.policy_type` (`act`), `default-values.dataset_repo_id` (empty), `default-values.eval_episodes` (`"10"`), `default-values.eval_batch_size` (`"10"`), `default-values.record_video` (`"false"`), `default-values.mlflow_enable` (`"false"`), `default-values.register_model` (empty), `default-values.blob_storage_container` (`datasets`). | +| Typical submit path | `evaluation/sil/scripts/submit-osmo-lerobot-eval.sh` | +| Usage notes | This is the canonical LeRobot OSMO evaluation template. | + +## Usage Notes + +| Topic | Guidance | +| --- | --- | +| Source of truth | Use YAML files under `training/` and `evaluation/` as the canonical inventory. | +| Submission flow | Submit through the companion scripts listed above to resolve defaults from CLI, env vars, and Terraform outputs. | +| Runtime packaging | RL workflows use inline payload or dataset injection; choose based on payload size and reuse needs. | +| Related reference | See [Reference index](README.md) for adjacent script and artifact guides. | diff --git a/workflows/README.md b/workflows/README.md index 02acb711..2324df19 100644 --- a/workflows/README.md +++ b/workflows/README.md @@ -1,312 +1,28 @@ --- title: Workflows -description: AzureML and OSMO workflow templates for robotics training and validation jobs +description: Pointer index for workflow template reference documentation. author: Edge AI Team -ms.date: 2026-03-20 +ms.date: 2026-04-01 ms.topic: reference --- -Workflow templates for submitting robotics training and validation jobs to Azure infrastructure. +Workflow template details are maintained in reference documentation. +Use this page as a pointer index. -## 📁 Directory Structure +## 📚 Reference Pages -```text -workflows/ -├── README.md -├── azureml/ -│ ├── README.md -│ ├── train.yaml # Training job specification -│ ├── lerobot-train.yaml # LeRobot behavioral cloning (AzureML) -│ └── validate.yaml # Validation job specification -└── osmo/ - ├── README.md - ├── train.yaml # OSMO training (base64 payload) - ├── train-dataset.yaml # OSMO training (dataset folder upload) - ├── lerobot-train.yaml # LeRobot behavioral cloning training - ├── lerobot-infer.yaml # LeRobot inference/evaluation - └── infer.yaml # OSMO inference workflow -``` +* [Workflow Templates (AzureML)](../docs/reference/workflow-templates-azureml.md) +* [Workflow Templates (OSMO)](../docs/reference/workflow-templates-osmo.md) -## ⚖️ Platform Comparison +## 🧭 Directory Pointers -| Feature | AzureML | OSMO | -|---------------|---------------------------|--------------------------| -| Orchestration | Azure ML Job Service | OSMO Workflow Engine | -| Scheduling | Azure ML Compute | KAI Scheduler / Volcano | -| Multi-node | Azure ML distributed jobs | OSMO workflow DAGs | -| Checkpointing | MLflow integration | MLflow + custom handlers | -| Monitoring | Azure ML Studio | OSMO UI Dashboard | +| Path | Purpose | +|-------------------------|-------------------------------------------| +| `workflows/azureml/` | AzureML workflow pointer documentation | +| `workflows/osmo/` | OSMO workflow pointer documentation | -## 🚀 Quick Start +## 🔗 Canonical Sources -### AzureML Workflows - -```bash -# Training job -training/rl/scripts/submit-azureml-training.sh --task Isaac-Velocity-Rough-Anymal-C-v0 - -# LeRobot behavioral cloning (AzureML) -training/il/scripts/submit-azureml-lerobot-training.sh -d lerobot/aloha_sim_insertion_human - -# Validation job (model name derived from task by default) -evaluation/sil/scripts/submit-azureml-validation.sh --task Isaac-Velocity-Rough-Anymal-C-v0 -``` - -### OSMO Workflows - -```bash -# Base64 payload (< 1MB training code) -training/rl/scripts/submit-osmo-training.sh --task Isaac-Velocity-Rough-Anymal-C-v0 - -# Dataset folder upload (unlimited size, versioned) -training/il/scripts/submit-osmo-dataset-training.sh --task Isaac-Velocity-Rough-Anymal-C-v0 - -# LeRobot behavioral cloning (HuggingFace datasets) -training/il/scripts/submit-osmo-lerobot-training.sh -d lerobot/aloha_sim_insertion_human - -# LeRobot inference/evaluation -evaluation/sil/scripts/submit-osmo-lerobot-eval.sh --policy-repo-id user/trained-policy - -# End-to-end pipeline: train → evaluate → register -training/pipelines/run-lerobot-pipeline.sh \ - -d lerobot/aloha_sim_insertion_human \ - --policy-repo-id user/my-policy \ - -r my-model -``` - -## 💾 OSMO Dataset Workflow - -The `train-dataset.yaml` template uploads `training/rl/` as a versioned OSMO dataset instead of base64-encoding it inline. - -| Aspect | train.yaml | train-dataset.yaml | -|----------------|------------------------|-----------------------| -| Payload method | Base64-encoded archive | Dataset folder upload | -| Size limit | ~1MB | Unlimited | -| Versioning | None | Automatic | -| Reusability | Per-run | Across runs | - -### Dataset Submission - -```bash -# Default configuration -training/il/scripts/submit-osmo-dataset-training.sh --task Isaac-Velocity-Rough-Anymal-C-v0 - -# Custom dataset configuration -training/il/scripts/submit-osmo-dataset-training.sh \ - --dataset-bucket custom-bucket \ - --dataset-name my-training-v1 \ - --task Isaac-Velocity-Rough-Anymal-C-v0 -``` - -### Dataset Parameters - -| Parameter | Default | Description | -|--------------------|-----------------|-------------------------------| -| `--dataset-bucket` | `training` | OSMO bucket for training code | -| `--dataset-name` | `training-code` | Dataset name (auto-versioned) | -| `--training-path` | `training/rl` | Local folder to upload | - -The training folder mounts at `/data//training` inside the container. - -## 🤖 LeRobot Behavioral Cloning Workflow - -The `lerobot-train.yaml` workflow trains behavioral cloning policies using the LeRobot framework. It supports ACT and Diffusion policy architectures with HuggingFace Hub datasets. - -### LeRobot Features - -| Feature | Description | -|-----------------|-------------------------------------------------------| -| Policy types | ACT, Diffusion | -| Dataset source | HuggingFace Hub (e.g., `lerobot/aloha_sim_insertion`) | -| Logging | Azure MLflow | -| Checkpoints | Automatic save + Azure ML registration | -| Runtime install | LeRobot installed via `uv pip` (no source packaging) | - -### LeRobot Parameters - -| Parameter | Default | Description | -|-------------------------|------------|--------------------------------------| -| `--dataset-repo-id` | (required) | HuggingFace dataset repository ID | -| `--policy-type` | `act` | Policy: `act`, `diffusion` | -| `--mlflow-enable` | disabled | Azure ML MLflow logging | -| `--register-checkpoint` | (none) | Model name for Azure ML registration | - -### LeRobot Examples - -```bash -# ACT training with WANDB -training/il/scripts/submit-osmo-lerobot-training.sh \ - -d lerobot/aloha_sim_insertion_human - -# Diffusion policy with MLflow and model registration -training/il/scripts/submit-osmo-lerobot-training.sh \ - -d user/custom-dataset \ - -p diffusion \ - --mlflow-enable \ - -r my-diffusion-model -``` - -## LeRobot Inference Workflow - -The `lerobot-infer.yaml` workflow evaluates trained LeRobot policies from HuggingFace Hub. Downloads policy checkpoints, runs evaluation, and optionally registers models to Azure ML. - -### Inference Features - -| Feature | Description | -|--------------------|-------------------------------------------| -| Policy source | HuggingFace Hub repositories | -| Policy types | ACT, Diffusion | -| Model registration | Optional Azure ML model registration | -| Evaluation | Configurable episode count and batch size | - -### Inference Parameters - -| Parameter | Default | Description | -|--------------------|------------|--------------------------------------| -| `--policy-repo-id` | (required) | HuggingFace policy repository | -| `--policy-type` | `act` | Policy: `act`, `diffusion` | -| `--eval-episodes` | `10` | Number of evaluation episodes | -| `--register-model` | (none) | Model name for Azure ML registration | - -### Inference Examples - -```bash -# Evaluate trained policy -evaluation/sil/scripts/submit-osmo-lerobot-eval.sh \ - --policy-repo-id user/trained-act-policy - -# Evaluate with model registration -evaluation/sil/scripts/submit-osmo-lerobot-eval.sh \ - --policy-repo-id user/trained-act-policy \ - -r my-evaluated-model \ - --eval-episodes 50 -``` - -## 🔮 OSMO Inference Workflow - -The inference workflow exports trained checkpoints to deployment-ready formats (ONNX, TorchScript) and validates them in simulation. - -### Supported Model Formats - -| Format | Extension | Use Case | -|-------------|-----------|--------------------------------------------| -| ONNX | `.onnx` | Cross-platform deployment, ONNX Runtime | -| TorchScript | `.pt` | PyTorch-native deployment, JIT compilation | -| Both | — | Export and validate both formats (default) | - -### Checkpoint URI Formats - -The workflow accepts checkpoints from multiple sources: - -| Source | URI Format | Example | -|--------------|--------------------------------------------------------------|-------------------------------------------------------------------------------| -| MLflow run | `runs://` | `runs:/b906b426-078e-4539-b907-aecb3121a76d/checkpoints/final/model_99.pt` | -| MLflow model | `models://` | `models:/anymal-rough-terrain/1` | -| Azure Blob | `https://.blob.core.windows.net//` | `https://stosmorbt3dev001.blob.core.windows.net/azureml/checkpoints/model.pt` | -| HTTP(S) | Direct URL | `https://example.com/models/policy.pt` | - -### Basic Usage - -```bash -evaluation/sil/scripts/submit-osmo-eval.sh \ - --checkpoint-uri "runs:/abc123/checkpoints/final/model_999.pt" \ - --task Isaac-Ant-v0 -``` - -### OSMO Inference Parameters - -| Parameter | Default | Description | -|--------------------|----------------|----------------------------| -| `--checkpoint-uri` | (required) | URI to training checkpoint | -| `--task` | `Isaac-Ant-v0` | Isaac Lab task name | -| `--format` | `both` | `onnx`, `jit`, or `both` | -| `--num-envs` | `4` | Number of environments | -| `--max-steps` | `500` | Maximum inference steps | -| `--video-length` | `200` | Video recording length | - -### Examples - -```bash -# ONNX-only inference with custom parameters -evaluation/sil/scripts/submit-osmo-eval.sh \ - --checkpoint-uri "models:/my-model/1" \ - --task Isaac-Velocity-Rough-Anymal-C-v0 \ - --format onnx \ - --num-envs 8 \ - --max-steps 1000 \ - --video-length 300 - -# TorchScript-only inference -evaluation/sil/scripts/submit-osmo-eval.sh \ - --checkpoint-uri "runs:/abc123/checkpoints/final/model_99.pt" \ - --task Isaac-Ant-v0 \ - --format jit - -# With explicit Azure context -evaluation/sil/scripts/submit-osmo-eval.sh \ - --checkpoint-uri "runs:/abc123/checkpoints/model_999.pt" \ - --task Isaac-Ant-v0 \ - --azure-subscription-id "00000000-0000-0000-0000-000000000000" \ - --azure-resource-group "rg-robotics" \ - --azure-workspace-name "aml-robotics" -``` - -### Locating Checkpoints from Training Runs - -Training workflows upload checkpoints to Azure ML as MLflow artifacts. To find checkpoint URIs from completed training runs: - -```bash -# List recent OSMO workflows -osmo workflow list - -# View logs from a completed training run -osmo workflow logs isaaclab-inline-training-55 | grep -E "checkpoint|\.pt|mlflow" -``` - -Training logs display the MLflow run ID and artifact paths: - -```text -INFO | MLflow tracking configured: experiment=isaaclab-rsl-rl-Isaac-Velocity-Rough-Anymal-C-v0 -INFO | Found final model: /workspace/isaaclab/logs/rsl_rl/anymal_c_rough/2026-02-03_15-37-23/model_99.pt -View run at: .../runs/b906b426-078e-4539-b907-aecb3121a76d -``` - -Construct the checkpoint URI from the run ID and artifact path: - -```text -runs:/b906b426-078e-4539-b907-aecb3121a76d/checkpoints/final/model_99.pt -``` - -### Workflow Outputs - -The inference workflow produces: - -| Artifact | Description | -|-----------------------------|-------------------------------------------| -| `exported/policy.onnx` | ONNX-exported policy model | -| `exported/policy.pt` | TorchScript-exported policy model | -| `metrics/onnx_metrics.json` | ONNX inference performance metrics | -| `metrics/jit_metrics.json` | TorchScript inference performance metrics | -| `videos/onnx_play/` | ONNX inference video recordings | -| `videos/jit_play/` | TorchScript inference video recordings | - -## 📋 Prerequisites - -| Requirement | Setup | -|-------------------------------|-----------------------------| -| Infrastructure deployed | `infrastructure/terraform/` | -| Setup scripts completed | `infrastructure/setup/` | -| Azure CLI authenticated | `az login` | -| OSMO CLI (for OSMO workflows) | Installed and configured | - -## ⚙️ Configuration - -Scripts resolve values in order: - -| Precedence | Source | Example | -|-------------|-----------------------|----------------------------------| -| 1 (highest) | CLI arguments | `--resource-group rg-custom` | -| 2 | Environment variables | `AZURE_RESOURCE_GROUP=rg-custom` | -| 3 (default) | Terraform outputs | `infrastructure/terraform/` | - -See individual workflow READMEs for detailed configuration options. +Canonical workflow YAML files are under `training/*/workflows/` and +`evaluation/sil/workflows/`. Use submission scripts under `training/*/scripts/` +and `evaluation/sil/scripts/` for execution. diff --git a/workflows/azureml/README.md b/workflows/azureml/README.md index 7eefcb5b..92017105 100644 --- a/workflows/azureml/README.md +++ b/workflows/azureml/README.md @@ -1,120 +1,29 @@ --- title: AzureML Workflows -description: Azure Machine Learning job templates for robotics training and validation +description: Pointer page for AzureML workflow template documentation. author: Edge AI Team -ms.date: 2026-03-20 +ms.date: 2026-04-01 ms.topic: reference --- -Azure Machine Learning job templates for Isaac Lab training and validation workloads. +AzureML workflow template details are documented in reference pages. +Use this README as a concise pointer. -## 📜 Available Templates +## 📚 Reference Pages -| Template | Purpose | Submission Script | -|------------------------------------------------------------------------------|---------------------------------------|----------------------------------------------------------| -| [train.yaml](../../training/rl/workflows/azureml/train.yaml) | Training jobs with checkpoint support | `training/rl/scripts/submit-azureml-training.sh` | -| [validate.yaml](../../evaluation/sil/workflows/azureml/validate.yaml) | Policy validation and inference | `evaluation/sil/scripts/submit-azureml-validation.sh` | -| [lerobot-train.yaml](../../training/il/workflows/azureml/lerobot-train.yaml) | LeRobot behavioral cloning training | `training/il/scripts/submit-azureml-lerobot-training.sh` | +* [Workflow Templates (AzureML)](../../docs/reference/workflow-templates-azureml.md) +* [Workflow Templates (OSMO)](../../docs/reference/workflow-templates-osmo.md) -## 🏋️ Training Job (`train.yaml`) +## 🧭 Canonical Scope -Submits Isaac Lab reinforcement learning training to AKS GPU nodes via Azure ML. +This folder maps to AzureML template sources under +`training/*/workflows/azureml/` and `evaluation/sil/workflows/azureml/`. -### Key Parameters +## 🔗 Submission Paths -| Input | Description | Default | -|-------------------|---------------------------------|------------------------------------| -| `mode` | Execution mode | `train` | -| `checkpoint_mode` | Checkpoint loading strategy | `from-scratch` | -| `task` | Isaac Lab task name | `Isaac-Velocity-Rough-Anymal-C-v0` | -| `num_envs` | Number of parallel environments | `4096` | -| `headless` | Run without rendering | `true` | -| `max_iterations` | Training iterations | `4500` | +Use script-based submission from: -### Training Usage - -```bash -# Default configuration from Terraform outputs -training/rl/scripts/submit-azureml-training.sh - -# Override specific parameters -training/rl/scripts/submit-azureml-training.sh \ - --resource-group rg-custom \ - --workspace-name mlw-custom -``` - -## ✅ Validation Job (`validate.yaml`) - -Runs trained policy validation and generates inference metrics. - -### Validation Parameters - -| Input | Description | Default | -|-------------------|-----------------------------|------------------------------------| -| `mode` | Execution mode | `play` | -| `checkpoint_mode` | Must use trained checkpoint | `from-trained` | -| `task` | Isaac Lab task name | `Isaac-Velocity-Rough-Anymal-C-v0` | -| `num_envs` | Environments for validation | `1024` | - -### Validation Usage - -```bash -# Default configuration -evaluation/sil/scripts/submit-azureml-validation.sh - -# With custom checkpoint -evaluation/sil/scripts/submit-azureml-validation.sh \ - --checkpoint-path "azureml://datastores/checkpoints/paths/model.pt" -``` - -## ⚙️ Environment Variables - -All scripts support environment variable configuration: - -| Variable | Description | -|--------------------------|-------------------------| -| `AZURE_SUBSCRIPTION_ID` | Azure subscription ID | -| `AZURE_RESOURCE_GROUP` | Resource group name | -| `AZUREML_WORKSPACE_NAME` | Azure ML workspace name | -| `AZUREML_COMPUTE` | Compute target name | - -## 📋 Prerequisites - -1. Azure ML extension installed on AKS cluster -2. Kubernetes compute target attached to workspace -3. GPU instance types configured in cluster - -## 🤖 LeRobot Training Job (`lerobot-train.yaml`) - -Submits LeRobot behavioral cloning training (ACT/Diffusion policies) to Azure ML. Installs LeRobot dynamically in the container and trains from HuggingFace Hub datasets. - -### LeRobot Parameters - -| Input | Description | Default | -|-------------------|--------------------------------|-------------------------------------------------| -| `dataset_repo_id` | HuggingFace dataset repository | (required) | -| `policy_type` | Policy architecture | `act` | -| `job_name` | Job identifier | `lerobot-act-training` | -| `image` | Container image | `pytorch/pytorch:2.4.1-cuda12.4-cudnn9-runtime` | -| `wandb_enable` | Enable WANDB logging | `true` | -| `save_freq` | Checkpoint save frequency | `5000` | - -### LeRobot Usage - -```bash -# ACT policy training -training/il/scripts/submit-azureml-lerobot-training.sh \ - -d lerobot/aloha_sim_insertion_human - -# Diffusion policy with model registration -training/il/scripts/submit-azureml-lerobot-training.sh \ - -d user/custom-dataset \ - -p diffusion \ - -r my-diffusion-model \ - --stream -``` - - -*🤖 Crafted with precision by ✨Copilot following brilliant human instruction, -then carefully refined by our team of discerning human reviewers.* - +* `training/rl/scripts/submit-azureml-training.sh` +* `training/il/scripts/submit-azureml-lerobot-training.sh` +* `evaluation/sil/scripts/submit-azureml-validation.sh` +* `evaluation/sil/scripts/submit-azureml-lerobot-eval.sh` diff --git a/workflows/osmo/README.md b/workflows/osmo/README.md index e016dab5..cf287ca0 100644 --- a/workflows/osmo/README.md +++ b/workflows/osmo/README.md @@ -1,301 +1,30 @@ --- title: OSMO Workflows -description: NVIDIA OSMO workflow templates for distributed robotics training +description: Pointer page for OSMO workflow template documentation. author: Edge AI Team -ms.date: 2026-03-20 +ms.date: 2026-04-01 ms.topic: reference --- -NVIDIA OSMO workflow templates for distributed Isaac Lab training on Azure Kubernetes Service. +OSMO workflow template details are documented in reference pages. +Use this README as a concise pointer. -## 📜 Available Templates +## 📚 Reference Pages -| Template | Purpose | Submission Script | -|----------------------------------------------------------------------------|---------------------------------------|-------------------------------------------------------| -| [train.yaml](../../training/rl/workflows/osmo/train.yaml) | Distributed training (base64 inline) | `training/rl/scripts/submit-osmo-training.sh` | -| [train-dataset.yaml](../../training/il/workflows/osmo/train-dataset.yaml) | Distributed training (dataset upload) | `training/il/scripts/submit-osmo-dataset-training.sh` | -| [lerobot-train.yaml](../../training/il/workflows/osmo/lerobot-train.yaml) | LeRobot behavioral cloning | `training/il/scripts/submit-osmo-lerobot-training.sh` | -| [lerobot-eval.yaml](../../evaluation/sil/workflows/osmo/lerobot-eval.yaml) | LeRobot inference/evaluation | `evaluation/sil/scripts/submit-osmo-lerobot-eval.sh` | +* [Workflow Templates (OSMO)](../../docs/reference/workflow-templates-osmo.md) +* [Workflow Templates (AzureML)](../../docs/reference/workflow-templates-azureml.md) -## ⚖️ Workflow Comparison +## 🧭 Canonical Scope -| Aspect | train.yaml | train-dataset.yaml | -|-------------|------------------------|-----------------------| -| Payload | Base64-encoded archive | Dataset folder upload | -| Size limit | ~1MB | Unlimited | -| Versioning | None | Automatic | -| Reusability | Per-run | Across runs | -| Setup | None | Bucket configured | +This folder maps to OSMO template sources under `training/*/workflows/osmo/` +and `evaluation/sil/workflows/osmo/`. -## 🏋️ Training Workflow (`train.yaml`) +## 🔗 Submission Paths -Submits Isaac Lab distributed training through OSMO's workflow orchestration engine. +Use script-based submission from: -### Training Features - -* Multi-GPU distributed training coordination -* KAI Scheduler / Volcano integration -* Automatic checkpointing and recovery -* OSMO UI monitoring dashboard - -### Workflow Parameters - -Parameters are passed as key=value pairs through the submission script: - -| Parameter | Description | -|-------------------------|-----------------------| -| `azure_subscription_id` | Azure subscription ID | -| `azure_resource_group` | Resource group name | -| `azure_workspace_name` | ML workspace name | -| `task` | Isaac Lab task name | -| `num_envs` | Parallel environments | -| `max_iterations` | Training iterations | - -### Usage - -```bash -# Default configuration from Terraform outputs -training/rl/scripts/submit-osmo-training.sh - -# Override parameters -training/rl/scripts/submit-osmo-training.sh \ - --azure-subscription-id "your-subscription-id" \ - --azure-resource-group "rg-custom" -``` - -## 💾 Dataset Training Workflow (`train-dataset.yaml`) - -Submits Isaac Lab training using OSMO dataset folder injection instead of base64-encoded archives. - -### Dataset Features - -* Dataset versioning and reusability -* No payload size limits -* Training folder mounted at `/data//training` -* All features from `train.yaml` - -### Dataset Parameters - -| Parameter | Default | Description | -|----------------------|-----------------|----------------------------------------------| -| `dataset_bucket` | `training` | OSMO bucket for training code | -| `dataset_name` | `training-code` | Dataset name in bucket | -| `training_localpath` | (required) | Local path to training/ relative to workflow | - -### Dataset Usage - -```bash -# Default configuration -training/il/scripts/submit-osmo-dataset-training.sh - -# Custom dataset bucket -training/il/scripts/submit-osmo-dataset-training.sh \ - --dataset-bucket custom-bucket \ - --dataset-name my-training-code -``` - -## 🤖 LeRobot Training Workflow (`lerobot-train.yaml`) - -Submits LeRobot behavioral cloning training for ACT and Diffusion policy architectures. Uses HuggingFace Hub datasets and installs LeRobot dynamically at runtime via `uv pip install`. - -### LeRobot Features - -* ACT and Diffusion policy architectures -* HuggingFace Hub dataset integration -* Azure MLflow logging backend -* Automatic checkpoint registration to Azure ML -* No source payload packaging required - -### LeRobot Parameters - -| Parameter | Default | Description | -|-------------------|-------------------------------------------------|--------------------------------------------| -| `dataset_repo_id` | (required) | HuggingFace dataset (e.g., `user/dataset`) | -| `policy_type` | `act` | Policy architecture: `act`, `diffusion` | -| `job_name` | `lerobot-act-training` | Unique job identifier | -| `image` | `pytorch/pytorch:2.4.1-cuda12.4-cudnn9-runtime` | Container image | -| `training_steps` | (LeRobot default) | Total training iterations | -| `batch_size` | (LeRobot default) | Training batch size | -| `save_freq` | `5000` | Checkpoint save frequency | -| `wandb_enable` | `true` | Enable WANDB logging | -| `mlflow_enable` | `false` | Enable Azure ML MLflow logging | - -### LeRobot Usage - -```bash -# ACT training with WANDB logging -training/il/scripts/submit-osmo-lerobot-training.sh \ - -d lerobot/aloha_sim_insertion_human - -# Diffusion policy with MLflow logging -training/il/scripts/submit-osmo-lerobot-training.sh \ - -d user/custom-dataset \ - -p diffusion \ - --mlflow-enable \ - -r my-diffusion-model - -# Fine-tune from existing policy -training/il/scripts/submit-osmo-lerobot-training.sh \ - -d user/dataset \ - --policy-repo-id user/pretrained-act \ - --training-steps 50000 -``` - -### Credential Configuration - -The workflow uses OSMO credential injection for HuggingFace and WANDB authentication: - -```bash -# Set HuggingFace token (required for private datasets) -osmo credential set hf_token --generic --value "hf_..." - -# Set WANDB API key (required when wandb_enable=true) -osmo credential set wandb_api_key --generic --value "..." -``` - -## 📦 LeRobot Dataset Training Workflow (`lerobot-train-dataset.yaml`) - -Trains LeRobot policies using OSMO dataset mounts instead of HuggingFace Hub downloads. Supports Azure Blob Storage datasets uploaded via OSMO's dataset bucket system. - -### Dataset Training Features - -* OSMO dataset versioning and reuse across runs -* Azure Blob Storage integration via `azure://` URLs -* Falls back to HuggingFace Hub if no dataset mount is available -* All features from `lerobot-train.yaml` - -### Dataset Training Parameters - -| Parameter | Default | Description | -|---------------------|--------------------|----------------------------------------| -| `dataset_bucket` | `lerobot-datasets` | OSMO bucket for training data | -| `dataset_name` | `training-data` | Dataset name in bucket | -| `dataset_localpath` | (required) | Local path to dataset relative to YAML | - -### Dataset Training Usage - -```bash -# Train with local dataset uploaded via OSMO -training/il/scripts/submit-osmo-lerobot-training.sh \ - -w workflows/osmo/lerobot-train-dataset.yaml \ - -d user/fallback-dataset \ - --dataset-bucket my-bucket \ - --dataset-name my-lerobot-data -``` - -## 🔬 LeRobot Inference Workflow (`lerobot-infer.yaml`) - -Evaluates trained LeRobot policies from HuggingFace Hub repositories. Downloads the policy checkpoint, runs evaluation, and optionally registers the model to Azure ML. - -### Inference Features - -* Policy download from HuggingFace Hub -* Model artifact extraction and validation -* Optional Azure ML model registration -* ACT and Diffusion policy support - -### Inference Parameters - -| Parameter | Default | Description | -|-------------------|------------|-----------------------------------------| -| `policy_repo_id` | (required) | HuggingFace policy repository | -| `policy_type` | `act` | Policy architecture: `act`, `diffusion` | -| `eval_episodes` | `10` | Number of evaluation episodes | -| `eval_batch_size` | `10` | Evaluation batch size | -| `register_model` | (none) | Model name for Azure ML registration | -| `record_video` | `false` | Record evaluation videos | - -### Inference Usage - -```bash -# Evaluate a trained policy -evaluation/sil/scripts/submit-osmo-lerobot-eval.sh \ - --policy-repo-id user/trained-act-policy - -# Evaluate with Azure ML model registration -evaluation/sil/scripts/submit-osmo-lerobot-eval.sh \ - --policy-repo-id user/trained-act-policy \ - -r my-evaluated-model - -# Diffusion policy with more episodes -evaluation/sil/scripts/submit-osmo-lerobot-eval.sh \ - --policy-repo-id user/diffusion-policy \ - -p diffusion \ - --eval-episodes 50 -``` - -## ⚙️ Environment Variables - -| Variable | Description | -|-------------------------|-----------------------------------------| -| `AZURE_SUBSCRIPTION_ID` | Azure subscription ID | -| `AZURE_RESOURCE_GROUP` | Resource group name | -| `WORKFLOW_TEMPLATE` | Path to workflow template | -| `OSMO_CONFIG_DIR` | OSMO configuration directory | -| `OSMO_DATASET_BUCKET` | Dataset bucket name (default: training) | -| `OSMO_DATASET_NAME` | Dataset name (default: training-code) | - -## 📋 Prerequisites - -1. OSMO control plane deployed (`03-deploy-osmo-control-plane.sh`) -2. OSMO backend operator installed (`04-deploy-osmo-backend.sh`) -3. Storage configured for checkpoints -4. OSMO CLI installed and authenticated (see [Accessing OSMO](#-accessing-osmo)) - -## 🔌 Accessing OSMO - -OSMO services are deployed to the `osmo-control-plane` namespace. Access method depends on your network configuration. - -### Via VPN (Default Private Cluster) - -When connected to VPN, OSMO is accessible via the internal load balancer: - -| Service | URL | -|--------------|-----------------------| -| UI Dashboard | `http://10.0.5.7` | -| API Service | `http://10.0.5.7/api` | - -```bash -osmo login http://10.0.5.7 --method=dev --username=testuser -osmo info -``` - -> [!NOTE] -> Verify the internal load balancer IP with: `kubectl get svc -n azureml azureml-nginx-ingress -o jsonpath='{.status.loadBalancer.ingress[0].ip}'` - -### Via Port-Forward (Public Cluster without VPN) - -If `should_enable_private_aks_cluster = false` and not using VPN: - -| Service | Port-Forward Command | Local URL | -|--------------|-----------------------------------------------------------------------|-------------------------| -| UI Dashboard | `kubectl port-forward svc/osmo-ui 3000:80 -n osmo-control-plane` | `http://localhost:3000` | -| API Service | `kubectl port-forward svc/osmo-service 9000:80 -n osmo-control-plane` | `http://localhost:9000` | -| Router | `kubectl port-forward svc/osmo-router 8080:80 -n osmo-control-plane` | `http://localhost:8080` | - -```bash -# Start port-forward in background (or separate terminal) -kubectl port-forward svc/osmo-service 9000:80 -n osmo-control-plane & - -# Login to OSMO (dev mode for local access) -osmo login http://localhost:9000 --method=dev --username=testuser - -# Verify connection -osmo info -osmo backend list -``` - -> [!NOTE] -> When accessing OSMO through port-forwarding, `osmo workflow exec` and `osmo workflow port-forward` commands are not supported. These require the router service to be accessible via ingress. - -## 📺 Monitoring - -Access the OSMO UI dashboard: - -* **VPN**: Open `http://10.0.5.7` in your browser -* **Port-forward**: Run `kubectl port-forward svc/osmo-ui 3000:80 -n osmo-control-plane` then open `http://localhost:3000` - - -*🤖 Crafted with precision by ✨Copilot following brilliant human instruction, -then carefully refined by our team of discerning human reviewers.* - +* `training/rl/scripts/submit-osmo-training.sh` +* `training/il/scripts/submit-osmo-dataset-training.sh` +* `training/il/scripts/submit-osmo-lerobot-training.sh` +* `evaluation/sil/scripts/submit-osmo-eval.sh` +* `evaluation/sil/scripts/submit-osmo-lerobot-eval.sh`