Skip to content

fix(training): validate AzureML and OSMO RL submissions end to end#372

Merged
katriendg merged 5 commits into
microsoft:mainfrom
fbeltrao:fix/320-azure-rl-submission
Apr 1, 2026
Merged

fix(training): validate AzureML and OSMO RL submissions end to end#372
katriendg merged 5 commits into
microsoft:mainfrom
fbeltrao:fix/320-azure-rl-submission

Conversation

@fbeltrao
Copy link
Copy Markdown
Contributor

Description

As reported by #320, RL training is not working in Azure ML. To test it one could ask an AI assistant to run tests and validate metrics, checkpoints, uploaded files. This PR proposes a end-to-end suite that does it automatically. Treat as a conversation starter.

Included in this PR is the fix to RL training job submission in AML and end-to-end tests for Azure ML and OSMO. The end-to-end test checks that metrics are written, that the job completes successfully, that the right code is uploaded to AML. It utilizes command line tools to validate work (az, terraform, osmo).

This is the log of running an end-to-end test:

tests/e2e/test_e2e_training.py::test_aml_rl_training_e2e [e2e] [2026-03-30 13:12:15]: Starting AzureML RL e2e test
[e2e] [2026-03-30 13:12:15]: Submitting AzureML training job for task=Isaac-Velocity-Rough-Anymal-C-v0, num_envs=64, max_iterations=10, experiment=rl-training-e2e-aml-1774876335-50fe546b
[e2e] [2026-03-30 13:12:32]: Submitted AzureML job name=silly_glass_wnlrp7sl2z
[e2e] [2026-03-30 13:12:32]: Waiting for AzureML job silly_glass_wnlrp7sl2z to start
[e2e] [2026-03-30 13:12:32]: Waiting for AzureML job silly_glass_wnlrp7sl2z to start for up to 15 minutes (poll every 30s)
[e2e] [2026-03-30 13:12:36]: Observed status=Preparing
[e2e] [2026-03-30 13:12:36]: Reached AzureML job silly_glass_wnlrp7sl2z to start with status=Preparing
[e2e] [2026-03-30 13:12:36]: Waiting for AzureML job silly_glass_wnlrp7sl2z to complete
[e2e] [2026-03-30 13:12:36]: Waiting for AzureML job silly_glass_wnlrp7sl2z to complete for up to 30 minutes (poll every 30s)
[e2e] [2026-03-30 13:12:41]: Completion poll status=Queued
[e2e] [2026-03-30 13:15:03]: Completion poll status=Running
[e2e] [2026-03-30 13:16:48]: Completion poll status=Completed
[e2e] [2026-03-30 13:16:48]: Reached AzureML job silly_glass_wnlrp7sl2z to complete with status=Completed
[e2e] [2026-03-30 13:16:48]: AzureML job silly_glass_wnlrp7sl2z completed successfully
[e2e] [2026-03-30 13:16:48]: Validating AzureML uploaded code snapshot
[e2e] [2026-03-30 13:16:48]: Inspecting uploaded code snapshot for AzureML job silly_glass_wnlrp7sl2z
[e2e] [2026-03-30 13:16:56]: Code snapshot validation passed for AzureML job silly_glass_wnlrp7sl2z; top-level entries=.amlignore, README.md, __init__.py, examples, il, packaging, pipelines, rl, setup, specifications, stream.py, tests, utils, vla
[e2e] [2026-03-30 13:16:56]: Validating AzureML MLflow tracking
Class DeploymentTemplateOperations: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
[e2e] [2026-03-30 13:17:05]: AzureML MLflow tracking passed: metrics=[Learning / Learning rate=8.77914951989026e-05, Loss / Policy loss=-0.024657187331467868, Loss / Value loss=0.06572330687195063] params=[algorithm=PPO, distributed=False, num_envs=None]
[e2e] [2026-03-30 13:17:05]: Validating AzureML checkpoint output
[e2e] [2026-03-30 13:17:05]: Checking checkpoint output for AzureML job silly_glass_wnlrp7sl2z
[e2e] [2026-03-30 13:17:09]: Checkpoint output is present for AzureML job silly_glass_wnlrp7sl2z
[e2e] [2026-03-30 13:17:09]: AzureML RL e2e test finished successfully
PASSED[e2e] [2026-03-30 13:17:09]: Skipping cancel for AzureML job silly_glass_wnlrp7sl2z; terminal status=Completed

tests/e2e/test_e2e_training.py::test_osmo_rl_training_e2e [e2e] [2026-03-30 13:17:12]: Starting OSMO RL e2e test
[e2e] [2026-03-30 13:17:12]: Submitting OSMO workflow for task=Isaac-Velocity-Rough-Anymal-C-v0, num_envs=64, max_iterations=10, experiment=isaaclab-Isaac-Velocity-Rough-Anymal-C-v0, correlation_id=osmo-rl-e2e-d2c0de15e7224d8ea403dd40546e3a83
[e2e] [2026-03-30 13:17:18]: Submitted OSMO workflow id=isaaclab-inline-training-29, name=isaaclab-inline-training-29
[e2e] [2026-03-30 13:17:18]: Waiting for OSMO workflow isaaclab-inline-training-29 to start
[e2e] [2026-03-30 13:17:18]: Waiting for OSMO workflow isaaclab-inline-training-29 to start for up to 15 minutes (poll every 30s)
[e2e] [2026-03-30 13:17:19]: Observed status=PENDING
[e2e] [2026-03-30 13:17:51]: Observed status=RUNNING
[e2e] [2026-03-30 13:17:51]: Reached OSMO workflow isaaclab-inline-training-29 to start with status=RUNNING
[e2e] [2026-03-30 13:17:51]: Waiting for OSMO workflow isaaclab-inline-training-29 to complete
[e2e] [2026-03-30 13:17:51]: Waiting for OSMO workflow isaaclab-inline-training-29 to complete for up to 30 minutes (poll every 30s)
[e2e] [2026-03-30 13:17:52]: Completion poll status=RUNNING
[e2e] [2026-03-30 13:21:37]: Completion poll status=COMPLETED
[e2e] [2026-03-30 13:21:37]: Reached OSMO workflow isaaclab-inline-training-29 to complete with status=COMPLETED
[e2e] [2026-03-30 13:21:37]: OSMO workflow isaaclab-inline-training-29 completed successfully
[e2e] [2026-03-30 13:21:37]: Validating OSMO MLflow tracking
[e2e] [2026-03-30 13:21:40]: Resolved MLflow run via exact correlation_id tag: correlation_id=osmo-rl-e2e-d2c0de15e7224d8ea403dd40546e3a83, run_id=2ebae539-c04b-44c6-bc65-98e0a7e1cdf0
[e2e] [2026-03-30 13:21:41]: OSMO MLflow tracking passed: run_id=2ebae539-c04b-44c6-bc65-98e0a7e1cdf0 metrics=[Learning / Learning rate=8.77914951989026e-05, Loss / Policy loss=-0.020312698557972908, Loss / Value loss=0.124498013779521] params=[algorithm=PPO, distributed=False, num_envs=None] tags=[correlation_id=osmo-rl-e2e-d2c0de15e7224d8ea403dd40546e3a83, training_orchestrator=osmo]
[e2e] [2026-03-30 13:21:41]: Validating OSMO workflow task success
[e2e] [2026-03-30 13:21:41]: Verified OSMO task isaac-training succeeded with exit_code=0 on pod=62f94435d0354c13-d9a5279319484a3c
[e2e] [2026-03-30 13:21:41]: OSMO RL e2e test finished successfully
PASSED[e2e] [2026-03-30 13:21:41]: Skipping cancel for OSMO workflow isaaclab-inline-training-29; terminal status=COMPLETED

Partially addresses #320.

Not included in this PR:

  • Broader coverage across multiple tasks, algorithms, or longer-running training scenarios
  • CI automation or scheduled environment-backed execution for the new e2e suite
  • Equivalent validation for non-RL training flows or other submission entry points

Type of Change

  • 🐛 Bug fix (non-breaking change fixing an issue)
  • ✨ New feature (non-breaking change adding functionality)
  • 💥 Breaking change (fix or feature causing existing functionality to change)
  • 📚 Documentation update
  • 🏗️ Infrastructure change (Terraform/IaC)
  • ♻️ Refactoring (no functional changes)

Component(s) Affected

  • infrastructure/terraform/prerequisites/ - Azure subscription setup
  • infrastructure/terraform/ - Terraform infrastructure
  • infrastructure/setup/ - OSMO control plane / Helm
  • workflows/ - Training and evaluation workflows
  • training/ - Training pipelines and scripts
  • docs/ - Documentation

Testing Performed

  • Terraform plan reviewed (no unexpected changes)
  • Terraform apply tested in dev environment
  • Training scripts tested locally with Isaac Sim
  • OSMO workflow submitted successfully
  • Smoke tests passed (smoke_test_azure.py)

Notes:

  • This PR adds environment-backed e2e coverage, but the commit does not by itself prove those workflows were executed successfully in a merge-ready environment.
  • The new tests intentionally skip when Azure CLI authentication, Azure ML workspace access, OSMO CLI access, Kubernetes context, or GPU nodes are unavailable.
  • uv run pytest --collect-only -m e2e tests/e2e/test_e2e_training.py succeeded in the current workspace.

Documentation Impact

  • No documentation changes needed
  • Documentation updated in this PR
  • Documentation issue filed

Bug Fix Checklist

Complete this section for bug fix PRs. Skip for other contribution types.

  • Linked to issue being fixed
  • Regression test included, OR
  • Justification for no regression test:

Checklist

@fbeltrao fbeltrao requested a review from a team as a code owner March 30, 2026 13:24
@fbeltrao fbeltrao force-pushed the fix/320-azure-rl-submission branch from 1214aec to 1e00b61 Compare March 30, 2026 13:28
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 43.58%. Comparing base (dfb55cd) to head (65b09c0).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #372   +/-   ##
=======================================
  Coverage   43.58%   43.58%           
=======================================
  Files         242      242           
  Lines       14840    14840           
  Branches     1855     1855           
=======================================
  Hits         6468     6468           
  Misses       8082     8082           
  Partials      290      290           
Flag Coverage Δ *Carryforward flag
pester 79.87% <ø> (ø)
pytest 6.89% <ø> (ø) Carriedforward from ba66ed9
pytest-dataviewer 61.98% <ø> (ø)
vitest 50.72% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread tests/e2e/conftest.py Fixed
Comment thread tests/e2e/conftest.py Fixed
Comment thread tests/e2e/conftest.py Fixed
Copy link
Copy Markdown
Collaborator

@katriendg katriendg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @fbeltrao for discovering, filing the issue and your PR with the fix!
Two minor comments. Approving as you'll see them & resolve the conversation before we merge.

Also reviewed we have another similar issue - filed as #377 to track separately.

Comment thread tests/e2e/test_e2e_training.py
Comment thread training/rl/scripts/submit-azureml-training.sh
…ing workflows

- add common utilities for command execution and JSON parsing
- create MLflow tracking assertions for AzureML and OSMO
- implement submission and validation functions for OSMO training
- add fixtures for AzureML workspace and compute target setup
- create end-to-end test cases for AzureML and OSMO training

🔍 - Generated by Copilot
…ymlink

- add logic to recreate top-level training path for job container
- ensure compatibility with Python imports inside the job container

🔧 - Generated by Copilot
@fbeltrao fbeltrao force-pushed the fix/320-azure-rl-submission branch from 9cca371 to d0ddd5f Compare March 31, 2026 14:25
@katriendg katriendg merged commit 49904d3 into microsoft:main Apr 1, 2026
28 checks passed
WilliamBerryiii pushed a commit that referenced this pull request Apr 8, 2026
🤖 I have created a release *beep* *boop*
---


##
[0.6.0](v0.5.0...v0.6.0)
(2026-04-08)


### ✨ Features

* **build:** add terraform-docs generation pipeline
([#378](#378))
([78e90d0](78e90d0))
* **infrastructure:** enable optional AML diagnostic logs
([#400](#400))
([58dd8db](58dd8db))
* **scripts:** consolidate scripts library paths and enhance dataviewer
([#383](#383))
([176d9c9](176d9c9))


### 🐛 Bug Fixes

* **build:** remediate CVEs, enforce equality pinning, repair Dependabot
config
([#391](#391))
([0c29148](0c29148))
* **infrastructure:** add Storage File Data Privileged Contributor role
for ML identity
([#380](#380))
([378f7ed](378f7ed))
* **infrastructure:** replace hardcoded NAT Gateway availability zones
with variable
([#356](#356))
([a1397bd](a1397bd))
* **infrastructure:** resolve TFLint violations and enable hard-fail
([#376](#376))
([dfb55cd](dfb55cd))
* **scripts:** add dot-source guard to Invoke-MsDateFreshnessCheck.ps1
([#397](#397))
([f6f22c3](f6f22c3))
* **training:** validate AzureML and OSMO RL submissions end to end
([#372](#372))
([49904d3](49904d3))


### 📚 Documentation

* **infrastructure:** add terraform-docs tooling and improve developer
experience
([#365](#365))
([a0fb03a](a0fb03a))
* **reference:** centralize workflow template docs and convert workflow
READMEs to pointer index
([#379](#379))
([68097e4](68097e4))


### 🔧 Miscellaneous

* **deps-dev:** bump the npm_and_yarn group across 1 directory with 2
updates
([#374](#374))
([d848c8b](d848c8b))
* **deps-dev:** bump vite from 6.4.1 to 6.4.2 in
/data-management/viewer/frontend in the npm_and_yarn group across 1
directory
([#395](#395))
([6ec7f19](6ec7f19))
* **deps:** bump the github-actions group across 1 directory with 7
updates
([#370](#370))
([4d1b951](4d1b951))
* **deps:** bump the uv group across 2 directories with 1 update
([#373](#373))
([ba66ed9](ba66ed9))


### 🔒 Security

* **deps-dev:** bump brace-expansion from 1.1.12 to 1.1.13 in
/docs/docusaurus in the npm_and_yarn group across 1 directory
([#389](#389))
([27129d9](27129d9))
* **deps-dev:** bump the npm_and_yarn group across 2 directories with 2
updates
([#363](#363))
([aeae624](aeae624))
* **deps-dev:** bump the python-dependencies group with 5 updates
([#403](#403))
([bb85560](bb85560))
* **deps:** bump cryptography from 46.0.5 to 46.0.6 in /training/rl
([#367](#367))
([a82dd68](a82dd68))
* **deps:** bump the inference-dependencies group in /evaluation with 2
updates
([#401](#401))
([c88d253](c88d253))
* **deps:** bump the pip group across 4 directories with 2 updates
([#411](#411))
([1230fe0](1230fe0))
* **deps:** bump the training-dependencies group across 1 directory with
67 updates
([#375](#375))
([8e05172](8e05172))
* **deps:** bump the uv group across 2 directories with 1 update
([#382](#382))
([b6c7aea](b6c7aea))
* **deps:** update marshmallow requirement from &lt;4.3.0,&gt;=3.5 to
&gt;=3.5,&lt;4.4.0 in /evaluation in the inference-dependencies group
([#393](#393))
([599c7eb](599c7eb))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: physical-ai-toolchain-release[bot] <267194360+physical-ai-toolchain-release[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
WilliamBerryiii added a commit that referenced this pull request Apr 15, 2026
…ference (#387)

## Summary

Fixes the SIL AzureML validation submission script which was completely
broken — any attempt to submit a validation job would fail with exit
code 127 (`No such file or directory`).

This applies the same class of fixes from PR #372 (RL training) to the
evaluation submission script.

Closes #377

## Changes

### `evaluation/sil/scripts/submit-azureml-validation.sh`

| Bug | Before | After |
|---|---|---|
| Code snapshot scope | `code_path="$REPO_ROOT"` (entire repo uploaded)
| `code_path="$REPO_ROOT/evaluation"` (scoped to evaluation/) |
| Command path | `bash training/scripts/validate.sh` (does not exist) |
`bash evaluation/sil/validate.sh` with symlink shim |
| Job template path | `workflows/azureml/validate.yaml` (wrong location)
| `evaluation/sil/workflows/azureml/validate.yaml` |
| Directory validation | Checks for `training/` directory | Checks for
`sil/` directory |

The symlink shim (`if [ ! -e evaluation ]; then ln -s . evaluation; fi`)
mirrors the pattern from #372 — it ensures `validate.sh` relative path
resolution (`TRAINING_DIR`, `SRC_DIR`, `PYTHONPATH`) and the `python -m
evaluation.sil.policy_evaluation` import both work correctly inside the
AML job container.

### `evaluation/.amlignore` (new)

Added `.amlignore` matching the existing `training/.amlignore` pattern
to exclude bytecode, caches, virtual environments, and IDE files from
the AML code snapshot.

---------

Co-authored-by: GitHub Copilot <copilot@github.com>
Co-authored-by: Chris Montazer <17170709+rezatnoMsirhC@users.noreply.github.com>
Co-authored-by: Bill Berry <WilliamBerryiii@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants