fix(training): validate AzureML and OSMO RL submissions end to end by fbeltrao · Pull Request #372 · microsoft/physical-ai-toolchain

fbeltrao · 2026-03-30T13:24:01Z

Description

As reported by #320, RL training is not working in Azure ML. To test it one could ask an AI assistant to run tests and validate metrics, checkpoints, uploaded files. This PR proposes a end-to-end suite that does it automatically. Treat as a conversation starter.

Included in this PR is the fix to RL training job submission in AML and end-to-end tests for Azure ML and OSMO. The end-to-end test checks that metrics are written, that the job completes successfully, that the right code is uploaded to AML. It utilizes command line tools to validate work (az, terraform, osmo).

This is the log of running an end-to-end test:

tests/e2e/test_e2e_training.py::test_aml_rl_training_e2e [e2e] [2026-03-30 13:12:15]: Starting AzureML RL e2e test
[e2e] [2026-03-30 13:12:15]: Submitting AzureML training job for task=Isaac-Velocity-Rough-Anymal-C-v0, num_envs=64, max_iterations=10, experiment=rl-training-e2e-aml-1774876335-50fe546b
[e2e] [2026-03-30 13:12:32]: Submitted AzureML job name=silly_glass_wnlrp7sl2z
[e2e] [2026-03-30 13:12:32]: Waiting for AzureML job silly_glass_wnlrp7sl2z to start
[e2e] [2026-03-30 13:12:32]: Waiting for AzureML job silly_glass_wnlrp7sl2z to start for up to 15 minutes (poll every 30s)
[e2e] [2026-03-30 13:12:36]: Observed status=Preparing
[e2e] [2026-03-30 13:12:36]: Reached AzureML job silly_glass_wnlrp7sl2z to start with status=Preparing
[e2e] [2026-03-30 13:12:36]: Waiting for AzureML job silly_glass_wnlrp7sl2z to complete
[e2e] [2026-03-30 13:12:36]: Waiting for AzureML job silly_glass_wnlrp7sl2z to complete for up to 30 minutes (poll every 30s)
[e2e] [2026-03-30 13:12:41]: Completion poll status=Queued
[e2e] [2026-03-30 13:15:03]: Completion poll status=Running
[e2e] [2026-03-30 13:16:48]: Completion poll status=Completed
[e2e] [2026-03-30 13:16:48]: Reached AzureML job silly_glass_wnlrp7sl2z to complete with status=Completed
[e2e] [2026-03-30 13:16:48]: AzureML job silly_glass_wnlrp7sl2z completed successfully
[e2e] [2026-03-30 13:16:48]: Validating AzureML uploaded code snapshot
[e2e] [2026-03-30 13:16:48]: Inspecting uploaded code snapshot for AzureML job silly_glass_wnlrp7sl2z
[e2e] [2026-03-30 13:16:56]: Code snapshot validation passed for AzureML job silly_glass_wnlrp7sl2z; top-level entries=.amlignore, README.md, __init__.py, examples, il, packaging, pipelines, rl, setup, specifications, stream.py, tests, utils, vla
[e2e] [2026-03-30 13:16:56]: Validating AzureML MLflow tracking
Class DeploymentTemplateOperations: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
[e2e] [2026-03-30 13:17:05]: AzureML MLflow tracking passed: metrics=[Learning / Learning rate=8.77914951989026e-05, Loss / Policy loss=-0.024657187331467868, Loss / Value loss=0.06572330687195063] params=[algorithm=PPO, distributed=False, num_envs=None]
[e2e] [2026-03-30 13:17:05]: Validating AzureML checkpoint output
[e2e] [2026-03-30 13:17:05]: Checking checkpoint output for AzureML job silly_glass_wnlrp7sl2z
[e2e] [2026-03-30 13:17:09]: Checkpoint output is present for AzureML job silly_glass_wnlrp7sl2z
[e2e] [2026-03-30 13:17:09]: AzureML RL e2e test finished successfully
PASSED[e2e] [2026-03-30 13:17:09]: Skipping cancel for AzureML job silly_glass_wnlrp7sl2z; terminal status=Completed

tests/e2e/test_e2e_training.py::test_osmo_rl_training_e2e [e2e] [2026-03-30 13:17:12]: Starting OSMO RL e2e test
[e2e] [2026-03-30 13:17:12]: Submitting OSMO workflow for task=Isaac-Velocity-Rough-Anymal-C-v0, num_envs=64, max_iterations=10, experiment=isaaclab-Isaac-Velocity-Rough-Anymal-C-v0, correlation_id=osmo-rl-e2e-d2c0de15e7224d8ea403dd40546e3a83
[e2e] [2026-03-30 13:17:18]: Submitted OSMO workflow id=isaaclab-inline-training-29, name=isaaclab-inline-training-29
[e2e] [2026-03-30 13:17:18]: Waiting for OSMO workflow isaaclab-inline-training-29 to start
[e2e] [2026-03-30 13:17:18]: Waiting for OSMO workflow isaaclab-inline-training-29 to start for up to 15 minutes (poll every 30s)
[e2e] [2026-03-30 13:17:19]: Observed status=PENDING
[e2e] [2026-03-30 13:17:51]: Observed status=RUNNING
[e2e] [2026-03-30 13:17:51]: Reached OSMO workflow isaaclab-inline-training-29 to start with status=RUNNING
[e2e] [2026-03-30 13:17:51]: Waiting for OSMO workflow isaaclab-inline-training-29 to complete
[e2e] [2026-03-30 13:17:51]: Waiting for OSMO workflow isaaclab-inline-training-29 to complete for up to 30 minutes (poll every 30s)
[e2e] [2026-03-30 13:17:52]: Completion poll status=RUNNING
[e2e] [2026-03-30 13:21:37]: Completion poll status=COMPLETED
[e2e] [2026-03-30 13:21:37]: Reached OSMO workflow isaaclab-inline-training-29 to complete with status=COMPLETED
[e2e] [2026-03-30 13:21:37]: OSMO workflow isaaclab-inline-training-29 completed successfully
[e2e] [2026-03-30 13:21:37]: Validating OSMO MLflow tracking
[e2e] [2026-03-30 13:21:40]: Resolved MLflow run via exact correlation_id tag: correlation_id=osmo-rl-e2e-d2c0de15e7224d8ea403dd40546e3a83, run_id=2ebae539-c04b-44c6-bc65-98e0a7e1cdf0
[e2e] [2026-03-30 13:21:41]: OSMO MLflow tracking passed: run_id=2ebae539-c04b-44c6-bc65-98e0a7e1cdf0 metrics=[Learning / Learning rate=8.77914951989026e-05, Loss / Policy loss=-0.020312698557972908, Loss / Value loss=0.124498013779521] params=[algorithm=PPO, distributed=False, num_envs=None] tags=[correlation_id=osmo-rl-e2e-d2c0de15e7224d8ea403dd40546e3a83, training_orchestrator=osmo]
[e2e] [2026-03-30 13:21:41]: Validating OSMO workflow task success
[e2e] [2026-03-30 13:21:41]: Verified OSMO task isaac-training succeeded with exit_code=0 on pod=62f94435d0354c13-d9a5279319484a3c
[e2e] [2026-03-30 13:21:41]: OSMO RL e2e test finished successfully
PASSED[e2e] [2026-03-30 13:21:41]: Skipping cancel for OSMO workflow isaaclab-inline-training-29; terminal status=COMPLETED

Partially addresses #320.

Not included in this PR:

Broader coverage across multiple tasks, algorithms, or longer-running training scenarios
CI automation or scheduled environment-backed execution for the new e2e suite
Equivalent validation for non-RL training flows or other submission entry points

Type of Change

🐛 Bug fix (non-breaking change fixing an issue)
✨ New feature (non-breaking change adding functionality)
💥 Breaking change (fix or feature causing existing functionality to change)
📚 Documentation update
🏗️ Infrastructure change (Terraform/IaC)
♻️ Refactoring (no functional changes)

Component(s) Affected

infrastructure/terraform/prerequisites/ - Azure subscription setup
infrastructure/terraform/ - Terraform infrastructure
infrastructure/setup/ - OSMO control plane / Helm
workflows/ - Training and evaluation workflows
training/ - Training pipelines and scripts
docs/ - Documentation

Testing Performed

Terraform plan reviewed (no unexpected changes)
Terraform apply tested in dev environment
Training scripts tested locally with Isaac Sim
OSMO workflow submitted successfully
Smoke tests passed (smoke_test_azure.py)

Notes:

This PR adds environment-backed e2e coverage, but the commit does not by itself prove those workflows were executed successfully in a merge-ready environment.
The new tests intentionally skip when Azure CLI authentication, Azure ML workspace access, OSMO CLI access, Kubernetes context, or GPU nodes are unavailable.
uv run pytest --collect-only -m e2e tests/e2e/test_e2e_training.py succeeded in the current workspace.

Documentation Impact

No documentation changes needed
Documentation updated in this PR
Documentation issue filed

Bug Fix Checklist

Complete this section for bug fix PRs. Skip for other contribution types.

Linked to issue being fixed
Regression test included, OR
Justification for no regression test:

Checklist

My code follows the project conventions
Commit messages follow conventional commit format
I have performed a self-review
Documentation impact assessed above
No new linting warnings introduced

codecov-commenter · 2026-03-30T13:29:48Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 43.58%. Comparing base (dfb55cd) to head (65b09c0).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #372   +/-   ##
=======================================
  Coverage   43.58%   43.58%           
=======================================
  Files         242      242           
  Lines       14840    14840           
  Branches     1855     1855           
=======================================
  Hits         6468     6468           
  Misses       8082     8082           
  Partials      290      290

Flag	Coverage Δ	*Carryforward flag
pester	`79.87% <ø> (ø)`
pytest	`6.89% <ø> (ø)`	Carriedforward from ba66ed9
pytest-dataviewer	`61.98% <ø> (ø)`
vitest	`50.72% <ø> (ø)`

*This pull request uses carry forward flags. Click here to find out more.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

katriendg

Thanks a lot @fbeltrao for discovering, filing the issue and your PR with the fix!
Two minor comments. Approving as you'll see them & resolve the conversation before we merge.

Also reviewed we have another similar issue - filed as #377 to track separately.

…ing workflows - add common utilities for command execution and JSON parsing - create MLflow tracking assertions for AzureML and OSMO - implement submission and validation functions for OSMO training - add fixtures for AzureML workspace and compute target setup - create end-to-end test cases for AzureML and OSMO training 🔍 - Generated by Copilot

🔧 - Generated by Copilot

…king 🔧 - Generated by Copilot

…ymlink - add logic to recreate top-level training path for job container - ensure compatibility with Python imports inside the job container 🔧 - Generated by Copilot

🤖 I have created a release *beep* *boop* --- ## [0.6.0](v0.5.0...v0.6.0) (2026-04-08) ### ✨ Features * **build:** add terraform-docs generation pipeline ([#378](#378)) ([78e90d0](78e90d0)) * **infrastructure:** enable optional AML diagnostic logs ([#400](#400)) ([58dd8db](58dd8db)) * **scripts:** consolidate scripts library paths and enhance dataviewer ([#383](#383)) ([176d9c9](176d9c9)) ### 🐛 Bug Fixes * **build:** remediate CVEs, enforce equality pinning, repair Dependabot config ([#391](#391)) ([0c29148](0c29148)) * **infrastructure:** add Storage File Data Privileged Contributor role for ML identity ([#380](#380)) ([378f7ed](378f7ed)) * **infrastructure:** replace hardcoded NAT Gateway availability zones with variable ([#356](#356)) ([a1397bd](a1397bd)) * **infrastructure:** resolve TFLint violations and enable hard-fail ([#376](#376)) ([dfb55cd](dfb55cd)) * **scripts:** add dot-source guard to Invoke-MsDateFreshnessCheck.ps1 ([#397](#397)) ([f6f22c3](f6f22c3)) * **training:** validate AzureML and OSMO RL submissions end to end ([#372](#372)) ([49904d3](49904d3)) ### 📚 Documentation * **infrastructure:** add terraform-docs tooling and improve developer experience ([#365](#365)) ([a0fb03a](a0fb03a)) * **reference:** centralize workflow template docs and convert workflow READMEs to pointer index ([#379](#379)) ([68097e4](68097e4)) ### 🔧 Miscellaneous * **deps-dev:** bump the npm_and_yarn group across 1 directory with 2 updates ([#374](#374)) ([d848c8b](d848c8b)) * **deps-dev:** bump vite from 6.4.1 to 6.4.2 in /data-management/viewer/frontend in the npm_and_yarn group across 1 directory ([#395](#395)) ([6ec7f19](6ec7f19)) * **deps:** bump the github-actions group across 1 directory with 7 updates ([#370](#370)) ([4d1b951](4d1b951)) * **deps:** bump the uv group across 2 directories with 1 update ([#373](#373)) ([ba66ed9](ba66ed9)) ### 🔒 Security * **deps-dev:** bump brace-expansion from 1.1.12 to 1.1.13 in /docs/docusaurus in the npm_and_yarn group across 1 directory ([#389](#389)) ([27129d9](27129d9)) * **deps-dev:** bump the npm_and_yarn group across 2 directories with 2 updates ([#363](#363)) ([aeae624](aeae624)) * **deps-dev:** bump the python-dependencies group with 5 updates ([#403](#403)) ([bb85560](bb85560)) * **deps:** bump cryptography from 46.0.5 to 46.0.6 in /training/rl ([#367](#367)) ([a82dd68](a82dd68)) * **deps:** bump the inference-dependencies group in /evaluation with 2 updates ([#401](#401)) ([c88d253](c88d253)) * **deps:** bump the pip group across 4 directories with 2 updates ([#411](#411)) ([1230fe0](1230fe0)) * **deps:** bump the training-dependencies group across 1 directory with 67 updates ([#375](#375)) ([8e05172](8e05172)) * **deps:** bump the uv group across 2 directories with 1 update ([#382](#382)) ([b6c7aea](b6c7aea)) * **deps:** update marshmallow requirement from <4.3.0,>=3.5 to >=3.5,<4.4.0 in /evaluation in the inference-dependencies group ([#393](#393)) ([599c7eb](599c7eb)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: physical-ai-toolchain-release[bot] <267194360+physical-ai-toolchain-release[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

…ference (#387) ## Summary Fixes the SIL AzureML validation submission script which was completely broken — any attempt to submit a validation job would fail with exit code 127 (`No such file or directory`). This applies the same class of fixes from PR #372 (RL training) to the evaluation submission script. Closes #377 ## Changes ### `evaluation/sil/scripts/submit-azureml-validation.sh` | Bug | Before | After | |---|---|---| | Code snapshot scope | `code_path="$REPO_ROOT"` (entire repo uploaded) | `code_path="$REPO_ROOT/evaluation"` (scoped to evaluation/) | | Command path | `bash training/scripts/validate.sh` (does not exist) | `bash evaluation/sil/validate.sh` with symlink shim | | Job template path | `workflows/azureml/validate.yaml` (wrong location) | `evaluation/sil/workflows/azureml/validate.yaml` | | Directory validation | Checks for `training/` directory | Checks for `sil/` directory | The symlink shim (`if [ ! -e evaluation ]; then ln -s . evaluation; fi`) mirrors the pattern from #372 — it ensures `validate.sh` relative path resolution (`TRAINING_DIR`, `SRC_DIR`, `PYTHONPATH`) and the `python -m evaluation.sil.policy_evaluation` import both work correctly inside the AML job container. ### `evaluation/.amlignore` (new) Added `.amlignore` matching the existing `training/.amlignore` pattern to exclude bytecode, caches, virtual environments, and IDE files from the AML code snapshot. --------- Co-authored-by: GitHub Copilot <copilot@github.com> Co-authored-by: Chris Montazer <17170709+rezatnoMsirhC@users.noreply.github.com> Co-authored-by: Bill Berry <WilliamBerryiii@users.noreply.github.com>

fbeltrao requested a review from a team as a code owner March 30, 2026 13:24

fbeltrao force-pushed the fix/320-azure-rl-submission branch from 1214aec to 1e00b61 Compare March 30, 2026 13:28

github-advanced-security AI found potential problems Mar 30, 2026

View reviewed changes

Comment thread tests/e2e/conftest.py Fixed

Comment thread tests/e2e/conftest.py Fixed

Comment thread tests/e2e/conftest.py Fixed

katriendg mentioned this pull request Mar 31, 2026

fix(evaluation): SIL AzureML validation submission has broken code path and script reference #377

Closed

katriendg approved these changes Mar 31, 2026

View reviewed changes

Comment thread tests/e2e/test_e2e_training.py

Comment thread training/rl/scripts/submit-azureml-training.sh

fbeltrao added 4 commits March 31, 2026 13:31

refactor(tests): update import statements for consistency in test files

0aacc03

🔧 - Generated by Copilot

refactor(training): remove training_orchestrator tag from MLflow trac…

6be1465

…king 🔧 - Generated by Copilot

feat(scripts): update Azure ML submission script to create training s…

d0ddd5f

…ymlink - add logic to recreate top-level training path for job container - ensure compatibility with Python imports inside the job container 🔧 - Generated by Copilot

fbeltrao force-pushed the fix/320-azure-rl-submission branch from 9cca371 to d0ddd5f Compare March 31, 2026 14:25

rezatnoMsirhC approved these changes Mar 31, 2026

View reviewed changes

Merge branch 'main' into fix/320-azure-rl-submission

65b09c0

katriendg merged commit 49904d3 into microsoft:main Apr 1, 2026
28 checks passed

physical-ai-toolchain-release Bot mentioned this pull request Mar 31, 2026

chore(main): release 0.6.0 #364

Merged

kgmwang1 mentioned this pull request Apr 3, 2026

fix(evaluation): scope SIL AzureML validation code path and script reference #387

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(training): validate AzureML and OSMO RL submissions end to end#372

fix(training): validate AzureML and OSMO RL submissions end to end#372
katriendg merged 5 commits into
microsoft:mainfrom
fbeltrao:fix/320-azure-rl-submission

fbeltrao commented Mar 30, 2026

Uh oh!

codecov-commenter commented Mar 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

katriendg left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

fbeltrao commented Mar 30, 2026

Description

Type of Change

Component(s) Affected

Testing Performed

Documentation Impact

Bug Fix Checklist

Checklist

Uh oh!

codecov-commenter commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

katriendg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

codecov-commenter commented Mar 30, 2026 •

edited

Loading