fix(training): validate AzureML and OSMO RL submissions end to end#372
Merged
Merged
Conversation
1214aec to
1e00b61
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #372 +/- ##
=======================================
Coverage 43.58% 43.58%
=======================================
Files 242 242
Lines 14840 14840
Branches 1855 1855
=======================================
Hits 6468 6468
Misses 8082 8082
Partials 290 290
*This pull request uses carry forward flags. Click here to find out more. 🚀 New features to boost your workflow:
|
katriendg
approved these changes
Mar 31, 2026
…ing workflows - add common utilities for command execution and JSON parsing - create MLflow tracking assertions for AzureML and OSMO - implement submission and validation functions for OSMO training - add fixtures for AzureML workspace and compute target setup - create end-to-end test cases for AzureML and OSMO training 🔍 - Generated by Copilot
🔧 - Generated by Copilot
…king 🔧 - Generated by Copilot
…ymlink - add logic to recreate top-level training path for job container - ensure compatibility with Python imports inside the job container 🔧 - Generated by Copilot
9cca371 to
d0ddd5f
Compare
rezatnoMsirhC
approved these changes
Mar 31, 2026
WilliamBerryiii
pushed a commit
that referenced
this pull request
Apr 8, 2026
🤖 I have created a release *beep* *boop* --- ## [0.6.0](v0.5.0...v0.6.0) (2026-04-08) ### ✨ Features * **build:** add terraform-docs generation pipeline ([#378](#378)) ([78e90d0](78e90d0)) * **infrastructure:** enable optional AML diagnostic logs ([#400](#400)) ([58dd8db](58dd8db)) * **scripts:** consolidate scripts library paths and enhance dataviewer ([#383](#383)) ([176d9c9](176d9c9)) ### 🐛 Bug Fixes * **build:** remediate CVEs, enforce equality pinning, repair Dependabot config ([#391](#391)) ([0c29148](0c29148)) * **infrastructure:** add Storage File Data Privileged Contributor role for ML identity ([#380](#380)) ([378f7ed](378f7ed)) * **infrastructure:** replace hardcoded NAT Gateway availability zones with variable ([#356](#356)) ([a1397bd](a1397bd)) * **infrastructure:** resolve TFLint violations and enable hard-fail ([#376](#376)) ([dfb55cd](dfb55cd)) * **scripts:** add dot-source guard to Invoke-MsDateFreshnessCheck.ps1 ([#397](#397)) ([f6f22c3](f6f22c3)) * **training:** validate AzureML and OSMO RL submissions end to end ([#372](#372)) ([49904d3](49904d3)) ### 📚 Documentation * **infrastructure:** add terraform-docs tooling and improve developer experience ([#365](#365)) ([a0fb03a](a0fb03a)) * **reference:** centralize workflow template docs and convert workflow READMEs to pointer index ([#379](#379)) ([68097e4](68097e4)) ### 🔧 Miscellaneous * **deps-dev:** bump the npm_and_yarn group across 1 directory with 2 updates ([#374](#374)) ([d848c8b](d848c8b)) * **deps-dev:** bump vite from 6.4.1 to 6.4.2 in /data-management/viewer/frontend in the npm_and_yarn group across 1 directory ([#395](#395)) ([6ec7f19](6ec7f19)) * **deps:** bump the github-actions group across 1 directory with 7 updates ([#370](#370)) ([4d1b951](4d1b951)) * **deps:** bump the uv group across 2 directories with 1 update ([#373](#373)) ([ba66ed9](ba66ed9)) ### 🔒 Security * **deps-dev:** bump brace-expansion from 1.1.12 to 1.1.13 in /docs/docusaurus in the npm_and_yarn group across 1 directory ([#389](#389)) ([27129d9](27129d9)) * **deps-dev:** bump the npm_and_yarn group across 2 directories with 2 updates ([#363](#363)) ([aeae624](aeae624)) * **deps-dev:** bump the python-dependencies group with 5 updates ([#403](#403)) ([bb85560](bb85560)) * **deps:** bump cryptography from 46.0.5 to 46.0.6 in /training/rl ([#367](#367)) ([a82dd68](a82dd68)) * **deps:** bump the inference-dependencies group in /evaluation with 2 updates ([#401](#401)) ([c88d253](c88d253)) * **deps:** bump the pip group across 4 directories with 2 updates ([#411](#411)) ([1230fe0](1230fe0)) * **deps:** bump the training-dependencies group across 1 directory with 67 updates ([#375](#375)) ([8e05172](8e05172)) * **deps:** bump the uv group across 2 directories with 1 update ([#382](#382)) ([b6c7aea](b6c7aea)) * **deps:** update marshmallow requirement from <4.3.0,>=3.5 to >=3.5,<4.4.0 in /evaluation in the inference-dependencies group ([#393](#393)) ([599c7eb](599c7eb)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: physical-ai-toolchain-release[bot] <267194360+physical-ai-toolchain-release[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
WilliamBerryiii
added a commit
that referenced
this pull request
Apr 15, 2026
…ference (#387) ## Summary Fixes the SIL AzureML validation submission script which was completely broken — any attempt to submit a validation job would fail with exit code 127 (`No such file or directory`). This applies the same class of fixes from PR #372 (RL training) to the evaluation submission script. Closes #377 ## Changes ### `evaluation/sil/scripts/submit-azureml-validation.sh` | Bug | Before | After | |---|---|---| | Code snapshot scope | `code_path="$REPO_ROOT"` (entire repo uploaded) | `code_path="$REPO_ROOT/evaluation"` (scoped to evaluation/) | | Command path | `bash training/scripts/validate.sh` (does not exist) | `bash evaluation/sil/validate.sh` with symlink shim | | Job template path | `workflows/azureml/validate.yaml` (wrong location) | `evaluation/sil/workflows/azureml/validate.yaml` | | Directory validation | Checks for `training/` directory | Checks for `sil/` directory | The symlink shim (`if [ ! -e evaluation ]; then ln -s . evaluation; fi`) mirrors the pattern from #372 — it ensures `validate.sh` relative path resolution (`TRAINING_DIR`, `SRC_DIR`, `PYTHONPATH`) and the `python -m evaluation.sil.policy_evaluation` import both work correctly inside the AML job container. ### `evaluation/.amlignore` (new) Added `.amlignore` matching the existing `training/.amlignore` pattern to exclude bytecode, caches, virtual environments, and IDE files from the AML code snapshot. --------- Co-authored-by: GitHub Copilot <copilot@github.com> Co-authored-by: Chris Montazer <17170709+rezatnoMsirhC@users.noreply.github.com> Co-authored-by: Bill Berry <WilliamBerryiii@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
As reported by #320, RL training is not working in Azure ML. To test it one could ask an AI assistant to run tests and validate metrics, checkpoints, uploaded files. This PR proposes a end-to-end suite that does it automatically. Treat as a conversation starter.
Included in this PR is the fix to RL training job submission in AML and end-to-end tests for Azure ML and OSMO. The end-to-end test checks that metrics are written, that the job completes successfully, that the right code is uploaded to AML. It utilizes command line tools to validate work (az, terraform, osmo).
This is the log of running an end-to-end test:
Partially addresses #320.
Not included in this PR:
Type of Change
Component(s) Affected
infrastructure/terraform/prerequisites/- Azure subscription setupinfrastructure/terraform/- Terraform infrastructureinfrastructure/setup/- OSMO control plane / Helmworkflows/- Training and evaluation workflowstraining/- Training pipelines and scriptsdocs/- DocumentationTesting Performed
planreviewed (no unexpected changes)applytested in dev environmentsmoke_test_azure.py)Notes:
uv run pytest --collect-only -m e2e tests/e2e/test_e2e_training.pysucceeded in the current workspace.Documentation Impact
Bug Fix Checklist
Complete this section for bug fix PRs. Skip for other contribution types.
Checklist