feat(infrastructure): upgrade OSMO to chart 1.2.1 / image 6.2 with secure auth and skrl 2.0.0 compatibility#492
Conversation
…en auth - upgrade chart 1.0.1→1.2.1, image 6.0.0→6.2 with defaultAdmin + scoped backend token - fix service_base_url to use in-cluster ingress URL for workflow sidecar connectivity - pin OSMO CLI to 6.2.10 with SHA256 verification in devcontainer - fix training dependency pins (marshmallow, numpy, pydantic-core) and use --no-deps install - fix empty --max_iterations causing workflow argparse failure 🚀 - Generated by Copilot
…ations - read admin password from Azure Key Vault - update service token generation logic - sync osmo-admin-password with Key Vault secrets - adjust storage connection string handling - update OSMO image tag to 6.2 🔒 - Generated by Copilot
- change agent update method from _update to update - modify logging wrapper to accept keyword arguments - update dependencies in requirements and pyproject files 🔧 - Generated by Copilot
Dependency ReviewThe following issues were found:
Snapshot WarningsEnsure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice. License Issuestraining/rl/pyproject.toml
training/rl/requirements.txt
OpenSSF ScorecardScorecard details
Scanned Files
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #492 +/- ##
=======================================
Coverage 65.19% 65.19%
=======================================
Files 254 254
Lines 15804 15804
Branches 2118 2070 -48
=======================================
Hits 10303 10303
Misses 5212 5212
Partials 289 289
🚀 New features to boost your workflow:
|
bindsi
left a comment
There was a problem hiding this comment.
Review item RI-1: Stale docstring references and return type annotation mismatch after skrl 2.0.0 migration.
bindsi
left a comment
There was a problem hiding this comment.
RI-2: Unexplained positional argument in agent.act() call after skrl 2.0.0 migration.
bindsi
left a comment
There was a problem hiding this comment.
RI-3 through RI-5 (RI-4 dropped — postgresql.tf firewall rules are from the already-merged PR #477, not this PR).
RI-6 (PR description): The "OSMO CLI Pinning" section states:
Replaced unpinned
curl | bashinstallation with versioned download of OSMO CLI 6.2.10 verified via SHA256 checksum in devcontainer.json
But the devcontainer.json diff in this PR only shows a tflint install hardening (from the already-merged #465). The OSMO CLI version is pinned in defaults.conf via OSMO_CLI_VERSION, which is a different mechanism. Consider correcting the description to avoid confusion during future audits.
bindsi
left a comment
There was a problem hiding this comment.
Great work...the comments to optimise are more less low prio findings
|
|
…ic schema - replace manual table drops with a single command to drop and recreate the public schema - improve cleanup efficiency and reduce maintenance overhead 🔧 - Generated by Copilot
Thanks for the review @liupeirong - update the script: replaced the hand-maintained 25-table list with a |
- Update _load_skrl test to assert enable_training_mode(enabled=False, apply_to_models=True) - Source switched to skrl 2.0 API in #492; test on this branch was stale 🤖 Generated by Copilot
🤖 I have created a release *beep* *boop* --- ## [0.8.0](v0.7.4...v0.8.0) (2026-05-08) ### ⚠ BREAKING CHANGES * **dataviewer:** bump frontend stack to React 19, Vite 8, Tailwind v4, MSAL 5, ESLint 10 ([#524](#524)) ### ✨ Features * **agents:** add automated validation for high-risk Dependabot bumps ([#574](#574)) ([8c3686a](8c3686a)), closes [#573](#573) * **data:** add camera selector to annotation workspace and fix AV1 frame extraction ([#591](#591)) ([c809d2f](c809d2f)) * **data:** seed dataviewer frontend test foundation and per-section codecov flags ([#594](#594)) ([c06c4e3](c06c4e3)) * **dataviewer:** add OWASP security middleware stack ([#439](#439)) ([239edb9](239edb9)) * **infrastructure:** add conversion pipeline Terraform module ([#542](#542)) ([244531e](244531e)) * **infrastructure:** upgrade OSMO to chart 1.2.1 / image 6.2 with secure auth and skrl 2.0.0 compatibility ([#492](#492)) ([edfd7a5](edfd7a5)) * **pipeline:** add ACSA setup for ROS2 bag sync to Blob ([#451](#451)) ([c271a54](c271a54)) * **workflows:** add advisory Dependabot PR reviewer agentic workflow ([#498](#498)) ([d4bb140](d4bb140)) * **workflows:** trigger AW Dependabot PR reviewer after PR Validation ([#580](#580)) ([7ab3d16](7ab3d16)) ### 🐛 Bug Fixes * **ci:** correct stale version comment for actions/create-github-app-token ([#506](#506)) ([b2e9a54](b2e9a54)) * **ci:** restore data-pipeline and training broken tests by domain folder restructure ([#547](#547)) ([06d8472](06d8472)) * **docs:** update remaining stale 'Coming soon' labels in docs/README.md ([#507](#507)) ([02439d6](02439d6)) * **docs:** update stale coming soon label for Training section ([#472](#472)) ([46db49b](46db49b)) * **evaluation:** scope SIL AzureML validation code path and script reference ([#387](#387)) ([9f138a9](9f138a9)) * **infrastructure:** OSMO workflow execution, PostgreSQL public access, and quickstart corrections ([#477](#477)) ([9ed2da6](9ed2da6)) * **scripts:** exclude CHANGELOG.md from changed-files msdate check ([#644](#644)) ([8133bdc](8133bdc)) * **workflows:** allow dependabot[bot] to activate AW Dependabot PR Review ([#586](#586)) ([39dc022](39dc022)) * **workflows:** correct branches filter on AW Dependabot PR Review workflow_run trigger ([#584](#584)) ([fe06b52](fe06b52)) * **workflows:** normalize validate.yaml placeholder env/compute values ([#510](#510)) ([340ff44](340ff44)) * **workflows:** recompile aw-dependabot-pr-review lock file ([#576](#576)) ([d77c167](d77c167)) * **workflows:** switch AW Dependabot PR Review to pull_request_target ([#589](#589)) ([3f1edd1](3f1edd1)) ### 📚 Documentation * **docs:** Fix deployment guide links ([#614](#614)) ([0070b04](0070b04)) * document dependency-pinning-artifacts directory purpose ([#508](#508)) ([50e0010](50e0010)) ### 📦 Build System * **training:** standardize on Python 3.12 across manifests, containers, and runtime scripts ([#541](#541)) ([7ad014a](7ad014a)) ### 🔧 Operations * **build:** add Copilot cloud agent setup-steps workflow ([#593](#593)) ([c912668](c912668)) ### 🔧 Miscellaneous * **build:** exclude auto-generated CHANGELOG.md from cspell and seed dictionary ([#582](#582)) ([de1dd57](de1dd57)) * **build:** redesign codecov flags and split pytest CI per component ([#520](#520)) ([357e745](357e745)) * **dataviewer:** bump frontend stack to React 19, Vite 8, Tailwind v4, MSAL 5, ESLint 10 ([#524](#524)) ([50f8ad4](50f8ad4)) * **dataviewer:** repoint stale src/dataviewer references to data-management/viewer ([#504](#504)) ([88fa1b4](88fa1b4)), closes [#503](#503) * **deps-dev:** bump basic-ftp from 5.3.0 to 5.3.1 ([#618](#618)) ([ca10f2a](ca10f2a)) * **deps-dev:** bump globals from 15.15.0 to 17.5.0 in /data-management/viewer/frontend ([#527](#527)) ([0e0b2ae](0e0b2ae)) * **deps-dev:** bump ip-address from 10.1.0 to 10.2.0 ([#616](#616)) ([816c9cf](816c9cf)) * **deps-dev:** bump lint-staged from 16.4.0 to 17.0.2 in the root-npm-dependencies group across 1 directory ([#626](#626)) ([0e2f293](0e2f293)) * **deps-dev:** bump pydantic from 2.13.3 to 2.13.4 in the python-dependencies group across 1 directory ([#629](#629)) ([c24f1c1](c24f1c1)) * **deps-dev:** bump the python-dependencies group across 1 directory with 2 updates ([#514](#514)) ([8410f4b](8410f4b)) * **deps:** bump azure-core from 1.39.0 to 1.40.0 in /evaluation in the inference-dependencies group across 1 directory ([#597](#597)) ([6141db4](6141db4)) * **deps:** bump cryptography from 46.0.6 to 46.0.7 in /data-management/viewer ([#424](#424)) ([5fb6d58](5fb6d58)) * **deps:** bump cryptography from 46.0.6 to 46.0.7 in /data-management/viewer/backend ([#423](#423)) ([b516ad5](b516ad5)) * **deps:** bump lucide-react from 0.469.0 to 1.8.0 in /data-management/viewer/frontend ([#528](#528)) ([1bdfc1e](1bdfc1e)) * **deps:** bump nginx from `8aa63af` to `5616878` in /data-management/viewer/frontend ([#511](#511)) ([9e7e20e](9e7e20e)) * **deps:** bump nginx from 1.27-alpine to 1.29-alpine in /data-management/viewer/frontend ([#484](#484)) ([0e5c3dd](0e5c3dd)) * **deps:** bump node from `435f353` to `e49fd70` in /data-management/viewer/frontend ([#560](#560)) ([2884649](2884649)) * **deps:** bump react-is from 18.3.1 to 19.2.5 in /data-management/viewer/frontend ([#530](#530)) ([d51318c](d51318c)) * **deps:** bump tensordict from 0.11.0 to 0.12.1 in /evaluation in the inference-dependencies group across 1 directory ([#456](#456)) ([b24e733](b24e733)) * **deps:** bump the dataviewer-backend-dependencies group across 1 directory with 2 updates ([#531](#531)) ([171a1da](171a1da)) * **deps:** bump the dataviewer-backend-dependencies group across 1 directory with 5 updates ([#516](#516)) ([4f9a577](4f9a577)) * **deps:** bump the dataviewer-backend-dependencies group across 1 directory with 5 updates ([#602](#602)) ([6c27ab5](6c27ab5)) * **deps:** bump the dataviewer-dependencies group across 1 directory with 2 updates ([#529](#529)) ([8646971](8646971)) * **deps:** bump the dataviewer-dependencies group across 1 directory with 3 updates ([#601](#601)) ([d28fb50](d28fb50)) * **deps:** bump the dataviewer-dependencies group across 1 directory with 3 updates ([#632](#632)) ([4ca5f3e](4ca5f3e)) * **deps:** bump the dataviewer-dependencies group across 1 directory with 5 updates ([#515](#515)) ([109ee81](109ee81)) * **deps:** bump the dataviewer-frontend-patch-minor group across 1 directory with 6 updates ([#630](#630)) ([04d5dfd](04d5dfd)) * **deps:** bump the dataviewer-frontend-patch-minor group across 1 directory with 9 updates ([#563](#563)) ([c08f450](c08f450)) * **deps:** bump the docusaurus-dependencies group across 1 directory with 4 updates ([#627](#627)) ([f5825fc](f5825fc)) * **deps:** bump the docusaurus-dependencies group across 1 directory with 6 updates ([#599](#599)) ([b859344](b859344)) * **deps:** bump the github-actions group across 1 directory with 4 updates ([#459](#459)) ([2609c52](2609c52)) * **deps:** bump the github-actions group across 1 directory with 4 updates ([#517](#517)) ([f54bf5d](f54bf5d)) * **deps:** bump the inference-dependencies group across 1 directory with 11 updates ([#562](#562)) ([087f53a](087f53a)) * **deps:** bump the inference-dependencies group across 1 directory with 2 updates ([#628](#628)) ([4a3be47](4a3be47)) * **deps:** bump the pip group across 2 directories with 1 update ([#494](#494)) ([a14b6b0](a14b6b0)) * **docs:** update stale Python 3.11 references to 3.12 ([#575](#575)) ([6f85c95](6f85c95)) * **scripts:** remove redundant SC1091 disables in OSMO deploy scripts ([#509](#509)) ([ae1cb82](ae1cb82)) ### 🔒 Security * **build:** pin dependencies and hash-verify downloads ([#465](#465)) ([0289f49](0289f49)) * **build:** remediate dependency security advisories ([#479](#479)) ([7196d6d](7196d6d)) * **deps-dev:** bump basic-ftp from 5.2.1 to 5.2.2 ([#454](#454)) ([cb158f1](cb158f1)) * **deps-dev:** bump basic-ftp from 5.2.2 to 5.3.0 ([#495](#495)) ([e983b8b](e983b8b)) * **deps-dev:** bump hypothesis from 6.152.3 to 6.152.4 in the python-dependencies group ([#598](#598)) ([83384d2](83384d2)) * **deps-dev:** bump markdownlint-cli2 from 0.22.0 to 0.22.1 in the root-npm-dependencies group ([#559](#559)) ([32bde35](32bde35)) * **deps-dev:** bump picomatch from 2.3.1 to 2.3.2 in /docs/docusaurus ([#455](#455)) ([66f86ca](66f86ca)) * **deps-dev:** bump postcss from 8.5.10 to 8.5.12 in /data-management/viewer/frontend ([#569](#569)) ([a652dba](a652dba)) * **deps-dev:** bump the python-dependencies group with 2 updates ([#457](#457)) ([749d231](749d231)) * **deps-dev:** bump the python-dependencies group with 2 updates ([#485](#485)) ([71b44fd](71b44fd)) * **deps-dev:** bump the python-dependencies group with 3 updates ([#564](#564)) ([9fc52fd](9fc52fd)) * **deps-dev:** bump typescript from 6.0.2 to 6.0.3 in /docs/docusaurus in the docusaurus-dependencies group ([#513](#513)) ([5694dbc](5694dbc)) * **deps:** bump azureml/openmpi4.1.0-ubuntu22.04 from 20260303.v5 to 20260409.v4 in /evaluation/sil/docker ([#480](#480)) ([25d4df8](25d4df8)) * **deps:** bump cryptography from 46.0.6 to 46.0.7 in /evaluation in the uv group across 1 directory ([#538](#538)) ([92c5b2e](92c5b2e)) * **deps:** bump diffusers from 0.35.2 to 0.38.0 in /training/il/lerobot ([#638](#638)) ([6261d19](6261d19)) * **deps:** bump follow-redirects from 1.15.11 to 1.16.0 in /docs/docusaurus ([#469](#469)) ([0458908](0458908)) * **deps:** bump gitpython and mako for lerobot IL training ([#623](#623)) ([9f8022b](9f8022b)) * **deps:** bump node from 24.14.1-slim to 25.9.0-slim in /data-management/viewer/frontend ([#482](#482)) ([1532d09](1532d09)) * **deps:** bump packaging from 26.0 to 26.1 in /evaluation in the inference-dependencies group ([#483](#483)) ([f4afb6c](f4afb6c)) * **deps:** bump pillow from 12.1.1 to 12.2.0 ([#467](#467)) ([39fb663](39fb663)) * **deps:** bump python from 3.11-slim to 3.14-slim in /data-management/viewer/backend ([#481](#481)) ([7af9dfc](7af9dfc)) * **deps:** bump the dataviewer-backend-dependencies group across 1 directory with 15 updates ([#428](#428)) ([e4446a2](e4446a2)) * **deps:** bump the dataviewer-backend-dependencies group in /data-management/viewer/backend with 4 updates ([#487](#487)) ([0f57c5b](0f57c5b)) * **deps:** bump the dataviewer-backend-dependencies group in /data-management/viewer/backend with 8 updates ([#566](#566)) ([d6e7869](d6e7869)) * **deps:** bump the dataviewer-dependencies group across 1 directory with 5 updates ([#464](#464)) ([24c208d](24c208d)) * **deps:** bump the dataviewer-dependencies group in /data-management/viewer with 2 updates ([#486](#486)) ([90149f3](90149f3)) * **deps:** bump the dataviewer-dependencies group in /data-management/viewer with 6 updates ([#565](#565)) ([f0bb36b](f0bb36b)) * **deps:** bump the dataviewer-frontend-patch-minor group across 1 directory with 10 updates ([#613](#613)) ([e481f83](e481f83)) * **deps:** bump the github-actions group across 1 directory with 4 updates ([#534](#534)) ([5478ab6](5478ab6)) * **deps:** bump the github-actions group with 2 updates ([#488](#488)) ([4e6ce98](4e6ce98)) * **deps:** bump the github-actions group with 3 updates ([#567](#567)) ([48c38dc](48c38dc)) * **deps:** bump the github-actions group with 3 updates ([#634](#634)) ([00cfb49](00cfb49)) * **deps:** bump the github-actions group with 6 updates ([#603](#603)) ([73eb79a](73eb79a)) * **deps:** bump the training-dependencies group across 1 directory with 23 updates ([#463](#463)) ([d5a8656](d5a8656)) * **deps:** bump yaml from 2.8.2 to 2.8.3 in /data-management/viewer/frontend ([#453](#453)) ([10449df](10449df)) * pytest harness, dependabot advisories, and OSSF Scorecard remediations ([#501](#501)) ([e8756e8](e8756e8)) * **scripts:** pin and hash-verify all shell script downloads ([#468](#468)) ([0c2bb9c](0c2bb9c)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: physical-ai-toolchain-release[bot] <267194360+physical-ai-toolchain-release[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Description
Upgraded the OSMO platform from chart 1.0.1 (image 6.0.0) to chart 1.2.1 (image 6.2) and replaced the broken dev-mode authentication with token-based auth backed by a Key Vault credential lifecycle. The previous deployment had three compounding failures — backend CrashLoopBackOff from invalid
loginMethod, 404 on all auth APIs, and silently ignoreddefaultAdminvalues — making the platform inoperable.This PR also absorbed the skrl 2.0.0 breaking API change (private
_update→ publicupdatewith keyword-only arguments) and corrected several training dependency pins that caused import failures in GPU containers. End-to-end training validated on both OSMO and AzureML.Closes #478
Type of Change
Component(s) Affected
infrastructure/terraform/prerequisites/- Azure subscription setupinfrastructure/terraform/- Terraform infrastructureinfrastructure/setup/- OSMO control plane / Helmworkflows/- Training and evaluation workflowstraining/- Training pipelines and scriptsdocs/- DocumentationTesting Performed
planreviewed (no unexpected changes)applytested in dev environmentsmoke_test_azure.py)Validated Training Runs
isaaclab-inline-training-23isaaclab-inline-training-24redactedValidation Commands Run
ruff check training/ evaluation/— passedpytest training/ evaluation/— 43 passed, 92.5% coverageDocumentation Impact
Bug Fix Checklist
Changes
OSMO 6.2 Platform Upgrade
The Helm chart and image versions were bumped across all four releases (control plane, backend operator, router, UI). The control plane values introduced a
defaultAdminblock — NVIDIA's bootstrap mechanism for non-IdP deployments — and removedauth.enabled: false, which was disabling all user/token CRUD endpoints.defaultAdminconfiguration and Entra IdP placeholder comments in osmo-control-plane.yamlAuthentication and Credential Management
random_password.osmo_admin(43-char alphanumeric, URL-safe for Bearer tokens) andazapi_resourcefor Key Vault storage in security.tfosmo-admin-passwordto both CSI SecretProviderClass manifests for automatic K8s secret syncosmo_login_and_setupin common.sh to use--method token --token-filewith process substitutioncurlPOST toosmo token setCLI with 90-day expiry (was 1 year) and--from-filesecret creation to avoid credential exposure in process argsservice_base_urlself-healing in script 04 — detects when script 03 couldn't set it (private cluster, no port-forward) and applies the in-cluster ingress URL so workflow pod sidecars can reach the control planeskrl 2.0.0 API Migration
The upstream skrl library promoted the agent update method from a private to a public interface with keyword-only arguments.
_update(timestep, timesteps)toupdate(*, timestep, timesteps)set_running_mode("eval")toenable_training_mode(enabled=False, apply_to_models=True)and added required second positional argument toact()Training Dependencies
>=1.26.0,<2.0.0in pyproject.toml — Isaac Sim 4.x bundles numpy 1.x-compiled native extensions, and numpy 2.x changes the C ABIazure-ai-mlrequiresmarshmallow>=3.5,<4.0.0SystemErroron importuv pip compilein train.sh withuv pip install --no-depsfrom the pre-locked requirements.txt, removing non-deterministic resolution during GPU-billed training startup--max_iterationsconditional in train.yaml to prevent empty-string argparse failuresNotes
osmo workflow logsreturns 500 whenshared_access_key_enabled = falseon the storage account. The OSMO service internally callsBlobServiceClient.from_connection_string()which does not supportDefaultAzureCredential. Upstream fix tracked in NVIDIA/OSMO PR #865. Workaround:kubectl logsfor live output,az storage blob downloadfor completed runs. Tracked in fix: OSMO workflow log retrieval fails with Azure workload identity #491endpointfield in dataset-config-access-keys.template.jsonChecklist