Skip to content

fix(infrastructure): OSMO workflow execution, PostgreSQL public access, and quickstart corrections#477

Merged
WilliamBerryiii merged 4 commits into
mainfrom
bug/osmo-setup-quickstart
Apr 16, 2026
Merged

fix(infrastructure): OSMO workflow execution, PostgreSQL public access, and quickstart corrections#477
WilliamBerryiii merged 4 commits into
mainfrom
bug/osmo-setup-quickstart

Conversation

@bindsi
Copy link
Copy Markdown
Member

@bindsi bindsi commented Apr 15, 2026

Fixed end-to-end OSMO workflow execution by resolving the empty service_base_url that prevented osmo-ctrl sidecars from connecting to the logger WebSocket. Added PostgreSQL public network access support driven by should_enable_public_network_access, and corrected the quickstart guide with accurate variables, paths, namespaces, and deployment steps validated against a live cluster.

Closes #476
Closes #470

Type of Change

  • 🐛 Bug fix (non-breaking change fixing an issue)
  • ✨ New feature (non-breaking change adding functionality)
  • 💥 Breaking change (fix or feature causing existing functionality to change)
  • 📚 Documentation update
  • 🏗️ Infrastructure change (Terraform/IaC)
  • ♻️ Refactoring (no functional changes)

Component(s) Affected

  • infrastructure/terraform/prerequisites/ - Azure subscription setup
  • infrastructure/terraform/ - Terraform infrastructure
  • infrastructure/setup/ - OSMO control plane / Helm
  • workflows/ - Training and evaluation workflows
  • training/ - Training pipelines and scripts
  • docs/ - Documentation

Testing Performed

  • Terraform plan reviewed (no unexpected changes)
  • Terraform apply tested in dev environment
  • Training scripts tested locally with Isaac Sim
  • OSMO workflow submitted successfully
  • Smoke tests passed (smoke_test_azure.py)

Validated against live cluster aks-osmo-dev-001 with OSMO v6.0.0 in westus2. Workflow hello-osmo-7 completed successfully after the service_base_url fix. PostgreSQL connectivity verified from AKS pods with public access enabled. All Terraform, shellcheck, and markdownlint validations pass.

Documentation Impact

  • No documentation changes needed
  • Documentation updated in this PR
  • Documentation issue filed

Bug Fix Checklist

  • Linked to issue being fixed
  • Regression test included, OR
  • Justification for no regression test: Infrastructure and deploy script changes validated via live cluster deployment and terraform validate. Quickstart corrections verified by cross-referencing against actual source files.

Description

OSMO Service URL and Workflow Execution

Workflow pods failed silently because the service_base_url in the OSMO SERVICE config was empty, leaving the osmo-ctrl sidecar's -host argument blank and preventing WebSocket connections to the logger service.

  • Added detect_ingress_base_url() to scripts/lib/common.sh — returns the AzureML nginx ingress controller in-cluster FQDN (http://azureml-ingress-nginx-controller.azureml.svc.cluster.local) for osmo-ctrl sidecar routing
  • Updated 03-deploy-osmo-control-plane.sh to set SERVICE_BASE_URL from the ingress FQDN instead of the external LB IP, ensuring workflow pods route /api/logger, /api/auth, and /api paths correctly
  • Added validate_service_url_reachable() to scripts/lib/common.sh — probes the service URL with a 5-second timeout and prints actionable kubectl port-forward guidance when unreachable
  • Wired validate_service_url_reachable into both 03-deploy-osmo-control-plane.sh and 04-deploy-osmo-backend.sh
  • Split cluster_service_url from CLI URL in 04-deploy-osmo-backend.sh so --service-url http://localhost:8080 works for port-forwarded CLI calls while Helm uses the in-cluster agent_service_url for pods

PostgreSQL Public Network Access

  • Changed public_network_access_enabled in infrastructure/terraform/modules/platform/postgresql.tf from hardcoded false to var.should_enable_public_network_access, aligning with other resources (Storage, Redis, Key Vault, ACR)
  • Added AllowAzureServices firewall rule (0.0.0.0/0.0.0.0) when public access is enabled, allowing AKS pods to reach PostgreSQL without private endpoints
  • Added aks_egress firewall rule scoped to the NAT Gateway public IP when !pe_enabled && nat_gw, for tighter production configurations
  • Documented should_enable_public_network_access behavior in both terraform.tfvars and terraform.tfvars.example

Quickstart Guide Corrections

Rewrote docs/getting-started/quickstart.md with corrections validated against the actual codebase:

  • Replaced fictional Terraform variables (project_name, enable_*) with actual inputs (resource_prefix, should_*)
  • Fixed Terraform output access from -raw resource_group_name to -json resource_group | jq -r '.name'
  • Corrected training script path from scripts/submit-osmo-training.sh to training/rl/scripts/submit-osmo-training.sh
  • Changed training pod namespace from osmo-control-plane to osmo-workflows
  • Updated default GPU SKU to Standard_NV36ads_A10_v5 (A10 Spot)
  • Added Step 5: Set NGC API Key and --config-preview dry-run commands
  • Added resource_prefix naming constraints warning and Helm cleanup before terraform destroy
  • Added devcontainer port-forward guidance for scripts 03 and 04

Troubleshooting and Operations Docs

  • Added two entries to docs/operations/troubleshooting.md: "workflow completes but status stays RUNNING" (empty service_base_url) and "UI shows server IP not found" (FQDN vs browser resolution)
  • Corrected internal LB IP from 10.0.5.7 to 10.0.5.6 in docs/infrastructure/cluster-setup-advanced.md and fixed the service name reference
  • Added [!IMPORTANT] callout explaining service_base_url VPN vs non-VPN behavior

Table Formatting

Ran npm run format:tables across 6 documentation files — column-width normalization only, no content changes.

Related Issues

Closes #476
Closes #470
Subsumes PR #471 by @katriendg

Notes

  • The AllowAzureServices PostgreSQL firewall rule is intentionally broad for development scenarios. For production, disable it and rely on the NAT Gateway egress rule or private endpoints.
  • The OSMO backend values file (osmo-backend-operator.yaml) was updated from placeholder values to concrete 6.0.0 image tag and in-cluster agent URL. These are overridden by Helm --set-string flags at deploy time.

Checklist

- Refactor tables in README for better readability
- Update blob storage structure documentation for consistency
- Revise quickstart guide to reflect new steps and configurations
- Enhance troubleshooting documentation with detailed resolutions
- Improve clarity in operational guides and service configurations

Signed-off-by: Marcel Bindseil <marcelbindseil@gmail.com>
@bindsi bindsi requested a review from a team as a code owner April 15, 2026 12:37
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 15, 2026

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Snapshot Warnings

⚠️: No snapshots were found for the head SHA 94c25a9.
Ensure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice.

Scanned Files

None

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 65.18%. Comparing base (46db49b) to head (94c25a9).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #477   +/-   ##
=======================================
  Coverage   65.18%   65.18%           
=======================================
  Files         253      253           
  Lines       15541    15541           
  Branches     2118     2118           
=======================================
  Hits        10130    10130           
  Misses       5122     5122           
  Partials      289      289           
Flag Coverage Δ
pester 82.24% <ø> (ø)
pytest 92.40% <ø> (ø)
pytest-dataviewer 65.12% <ø> (ø)
pytest-fuzz 1.56% <ø> (ø)
vitest 51.77% <ø> (ø)
🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Collaborator

@katriendg katriendg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks like everything is there for an initial fix!

Comment thread docs/getting-started/quickstart.md Outdated
@WilliamBerryiii WilliamBerryiii merged commit 9ed2da6 into main Apr 16, 2026
32 checks passed
@WilliamBerryiii WilliamBerryiii deleted the bug/osmo-setup-quickstart branch April 16, 2026 01:06
WilliamBerryiii pushed a commit that referenced this pull request May 8, 2026
🤖 I have created a release *beep* *boop*
---


##
[0.8.0](v0.7.4...v0.8.0)
(2026-05-08)


### ⚠ BREAKING CHANGES

* **dataviewer:** bump frontend stack to React 19, Vite 8, Tailwind v4,
MSAL 5, ESLint 10
([#524](#524))

### ✨ Features

* **agents:** add automated validation for high-risk Dependabot bumps
([#574](#574))
([8c3686a](8c3686a)),
closes
[#573](#573)
* **data:** add camera selector to annotation workspace and fix AV1
frame extraction
([#591](#591))
([c809d2f](c809d2f))
* **data:** seed dataviewer frontend test foundation and per-section
codecov flags
([#594](#594))
([c06c4e3](c06c4e3))
* **dataviewer:** add OWASP security middleware stack
([#439](#439))
([239edb9](239edb9))
* **infrastructure:** add conversion pipeline Terraform module
([#542](#542))
([244531e](244531e))
* **infrastructure:** upgrade OSMO to chart 1.2.1 / image 6.2 with
secure auth and skrl 2.0.0 compatibility
([#492](#492))
([edfd7a5](edfd7a5))
* **pipeline:** add ACSA setup for ROS2 bag sync to Blob
([#451](#451))
([c271a54](c271a54))
* **workflows:** add advisory Dependabot PR reviewer agentic workflow
([#498](#498))
([d4bb140](d4bb140))
* **workflows:** trigger AW Dependabot PR reviewer after PR Validation
([#580](#580))
([7ab3d16](7ab3d16))


### 🐛 Bug Fixes

* **ci:** correct stale version comment for
actions/create-github-app-token
([#506](#506))
([b2e9a54](b2e9a54))
* **ci:** restore data-pipeline and training broken tests by domain
folder restructure
([#547](#547))
([06d8472](06d8472))
* **docs:** update remaining stale 'Coming soon' labels in
docs/README.md
([#507](#507))
([02439d6](02439d6))
* **docs:** update stale coming soon label for Training section
([#472](#472))
([46db49b](46db49b))
* **evaluation:** scope SIL AzureML validation code path and script
reference
([#387](#387))
([9f138a9](9f138a9))
* **infrastructure:** OSMO workflow execution, PostgreSQL public access,
and quickstart corrections
([#477](#477))
([9ed2da6](9ed2da6))
* **scripts:** exclude CHANGELOG.md from changed-files msdate check
([#644](#644))
([8133bdc](8133bdc))
* **workflows:** allow dependabot[bot] to activate AW Dependabot PR
Review
([#586](#586))
([39dc022](39dc022))
* **workflows:** correct branches filter on AW Dependabot PR Review
workflow_run trigger
([#584](#584))
([fe06b52](fe06b52))
* **workflows:** normalize validate.yaml placeholder env/compute values
([#510](#510))
([340ff44](340ff44))
* **workflows:** recompile aw-dependabot-pr-review lock file
([#576](#576))
([d77c167](d77c167))
* **workflows:** switch AW Dependabot PR Review to pull_request_target
([#589](#589))
([3f1edd1](3f1edd1))


### 📚 Documentation

* **docs:** Fix deployment guide links
([#614](#614))
([0070b04](0070b04))
* document dependency-pinning-artifacts directory purpose
([#508](#508))
([50e0010](50e0010))


### 📦 Build System

* **training:** standardize on Python 3.12 across manifests, containers,
and runtime scripts
([#541](#541))
([7ad014a](7ad014a))


### 🔧 Operations

* **build:** add Copilot cloud agent setup-steps workflow
([#593](#593))
([c912668](c912668))


### 🔧 Miscellaneous

* **build:** exclude auto-generated CHANGELOG.md from cspell and seed
dictionary
([#582](#582))
([de1dd57](de1dd57))
* **build:** redesign codecov flags and split pytest CI per component
([#520](#520))
([357e745](357e745))
* **dataviewer:** bump frontend stack to React 19, Vite 8, Tailwind v4,
MSAL 5, ESLint 10
([#524](#524))
([50f8ad4](50f8ad4))
* **dataviewer:** repoint stale src/dataviewer references to
data-management/viewer
([#504](#504))
([88fa1b4](88fa1b4)),
closes
[#503](#503)
* **deps-dev:** bump basic-ftp from 5.3.0 to 5.3.1
([#618](#618))
([ca10f2a](ca10f2a))
* **deps-dev:** bump globals from 15.15.0 to 17.5.0 in
/data-management/viewer/frontend
([#527](#527))
([0e0b2ae](0e0b2ae))
* **deps-dev:** bump ip-address from 10.1.0 to 10.2.0
([#616](#616))
([816c9cf](816c9cf))
* **deps-dev:** bump lint-staged from 16.4.0 to 17.0.2 in the
root-npm-dependencies group across 1 directory
([#626](#626))
([0e2f293](0e2f293))
* **deps-dev:** bump pydantic from 2.13.3 to 2.13.4 in the
python-dependencies group across 1 directory
([#629](#629))
([c24f1c1](c24f1c1))
* **deps-dev:** bump the python-dependencies group across 1 directory
with 2 updates
([#514](#514))
([8410f4b](8410f4b))
* **deps:** bump azure-core from 1.39.0 to 1.40.0 in /evaluation in the
inference-dependencies group across 1 directory
([#597](#597))
([6141db4](6141db4))
* **deps:** bump cryptography from 46.0.6 to 46.0.7 in
/data-management/viewer
([#424](#424))
([5fb6d58](5fb6d58))
* **deps:** bump cryptography from 46.0.6 to 46.0.7 in
/data-management/viewer/backend
([#423](#423))
([b516ad5](b516ad5))
* **deps:** bump lucide-react from 0.469.0 to 1.8.0 in
/data-management/viewer/frontend
([#528](#528))
([1bdfc1e](1bdfc1e))
* **deps:** bump nginx from `8aa63af` to `5616878` in
/data-management/viewer/frontend
([#511](#511))
([9e7e20e](9e7e20e))
* **deps:** bump nginx from 1.27-alpine to 1.29-alpine in
/data-management/viewer/frontend
([#484](#484))
([0e5c3dd](0e5c3dd))
* **deps:** bump node from `435f353` to `e49fd70` in
/data-management/viewer/frontend
([#560](#560))
([2884649](2884649))
* **deps:** bump react-is from 18.3.1 to 19.2.5 in
/data-management/viewer/frontend
([#530](#530))
([d51318c](d51318c))
* **deps:** bump tensordict from 0.11.0 to 0.12.1 in /evaluation in the
inference-dependencies group across 1 directory
([#456](#456))
([b24e733](b24e733))
* **deps:** bump the dataviewer-backend-dependencies group across 1
directory with 2 updates
([#531](#531))
([171a1da](171a1da))
* **deps:** bump the dataviewer-backend-dependencies group across 1
directory with 5 updates
([#516](#516))
([4f9a577](4f9a577))
* **deps:** bump the dataviewer-backend-dependencies group across 1
directory with 5 updates
([#602](#602))
([6c27ab5](6c27ab5))
* **deps:** bump the dataviewer-dependencies group across 1 directory
with 2 updates
([#529](#529))
([8646971](8646971))
* **deps:** bump the dataviewer-dependencies group across 1 directory
with 3 updates
([#601](#601))
([d28fb50](d28fb50))
* **deps:** bump the dataviewer-dependencies group across 1 directory
with 3 updates
([#632](#632))
([4ca5f3e](4ca5f3e))
* **deps:** bump the dataviewer-dependencies group across 1 directory
with 5 updates
([#515](#515))
([109ee81](109ee81))
* **deps:** bump the dataviewer-frontend-patch-minor group across 1
directory with 6 updates
([#630](#630))
([04d5dfd](04d5dfd))
* **deps:** bump the dataviewer-frontend-patch-minor group across 1
directory with 9 updates
([#563](#563))
([c08f450](c08f450))
* **deps:** bump the docusaurus-dependencies group across 1 directory
with 4 updates
([#627](#627))
([f5825fc](f5825fc))
* **deps:** bump the docusaurus-dependencies group across 1 directory
with 6 updates
([#599](#599))
([b859344](b859344))
* **deps:** bump the github-actions group across 1 directory with 4
updates
([#459](#459))
([2609c52](2609c52))
* **deps:** bump the github-actions group across 1 directory with 4
updates
([#517](#517))
([f54bf5d](f54bf5d))
* **deps:** bump the inference-dependencies group across 1 directory
with 11 updates
([#562](#562))
([087f53a](087f53a))
* **deps:** bump the inference-dependencies group across 1 directory
with 2 updates
([#628](#628))
([4a3be47](4a3be47))
* **deps:** bump the pip group across 2 directories with 1 update
([#494](#494))
([a14b6b0](a14b6b0))
* **docs:** update stale Python 3.11 references to 3.12
([#575](#575))
([6f85c95](6f85c95))
* **scripts:** remove redundant SC1091 disables in OSMO deploy scripts
([#509](#509))
([ae1cb82](ae1cb82))


### 🔒 Security

* **build:** pin dependencies and hash-verify downloads
([#465](#465))
([0289f49](0289f49))
* **build:** remediate dependency security advisories
([#479](#479))
([7196d6d](7196d6d))
* **deps-dev:** bump basic-ftp from 5.2.1 to 5.2.2
([#454](#454))
([cb158f1](cb158f1))
* **deps-dev:** bump basic-ftp from 5.2.2 to 5.3.0
([#495](#495))
([e983b8b](e983b8b))
* **deps-dev:** bump hypothesis from 6.152.3 to 6.152.4 in the
python-dependencies group
([#598](#598))
([83384d2](83384d2))
* **deps-dev:** bump markdownlint-cli2 from 0.22.0 to 0.22.1 in the
root-npm-dependencies group
([#559](#559))
([32bde35](32bde35))
* **deps-dev:** bump picomatch from 2.3.1 to 2.3.2 in /docs/docusaurus
([#455](#455))
([66f86ca](66f86ca))
* **deps-dev:** bump postcss from 8.5.10 to 8.5.12 in
/data-management/viewer/frontend
([#569](#569))
([a652dba](a652dba))
* **deps-dev:** bump the python-dependencies group with 2 updates
([#457](#457))
([749d231](749d231))
* **deps-dev:** bump the python-dependencies group with 2 updates
([#485](#485))
([71b44fd](71b44fd))
* **deps-dev:** bump the python-dependencies group with 3 updates
([#564](#564))
([9fc52fd](9fc52fd))
* **deps-dev:** bump typescript from 6.0.2 to 6.0.3 in /docs/docusaurus
in the docusaurus-dependencies group
([#513](#513))
([5694dbc](5694dbc))
* **deps:** bump azureml/openmpi4.1.0-ubuntu22.04 from 20260303.v5 to
20260409.v4 in /evaluation/sil/docker
([#480](#480))
([25d4df8](25d4df8))
* **deps:** bump cryptography from 46.0.6 to 46.0.7 in /evaluation in
the uv group across 1 directory
([#538](#538))
([92c5b2e](92c5b2e))
* **deps:** bump diffusers from 0.35.2 to 0.38.0 in /training/il/lerobot
([#638](#638))
([6261d19](6261d19))
* **deps:** bump follow-redirects from 1.15.11 to 1.16.0 in
/docs/docusaurus
([#469](#469))
([0458908](0458908))
* **deps:** bump gitpython and mako for lerobot IL training
([#623](#623))
([9f8022b](9f8022b))
* **deps:** bump node from 24.14.1-slim to 25.9.0-slim in
/data-management/viewer/frontend
([#482](#482))
([1532d09](1532d09))
* **deps:** bump packaging from 26.0 to 26.1 in /evaluation in the
inference-dependencies group
([#483](#483))
([f4afb6c](f4afb6c))
* **deps:** bump pillow from 12.1.1 to 12.2.0
([#467](#467))
([39fb663](39fb663))
* **deps:** bump python from 3.11-slim to 3.14-slim in
/data-management/viewer/backend
([#481](#481))
([7af9dfc](7af9dfc))
* **deps:** bump the dataviewer-backend-dependencies group across 1
directory with 15 updates
([#428](#428))
([e4446a2](e4446a2))
* **deps:** bump the dataviewer-backend-dependencies group in
/data-management/viewer/backend with 4 updates
([#487](#487))
([0f57c5b](0f57c5b))
* **deps:** bump the dataviewer-backend-dependencies group in
/data-management/viewer/backend with 8 updates
([#566](#566))
([d6e7869](d6e7869))
* **deps:** bump the dataviewer-dependencies group across 1 directory
with 5 updates
([#464](#464))
([24c208d](24c208d))
* **deps:** bump the dataviewer-dependencies group in
/data-management/viewer with 2 updates
([#486](#486))
([90149f3](90149f3))
* **deps:** bump the dataviewer-dependencies group in
/data-management/viewer with 6 updates
([#565](#565))
([f0bb36b](f0bb36b))
* **deps:** bump the dataviewer-frontend-patch-minor group across 1
directory with 10 updates
([#613](#613))
([e481f83](e481f83))
* **deps:** bump the github-actions group across 1 directory with 4
updates
([#534](#534))
([5478ab6](5478ab6))
* **deps:** bump the github-actions group with 2 updates
([#488](#488))
([4e6ce98](4e6ce98))
* **deps:** bump the github-actions group with 3 updates
([#567](#567))
([48c38dc](48c38dc))
* **deps:** bump the github-actions group with 3 updates
([#634](#634))
([00cfb49](00cfb49))
* **deps:** bump the github-actions group with 6 updates
([#603](#603))
([73eb79a](73eb79a))
* **deps:** bump the training-dependencies group across 1 directory with
23 updates
([#463](#463))
([d5a8656](d5a8656))
* **deps:** bump yaml from 2.8.2 to 2.8.3 in
/data-management/viewer/frontend
([#453](#453))
([10449df](10449df))
* pytest harness, dependabot advisories, and OSSF Scorecard remediations
([#501](#501))
([e8756e8](e8756e8))
* **scripts:** pin and hash-verify all shell script downloads
([#468](#468))
([0c2bb9c](0c2bb9c))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: physical-ai-toolchain-release[bot] <267194360+physical-ai-toolchain-release[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

6 participants