Skip to content
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
db7c86c
feat: cross cutting migration
agreaves-ms Mar 14, 2026
0252532
feat: continue cross cutting migration
agreaves-ms Mar 14, 2026
1ab93b8
feat: deploy migration
agreaves-ms Mar 15, 2026
f7cdc84
feat: deploy migration continued
agreaves-ms Mar 15, 2026
c991d0c
feat: data management migration
agreaves-ms Mar 15, 2026
a34f67f
feat: data management migration continue
agreaves-ms Mar 15, 2026
3d8e8c4
feat: migrate training to domains
agreaves-ms Mar 15, 2026
8aa4faf
feat: continue training migration
agreaves-ms Mar 15, 2026
4180dd6
feat: data pipeline migration
agreaves-ms Mar 15, 2026
251470b
feat: data pipeline migration continued
agreaves-ms Mar 15, 2026
9bf442f
feat: evaluation migration
agreaves-ms Mar 15, 2026
0e4b66d
feat: evaluation migration continued
agreaves-ms Mar 15, 2026
0f80acc
feat: add docs and landing places for SDG and fleet deployment
agreaves-ms Mar 15, 2026
ac8e06e
fix: update stale path references post-architecture-reorg
agreaves-ms Mar 15, 2026
d05e802
feat: additional clean up and fixes for the migration
agreaves-ms Mar 15, 2026
d03bacf
fix(repo): update import paths and references for architecture migration
katriendg Mar 16, 2026
a364e47
fix(pipeline): update ruff lint commands to check all directories
katriendg Mar 16, 2026
08ee6b2
fix(deps): patch PyJWT and flatted security vulnerabilities
katriendg Mar 16, 2026
38fef19
fix(training): update pyjwt version to address security vulnerabilities
katriendg Mar 16, 2026
b60fb95
fix(docs): update ms.date in workflow README files to reflect current…
katriendg Mar 16, 2026
a41fce5
refactor(linting): streamline repository root path detection in linti…
katriendg Mar 16, 2026
a9ab3c3
fix(docs): update ms.date in frontmatter files to reflect current date
katriendg Mar 16, 2026
430c09f
style(infrastructure): standardize table formatting in specifications…
katriendg Mar 16, 2026
b14143f
fix(pipeline): update default value for soft-fail input in link langu…
katriendg Mar 16, 2026
c1ab1c3
fix(build): update default value for soft-fail input in link language…
katriendg Mar 16, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 1 addition & 1 deletion .cspell.json
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
"*megalinter_file_names_cspell.txt",
"**/.terraform/**",
"**/.terraform.lock.hcl",
"**/scripts/tests/Fixtures/**"
"**/shared/ci/tests/Fixtures/**"
],
"dictionaryDefinitions": [
{
Expand Down
69 changes: 36 additions & 33 deletions .cspell/azure-services.txt
Original file line number Diff line number Diff line change
@@ -1,68 +1,71 @@
AADSSH
ACMRT
ACSA
AKRI
AMPLS
Dependabot
Entra
Eventhouse
MLflow
RTXPRO
acrpull
acsa
activedirectory
adal
adls
adms
amlarc
azmk
azurevpnconfig
adminconsent
adms
afds
agentpool
agentsvc
agic
aiops
akri
AKRI
AMPLS
amlarc
apim
appi
apparmor
appi
appinsight
appinsights
arcbox
arcgis
arck
arcsight
armttk
azacsnap
azapi
azcmagent
azcopy
azdo
azmk
azmon
azsk
azureactivedirectory
azuread
azurecli
acrpull
acsa
ACSA
ACMRT
agentpool
agentsvc
arck
azapi
azmon
azurecontainer
azuredevops
azuremonitor
azurelinux
azureml
azuremonitor
azurestack
azurevpnconfig
containerapp
cosmosdb
Dependabot
deviceregistry
eastu
eastus
Entra
entra
Eventhouse
functionalization
keyvault
MLflow
onedrive
onenote
powerbi
rtxprogpu
servicebus
sharepoint
southeastasia
wasbs
westus
azurecontainer
azurestack
activedirectory
azuread
functionalization
azurelinux
servicebus
eastu
AADSSH
rtxprogpu
RTXPRO
azureactivedirectory

eventhub
9 changes: 9 additions & 0 deletions .cspell/general-technical.txt
Original file line number Diff line number Diff line change
Expand Up @@ -443,6 +443,7 @@ flexbox
Flink
fluentbit
fluentd
fluxcd
flyout
flyouts
fontcolor
Expand Down Expand Up @@ -508,6 +509,7 @@ heredoc
heredocs
hexadecimal
hids
hil
highlevel
hipaa
hive
Expand Down Expand Up @@ -647,6 +649,7 @@ kingsway
kingswaysoft
koalaman
kpis
kql
kreps
kube
kubeconfig
Expand Down Expand Up @@ -1150,6 +1153,7 @@ scom
scoped
scrum
scsp
sdg
sdk
sdks
sdlc
Expand All @@ -1176,6 +1180,7 @@ sigalrm
signalr
signingkey
sigstore
sil
signup
Silverlight
Simbolic
Expand Down Expand Up @@ -1964,3 +1969,7 @@ rvfc
srcs
tobytes
ultrafast

kolmogorov
mqtt
smirnov
2 changes: 2 additions & 0 deletions .cspell/industry-acronyms.txt
Original file line number Diff line number Diff line change
Expand Up @@ -76,3 +76,5 @@ Cnet
cnet
URDF
MJCF

cusum
6 changes: 3 additions & 3 deletions .env.local.example
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Local environment overrides for developer workstations (not committed to git)
# Copy this file to .env.local and customize values for your setup.
# Variables defined here override defaults in deploy/002-setup/defaults.conf.
# Variables defined here override defaults in infrastructure/setup/defaults.conf.

# Path to the OSMO repository clone (required for --use-local-osmo flag).
# The osmo-dev.sh script builds and runs the CLI from source via Bazel.
Expand Down Expand Up @@ -44,8 +44,8 @@
# HELM_REPO_KAI=https://nvidia.github.io/k8s-device-scheduler/
# HELM_REPO_OSMO=https://helm.ngc.nvidia.com/nvidia/osmo

# Default Terraform Directory (relative to deploy/002-setup)
# DEFAULT_TF_DIR=../001-iac
# Default Terraform Directory (relative to infrastructure/setup)
# DEFAULT_TF_DIR=../terraform

# AzureML Extension Configuration
# AZUREML_EXTENSION_NAME=aml-extension
Expand Down
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/00-general.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ Provide details about your issue:

<!--
Include any relevant information:
- Related files or components (e.g., deploy/001-iac, src/training)
- Related files or components (e.g., infrastructure/terraform/, training/)
- Links to documentation you've consulted
- Screenshots or diagrams if helpful
-->
Expand Down
10 changes: 5 additions & 5 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,11 @@ Closes #
## Component(s) Affected
<!-- Mark all that apply -->

- [ ] `deploy/000-prerequisites` - Azure subscription setup
- [ ] `deploy/001-iac` - Terraform infrastructure
- [ ] `deploy/002-setup` - OSMO control plane / Helm
- [ ] `deploy/004-workflow` - Training workflows
- [ ] `src/training` - Python training scripts
- [ ] `infrastructure/terraform/prerequisites/` - Azure subscription setup
- [ ] `infrastructure/terraform/` - Terraform infrastructure
- [ ] `infrastructure/setup/` - OSMO control plane / Helm
- [ ] `workflows/` - Training and evaluation workflows
- [ ] `training/` - Training pipelines and scripts
- [ ] `docs/` - Documentation

## Testing Performed
Expand Down
8 changes: 4 additions & 4 deletions .github/agents/dataviewer-developer.agent.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ Start the dataviewer app, optionally configuring the dataset path.

If the user provides a dataset path:

1. Read `src/dataviewer/backend/.env`.
1. Read `data-management/viewer/backend/.env`.
2. Replace the `HMI_DATA_PATH=` line with the absolute path to the user's dataset directory.
3. Confirm the update.

Expand All @@ -41,7 +41,7 @@ If no path is provided, use the existing `HMI_DATA_PATH` value.
1. Run `start.sh` in the background terminal with configured ports:

```bash
cd src/dataviewer && BACKEND_PORT=${backendPort} FRONTEND_PORT=${frontendPort} ./start.sh
cd data-management/viewer && BACKEND_PORT=${backendPort} FRONTEND_PORT=${frontendPort} ./start.sh
```

Use default ports (8000/5173) when no overrides are specified.
Expand Down Expand Up @@ -198,15 +198,15 @@ Follow these codebase conventions:

**Backend (Python/FastAPI):**

- Source code in `src/dataviewer/backend/src/api/`
- Source code in `data-management/viewer/backend/src/api/`
- New endpoints go in `routers/` (REST) or `routes/` (specialized)
- Models in `models/`, services in `services/`
- Register new routers in `main.py`
- Use ruff for linting (line-length 120, target py311)

**Frontend (React/TypeScript):**

- Source code in `src/dataviewer/frontend/src/`
- Source code in `data-management/viewer/frontend/src/`
- Components organized by feature in `components/`
- API calls in `api/`, hooks in `hooks/`, stores in `stores/`
- Types in `types/`
Expand Down
60 changes: 31 additions & 29 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,18 +31,21 @@ Conventions, domain knowledge, and non-obvious patterns for agents working in th

| Directory | Purpose |
| --- | --- |
| `deploy/000-prerequisites/` | Azure subscription setup, provider registration |
| `deploy/001-iac/` | Terraform infrastructure (AKS, networking, storage, identity) |
| `deploy/001-iac/vpn/` | Point-to-site VPN for private cluster access |
| `deploy/002-setup/` | Post-deploy shell scripts (Helm charts, AzureML, OSMO) |
| `src/training/` | Python training package (built as wheel via hatchling) |
| `scripts/` | AzureML and OSMO job submission scripts |
| `workflows/` | Job definition templates (AzureML YAML, OSMO workflow YAML) |
| `infrastructure/terraform/prerequisites/` | Azure subscription setup, provider registration |
| `infrastructure/terraform/` | Terraform infrastructure (AKS, networking, storage, identity) |
| `infrastructure/terraform/vpn/` | Point-to-site VPN for private cluster access |
| `infrastructure/setup/` | Post-deploy shell scripts (Helm charts, AzureML, OSMO) |
| `training/rl/` | RL training package (SKRL, RSL-RL, Isaac Lab) |
| `training/il/` | IL training package (LeRobot ACT/Diffusion) |
| `evaluation/sil/` | Software-in-the-loop evaluation scripts and workflows |
| `data-management/viewer/` | Dataset analysis tool (FastAPI backend + React frontend) |
| `data-pipeline/capture/` | Recording configuration and data capture |
| `shared/lib/` | Cross-domain shared shell libraries (canonical location) |
| `external/IsaacLab/` | NVIDIA IsaacLab (cloned for IntelliSense only, not built locally) |
| `docs/contributing/` | Architecture, roadmap, style guides, contribution workflow |

* Version: managed by release-please across `pyproject.toml` and `package.json`
* Python: >=3.11, managed by `uv` (not pip); `hatchling` builds `src/training` into wheel
* Python: >=3.11, managed by `uv` (not pip); `hatchling` builds `training/rl` into wheel
* Linting: `npm run lint:md` (markdownlint-cli2), `npm run spell-check` (cspell), `npm run lint:yaml` (yaml-lint)

## Terraform Conventions
Expand All @@ -63,8 +66,8 @@ Conventions, domain knowledge, and non-obvious patterns for agents working in th
Detailed template and structure in `.github/instructions/shell-scripts.instructions.md`.

* Two Terraform output libraries exist (do NOT mix them):
* `deploy/002-setup/lib/common.sh`: dot-path accessors (`tf_get`, `tf_require`) for deploy scripts
* `scripts/lib/terraform-outputs.sh`: jq-path accessor (`get_output`) for submission scripts
* `shared/lib/common.sh`: dot-path accessors (`tf_get`, `tf_require`) for deploy and submission scripts
* `shared/lib/terraform-outputs.sh`: jq-path accessor (`get_output`) for submission scripts (symlinked at `scripts/lib/terraform-outputs.sh`)
* `.env.local` load order: `common.sh` loads `.env.local` BEFORE `defaults.conf`; override defaults via `${VAR:-default}` pattern
* Idempotent K8s operations: `kubectl create --dry-run=client -o yaml | kubectl apply -f -`
* Every script supports `--config-preview` (print configuration and exit without changes)
Expand All @@ -88,17 +91,17 @@ Four ordered deployment steps:

| Step | Directory | Description |
| --- | --- | --- |
| 1 | `deploy/000-prerequisites/` | Azure subscription init, provider registration |
| 2 | `deploy/001-iac/` | Terraform infrastructure (AKS, networking, storage, identity) |
| 3 | `deploy/001-iac/vpn/` | Point-to-site VPN (required for private clusters before any kubectl) |
| 4 | `deploy/002-setup/` | Helm charts, AzureML extension, OSMO control plane and backend |
| 1 | `infrastructure/terraform/prerequisites/` | Azure subscription init, provider registration |
| 2 | `infrastructure/terraform/` | Terraform infrastructure (AKS, networking, storage, identity) |
| 3 | `infrastructure/terraform/vpn/` | Point-to-site VPN (required for private clusters before any kubectl) |
| 4 | `infrastructure/setup/` | Helm charts, AzureML extension, OSMO control plane and backend |

* Default is private AKS — VPN step (3) is REQUIRED before any kubectl or Helm commands unless `should_enable_public_access = true`
* Three network modes: Full Private (default), Hybrid, Full Public
* Always run `source deploy/000-prerequisites/az-sub-init.sh` before any `terraform` or deploy script commands
* Always run `source infrastructure/terraform/prerequisites/az-sub-init.sh` before any `terraform` or deploy script commands
* Exports `ARM_SUBSCRIPTION_ID` and validates Azure CLI authentication
* If the user has not done `az login`, the script requires interactive input
* Deploy scripts (`002-setup/`) must run in numeric order (01 → 02 → 03 → 04)
* Deploy scripts (`infrastructure/setup/`) must run in numeric order (01 → 02 → 03 → 04)
* Each deploy script is idempotent and safe to re-run

## OSMO Platform
Expand Down Expand Up @@ -131,7 +134,7 @@ AzureML runs on Arc-connected AKS clusters via the AzureML Kubernetes extension.
* Job YAML schema: `$schema: .../commandJob.schema.json`
* No empty strings in YAML values — use sentinel values (`auto`, `none`, `placeholder`)
* Submit with runtime overrides: `az ml job create --file <yaml> --set "display_name=..." --set "environment_variables.KEY=value"`
* Code snapshot: `src/` directory uploaded to AzureML; exclusions controlled by `src/.amlignore`
* Code snapshot: each domain's workflow directory uploaded to AzureML via `code: .` relative path
* Identity chain: Terraform-created managed identity → federated credentials → K8s service accounts (`azureml:default`, `azureml:training`)
* Model validation mode: `mode: download` (NOT `ro_mount`) — workaround for workload identity auth failures in `data-capability` sidecar
* Multi-node: Volcano scheduler installed by AzureML extension when `installVolcano: true`
Expand Down Expand Up @@ -178,8 +181,8 @@ Run `npm install` (or `npm ci`) before any `npm run` lint commands. `shellcheck`
| `*.sh` | `shellcheck <file>` |
| `*.ps1` | `npm run lint:ps` |
| `*.yml` (GitHub Actions) | `npm run lint:yaml` |
| `src/dataviewer/frontend/**` | `cd src/dataviewer/frontend && npm run validate` (type-check + lint + test) |
| `src/dataviewer/backend/**` | `cd src/dataviewer/backend && pytest` and `ruff check src/` |
| `data-management/viewer/frontend/**` | `cd data-management/viewer/frontend && npm run validate` (type-check + lint + test) |
| `data-management/viewer/backend/**` | `cd data-management/viewer/backend && pytest` and `ruff check src/` |
| Any file | `npm run spell-check` |

### Linting
Expand All @@ -193,23 +196,22 @@ Run `npm install` (or `npm ci`) before any `npm run` lint commands. `shellcheck`

Terraform validation is per-directory — each deployment directory has its own provider configuration and state:

* `terraform fmt -check -recursive deploy/` — formatting compliance (recursive across all directories)
* `terraform fmt -check -recursive infrastructure/terraform/` — formatting compliance (recursive across all directories)
* `terraform validate` — run inside each deployment directory individually:
* `deploy/001-iac/`
* `deploy/001-iac/vpn/`
* `deploy/001-iac/dns/`
* `deploy/001-iac/automation/`
* `terraform plan -var-file=terraform.tfvars` — validates configuration against provider APIs (requires `source deploy/000-prerequisites/az-sub-init.sh` first)
* `infrastructure/terraform/`
* `infrastructure/terraform/vpn/`
* `infrastructure/terraform/dns/`
* `infrastructure/terraform/automation/`
* `terraform plan -var-file=terraform.tfvars` — validates configuration against provider APIs (requires `source infrastructure/terraform/prerequisites/az-sub-init.sh` first)

### Shell Scripts

* `shellcheck deploy/**/*.sh scripts/**/*.sh` — static analysis for deploy and submission scripts
* Requires zsh or bash with `shopt -s globstar`; alternatively use `find deploy scripts -name '*.sh' -exec shellcheck {} +`
* Deploy scripts (`deploy/002-setup/`) support `--config-preview` — prints configuration and exits without making changes; use for dry-run validation after modifying any deploy script
* `shellcheck infrastructure/setup/*.sh training/**/*.sh evaluation/**/*.sh` — static analysis for deploy and submission scripts
* Deploy scripts (`infrastructure/setup/`) support `--config-preview` — prints configuration and exits without making changes; use for dry-run validation after modifying any deploy script

### Pester Tests

* `npm run test:ps` — runs Pester tests in `scripts/tests/` covering linting helpers and security checks
* `npm run test:ps` — runs Pester tests in `shared/ci/tests/` covering linting helpers and security checks

## Contributing References

Expand Down
Loading
Loading