feat(infrastructure): add optional ADLS Gen2 data lake storage account#398
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #398 +/- ##
=======================================
Coverage 64.40% 64.40%
=======================================
Files 251 251
Lines 15433 15433
Branches 2060 2060
=======================================
Hits 9939 9939
Misses 5206 5206
Partials 288 288
🚀 New features to boost your workflow:
|
katriendg
left a comment
There was a problem hiding this comment.
Thank you @jjottar for this contribution!
I've left one comment in the review, and two small requests:
- We have added Terraform docs generation (still to be documented though, so not something you knew): could you run
npm run docs:generate:tflocally to updateTERRAFORM.mdfile(s) before we merge? - We typically document variables in the file
infrastructure/terraform/terraform.tfvars.example, could you add this new one there as well?
katriendg
left a comment
There was a problem hiding this comment.
Thank you @jjottar.
Docs generation all looking good.
There is just one thing with the existing policy that now applies to the ADLS HNS account, which has containers for the blobs, I believe this may need a final update before we can merge? Left an inline comment on this.
Hello @katriendg, thanks for following up, somehow the new inline comment doesn't show up for me. Could it be it's not yet posted? |
katriendg
left a comment
There was a problem hiding this comment.
So sorry @jjottar! Not sure how that happened. It seems I had two windows open...
Found the right window, this was the comment (should be there now).
I also notice in the meantime we merged another PR which generates the docs, for merging the conflict you can simply rebase and re-generate the docs again. Your version will then contain the incoming as well as your updates.
- add data lake storage account with HNS behind should_create_data_lake_storage flag - add datasets and models containers with lifecycle policies - add storage_dfs private DNS zone and data lake private endpoints - add Storage Blob Data Contributor role assignments for ML, OSMO, user, and dataviewer identities - update blob storage architecture docs for two-account layout 🗄️ - Generated by Copilot
- add conditional azurerm_storage_management_policy.main (active when data lake off) - add should_create_data_lake_storage to terraform.tfvars.example
…npm run docs:generate:tf
…tainer name and regenerate docs - update data lake lifecycle prefix_match to include container name (datasets/raw/, datasets/converted/, datasets/reports/) - regenerate TERRAFORM.md after rebase on upstream main
- add evaluation container for reports and evaluation outputs - update lifecycle prefix from datasets/reports/ to evaluation/reports/ - update docs and tests for new container structure
44ad760 to
ebf534a
Compare
katriendg
left a comment
There was a problem hiding this comment.
Looks good, thank you for the work on this.
🤖 I have created a release *beep* *boop* --- ## [0.7.0](v0.6.1...v0.7.0) (2026-04-09) ### ✨ Features * **build:** add hve-core release pipeline with dependency SBOM and signing artifacts ([#420](#420)) ([2ff839a](2ff839a)) * **build:** enforce strict warnings across all linters ([#392](#392)) ([b75e217](b75e217)) * **evaluation:** add fuzz testing infrastructure and property-based tests ([#416](#416)) ([d97d42c](d97d42c)) * **infrastructure:** add optional ADLS Gen2 data lake storage account ([#398](#398)) ([3bb9012](3bb9012)) * **settings:** add HVE Core extension to workspace and devcontainer recommendations ([#226](#226)) ([f0735d8](f0735d8)) ### 🐛 Bug Fixes * **docs:** fix broken links, harden Docusaurus config, and integrate CI workflow ([#430](#430)) ([ea99997](ea99997)) * **scripts:** join shellcheck version output before -match to populate $Matches ([#432](#432)) ([8768e76](8768e76)) * **scripts:** map unmapped ShellCheck severity levels and harden version parsing ([#434](#434)) ([1e95a17](1e95a17)) * **scripts:** resolve ShellCheck SC2034 and enable source-path resolution ([#443](#443)) ([04438ea](04438ea)) ### 🔧 Miscellaneous * **deps-dev:** bump basic-ftp from 5.2.0 to 5.2.1 ([#429](#429)) ([438660a](438660a)) * **deps:** bump cryptography from 46.0.6 to 46.0.7 ([#425](#425)) ([2366647](2366647)) --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). --------- Co-authored-by: physical-ai-toolchain-release[bot] <267194360+physical-ai-toolchain-release[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Pull Request
Description
Add an optional dedicated ADLS Gen2 storage account with hierarchical namespace (HNS) for domain data (datasets, model checkpoints, evaluation reports), separate from the existing AzureML workspace storage. The data lake is gated behind
should_create_data_lake_storage(default:false) and follows existing patterns for naming, networking, RBAC, and lifecycle policies.Closes #385
Type of Change
Component(s) Affected
infrastructure/terraform/prerequisites/- Azure subscription setupinfrastructure/terraform/- Terraform infrastructureinfrastructure/setup/- OSMO control plane / Helmworkflows/- Training and evaluation workflowstraining/- Training pipelines and scriptsdocs/- DocumentationTesting Performed
planreviewed (no unexpected changes)applytested in dev environmentsmoke_test_azure.py)Terraform Plan (with
should_create_data_lake_storage = true)Plan: 12 to add, 2 to change, 1 to destroy.
module.platform.azurerm_storage_account.data_lake[0]module.platform.azurerm_storage_container.datasets[0]module.platform.azurerm_storage_container.models[0]module.platform.azurerm_storage_container.evaluation[0]module.platform.azurerm_storage_management_policy.data_lake[0]module.platform.azurerm_private_dns_zone.core["storage_dfs"]module.platform.azurerm_private_dns_zone_virtual_network_link.core["storage_dfs"]module.platform.azurerm_private_endpoint.data_lake_blob[0]module.platform.azurerm_private_endpoint.data_lake_dfs[0]module.platform.azurerm_role_assignment.user_data_lake_blob[0]module.platform.azurerm_role_assignment.ml_data_lake_blob[0]module.platform.azurerm_role_assignment.osmo_data_lake_blob[0]module.platform.azurerm_key_vault.main(in-place, pre-existing drift)module.platform.azurerm_storage_account.main(in-place, pre-existing drift)module.platform.azurerm_storage_management_policy.mainThe 2 in-place updates and the destroy are expected:
count = 0when data lake is enabled). Existing deployments without the data lake retain their lifecycle rules.Terraform Apply
All 12 data lake resources created successfully in
rg-roboticsch-dev-001(switzerlandnorth):stdlroboticschdev001(HNS enabled)datasets(private)models(private)evaluation(private)privatelink.dfs.core.windows.netpe-datalake-blob-roboticsch-dev-001pe-datalake-dfs-roboticsch-dev-001stdl*stdl*stdl*Terraform Test
Lint & Validation
npm run lint:tfnpm run lint:tf:validatenpm run spell-checknpm run lint:mdWhat Changed
Platform Module (
infrastructure/terraform/modules/platform/)storage.tf— Newazurerm_storage_account.data_lakewithis_hns_enabled = true,datasets,models, andevaluationcontainers, data lake lifecycle policy, blob and DFS private endpoints. ML storage lifecycle policy gated withcount = var.should_create_data_lake_storage ? 0 : 1to avoid regression for existing deployments, and legacy fallback lifecycle prefixes corrected to targetml-workspace/...paths when the data lake is disabled.main.tf— Addedstorage_dfs = "privatelink.dfs.core.windows.net"tobase_dns_zones(7 base zones, up from 6).variables.tf— Newshould_create_data_lake_storagevariable (bool, defaultfalse).role-assignments.tf— Added Storage Blob Data Contributor on data lake for current user, ML identity, and OSMO identity. All gated on the data lake flag.outputs.tf— Newdata_lake_storage_accountanddata_lake_storage_account_accessoutputs (null when disabled).Dataviewer Module (
infrastructure/terraform/modules/dataviewer/)variables.deps.tf— New optionaldata_lake_storage_accountinput (nullable) on the reusable dataviewer Terraform module.role-assignments.tf— Conditional Storage Blob Data Contributor on data lake for dataviewer identity when a caller passes the optional data lake dependency.Root Module (
infrastructure/terraform/)variables.tf— Newshould_create_data_lake_storageroot variable.main.tf— Passshould_create_data_lake_storageto platform module.outputs.tf— Newdata_lake_storage_accountroot output.terraform.tfvars.example— Addedshould_create_data_lake_storagewith documentation.Tests (
infrastructure/terraform/modules/platform/tests/)dns-zones.tftest.hcl— Updated zone counts (6→7 base zones).security.tftest.hcl— Addeddata_lake_securityanddata_lake_disabled_by_defaulttest runs.conditionals.tftest.hcl— Addeddata_lake_enabledanddata_lake_disabledtest runs.Documentation
docs/cloud/blob-storage-structure.md— Rewritten for two-account architecture: ML workspace storage vs data lake storage, new container/folder structure, updated lifecycle policy references..cspell/general-technical.txt— Addedstdl(data lake naming prefix).Documentation Impact
Checklist