diff --git a/.cspell/general-technical.txt b/.cspell/general-technical.txt index 657630eb..96fc0265 100644 --- a/.cspell/general-technical.txt +++ b/.cspell/general-technical.txt @@ -1259,6 +1259,7 @@ stackexchange stackoverflow statefulness stderr +stdl stdout stig stonith diff --git a/docs/cloud/blob-storage-structure.md b/docs/cloud/blob-storage-structure.md index a34780b5..7b1418b4 100644 --- a/docs/cloud/blob-storage-structure.md +++ b/docs/cloud/blob-storage-structure.md @@ -1,19 +1,33 @@ -# Azure Blob Storage Folder Structure +# Azure Storage Architecture -Standardized folder structure for robotics data stored in Azure Blob Storage, including raw ROS bags, converted datasets, validation reports, and model checkpoints. +Storage architecture for robotics data across two Azure Storage accounts: an ML workspace storage account for AzureML system data, and an optional ADLS Gen2 data lake for domain data (datasets, model checkpoints). + +## Storage Accounts + +| Account | Naming Pattern | Type | Purpose | Terraform Variable | +|--------------------------|-----------------------------------------------|----------------------------|----------------------------------------------|-----------------------------------| +| ML Workspace Storage | `st{prefix}{env}{instance}` | Standard Blob (StorageV2) | AzureML system data (logs, snapshots, files) | Always created | +| Data Lake Storage (Gen2) | `stdl{prefix}{env}{instance}` | ADLS Gen2 (HNS enabled) | Domain data (datasets, model checkpoints) | `should_create_data_lake_storage` | + +The data lake uses hierarchical namespace (HNS), which provides atomic directory renames required by checkpoint libraries and POSIX-style ACLs for fine-grained access control. ## Container and Folder Organization -**Default Container:** `ml-workspace` +### ML Workspace Storage -### Folder Structure +**Container:** `ml-workspace` -| Folder Prefix | Purpose | Lifecycle Policy | Example Path | -|----------------|-------------------------------------|-----------------------------|-------------------------------------------------------| -| `raw/` | Raw ROS bag files from edge devices | Auto-delete after 30 days | `raw/robot-01/2026-03-05/episode-001.mcap` | -| `converted/` | LeRobot datasets in v0.3.x format | Tier to cool after 90 days | `converted/pick-place-v1/meta/info.json` | -| `reports/` | Validation reports and metrics | Cool (30d) → Archive (180d) | `reports/pick-place-v1/2026-03-05/eval_results.json` | -| `checkpoints/` | Model checkpoints | Retained indefinitely (Hot) | `checkpoints/act-policy/20260305_143022_step_1000.pt` | +Used internally by AzureML for run metadata, code snapshots, and workspace file shares. Not intended for user data. + +### Data Lake Storage + +| Container | Subfolder | Purpose | Lifecycle Policy | +|--------------|----------------------|-------------------------------------|-----------------------------| +| `datasets` | `raw/` | Raw ROS bag files from edge devices | Auto-delete after 30 days | +| `datasets` | `converted/` | LeRobot datasets in v0.3.x format | Tier to cool after 90 days | +| `models` | `base-models/` | Pre-trained foundation model weights | Retained indefinitely (Hot) | +| `models` | `checkpoints/` | Training checkpoint outputs | Retained indefinitely (Hot) | +| `evaluation` | `reports/` | Validation reports and metrics | Cool (30d) → Archive (180d) | ## Naming Conventions @@ -45,8 +59,11 @@ Standardized folder structure for robotics data stored in Azure Blob Storage, in ## Path Patterns +All domain data paths below are relative to the data lake storage account. + ### Raw ROS Bags +**Container:** `datasets` **Pattern:** `raw/{device-id}/{YYYY-MM-DD}/{filename}.mcap` **Examples:** @@ -59,6 +76,7 @@ raw/mobile-manipulator-03/2026-03-01/navigation-001.mcap ### Converted LeRobot Datasets +**Container:** `datasets` **Pattern:** `converted/{dataset-id}/meta/info.json` **Structure:** @@ -87,6 +105,7 @@ converted/pick-place-v1/videos/observation.image/chunk-000/episode_0000.mp4 ### Validation Reports +**Container:** `evaluation` **Pattern:** `reports/{dataset-id}/{YYYY-MM-DD}/{filename}.json` **Examples:** @@ -99,6 +118,7 @@ reports/navigation-v2/2026-03-04/mse_results.json ### Model Checkpoints +**Container:** `models` **Pattern:** `checkpoints/{model-name}/{timestamp}_step_{N}.{ext}` **Examples:** @@ -111,23 +131,22 @@ checkpoints/velocity-anymal/20260301_120000.onnx ## Lifecycle Management Policies -Lifecycle policies automatically manage blob storage costs by tiering and deleting data based on age. +Lifecycle policies on the data lake storage account automatically manage storage costs by tiering and deleting data based on age. ### Policy Details -| Folder Prefix | Action | Timing | Configurable | -|----------------|-----------------|-----------------------|-------------------------------------------| -| `raw/` | Delete | After 30 days | Yes (`raw_bags_retention_days`) | -| `converted/` | Tier to Cool | After 90 days | Yes (`converted_datasets_cool_tier_days`) | -| `reports/` | Tier to Cool | After 30 days | Yes (`reports_cool_tier_days`) | -| `reports/` | Tier to Archive | After 180 days | Yes (`reports_archive_tier_days`) | -| `checkpoints/` | None | Retained indefinitely | N/A | +| Folder Prefix | Container | Action | Timing | Configurable | +|----------------|--------------|-----------------|---------------------- |-------------------------------------------| +| `raw/` | `datasets` | Delete | After 30 days | Yes (`raw_bags_retention_days`) | +| `converted/` | `datasets` | Tier to Cool | After 90 days | Yes (`converted_datasets_cool_tier_days`) | +| `reports/` | `evaluation` | Tier to Cool | After 30 days | Yes (`reports_cool_tier_days`) | +| `reports/` | `evaluation` | Tier to Archive | After 180 days | Yes (`reports_archive_tier_days`) | ### Configuration -Lifecycle policies are defined in Terraform variables: +Lifecycle policies are configured in the root Terraform deployment: -**File:** `deploy/001-iac/terraform.tfvars` +**Example File:** `infrastructure/terraform/terraform.tfvars.example` ```hcl should_enable_raw_bags_lifecycle_policy = true @@ -242,7 +261,7 @@ No migration needed. **Azure Portal:** -1. Navigate to Storage Account (e.g., `st`) +1. Navigate to Data Lake Storage Account (e.g., `stdl`) 2. Settings → Lifecycle management 3. Verify rules: `delete-raw-bags`, `tier-converted-datasets-to-cool`, `tier-reports-to-cool-then-archive` @@ -250,7 +269,7 @@ No migration needed. ```bash az storage account management-policy show \ - --account-name st \ + --account-name stdl \ --resource-group rg--- ``` @@ -290,6 +309,7 @@ az storage account management-policy show \ ## References +- [Azure Data Lake Storage Gen2](https://learn.microsoft.com/azure/storage/blobs/data-lake-storage-introduction) - [Azure Blob Storage Lifecycle Management](https://learn.microsoft.com/azure/storage/blobs/lifecycle-management-overview) - [Terraform azurerm_storage_management_policy](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/storage_management_policy) - [LeRobot Dataset Format v0.3.x](https://github.com/huggingface/lerobot/tree/main/src/lerobot/datasets) diff --git a/infrastructure/terraform/TERRAFORM.md b/infrastructure/terraform/TERRAFORM.md index 6a99e122..7b9ee141 100644 --- a/infrastructure/terraform/TERRAFORM.md +++ b/infrastructure/terraform/TERRAFORM.md @@ -2,7 +2,7 @@ title: Robotics Blueprint description: Deploys robotics infrastructure with NVIDIA GPU support, KAI Scheduler, and optional Azure Machine Learning integration. author: Microsoft Robotics-AI Team -ms.date: 2026-03-25 +ms.date: 2026-04-08 ms.topic: reference --- @@ -11,7 +11,6 @@ Deploys robotics infrastructure with NVIDIA GPU support, KAI Scheduler, and optional Azure Machine Learning integration. Architecture: - - Platform Module: Shared services (networking, security, observability, ACR, storage, ML workspace) - SiL Module: AKS cluster with GPU node pools and ML extension integration @@ -52,54 +51,64 @@ Architecture: ## Inputs -| Name | Description | Type | Default | Required | -|---------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:| -| environment | Environment for all resources in this module: dev, test, or prod | `string` | n/a | yes | -| location | Location for all resources in this module | `string` | n/a | yes | -| resource\_prefix | Prefix for all resources in this module | `string` | n/a | yes | -| aml\_compute\_config | AzureML managed compute cluster configuration including VM size, priority, scaling, and optional subnet placement | ```object({ vm_size = string vm_priority = string min_node_count = number max_node_count = number scale_down_after_idle = optional(string, "PT5M") cluster_name = optional(string, "gpu-cluster") subnet_id = optional(string) })``` | ```{ "cluster_name": "gpu-cluster", "max_node_count": 1, "min_node_count": 0, "scale_down_after_idle": "PT5M", "subnet_id": null, "vm_priority": "LowPriority", "vm_size": "Standard_NC4as_T4_v3" }``` | no | -| instance | Instance identifier for naming resources: 001, 002, etc | `string` | `"001"` | no | -| node\_pools | Additional node pools for the AKS cluster. Map key is used as the node pool name. Note: Pod subnets are not used with Azure CNI Overlay mode | ```map(object({ node_count = optional(number, null) vm_size = string subnet_address_prefixes = list(string) node_taints = optional(list(string), []) node_labels = optional(map(string), {}) should_enable_auto_scaling = optional(bool, false) min_count = optional(number, null) max_count = optional(number, null) priority = optional(string, "Regular") zones = optional(list(string), null) eviction_policy = optional(string, "Deallocate") gpu_driver = optional(string, null) }))``` | ```{ "gpu": { "eviction_policy": "Delete", "gpu_driver": "Install", "max_count": 1, "min_count": 1, "node_taints": [ "nvidia.com/gpu:NoSchedule", "kubernetes.azure.com/scalesetpriority = spot:NoSchedule" ], "priority": "Spot", "should_enable_auto_scaling": true, "subnet_address_prefixes": [ "10.0.7.0/24" ], "vm_size": "Standard_NV36ads_A10_v5", "zones": [] } }``` | no | -| osmo\_config | OSMO configuration including workload identity settings | ```object({ should_enable_identity = bool should_federate_identity = bool control_plane_namespace = string operator_namespace = string workflows_namespace = string })``` | ```{ "control_plane_namespace": "osmo-control-plane", "operator_namespace": "osmo-operator", "should_enable_identity": true, "should_federate_identity": true, "workflows_namespace": "osmo-workflows" }``` | no | -| postgresql\_databases | Map of databases to create with collation and charset | ```map(object({ collation = string charset = string }))``` | ```{ "osmo": { "charset": "utf8", "collation": "en_US.utf8" } }``` | no | -| postgresql\_high\_availability | PostgreSQL high availability configuration. Set should\_enable=false to deploy without HA | ```object({ should_enable = bool standby_availability_zone = optional(string) })``` | ```{ "should_enable": false, "standby_availability_zone": null }``` | no | -| postgresql\_location | Location for PostgreSQL Flexible Server. Defaults to the main location. Set to a different region when PostgreSQL provisioning is restricted in the primary location | `string` | `null` | no | -| postgresql\_sku\_name | SKU name for PostgreSQL server | `string` | `"GP_Standard_D2s_v3"` | no | -| postgresql\_storage\_mb | Storage size in megabytes for PostgreSQL | `number` | `32768` | no | -| postgresql\_version | PostgreSQL server version | `string` | `"16"` | no | -| postgresql\_zone | Primary availability zone for PostgreSQL. Set to null for Azure auto-selection | `string` | `null` | no | -| redis\_clustering\_policy | Clustering policy for Redis cache (OSSCluster or EnterpriseCluster). EnterpriseCluster recommended for clients that don't support Redis Cluster MOVED redirects | `string` | `"EnterpriseCluster"` | no | -| redis\_sku\_name | SKU name for Azure Managed Redis cache. Format: {Tier}\_{Size} (e.g., Balanced\_B10, Memory\_M20, Compute\_X10) | `string` | `"Balanced_B10"` | no | -| resource\_group\_name | Existing resource group name containing foundational and ML resources (Otherwise 'rg-{resource\_prefix}-{environment}-{instance}') | `string` | `null` | no | -| should\_add\_current\_user\_key\_vault\_admin | Whether to add the current user as Key Vault Secrets Officer | `bool` | `true` | no | -| should\_add\_current\_user\_storage\_blob | Whether to add the current user as Storage Blob Data Contributor | `bool` | `true` | no | -| should\_create\_resource\_group | Whether to create the resource group for the robotics infrastructure | `bool` | `true` | no | -| should\_deploy\_aml\_compute | Whether to deploy an AzureML managed compute cluster for GPU workloads | `bool` | `false` | no | -| should\_enable\_aml\_diagnostic\_logs | Whether to enable AML workspace diagnostic logs in Log Analytics | `bool` | `false` | no | -| should\_deploy\_ampls | Whether to deploy Azure Monitor Private Link Scope and its private endpoint | `bool` | `true` | no | -| should\_deploy\_dce | Whether to deploy Data Collection Endpoint for observability | `bool` | `true` | no | -| should\_deploy\_grafana | Whether to deploy Azure Managed Grafana dashboard | `bool` | `true` | no | -| should\_deploy\_monitor\_workspace | Whether to deploy Azure Monitor Workspace for Prometheus metrics | `bool` | `true` | no | -| should\_deploy\_postgresql | Whether to deploy PostgreSQL Flexible Server component | `bool` | `true` | no | -| should\_deploy\_redis | Whether to deploy Azure Managed Redis component | `bool` | `true` | no | -| should\_enable\_microsoft\_defender | Whether to enable Microsoft Defender for Containers on the AKS cluster | `bool` | `false` | no | -| should\_enable\_nat\_gateway | Whether to deploy NAT Gateway for explicit outbound connectivity. When true, subnets use NAT Gateway; when false, subnets use Azure default outbound access | `bool` | `true` | no | -| should\_enable\_private\_aks\_cluster | Whether the AKS cluster API endpoint is private. When true, requires VPN for kubectl access. Can be set independently from should\_enable\_private\_endpoint to allow private Azure services with a public AKS control plane. | `bool` | `true` | no | -| should\_enable\_private\_endpoint | Whether to enable private endpoints across resources for secure connectivity | `bool` | `true` | no | -| should\_enable\_public\_network\_access | Whether to enable public network access to the Azure ML workspace | `bool` | `true` | no | -| should\_enable\_purge\_protection | Whether to enable purge protection on Key Vault. Set to false for dev/test to allow easy cleanup. WARNING: Once enabled, purge protection cannot be disabled | `bool` | `false` | no | -| should\_enable\_redis\_high\_availability | Enable high availability for Redis. Increases cost but provides zone redundancy | `bool` | `false` | no | -| should\_enable\_system\_node\_pool\_auto\_scaling | Enable auto-scaling for the AKS system node pool | `bool` | `false` | no | -| should\_include\_aks\_dns\_zone | Whether to include the AKS private DNS zone in core DNS zones | `bool` | `true` | no | -| subnet\_address\_prefixes\_aks | Address prefixes for the AKS subnet | `list(string)` | ```[ "10.0.5.0/24" ]``` | no | -| subnet\_address\_prefixes\_aks\_pod | Address prefixes for the AKS pod subnet | `list(string)` | ```[ "10.0.6.0/24" ]``` | no | -| system\_node\_pool\_max\_count | Maximum node count for AKS system node pool when auto-scaling is enabled (0-1000) | `number` | `null` | no | -| system\_node\_pool\_min\_count | Minimum node count for AKS system node pool when auto-scaling is enabled (0-1000) | `number` | `null` | no | -| system\_node\_pool\_node\_count | Number of nodes for the AKS system node pool | `number` | `1` | no | -| system\_node\_pool\_vm\_size | VM size for the AKS system node pool | `string` | `"Standard_D8ds_v5"` | no | -| system\_node\_pool\_zones | Availability zones for AKS system node pool. Set to null or empty for regional deployment (no zone constraint) | `list(string)` | `null` | no | -| tags | Tags to apply to all resources | `map(string)` | `{}` | no | -| virtual\_network\_config | Configuration for the virtual network including address space and subnet prefixes. PE subnet prefix is required when private endpoints are enabled. Resolver subnet enables DNS resolution for VPN clients and on-premises networks | ```object({ address_space = string subnet_address_prefix = string subnet_address_prefix_pe = optional(string, "10.0.2.0/24") subnet_address_prefix_resolver = optional(string, "10.0.9.0/28") })``` | ```{ "address_space": "10.0.0.0/16", "subnet_address_prefix": "10.0.1.0/24", "subnet_address_prefix_pe": "10.0.2.0/24", "subnet_address_prefix_resolver": "10.0.9.0/28" }``` | no | +| Name | Description | Type | Default | Required | +|--------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:| +| environment | Environment for all resources in this module: dev, test, or prod | `string` | n/a | yes | +| location | Location for all resources in this module | `string` | n/a | yes | +| resource\_prefix | Prefix for all resources in this module | `string` | n/a | yes | +| aml\_compute\_config | AzureML managed compute cluster configuration including VM size, priority, scaling, and optional subnet placement | ```object({ vm_size = string vm_priority = string min_node_count = number max_node_count = number scale_down_after_idle = optional(string, "PT5M") cluster_name = optional(string, "gpu-cluster") subnet_id = optional(string) })``` | ```{ "cluster_name": "gpu-cluster", "max_node_count": 1, "min_node_count": 0, "scale_down_after_idle": "PT5M", "subnet_id": null, "vm_priority": "LowPriority", "vm_size": "Standard_NC4as_T4_v3" }``` | no | +| converted\_datasets\_cool\_tier\_days | Number of days before tiering converted datasets to cool storage. Set to -1 to disable tiering | `number` | `90` | no | +| instance | Instance identifier for naming resources: 001, 002, etc | `string` | `"001"` | no | +| nat\_gateway\_zones | Availability zones for NAT Gateway and its public IP. Set to ["1"] in regions with AZ support. Leave empty for regions without AZ support (e.g. westus) | `list(string)` | ```[ "1" ]``` | no | +| node\_pools | Additional node pools for the AKS cluster. Map key is used as the node pool name. Note: Pod subnets are not used with Azure CNI Overlay mode | ```map(object({ node_count = optional(number, null) vm_size = string subnet_address_prefixes = list(string) node_taints = optional(list(string), []) node_labels = optional(map(string), {}) should_enable_auto_scaling = optional(bool, false) min_count = optional(number, null) max_count = optional(number, null) priority = optional(string, "Regular") zones = optional(list(string), null) eviction_policy = optional(string, "Deallocate") gpu_driver = optional(string, null) }))``` | ```{ "gpu": { "eviction_policy": "Delete", "gpu_driver": "Install", "max_count": 1, "min_count": 1, "node_taints": [ "nvidia.com/gpu:NoSchedule", "kubernetes.azure.com/scalesetpriority = spot:NoSchedule" ], "priority": "Spot", "should_enable_auto_scaling": true, "subnet_address_prefixes": [ "10.0.7.0/24" ], "vm_size": "Standard_NV36ads_A10_v5", "zones": [] } }``` | no | +| osmo\_config | OSMO configuration including workload identity settings | ```object({ should_enable_identity = bool should_federate_identity = bool control_plane_namespace = string operator_namespace = string workflows_namespace = string })``` | ```{ "control_plane_namespace": "osmo-control-plane", "operator_namespace": "osmo-operator", "should_enable_identity": true, "should_federate_identity": true, "workflows_namespace": "osmo-workflows" }``` | no | +| postgresql\_databases | Map of databases to create with collation and charset | ```map(object({ collation = string charset = string }))``` | ```{ "osmo": { "charset": "utf8", "collation": "en_US.utf8" } }``` | no | +| postgresql\_high\_availability | PostgreSQL high availability configuration. Set should\_enable=false to deploy without HA | ```object({ should_enable = bool standby_availability_zone = optional(string) })``` | ```{ "should_enable": false, "standby_availability_zone": null }``` | no | +| postgresql\_location | Location for PostgreSQL Flexible Server. Defaults to the main location. Set to a different region when PostgreSQL provisioning is restricted in the primary location | `string` | `null` | no | +| postgresql\_sku\_name | SKU name for PostgreSQL server | `string` | `"GP_Standard_D2s_v3"` | no | +| postgresql\_storage\_mb | Storage size in megabytes for PostgreSQL | `number` | `32768` | no | +| postgresql\_version | PostgreSQL server version | `string` | `"16"` | no | +| postgresql\_zone | Primary availability zone for PostgreSQL. Set to null for Azure auto-selection | `string` | `null` | no | +| raw\_bags\_retention\_days | Number of days to retain raw ROS bags before automatic deletion. Set to -1 to disable deletion | `number` | `30` | no | +| redis\_clustering\_policy | Clustering policy for Redis cache (OSSCluster or EnterpriseCluster). EnterpriseCluster recommended for clients that don't support Redis Cluster MOVED redirects | `string` | `"EnterpriseCluster"` | no | +| redis\_sku\_name | SKU name for Azure Managed Redis cache. Format: {Tier}\_{Size} (e.g., Balanced\_B10, Memory\_M20, Compute\_X10) | `string` | `"Balanced_B10"` | no | +| reports\_archive\_tier\_days | Number of days before tiering validation reports to archive storage. Must be greater than reports\_cool\_tier\_days | `number` | `180` | no | +| reports\_cool\_tier\_days | Number of days before tiering validation reports to cool storage | `number` | `30` | no | +| resource\_group\_name | Existing resource group name containing foundational and ML resources (Otherwise 'rg-{resource\_prefix}-{environment}-{instance}') | `string` | `null` | no | +| should\_add\_current\_user\_key\_vault\_admin | Whether to add the current user as Key Vault Secrets Officer | `bool` | `true` | no | +| should\_add\_current\_user\_storage\_blob | Whether to add the current user as Storage Blob Data Contributor | `bool` | `true` | no | +| should\_create\_data\_lake\_storage | Whether to create a dedicated ADLS Gen2 storage account with hierarchical namespace for domain data (datasets, model checkpoints) | `bool` | `false` | no | +| should\_create\_resource\_group | Whether to create the resource group for the robotics infrastructure | `bool` | `true` | no | +| should\_create\_vm\_subnet | Whether to create a dedicated subnet for virtual machines in the platform virtual network | `bool` | `false` | no | +| should\_deploy\_aml\_compute | Whether to deploy an AzureML managed compute cluster for GPU workloads | `bool` | `false` | no | +| should\_deploy\_ampls | Whether to deploy Azure Monitor Private Link Scope and its private endpoint | `bool` | `true` | no | +| should\_deploy\_dce | Whether to deploy Data Collection Endpoint for observability | `bool` | `true` | no | +| should\_deploy\_grafana | Whether to deploy Azure Managed Grafana dashboard | `bool` | `true` | no | +| should\_deploy\_monitor\_workspace | Whether to deploy Azure Monitor Workspace for Prometheus metrics | `bool` | `true` | no | +| should\_deploy\_postgresql | Whether to deploy PostgreSQL Flexible Server component | `bool` | `true` | no | +| should\_deploy\_redis | Whether to deploy Azure Managed Redis component | `bool` | `true` | no | +| should\_enable\_aml\_diagnostic\_logs | Whether to enable AML workspace diagnostic logs in Log Analytics | `bool` | `false` | no | +| should\_enable\_converted\_datasets\_lifecycle\_policy | Whether to enable lifecycle policy for converted LeRobot datasets (auto-tier to cool storage) | `bool` | `true` | no | +| should\_enable\_microsoft\_defender | Whether to enable Microsoft Defender for Containers on the AKS cluster | `bool` | `false` | no | +| should\_enable\_nat\_gateway | Whether to deploy NAT Gateway for explicit outbound connectivity. When true, subnets use NAT Gateway; when false, subnets use Azure default outbound access | `bool` | `true` | no | +| should\_enable\_private\_aks\_cluster | Whether the AKS cluster API endpoint is private. When true, requires VPN for kubectl access. Can be set independently from should\_enable\_private\_endpoint to allow private Azure services with a public AKS control plane. | `bool` | `true` | no | +| should\_enable\_private\_endpoint | Whether to enable private endpoints across resources for secure connectivity | `bool` | `true` | no | +| should\_enable\_public\_network\_access | Whether to enable public network access to the Azure ML workspace | `bool` | `true` | no | +| should\_enable\_purge\_protection | Whether to enable purge protection on Key Vault. Set to false for dev/test to allow easy cleanup. WARNING: Once enabled, purge protection cannot be disabled | `bool` | `false` | no | +| should\_enable\_raw\_bags\_lifecycle\_policy | Whether to enable lifecycle policy for raw ROS bags (auto-delete after retention period) | `bool` | `true` | no | +| should\_enable\_redis\_high\_availability | Enable high availability for Redis. Increases cost but provides zone redundancy | `bool` | `false` | no | +| should\_enable\_reports\_lifecycle\_policy | Whether to enable lifecycle policy for validation reports (auto-tier to cool then archive) | `bool` | `true` | no | +| should\_enable\_system\_node\_pool\_auto\_scaling | Enable auto-scaling for the AKS system node pool | `bool` | `false` | no | +| should\_include\_aks\_dns\_zone | Whether to include the AKS private DNS zone in core DNS zones | `bool` | `true` | no | +| subnet\_address\_prefixes\_aks | Address prefixes for the AKS subnet | `list(string)` | ```[ "10.0.5.0/24" ]``` | no | +| subnet\_address\_prefixes\_aks\_pod | Address prefixes for the AKS pod subnet | `list(string)` | ```[ "10.0.6.0/24" ]``` | no | +| system\_node\_pool\_max\_count | Maximum node count for AKS system node pool when auto-scaling is enabled (0-1000) | `number` | `null` | no | +| system\_node\_pool\_min\_count | Minimum node count for AKS system node pool when auto-scaling is enabled (0-1000) | `number` | `null` | no | +| system\_node\_pool\_node\_count | Number of nodes for the AKS system node pool | `number` | `1` | no | +| system\_node\_pool\_vm\_size | VM size for the AKS system node pool | `string` | `"Standard_D8ds_v5"` | no | +| system\_node\_pool\_zones | Availability zones for AKS system node pool. Set to null or empty for regional deployment (no zone constraint) | `list(string)` | `null` | no | +| tags | Tags to apply to all resources | `map(string)` | `{}` | no | +| virtual\_network\_config | Configuration for the virtual network including address space and subnet prefixes. PE subnet prefix is required when private endpoints are enabled. Resolver subnet enables DNS resolution for VPN clients and on-premises networks | ```object({ address_space = string subnet_address_prefix = string subnet_address_prefix_vm = optional(string, "10.0.4.0/24") subnet_address_prefix_pe = optional(string, "10.0.2.0/24") subnet_address_prefix_resolver = optional(string, "10.0.9.0/28") })``` | ```{ "address_space": "10.0.0.0/16", "subnet_address_prefix": "10.0.1.0/24", "subnet_address_prefix_pe": "10.0.2.0/24", "subnet_address_prefix_resolver": "10.0.9.0/28", "subnet_address_prefix_vm": "10.0.4.0/24" }``` | no | ## Outputs @@ -111,6 +120,7 @@ Architecture: | application\_insights | Application Insights for application telemetry. | | azureml\_workspace | Azure ML workspace for ML workloads. | | container\_registry | Azure Container Registry for container images. | +| data\_lake\_storage\_account | Data lake storage account for domain data. Null when data lake is disabled. | | dns\_server\_ip | The IP address to use as DNS server for VPN clients or on-premises DNS forwarding. | | gpu\_node\_pool\_subnets | GPU node pool subnets created by SiL module. | | grafana | Azure Managed Grafana for dashboards. | @@ -119,6 +129,7 @@ Architecture: | log\_analytics\_workspace | Log Analytics Workspace for centralized logging. | | managed\_redis\_connection\_info | Redis connection information for OSMO control plane. | | ml\_workload\_identity | ML workload identity for federated credentials. | +| network\_security\_group | Shared network security group for robotics infrastructure. | | node\_pools | GPU node pool configurations for OSMO pool and pod template generation | | osmo\_workload\_identity | OSMO workload identity for deployment scripts | | postgresql | PostgreSQL Flexible Server object. | @@ -129,6 +140,7 @@ Architecture: | storage\_account | Storage account for ML workspace and general storage. | | subnets | Subnet details from platform module. | | virtual\_network | Virtual network for robotics infrastructure. | +| vm\_subnet | Dedicated VM subnet. Null when should\_create\_vm\_subnet is false. | diff --git a/infrastructure/terraform/automation/TERRAFORM.md b/infrastructure/terraform/automation/TERRAFORM.md index 42bf3dba..93611a4f 100644 --- a/infrastructure/terraform/automation/TERRAFORM.md +++ b/infrastructure/terraform/automation/TERRAFORM.md @@ -2,7 +2,7 @@ title: Azure Automation Standalone Configuration description: Deploys Azure Automation Account with scheduled runbook to start AKS cluster and PostgreSQL server every morning. Uses data sources to reference existing platform infrastructure. author: Microsoft Robotics-AI Team -ms.date: 2026-03-25 +ms.date: 2026-04-08 ms.topic: reference --- diff --git a/infrastructure/terraform/dns/TERRAFORM.md b/infrastructure/terraform/dns/TERRAFORM.md index dfedc4bf..03317f8e 100644 --- a/infrastructure/terraform/dns/TERRAFORM.md +++ b/infrastructure/terraform/dns/TERRAFORM.md @@ -2,7 +2,7 @@ title: Private DNS Zone for OSMO UI Service description: Creates a private DNS zone for internal resolution of the OSMO UI service running on an internal LoadBalancer within the AKS cluster. author: Microsoft Robotics-AI Team -ms.date: 2026-03-25 +ms.date: 2026-04-08 ms.topic: reference --- diff --git a/infrastructure/terraform/main.tf b/infrastructure/terraform/main.tf index 415b820c..134aa9eb 100644 --- a/infrastructure/terraform/main.tf +++ b/infrastructure/terraform/main.tf @@ -95,6 +95,7 @@ module "platform" { should_add_current_user_key_vault_admin = var.should_add_current_user_key_vault_admin should_add_current_user_storage_blob = var.should_add_current_user_storage_blob should_enable_purge_protection = var.should_enable_purge_protection + should_create_data_lake_storage = var.should_create_data_lake_storage // Storage lifecycle management should_enable_raw_bags_lifecycle_policy = var.should_enable_raw_bags_lifecycle_policy diff --git a/infrastructure/terraform/modules/automation/TERRAFORM.md b/infrastructure/terraform/modules/automation/TERRAFORM.md index fc298b68..7520c5c1 100644 --- a/infrastructure/terraform/modules/automation/TERRAFORM.md +++ b/infrastructure/terraform/modules/automation/TERRAFORM.md @@ -2,7 +2,7 @@ title: Azure Automation Module description: Creates an Azure Automation Account with a scheduled PowerShell runbook for automated startup of AKS clusters and PostgreSQL servers. author: Microsoft Robotics-AI Team -ms.date: 2026-03-25 +ms.date: 2026-04-08 ms.topic: reference --- @@ -41,15 +41,15 @@ for automated startup of AKS clusters and PostgreSQL servers. | Name | Description | Type | Default | Required | |-----------------------|--------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------|:--------:| | aks\_cluster | AKS cluster object containing id and name for startup and RBAC assignment | ```object({ id = string name = string })``` | n/a | yes | +| environment | Environment for all resources in this module: dev, test, or prod | `string` | n/a | yes | +| location | Location for all resources in this module | `string` | n/a | yes | | resource\_group | Resource group object containing name, id, and location | ```object({ id = string name = string location = string })``` | n/a | yes | | resource\_prefix | Prefix for all resources in this module | `string` | n/a | yes | | runbook\_script\_path | Path to PowerShell runbook script file | `string` | n/a | yes | -| environment | Environment for all resources in this module: dev, test, or prod | `string` | `"dev"` | no | | instance | Instance identifier for naming resources: 001, 002, etc | `string` | `"001"` | no | -| location | Location for all resources in this module | `string` | `null` | no | | postgresql\_server | PostgreSQL server object containing id and name for startup and RBAC assignment (null to skip) | ```object({ id = string name = string })``` | `null` | no | | schedule\_config | Schedule configuration for startup runbook including start time (HH:MM), week days, and timezone | ```object({ start_time = string week_days = list(string) timezone = string })``` | ```{ "start_time": "08:00", "timezone": "UTC", "week_days": [ "Monday", "Tuesday", "Wednesday", "Thursday", "Friday" ] }``` | no | -| tags | Tags to apply to all resources | `map(string)` | `{}` | no | +| tags | Tags to apply to all resources created by this module | `map(string)` | `{}` | no | ## Outputs diff --git a/infrastructure/terraform/modules/dataviewer/TERRAFORM.md b/infrastructure/terraform/modules/dataviewer/TERRAFORM.md index 81cb9abb..b42a0f65 100644 --- a/infrastructure/terraform/modules/dataviewer/TERRAFORM.md +++ b/infrastructure/terraform/modules/dataviewer/TERRAFORM.md @@ -2,7 +2,7 @@ title: Dataviewer Module description: Deploys the dataviewer application on Azure Container Apps with networking, identity, and app-level resources. author: Microsoft Robotics-AI Team -ms.date: 2026-03-25 +ms.date: 2026-04-08 ms.topic: reference --- @@ -48,6 +48,7 @@ Supports internal (VNet/VPN) and external (public) deployment modes. | [azurerm_private_dns_zone.container_apps](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/private_dns_zone) | resource | | [azurerm_private_dns_zone_virtual_network_link.container_apps](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/private_dns_zone_virtual_network_link) | resource | | [azurerm_role_assignment.dataviewer_acr_pull](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/role_assignment) | resource | +| [azurerm_role_assignment.dataviewer_data_lake_blob](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/role_assignment) | resource | | [azurerm_role_assignment.dataviewer_storage_blob](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/role_assignment) | resource | | [azurerm_subnet.container_apps](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/subnet) | resource | | [azurerm_subnet_nat_gateway_association.container_apps](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/subnet_nat_gateway_association) | resource | @@ -75,6 +76,7 @@ Supports internal (VNet/VPN) and external (public) deployment modes. | backend\_cpu | CPU allocation for the backend container | `number` | `0.5` | no | | backend\_image | Full image reference for the backend container (e.g., acr.azurecr.io/dataviewer-backend:latest). Leave empty to use a placeholder for initial IaC provisioning | `string` | `""` | no | | backend\_memory | Memory allocation for the backend container | `string` | `"1Gi"` | no | +| data\_lake\_storage\_account | Data lake storage account from platform module. Null when data lake is disabled | ```object({ id = string name = string })``` | `null` | no | | dataviewer\_redirect\_uris | SPA redirect URIs for MSAL.js authentication (local development) | `list(string)` | ```[ "http://localhost:5173/", "http://localhost:5174/" ]``` | no | | frontend\_cpu | CPU allocation for the frontend container | `number` | `0.25` | no | | frontend\_image | Full image reference for the frontend container (e.g., acr.azurecr.io/dataviewer-frontend:latest). Leave empty to use a placeholder for initial IaC provisioning | `string` | `""` | no | diff --git a/infrastructure/terraform/modules/dataviewer/role-assignments.tf b/infrastructure/terraform/modules/dataviewer/role-assignments.tf index fec4e190..b8df5ec1 100644 --- a/infrastructure/terraform/modules/dataviewer/role-assignments.tf +++ b/infrastructure/terraform/modules/dataviewer/role-assignments.tf @@ -27,3 +27,16 @@ resource "azurerm_role_assignment" "dataviewer_storage_blob" { principal_id = azurerm_user_assigned_identity.dataviewer.principal_id skip_service_principal_aad_check = true } + +// ============================================================ +// Data Lake Storage Role Assignments +// ============================================================ + +resource "azurerm_role_assignment" "dataviewer_data_lake_blob" { + count = var.data_lake_storage_account != null ? 1 : 0 + + scope = var.data_lake_storage_account.id + role_definition_name = "Storage Blob Data Contributor" + principal_id = azurerm_user_assigned_identity.dataviewer.principal_id + skip_service_principal_aad_check = true +} diff --git a/infrastructure/terraform/modules/dataviewer/variables.deps.tf b/infrastructure/terraform/modules/dataviewer/variables.deps.tf index 3d19922d..76fc34d9 100644 --- a/infrastructure/terraform/modules/dataviewer/variables.deps.tf +++ b/infrastructure/terraform/modules/dataviewer/variables.deps.tf @@ -55,3 +55,12 @@ variable "storage_account" { }) description = "Storage account from platform module" } + +variable "data_lake_storage_account" { + type = object({ + id = string + name = string + }) + description = "Data lake storage account from platform module. Null when data lake is disabled" + default = null +} diff --git a/infrastructure/terraform/modules/platform/TERRAFORM.md b/infrastructure/terraform/modules/platform/TERRAFORM.md index 39166704..1b4e23ef 100644 --- a/infrastructure/terraform/modules/platform/TERRAFORM.md +++ b/infrastructure/terraform/modules/platform/TERRAFORM.md @@ -2,7 +2,7 @@ title: Platform Module description: Deploys shared Azure infrastructure services for robotics ML workloads. Resources include: networking, DNS zones, security, observability, ACR, storage, ML workspace. Optional: PostgreSQL and Redis for OSMO workloads. author: Microsoft Robotics-AI Team -ms.date: 2026-03-25 +ms.date: 2026-04-08 ms.topic: reference --- @@ -34,17 +34,17 @@ Optional: PostgreSQL and Redis for OSMO workloads. | Name | Type | |--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------| | [azapi_resource.ml_workspace](https://registry.terraform.io/providers/Azure/azapi/latest/docs/resources/resource) | resource | +| [azapi_resource.postgresql_password](https://registry.terraform.io/providers/Azure/azapi/latest/docs/resources/resource) | resource | | [azurerm_application_insights.main](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/application_insights) | resource | | [azurerm_container_registry.main](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/container_registry) | resource | | [azurerm_dashboard_grafana.main](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/dashboard_grafana) | resource | | [azurerm_key_vault.main](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/key_vault) | resource | -| [azurerm_key_vault_secret.postgresql_password](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/key_vault_secret) | resource | | [azurerm_key_vault_secret.redis_primary_key](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/key_vault_secret) | resource | | [azurerm_log_analytics_workspace.main](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/log_analytics_workspace) | resource | | [azurerm_machine_learning_compute_cluster.gpu](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/machine_learning_compute_cluster) | resource | | [azurerm_managed_redis.main](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/managed_redis) | resource | | [azurerm_monitor_data_collection_endpoint.main](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/monitor_data_collection_endpoint) | resource | -| [azurerm_monitor_diagnostic_setting.ml_workspace_logs](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/monitor_diagnostic_setting) | resource | +| [azurerm_monitor_diagnostic_setting.ml_workspace_logs](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/monitor_diagnostic_setting) | resource | | [azurerm_monitor_private_link_scope.main](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/monitor_private_link_scope) | resource | | [azurerm_monitor_private_link_scoped_service.ai](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/monitor_private_link_scoped_service) | resource | | [azurerm_monitor_private_link_scoped_service.dce](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/monitor_private_link_scoped_service) | resource | @@ -66,6 +66,8 @@ Optional: PostgreSQL and Redis for OSMO workloads. | [azurerm_private_dns_zone_virtual_network_link.redis](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/private_dns_zone_virtual_network_link) | resource | | [azurerm_private_endpoint.acr](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/private_endpoint) | resource | | [azurerm_private_endpoint.azureml_api](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/private_endpoint) | resource | +| [azurerm_private_endpoint.data_lake_blob](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/private_endpoint) | resource | +| [azurerm_private_endpoint.data_lake_dfs](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/private_endpoint) | resource | | [azurerm_private_endpoint.key_vault](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/private_endpoint) | resource | | [azurerm_private_endpoint.monitor](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/private_endpoint) | resource | | [azurerm_private_endpoint.postgresql](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/private_endpoint) | resource | @@ -78,24 +80,37 @@ Optional: PostgreSQL and Redis for OSMO workloads. | [azurerm_role_assignment.ml_acr_pull](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/role_assignment) | resource | | [azurerm_role_assignment.ml_acr_push](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/role_assignment) | resource | | [azurerm_role_assignment.ml_appinsights_publisher](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/role_assignment) | resource | +| [azurerm_role_assignment.ml_data_lake_blob](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/role_assignment) | resource | | [azurerm_role_assignment.ml_kv_user](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/role_assignment) | resource | | [azurerm_role_assignment.ml_rg_contributor](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/role_assignment) | resource | | [azurerm_role_assignment.ml_storage_blob](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/role_assignment) | resource | | [azurerm_role_assignment.ml_storage_file](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/role_assignment) | resource | +| [azurerm_role_assignment.ml_storage_file_privileged](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/role_assignment) | resource | | [azurerm_role_assignment.osmo_acr_pull](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/role_assignment) | resource | +| [azurerm_role_assignment.osmo_data_lake_blob](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/role_assignment) | resource | | [azurerm_role_assignment.osmo_kv_secrets_user](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/role_assignment) | resource | | [azurerm_role_assignment.osmo_ml_data_scientist](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/role_assignment) | resource | | [azurerm_role_assignment.osmo_storage_blob_contributor](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/role_assignment) | resource | +| [azurerm_role_assignment.user_data_lake_blob](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/role_assignment) | resource | | [azurerm_role_assignment.user_kv_officer](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/role_assignment) | resource | | [azurerm_role_assignment.user_storage_blob](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/role_assignment) | resource | +| [azurerm_storage_account.data_lake](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/storage_account) | resource | | [azurerm_storage_account.main](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/storage_account) | resource | +| [azurerm_storage_container.datasets](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/storage_container) | resource | +| [azurerm_storage_container.evaluation](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/storage_container) | resource | | [azurerm_storage_container.ml_workspace](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/storage_container) | resource | +| [azurerm_storage_container.models](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/storage_container) | resource | +| [azurerm_storage_management_policy.data_lake](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/storage_management_policy) | resource | +| [azurerm_storage_management_policy.main](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/storage_management_policy) | resource | | [azurerm_subnet.main](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/subnet) | resource | | [azurerm_subnet.private_endpoints](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/subnet) | resource | | [azurerm_subnet.resolver](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/subnet) | resource | +| [azurerm_subnet.vm_subnet](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/subnet) | resource | | [azurerm_subnet_nat_gateway_association.main](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/subnet_nat_gateway_association) | resource | +| [azurerm_subnet_nat_gateway_association.vm_subnet](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/subnet_nat_gateway_association) | resource | | [azurerm_subnet_network_security_group_association.main](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/subnet_network_security_group_association) | resource | | [azurerm_subnet_network_security_group_association.private_endpoints](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/subnet_network_security_group_association) | resource | +| [azurerm_subnet_network_security_group_association.vm_subnet](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/subnet_network_security_group_association) | resource | | [azurerm_user_assigned_identity.ml](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/user_assigned_identity) | resource | | [azurerm_user_assigned_identity.osmo](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/user_assigned_identity) | resource | | [azurerm_virtual_network.main](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/virtual_network) | resource | @@ -105,64 +120,76 @@ Optional: PostgreSQL and Redis for OSMO workloads. ## Inputs -| Name | Description | Type | Default | Required | -|-----------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:| -| environment | Environment for all resources in this module: dev, test, or prod | `string` | n/a | yes | -| location | Location for all resources in this module | `string` | n/a | yes | -| resource\_group | Resource group object containing name, id, and location | ```object({ id = string name = string location = string })``` | n/a | yes | -| resource\_prefix | Prefix for all resources in this module | `string` | n/a | yes | -| aml\_compute\_config | AzureML managed compute cluster configuration including VM size, priority, scaling, and optional subnet placement | ```object({ vm_size = string vm_priority = string min_node_count = number max_node_count = number scale_down_after_idle = optional(string, "PT5M") cluster_name = optional(string, "gpu-cluster") subnet_id = optional(string) })``` | ```{ "cluster_name": "gpu-cluster", "max_node_count": 1, "min_node_count": 0, "scale_down_after_idle": "PT5M", "subnet_id": null, "vm_priority": "LowPriority", "vm_size": "Standard_NC4as_T4_v3" }``` | no | -| current\_user\_oid | Object ID of the current user for role assignments. Obtained via Microsoft Graph to avoid constant updates from azurerm\_client\_config | `string` | `null` | no | -| instance | Instance identifier for naming resources: 001, 002, etc | `string` | `"001"` | no | -| postgresql\_config | PostgreSQL configuration for OSMO including location, SKU, storage, zone, HA settings, and database definitions | ```object({ location = string sku_name = string storage_mb = number version = string databases = map(object({ collation = string, charset = string })) zone = optional(string) should_enable_high_availability = optional(bool, false) standby_availability_zone = optional(string) })``` | ```{ "databases": { "osmo": { "charset": "utf8", "collation": "en_US.utf8" } }, "location": "westus3", "should_enable_high_availability": false, "sku_name": "GP_Standard_D2s_v3", "standby_availability_zone": null, "storage_mb": 32768, "version": "16", "zone": null }``` | no | -| redis\_config | Redis configuration for OSMO including SKU, clustering policy, and HA settings. EnterpriseCluster recommended for clients that don't support Redis Cluster MOVED redirects | ```object({ sku_name = string clustering_policy = string should_enable_high_availability = optional(bool, false) })``` | ```{ "clustering_policy": "EnterpriseCluster", "should_enable_high_availability": false, "sku_name": "Balanced_B10" }``` | no | -| should\_add\_current\_user\_key\_vault\_admin | Whether to add the current user as Key Vault Secrets Officer | `bool` | `true` | no | -| should\_add\_current\_user\_storage\_blob | Whether to add the current user as Storage Blob Data Contributor | `bool` | `true` | no | -| should\_deploy\_aml\_compute | Whether to deploy an AzureML managed compute cluster for GPU workloads | `bool` | `false` | no | -| should\_enable\_aml\_diagnostic\_logs | Whether to enable AML workspace diagnostic logs in Log Analytics | `bool` | `false` | no | -| should\_deploy\_ampls | Whether to deploy Azure Monitor Private Link Scope and its private endpoint | `bool` | `true` | no | -| should\_deploy\_dce | Whether to deploy Data Collection Endpoint for observability | `bool` | `true` | no | -| should\_deploy\_grafana | Whether to deploy Azure Managed Grafana dashboard | `bool` | `true` | no | -| should\_deploy\_monitor\_workspace | Whether to deploy Azure Monitor Workspace for Prometheus metrics | `bool` | `true` | no | -| should\_deploy\_postgresql | Whether to deploy PostgreSQL for OSMO backend | `bool` | `false` | no | -| should\_deploy\_redis | Whether to deploy Azure Managed Redis for OSMO | `bool` | `false` | no | -| should\_enable\_nat\_gateway | Whether to deploy NAT Gateway for explicit outbound connectivity. When true, subnets use NAT Gateway; when false, subnets use Azure default outbound access | `bool` | `true` | no | -| should\_enable\_osmo\_identity | Whether to create a managed identity for OSMO workload identity authentication | `bool` | `true` | no | -| should\_enable\_private\_endpoint | Whether to enable private endpoints for all services | `bool` | `true` | no | -| should\_enable\_public\_network\_access | Whether to allow public network access (set to true for dev/test) | `bool` | `false` | no | -| should\_enable\_purge\_protection | Whether to enable purge protection on Key Vault. Set to false for dev/test to allow easy cleanup. WARNING: Once enabled, purge protection cannot be disabled | `bool` | `false` | no | -| should\_enable\_storage\_shared\_access\_key | Whether to enable Shared Key (SAS token) authorization for the storage account. When false, all requests must use Azure AD authentication | `bool` | `false` | no | -| should\_include\_aks\_dns\_zone | Whether to include the AKS private DNS zone in core DNS zones | `bool` | `true` | no | -| virtual\_network\_config | Virtual network address configuration including address space and subnet prefixes. PE and resolver subnet prefixes are only used when should\_enable\_private\_endpoint is true | ```object({ address_space = string subnet_address_prefix_main = string subnet_address_prefix_pe = optional(string) subnet_address_prefix_resolver = optional(string) })``` | ```{ "address_space": "10.0.0.0/16", "subnet_address_prefix_main": "10.0.1.0/24", "subnet_address_prefix_pe": "10.0.2.0/24", "subnet_address_prefix_resolver": "10.0.9.0/28" }``` | no | +| Name | Description | Type | Default | Required | +|--------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:| +| environment | Environment for all resources in this module: dev, test, or prod | `string` | n/a | yes | +| location | Location for all resources in this module | `string` | n/a | yes | +| resource\_group | Resource group object containing name, id, and location | ```object({ id = string name = string location = string })``` | n/a | yes | +| resource\_prefix | Prefix for all resources in this module | `string` | n/a | yes | +| aml\_compute\_config | AzureML managed compute cluster configuration including VM size, priority, scaling, and optional subnet placement | ```object({ vm_size = string vm_priority = string min_node_count = number max_node_count = number scale_down_after_idle = optional(string, "PT5M") cluster_name = optional(string, "gpu-cluster") subnet_id = optional(string) })``` | ```{ "cluster_name": "gpu-cluster", "max_node_count": 1, "min_node_count": 0, "scale_down_after_idle": "PT5M", "subnet_id": null, "vm_priority": "LowPriority", "vm_size": "Standard_NC4as_T4_v3" }``` | no | +| converted\_datasets\_cool\_tier\_days | Number of days before tiering converted datasets to cool storage. Set to -1 to disable tiering | `number` | `90` | no | +| current\_user\_oid | Object ID of the current user for role assignments. Obtained via Microsoft Graph to avoid constant updates from azurerm\_client\_config | `string` | `null` | no | +| instance | Instance identifier for naming resources: 001, 002, etc | `string` | `"001"` | no | +| nat\_gateway\_zones | Availability zones for NAT Gateway and its public IP. Leave empty for regions without AZ support | `list(string)` | ```[ "1" ]``` | no | +| postgresql\_config | PostgreSQL configuration for OSMO including location, SKU, storage, zone, HA settings, and database definitions | ```object({ location = string sku_name = string storage_mb = number version = string databases = map(object({ collation = string, charset = string })) zone = optional(string) should_enable_high_availability = optional(bool, false) standby_availability_zone = optional(string) })``` | ```{ "databases": { "osmo": { "charset": "utf8", "collation": "en_US.utf8" } }, "location": "westus3", "should_enable_high_availability": false, "sku_name": "GP_Standard_D2s_v3", "standby_availability_zone": null, "storage_mb": 32768, "version": "16", "zone": null }``` | no | +| raw\_bags\_retention\_days | Number of days to retain raw ROS bags before automatic deletion. Set to -1 to disable deletion | `number` | `30` | no | +| redis\_config | Redis configuration for OSMO including SKU, clustering policy, and HA settings. EnterpriseCluster recommended for clients that don't support Redis Cluster MOVED redirects | ```object({ sku_name = string clustering_policy = string should_enable_high_availability = optional(bool, false) })``` | ```{ "clustering_policy": "EnterpriseCluster", "should_enable_high_availability": false, "sku_name": "Balanced_B10" }``` | no | +| reports\_archive\_tier\_days | Number of days before tiering validation reports to archive storage. Must be greater than reports\_cool\_tier\_days | `number` | `180` | no | +| reports\_cool\_tier\_days | Number of days before tiering validation reports to cool storage | `number` | `30` | no | +| should\_add\_current\_user\_key\_vault\_admin | Whether to add the current user as Key Vault Secrets Officer | `bool` | `true` | no | +| should\_add\_current\_user\_storage\_blob | Whether to add the current user as Storage Blob Data Contributor | `bool` | `true` | no | +| should\_create\_data\_lake\_storage | Whether to create a dedicated ADLS Gen2 storage account with hierarchical namespace for domain data (datasets, model checkpoints) | `bool` | `false` | no | +| should\_create\_vm\_subnet | Whether to create a dedicated subnet for virtual machines in the platform virtual network | `bool` | `false` | no | +| should\_deploy\_aml\_compute | Whether to deploy an AzureML managed compute cluster for GPU workloads | `bool` | `false` | no | +| should\_deploy\_ampls | Whether to deploy Azure Monitor Private Link Scope and its private endpoint | `bool` | `true` | no | +| should\_deploy\_dce | Whether to deploy Data Collection Endpoint for observability | `bool` | `true` | no | +| should\_deploy\_grafana | Whether to deploy Azure Managed Grafana dashboard | `bool` | `true` | no | +| should\_deploy\_monitor\_workspace | Whether to deploy Azure Monitor Workspace for Prometheus metrics | `bool` | `true` | no | +| should\_deploy\_postgresql | Whether to deploy PostgreSQL for OSMO backend | `bool` | `false` | no | +| should\_deploy\_redis | Whether to deploy Azure Managed Redis for OSMO | `bool` | `false` | no | +| should\_enable\_aml\_diagnostic\_logs | Whether to enable AML workspace diagnostic logs in Log Analytics | `bool` | `false` | no | +| should\_enable\_converted\_datasets\_lifecycle\_policy | Whether to enable lifecycle policy for converted LeRobot datasets (auto-tier to cool storage) | `bool` | `true` | no | +| should\_enable\_nat\_gateway | Whether to deploy NAT Gateway for explicit outbound connectivity. When true, subnets use NAT Gateway; when false, subnets use Azure default outbound access | `bool` | `true` | no | +| should\_enable\_osmo\_identity | Whether to create a managed identity for OSMO workload identity authentication | `bool` | `true` | no | +| should\_enable\_private\_endpoint | Whether to enable private endpoints for all services | `bool` | `true` | no | +| should\_enable\_public\_network\_access | Whether to allow public network access (set to true for dev/test) | `bool` | `false` | no | +| should\_enable\_purge\_protection | Whether to enable purge protection on Key Vault. Set to false for dev/test to allow easy cleanup. WARNING: Once enabled, purge protection cannot be disabled | `bool` | `false` | no | +| should\_enable\_raw\_bags\_lifecycle\_policy | Whether to enable lifecycle policy for raw ROS bags (auto-delete after retention period) | `bool` | `true` | no | +| should\_enable\_reports\_lifecycle\_policy | Whether to enable lifecycle policy for validation reports (auto-tier to cool then archive) | `bool` | `true` | no | +| should\_enable\_storage\_shared\_access\_key | Whether to enable Shared Key (SAS token) authorization for the storage account. When false, all requests must use Azure AD authentication | `bool` | `false` | no | +| should\_include\_aks\_dns\_zone | Whether to include the AKS private DNS zone in core DNS zones | `bool` | `true` | no | +| virtual\_network\_config | Virtual network address configuration including address space and subnet prefixes. PE and resolver subnet prefixes are only used when should\_enable\_private\_endpoint is true | ```object({ address_space = string subnet_address_prefix_main = string subnet_address_prefix_vm = optional(string) subnet_address_prefix_pe = optional(string) subnet_address_prefix_resolver = optional(string) })``` | ```{ "address_space": "10.0.0.0/16", "subnet_address_prefix_main": "10.0.1.0/24", "subnet_address_prefix_pe": "10.0.2.0/24", "subnet_address_prefix_resolver": "10.0.9.0/28", "subnet_address_prefix_vm": "10.0.4.0/24" }``` | no | ## Outputs -| Name | Description | -|----------------------------|----------------------------------------------------------------------------------------------------------------------------------| -| aml\_compute\_cluster | AzureML managed compute cluster. Null when compute deployment is disabled | -| application\_insights | Application Insights for telemetry | -| azureml\_workspace | ML workspace for AKS extension. | -| container\_registry | Container registry for SiL workloads | -| data\_collection\_endpoint | Data Collection Endpoint for observability. Null when DCE is disabled | -| dns\_server\_ip | The IP address to use as DNS server for VPN clients or on-premises DNS forwarding. Null when resolver not configured | -| grafana | Azure Managed Grafana dashboard. Null when Grafana is disabled | -| key\_vault | Key Vault for secrets management | -| log\_analytics\_workspace | Log Analytics workspace for AKS monitoring | -| ml\_workload\_identity | ML workload identity for FICs | -| monitor\_workspace | Azure Monitor workspace for Prometheus metrics. Null when monitor workspace is disabled | -| nat\_gateway | NAT Gateway for outbound connectivity. Null when NAT Gateway is disabled | -| network\_security\_group | NSG for SiL subnets | -| osmo\_workload\_identity | OSMO workload identity for federated credentials | -| postgresql | PostgreSQL Flexible Server for OSMO (if deployed) | -| postgresql\_secret\_name | Key Vault secret name containing PostgreSQL admin password | -| private\_dns\_resolver | Private DNS Resolver for resolving private DNS zones. Null when private endpoints are disabled or resolver subnet not configured | -| private\_dns\_zones | Private DNS zones for private endpoints | -| redis | Azure Managed Redis for OSMO (if deployed). | -| redis\_secret\_name | Key Vault secret name containing Redis primary access key | -| storage\_account | Storage account for ML workspace | -| storage\_account\_access | Storage account access credentials. Only populated when shared\_access\_key\_enabled is true | -| subnets | Subnets for SiL resources. Private endpoints subnet is null when private endpoints are disabled | -| virtual\_network | Virtual network for SiL AKS cluster | +| Name | Description | +|--------------------------------------|----------------------------------------------------------------------------------------------------------------------------------| +| aml\_compute\_cluster | AzureML managed compute cluster. Null when compute deployment is disabled | +| application\_insights | Application Insights for telemetry | +| azureml\_workspace | ML workspace for AKS extension. | +| container\_registry | Container registry for SiL workloads | +| data\_collection\_endpoint | Data Collection Endpoint for observability. Null when DCE is disabled | +| data\_lake\_storage\_account | Data lake storage account for domain data. Null when data lake is disabled | +| data\_lake\_storage\_account\_access | Data lake storage account access credentials. Null when data lake is disabled | +| dns\_server\_ip | The IP address to use as DNS server for VPN clients or on-premises DNS forwarding. Null when resolver not configured | +| grafana | Azure Managed Grafana dashboard. Null when Grafana is disabled | +| key\_vault | Key Vault for secrets management | +| log\_analytics\_workspace | Log Analytics workspace for AKS monitoring | +| ml\_workload\_identity | ML workload identity for FICs | +| monitor\_workspace | Azure Monitor workspace for Prometheus metrics. Null when monitor workspace is disabled | +| nat\_gateway | NAT Gateway for outbound connectivity. Null when NAT Gateway is disabled | +| network\_security\_group | NSG for SiL subnets | +| osmo\_workload\_identity | OSMO workload identity for federated credentials | +| postgresql | PostgreSQL Flexible Server for OSMO (if deployed) | +| postgresql\_secret\_name | Key Vault secret name containing PostgreSQL admin password | +| private\_dns\_resolver | Private DNS Resolver for resolving private DNS zones. Null when private endpoints are disabled or resolver subnet not configured | +| private\_dns\_zones | Private DNS zones for private endpoints | +| redis | Azure Managed Redis for OSMO (if deployed). | +| redis\_secret\_name | Key Vault secret name containing Redis primary access key | +| storage\_account | Storage account for ML workspace | +| storage\_account\_access | Storage account access credentials. Only populated when shared\_access\_key\_enabled is true | +| subnets | Subnets for SiL resources. Private endpoints subnet is null when private endpoints are disabled | +| virtual\_network | Virtual network for SiL AKS cluster | diff --git a/infrastructure/terraform/modules/platform/main.tf b/infrastructure/terraform/modules/platform/main.tf index 7cf30893..b716fad2 100644 --- a/infrastructure/terraform/modules/platform/main.tf +++ b/infrastructure/terraform/modules/platform/main.tf @@ -25,6 +25,7 @@ locals { key_vault = "privatelink.vaultcore.azure.net" storage_blob = "privatelink.blob.core.windows.net" storage_file = "privatelink.file.core.windows.net" + storage_dfs = "privatelink.dfs.core.windows.net" acr = "privatelink.azurecr.io" azureml_api = "privatelink.api.azureml.ms" azureml_notebooks = "privatelink.notebooks.azure.net" diff --git a/infrastructure/terraform/modules/platform/outputs.tf b/infrastructure/terraform/modules/platform/outputs.tf index 0b82724f..c53da826 100644 --- a/infrastructure/terraform/modules/platform/outputs.tf +++ b/infrastructure/terraform/modules/platform/outputs.tf @@ -157,6 +157,24 @@ output "storage_account_access" { sensitive = true } +output "data_lake_storage_account" { + description = "Data lake storage account for domain data. Null when data lake is disabled" + value = var.should_create_data_lake_storage ? { + id = azurerm_storage_account.data_lake[0].id + name = azurerm_storage_account.data_lake[0].name + } : null +} + +output "data_lake_storage_account_access" { + description = "Data lake storage account access credentials. Null when data lake is disabled" + value = var.should_create_data_lake_storage ? { + primary_blob_endpoint = azurerm_storage_account.data_lake[0].primary_blob_endpoint + primary_dfs_endpoint = azurerm_storage_account.data_lake[0].primary_dfs_endpoint + primary_access_key = azurerm_storage_account.data_lake[0].primary_access_key + } : null + sensitive = true +} + /* * AzureML Outputs */ diff --git a/infrastructure/terraform/modules/platform/role-assignments.tf b/infrastructure/terraform/modules/platform/role-assignments.tf index 017c7910..d56583b9 100644 --- a/infrastructure/terraform/modules/platform/role-assignments.tf +++ b/infrastructure/terraform/modules/platform/role-assignments.tf @@ -78,6 +78,28 @@ resource "azurerm_role_assignment" "ml_storage_file_privileged" { principal_id = azurerm_user_assigned_identity.ml.principal_id } +// ============================================================ +// Data Lake Storage Role Assignments +// ============================================================ + +// Grant current user Storage Blob Data Contributor on data lake +resource "azurerm_role_assignment" "user_data_lake_blob" { + count = var.should_add_current_user_storage_blob && var.should_create_data_lake_storage ? 1 : 0 + + scope = azurerm_storage_account.data_lake[0].id + role_definition_name = "Storage Blob Data Contributor" + principal_id = var.current_user_oid +} + +// Grant ML identity Storage Blob Data Contributor on data lake +resource "azurerm_role_assignment" "ml_data_lake_blob" { + count = var.should_create_data_lake_storage ? 1 : 0 + + scope = azurerm_storage_account.data_lake[0].id + role_definition_name = "Storage Blob Data Contributor" + principal_id = azurerm_user_assigned_identity.ml.principal_id +} + // ============================================================ // OSMO Identity Role Assignments // ============================================================ @@ -90,6 +112,14 @@ resource "azurerm_role_assignment" "osmo_storage_blob_contributor" { principal_id = azurerm_user_assigned_identity.osmo[0].principal_id } +// Grant OSMO identity Storage Blob Data Contributor on data lake +resource "azurerm_role_assignment" "osmo_data_lake_blob" { + count = var.should_enable_osmo_identity && var.should_create_data_lake_storage ? 1 : 0 + scope = azurerm_storage_account.data_lake[0].id + role_definition_name = "Storage Blob Data Contributor" + principal_id = azurerm_user_assigned_identity.osmo[0].principal_id +} + // Grant OSMO identity AcrPull role for pulling container images resource "azurerm_role_assignment" "osmo_acr_pull" { count = var.should_enable_osmo_identity ? 1 : 0 diff --git a/infrastructure/terraform/modules/platform/storage.tf b/infrastructure/terraform/modules/platform/storage.tf index 1a8748a8..413784e7 100644 --- a/infrastructure/terraform/modules/platform/storage.tf +++ b/infrastructure/terraform/modules/platform/storage.tf @@ -1,10 +1,10 @@ /** * # Storage Resources * - * This file creates the Storage Account for the Platform module including: - * - Storage Account for ML workspace and general purpose - * - Default container for ML workspace - * - Private endpoints for blob and file services + * This file creates storage infrastructure for the Platform module including: + * - Storage Account for ML workspace (system data, logs, snapshots) + * - Optional ADLS Gen2 Data Lake storage account for domain data (datasets, model checkpoints) + * - Storage containers, lifecycle policies, and private endpoints */ // ============================================================ @@ -54,19 +54,160 @@ resource "azurerm_storage_container" "ml_workspace" { } // ============================================================ -// Storage Lifecycle Management Policy +// Data Lake Storage Account (ADLS Gen2) // ============================================================ +resource "azurerm_storage_account" "data_lake" { + count = var.should_create_data_lake_storage ? 1 : 0 + + name = "stdl${var.resource_prefix}${var.environment}${var.instance}" + location = var.resource_group.location + resource_group_name = var.resource_group.name + account_tier = "Standard" + account_replication_type = "LRS" + access_tier = "Hot" + min_tls_version = "TLS1_2" + is_hns_enabled = true + shared_access_key_enabled = var.should_enable_storage_shared_access_key + public_network_access_enabled = var.should_enable_public_network_access + allow_nested_items_to_be_public = false + + blob_properties { + delete_retention_policy { + days = 7 + } + + container_delete_retention_policy { + days = 7 + } + } + + lifecycle { + prevent_destroy = true + } +} + +// ============================================================ +// Data Lake Containers +// ============================================================ + +resource "azurerm_storage_container" "datasets" { + count = var.should_create_data_lake_storage ? 1 : 0 + + name = "datasets" + storage_account_id = azurerm_storage_account.data_lake[0].id + container_access_type = "private" + + lifecycle { + prevent_destroy = true + } +} + +resource "azurerm_storage_container" "models" { + count = var.should_create_data_lake_storage ? 1 : 0 + + name = "models" + storage_account_id = azurerm_storage_account.data_lake[0].id + container_access_type = "private" + + lifecycle { + prevent_destroy = true + } +} + +resource "azurerm_storage_container" "evaluation" { + count = var.should_create_data_lake_storage ? 1 : 0 + + name = "evaluation" + storage_account_id = azurerm_storage_account.data_lake[0].id + container_access_type = "private" + + lifecycle { + prevent_destroy = true + } +} + +// ============================================================ +// Storage Lifecycle Management Policy (ML storage fallback) +// ============================================================ +// Active when data lake is disabled — ensures existing deployments retain lifecycle +// cost controls. Removed when data lake is enabled (rules move to data lake account). + resource "azurerm_storage_management_policy" "main" { + count = var.should_create_data_lake_storage ? 0 : 1 + storage_account_id = azurerm_storage_account.main.id + rule { + name = "delete-raw-bags" + enabled = var.should_enable_raw_bags_lifecycle_policy + + filters { + prefix_match = ["ml-workspace/raw/"] + blob_types = ["blockBlob"] + } + + actions { + base_blob { + delete_after_days_since_modification_greater_than = var.raw_bags_retention_days + } + } + } + + rule { + name = "tier-converted-datasets-to-cool" + enabled = var.should_enable_converted_datasets_lifecycle_policy + + filters { + prefix_match = ["ml-workspace/converted/"] + blob_types = ["blockBlob"] + } + + actions { + base_blob { + tier_to_cool_after_days_since_modification_greater_than = var.converted_datasets_cool_tier_days + } + } + } + + rule { + name = "tier-reports-to-cool-then-archive" + enabled = var.should_enable_reports_lifecycle_policy + + filters { + prefix_match = ["ml-workspace/reports/"] + blob_types = ["blockBlob"] + } + + actions { + base_blob { + tier_to_cool_after_days_since_modification_greater_than = var.reports_cool_tier_days + tier_to_archive_after_days_since_modification_greater_than = var.reports_archive_tier_days + } + } + } + + lifecycle { + prevent_destroy = true + } +} + +// ============================================================ +// Data Lake Lifecycle Management Policy +// ============================================================ + +resource "azurerm_storage_management_policy" "data_lake" { + count = var.should_create_data_lake_storage ? 1 : 0 + + storage_account_id = azurerm_storage_account.data_lake[0].id + // Rule 1: Delete raw ROS bags after retention period rule { name = "delete-raw-bags" enabled = var.should_enable_raw_bags_lifecycle_policy filters { - prefix_match = ["raw/"] + prefix_match = ["datasets/raw/"] blob_types = ["blockBlob"] } @@ -83,7 +224,7 @@ resource "azurerm_storage_management_policy" "main" { enabled = var.should_enable_converted_datasets_lifecycle_policy filters { - prefix_match = ["converted/"] + prefix_match = ["datasets/converted/"] blob_types = ["blockBlob"] } @@ -100,7 +241,7 @@ resource "azurerm_storage_management_policy" "main" { enabled = var.should_enable_reports_lifecycle_policy filters { - prefix_match = ["reports/"] + prefix_match = ["evaluation/reports/"] blob_types = ["blockBlob"] } @@ -112,8 +253,6 @@ resource "azurerm_storage_management_policy" "main" { } } - // Note: No lifecycle policy for checkpoints/ prefix — model checkpoints retained indefinitely in Hot tier - lifecycle { prevent_destroy = true } @@ -167,3 +306,51 @@ resource "azurerm_private_endpoint" "storage_file" { private_dns_zone_ids = [azurerm_private_dns_zone.core["storage_file"].id] } } + +// ============================================================ +// Data Lake Private Endpoints +// ============================================================ + +// Data Lake Blob Private Endpoint +resource "azurerm_private_endpoint" "data_lake_blob" { + count = var.should_create_data_lake_storage && local.pe_enabled ? 1 : 0 + + name = "pe-datalake-blob-${local.resource_name_suffix}" + location = var.resource_group.location + resource_group_name = var.resource_group.name + subnet_id = azurerm_subnet.private_endpoints[0].id + + private_service_connection { + name = "psc-datalake-blob-${local.resource_name_suffix}" + private_connection_resource_id = azurerm_storage_account.data_lake[0].id + subresource_names = ["blob"] + is_manual_connection = false + } + + private_dns_zone_group { + name = "pdz-datalake-blob-${local.resource_name_suffix}" + private_dns_zone_ids = [azurerm_private_dns_zone.core["storage_blob"].id] + } +} + +// Data Lake DFS Private Endpoint +resource "azurerm_private_endpoint" "data_lake_dfs" { + count = var.should_create_data_lake_storage && local.pe_enabled ? 1 : 0 + + name = "pe-datalake-dfs-${local.resource_name_suffix}" + location = var.resource_group.location + resource_group_name = var.resource_group.name + subnet_id = azurerm_subnet.private_endpoints[0].id + + private_service_connection { + name = "psc-datalake-dfs-${local.resource_name_suffix}" + private_connection_resource_id = azurerm_storage_account.data_lake[0].id + subresource_names = ["dfs"] + is_manual_connection = false + } + + private_dns_zone_group { + name = "pdz-datalake-dfs-${local.resource_name_suffix}" + private_dns_zone_ids = [azurerm_private_dns_zone.core["storage_dfs"].id] + } +} diff --git a/infrastructure/terraform/modules/platform/tests/conditionals.tftest.hcl b/infrastructure/terraform/modules/platform/tests/conditionals.tftest.hcl index ed79b873..19669a1a 100644 --- a/infrastructure/terraform/modules/platform/tests/conditionals.tftest.hcl +++ b/infrastructure/terraform/modules/platform/tests/conditionals.tftest.hcl @@ -694,3 +694,80 @@ run "aml_compute_disabled" { error_message = "AML compute cluster should not be created when disabled" } } + +// ============================================================ +// Data Lake Storage Conditionals +// ============================================================ + +run "data_lake_enabled" { + command = plan + + variables { + resource_prefix = run.setup.resource_prefix + environment = run.setup.environment + instance = run.setup.instance + location = run.setup.location + resource_group = run.setup.resource_group + current_user_oid = run.setup.current_user_oid + should_create_data_lake_storage = true + } + + assert { + condition = length(azurerm_storage_account.data_lake) == 1 + error_message = "Data lake storage account should be created when enabled" + } + + assert { + condition = length(azurerm_storage_container.datasets) == 1 + error_message = "Datasets container should be created when data lake is enabled" + } + + assert { + condition = length(azurerm_storage_container.models) == 1 + error_message = "Models container should be created when data lake is enabled" + } + + assert { + condition = length(azurerm_storage_container.evaluation) == 1 + error_message = "Evaluation container should be created when data lake is enabled" + } + + assert { + condition = length(azurerm_storage_management_policy.data_lake) == 1 + error_message = "Data lake lifecycle policy should be created when data lake is enabled" + } +} + +run "data_lake_disabled" { + command = plan + + variables { + resource_prefix = run.setup.resource_prefix + environment = run.setup.environment + instance = run.setup.instance + location = run.setup.location + resource_group = run.setup.resource_group + current_user_oid = run.setup.current_user_oid + should_create_data_lake_storage = false + } + + assert { + condition = length(azurerm_storage_account.data_lake) == 0 + error_message = "Data lake storage account should not be created when disabled" + } + + assert { + condition = length(azurerm_storage_container.datasets) == 0 + error_message = "Datasets container should not exist when data lake is disabled" + } + + assert { + condition = length(azurerm_storage_container.models) == 0 + error_message = "Models container should not exist when data lake is disabled" + } + + assert { + condition = length(azurerm_storage_container.evaluation) == 0 + error_message = "Evaluation container should not exist when data lake is disabled" + } +} diff --git a/infrastructure/terraform/modules/platform/tests/dns-zones.tftest.hcl b/infrastructure/terraform/modules/platform/tests/dns-zones.tftest.hcl index 4bce5824..7c1e3a3c 100644 --- a/infrastructure/terraform/modules/platform/tests/dns-zones.tftest.hcl +++ b/infrastructure/terraform/modules/platform/tests/dns-zones.tftest.hcl @@ -23,7 +23,7 @@ run "setup" { } } -// PE on + AKS zone on + AMPLS on = 6 base + 1 AKS + 4 monitor = 11 +// PE on + AKS zone on + AMPLS on = 7 base + 1 AKS + 4 monitor = 12 run "all_dns_zones" { command = plan @@ -40,12 +40,12 @@ run "all_dns_zones" { } assert { - condition = length(azurerm_private_dns_zone.core) == 11 - error_message = "Expected 11 DNS zones (6 base + 1 AKS + 4 monitor)" + condition = length(azurerm_private_dns_zone.core) == 12 + error_message = "Expected 12 DNS zones (7 base + 1 AKS + 4 monitor)" } } -// PE on + AKS zone off + AMPLS on = 6 base + 4 monitor = 10 +// PE on + AKS zone off + AMPLS on = 7 base + 4 monitor = 11 run "no_aks_zone" { command = plan @@ -62,12 +62,12 @@ run "no_aks_zone" { } assert { - condition = length(azurerm_private_dns_zone.core) == 10 - error_message = "Expected 10 DNS zones (6 base + 4 monitor, no AKS)" + condition = length(azurerm_private_dns_zone.core) == 11 + error_message = "Expected 11 DNS zones (7 base + 4 monitor, no AKS)" } } -// PE on + AKS zone on + AMPLS off = 6 base + 1 AKS = 7 +// PE on + AKS zone on + AMPLS off = 7 base + 1 AKS = 8 run "no_ampls_zones" { command = plan @@ -84,12 +84,12 @@ run "no_ampls_zones" { } assert { - condition = length(azurerm_private_dns_zone.core) == 7 - error_message = "Expected 7 DNS zones (6 base + 1 AKS, no AMPLS)" + condition = length(azurerm_private_dns_zone.core) == 8 + error_message = "Expected 8 DNS zones (7 base + 1 AKS, no AMPLS)" } } -// PE on + AKS zone off + AMPLS off = 6 base only +// PE on + AKS zone off + AMPLS off = 7 base only run "base_zones_only" { command = plan @@ -106,8 +106,8 @@ run "base_zones_only" { } assert { - condition = length(azurerm_private_dns_zone.core) == 6 - error_message = "Expected 6 base DNS zones" + condition = length(azurerm_private_dns_zone.core) == 7 + error_message = "Expected 7 base DNS zones" } } diff --git a/infrastructure/terraform/modules/platform/tests/security.tftest.hcl b/infrastructure/terraform/modules/platform/tests/security.tftest.hcl index b2b4b23e..1f45fd8f 100644 --- a/infrastructure/terraform/modules/platform/tests/security.tftest.hcl +++ b/infrastructure/terraform/modules/platform/tests/security.tftest.hcl @@ -164,3 +164,50 @@ run "acr_security" { error_message = "ACR anonymous pull must be disabled" } } + +run "data_lake_security" { + command = plan + + variables { + resource_prefix = run.setup.resource_prefix + environment = run.setup.environment + instance = run.setup.instance + location = run.setup.location + resource_group = run.setup.resource_group + current_user_oid = run.setup.current_user_oid + should_create_data_lake_storage = true + } + + assert { + condition = azurerm_storage_account.data_lake[0].is_hns_enabled == true + error_message = "Data lake storage account must have hierarchical namespace enabled" + } + + assert { + condition = azurerm_storage_account.data_lake[0].min_tls_version == "TLS1_2" + error_message = "Data lake storage account must enforce TLS 1.2 minimum" + } + + assert { + condition = azurerm_storage_account.data_lake[0].allow_nested_items_to_be_public == false + error_message = "Data lake storage account must not allow public blob access" + } +} + +run "data_lake_disabled_by_default" { + command = plan + + variables { + resource_prefix = run.setup.resource_prefix + environment = run.setup.environment + instance = run.setup.instance + location = run.setup.location + resource_group = run.setup.resource_group + current_user_oid = run.setup.current_user_oid + } + + assert { + condition = length(azurerm_storage_account.data_lake) == 0 + error_message = "Data lake storage account should not exist when flag is false" + } +} diff --git a/infrastructure/terraform/modules/platform/variables.tf b/infrastructure/terraform/modules/platform/variables.tf index d30af25f..416a4a73 100644 --- a/infrastructure/terraform/modules/platform/variables.tf +++ b/infrastructure/terraform/modules/platform/variables.tf @@ -178,6 +178,12 @@ variable "should_enable_osmo_identity" { * Storage Variables */ +variable "should_create_data_lake_storage" { + type = bool + description = "Whether to create a dedicated ADLS Gen2 storage account with hierarchical namespace for domain data (datasets, model checkpoints)" + default = false +} + variable "should_enable_storage_shared_access_key" { type = bool description = "Whether to enable Shared Key (SAS token) authorization for the storage account. When false, all requests must use Azure AD authentication" diff --git a/infrastructure/terraform/modules/sil/TERRAFORM.md b/infrastructure/terraform/modules/sil/TERRAFORM.md index 1985dc93..7dbd7351 100644 --- a/infrastructure/terraform/modules/sil/TERRAFORM.md +++ b/infrastructure/terraform/modules/sil/TERRAFORM.md @@ -2,7 +2,7 @@ title: SiL Module (Software-in-the-Loop) description: Deploys AKS-specific infrastructure for robotics ML workloads with GPU node pools, AzureML integration, and observability. author: Microsoft Robotics-AI Team -ms.date: 2026-03-25 +ms.date: 2026-04-08 ms.topic: reference --- @@ -65,31 +65,33 @@ are created in the platform module and passed as dependencies. ## Inputs -| Name | Description | Type | Default | Required | -|-----------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:| -| container\_registry | ACR from platform module | ```object({ id = string name = string login_server = string })``` | n/a | yes | -| environment | Environment for all resources in this module: dev, test, or prod | `string` | n/a | yes | -| location | Location for all resources in this module | `string` | n/a | yes | -| log\_analytics\_workspace | Log Analytics from platform module | ```object({ id = string workspace_id = string })``` | n/a | yes | -| nat\_gateway | NAT Gateway from platform module. Null when NAT Gateway is disabled | ```object({ id = string })``` | n/a | yes | -| network\_security\_group | NSG from platform module | ```object({ id = string })``` | n/a | yes | -| resource\_group | Resource group object containing name, id, and location | ```object({ id = string name = string location = string })``` | n/a | yes | -| resource\_prefix | Prefix for all resources in this module | `string` | n/a | yes | -| subnets | Subnets from platform module. Private endpoints subnet is optional and only provided when private endpoints are enabled | ```object({ main = object({ id = string name = string }) private_endpoints = optional(object({ id = string name = string })) })``` | n/a | yes | -| virtual\_network | Virtual network from platform module | ```object({ id = string name = string })``` | n/a | yes | -| aks\_config | AKS cluster configuration for the system node pool | ```object({ system_node_pool_vm_size = string system_node_pool_node_count = number should_enable_system_node_pool_auto_scaling = bool system_node_pool_min_count = optional(number) system_node_pool_max_count = optional(number) should_enable_private_cluster = bool system_node_pool_zones = optional(list(string)) should_enable_microsoft_defender = optional(bool, false) })``` | ```{ "should_enable_private_cluster": true, "should_enable_system_node_pool_auto_scaling": false, "system_node_pool_max_count": null, "system_node_pool_min_count": null, "system_node_pool_node_count": 2, "system_node_pool_vm_size": "Standard_D8ds_v5", "system_node_pool_zones": null }``` | no | -| aks\_subnet\_config | AKS subnet address configuration for system node pool. When properties are null, defaults are used. Note: Pod subnets are not used with Azure CNI Overlay mode | ```object({ subnet_address_prefix_aks = optional(string, "10.0.5.0/24") })``` | `{}` | no | -| current\_user\_oid | Object ID of the current user for cluster admin role assignments. Obtained via Microsoft Graph to avoid constant updates from azurerm\_client\_config | `string` | `null` | no | -| data\_collection\_endpoint | Data Collection Endpoint from platform module. Null when DCE is disabled | ```object({ id = string })``` | `null` | no | -| instance | Instance identifier for naming resources: 001, 002, etc | `string` | `"001"` | no | -| monitor\_workspace | Azure Monitor workspace from platform module. Null when monitor workspace is disabled | ```object({ id = string })``` | `null` | no | -| node\_pools | Additional AKS node pools configuration. Map key is used as the node pool name. Note: Pod subnets are not used with Azure CNI Overlay mode | ```map(object({ vm_size = string node_count = optional(number, null) subnet_address_prefixes = list(string) node_taints = optional(list(string), []) node_labels = optional(map(string), {}) gpu_driver = optional(string) priority = optional(string, "Regular") should_enable_auto_scaling = optional(bool, false) min_count = optional(number, null) max_count = optional(number, null) zones = optional(list(string), null) eviction_policy = optional(string, "Deallocate") }))``` | ```{ "gpu": { "eviction_policy": "Delete", "gpu_driver": "Install", "max_count": 1, "min_count": 0, "node_count": null, "node_taints": [ "nvidia.com/gpu:NoSchedule", "kubernetes.azure.com/scalesetpriority = spot:NoSchedule" ], "priority": "Spot", "should_enable_auto_scaling": true, "subnet_address_prefixes": [ "10.0.16.0/24" ], "vm_size": "Standard_NV36ads_A10_v5", "zones": [] } }``` | no | -| osmo\_config | OSMO configuration for federated identity credentials | ```object({ should_federate_identity = bool control_plane_namespace = string operator_namespace = string workflows_namespace = string })``` | ```{ "control_plane_namespace": "osmo-control-plane", "operator_namespace": "osmo-operator", "should_federate_identity": false, "workflows_namespace": "osmo-workflows" }``` | no | -| osmo\_workload\_identity | OSMO workload identity from platform module for federated credential creation | ```object({ id = string principal_id = string client_id = string tenant_id = string })``` | `null` | no | -| private\_dns\_zones | Private DNS zones from platform module | ```map(object({ id = string name = string }))``` | `{}` | no | -| should\_assign\_cluster\_admin | Whether to assign Azure Kubernetes Cluster Admin Role to the current user | `bool` | `true` | no | -| should\_enable\_nat\_gateway | Whether NAT Gateway is enabled for outbound connectivity. When true, subnets disable default outbound access; when false, subnets use Azure default outbound access | `bool` | `true` | no | -| should\_enable\_private\_endpoint | Whether to enable private endpoints for AKS cluster | `bool` | `true` | no | +| Name | Description | Type | Default | Required | +|------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:| +| container\_registry | ACR from platform module | ```object({ id = string name = string login_server = string })``` | n/a | yes | +| environment | Environment for all resources in this module: dev, test, or prod | `string` | n/a | yes | +| location | Location for all resources in this module | `string` | n/a | yes | +| log\_analytics\_workspace | Log Analytics from platform module | ```object({ id = string workspace_id = string })``` | n/a | yes | +| nat\_gateway | NAT Gateway from platform module. Null when NAT Gateway is disabled | ```object({ id = string })``` | n/a | yes | +| network\_security\_group | NSG from platform module | ```object({ id = string })``` | n/a | yes | +| resource\_group | Resource group object containing name, id, and location | ```object({ id = string name = string location = string })``` | n/a | yes | +| resource\_prefix | Prefix for all resources in this module | `string` | n/a | yes | +| subnets | Subnets from platform module. Private endpoints subnet is optional and only provided when private endpoints are enabled | ```object({ main = object({ id = string name = string }) private_endpoints = optional(object({ id = string name = string })) })``` | n/a | yes | +| virtual\_network | Virtual network from platform module | ```object({ id = string name = string })``` | n/a | yes | +| aks\_config | AKS cluster configuration for the system node pool | ```object({ system_node_pool_vm_size = string system_node_pool_node_count = number should_enable_system_node_pool_auto_scaling = bool system_node_pool_min_count = optional(number) system_node_pool_max_count = optional(number) should_enable_private_cluster = bool system_node_pool_zones = optional(list(string)) should_enable_microsoft_defender = optional(bool, false) })``` | ```{ "should_enable_private_cluster": true, "should_enable_system_node_pool_auto_scaling": false, "system_node_pool_max_count": null, "system_node_pool_min_count": null, "system_node_pool_node_count": 2, "system_node_pool_vm_size": "Standard_D8ds_v5", "system_node_pool_zones": null }``` | no | +| aks\_subnet\_config | AKS subnet address configuration for system node pool. When properties are null, defaults are used. Note: Pod subnets are not used with Azure CNI Overlay mode | ```object({ subnet_address_prefix_aks = optional(string, "10.0.5.0/24") })``` | `{}` | no | +| current\_user\_oid | Object ID of the current user for cluster admin role assignments. Obtained via Microsoft Graph to avoid constant updates from azurerm\_client\_config | `string` | `null` | no | +| data\_collection\_endpoint | Data Collection Endpoint from platform module. Null when DCE is disabled | ```object({ id = string })``` | `null` | no | +| instance | Instance identifier for naming resources: 001, 002, etc | `string` | `"001"` | no | +| monitor\_workspace | Azure Monitor workspace from platform module. Null when monitor workspace is disabled | ```object({ id = string })``` | `null` | no | +| node\_pools | Additional AKS node pools configuration. Map key is used as the node pool name. Note: Pod subnets are not used with Azure CNI Overlay mode | ```map(object({ vm_size = string node_count = optional(number, null) subnet_address_prefixes = list(string) node_taints = optional(list(string), []) node_labels = optional(map(string), {}) gpu_driver = optional(string) priority = optional(string, "Regular") should_enable_auto_scaling = optional(bool, false) min_count = optional(number, null) max_count = optional(number, null) zones = optional(list(string), null) eviction_policy = optional(string, "Deallocate") }))``` | ```{ "gpu": { "eviction_policy": "Delete", "gpu_driver": "Install", "max_count": 1, "min_count": 0, "node_count": null, "node_taints": [ "nvidia.com/gpu:NoSchedule", "kubernetes.azure.com/scalesetpriority = spot:NoSchedule" ], "priority": "Spot", "should_enable_auto_scaling": true, "subnet_address_prefixes": [ "10.0.16.0/24" ], "vm_size": "Standard_NV36ads_A10_v5", "zones": [] } }``` | no | +| osmo\_config | OSMO configuration for federated identity credentials | ```object({ should_federate_identity = bool control_plane_namespace = string operator_namespace = string workflows_namespace = string })``` | ```{ "control_plane_namespace": "osmo-control-plane", "operator_namespace": "osmo-operator", "should_federate_identity": false, "workflows_namespace": "osmo-workflows" }``` | no | +| osmo\_workload\_identity | OSMO workload identity from platform module for federated credential creation | ```object({ id = string principal_id = string client_id = string tenant_id = string })``` | `null` | no | +| private\_dns\_zones | Private DNS zones from platform module | ```map(object({ id = string name = string }))``` | `{}` | no | +| should\_assign\_cluster\_admin | Whether to assign Azure Kubernetes Cluster Admin Role to the current user | `bool` | `true` | no | +| should\_deploy\_dce | Whether Data Collection Endpoint is enabled for AKS observability | `bool` | `true` | no | +| should\_deploy\_monitor\_workspace | Whether Azure Monitor Workspace is enabled for AKS observability | `bool` | `true` | no | +| should\_enable\_nat\_gateway | Whether NAT Gateway is enabled for outbound connectivity. When true, subnets disable default outbound access; when false, subnets use Azure default outbound access | `bool` | `true` | no | +| should\_enable\_private\_endpoint | Whether to enable private endpoints for AKS cluster | `bool` | `true` | no | ## Outputs diff --git a/infrastructure/terraform/modules/vpn/TERRAFORM.md b/infrastructure/terraform/modules/vpn/TERRAFORM.md index 9eb1990b..2cabd029 100644 --- a/infrastructure/terraform/modules/vpn/TERRAFORM.md +++ b/infrastructure/terraform/modules/vpn/TERRAFORM.md @@ -2,7 +2,7 @@ title: VPN Gateway Module description: Deploys Azure VPN Gateway for Point-to-Site and Site-to-Site connectivity. Creates GatewaySubnet within the platform's virtual network. author: Microsoft Robotics-AI Team -ms.date: 2026-03-25 +ms.date: 2026-04-08 ms.topic: reference --- @@ -49,8 +49,8 @@ Creates GatewaySubnet within the platform's virtual network. | root\_certificate\_name | Name for the root certificate used in P2S authentication | `string` | `"RoboticsVPNRootCert"` | no | | root\_certificate\_public\_data | Base64-encoded public certificate data for P2S authentication (without BEGIN/END markers) | `string` | `null` | no | | should\_enable\_nat\_gateway | Whether NAT Gateway is enabled for outbound connectivity. When true, disables default outbound access for GatewaySubnet | `bool` | `true` | no | -| tags | Tags to apply to all resources | `map(string)` | `{}` | no | -| vpn\_gateway\_config | VPN Gateway configuration including SKU, generation, and P2S client address pool | ```object({ sku = optional(string, "VpnGw1AZ") generation = optional(string, "Generation1") client_address_pool = optional(list(string), ["192.168.200.0/24"]) })``` | `{}` | no | +| tags | Tags to apply to all resources created by this module | `map(string)` | `{}` | no | +| vpn\_gateway\_config | VPN Gateway configuration including SKU, generation, P2S client address pool, and availability zones for the public IP | ```object({ sku = optional(string, "VpnGw1AZ") generation = optional(string, "Generation1") client_address_pool = optional(list(string), ["192.168.200.0/24"]) zones = optional(list(string), ["1", "2", "3"]) })``` | `{}` | no | | vpn\_site\_connections | Site-to-site VPN site definitions for connecting on-premises networks | ```list(object({ name = string address_spaces = list(string) shared_key_reference = string gateway_ip_address = optional(string) gateway_fqdn = optional(string) bgp_asn = optional(number) bgp_peering_address = optional(string) ike_protocol = optional(string, "IKEv2") }))``` | `[]` | no | | vpn\_site\_default\_ipsec\_policy | Default IPsec policy for all S2S connections | ```object({ dh_group = string ike_encryption = string ike_integrity = string ipsec_encryption = string ipsec_integrity = string pfs_group = string sa_datasize_kb = optional(number) sa_lifetime_seconds = optional(number) })``` | `null` | no | | vpn\_site\_shared\_keys | Pre-shared keys for S2S VPN connections indexed by shared\_key\_reference | `map(string)` | `{}` | no | diff --git a/infrastructure/terraform/outputs.tf b/infrastructure/terraform/outputs.tf index 7378a15f..64408ec4 100644 --- a/infrastructure/terraform/outputs.tf +++ b/infrastructure/terraform/outputs.tf @@ -144,6 +144,11 @@ output "storage_account" { value = module.platform.storage_account } +output "data_lake_storage_account" { + description = "Data lake storage account for domain data. Null when data lake is disabled." + value = module.platform.data_lake_storage_account +} + // ============================================================ // AzureML Compute Outputs // ============================================================ diff --git a/infrastructure/terraform/terraform.tfvars.example b/infrastructure/terraform/terraform.tfvars.example index 302ed123..9e2ef0b6 100644 --- a/infrastructure/terraform/terraform.tfvars.example +++ b/infrastructure/terraform/terraform.tfvars.example @@ -113,6 +113,10 @@ should_enable_microsoft_defender = true // Storage Lifecycle Management // Configure automatic deletion and tiering policies for blob storage +// Data Lake Storage (ADLS Gen2): Optional dedicated storage account with hierarchical namespace +// for domain data (datasets, model checkpoints). When enabled, lifecycle policies move to the data lake account. +// should_create_data_lake_storage = false + // Raw ROS bags: Auto-delete after 30 days (configurable) should_enable_raw_bags_lifecycle_policy = true raw_bags_retention_days = 30 # Set to -1 to disable auto-delete diff --git a/infrastructure/terraform/variables.tf b/infrastructure/terraform/variables.tf index d13b6a0f..1e4c4977 100644 --- a/infrastructure/terraform/variables.tf +++ b/infrastructure/terraform/variables.tf @@ -72,6 +72,12 @@ variable "should_enable_purge_protection" { * Storage Lifecycle Management */ +variable "should_create_data_lake_storage" { + type = bool + description = "Whether to create a dedicated ADLS Gen2 storage account with hierarchical namespace for domain data (datasets, model checkpoints)" + default = false +} + variable "should_enable_raw_bags_lifecycle_policy" { type = bool description = "Whether to enable lifecycle policy for raw ROS bags (auto-delete after retention period)" diff --git a/infrastructure/terraform/vpn/TERRAFORM.md b/infrastructure/terraform/vpn/TERRAFORM.md index bb61d1ae..79916ed3 100644 --- a/infrastructure/terraform/vpn/TERRAFORM.md +++ b/infrastructure/terraform/vpn/TERRAFORM.md @@ -2,7 +2,7 @@ title: VPN Gateway Standalone Configuration description: Deploys VPN Gateway for Point-to-Site and Site-to-Site connectivity using data sources to reference existing platform infrastructure. author: Microsoft Robotics-AI Team -ms.date: 2026-03-25 +ms.date: 2026-04-08 ms.topic: reference --- @@ -50,7 +50,7 @@ using data sources to reference existing platform infrastructure. | root\_certificate\_name | Name for the root certificate used in P2S authentication | `string` | `"RoboticsVPNRootCert"` | no | | root\_certificate\_public\_data | Base64-encoded public certificate data for P2S authentication (without BEGIN/END markers) | `string` | `null` | no | | virtual\_network\_name | Existing virtual network name (Otherwise 'vnet-{resource\_prefix}-{environment}-{instance}') | `string` | `null` | no | -| vpn\_gateway\_config | VPN Gateway configuration including SKU, generation, and P2S client address pool | ```object({ sku = optional(string, "VpnGw1AZ") generation = optional(string, "Generation1") client_address_pool = optional(list(string), ["192.168.200.0/24"]) })``` | `{}` | no | +| vpn\_gateway\_config | VPN Gateway configuration including SKU, generation, P2S client address pool, and availability zones for the public IP | ```object({ sku = optional(string, "VpnGw1AZ") generation = optional(string, "Generation1") client_address_pool = optional(list(string), ["192.168.200.0/24"]) zones = optional(list(string), ["1", "2", "3"]) })``` | `{}` | no | | vpn\_site\_connections | Site-to-site VPN site definitions for connecting on-premises networks | ```list(object({ name = string address_spaces = list(string) shared_key_reference = string gateway_ip_address = optional(string) gateway_fqdn = optional(string) bgp_asn = optional(number) bgp_peering_address = optional(string) ike_protocol = optional(string, "IKEv2") }))``` | `[]` | no | | vpn\_site\_default\_ipsec\_policy | Default IPsec policy for all S2S connections | ```object({ dh_group = string ike_encryption = string ike_integrity = string ipsec_encryption = string ipsec_integrity = string pfs_group = string sa_datasize_kb = optional(number) sa_lifetime_seconds = optional(number) })``` | `null` | no | | vpn\_site\_shared\_keys | Pre-shared keys for S2S VPN connections indexed by shared\_key\_reference | `map(string)` | `{}` | no |