Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 44 additions & 1 deletion releases/6.3.0.md
Original file line number Diff line number Diff line change
@@ -1,42 +1,80 @@
## Highlights

- **ConfigMap-based configuration** — All service configs (pools, backends, pod templates, roles, and more) can now be managed as Helm values via a Kubernetes ConfigMap, following standard K8s patterns and enabling GitOps workflows.
- **Router chart merged into service chart** — The standalone `router` Helm chart has been consolidated into the `service` chart, simplifying deployment to a single Helm release.
- **TLS support** — The service chart now terminates TLS at the gateway, with values for cert/key, redirect from HTTP, and SAN configuration.
- **Service chart consolidation** — The standalone `router` and `web-ui` Helm charts have been folded into the `service` chart, making a full deployment a single Helm release.
- **Multi-provider deploy scripts** — `deploy-k8s.sh` now provisions OSMO on Azure AKS, AWS EKS, microk8s, or any existing Kubernetes cluster, with idempotent installers for KAI Scheduler, GPU Operator, MinIO, and configurable storage backends (MinIO, Azure Blob, AWS S3, BYO S3).
- **Per-group timeouts** — `exec_timeout` and `queue_timeout` now meter each group independently instead of running against the workflow as a whole, so a stuck simulation group no longer kills the rest of the workflow.
- **Dataset CLI and API deprecated** — `osmo dataset` commands and the `/datasets` API endpoints are deprecated and will be removed in 6.4. Migrate to workflow-managed dataset outputs.
- **Rsync download support** — Pull files from running workflow tasks to your local machine with `osmo workflow rsync download`, complementing the existing upload capability.
- **Visual transfer progress** — File sync operations now display a progress bar showing bytes transferred, percentage, rate, and ETA.
- **Privilege escalation fix** — Policies with empty resources lists no longer grant access to resource-scoped endpoints.

## Breaking Changes

- **Router chart removed**: The standalone `router` Helm chart is gone. Router pods now deploy as part of the `service` chart. Existing router resources (`osmo-router`, `osmo-router-headless`) continue to work, but you must remove the separate router Helm release before upgrading. See the [6.2 to 6.3 upgrade guide](https://nvidia.github.io/OSMO/deployment_guide/upgrades/6_2_to_6_3/) for migration steps. (#897)
- **Web UI chart removed**: The standalone `web-ui` Helm chart has been merged into the `service` chart. Set `ui.enabled: true` in service values to deploy the UI alongside the API. Remove the separate `web-ui` release before upgrading. (#907)
- **Squid proxy removed from backend operator**: The egress allowlist and squid-proxy sidecar have been removed from the backend operator chart. Network policies now restrict pod-to-pod access directly. (#823)
- **Per-group timeout semantics**: `exec_timeout` and `queue_timeout` are now enforced per group (clock starts on the group's `RUNNING` or `SCHEDULING` transition) instead of per workflow. An expired group is marked `FAILED_EXEC_TIMEOUT` or `FAILED_QUEUE_TIMEOUT`; sibling groups continue and the workflow status aggregates only after all groups finish. (#925)
- **Dataset CLI and API deprecated**: All `osmo dataset` subcommands print a stderr deprecation warning, and the `/datasets` REST endpoints are marked deprecated in the OpenAPI schema. The Datasets page in the UI shows a deprecation banner. Both will be removed in 6.4. (#872)
- **S3 addressing default**: For S3-compatible backends with a custom `endpoint_url`, the addressing style now defaults to virtual-hosted instead of boto3's auto-selection (which picks path style for custom endpoints), fixing compatibility with providers that require virtual hosts. If a backend requires path addressing, set the `addressing_style` attribute to path, or force OSMO to always use path addressing via the `AWS_S3_FORCE_PATH_STYLE` environment variable. (#950)

## Helm Charts

- **ConfigMap configuration mode**: Set `services.configs.enabled: true` to manage all service configs via Helm values. CLI/API writes return HTTP 409 when active. The chart ships with default roles, pod templates, resource validations, backend, and pool. (#822)
- **ConfigMap mode for worker, agent, and logger**: The ConfigMapWatcher now runs in the worker, agent, and logger services. Previously only the API service watched the ConfigMap, so workflow pods built by the worker could be constructed from stale config. (#926)
- **TLS termination at the gateway**: Configure a serving cert/key, optional HTTP-to-HTTPS redirect, and SAN list via `gateway.tls`. The gateway template generates the matching Envoy listener config. (#953)
- **Gateway consolidation**: A unified gateway now handles load balancing for all service types (API, router, UI), simplifying ingress configuration. (#817, #799)
- **Gateway extension hooks**: Inject custom Envoy filters and additive auth-skip paths via `gateway.envoy.extensions` and `gateway.envoy.authSkipPaths`, useful for sidecar integrations and bypassing authz on specific endpoints. (#1009)
- **Default identity headers**: Minimal deployments can now inject default `x-osmo-user`, `x-osmo-roles`, and `x-osmo-allowed-pools` headers for unauthenticated browser requests via `gateway.envoy.defaultIdentity` values. (#902)
- **oauth2-proxy extraEnv**: Expose environment variables on the oauth2-proxy container via `gateway.oauth2Proxy.extraEnv`, needed for Redis AUTH when using session storage. (#898)
- **Custom HPA metrics**: Specify custom metrics for Horizontal Pod Autoscalers on service components. (#858)
- **Pool computed fields resolved at load time**: ConfigMap pools no longer require pre-expanded `parsed_pod_template` and `parsed_resource_validations`, reducing config file size by ~60%. (#866)
- **Per-field Secret mounts**: Create credential Secrets with `kubectl --from-literal` instead of packaging all fields into a single `cred.yaml`. (#884)
- **Default pod templates on default pool**: The chart's default pool now sets `common_pod_template`, so workflows submitted without an explicit template pick up `default_ctrl` and `default_user` automatically. (#1010, #1012)
- **Backend-operator startup probe configurable**: `startupProbe` thresholds on the backend listener and worker are now exposed in values, with relaxed defaults to handle slow image pulls on cold clusters. (#961)
- **Service startup probe extended**: The API service `startupProbe` failure threshold now allows up to ~2 minutes for migrations and DB warm-up before the pod is restarted. (#967)
- **podMonitor disabled by default**: Both the service and backend-operator charts now default `podMonitor.enabled` to `false`, avoiding errors on clusters without Prometheus Operator CRDs installed. (#962, #963)
- **Config export script**: New `deployments/upgrades/export_configs_to_helm.py` exports existing database configs to Helm values format. (#866)

## Deployment Scripts

- **Multi-provider deploy**: `deploy-k8s.sh` provisions a Kubernetes cluster on Azure AKS, AWS EKS, microk8s, or registers an existing cluster, then installs OSMO end-to-end. Cluster-agnostic dependency installers detect existing KAI Scheduler, GPU Operator, and MinIO so re-runs are safe. (#979)
- **Storage backend wiring**: `configure-storage.sh` provisions and registers the workflow storage backend for MinIO, Azure Blob, AWS S3, or a bring-your-own S3 endpoint, including credential creation and bucket setup. (#979, #988)
- **Idempotent token mint**: Backend operator token reconciliation now deletes any pre-existing `backend-token` before re-minting, so partial prior runs and microk8s PVC carryover no longer wedge re-deploys. (#988)
- **Helm values for minimal install**: `deploy-osmo-minimal.sh` accepts `--values` to layer custom Helm values on top of the minimal preset. (#993)

## Workflow Execution

- **Per-group exec and queue timeouts**: Each group's clock starts on its own `RUNNING` (exec) or `SCHEDULING` (queue) transition. Expired groups are marked `FAILED_EXEC_TIMEOUT` or `FAILED_QUEUE_TIMEOUT`; downstream groups cascade as `FAILED_UPSTREAM`, sibling groups keep running. Delayed jobs serialized before the upgrade fall back to the previous workflow-level enforcement with a warning log. (#925)
- **Pool quota accounting handles Jinja**: `osmo-ctrl` resource requests and limits are now pre-rendered for pool-quota accounting, so templated values like `{% if USER_CPU > 2 %}2{% else %}{{USER_CPU}}{% endif %}` are counted correctly instead of being silently treated as zero. (#931)
- **service_auth wired into worker, agent, logger**: These services now read `service_auth` and stop reading `service_base_url` from the database, fixing config-mode authentication for non-API pods. (#930)
- **KAI queues sync on every registration**: Backend registration now syncs KAI Scheduler queues unconditionally, instead of only on the first registration. (#941)

## CLI

- **Rsync download**: Pull files from running tasks to your local machine with `osmo workflow rsync download wf-id /remote/path:/local/path`. (#792)
- **Rsync errors when remote source is missing**: Downloads now fail loudly when the requested remote path doesn't exist on the task pod. Previously the rsync daemon exited 0 with zero files transferred and the CLI reported success while leaving the destination empty. (#1019)
- **Rsync shutdown error fixed**: The spurious `ValueError: Invalid file descriptor: -1` after a successful rsync download is gone. (#987)
- **Transfer progress bar**: Rsync upload and download now display an in-place progress bar showing bytes, percentage, rate, and ETA. Suppress with `--no-progress`. (#826)
- **Structured JSON logs**: Pass `--log_format json` (or set the equivalent values key on services) to emit single-line JSON logs compatible with Fluent Bit. (#888)
- **Uninstall script**: Remove OSMO CLI with `osmo-uninstall` (macOS/Linux). (#710)
- **Agent skill prompt**: The installer offers to install the OSMO agent skill for AI coding assistants during CLI installation. (#841)
- **Token expiry warning**: The CLI warns when your access token is within 24 hours of expiring. (#711)
- **Token roles nargs**: `osmo token set --roles` now accepts multiple roles as separate arguments instead of requiring a comma-separated list. (#754)
- **Dataset commands deprecated**: `osmo dataset *` subcommands print a stderr deprecation warning. The commands and corresponding `/datasets` API will be removed in 6.4. (#872)

## Web UI

- **Workflow version navigation**: Navigate between workflow run versions using back/forward arrows in the details panel. (#834)
- **Task failure messages**: Failed and canceled tasks now display their `failure_message` in the Details section, even when exit code is null. (#832, #833)
- **Sign out via oauth2-proxy**: The Sign Out action now routes through the oauth2-proxy logout endpoint so the upstream session is cleared, not just the local cookie. (#996)
- **Exec cookies fix for multi-router deployments**: Exec session cookies are now scoped correctly when multiple router services are running, so terminal sessions stay attached to the right backend. (#1003)
- **Datasets deprecation banner**: The `/datasets` page shows a deprecation banner announcing the v6.4 removal. (#872)
- **Terminal resize**: The web shell now responds to window resizes, fixing display issues with applications like vim. (#727, #717)
- **Filter and retry fixes**: Resolved issues with workflow log filters, task retry display, and occupancy search fields. (#784)
- **Next.js 16.2.4 / Node 24.14.1**: The UI now ships on Next.js 16.2.4 and Node 24.14.1. (#949)
- **CodeMirror deduplicated**: `@codemirror/state` is now deduped to a single version, fixing intermittent editor crashes. (#955)

## Authorization

Expand All @@ -63,6 +101,11 @@
- **Credential env var collision**: Multiple credentials with the same payload key name (e.g., both use `key`) no longer overwrite each other. Secret data keys are now namespaced with the credential name. (#839)
- **Credential names not masked**: Credential names and field references (e.g., `AWS_ACCESS_KEY_ID`) are no longer incorrectly masked in workflow specs. (#744)
- **Dataset manifest sort**: Fixed binary search mismatch in dataset manifest comparator. (#903)
- **Dataset browsing from private buckets**: Dataset URLs for S3-compatible backends are now built against the credential's `override_url` instead of the AWS pattern, so the UI can fetch content from CAIOS, MinIO, and other non-AWS endpoints. (#957)
- **Storage credential setup errors**: Clearer error messages when required fields are missing or malformed during credential creation. (#947)
- **OpenAPI schema generation**: API schema export works again after the Pydantic v2 migration. (#985)
- **SSL truststore on Python 3.14 + microk8s**: Patched `ssl` with `truststore` so HTTPS calls from in-cluster pods on Python 3.14 + microk8s pick up the system trust store. (#951)
- **Web UI base image (CVE-2026-2673)**: Bumped the web-ui base image to v4.0.5 to pick up the upstream fix. (#971)
- **Workflow file 403 handling**: Streaming response now returns proper error when workflow file access is forbidden. (#730)
- **Authz path fixes**: Corrected authorization paths for rsync, workflow exec, and credential create operations. (#739, #738, #737)

Expand Down
Loading