Cleanup ECR resources and Docker image handling by forstmeier · Pull Request #813 · oscmcompany/fund

forstmeier · 2026-04-03T22:08:48Z

Overview

Changes

Rename ECR repository from fund/tide-runner to fund/tide-model-runner in infrastructure code (infrastructure/storage.py, infrastructure/__main__.py)
Update GitHub Actions workflow to use tide application with model-runner stage instead of the old model-trainer entries; remove duplicate model-trainer deploy job matrix entries
Rename Docker stage in models/tide/Dockerfile from runner to model-runner to match the new naming
Delete unused models/Dockerfile which contained a dead server-worker stage no longer referenced anywhere
Update maskfile.md ECS service lookup case to tide-model-runner
Update infrastructure storage test to assert against the new tide_model_runner_repository_lifecycle resource name

Context

The tide-runner naming was inconsistent with the broader model-runner convention being adopted. The old models/Dockerfile (server-worker stage) was unreferenced dead code left over from a prior architecture and is no longer needed.

Summary by CodeRabbit

New Features
- Provisioned GPU-backed training infrastructure and added model initialization commands for remote/local trainer setup.
- Added Prefect deployment updates to target new trainer work pools.
Chores
- Consolidated CI/CD to a unified model-runner build/deploy flow and cleaned the deployment matrix.
- Renamed runtime/image identifiers and ECS cluster target; removed legacy model image build.
- Expanded deployment permissions to support autoscaling and trainer instance management.
Tests
- Added and updated tests validating training infra, IAM, and storage/name changes.

… dead code Root cause: The Prefect flow was crashing because the ECS task used a stale fund/tide-runner:worker image built from commit f6c3a08 (March 17), which lacked prefect-aws, mlflow, and internal dependencies added to models/tide/pyproject.toml since then. The :worker tag was manually pushed and diverged from :latest. Changes: - Rename ECR repo fund/tide-runner -> fund/tide-model-runner in infrastructure/storage.py and infrastructure/__main__.py to match the maskfile naming convention (repository_name="fund/${package_name}-${stage_name}") - Rename Dockerfile stage AS runner -> AS model-runner in models/tide/Dockerfile so mask infrastructure image build tide model-runner resolves the correct target - Delete models/Dockerfile (dead code: old Prefect polling-worker image that ran prefect worker start; no longer needed with ecs:push work pool architecture) - Update .github/workflows/launch_infrastructure.yaml: replace model-trainer/server-worker matrix entry with tide/model-runner; remove model-trainer entries from deploy_images matrix (tide has no ECS service, only an ECS task run by Prefect Cloud directly) - Update maskfile.md deploy case from tide-runner) to tide-model-runner) - Update test_infrastructure_storage.py to assert tide_model_runner_repository_lifecycle All 4 images rebuilt and pushed to ECR at commit 2feadf6: - fund/data-manager-server - fund/portfolio-manager-server - fund/ensemble-manager-server - fund/tide-model-runner (new repo, replaces fund/tide-runner) Prefect work pool fund-work-pool-ecs default image updated to 982604770479.dkr.ecr.us-east-1.amazonaws.com/fund/tide-model-runner:latest Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai · 2026-04-03T22:09:05Z

📝 Walkthrough

Walkthrough

Renames tide-runner artifacts to tide-model-runner, adds GPU-backed training infrastructure (ECS cluster, ASG, capacity provider, IAM changes), updates CI/CD workflow and Prefect work pools, removes legacy models Dockerfile, and updates related tests and developer scripts. (50 words)

Changes

Cohort / File(s)	Summary
CI/CD Workflow `/.github/workflows/launch_infrastructure.yaml`	Replaces `model-trainer` entries with `tide`; swaps build mappings to `model-runner` scoped to `models/**`; removes `model-trainer` deploy mappings.
Pulumi entrypoints & exports `infrastructure/__main__.py`	Renames ECR export identifiers from `tide_runner_` → `tide_model_runner_`; adds export `aws_ecs_models_cluster_name` from `models_cluster.name`.
ECR storage resources `infrastructure/storage.py`	Renames ECR resource and lifecycle refs to `tide_model_runner_repository`, repo name → `fund/tide-model-runner`, and renames image URI output.
New training infra module `infrastructure/training.py`	Adds GPU-backed training infra: ECS cluster `fund-models`, IAM roles/instance-profile, GPU AMI via SSM, EC2 Launch Template (g4dn.xlarge + GPU user-data), ASG (min=0, desired=0, max=1), Capacity Provider, and CloudWatch log group; exports roles/cluster/log-group.
IAM policy updates `infrastructure/iam.py`	Adds `github_actions_trainer_policy`, grants `autoscaling:*` to existing infra policy, and attaches trainer policy to `github_actions_infrastructure_role`.
ECS cluster name & dev scripts `infrastructure/compute.py`, `devenv.nix`	Cluster name changed from `fund-application` → `fund-applications`; dev scripts updated to new cluster and renamed trainer init/setup scripts.
Prefect & deployment changes `prefect.yaml`, `models/tide/src/tide/deploy.py`, `maskfile.md`, `devenv.nix`	Prefect work pools renamed to `fund-models-remote`/`fund-models-local`; deployment name/work_pool updated; maskfile/devenv simplified model deploy flow and commands.
Dockerfiles & images `models/Dockerfile`, `models/tide/Dockerfile`	Removed legacy `models/Dockerfile`; renamed build stage alias in `models/tide/Dockerfile` from `runner` → `model-runner`.
Tests `libraries/python/tests/*`, `models/tide/tests/test_deploy.py`	Updated tests to reflect renamed ECR and IAM identifiers; added tests asserting contents of new `infrastructure/training.py` (GPU AMI, g4dn.xlarge, ASG scale-to-zero, user-data).
Misc help text / maskfile `maskfile.md`	Updated Mask commands to reference `tide model-runner`, added `models` subcommands for initializing/deploying work pools, removed Pulumi-based PREFECT_API_URL derivation in model deploy flow.

Sequence Diagram(s)

sequenceDiagram
  participant GH as "GitHub Actions"
  participant Registry as "ECR (fund/tide-model-runner)"
  participant Pulumi as "Pulumi"
  participant AWS as "AWS (ECS / EC2 / ASG / IAM / CloudWatch)"
  participant Prefect as "Prefect Cloud"

  GH->>Registry: Build & push model image (models/**) to fund/tide-model-runner
  GH->>Pulumi: Trigger infra deploy (update exports)
  Pulumi->>AWS: Create/Update IAM, ECR, ECS cluster, Launch Template, ASG, Capacity Provider, Log Group
  AWS-->>Pulumi: Return resource ARNs / cluster name
  Pulumi-->>GH: Export outputs (e.g., aws_ecs_models_cluster_name, ecr image URI)
  Prefect->>AWS: Create/use work pools (fund-models-remote / fund-models-local)
  GH->>Prefect: Deploy flows using new work pools (via maskfile/devenv scripts)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Migrate to Prefect Cloud with optimized TiDE model #796: Similar training/Prefect migration and model-runner ECR naming adjustments (training infra, Prefect work-pools, ECR renames).
Cleanup Docker image handling, ECS service management, and model training resources #793: Overlapping CI workflow matrix and ECR/image naming changes that intersect with workflow and storage updates.
Add S3/ECR "down" command resource handling #772: Related changes to Pulumi ECR repository exports and naming conventions.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 9.09% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Cleanup ECR resources and Docker image handling' accurately summarizes the main changes: ECR repository renaming, Docker stage updates, and removal of unused Docker configuration.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch prefect-cloud-fixes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

greptile-apps · 2026-04-03T22:11:34Z

Greptile Summary

This PR renames the ECR repository from fund/tide-runner to fund/tide-model-runner, the corresponding Docker build stage, and all Prefect work pool names (fund-models-remote/fund-models-local). It removes the dead root models/Dockerfile, drops the model-trainer deploy job entries from CI (correct since the tide runner is a Prefect ECS batch task), and adds a new infrastructure/training.py that provisions GPU training infrastructure (a dedicated fund-models ECS cluster, a g4dn.xlarge ASG scaling to zero, a capacity provider, launch template, IAM instance profile, and a new least-privilege github_actions_trainer_policy). A long-standing cluster name typo (fund-application → fund-applications) is also fixed in devenv.nix and maskfile.md.

Confidence Score: 5/5

Safe to merge; all remaining findings are P2 style suggestions that do not affect production deployments

The rename and refactoring changes are consistent and well-tested throughout. Both issues are P2 (missing env-var guard and override in developer convenience scripts) that do not block the primary training or deployment workflow.

devenv.nix — two minor issues with PREFECT_API_URL and PREFECT_API_KEY handling in the renamed init scripts

Important Files Changed

Filename	Overview
.github/workflows/launch_infrastructure.yaml	CI matrix updated: replaces model-trainer/server-worker entries with tide/model-runner in build job; removes model-trainer deploy entries since tide runs as a Prefect ECS batch task, not a long-running service
devenv.nix	Renames training scripts, updates work pool names, fixes cluster name typo; two P2 issues: missing PREFECT_API_URL override in initialize-local-trainer and dropped PREFECT_API_KEY guard in initialize-remote-trainer
infrastructure/main.py	Imports and exports updated to use tide_model_runner_repository and tide_model_runner_image_uri; adds aws_ecs_models_cluster_name export from new training module
infrastructure/compute.py	ECS cluster name corrected from fund-application to fund-applications to match actual provisioned resource name
infrastructure/iam.py	Adds github_actions_trainer_policy scoped to fund-models-instance-role and fund-models-instance-profile; policy attached to github_actions_infrastructure_role alongside existing infrastructure policy
infrastructure/storage.py	ECR repository renamed from tide_runner_repository/fund/tide-runner to tide_model_runner_repository/fund/tide-model-runner with matching lifecycle policy and image URI variable rename
infrastructure/training.py	New file: provisions fund-models ECS cluster with g4dn.xlarge ASG scaling to zero, ECS capacity provider with managed scaling, launch template using GPU-optimized ECS AMI, EC2 instance profile with AmazonEC2ContainerServiceforEC2Role, and CloudWatch log group
libraries/python/tests/test_infrastructure_iam.py	Updated to assert github_actions_trainer_policy resource name is declared and its arn is attached to the infrastructure role
libraries/python/tests/test_infrastructure_storage.py	Lifecycle policy assertion updated from tide_runner_repository_lifecycle to tide_model_runner_repository_lifecycle
libraries/python/tests/test_infrastructure_training.py	New tests covering GPU instance type, scale-to-zero ASG, EC2 instance profile with ECS role, capacity provider wiring, and ECS_ENABLE_GPU_SUPPORT in user data
maskfile.md	Fixes ECS cluster name typo in ecs-deploy/ecs-status; adds tide-model-runner case in deploy command that exits early with informative message since it is a batch task, not a service
models/Dockerfile	Deleted: removes dead code server-worker stage that was no longer referenced after a prior architecture change
models/tide/Dockerfile	Docker build stage renamed from runner to model-runner to align with ECR repository name and CI workflow matrix stage name
models/tide/src/tide/deploy.py	Deployment name updated to tide-trainer-remote and work pool to fund-models-remote; consistent with prefect.yaml and devenv.nix
models/tide/tests/test_deploy.py	Test assertions updated to match new deployment name (tide-trainer-remote) and work pool name (fund-models-remote)
prefect.yaml	Work pool names updated from fund-work-pool-ecs/fund-work-pool-local to fund-models-remote/fund-models-local across both deployment entries

Prompt To Fix All With AI

This is a comment left during a code review.
Path: devenv.nix
Line: 302

Comment:
**Missing `PREFECT_API_URL` for local deployment registration**

The old `training-setup` script prefixed every `prefect` invocation with `PREFECT_API_URL="http://localhost:4200/api"`. The new `initialize-local-trainer` drops this, so if `PREFECT_API_KEY` is present in the developer's shell environment (common after configuring remote training), Prefect will route this command to Prefect Cloud and silently register `tide-trainer-local` there instead of on the local orchestrator.

```suggestion
    PREFECT_API_URL="http://localhost:4200/api" uv run prefect --no-prompt deploy --name tide-trainer-local
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: devenv.nix
Line: 271-273

Comment:
**`PREFECT_API_KEY` guard removed from remote trainer init**

The old `training-init` script checked for `PREFECT_API_KEY` at the top and exited early with a clear, actionable message if it was unset. The new `initialize-remote-trainer` drops this guard entirely, so a missing key produces a cryptic Prefect authentication error rather than a helpful prompt for new contributors setting up remote training.

```suggestion
  scripts.initialize-remote-trainer.exec = ''
    if [ -z "${PREFECT_API_KEY:-}" ]; then
      echo "PREFECT_API_KEY not set. Add it to .envrc and run 'direnv allow'."
      exit 1
    fi

    unset PREFECT_API_URL
```

How can I resolve this? If you propose a fix, please make it concise.

_{Reviews (4): Last reviewed commit: "Address PR #813 feedback" | Re-trigger Greptile}

Copilot

Pull request overview

This PR aligns the “tide” model image/repository naming with the project’s “model-runner” convention, updates CI to build/push the renamed image target, and removes an unreferenced legacy Dockerfile.

Changes:

Renamed the Tide ECR repository and related Pulumi outputs/resources from tide-runner to tide-model-runner.
Updated GitHub Actions + Tide Docker build stage naming to use tide + model-runner (and removed old model-trainer matrix entries).
Deleted the unused models/Dockerfile and updated the maskfile ECS deploy lookup + a storage test assertion.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`models/tide/Dockerfile`	Renames the build stage from `runner` to `model-runner` to match CI naming.
`models/Dockerfile`	Removes a dead/unused Dockerfile that defined an unreferenced `server-worker` stage.
`maskfile.md`	Updates ECS deploy lookup to `tide-model-runner`.
`libraries/python/tests/test_infrastructure_storage.py`	Updates test assertion for renamed ECR lifecycle policy resource.
`infrastructure/storage.py`	Renames the ECR repository + lifecycle policy resources to `tide_model_runner_*` and changes the repo name.
`infrastructure/__main__.py`	Updates Pulumi exports/imports to the new Tide model-runner naming.
`.github/workflows/launch_infrastructure.yaml`	Updates image build/push matrix to `tide` + `model-runner` and removes old `model-trainer` deploy entries.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coveralls · 2026-04-03T22:19:11Z

coverage: 78.162% (+0.06%) from 78.101%
when pulling 7c7ee50 on prefect-cloud-fixes
into a14bf54 on master.

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

libraries/python/tests/test_infrastructure_storage.py (1)
25-31: 🛠️ Refactor suggestion | 🟠 Major

Assert the renamed ECR repository, not just the lifecycle symbol.

This still passes if tide_model_runner_repository exists but its name stays fund/tide-runner, which is the regression this PR is fixing. Add an assertion for the actual repository name or image URI so the stale-repo case is covered.
🧪 Suggested assertion
 def test_storage_contains_ecr_lifecycle_policy_resources() -> None:
     infrastructure_storage = load_infrastructure_storage()

     assert '"data_manager_repository_lifecycle"' in infrastructure_storage
     assert '"portfolio_manager_repository_lifecycle"' in infrastructure_storage
     assert '"ensemble_manager_repository_lifecycle"' in infrastructure_storage
     assert '"tide_model_runner_repository_lifecycle"' in infrastructure_storage
+    assert 'fund/tide-model-runner' in infrastructure_storage
As per coding guidelines: "When fixing a bug, write tests that reproduce the bug before fixing it, then verify the tests pass after the fix."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libraries/python/tests/test_infrastructure_storage.py` around lines 25 - 31,
The test test_storage_contains_ecr_lifecycle_policy_resources only checks for
lifecycle symbol names and misses verifying the actual repository name/URI;
update the test (which calls load_infrastructure_storage) to assert that the
renamed ECR repository (e.g., the tide_model_runner_repository entry) contains
the expected repository name or image URI (not the old "fund/tide-runner") so
the stale-repo regression is caught; specifically, after locating the
"tide_model_runner_repository_lifecycle" block in infrastructure_storage, add an
assertion that the corresponding repository object or its "name"/"repositoryUri"
field equals the expected new name/URI.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/launch_infrastructure.yaml:
- Around line 31-33: The build_and_push_images job attempts to push the
tide/model-runner image to the ECR repo fund/tide-model-runner but that
repository is only created later by launch_infrastructure
(infrastructure/storage.py repository creation), so the push can fail on first
deploy; to fix, ensure fund/tide-model-runner is created before
build_and_push_images runs by either (a) adding a provisioning step early in
this workflow that creates the ECR repository (or calls the existing repository
creation logic) or (b) moving/duplicating the repository creation call into a
new pre-build job that launch_and_waits before build_and_push_images,
referencing the same repository name fund/tide-model-runner and the
repository-creation routine currently used in launch_infrastructure.

In `@maskfile.md`:
- Line 196: The docs are inconsistent: the canonical service name was changed to
tide-model-runner but example text still says "tide runner"; update the example
occurrences (e.g., the strings "tide runner" shown in the example blocks around
the earlier examples and any usage examples) to use "model-runner" or the full
canonical "tide-model-runner" so they match the renamed stage/repo (search for
and replace "tide runner" and any standalone "tide" examples with
"model-runner"/"tide-model-runner" and ensure CLI/service examples, README
examples, and any echo messages reflect the new name).

---

Outside diff comments:
In `@libraries/python/tests/test_infrastructure_storage.py`:
- Around line 25-31: The test
test_storage_contains_ecr_lifecycle_policy_resources only checks for lifecycle
symbol names and misses verifying the actual repository name/URI; update the
test (which calls load_infrastructure_storage) to assert that the renamed ECR
repository (e.g., the tide_model_runner_repository entry) contains the expected
repository name or image URI (not the old "fund/tide-runner") so the stale-repo
regression is caught; specifically, after locating the
"tide_model_runner_repository_lifecycle" block in infrastructure_storage, add an
assertion that the corresponding repository object or its "name"/"repositoryUri"
field equals the expected new name/URI.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c45fb425-412c-484b-95a1-4ffd20e3a99f

📥 Commits

Reviewing files that changed from the base of the PR and between a14bf54 and 4eac37a.

📒 Files selected for processing (7)

.github/workflows/launch_infrastructure.yaml
infrastructure/__main__.py
infrastructure/storage.py
libraries/python/tests/test_infrastructure_storage.py
maskfile.md
models/Dockerfile
models/tide/Dockerfile

💤 Files with no reviewable changes (1)

models/Dockerfile

Root cause: training jobs were running on Fargate (no GPU), causing tinygrad to fail with AttributeError on cuInit because CUDA=1 was set in the Dockerfile but no GPU driver was available. Changes: - Add infrastructure/training.py: new fund-models ECS cluster backed by EC2 g4dn.xlarge GPU instances via auto-scaling group (scale-to-zero), capacity provider with managed scaling, and CloudWatch log group - Add fund-github-actions-trainer-policy: separate IAM managed policy for trainer role/instance profile permissions, split from the main policy to stay within AWS's 6144-byte policy size limit - Rename ECS application cluster from fund-application to fund-applications - Rename all Prefect/infrastructure resources from fund-trainer to fund-models to reflect that the cluster will host multiple model workloads beyond training - Rename Prefect deployment from fund-trainer-remote to tide-trainer-remote to be workflow-specific rather than infrastructure-scoped - Rename mask infrastructure trainer command to mask infrastructure models - Update devenv.nix scripts and process work pool references to fund-models-local - Update prefect.yaml deployment names and work pool names throughout - Add tests for new training infrastructure Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

infrastructure/iam.py (1)
254-269: ⚠️ Potential issue | 🟡 Minor

Add autoscaling.amazonaws.com to DeleteServiceLinkedRoles statement or document why it's intentionally excluded.

The CreateServiceLinkedRolesForStack statement includes autoscaling.amazonaws.com (line 263), but the DeleteServiceLinkedRoles statement (lines 351–367) does not. This means GitHub Actions can create the autoscaling SLR but cannot delete it during infrastructure teardown. If this asymmetry is intentional (e.g., the SLR is shared and should persist), add a comment explaining the design choice. Otherwise, add autoscaling.amazonaws.com to the deletion condition to ensure consistent lifecycle management.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@infrastructure/iam.py` around lines 254 - 269, The
CreateServiceLinkedRolesForStack policy includes "autoscaling.amazonaws.com" but
the DeleteServiceLinkedRoles policy omits it; update the
DeleteServiceLinkedRoles statement (the IAM policy block named
DeleteServiceLinkedRoles) to include "autoscaling.amazonaws.com" in its
iam:AWSServiceName condition to allow deletion, or if intentional, add a clear
inline comment next to the DeleteServiceLinkedRoles block explaining why
autoscaling SLRs must persist and should not be deleted; reference the policy
names CreateServiceLinkedRolesForStack and DeleteServiceLinkedRoles when making
the change.

♻️ Duplicate comments (1)

maskfile.md (1)

196-196: ⚠️ Potential issue | 🟡 Minor

Docs are still inconsistent with the renamed stage.

Line 196 uses tide-model-runner, but examples in this file still reference tide runner, leaving operator guidance out of sync.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@maskfile.md` at line 196, Documentation is inconsistent: the stage was
renamed to tide-model-runner but other examples still say "tide runner"; update
all references so they match the new stage name. Search the document for
occurrences of "tide runner" (and similar variants like "tide-runner") and
replace them with "tide-model-runner", and update any CLI/service example lines,
examples, and operator guidance to use the exact symbol "tide-model-runner"
(including the switch/echo example that currently references it) so all examples
and guidance are consistent.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@infrastructure/training.py`:
- Around line 92-111: The autoscaling Group models_asg only sets the "Name" tag
and omits the shared tags in config.tags; update the tags argument of models_asg
to include the common tags from config.tags (converted into
aws.autoscaling.GroupTagArgs with propagate_at_launch=True) in addition to the
existing "Name" tag, ensuring duplicate keys are handled (override or skip) so
all required environment/project identifiers are propagated at launch. Use the
same GroupTagArgs shape used for the Name tag and build the list by mapping
config.tags entries into GroupTagArgs before passing to models_asg.
- Around line 48-57: The current models_ami lookup uses an Amazon Linux 2 ECS
GPU AMI pattern; update the aws.ec2.get_ami_output call (the models_ami
variable) to target Amazon Linux 2023 ECS GPU AMIs by changing the filter values
to the new pattern "al2023-ami-ecs-gpu-hvm-*-kernel-*-x86_64-ebs" or replace the
get_ami_output approach with an SSM Parameter Store lookup for
"/aws/service/ecs/optimized-ami/amazon-linux-2023/gpu/recommended" and use that
returned AMI ID when setting models_ami so the code references the AL2023
ECS-optimized GPU image instead of AL2.

In `@maskfile.md`:
- Around line 448-450: The local initialization branch incorrectly calls the
remote deployment module (python -m tide.deploy) which deploys
tide-trainer-remote to fund-models-remote; change this to a local-only deploy or
skip remote registration by invoking the local deployment entrypoint or passing
an explicit local target flag. Replace the uv run invocation of tide.deploy with
the local deploy variant (e.g., python -m tide.deploy_local or python -m
tide.deploy --target=local) or remove the deployment call so the 'local' branch
does not deploy tide-trainer-remote to fund-models-remote; update any related
invocation in the script where "Registering training deployment..." is echoed to
reference the local deploy path.

---

Outside diff comments:
In `@infrastructure/iam.py`:
- Around line 254-269: The CreateServiceLinkedRolesForStack policy includes
"autoscaling.amazonaws.com" but the DeleteServiceLinkedRoles policy omits it;
update the DeleteServiceLinkedRoles statement (the IAM policy block named
DeleteServiceLinkedRoles) to include "autoscaling.amazonaws.com" in its
iam:AWSServiceName condition to allow deletion, or if intentional, add a clear
inline comment next to the DeleteServiceLinkedRoles block explaining why
autoscaling SLRs must persist and should not be deleted; reference the policy
names CreateServiceLinkedRolesForStack and DeleteServiceLinkedRoles when making
the change.

---

Duplicate comments:
In `@maskfile.md`:
- Line 196: Documentation is inconsistent: the stage was renamed to
tide-model-runner but other examples still say "tide runner"; update all
references so they match the new stage name. Search the document for occurrences
of "tide runner" (and similar variants like "tide-runner") and replace them with
"tide-model-runner", and update any CLI/service example lines, examples, and
operator guidance to use the exact symbol "tide-model-runner" (including the
switch/echo example that currently references it) so all examples and guidance
are consistent.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: eeb49388-b5f9-42b4-9028-c14eca3d52c4

📥 Commits

Reviewing files that changed from the base of the PR and between 4eac37a and e5a260a.

📒 Files selected for processing (11)

devenv.nix
infrastructure/__main__.py
infrastructure/compute.py
infrastructure/iam.py
infrastructure/training.py
libraries/python/tests/test_infrastructure_iam.py
libraries/python/tests/test_infrastructure_training.py
maskfile.md
models/tide/src/tide/deploy.py
models/tide/tests/test_deploy.py
prefect.yaml

- Update models AMI lookup to use AL2023 ECS GPU AMI via SSM parameter store instead of AL2 AMI name filter - Propagate config.tags on ASG in addition to the Name tag so project, stack, and manager tags are applied at launch - Fix local init command to deploy tide-trainer-local from prefect.yaml instead of calling tide.deploy which hardcodes the remote deployment - Update maskfile image command examples from "tide runner" to "tide model-runner" to match the canonical ECR stage name Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

infrastructure/iam.py:104

The GitHub Actions infrastructure policy now includes autoscaling:* on Resource: "*", which grants full Auto Scaling permissions account-wide. If the intent is only to manage the models GPU ASG/capacity provider, consider scoping actions to the specific ASG/launch template ARNs (and/or limiting to the minimal autoscaling actions Pulumi needs) to keep this policy least-privilege.

                        "Sid": "ManageEC2ECSELBBudgetsAndServiceDiscovery",
                        "Effect": "Allow",
                        "Action": [
                            "autoscaling:*",
                            "ec2:*",
                            "ecs:*",
                            "elasticloadbalancing:*",
                            "budgets:*",
                            "servicediscovery:*",
                        ],
                        "Resource": "*",

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

maskfile.md (1)
844-863: ⚠️ Potential issue | 🟡 Minor

Command assumes work pool already exists and provides no validation.

The model deploy tide command has been simplified to only call tide.deploy, which hardcodes deployment to the fund-models-remote work pool. Unlike model train which validates and selects the Pulumi stack, this command skips infrastructure validation and assumes the work pool has already been created (via models initialize remote).

This creates a design gap:

Implicit dependency: The command silently fails if the work pool doesn't exist, with no error message explaining the root cause.

Overlapping commands: Both model deploy tide and models initialize remote register deployments, making it unclear which to use.

No environment flexibility: The hardcoded remote deployment means there's no equivalent for local deployment (users must use models initialize local instead).

The command works correctly when used in the proper sequence, but the design doesn't communicate these constraints.
💡 Recommendations

Option 1: Add work pool existence check to prevent silent failures:
# In model deploy, before calling tide.deploy
if ! uv run prefect work-pool inspect fund-models-remote >/dev/null 2>&1; then
    echo "Error: fund-models-remote work pool not found"
    echo "Run 'mask models initialize remote' first"
    exit 1
fi
Option 2: Consolidate with models initialize remote by documenting that model deploy tide is for re-deployment only and requires prior initialization.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@maskfile.md` around lines 844 - 863, The deploy command for model_name (the
"model deploy tide" branch) assumes the Prefect work pool "fund-models-remote"
exists and directly calls tide.deploy; add a pre-check that verifies the work
pool exists (using the same uv/prefect inspection approach used elsewhere) and
if missing print a clear error telling the user to run "mask models initialize
remote" (or run initialization) and exit non‑zero; keep the call to tide.deploy
only after the existence check so the command fails fast with a helpful message
instead of silently erroring.

♻️ Duplicate comments (1)

infrastructure/training.py (1)

101-115: ⚠️ Potential issue | 🟡 Minor

Prevent potential duplicate Name tag entries in ASG tags.

Lines 102-106 already add Name, then Lines 108-115 append all tags.items(). If config.tags contains Name, this produces duplicate tag keys. Filter Name in the list comprehension.

#!/bin/bash
# Verify whether `config.tags` may include "Name", which would duplicate ASG tag keys.
# Expected: "Name" should be absent from shared tags, or code should explicitly filter it.

set -euo pipefail

echo "== Locate config.py files =="
fd -i 'config.py'

echo
echo "== Inspect tags declarations and nearby context =="
while IFS= read -r f; do
  echo "--- $f ---"
  rg -n -C4 '^\s*tags\s*=' "$f" || true
  rg -n -C2 '"Name"|'\''Name'\''' "$f" || true
done < <(fd -i 'config.py')

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@infrastructure/training.py` around lines 101 - 115, The ASG tags list
currently adds a hardcoded Name tag via aws.autoscaling.GroupTagArgs and then
extends with all entries from tags.items(), which can produce duplicate "Name"
keys; modify the list comprehension that builds additional
aws.autoscaling.GroupTagArgs (the comprehension iterating "for key, value in
tags.items()") to skip entries where key == "Name" (or key.lower() == "name" if
case-insensitive) so only the single intended Name tag is included.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@infrastructure/training.py`:
- Around line 56-87: The models_launch_template AWS LaunchTemplate must enforce
IMDSv2: add a metadata_options block to the aws.ec2.LaunchTemplate call (the
models_launch_template instantiation) and set http_tokens="required" (optionally
also set http_endpoint="enabled" and a conservative http_put_response_hop_limit
like 1); update the metadata_options using the appropriate LaunchTemplate
metadata args so instances launched from models_launch_template require IMDSv2.

In `@libraries/python/tests/test_infrastructure_training.py`:
- Around line 11-43: Add targeted unit tests in
libraries/python/tests/test_infrastructure_training.py that exercise
missing/alternate branches of load_infrastructure_training: create tests that
(1) simulate configuration where models_cluster or models_capacity_provider is
absent and assert the template omits '"models_cluster"' and
'"models_capacity_provider"', (2) simulate a non-GPU trainer to assert
instance_type is not "g4dn.xlarge" and that "amazon-linux-2023/gpu/recommended"
is absent, (3) simulate ASG with min_size>0 and desired_capacity>0 to assert
"min_size=0" and "desired_capacity=0" are absent, and (4) simulate an
environment without an instance profile to assert "models_instance_profile" and
"AmazonEC2ContainerServiceforEC2Role" are omitted; ensure each test calls
load_infrastructure_training and uses clear asserts to drive coverage above 90%
for this module.
- Around line 7-43: The tests currently assert raw substrings from
load_infrastructure_training() which is brittle; instead parse the generated
Python/templated infrastructure into an AST or load it as a resource object and
assert concrete structures and arguments. Update the tests (e.g.,
test_models_cluster_uses_ec2_backed_capacity_provider,
test_trainer_uses_gpu_instance_type, test_trainer_asg_scales_to_zero,
test_models_instance_profile_configured,
test_trainer_gpu_support_enabled_in_user_data) to call
load_infrastructure_training(), parse the content into an AST or deserialize to
a resource/model representation, then assert the presence and properties of the
relevant nodes/objects (e.g., check the Cluster creation has
models_capacity_provider and models_cluster_capacity_providers, the Trainer/ASG
resource has instance_type "g4dn.xlarge" and amazon-linux-2023/gpu/recommended
image, min_size/desired_capacity equal to 0, instance profile includes
AmazonEC2ContainerServiceforEC2Role, and user data contains
ECS_ENABLE_GPU_SUPPORT) rather than using substring membership.

In `@maskfile.md`:
- Around line 443-450: In the local) branch ensure PREFECT_API_URL is set to the
local Prefect server before creating the pool and deploying: set PREFECT_API_URL
to "http://localhost:4200/api" in that branch so the uv run --package tools
prefect work-pool create "fund-models-local" and uv run prefect --no-prompt
deploy --name tide-trainer-local commands target the local server (either export
the variable at the start of the local) case or prefix those uv run commands
with the PREFECT_API_URL assignment).
- Line 196: The early-exit echo "No ECS service for tide model runner" is
ambiguous; update the message where the script handles the tide-model-runner
case (the branch that runs echo "No ECS service for tide model runner" && exit
0) to explicitly state that tide-model-runner is intentionally not an ECS
long-running service and that the script is exiting intentionally (e.g.,
"tide-model-runner is used only for Prefect training; no ECS service expected —
exiting 0"). Keep the same exit code (0) and ensure the change is made in the
branch referencing "tide-model-runner".

---

Outside diff comments:
In `@maskfile.md`:
- Around line 844-863: The deploy command for model_name (the "model deploy
tide" branch) assumes the Prefect work pool "fund-models-remote" exists and
directly calls tide.deploy; add a pre-check that verifies the work pool exists
(using the same uv/prefect inspection approach used elsewhere) and if missing
print a clear error telling the user to run "mask models initialize remote" (or
run initialization) and exit non‑zero; keep the call to tide.deploy only after
the existence check so the command fails fast with a helpful message instead of
silently erroring.

---

Duplicate comments:
In `@infrastructure/training.py`:
- Around line 101-115: The ASG tags list currently adds a hardcoded Name tag via
aws.autoscaling.GroupTagArgs and then extends with all entries from
tags.items(), which can produce duplicate "Name" keys; modify the list
comprehension that builds additional aws.autoscaling.GroupTagArgs (the
comprehension iterating "for key, value in tags.items()") to skip entries where
key == "Name" (or key.lower() == "name" if case-insensitive) so only the single
intended Name tag is included.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 44c6b8c4-8ce2-496c-b00a-8b8a3560728e

📥 Commits

Reviewing files that changed from the base of the PR and between e5a260a and e56c5ec.

📒 Files selected for processing (3)

infrastructure/training.py
libraries/python/tests/test_infrastructure_training.py
maskfile.md

- Add IMDSv2 enforcement to models launch template (http_tokens=required) - Set PREFECT_API_URL=http://localhost:4200/api in maskfile local init branch so work pool and deployment target the local Prefect server - Fix devenv initialize-local-trainer to deploy tide-trainer-local via prefect deploy instead of calling tide.deploy which registers the remote deployment - Clarify mask infrastructure models initialize local echo message - Update mask model deploy description to say Prefect Cloud, matching the actual behavior of unsetting PREFECT_API_URL - Clarify tide-model-runner early-exit message in mask infrastructure image deploy to explain it is a training image not an ECS service Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai

♻️ Duplicate comments (1)

devenv.nix (1)

357-360: ⚠️ Potential issue | 🟠 Major

Local worker bootstrap still invokes remote deployment registration.

Line 360 still runs tide.deploy, which registers tide-trainer-remote to fund-models-remote; that conflicts with the local pool created on Line 358 and is silently ignored due to 2>/dev/null || true.

Suggested fix

       PREFECT_API_URL="http://localhost:4200/api" \
-        uv run --package tide python -m tide.deploy 2>/dev/null || true
+        uv run prefect --no-prompt deploy --name tide-trainer-local 2>/dev/null || true

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@devenv.nix` around lines 357 - 360, The bootstrap is still invoking the
remote registration via the command that runs `python -m tide.deploy`, which
registers `tide-trainer-remote` to `fund-models-remote` and conflicts with the
local pool created by the `uv ... prefect work-pool create "fund-models-local"`
command; remove or guard the `uv run --package tide python -m tide.deploy
2>/dev/null || true` invocation so local bootstraps do not perform remote
deployment registration (e.g., delete the line or wrap it in a conditional that
only runs for remote envs), and also remove the noisy suppression (`2>/dev/null
|| true`) so failures are visible when intended.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@devenv.nix`:
- Around line 357-360: The bootstrap is still invoking the remote registration
via the command that runs `python -m tide.deploy`, which registers
`tide-trainer-remote` to `fund-models-remote` and conflicts with the local pool
created by the `uv ... prefect work-pool create "fund-models-local"` command;
remove or guard the `uv run --package tide python -m tide.deploy 2>/dev/null ||
true` invocation so local bootstraps do not perform remote deployment
registration (e.g., delete the line or wrap it in a conditional that only runs
for remote envs), and also remove the noisy suppression (`2>/dev/null || true`)
so failures are visible when intended.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0b4982f2-cc3f-474f-bdb7-2e52cb7e3e26

📥 Commits

Reviewing files that changed from the base of the PR and between e56c5ec and 7c7ee50.

📒 Files selected for processing (3)

devenv.nix
infrastructure/training.py
maskfile.md

forstmeier and others added 2 commits April 2, 2026 22:35

Merge branch 'master' into prefect-cloud-fixes

4eac37a

forstmeier added the yaml YAML code updates label Apr 3, 2026

Copilot AI review requested due to automatic review settings April 3, 2026 22:08

github-project-automation Bot added this to Overview Apr 3, 2026

github-project-automation Bot moved this to In Progress in Overview Apr 3, 2026

github-actions Bot requested a review from chrisaddy April 3, 2026 22:08

github-actions Bot added python Python code updates markdown Markdown code updates labels Apr 3, 2026

github-actions Bot assigned forstmeier Apr 3, 2026

Copilot started reviewing on behalf of forstmeier April 3, 2026 22:09 View session

forstmeier changed the title ~~Rename tide ECR repo from tide-runner to tide-model-runner and remove dead code~~ Cleanup ECR resources and Docker image handling Apr 3, 2026

Copilot AI reviewed Apr 3, 2026

View reviewed changes

Comment thread infrastructure/storage.py

Comment thread maskfile.md Outdated

coderabbitai Bot reviewed Apr 3, 2026

View reviewed changes

Comment thread .github/workflows/launch_infrastructure.yaml

coderabbitai Bot reviewed Apr 5, 2026

View reviewed changes

Comment thread infrastructure/training.py Outdated

Comment thread infrastructure/training.py

Comment thread maskfile.md Outdated

Copilot AI review requested due to automatic review settings April 5, 2026 16:39

Copilot started reviewing on behalf of forstmeier April 5, 2026 16:40 View session

Copilot AI reviewed Apr 5, 2026

View reviewed changes

Comment thread infrastructure/training.py

Comment thread libraries/python/tests/test_infrastructure_iam.py

Comment thread maskfile.md

Comment thread maskfile.md

Comment thread devenv.nix

Comment thread infrastructure/compute.py

Comment thread infrastructure/compute.py

coderabbitai Bot reviewed Apr 5, 2026

View reviewed changes

Comment thread infrastructure/training.py

Comment thread libraries/python/tests/test_infrastructure_training.py

Comment thread libraries/python/tests/test_infrastructure_training.py

Comment thread maskfile.md Outdated

Comment thread maskfile.md

coderabbitai Bot reviewed Apr 5, 2026

View reviewed changes

coderabbitai Bot approved these changes Apr 5, 2026

View reviewed changes

forstmeier merged commit 1975409 into master Apr 5, 2026
14 checks passed

forstmeier deleted the prefect-cloud-fixes branch April 5, 2026 19:56

github-project-automation Bot moved this from In Progress to Done in Overview Apr 5, 2026

This was referenced Apr 13, 2026

Rebuild TiDE model data/model training pipeline #821

Merged

Update ECR/Docker naming conventions and enable cost anomaly detection #825

Merged

Conversation

forstmeier commented Apr 3, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Changes

Context

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

greptile-apps Bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

coveralls commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

forstmeier commented Apr 3, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 3, 2026 •

edited

Loading

greptile-apps Bot commented Apr 3, 2026 •

edited

Loading

coveralls commented Apr 3, 2026 •

edited

Loading