Skip to content

Cleanup ECR resources and Docker image handling#813

Merged
forstmeier merged 5 commits intomasterfrom
prefect-cloud-fixes
Apr 5, 2026
Merged

Cleanup ECR resources and Docker image handling#813
forstmeier merged 5 commits intomasterfrom
prefect-cloud-fixes

Conversation

@forstmeier
Copy link
Copy Markdown
Collaborator

@forstmeier forstmeier commented Apr 3, 2026

Overview

Changes

  • Rename ECR repository from fund/tide-runner to fund/tide-model-runner in infrastructure code (infrastructure/storage.py, infrastructure/__main__.py)
  • Update GitHub Actions workflow to use tide application with model-runner stage instead of the old model-trainer entries; remove duplicate model-trainer deploy job matrix entries
  • Rename Docker stage in models/tide/Dockerfile from runner to model-runner to match the new naming
  • Delete unused models/Dockerfile which contained a dead server-worker stage no longer referenced anywhere
  • Update maskfile.md ECS service lookup case to tide-model-runner
  • Update infrastructure storage test to assert against the new tide_model_runner_repository_lifecycle resource name

Context

The tide-runner naming was inconsistent with the broader model-runner convention being adopted. The old models/Dockerfile (server-worker stage) was unreferenced dead code left over from a prior architecture and is no longer needed.

Summary by CodeRabbit

  • New Features

    • Provisioned GPU-backed training infrastructure and added model initialization commands for remote/local trainer setup.
    • Added Prefect deployment updates to target new trainer work pools.
  • Chores

    • Consolidated CI/CD to a unified model-runner build/deploy flow and cleaned the deployment matrix.
    • Renamed runtime/image identifiers and ECS cluster target; removed legacy model image build.
    • Expanded deployment permissions to support autoscaling and trainer instance management.
  • Tests

    • Added and updated tests validating training infra, IAM, and storage/name changes.

forstmeier and others added 2 commits April 2, 2026 22:35
… dead code

Root cause: The Prefect flow was crashing because the ECS task used a stale
fund/tide-runner:worker image built from commit f6c3a08 (March 17), which lacked
prefect-aws, mlflow, and internal dependencies added to models/tide/pyproject.toml
since then. The :worker tag was manually pushed and diverged from :latest.

Changes:
- Rename ECR repo fund/tide-runner -> fund/tide-model-runner in infrastructure/storage.py
  and infrastructure/__main__.py to match the maskfile naming convention
  (repository_name="fund/${package_name}-${stage_name}")
- Rename Dockerfile stage AS runner -> AS model-runner in models/tide/Dockerfile
  so mask infrastructure image build tide model-runner resolves the correct target
- Delete models/Dockerfile (dead code: old Prefect polling-worker image that ran
  prefect worker start; no longer needed with ecs:push work pool architecture)
- Update .github/workflows/launch_infrastructure.yaml: replace model-trainer/server-worker
  matrix entry with tide/model-runner; remove model-trainer entries from deploy_images
  matrix (tide has no ECS service, only an ECS task run by Prefect Cloud directly)
- Update maskfile.md deploy case from tide-runner) to tide-model-runner)
- Update test_infrastructure_storage.py to assert tide_model_runner_repository_lifecycle

All 4 images rebuilt and pushed to ECR at commit 2feadf6:
- fund/data-manager-server
- fund/portfolio-manager-server
- fund/ensemble-manager-server
- fund/tide-model-runner (new repo, replaces fund/tide-runner)

Prefect work pool fund-work-pool-ecs default image updated to
982604770479.dkr.ecr.us-east-1.amazonaws.com/fund/tide-model-runner:latest

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@forstmeier forstmeier added the yaml YAML code updates label Apr 3, 2026
Copilot AI review requested due to automatic review settings April 3, 2026 22:08
@github-project-automation github-project-automation Bot moved this to In Progress in Overview Apr 3, 2026
@github-actions github-actions Bot requested a review from chrisaddy April 3, 2026 22:08
@github-actions github-actions Bot added python Python code updates markdown Markdown code updates labels Apr 3, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 3, 2026

📝 Walkthrough

Walkthrough

Renames tide-runner artifacts to tide-model-runner, adds GPU-backed training infrastructure (ECS cluster, ASG, capacity provider, IAM changes), updates CI/CD workflow and Prefect work pools, removes legacy models Dockerfile, and updates related tests and developer scripts. (50 words)

Changes

Cohort / File(s) Summary
CI/CD Workflow
/.github/workflows/launch_infrastructure.yaml
Replaces model-trainer entries with tide; swaps build mappings to model-runner scoped to models/**; removes model-trainer deploy mappings.
Pulumi entrypoints & exports
infrastructure/__main__.py
Renames ECR export identifiers from tide_runner_*tide_model_runner_*; adds export aws_ecs_models_cluster_name from models_cluster.name.
ECR storage resources
infrastructure/storage.py
Renames ECR resource and lifecycle refs to tide_model_runner_repository, repo name → fund/tide-model-runner, and renames image URI output.
New training infra module
infrastructure/training.py
Adds GPU-backed training infra: ECS cluster fund-models, IAM roles/instance-profile, GPU AMI via SSM, EC2 Launch Template (g4dn.xlarge + GPU user-data), ASG (min=0, desired=0, max=1), Capacity Provider, and CloudWatch log group; exports roles/cluster/log-group.
IAM policy updates
infrastructure/iam.py
Adds github_actions_trainer_policy, grants autoscaling:* to existing infra policy, and attaches trainer policy to github_actions_infrastructure_role.
ECS cluster name & dev scripts
infrastructure/compute.py, devenv.nix
Cluster name changed from fund-applicationfund-applications; dev scripts updated to new cluster and renamed trainer init/setup scripts.
Prefect & deployment changes
prefect.yaml, models/tide/src/tide/deploy.py, maskfile.md, devenv.nix
Prefect work pools renamed to fund-models-remote/fund-models-local; deployment name/work_pool updated; maskfile/devenv simplified model deploy flow and commands.
Dockerfiles & images
models/Dockerfile, models/tide/Dockerfile
Removed legacy models/Dockerfile; renamed build stage alias in models/tide/Dockerfile from runnermodel-runner.
Tests
libraries/python/tests/*, models/tide/tests/test_deploy.py
Updated tests to reflect renamed ECR and IAM identifiers; added tests asserting contents of new infrastructure/training.py (GPU AMI, g4dn.xlarge, ASG scale-to-zero, user-data).
Misc help text / maskfile
maskfile.md
Updated Mask commands to reference tide model-runner, added models subcommands for initializing/deploying work pools, removed Pulumi-based PREFECT_API_URL derivation in model deploy flow.

Sequence Diagram(s)

sequenceDiagram
  participant GH as "GitHub Actions"
  participant Registry as "ECR (fund/tide-model-runner)"
  participant Pulumi as "Pulumi"
  participant AWS as "AWS (ECS / EC2 / ASG / IAM / CloudWatch)"
  participant Prefect as "Prefect Cloud"

  GH->>Registry: Build & push model image (models/**) to fund/tide-model-runner
  GH->>Pulumi: Trigger infra deploy (update exports)
  Pulumi->>AWS: Create/Update IAM, ECR, ECS cluster, Launch Template, ASG, Capacity Provider, Log Group
  AWS-->>Pulumi: Return resource ARNs / cluster name
  Pulumi-->>GH: Export outputs (e.g., aws_ecs_models_cluster_name, ecr image URI)
  Prefect->>AWS: Create/use work pools (fund-models-remote / fund-models-local)
  GH->>Prefect: Deploy flows using new work pools (via maskfile/devenv scripts)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Cleanup ECR resources and Docker image handling' accurately summarizes the main changes: ECR repository renaming, Docker stage updates, and removal of unused Docker configuration.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch prefect-cloud-fixes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@forstmeier forstmeier changed the title Rename tide ECR repo from tide-runner to tide-model-runner and remove dead code Cleanup ECR resources and Docker image handling Apr 3, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 3, 2026

Greptile Summary

This PR renames the ECR repository from fund/tide-runner to fund/tide-model-runner, the corresponding Docker build stage, and all Prefect work pool names (fund-models-remote/fund-models-local). It removes the dead root models/Dockerfile, drops the model-trainer deploy job entries from CI (correct since the tide runner is a Prefect ECS batch task), and adds a new infrastructure/training.py that provisions GPU training infrastructure (a dedicated fund-models ECS cluster, a g4dn.xlarge ASG scaling to zero, a capacity provider, launch template, IAM instance profile, and a new least-privilege github_actions_trainer_policy). A long-standing cluster name typo (fund-applicationfund-applications) is also fixed in devenv.nix and maskfile.md.

Confidence Score: 5/5

Safe to merge; all remaining findings are P2 style suggestions that do not affect production deployments

The rename and refactoring changes are consistent and well-tested throughout. Both issues are P2 (missing env-var guard and override in developer convenience scripts) that do not block the primary training or deployment workflow.

devenv.nix — two minor issues with PREFECT_API_URL and PREFECT_API_KEY handling in the renamed init scripts

Important Files Changed

Filename Overview
.github/workflows/launch_infrastructure.yaml CI matrix updated: replaces model-trainer/server-worker entries with tide/model-runner in build job; removes model-trainer deploy entries since tide runs as a Prefect ECS batch task, not a long-running service
devenv.nix Renames training scripts, updates work pool names, fixes cluster name typo; two P2 issues: missing PREFECT_API_URL override in initialize-local-trainer and dropped PREFECT_API_KEY guard in initialize-remote-trainer
infrastructure/main.py Imports and exports updated to use tide_model_runner_repository and tide_model_runner_image_uri; adds aws_ecs_models_cluster_name export from new training module
infrastructure/compute.py ECS cluster name corrected from fund-application to fund-applications to match actual provisioned resource name
infrastructure/iam.py Adds github_actions_trainer_policy scoped to fund-models-instance-role and fund-models-instance-profile; policy attached to github_actions_infrastructure_role alongside existing infrastructure policy
infrastructure/storage.py ECR repository renamed from tide_runner_repository/fund/tide-runner to tide_model_runner_repository/fund/tide-model-runner with matching lifecycle policy and image URI variable rename
infrastructure/training.py New file: provisions fund-models ECS cluster with g4dn.xlarge ASG scaling to zero, ECS capacity provider with managed scaling, launch template using GPU-optimized ECS AMI, EC2 instance profile with AmazonEC2ContainerServiceforEC2Role, and CloudWatch log group
libraries/python/tests/test_infrastructure_iam.py Updated to assert github_actions_trainer_policy resource name is declared and its arn is attached to the infrastructure role
libraries/python/tests/test_infrastructure_storage.py Lifecycle policy assertion updated from tide_runner_repository_lifecycle to tide_model_runner_repository_lifecycle
libraries/python/tests/test_infrastructure_training.py New tests covering GPU instance type, scale-to-zero ASG, EC2 instance profile with ECS role, capacity provider wiring, and ECS_ENABLE_GPU_SUPPORT in user data
maskfile.md Fixes ECS cluster name typo in ecs-deploy/ecs-status; adds tide-model-runner case in deploy command that exits early with informative message since it is a batch task, not a service
models/Dockerfile Deleted: removes dead code server-worker stage that was no longer referenced after a prior architecture change
models/tide/Dockerfile Docker build stage renamed from runner to model-runner to align with ECR repository name and CI workflow matrix stage name
models/tide/src/tide/deploy.py Deployment name updated to tide-trainer-remote and work pool to fund-models-remote; consistent with prefect.yaml and devenv.nix
models/tide/tests/test_deploy.py Test assertions updated to match new deployment name (tide-trainer-remote) and work pool name (fund-models-remote)
prefect.yaml Work pool names updated from fund-work-pool-ecs/fund-work-pool-local to fund-models-remote/fund-models-local across both deployment entries
Prompt To Fix All With AI
This is a comment left during a code review.
Path: devenv.nix
Line: 302

Comment:
**Missing `PREFECT_API_URL` for local deployment registration**

The old `training-setup` script prefixed every `prefect` invocation with `PREFECT_API_URL="http://localhost:4200/api"`. The new `initialize-local-trainer` drops this, so if `PREFECT_API_KEY` is present in the developer's shell environment (common after configuring remote training), Prefect will route this command to Prefect Cloud and silently register `tide-trainer-local` there instead of on the local orchestrator.

```suggestion
    PREFECT_API_URL="http://localhost:4200/api" uv run prefect --no-prompt deploy --name tide-trainer-local
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: devenv.nix
Line: 271-273

Comment:
**`PREFECT_API_KEY` guard removed from remote trainer init**

The old `training-init` script checked for `PREFECT_API_KEY` at the top and exited early with a clear, actionable message if it was unset. The new `initialize-remote-trainer` drops this guard entirely, so a missing key produces a cryptic Prefect authentication error rather than a helpful prompt for new contributors setting up remote training.

```suggestion
  scripts.initialize-remote-trainer.exec = ''
    if [ -z "${PREFECT_API_KEY:-}" ]; then
      echo "PREFECT_API_KEY not set. Add it to .envrc and run 'direnv allow'."
      exit 1
    fi

    unset PREFECT_API_URL
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (4): Last reviewed commit: "Address PR #813 feedback" | Re-trigger Greptile

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aligns the “tide” model image/repository naming with the project’s “model-runner” convention, updates CI to build/push the renamed image target, and removes an unreferenced legacy Dockerfile.

Changes:

  • Renamed the Tide ECR repository and related Pulumi outputs/resources from tide-runner to tide-model-runner.
  • Updated GitHub Actions + Tide Docker build stage naming to use tide + model-runner (and removed old model-trainer matrix entries).
  • Deleted the unused models/Dockerfile and updated the maskfile ECS deploy lookup + a storage test assertion.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
models/tide/Dockerfile Renames the build stage from runner to model-runner to match CI naming.
models/Dockerfile Removes a dead/unused Dockerfile that defined an unreferenced server-worker stage.
maskfile.md Updates ECS deploy lookup to tide-model-runner.
libraries/python/tests/test_infrastructure_storage.py Updates test assertion for renamed ECR lifecycle policy resource.
infrastructure/storage.py Renames the ECR repository + lifecycle policy resources to tide_model_runner_* and changes the repo name.
infrastructure/__main__.py Updates Pulumi exports/imports to the new Tide model-runner naming.
.github/workflows/launch_infrastructure.yaml Updates image build/push matrix to tide + model-runner and removes old model-trainer deploy entries.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread infrastructure/storage.py
Comment thread maskfile.md Outdated
@coveralls
Copy link
Copy Markdown
Collaborator

coveralls commented Apr 3, 2026

Coverage Status

coverage: 78.162% (+0.06%) from 78.101%
when pulling 7c7ee50 on prefect-cloud-fixes
into a14bf54 on master.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
libraries/python/tests/test_infrastructure_storage.py (1)

25-31: 🛠️ Refactor suggestion | 🟠 Major

Assert the renamed ECR repository, not just the lifecycle symbol.

This still passes if tide_model_runner_repository exists but its name stays fund/tide-runner, which is the regression this PR is fixing. Add an assertion for the actual repository name or image URI so the stale-repo case is covered.

🧪 Suggested assertion
 def test_storage_contains_ecr_lifecycle_policy_resources() -> None:
     infrastructure_storage = load_infrastructure_storage()

     assert '"data_manager_repository_lifecycle"' in infrastructure_storage
     assert '"portfolio_manager_repository_lifecycle"' in infrastructure_storage
     assert '"ensemble_manager_repository_lifecycle"' in infrastructure_storage
     assert '"tide_model_runner_repository_lifecycle"' in infrastructure_storage
+    assert 'fund/tide-model-runner' in infrastructure_storage

As per coding guidelines: "When fixing a bug, write tests that reproduce the bug before fixing it, then verify the tests pass after the fix."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libraries/python/tests/test_infrastructure_storage.py` around lines 25 - 31,
The test test_storage_contains_ecr_lifecycle_policy_resources only checks for
lifecycle symbol names and misses verifying the actual repository name/URI;
update the test (which calls load_infrastructure_storage) to assert that the
renamed ECR repository (e.g., the tide_model_runner_repository entry) contains
the expected repository name or image URI (not the old "fund/tide-runner") so
the stale-repo regression is caught; specifically, after locating the
"tide_model_runner_repository_lifecycle" block in infrastructure_storage, add an
assertion that the corresponding repository object or its "name"/"repositoryUri"
field equals the expected new name/URI.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/launch_infrastructure.yaml:
- Around line 31-33: The build_and_push_images job attempts to push the
tide/model-runner image to the ECR repo fund/tide-model-runner but that
repository is only created later by launch_infrastructure
(infrastructure/storage.py repository creation), so the push can fail on first
deploy; to fix, ensure fund/tide-model-runner is created before
build_and_push_images runs by either (a) adding a provisioning step early in
this workflow that creates the ECR repository (or calls the existing repository
creation logic) or (b) moving/duplicating the repository creation call into a
new pre-build job that launch_and_waits before build_and_push_images,
referencing the same repository name fund/tide-model-runner and the
repository-creation routine currently used in launch_infrastructure.

In `@maskfile.md`:
- Line 196: The docs are inconsistent: the canonical service name was changed to
tide-model-runner but example text still says "tide runner"; update the example
occurrences (e.g., the strings "tide runner" shown in the example blocks around
the earlier examples and any usage examples) to use "model-runner" or the full
canonical "tide-model-runner" so they match the renamed stage/repo (search for
and replace "tide runner" and any standalone "tide" examples with
"model-runner"/"tide-model-runner" and ensure CLI/service examples, README
examples, and any echo messages reflect the new name).

---

Outside diff comments:
In `@libraries/python/tests/test_infrastructure_storage.py`:
- Around line 25-31: The test
test_storage_contains_ecr_lifecycle_policy_resources only checks for lifecycle
symbol names and misses verifying the actual repository name/URI; update the
test (which calls load_infrastructure_storage) to assert that the renamed ECR
repository (e.g., the tide_model_runner_repository entry) contains the expected
repository name or image URI (not the old "fund/tide-runner") so the stale-repo
regression is caught; specifically, after locating the
"tide_model_runner_repository_lifecycle" block in infrastructure_storage, add an
assertion that the corresponding repository object or its "name"/"repositoryUri"
field equals the expected new name/URI.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c45fb425-412c-484b-95a1-4ffd20e3a99f

📥 Commits

Reviewing files that changed from the base of the PR and between a14bf54 and 4eac37a.

📒 Files selected for processing (7)
  • .github/workflows/launch_infrastructure.yaml
  • infrastructure/__main__.py
  • infrastructure/storage.py
  • libraries/python/tests/test_infrastructure_storage.py
  • maskfile.md
  • models/Dockerfile
  • models/tide/Dockerfile
💤 Files with no reviewable changes (1)
  • models/Dockerfile

Comment thread .github/workflows/launch_infrastructure.yaml
Root cause: training jobs were running on Fargate (no GPU), causing
tinygrad to fail with AttributeError on cuInit because CUDA=1 was set
in the Dockerfile but no GPU driver was available.

Changes:
- Add infrastructure/training.py: new fund-models ECS cluster backed by
  EC2 g4dn.xlarge GPU instances via auto-scaling group (scale-to-zero),
  capacity provider with managed scaling, and CloudWatch log group
- Add fund-github-actions-trainer-policy: separate IAM managed policy for
  trainer role/instance profile permissions, split from the main policy
  to stay within AWS's 6144-byte policy size limit
- Rename ECS application cluster from fund-application to fund-applications
- Rename all Prefect/infrastructure resources from fund-trainer to fund-models
  to reflect that the cluster will host multiple model workloads beyond training
- Rename Prefect deployment from fund-trainer-remote to tide-trainer-remote
  to be workflow-specific rather than infrastructure-scoped
- Rename mask infrastructure trainer command to mask infrastructure models
- Update devenv.nix scripts and process work pool references to fund-models-local
- Update prefect.yaml deployment names and work pool names throughout
- Add tests for new training infrastructure

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
infrastructure/iam.py (1)

254-269: ⚠️ Potential issue | 🟡 Minor

Add autoscaling.amazonaws.com to DeleteServiceLinkedRoles statement or document why it's intentionally excluded.

The CreateServiceLinkedRolesForStack statement includes autoscaling.amazonaws.com (line 263), but the DeleteServiceLinkedRoles statement (lines 351–367) does not. This means GitHub Actions can create the autoscaling SLR but cannot delete it during infrastructure teardown. If this asymmetry is intentional (e.g., the SLR is shared and should persist), add a comment explaining the design choice. Otherwise, add autoscaling.amazonaws.com to the deletion condition to ensure consistent lifecycle management.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@infrastructure/iam.py` around lines 254 - 269, The
CreateServiceLinkedRolesForStack policy includes "autoscaling.amazonaws.com" but
the DeleteServiceLinkedRoles policy omits it; update the
DeleteServiceLinkedRoles statement (the IAM policy block named
DeleteServiceLinkedRoles) to include "autoscaling.amazonaws.com" in its
iam:AWSServiceName condition to allow deletion, or if intentional, add a clear
inline comment next to the DeleteServiceLinkedRoles block explaining why
autoscaling SLRs must persist and should not be deleted; reference the policy
names CreateServiceLinkedRolesForStack and DeleteServiceLinkedRoles when making
the change.
♻️ Duplicate comments (1)
maskfile.md (1)

196-196: ⚠️ Potential issue | 🟡 Minor

Docs are still inconsistent with the renamed stage.

Line 196 uses tide-model-runner, but examples in this file still reference tide runner, leaving operator guidance out of sync.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@maskfile.md` at line 196, Documentation is inconsistent: the stage was
renamed to tide-model-runner but other examples still say "tide runner"; update
all references so they match the new stage name. Search the document for
occurrences of "tide runner" (and similar variants like "tide-runner") and
replace them with "tide-model-runner", and update any CLI/service example lines,
examples, and operator guidance to use the exact symbol "tide-model-runner"
(including the switch/echo example that currently references it) so all examples
and guidance are consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@infrastructure/training.py`:
- Around line 92-111: The autoscaling Group models_asg only sets the "Name" tag
and omits the shared tags in config.tags; update the tags argument of models_asg
to include the common tags from config.tags (converted into
aws.autoscaling.GroupTagArgs with propagate_at_launch=True) in addition to the
existing "Name" tag, ensuring duplicate keys are handled (override or skip) so
all required environment/project identifiers are propagated at launch. Use the
same GroupTagArgs shape used for the Name tag and build the list by mapping
config.tags entries into GroupTagArgs before passing to models_asg.
- Around line 48-57: The current models_ami lookup uses an Amazon Linux 2 ECS
GPU AMI pattern; update the aws.ec2.get_ami_output call (the models_ami
variable) to target Amazon Linux 2023 ECS GPU AMIs by changing the filter values
to the new pattern "al2023-ami-ecs-gpu-hvm-*-kernel-*-x86_64-ebs" or replace the
get_ami_output approach with an SSM Parameter Store lookup for
"/aws/service/ecs/optimized-ami/amazon-linux-2023/gpu/recommended" and use that
returned AMI ID when setting models_ami so the code references the AL2023
ECS-optimized GPU image instead of AL2.

In `@maskfile.md`:
- Around line 448-450: The local initialization branch incorrectly calls the
remote deployment module (python -m tide.deploy) which deploys
tide-trainer-remote to fund-models-remote; change this to a local-only deploy or
skip remote registration by invoking the local deployment entrypoint or passing
an explicit local target flag. Replace the uv run invocation of tide.deploy with
the local deploy variant (e.g., python -m tide.deploy_local or python -m
tide.deploy --target=local) or remove the deployment call so the 'local' branch
does not deploy tide-trainer-remote to fund-models-remote; update any related
invocation in the script where "Registering training deployment..." is echoed to
reference the local deploy path.

---

Outside diff comments:
In `@infrastructure/iam.py`:
- Around line 254-269: The CreateServiceLinkedRolesForStack policy includes
"autoscaling.amazonaws.com" but the DeleteServiceLinkedRoles policy omits it;
update the DeleteServiceLinkedRoles statement (the IAM policy block named
DeleteServiceLinkedRoles) to include "autoscaling.amazonaws.com" in its
iam:AWSServiceName condition to allow deletion, or if intentional, add a clear
inline comment next to the DeleteServiceLinkedRoles block explaining why
autoscaling SLRs must persist and should not be deleted; reference the policy
names CreateServiceLinkedRolesForStack and DeleteServiceLinkedRoles when making
the change.

---

Duplicate comments:
In `@maskfile.md`:
- Line 196: Documentation is inconsistent: the stage was renamed to
tide-model-runner but other examples still say "tide runner"; update all
references so they match the new stage name. Search the document for occurrences
of "tide runner" (and similar variants like "tide-runner") and replace them with
"tide-model-runner", and update any CLI/service example lines, examples, and
operator guidance to use the exact symbol "tide-model-runner" (including the
switch/echo example that currently references it) so all examples and guidance
are consistent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: eeb49388-b5f9-42b4-9028-c14eca3d52c4

📥 Commits

Reviewing files that changed from the base of the PR and between 4eac37a and e5a260a.

📒 Files selected for processing (11)
  • devenv.nix
  • infrastructure/__main__.py
  • infrastructure/compute.py
  • infrastructure/iam.py
  • infrastructure/training.py
  • libraries/python/tests/test_infrastructure_iam.py
  • libraries/python/tests/test_infrastructure_training.py
  • maskfile.md
  • models/tide/src/tide/deploy.py
  • models/tide/tests/test_deploy.py
  • prefect.yaml

Comment thread infrastructure/training.py Outdated
Comment thread infrastructure/training.py
Comment thread maskfile.md Outdated
- Update models AMI lookup to use AL2023 ECS GPU AMI via SSM parameter
  store instead of AL2 AMI name filter
- Propagate config.tags on ASG in addition to the Name tag so project,
  stack, and manager tags are applied at launch
- Fix local init command to deploy tide-trainer-local from prefect.yaml
  instead of calling tide.deploy which hardcodes the remote deployment
- Update maskfile image command examples from "tide runner" to
  "tide model-runner" to match the canonical ECR stage name

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 5, 2026 16:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

infrastructure/iam.py:104

  • The GitHub Actions infrastructure policy now includes autoscaling:* on Resource: "*", which grants full Auto Scaling permissions account-wide. If the intent is only to manage the models GPU ASG/capacity provider, consider scoping actions to the specific ASG/launch template ARNs (and/or limiting to the minimal autoscaling actions Pulumi needs) to keep this policy least-privilege.
                        "Sid": "ManageEC2ECSELBBudgetsAndServiceDiscovery",
                        "Effect": "Allow",
                        "Action": [
                            "autoscaling:*",
                            "ec2:*",
                            "ecs:*",
                            "elasticloadbalancing:*",
                            "budgets:*",
                            "servicediscovery:*",
                        ],
                        "Resource": "*",

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread infrastructure/training.py
Comment thread libraries/python/tests/test_infrastructure_iam.py
Comment thread maskfile.md
Comment thread maskfile.md
Comment thread devenv.nix
Comment thread infrastructure/compute.py
Comment thread infrastructure/compute.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
maskfile.md (1)

844-863: ⚠️ Potential issue | 🟡 Minor

Command assumes work pool already exists and provides no validation.

The model deploy tide command has been simplified to only call tide.deploy, which hardcodes deployment to the fund-models-remote work pool. Unlike model train which validates and selects the Pulumi stack, this command skips infrastructure validation and assumes the work pool has already been created (via models initialize remote).

This creates a design gap:

  1. Implicit dependency: The command silently fails if the work pool doesn't exist, with no error message explaining the root cause.
  2. Overlapping commands: Both model deploy tide and models initialize remote register deployments, making it unclear which to use.
  3. No environment flexibility: The hardcoded remote deployment means there's no equivalent for local deployment (users must use models initialize local instead).

The command works correctly when used in the proper sequence, but the design doesn't communicate these constraints.

💡 Recommendations

Option 1: Add work pool existence check to prevent silent failures:

# In model deploy, before calling tide.deploy
if ! uv run prefect work-pool inspect fund-models-remote >/dev/null 2>&1; then
    echo "Error: fund-models-remote work pool not found"
    echo "Run 'mask models initialize remote' first"
    exit 1
fi

Option 2: Consolidate with models initialize remote by documenting that model deploy tide is for re-deployment only and requires prior initialization.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@maskfile.md` around lines 844 - 863, The deploy command for model_name (the
"model deploy tide" branch) assumes the Prefect work pool "fund-models-remote"
exists and directly calls tide.deploy; add a pre-check that verifies the work
pool exists (using the same uv/prefect inspection approach used elsewhere) and
if missing print a clear error telling the user to run "mask models initialize
remote" (or run initialization) and exit non‑zero; keep the call to tide.deploy
only after the existence check so the command fails fast with a helpful message
instead of silently erroring.
♻️ Duplicate comments (1)
infrastructure/training.py (1)

101-115: ⚠️ Potential issue | 🟡 Minor

Prevent potential duplicate Name tag entries in ASG tags.

Lines 102-106 already add Name, then Lines 108-115 append all tags.items(). If config.tags contains Name, this produces duplicate tag keys. Filter Name in the list comprehension.

#!/bin/bash
# Verify whether `config.tags` may include "Name", which would duplicate ASG tag keys.
# Expected: "Name" should be absent from shared tags, or code should explicitly filter it.

set -euo pipefail

echo "== Locate config.py files =="
fd -i 'config.py'

echo
echo "== Inspect tags declarations and nearby context =="
while IFS= read -r f; do
  echo "--- $f ---"
  rg -n -C4 '^\s*tags\s*=' "$f" || true
  rg -n -C2 '"Name"|'\''Name'\''' "$f" || true
done < <(fd -i 'config.py')
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@infrastructure/training.py` around lines 101 - 115, The ASG tags list
currently adds a hardcoded Name tag via aws.autoscaling.GroupTagArgs and then
extends with all entries from tags.items(), which can produce duplicate "Name"
keys; modify the list comprehension that builds additional
aws.autoscaling.GroupTagArgs (the comprehension iterating "for key, value in
tags.items()") to skip entries where key == "Name" (or key.lower() == "name" if
case-insensitive) so only the single intended Name tag is included.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@infrastructure/training.py`:
- Around line 56-87: The models_launch_template AWS LaunchTemplate must enforce
IMDSv2: add a metadata_options block to the aws.ec2.LaunchTemplate call (the
models_launch_template instantiation) and set http_tokens="required" (optionally
also set http_endpoint="enabled" and a conservative http_put_response_hop_limit
like 1); update the metadata_options using the appropriate LaunchTemplate
metadata args so instances launched from models_launch_template require IMDSv2.

In `@libraries/python/tests/test_infrastructure_training.py`:
- Around line 11-43: Add targeted unit tests in
libraries/python/tests/test_infrastructure_training.py that exercise
missing/alternate branches of load_infrastructure_training: create tests that
(1) simulate configuration where models_cluster or models_capacity_provider is
absent and assert the template omits '"models_cluster"' and
'"models_capacity_provider"', (2) simulate a non-GPU trainer to assert
instance_type is not "g4dn.xlarge" and that "amazon-linux-2023/gpu/recommended"
is absent, (3) simulate ASG with min_size>0 and desired_capacity>0 to assert
"min_size=0" and "desired_capacity=0" are absent, and (4) simulate an
environment without an instance profile to assert "models_instance_profile" and
"AmazonEC2ContainerServiceforEC2Role" are omitted; ensure each test calls
load_infrastructure_training and uses clear asserts to drive coverage above 90%
for this module.
- Around line 7-43: The tests currently assert raw substrings from
load_infrastructure_training() which is brittle; instead parse the generated
Python/templated infrastructure into an AST or load it as a resource object and
assert concrete structures and arguments. Update the tests (e.g.,
test_models_cluster_uses_ec2_backed_capacity_provider,
test_trainer_uses_gpu_instance_type, test_trainer_asg_scales_to_zero,
test_models_instance_profile_configured,
test_trainer_gpu_support_enabled_in_user_data) to call
load_infrastructure_training(), parse the content into an AST or deserialize to
a resource/model representation, then assert the presence and properties of the
relevant nodes/objects (e.g., check the Cluster creation has
models_capacity_provider and models_cluster_capacity_providers, the Trainer/ASG
resource has instance_type "g4dn.xlarge" and amazon-linux-2023/gpu/recommended
image, min_size/desired_capacity equal to 0, instance profile includes
AmazonEC2ContainerServiceforEC2Role, and user data contains
ECS_ENABLE_GPU_SUPPORT) rather than using substring membership.

In `@maskfile.md`:
- Around line 443-450: In the local) branch ensure PREFECT_API_URL is set to the
local Prefect server before creating the pool and deploying: set PREFECT_API_URL
to "http://localhost:4200/api" in that branch so the uv run --package tools
prefect work-pool create "fund-models-local" and uv run prefect --no-prompt
deploy --name tide-trainer-local commands target the local server (either export
the variable at the start of the local) case or prefix those uv run commands
with the PREFECT_API_URL assignment).
- Line 196: The early-exit echo "No ECS service for tide model runner" is
ambiguous; update the message where the script handles the tide-model-runner
case (the branch that runs echo "No ECS service for tide model runner" && exit
0) to explicitly state that tide-model-runner is intentionally not an ECS
long-running service and that the script is exiting intentionally (e.g.,
"tide-model-runner is used only for Prefect training; no ECS service expected —
exiting 0"). Keep the same exit code (0) and ensure the change is made in the
branch referencing "tide-model-runner".

---

Outside diff comments:
In `@maskfile.md`:
- Around line 844-863: The deploy command for model_name (the "model deploy
tide" branch) assumes the Prefect work pool "fund-models-remote" exists and
directly calls tide.deploy; add a pre-check that verifies the work pool exists
(using the same uv/prefect inspection approach used elsewhere) and if missing
print a clear error telling the user to run "mask models initialize remote" (or
run initialization) and exit non‑zero; keep the call to tide.deploy only after
the existence check so the command fails fast with a helpful message instead of
silently erroring.

---

Duplicate comments:
In `@infrastructure/training.py`:
- Around line 101-115: The ASG tags list currently adds a hardcoded Name tag via
aws.autoscaling.GroupTagArgs and then extends with all entries from
tags.items(), which can produce duplicate "Name" keys; modify the list
comprehension that builds additional aws.autoscaling.GroupTagArgs (the
comprehension iterating "for key, value in tags.items()") to skip entries where
key == "Name" (or key.lower() == "name" if case-insensitive) so only the single
intended Name tag is included.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 44c6b8c4-8ce2-496c-b00a-8b8a3560728e

📥 Commits

Reviewing files that changed from the base of the PR and between e5a260a and e56c5ec.

📒 Files selected for processing (3)
  • infrastructure/training.py
  • libraries/python/tests/test_infrastructure_training.py
  • maskfile.md

Comment thread infrastructure/training.py
Comment thread libraries/python/tests/test_infrastructure_training.py
Comment thread libraries/python/tests/test_infrastructure_training.py
Comment thread maskfile.md Outdated
Comment thread maskfile.md
- Add IMDSv2 enforcement to models launch template (http_tokens=required)
- Set PREFECT_API_URL=http://localhost:4200/api in maskfile local init
  branch so work pool and deployment target the local Prefect server
- Fix devenv initialize-local-trainer to deploy tide-trainer-local via
  prefect deploy instead of calling tide.deploy which registers the
  remote deployment
- Clarify mask infrastructure models initialize local echo message
- Update mask model deploy description to say Prefect Cloud, matching
  the actual behavior of unsetting PREFECT_API_URL
- Clarify tide-model-runner early-exit message in mask infrastructure
  image deploy to explain it is a training image not an ECS service

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
devenv.nix (1)

357-360: ⚠️ Potential issue | 🟠 Major

Local worker bootstrap still invokes remote deployment registration.

Line 360 still runs tide.deploy, which registers tide-trainer-remote to fund-models-remote; that conflicts with the local pool created on Line 358 and is silently ignored due to 2>/dev/null || true.

Suggested fix
       PREFECT_API_URL="http://localhost:4200/api" \
-        uv run --package tide python -m tide.deploy 2>/dev/null || true
+        uv run prefect --no-prompt deploy --name tide-trainer-local 2>/dev/null || true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@devenv.nix` around lines 357 - 360, The bootstrap is still invoking the
remote registration via the command that runs `python -m tide.deploy`, which
registers `tide-trainer-remote` to `fund-models-remote` and conflicts with the
local pool created by the `uv ... prefect work-pool create "fund-models-local"`
command; remove or guard the `uv run --package tide python -m tide.deploy
2>/dev/null || true` invocation so local bootstraps do not perform remote
deployment registration (e.g., delete the line or wrap it in a conditional that
only runs for remote envs), and also remove the noisy suppression (`2>/dev/null
|| true`) so failures are visible when intended.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@devenv.nix`:
- Around line 357-360: The bootstrap is still invoking the remote registration
via the command that runs `python -m tide.deploy`, which registers
`tide-trainer-remote` to `fund-models-remote` and conflicts with the local pool
created by the `uv ... prefect work-pool create "fund-models-local"` command;
remove or guard the `uv run --package tide python -m tide.deploy 2>/dev/null ||
true` invocation so local bootstraps do not perform remote deployment
registration (e.g., delete the line or wrap it in a conditional that only runs
for remote envs), and also remove the noisy suppression (`2>/dev/null || true`)
so failures are visible when intended.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0b4982f2-cc3f-474f-bdb7-2e52cb7e3e26

📥 Commits

Reviewing files that changed from the base of the PR and between e56c5ec and 7c7ee50.

📒 Files selected for processing (3)
  • devenv.nix
  • infrastructure/training.py
  • maskfile.md

@forstmeier forstmeier merged commit 1975409 into master Apr 5, 2026
14 checks passed
@forstmeier forstmeier deleted the prefect-cloud-fixes branch April 5, 2026 19:56
@github-project-automation github-project-automation Bot moved this from In Progress to Done in Overview Apr 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

markdown Markdown code updates python Python code updates yaml YAML code updates

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants