Cleanup ECR resources and Docker image handling#813
Conversation
… dead code Root cause: The Prefect flow was crashing because the ECS task used a stale fund/tide-runner:worker image built from commit f6c3a08 (March 17), which lacked prefect-aws, mlflow, and internal dependencies added to models/tide/pyproject.toml since then. The :worker tag was manually pushed and diverged from :latest. Changes: - Rename ECR repo fund/tide-runner -> fund/tide-model-runner in infrastructure/storage.py and infrastructure/__main__.py to match the maskfile naming convention (repository_name="fund/${package_name}-${stage_name}") - Rename Dockerfile stage AS runner -> AS model-runner in models/tide/Dockerfile so mask infrastructure image build tide model-runner resolves the correct target - Delete models/Dockerfile (dead code: old Prefect polling-worker image that ran prefect worker start; no longer needed with ecs:push work pool architecture) - Update .github/workflows/launch_infrastructure.yaml: replace model-trainer/server-worker matrix entry with tide/model-runner; remove model-trainer entries from deploy_images matrix (tide has no ECS service, only an ECS task run by Prefect Cloud directly) - Update maskfile.md deploy case from tide-runner) to tide-model-runner) - Update test_infrastructure_storage.py to assert tide_model_runner_repository_lifecycle All 4 images rebuilt and pushed to ECR at commit 2feadf6: - fund/data-manager-server - fund/portfolio-manager-server - fund/ensemble-manager-server - fund/tide-model-runner (new repo, replaces fund/tide-runner) Prefect work pool fund-work-pool-ecs default image updated to 982604770479.dkr.ecr.us-east-1.amazonaws.com/fund/tide-model-runner:latest Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
📝 WalkthroughWalkthroughRenames tide-runner artifacts to tide-model-runner, adds GPU-backed training infrastructure (ECS cluster, ASG, capacity provider, IAM changes), updates CI/CD workflow and Prefect work pools, removes legacy models Dockerfile, and updates related tests and developer scripts. (50 words) Changes
Sequence Diagram(s)sequenceDiagram
participant GH as "GitHub Actions"
participant Registry as "ECR (fund/tide-model-runner)"
participant Pulumi as "Pulumi"
participant AWS as "AWS (ECS / EC2 / ASG / IAM / CloudWatch)"
participant Prefect as "Prefect Cloud"
GH->>Registry: Build & push model image (models/**) to fund/tide-model-runner
GH->>Pulumi: Trigger infra deploy (update exports)
Pulumi->>AWS: Create/Update IAM, ECR, ECS cluster, Launch Template, ASG, Capacity Provider, Log Group
AWS-->>Pulumi: Return resource ARNs / cluster name
Pulumi-->>GH: Export outputs (e.g., aws_ecs_models_cluster_name, ecr image URI)
Prefect->>AWS: Create/use work pools (fund-models-remote / fund-models-local)
GH->>Prefect: Deploy flows using new work pools (via maskfile/devenv scripts)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Greptile SummaryThis PR renames the ECR repository from Confidence Score: 5/5Safe to merge; all remaining findings are P2 style suggestions that do not affect production deployments The rename and refactoring changes are consistent and well-tested throughout. Both issues are P2 (missing env-var guard and override in developer convenience scripts) that do not block the primary training or deployment workflow. devenv.nix — two minor issues with PREFECT_API_URL and PREFECT_API_KEY handling in the renamed init scripts
|
| Filename | Overview |
|---|---|
| .github/workflows/launch_infrastructure.yaml | CI matrix updated: replaces model-trainer/server-worker entries with tide/model-runner in build job; removes model-trainer deploy entries since tide runs as a Prefect ECS batch task, not a long-running service |
| devenv.nix | Renames training scripts, updates work pool names, fixes cluster name typo; two P2 issues: missing PREFECT_API_URL override in initialize-local-trainer and dropped PREFECT_API_KEY guard in initialize-remote-trainer |
| infrastructure/main.py | Imports and exports updated to use tide_model_runner_repository and tide_model_runner_image_uri; adds aws_ecs_models_cluster_name export from new training module |
| infrastructure/compute.py | ECS cluster name corrected from fund-application to fund-applications to match actual provisioned resource name |
| infrastructure/iam.py | Adds github_actions_trainer_policy scoped to fund-models-instance-role and fund-models-instance-profile; policy attached to github_actions_infrastructure_role alongside existing infrastructure policy |
| infrastructure/storage.py | ECR repository renamed from tide_runner_repository/fund/tide-runner to tide_model_runner_repository/fund/tide-model-runner with matching lifecycle policy and image URI variable rename |
| infrastructure/training.py | New file: provisions fund-models ECS cluster with g4dn.xlarge ASG scaling to zero, ECS capacity provider with managed scaling, launch template using GPU-optimized ECS AMI, EC2 instance profile with AmazonEC2ContainerServiceforEC2Role, and CloudWatch log group |
| libraries/python/tests/test_infrastructure_iam.py | Updated to assert github_actions_trainer_policy resource name is declared and its arn is attached to the infrastructure role |
| libraries/python/tests/test_infrastructure_storage.py | Lifecycle policy assertion updated from tide_runner_repository_lifecycle to tide_model_runner_repository_lifecycle |
| libraries/python/tests/test_infrastructure_training.py | New tests covering GPU instance type, scale-to-zero ASG, EC2 instance profile with ECS role, capacity provider wiring, and ECS_ENABLE_GPU_SUPPORT in user data |
| maskfile.md | Fixes ECS cluster name typo in ecs-deploy/ecs-status; adds tide-model-runner case in deploy command that exits early with informative message since it is a batch task, not a service |
| models/Dockerfile | Deleted: removes dead code server-worker stage that was no longer referenced after a prior architecture change |
| models/tide/Dockerfile | Docker build stage renamed from runner to model-runner to align with ECR repository name and CI workflow matrix stage name |
| models/tide/src/tide/deploy.py | Deployment name updated to tide-trainer-remote and work pool to fund-models-remote; consistent with prefect.yaml and devenv.nix |
| models/tide/tests/test_deploy.py | Test assertions updated to match new deployment name (tide-trainer-remote) and work pool name (fund-models-remote) |
| prefect.yaml | Work pool names updated from fund-work-pool-ecs/fund-work-pool-local to fund-models-remote/fund-models-local across both deployment entries |
Prompt To Fix All With AI
This is a comment left during a code review.
Path: devenv.nix
Line: 302
Comment:
**Missing `PREFECT_API_URL` for local deployment registration**
The old `training-setup` script prefixed every `prefect` invocation with `PREFECT_API_URL="http://localhost:4200/api"`. The new `initialize-local-trainer` drops this, so if `PREFECT_API_KEY` is present in the developer's shell environment (common after configuring remote training), Prefect will route this command to Prefect Cloud and silently register `tide-trainer-local` there instead of on the local orchestrator.
```suggestion
PREFECT_API_URL="http://localhost:4200/api" uv run prefect --no-prompt deploy --name tide-trainer-local
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: devenv.nix
Line: 271-273
Comment:
**`PREFECT_API_KEY` guard removed from remote trainer init**
The old `training-init` script checked for `PREFECT_API_KEY` at the top and exited early with a clear, actionable message if it was unset. The new `initialize-remote-trainer` drops this guard entirely, so a missing key produces a cryptic Prefect authentication error rather than a helpful prompt for new contributors setting up remote training.
```suggestion
scripts.initialize-remote-trainer.exec = ''
if [ -z "${PREFECT_API_KEY:-}" ]; then
echo "PREFECT_API_KEY not set. Add it to .envrc and run 'direnv allow'."
exit 1
fi
unset PREFECT_API_URL
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (4): Last reviewed commit: "Address PR #813 feedback" | Re-trigger Greptile
There was a problem hiding this comment.
Pull request overview
This PR aligns the “tide” model image/repository naming with the project’s “model-runner” convention, updates CI to build/push the renamed image target, and removes an unreferenced legacy Dockerfile.
Changes:
- Renamed the Tide ECR repository and related Pulumi outputs/resources from
tide-runnertotide-model-runner. - Updated GitHub Actions + Tide Docker build stage naming to use
tide+model-runner(and removed oldmodel-trainermatrix entries). - Deleted the unused
models/Dockerfileand updated the maskfile ECS deploy lookup + a storage test assertion.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
models/tide/Dockerfile |
Renames the build stage from runner to model-runner to match CI naming. |
models/Dockerfile |
Removes a dead/unused Dockerfile that defined an unreferenced server-worker stage. |
maskfile.md |
Updates ECS deploy lookup to tide-model-runner. |
libraries/python/tests/test_infrastructure_storage.py |
Updates test assertion for renamed ECR lifecycle policy resource. |
infrastructure/storage.py |
Renames the ECR repository + lifecycle policy resources to tide_model_runner_* and changes the repo name. |
infrastructure/__main__.py |
Updates Pulumi exports/imports to the new Tide model-runner naming. |
.github/workflows/launch_infrastructure.yaml |
Updates image build/push matrix to tide + model-runner and removes old model-trainer deploy entries. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
libraries/python/tests/test_infrastructure_storage.py (1)
25-31: 🛠️ Refactor suggestion | 🟠 MajorAssert the renamed ECR repository, not just the lifecycle symbol.
This still passes if
tide_model_runner_repositoryexists but itsnamestaysfund/tide-runner, which is the regression this PR is fixing. Add an assertion for the actual repository name or image URI so the stale-repo case is covered.🧪 Suggested assertion
def test_storage_contains_ecr_lifecycle_policy_resources() -> None: infrastructure_storage = load_infrastructure_storage() assert '"data_manager_repository_lifecycle"' in infrastructure_storage assert '"portfolio_manager_repository_lifecycle"' in infrastructure_storage assert '"ensemble_manager_repository_lifecycle"' in infrastructure_storage assert '"tide_model_runner_repository_lifecycle"' in infrastructure_storage + assert 'fund/tide-model-runner' in infrastructure_storageAs per coding guidelines: "When fixing a bug, write tests that reproduce the bug before fixing it, then verify the tests pass after the fix."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@libraries/python/tests/test_infrastructure_storage.py` around lines 25 - 31, The test test_storage_contains_ecr_lifecycle_policy_resources only checks for lifecycle symbol names and misses verifying the actual repository name/URI; update the test (which calls load_infrastructure_storage) to assert that the renamed ECR repository (e.g., the tide_model_runner_repository entry) contains the expected repository name or image URI (not the old "fund/tide-runner") so the stale-repo regression is caught; specifically, after locating the "tide_model_runner_repository_lifecycle" block in infrastructure_storage, add an assertion that the corresponding repository object or its "name"/"repositoryUri" field equals the expected new name/URI.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/launch_infrastructure.yaml:
- Around line 31-33: The build_and_push_images job attempts to push the
tide/model-runner image to the ECR repo fund/tide-model-runner but that
repository is only created later by launch_infrastructure
(infrastructure/storage.py repository creation), so the push can fail on first
deploy; to fix, ensure fund/tide-model-runner is created before
build_and_push_images runs by either (a) adding a provisioning step early in
this workflow that creates the ECR repository (or calls the existing repository
creation logic) or (b) moving/duplicating the repository creation call into a
new pre-build job that launch_and_waits before build_and_push_images,
referencing the same repository name fund/tide-model-runner and the
repository-creation routine currently used in launch_infrastructure.
In `@maskfile.md`:
- Line 196: The docs are inconsistent: the canonical service name was changed to
tide-model-runner but example text still says "tide runner"; update the example
occurrences (e.g., the strings "tide runner" shown in the example blocks around
the earlier examples and any usage examples) to use "model-runner" or the full
canonical "tide-model-runner" so they match the renamed stage/repo (search for
and replace "tide runner" and any standalone "tide" examples with
"model-runner"/"tide-model-runner" and ensure CLI/service examples, README
examples, and any echo messages reflect the new name).
---
Outside diff comments:
In `@libraries/python/tests/test_infrastructure_storage.py`:
- Around line 25-31: The test
test_storage_contains_ecr_lifecycle_policy_resources only checks for lifecycle
symbol names and misses verifying the actual repository name/URI; update the
test (which calls load_infrastructure_storage) to assert that the renamed ECR
repository (e.g., the tide_model_runner_repository entry) contains the expected
repository name or image URI (not the old "fund/tide-runner") so the stale-repo
regression is caught; specifically, after locating the
"tide_model_runner_repository_lifecycle" block in infrastructure_storage, add an
assertion that the corresponding repository object or its "name"/"repositoryUri"
field equals the expected new name/URI.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: c45fb425-412c-484b-95a1-4ffd20e3a99f
📒 Files selected for processing (7)
.github/workflows/launch_infrastructure.yamlinfrastructure/__main__.pyinfrastructure/storage.pylibraries/python/tests/test_infrastructure_storage.pymaskfile.mdmodels/Dockerfilemodels/tide/Dockerfile
💤 Files with no reviewable changes (1)
- models/Dockerfile
Root cause: training jobs were running on Fargate (no GPU), causing tinygrad to fail with AttributeError on cuInit because CUDA=1 was set in the Dockerfile but no GPU driver was available. Changes: - Add infrastructure/training.py: new fund-models ECS cluster backed by EC2 g4dn.xlarge GPU instances via auto-scaling group (scale-to-zero), capacity provider with managed scaling, and CloudWatch log group - Add fund-github-actions-trainer-policy: separate IAM managed policy for trainer role/instance profile permissions, split from the main policy to stay within AWS's 6144-byte policy size limit - Rename ECS application cluster from fund-application to fund-applications - Rename all Prefect/infrastructure resources from fund-trainer to fund-models to reflect that the cluster will host multiple model workloads beyond training - Rename Prefect deployment from fund-trainer-remote to tide-trainer-remote to be workflow-specific rather than infrastructure-scoped - Rename mask infrastructure trainer command to mask infrastructure models - Update devenv.nix scripts and process work pool references to fund-models-local - Update prefect.yaml deployment names and work pool names throughout - Add tests for new training infrastructure Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
infrastructure/iam.py (1)
254-269:⚠️ Potential issue | 🟡 MinorAdd
autoscaling.amazonaws.comto DeleteServiceLinkedRoles statement or document why it's intentionally excluded.The
CreateServiceLinkedRolesForStackstatement includesautoscaling.amazonaws.com(line 263), but theDeleteServiceLinkedRolesstatement (lines 351–367) does not. This means GitHub Actions can create the autoscaling SLR but cannot delete it during infrastructure teardown. If this asymmetry is intentional (e.g., the SLR is shared and should persist), add a comment explaining the design choice. Otherwise, addautoscaling.amazonaws.comto the deletion condition to ensure consistent lifecycle management.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@infrastructure/iam.py` around lines 254 - 269, The CreateServiceLinkedRolesForStack policy includes "autoscaling.amazonaws.com" but the DeleteServiceLinkedRoles policy omits it; update the DeleteServiceLinkedRoles statement (the IAM policy block named DeleteServiceLinkedRoles) to include "autoscaling.amazonaws.com" in its iam:AWSServiceName condition to allow deletion, or if intentional, add a clear inline comment next to the DeleteServiceLinkedRoles block explaining why autoscaling SLRs must persist and should not be deleted; reference the policy names CreateServiceLinkedRolesForStack and DeleteServiceLinkedRoles when making the change.
♻️ Duplicate comments (1)
maskfile.md (1)
196-196:⚠️ Potential issue | 🟡 MinorDocs are still inconsistent with the renamed stage.
Line 196 uses
tide-model-runner, but examples in this file still referencetide runner, leaving operator guidance out of sync.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@maskfile.md` at line 196, Documentation is inconsistent: the stage was renamed to tide-model-runner but other examples still say "tide runner"; update all references so they match the new stage name. Search the document for occurrences of "tide runner" (and similar variants like "tide-runner") and replace them with "tide-model-runner", and update any CLI/service example lines, examples, and operator guidance to use the exact symbol "tide-model-runner" (including the switch/echo example that currently references it) so all examples and guidance are consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@infrastructure/training.py`:
- Around line 92-111: The autoscaling Group models_asg only sets the "Name" tag
and omits the shared tags in config.tags; update the tags argument of models_asg
to include the common tags from config.tags (converted into
aws.autoscaling.GroupTagArgs with propagate_at_launch=True) in addition to the
existing "Name" tag, ensuring duplicate keys are handled (override or skip) so
all required environment/project identifiers are propagated at launch. Use the
same GroupTagArgs shape used for the Name tag and build the list by mapping
config.tags entries into GroupTagArgs before passing to models_asg.
- Around line 48-57: The current models_ami lookup uses an Amazon Linux 2 ECS
GPU AMI pattern; update the aws.ec2.get_ami_output call (the models_ami
variable) to target Amazon Linux 2023 ECS GPU AMIs by changing the filter values
to the new pattern "al2023-ami-ecs-gpu-hvm-*-kernel-*-x86_64-ebs" or replace the
get_ami_output approach with an SSM Parameter Store lookup for
"/aws/service/ecs/optimized-ami/amazon-linux-2023/gpu/recommended" and use that
returned AMI ID when setting models_ami so the code references the AL2023
ECS-optimized GPU image instead of AL2.
In `@maskfile.md`:
- Around line 448-450: The local initialization branch incorrectly calls the
remote deployment module (python -m tide.deploy) which deploys
tide-trainer-remote to fund-models-remote; change this to a local-only deploy or
skip remote registration by invoking the local deployment entrypoint or passing
an explicit local target flag. Replace the uv run invocation of tide.deploy with
the local deploy variant (e.g., python -m tide.deploy_local or python -m
tide.deploy --target=local) or remove the deployment call so the 'local' branch
does not deploy tide-trainer-remote to fund-models-remote; update any related
invocation in the script where "Registering training deployment..." is echoed to
reference the local deploy path.
---
Outside diff comments:
In `@infrastructure/iam.py`:
- Around line 254-269: The CreateServiceLinkedRolesForStack policy includes
"autoscaling.amazonaws.com" but the DeleteServiceLinkedRoles policy omits it;
update the DeleteServiceLinkedRoles statement (the IAM policy block named
DeleteServiceLinkedRoles) to include "autoscaling.amazonaws.com" in its
iam:AWSServiceName condition to allow deletion, or if intentional, add a clear
inline comment next to the DeleteServiceLinkedRoles block explaining why
autoscaling SLRs must persist and should not be deleted; reference the policy
names CreateServiceLinkedRolesForStack and DeleteServiceLinkedRoles when making
the change.
---
Duplicate comments:
In `@maskfile.md`:
- Line 196: Documentation is inconsistent: the stage was renamed to
tide-model-runner but other examples still say "tide runner"; update all
references so they match the new stage name. Search the document for occurrences
of "tide runner" (and similar variants like "tide-runner") and replace them with
"tide-model-runner", and update any CLI/service example lines, examples, and
operator guidance to use the exact symbol "tide-model-runner" (including the
switch/echo example that currently references it) so all examples and guidance
are consistent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: eeb49388-b5f9-42b4-9028-c14eca3d52c4
📒 Files selected for processing (11)
devenv.nixinfrastructure/__main__.pyinfrastructure/compute.pyinfrastructure/iam.pyinfrastructure/training.pylibraries/python/tests/test_infrastructure_iam.pylibraries/python/tests/test_infrastructure_training.pymaskfile.mdmodels/tide/src/tide/deploy.pymodels/tide/tests/test_deploy.pyprefect.yaml
- Update models AMI lookup to use AL2023 ECS GPU AMI via SSM parameter store instead of AL2 AMI name filter - Propagate config.tags on ASG in addition to the Name tag so project, stack, and manager tags are applied at launch - Fix local init command to deploy tide-trainer-local from prefect.yaml instead of calling tide.deploy which hardcodes the remote deployment - Update maskfile image command examples from "tide runner" to "tide model-runner" to match the canonical ECR stage name Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 7 comments.
Comments suppressed due to low confidence (1)
infrastructure/iam.py:104
- The GitHub Actions infrastructure policy now includes
autoscaling:*onResource: "*", which grants full Auto Scaling permissions account-wide. If the intent is only to manage the models GPU ASG/capacity provider, consider scoping actions to the specific ASG/launch template ARNs (and/or limiting to the minimal autoscaling actions Pulumi needs) to keep this policy least-privilege.
"Sid": "ManageEC2ECSELBBudgetsAndServiceDiscovery",
"Effect": "Allow",
"Action": [
"autoscaling:*",
"ec2:*",
"ecs:*",
"elasticloadbalancing:*",
"budgets:*",
"servicediscovery:*",
],
"Resource": "*",
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
maskfile.md (1)
844-863:⚠️ Potential issue | 🟡 MinorCommand assumes work pool already exists and provides no validation.
The
model deploy tidecommand has been simplified to only calltide.deploy, which hardcodes deployment to thefund-models-remotework pool. Unlikemodel trainwhich validates and selects the Pulumi stack, this command skips infrastructure validation and assumes the work pool has already been created (viamodels initialize remote).This creates a design gap:
- Implicit dependency: The command silently fails if the work pool doesn't exist, with no error message explaining the root cause.
- Overlapping commands: Both
model deploy tideandmodels initialize remoteregister deployments, making it unclear which to use.- No environment flexibility: The hardcoded remote deployment means there's no equivalent for local deployment (users must use
models initialize localinstead).The command works correctly when used in the proper sequence, but the design doesn't communicate these constraints.
💡 Recommendations
Option 1: Add work pool existence check to prevent silent failures:
# In model deploy, before calling tide.deploy if ! uv run prefect work-pool inspect fund-models-remote >/dev/null 2>&1; then echo "Error: fund-models-remote work pool not found" echo "Run 'mask models initialize remote' first" exit 1 fiOption 2: Consolidate with
models initialize remoteby documenting thatmodel deploy tideis for re-deployment only and requires prior initialization.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@maskfile.md` around lines 844 - 863, The deploy command for model_name (the "model deploy tide" branch) assumes the Prefect work pool "fund-models-remote" exists and directly calls tide.deploy; add a pre-check that verifies the work pool exists (using the same uv/prefect inspection approach used elsewhere) and if missing print a clear error telling the user to run "mask models initialize remote" (or run initialization) and exit non‑zero; keep the call to tide.deploy only after the existence check so the command fails fast with a helpful message instead of silently erroring.
♻️ Duplicate comments (1)
infrastructure/training.py (1)
101-115:⚠️ Potential issue | 🟡 MinorPrevent potential duplicate
Nametag entries in ASG tags.Lines 102-106 already add
Name, then Lines 108-115 append alltags.items(). Ifconfig.tagscontainsName, this produces duplicate tag keys. FilterNamein the list comprehension.#!/bin/bash # Verify whether `config.tags` may include "Name", which would duplicate ASG tag keys. # Expected: "Name" should be absent from shared tags, or code should explicitly filter it. set -euo pipefail echo "== Locate config.py files ==" fd -i 'config.py' echo echo "== Inspect tags declarations and nearby context ==" while IFS= read -r f; do echo "--- $f ---" rg -n -C4 '^\s*tags\s*=' "$f" || true rg -n -C2 '"Name"|'\''Name'\''' "$f" || true done < <(fd -i 'config.py')🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@infrastructure/training.py` around lines 101 - 115, The ASG tags list currently adds a hardcoded Name tag via aws.autoscaling.GroupTagArgs and then extends with all entries from tags.items(), which can produce duplicate "Name" keys; modify the list comprehension that builds additional aws.autoscaling.GroupTagArgs (the comprehension iterating "for key, value in tags.items()") to skip entries where key == "Name" (or key.lower() == "name" if case-insensitive) so only the single intended Name tag is included.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@infrastructure/training.py`:
- Around line 56-87: The models_launch_template AWS LaunchTemplate must enforce
IMDSv2: add a metadata_options block to the aws.ec2.LaunchTemplate call (the
models_launch_template instantiation) and set http_tokens="required" (optionally
also set http_endpoint="enabled" and a conservative http_put_response_hop_limit
like 1); update the metadata_options using the appropriate LaunchTemplate
metadata args so instances launched from models_launch_template require IMDSv2.
In `@libraries/python/tests/test_infrastructure_training.py`:
- Around line 11-43: Add targeted unit tests in
libraries/python/tests/test_infrastructure_training.py that exercise
missing/alternate branches of load_infrastructure_training: create tests that
(1) simulate configuration where models_cluster or models_capacity_provider is
absent and assert the template omits '"models_cluster"' and
'"models_capacity_provider"', (2) simulate a non-GPU trainer to assert
instance_type is not "g4dn.xlarge" and that "amazon-linux-2023/gpu/recommended"
is absent, (3) simulate ASG with min_size>0 and desired_capacity>0 to assert
"min_size=0" and "desired_capacity=0" are absent, and (4) simulate an
environment without an instance profile to assert "models_instance_profile" and
"AmazonEC2ContainerServiceforEC2Role" are omitted; ensure each test calls
load_infrastructure_training and uses clear asserts to drive coverage above 90%
for this module.
- Around line 7-43: The tests currently assert raw substrings from
load_infrastructure_training() which is brittle; instead parse the generated
Python/templated infrastructure into an AST or load it as a resource object and
assert concrete structures and arguments. Update the tests (e.g.,
test_models_cluster_uses_ec2_backed_capacity_provider,
test_trainer_uses_gpu_instance_type, test_trainer_asg_scales_to_zero,
test_models_instance_profile_configured,
test_trainer_gpu_support_enabled_in_user_data) to call
load_infrastructure_training(), parse the content into an AST or deserialize to
a resource/model representation, then assert the presence and properties of the
relevant nodes/objects (e.g., check the Cluster creation has
models_capacity_provider and models_cluster_capacity_providers, the Trainer/ASG
resource has instance_type "g4dn.xlarge" and amazon-linux-2023/gpu/recommended
image, min_size/desired_capacity equal to 0, instance profile includes
AmazonEC2ContainerServiceforEC2Role, and user data contains
ECS_ENABLE_GPU_SUPPORT) rather than using substring membership.
In `@maskfile.md`:
- Around line 443-450: In the local) branch ensure PREFECT_API_URL is set to the
local Prefect server before creating the pool and deploying: set PREFECT_API_URL
to "http://localhost:4200/api" in that branch so the uv run --package tools
prefect work-pool create "fund-models-local" and uv run prefect --no-prompt
deploy --name tide-trainer-local commands target the local server (either export
the variable at the start of the local) case or prefix those uv run commands
with the PREFECT_API_URL assignment).
- Line 196: The early-exit echo "No ECS service for tide model runner" is
ambiguous; update the message where the script handles the tide-model-runner
case (the branch that runs echo "No ECS service for tide model runner" && exit
0) to explicitly state that tide-model-runner is intentionally not an ECS
long-running service and that the script is exiting intentionally (e.g.,
"tide-model-runner is used only for Prefect training; no ECS service expected —
exiting 0"). Keep the same exit code (0) and ensure the change is made in the
branch referencing "tide-model-runner".
---
Outside diff comments:
In `@maskfile.md`:
- Around line 844-863: The deploy command for model_name (the "model deploy
tide" branch) assumes the Prefect work pool "fund-models-remote" exists and
directly calls tide.deploy; add a pre-check that verifies the work pool exists
(using the same uv/prefect inspection approach used elsewhere) and if missing
print a clear error telling the user to run "mask models initialize remote" (or
run initialization) and exit non‑zero; keep the call to tide.deploy only after
the existence check so the command fails fast with a helpful message instead of
silently erroring.
---
Duplicate comments:
In `@infrastructure/training.py`:
- Around line 101-115: The ASG tags list currently adds a hardcoded Name tag via
aws.autoscaling.GroupTagArgs and then extends with all entries from
tags.items(), which can produce duplicate "Name" keys; modify the list
comprehension that builds additional aws.autoscaling.GroupTagArgs (the
comprehension iterating "for key, value in tags.items()") to skip entries where
key == "Name" (or key.lower() == "name" if case-insensitive) so only the single
intended Name tag is included.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 44c6b8c4-8ce2-496c-b00a-8b8a3560728e
📒 Files selected for processing (3)
infrastructure/training.pylibraries/python/tests/test_infrastructure_training.pymaskfile.md
- Add IMDSv2 enforcement to models launch template (http_tokens=required) - Set PREFECT_API_URL=http://localhost:4200/api in maskfile local init branch so work pool and deployment target the local Prefect server - Fix devenv initialize-local-trainer to deploy tide-trainer-local via prefect deploy instead of calling tide.deploy which registers the remote deployment - Clarify mask infrastructure models initialize local echo message - Update mask model deploy description to say Prefect Cloud, matching the actual behavior of unsetting PREFECT_API_URL - Clarify tide-model-runner early-exit message in mask infrastructure image deploy to explain it is a training image not an ECS service Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
devenv.nix (1)
357-360:⚠️ Potential issue | 🟠 MajorLocal worker bootstrap still invokes remote deployment registration.
Line 360 still runs
tide.deploy, which registerstide-trainer-remotetofund-models-remote; that conflicts with the local pool created on Line 358 and is silently ignored due to2>/dev/null || true.Suggested fix
PREFECT_API_URL="http://localhost:4200/api" \ - uv run --package tide python -m tide.deploy 2>/dev/null || true + uv run prefect --no-prompt deploy --name tide-trainer-local 2>/dev/null || true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@devenv.nix` around lines 357 - 360, The bootstrap is still invoking the remote registration via the command that runs `python -m tide.deploy`, which registers `tide-trainer-remote` to `fund-models-remote` and conflicts with the local pool created by the `uv ... prefect work-pool create "fund-models-local"` command; remove or guard the `uv run --package tide python -m tide.deploy 2>/dev/null || true` invocation so local bootstraps do not perform remote deployment registration (e.g., delete the line or wrap it in a conditional that only runs for remote envs), and also remove the noisy suppression (`2>/dev/null || true`) so failures are visible when intended.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@devenv.nix`:
- Around line 357-360: The bootstrap is still invoking the remote registration
via the command that runs `python -m tide.deploy`, which registers
`tide-trainer-remote` to `fund-models-remote` and conflicts with the local pool
created by the `uv ... prefect work-pool create "fund-models-local"` command;
remove or guard the `uv run --package tide python -m tide.deploy 2>/dev/null ||
true` invocation so local bootstraps do not perform remote deployment
registration (e.g., delete the line or wrap it in a conditional that only runs
for remote envs), and also remove the noisy suppression (`2>/dev/null || true`)
so failures are visible when intended.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0b4982f2-cc3f-474f-bdb7-2e52cb7e3e26
📒 Files selected for processing (3)
devenv.nixinfrastructure/training.pymaskfile.md
Overview
Changes
fund/tide-runnertofund/tide-model-runnerin infrastructure code (infrastructure/storage.py,infrastructure/__main__.py)tideapplication withmodel-runnerstage instead of the oldmodel-trainerentries; remove duplicate model-trainer deploy job matrix entriesmodels/tide/Dockerfilefromrunnertomodel-runnerto match the new namingmodels/Dockerfilewhich contained a deadserver-workerstage no longer referenced anywheremaskfile.mdECS service lookup case totide-model-runnertide_model_runner_repository_lifecycleresource nameContext
The
tide-runnernaming was inconsistent with the broadermodel-runnerconvention being adopted. The oldmodels/Dockerfile(server-worker stage) was unreferenced dead code left over from a prior architecture and is no longer needed.Summary by CodeRabbit
New Features
Chores
Tests