Skip to content

Rebuild TiDE model data/model training pipeline#821

Merged
forstmeier merged 11 commits intomasterfrom
model-data-rebuild
Apr 14, 2026
Merged

Rebuild TiDE model data/model training pipeline#821
forstmeier merged 11 commits intomasterfrom
model-data-rebuild

Conversation

@forstmeier
Copy link
Copy Markdown
Collaborator

@forstmeier forstmeier commented Apr 13, 2026

Overview

Changes

  • Renamed tide_data.pydata.py and tide_model.pymodel.py for consistency
  • Rebuilt Data preprocessing pipeline with named stages (ValidateColumns, EngineerFeatures, CleanData, ScaleAndEncode) and a Pipeline orchestrator
  • Fixed CleanData to collect stats on the original dataframe before filtering, and use pl.all_horizontal() for a single-pass filter; corrected is_infinite() vs ~is_finite() distinction
  • Fixed filter_equity_bars in tasks.py to exclude all lowercase-suffix tickers (warrants, rights, preferred shares) using str.contains("[a-z]") instead of only filtering p
  • Fixed consolidate_data in tasks.py to filter rows with null sector or industry after the join, resolving pandera SchemaError on non-nullable columns
  • Added explicit GPU device logging (Device.DEFAULT) before training loop in model.py
  • Pre-load full dataset as GPU tensors once before the epoch loop to eliminate per-step host-to-device transfers; fixed batch indexing with .tolist() for tinygrad compatibility
  • Updated trainer.py default configuration: early_stopping_patience 10 → 3, batch_size 256 → 512, learning_rate 0.0005 → 0.001
  • Removed run.py and test_run.py; replaced with deploy.py and trainer.py
  • Added tasks.py with S3 read/write and data consolidation functions, fully tested
  • Added tools/src/tools/build_work_pool_template.py for ECS work pool configuration generation
  • Updated maskfile.md model commands and training infrastructure section
  • Added infrastructure/training.py with ECS GPU cluster and TiDE trainer task definition

Context

This branch rebuilds the TiDE model training pipeline after migrating the model code into models/tide/. Key fixes address pandera schema validation failures observed in production (lowercase ticker suffixes and null sector/industry values post-join). Training loop performance is improved by pre-loading dataset tensors onto the GPU once per training run rather than per batch step. Early stopping and batch size defaults are tuned to reduce wall-clock training time on the g4dn.xlarge ECS instance.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added monitoring capabilities via Prometheus client integration
    • Introduced ECS-based distributed training infrastructure with GPU support
  • Improvements

    • Consolidated Docker image build and push into single operation
    • Refactored training system to use dataset-based processing for improved efficiency
    • Enhanced data validation with improved ticker filtering and schema enforcement
    • Extended model training timeout for complex workloads
    • Updated Prefect deployment scheduling configuration
  • Bug Fixes

    • Improved data cleaning logic for handling invalid values across multiple columns
    • Added nullable field validation in data consolidation
  • Documentation

    • Updated documentation to reference current source file paths

Copilot AI review requested due to automatic review settings April 13, 2026 19:08
@forstmeier forstmeier added the python Python code updates label Apr 13, 2026
@github-project-automation github-project-automation Bot moved this to In Progress in Overview Apr 13, 2026
@github-actions github-actions Bot added rust Rust code updates markdown Markdown code updates yaml YAML code updates labels Apr 13, 2026
@github-actions github-actions Bot requested a review from chrisaddy April 13, 2026 19:08
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 13, 2026

Warning

Rate limit exceeded

@forstmeier has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 38 minutes and 47 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 38 minutes and 47 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: e268805d-9903-497b-afea-755dc3c9e18f

📥 Commits

Reviewing files that changed from the base of the PR and between 9c1e8b8 and 7df676e.

📒 Files selected for processing (10)
  • applications/data_manager/src/storage.rs
  • applications/data_manager/tests/test_storage.rs
  • applications/ensemble_manager/src/ensemble_manager/server.py
  • maskfile.md
  • models/tide/src/tide/model.py
  • models/tide/src/tide/tasks.py
  • models/tide/src/tide/trainer.py
  • models/tide/tests/test_tasks.py
  • tools/src/tools/build_work_pool_template.py
  • tools/tests/test_build_work_pool_template.py
📝 Walkthrough

Walkthrough

This PR refactors the Tide model training pipeline by reorganizing module structure (tide_model/tide_data → model/data), replacing batch-list-based training with a dataset-centric TrainingDataset API, removing the is_holiday feature, adding ECS task definitions for remote training, consolidating CI image build/push steps, and updating Prefect deployment to cloud-based run invocation with work-pool provisioning via infrastructure outputs.

Changes

Cohort / File(s) Summary
Documentation & Skills
.claude/skills/autotrain/SKILL.md
Updated file references from tide_model.py/tide_data.py to models/tide/src/tide/model.py/models/tide/src/tide/data.py.
CI Workflow & Infrastructure Deployment
.github/workflows/launch_infrastructure.yaml, maskfile.md
Consolidated image build and push into single step (build-and-push); changed deployment from image-specific to service update command; updated Prefect trainer section to provision ECS work pool via Pulumi outputs and trigger Prefect cloud deployment runs.
Storage & Validation
applications/data_manager/src/storage.rs, applications/data_manager/tests/test_storage.rs
Increased DuckDB config value length limit from 512 to 4096 characters; updated validation tests accordingly.
Dependency Management
applications/ensemble_manager/pyproject.toml, applications/portfolio_manager/pyproject.toml
Added prometheus-client>=0.21.0 to both projects.
Ensemble Manager
applications/ensemble_manager/src/ensemble_manager/server.py
Updated tide imports to new paths (tide.data/tide.model); refactored prediction flow to use dataset-based inference with single batch construction instead of per-batch iteration; updated empty-input handling and metrics reporting.
Tide Core Model API
models/tide/src/tide/model.py
Major refactor: replaced batch-list methods with TrainingDataset-based API; changed training loop from batch iteration to shuffled sample indices; updated validation from validate(batches) to validate_model(dataset); adjusted default output_length from 7 to 5; added device logging and early-stopping logic with optional validation dataset.
Tide Data Pipeline
models/tide/src/tide/data.py
Replaced get_batches() with get_dataset() returning TrainingDataset; removed is_holiday feature and ExpandDateRange/FillNulls stages; consolidated continuous column cleaning into single pl.all_horizontal() filter; updated empty-dataset behavior to return correctly-shaped zero arrays; gated Tensor import behind TYPE_CHECKING.
Tide Training & Tasks
models/tide/src/tide/trainer.py, models/tide/src/tide/tasks.py
Updated trainer to use get_dataset() and create separate validation dataset; added schema validation with equity-bars column-type enforcement; added lowercase ticker filtering and null sector/industry filtering; adjusted hyperparameter defaults and batch-size handling.
Tide Deployment & Workflow
models/tide/src/tide/deploy.py, models/tide/src/tide/workflow.py
Updated deployment to accept image argument and use Schedule(cron=..., timezone=...) instead of separate fields; changed tags and added job variables; extended workflow task timeout from 3600 to 14400 seconds; added FUND_TIDE_IMAGE_URI environment variable validation.
Tide Removed Module
models/tide/src/tide/run.py
Deleted entire module including run_training_job function and its CLI entrypoint.
Docker & Infrastructure
models/tide/Dockerfile, infrastructure/__main__.py, infrastructure/training.py
Added venv bin directory to PATH in Dockerfile; exported new ECS networking/task definition outputs from infrastructure; added tide_trainer_task_definition ECS resource with GPU support and CloudWatch logging.
Work Pool Template Building
tools/src/tools/build_work_pool_template.py, tools/tests/test_build_work_pool_template.py
Added new module to mutate Prefect ECS work-pool base job template with cluster/network/task-definition configuration and GPU/logging resource requirements.
Prefect Configuration
prefect.yaml
Removed tide-trainer-remote deployment configuration.
Test Suite Updates
models/tide/tests/test_data.py, models/tide/tests/test_deploy.py, models/tide/tests/test_model.py, models/tide/tests/test_tasks.py
Updated tests to use TrainingDataset and new get_dataset() API; adjusted schema assertions; verified Prefect schedule object configuration; expanded equity-bar filtering and schema validation coverage.
Test Removed Module
models/tide/tests/test_run.py
Deleted entire test suite for run_training_job.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Possibly related issues

  • Improve model training pipeline and data handling #691: The refactors to models/tide's data/model/trainer modules—removing preprocessing stages, switching from batch lists to TrainingDataset, and fixing validation logic—directly address the same code-level concerns in this issue.

Possibly related PRs

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 19.51% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: rebuilding the TiDE model data and model training pipeline, which encompasses all major changes including file renames, pipeline restructuring, training improvements, and infrastructure updates.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch model-data-rebuild

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Rebuilds the TiDE model data + training pipeline (now under models/tide/) and updates the surrounding Prefect/ECS deployment tooling, while addressing production schema-validation issues and improving GPU training performance.

Changes:

  • Refactors TiDE preprocessing into a staged Pipeline and introduces a TrainingDataset array-based interface to feed training/prediction.
  • Updates TiDE training loop to pre-load tensors onto the compute device once per run, adds validation-dataset support, and tunes trainer defaults.
  • Adds/updates infrastructure + tooling for ECS GPU training (task definition, Prefect work pool template generator, Mask/GitHub Actions updates) and fixes S3 consolidation edge cases.

Reviewed changes

Copilot reviewed 26 out of 27 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
uv.lock Adds prometheus-client to locked deps.
tools/tests/test_build_work_pool_template.py Adds test coverage for ECS work pool template generation.
tools/src/tools/build_work_pool_template.py New utility to generate/patch a Prefect ECS GPU work pool base job template.
prefect.yaml Removes the remote TiDE deployment entry (local deployment remains).
models/tide/tests/test_tasks.py Expands tests for S3 reads, ticker filtering, and consolidation null-handling.
models/tide/tests/test_run.py Removes tests for deleted tide.run.
models/tide/tests/test_model.py Updates tests for renamed modules + dataset-based training/validation APIs.
models/tide/tests/test_deploy.py Updates deploy tests to use Schedule + image/build options.
models/tide/tests/test_data.py Updates tests for staged pipeline + get_dataset output and new dimensions.
models/tide/src/tide/workflow.py Increases model training task timeout to 4 hours.
models/tide/src/tide/trainer.py Switches to dataset-based training + new defaults (batch size, LR, patience, output length).
models/tide/src/tide/tasks.py Adds schema validation + improves filtering/join consolidation logic + column normalization.
models/tide/src/tide/run.py Removes old CLI entrypoint wrapper.
models/tide/src/tide/model.py Migrates to dataset-based batching, adds device logging, GPU preloading, and validation-dataset early stopping.
models/tide/src/tide/deploy.py Updates deployment registration to supply image + schedule + job variables and enforce image env var.
models/tide/src/tide/data.py Implements staged preprocessing pipeline + TrainingDataset and replaces batch list generation.
models/tide/Dockerfile Adds venv bin directory to PATH in the runtime image.
maskfile.md Updates image build workflow (build-and-push), adds Prefect work pool template flow, updates trainer commands.
infrastructure/training.py Adds ECS GPU task definition for the TiDE trainer.
infrastructure/__main__.py Exports ECS network IDs + TiDE trainer task definition ARN for tooling.
applications/portfolio_manager/pyproject.toml Adds prometheus-client dependency (metrics support).
applications/ensemble_manager/src/ensemble_manager/server.py Updates TiDE usage to dataset-based prediction path and single-call predict flow.
applications/ensemble_manager/pyproject.toml Adds prometheus-client dependency (metrics support).
applications/data_manager/tests/test_storage.rs Updates tests for increased DuckDB config length limit.
applications/data_manager/src/storage.rs Raises DuckDB config value max length to 4096 (ECS session token support).
.github/workflows/launch_infrastructure.yaml Switches CI to build-and-push and uses service update for deployments.
.claude/skills/autotrain/SKILL.md Updates documentation references to renamed TiDE files.
Comments suppressed due to low confidence (2)

models/tide/src/tide/model.py:321

  • Model.train computes total_batches = (num_samples + batch_size - 1) // batch_size but does not validate batch_size > 0. A zero/negative batch size would raise a ZeroDivisionError (or behave unexpectedly). Add an explicit check for a positive batch size (and mirror it in validate_model).
    models/tide/src/tide/model.py:396
  • Calling Tensor.realize(*get_parameters(self)) on every optimization step forces parameter realization/synchronization each batch and can significantly reduce the benefit of the new GPU preloading (and slow training). If this is not strictly required for correctness with tinygrad, prefer removing it or realizing less frequently (e.g., per log interval / epoch).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread models/tide/src/tide/tasks.py Outdated
Comment thread applications/ensemble_manager/src/ensemble_manager/server.py Outdated
Comment thread tools/src/tools/build_work_pool_template.py
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 13, 2026

Greptile Summary

This PR rebuilds the TiDE model training pipeline: renames source files, introduces named pipeline stages and a TrainingDataset dataclass, pre-loads the full dataset as GPU tensors once per run to eliminate per-step host-to-device transfers, and fixes two production pandera failures (lowercase-suffix ticker filtering and null sector/industry rows post-join).

Confidence Score: 5/5

Safe to merge; both remaining findings are P2 defensive-coding suggestions with no impact on current production behavior.

Production bug fixes (ticker filtering, null sector/industry) are correct and well-tested. The GPU pre-loading pattern is sound. Both inline comments are P2 style/hardening suggestions that do not block merge.

models/tide/src/tide/model.py (Tensor.training guard), models/tide/src/tide/trainer.py (error string match)

Important Files Changed

Filename Overview
models/tide/src/tide/model.py Rebuilt training loop: GPU dataset pre-loading, dataset-based API replacing batch lists, added validation dataset early-stopping support; Tensor.training state may not be restored if GPU pre-loading raises
models/tide/src/tide/data.py Named pipeline stages (ValidateColumns, EngineerFeatures, CleanData, ScaleAndEncode), new TrainingDataset dataclass, CleanData correctly uses is_infinite() for stats and pl.all_horizontal() for single-pass filtering
models/tide/src/tide/tasks.py Fixed ticker filtering to exclude all lowercase-suffix instruments via regex [a-z]; consolidate_data now filters null sector/industry rows post-join; added per-file column type normalization on S3 reads
models/tide/src/tide/trainer.py Updated hyperparameters (batch_size 512, patience 3, epochs 20), creates separate train/validation datasets, brittle error-message string match for validation-set fallback
infrastructure/training.py Adds tide_trainer_task_definition ECS task with GPU resource requirement, awsvpc networking, and CloudWatch log configuration
applications/ensemble_manager/src/ensemble_manager/server.py Migrated from per-batch prediction loop to single-batch dataset-based prediction; updated imports from renamed data/model modules
tools/src/tools/build_work_pool_template.py New utility to populate Prefect ECS work pool template with GPU resource requirements, network config, and CloudWatch logging for the TiDE trainer
Prompt To Fix All With AI
This is a comment left during a code review.
Path: models/tide/src/tide/model.py
Line: 337-344

Comment:
**`Tensor.training` not restored if GPU pre-loading raises**

`Tensor.training = True` is set on line 315 before the `try/finally` block. If any of the five `Tensor(...)` calls on lines 338–342 raises (e.g., OOM), the `finally` clause never runs, leaving `Tensor.training = True` permanently. Any subsequent inference call on the same model instance would then run with dropout active.

Move the tensor creation inside the `try` block, or expand the `try` to start at line 314:

```suggestion
        logger.info("Training device", device=Device.DEFAULT)

        try:
            # Pre-load dataset onto compute device once to eliminate per-step transfers
            gpu_past_continuous = Tensor(dataset.past_continuous)
            gpu_past_categorical = Tensor(dataset.past_categorical)
            gpu_future_categorical = Tensor(dataset.future_categorical)
            gpu_static_categorical = Tensor(dataset.static_categorical)
            gpu_targets = Tensor(dataset.targets)

```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: models/tide/src/tide/trainer.py
Line: 69-71

Comment:
**Fragile error-message string match**

`if "Total days available" not in str(e)` couples this catch to the exact wording of the error message in `data.py`. If that string changes, the guard silently swallows unrelated `ValueError`s instead of re-raising them, hiding real failures.

Consider raising a distinct exception subclass in `get_dataset` (e.g., `InsufficientDataError`) so the catch here can be type-based rather than text-based.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (2): Last reviewed commit: "Address PR #821 review feedback" | Re-trigger Greptile

Comment thread applications/ensemble_manager/src/ensemble_manager/server.py
@coveralls
Copy link
Copy Markdown
Collaborator

coveralls commented Apr 13, 2026

Coverage Status

coverage: 78.714% (+0.6%) from 78.162% — model-data-rebuild into master

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
models/tide/src/tide/data.py (1)

411-413: ⚠️ Potential issue | 🟠 Major

predict is reusing historical rows as future covariates.

For data_type="predict", the last output_length historical rows become future_categorical, so features like day_of_month, month, and year describe the past rather than the forecast horizon. That also makes prediction require input_length + output_length history instead of just the encoder window. Build the decoder categorical features from dates after the latest timestamp, then pair them with only the last input_length historical rows.

Also applies to: 472-483

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@models/tide/src/tide/data.py` around lines 411 - 413, For data_type ==
"predict" (the branch setting self.batch_data via _get_prediction_data), stop
using the last output_length historical rows as future_categorical; instead
generate decoder categorical/temporal features from timestamps after the most
recent timestamp (i.e., the actual forecast horizon) and pair those decoder
features with only the last input_length encoder rows (not input_length +
output_length history). Update _get_prediction_data (and the analogous logic at
the other block around the 472-483 region) to build future_categorical/decoder
features from generated future dates, and ensure batch_data uses the encoder
window of length input_length plus decoder features for output_length.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@applications/data_manager/src/storage.rs`:
- Around line 133-134: The magic number 4096 used in storage.rs should be
extracted into a shared constant (e.g., DUCKDB_CONFIG_MAX_LEN: usize = 4096) so
the length limit is a single cross-file contract; add the constant in a common
place (top of storage.rs or a shared constants module) with appropriate
visibility (pub or pub(crate)) and replace the literal usage in the value.len()
> 4096 check and any tests that assert against 4096 to reference
DUCKDB_CONFIG_MAX_LEN instead; update imports/usages accordingly so all
production code and tests use the constant.

In `@applications/ensemble_manager/src/ensemble_manager/server.py`:
- Line 421: The metric variable prediction_batch_count is being set to
len(dataset) (sample count) which mismatches its name; update the metric to
either be renamed to prediction_sample_count everywhere it's defined/used
(search for prediction_batch_count) or adjust its help/description to indicate
it tracks sample count, and update any telemetry/metrics registration and usages
(e.g., in server.py where prediction_batch_count.set(len(dataset)) is called) to
use the new name/description so the metric name and value semantics match.

In `@maskfile.md`:
- Around line 791-809: The script currently calls prefect directly in the line
prefect deployment run "${deployment}" --param "lookback_days=${lookback_days}",
which fails when Prefect is only available via the uv environment; change that
invocation to run Prefect through uv (e.g., use uv run prefect deployment run
...), preserving the deployment and lookback_days variables (deployment,
lookback_days, model_name) so the exact call becomes an uv-run of prefect with
the same arguments and params; ensure quotes and variable expansions remain
intact and test that mask model train tide triggers the deployment via uv.

In `@models/tide/src/tide/model.py`:
- Around line 420-423: The early-stopping branch uses
validate_model(validation_dataset) and lets a NaN stopping_loss consume
patience, so change the logic in the block that sets stopping_loss (and the
identical block later) to treat an empty validation_dataset the same as None:
first check if validation_dataset is None or empty (e.g.,
len(validation_dataset)==0 or validation_dataset.is_empty()), and if so set
stopping_loss = epoch_loss; otherwise call validate_model(...), then if the
returned stopping_loss is NaN (use math.isnan or np.isnan) fall back to
stopping_loss = epoch_loss before comparing against best_loss and
early_stopping_min_delta; apply this change for both occurrences around the
validate_model calls.
- Around line 420-423: The code uses validate_model to compute stopping_loss for
early stopping but still selects/checkpoints on epoch_loss, causing restored
weights to mismatch; update the checkpoint selection and saving logic so that
when validation_dataset is provided you use stopping_loss (the value returned by
validate_model) as the metric for saving best checkpoints and for restoring
weights, i.e., set a unified checkpoint_metric variable to stopping_loss (else
epoch_loss), use that metric in the checkpoint saving/compare code (the same
branch that currently references epoch_loss), and update the
best-loss/best-epoch tracking and restore logic to reference that unified metric
so early stopping and checkpoint restore use the same criterion.
- Around line 530-535: validate_model is calling quantile_loss without the
huber_delta used during training, so validation uses a different objective;
update the quantile_loss call in validate_model (where predictions and
targets_reshaped are computed) to pass huber_delta=self.huber_delta (same
argument used in train()), ensuring the validation loss matches the training
loss function.

In `@models/tide/src/tide/trainer.py`:
- Around line 61-73: In the validation dataset creation in trainer.py, narrow
the except block around tide_data.get_dataset to only swallow the specific "not
enough data for validation windows" error: catch the ValueError as e, inspect e
(e.g., check for a distinctive substring like "not enough data" or "too small
for windowing") and only set validation_dataset = None and log the warning in
that case; for any other ValueError re-raise it so invalid split/configuration
errors from tide_data.get_dataset surface instead. Use the existing symbols
tide_data.get_dataset, validation_dataset, and the surrounding logger.warning
call to locate where to implement this check.

In `@models/tide/tests/test_tasks.py`:
- Around line 261-302: The test currently lets DLNGpB be dropped by the join
because _SAMPLE_CATEGORIES lacks a category for it, so update the test in
test_prepare_training_data_succeeds_when_raw_data_contains_preferred_tickers to
add a category row for "DLNGpB" to _SAMPLE_CATEGORIES (so both tickers are
present through the join), then capture and inspect the parquet uploaded by
prepare_training_data (mock_s3_client.put_object or the object returned from
model_artifacts_bucket upload) to assert that the resulting parquet contains
only the preferred ticker(s) you expect (e.g., only "AAPL" if filter_equity_bars
should remove DLNGpB); reference prepare_training_data and the mocked
boto3.client (mock_s3_client.get_object/put_object) to locate where to read the
uploaded bytes and validate the parquet contents.

In `@tools/tests/test_build_work_pool_template.py`:
- Around line 91-105: Add an assertion to verify that the AWS region is
propagated into the container log config: in the test function
test_build_work_pool_template_configures_gpu_and_logging (or a new test like
test_build_work_pool_template_uses_aws_region_env_var), set or monkeypatch the
AWS_REGION env var and assert that
result["job_configuration"]["task_definition"]["containerDefinitions"][0]["logConfiguration"]["options"]["awslogs-region"]
equals the expected region (e.g., "eu-west-1"); this ensures
build_work_pool_template reads the environment variable and populates
awslogs-region in the logConfiguration.

---

Outside diff comments:
In `@models/tide/src/tide/data.py`:
- Around line 411-413: For data_type == "predict" (the branch setting
self.batch_data via _get_prediction_data), stop using the last output_length
historical rows as future_categorical; instead generate decoder
categorical/temporal features from timestamps after the most recent timestamp
(i.e., the actual forecast horizon) and pair those decoder features with only
the last input_length encoder rows (not input_length + output_length history).
Update _get_prediction_data (and the analogous logic at the other block around
the 472-483 region) to build future_categorical/decoder features from generated
future dates, and ensure batch_data uses the encoder window of length
input_length plus decoder features for output_length.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9ac4c5ef-efc1-4146-8a4c-6d7c60b2a66f

📥 Commits

Reviewing files that changed from the base of the PR and between 8f39795 and 9c1e8b8.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (26)
  • .claude/skills/autotrain/SKILL.md
  • .github/workflows/launch_infrastructure.yaml
  • applications/data_manager/src/storage.rs
  • applications/data_manager/tests/test_storage.rs
  • applications/ensemble_manager/pyproject.toml
  • applications/ensemble_manager/src/ensemble_manager/server.py
  • applications/portfolio_manager/pyproject.toml
  • infrastructure/__main__.py
  • infrastructure/training.py
  • maskfile.md
  • models/tide/Dockerfile
  • models/tide/src/tide/data.py
  • models/tide/src/tide/deploy.py
  • models/tide/src/tide/model.py
  • models/tide/src/tide/run.py
  • models/tide/src/tide/tasks.py
  • models/tide/src/tide/trainer.py
  • models/tide/src/tide/workflow.py
  • models/tide/tests/test_data.py
  • models/tide/tests/test_deploy.py
  • models/tide/tests/test_model.py
  • models/tide/tests/test_run.py
  • models/tide/tests/test_tasks.py
  • prefect.yaml
  • tools/src/tools/build_work_pool_template.py
  • tools/tests/test_build_work_pool_template.py
💤 Files with no reviewable changes (3)
  • prefect.yaml
  • models/tide/tests/test_run.py
  • models/tide/src/tide/run.py

Comment thread applications/data_manager/src/storage.rs Outdated
Comment thread applications/ensemble_manager/src/ensemble_manager/server.py Outdated
Comment thread maskfile.md Outdated
Comment thread models/tide/src/tide/model.py
Comment thread models/tide/src/tide/model.py
Comment thread models/tide/src/tide/trainer.py
Comment thread models/tide/tests/test_tasks.py
Comment thread tools/tests/test_build_work_pool_template.py
forstmeier and others added 2 commits April 13, 2026 21:38
- server.py: set prediction_batch_count to 1 (all predictions run in a
  single in-memory batch after refactor; previously used len(dataset)
  which counted symbols, not batches)

- tasks.py: guard sector/industry filter with column presence check so
  consolidate_data does not fail when those columns are absent from the
  consolidated frame

- build_work_pool_template.py: fix CLI usage string to reflect actual
  invocation via `python -m tools.build_work_pool_template`

- storage.rs: extract magic 4096 to DUCKDB_CONFIG_VALUE_MAX_LENGTH const;
  update test to use the constant

- maskfile.md: prefix prefect deployment run with `uv run` so Prefect
  executes inside the uv environment on a clean workspace

- model.py: add math import; fall back stopping_loss to epoch_loss when
  validate_model returns NaN; use stopping_loss (not epoch_loss) as
  checkpoint metric when validation is available to avoid saving overfit
  weights; pass huber_delta to quantile_loss in validate_model to match
  training loop

- trainer.py: narrow ValueError catch to only swallow "Total days
  available" errors; re-raise config/programming errors immediately

- test_tasks.py: add DLNGpB category row so filter_equity_bars test
  exercises the actual filter path; assert only AAPL appears in the
  uploaded parquet output from prepare_training_data

- test_build_work_pool_template.py: assert awslogs-region in GPU/logging
  test to match what build_work_pool_template writes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@forstmeier forstmeier merged commit 6259b0a into master Apr 14, 2026
14 checks passed
@forstmeier forstmeier deleted the model-data-rebuild branch April 14, 2026 01:53
@github-project-automation github-project-automation Bot moved this from In Progress to Done in Overview Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

markdown Markdown code updates python Python code updates rust Rust code updates yaml YAML code updates

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants