Rebuild TiDE model data/model training pipeline#821
Conversation
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 38 minutes and 47 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (10)
📝 WalkthroughWalkthroughThis PR refactors the Tide model training pipeline by reorganizing module structure (tide_model/tide_data → model/data), replacing batch-list-based training with a dataset-centric Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes Possibly related issues
Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Rebuilds the TiDE model data + training pipeline (now under models/tide/) and updates the surrounding Prefect/ECS deployment tooling, while addressing production schema-validation issues and improving GPU training performance.
Changes:
- Refactors TiDE preprocessing into a staged
Pipelineand introduces aTrainingDatasetarray-based interface to feed training/prediction. - Updates TiDE training loop to pre-load tensors onto the compute device once per run, adds validation-dataset support, and tunes trainer defaults.
- Adds/updates infrastructure + tooling for ECS GPU training (task definition, Prefect work pool template generator, Mask/GitHub Actions updates) and fixes S3 consolidation edge cases.
Reviewed changes
Copilot reviewed 26 out of 27 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
uv.lock |
Adds prometheus-client to locked deps. |
tools/tests/test_build_work_pool_template.py |
Adds test coverage for ECS work pool template generation. |
tools/src/tools/build_work_pool_template.py |
New utility to generate/patch a Prefect ECS GPU work pool base job template. |
prefect.yaml |
Removes the remote TiDE deployment entry (local deployment remains). |
models/tide/tests/test_tasks.py |
Expands tests for S3 reads, ticker filtering, and consolidation null-handling. |
models/tide/tests/test_run.py |
Removes tests for deleted tide.run. |
models/tide/tests/test_model.py |
Updates tests for renamed modules + dataset-based training/validation APIs. |
models/tide/tests/test_deploy.py |
Updates deploy tests to use Schedule + image/build options. |
models/tide/tests/test_data.py |
Updates tests for staged pipeline + get_dataset output and new dimensions. |
models/tide/src/tide/workflow.py |
Increases model training task timeout to 4 hours. |
models/tide/src/tide/trainer.py |
Switches to dataset-based training + new defaults (batch size, LR, patience, output length). |
models/tide/src/tide/tasks.py |
Adds schema validation + improves filtering/join consolidation logic + column normalization. |
models/tide/src/tide/run.py |
Removes old CLI entrypoint wrapper. |
models/tide/src/tide/model.py |
Migrates to dataset-based batching, adds device logging, GPU preloading, and validation-dataset early stopping. |
models/tide/src/tide/deploy.py |
Updates deployment registration to supply image + schedule + job variables and enforce image env var. |
models/tide/src/tide/data.py |
Implements staged preprocessing pipeline + TrainingDataset and replaces batch list generation. |
models/tide/Dockerfile |
Adds venv bin directory to PATH in the runtime image. |
maskfile.md |
Updates image build workflow (build-and-push), adds Prefect work pool template flow, updates trainer commands. |
infrastructure/training.py |
Adds ECS GPU task definition for the TiDE trainer. |
infrastructure/__main__.py |
Exports ECS network IDs + TiDE trainer task definition ARN for tooling. |
applications/portfolio_manager/pyproject.toml |
Adds prometheus-client dependency (metrics support). |
applications/ensemble_manager/src/ensemble_manager/server.py |
Updates TiDE usage to dataset-based prediction path and single-call predict flow. |
applications/ensemble_manager/pyproject.toml |
Adds prometheus-client dependency (metrics support). |
applications/data_manager/tests/test_storage.rs |
Updates tests for increased DuckDB config length limit. |
applications/data_manager/src/storage.rs |
Raises DuckDB config value max length to 4096 (ECS session token support). |
.github/workflows/launch_infrastructure.yaml |
Switches CI to build-and-push and uses service update for deployments. |
.claude/skills/autotrain/SKILL.md |
Updates documentation references to renamed TiDE files. |
Comments suppressed due to low confidence (2)
models/tide/src/tide/model.py:321
Model.traincomputestotal_batches = (num_samples + batch_size - 1) // batch_sizebut does not validatebatch_size > 0. A zero/negative batch size would raise aZeroDivisionError(or behave unexpectedly). Add an explicit check for a positive batch size (and mirror it invalidate_model).
models/tide/src/tide/model.py:396- Calling
Tensor.realize(*get_parameters(self))on every optimization step forces parameter realization/synchronization each batch and can significantly reduce the benefit of the new GPU preloading (and slow training). If this is not strictly required for correctness with tinygrad, prefer removing it or realizing less frequently (e.g., per log interval / epoch).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Greptile SummaryThis PR rebuilds the TiDE model training pipeline: renames source files, introduces named pipeline stages and a Confidence Score: 5/5Safe to merge; both remaining findings are P2 defensive-coding suggestions with no impact on current production behavior. Production bug fixes (ticker filtering, null sector/industry) are correct and well-tested. The GPU pre-loading pattern is sound. Both inline comments are P2 style/hardening suggestions that do not block merge. models/tide/src/tide/model.py (Tensor.training guard), models/tide/src/tide/trainer.py (error string match)
|
| Filename | Overview |
|---|---|
| models/tide/src/tide/model.py | Rebuilt training loop: GPU dataset pre-loading, dataset-based API replacing batch lists, added validation dataset early-stopping support; Tensor.training state may not be restored if GPU pre-loading raises |
| models/tide/src/tide/data.py | Named pipeline stages (ValidateColumns, EngineerFeatures, CleanData, ScaleAndEncode), new TrainingDataset dataclass, CleanData correctly uses is_infinite() for stats and pl.all_horizontal() for single-pass filtering |
| models/tide/src/tide/tasks.py | Fixed ticker filtering to exclude all lowercase-suffix instruments via regex [a-z]; consolidate_data now filters null sector/industry rows post-join; added per-file column type normalization on S3 reads |
| models/tide/src/tide/trainer.py | Updated hyperparameters (batch_size 512, patience 3, epochs 20), creates separate train/validation datasets, brittle error-message string match for validation-set fallback |
| infrastructure/training.py | Adds tide_trainer_task_definition ECS task with GPU resource requirement, awsvpc networking, and CloudWatch log configuration |
| applications/ensemble_manager/src/ensemble_manager/server.py | Migrated from per-batch prediction loop to single-batch dataset-based prediction; updated imports from renamed data/model modules |
| tools/src/tools/build_work_pool_template.py | New utility to populate Prefect ECS work pool template with GPU resource requirements, network config, and CloudWatch logging for the TiDE trainer |
Prompt To Fix All With AI
This is a comment left during a code review.
Path: models/tide/src/tide/model.py
Line: 337-344
Comment:
**`Tensor.training` not restored if GPU pre-loading raises**
`Tensor.training = True` is set on line 315 before the `try/finally` block. If any of the five `Tensor(...)` calls on lines 338–342 raises (e.g., OOM), the `finally` clause never runs, leaving `Tensor.training = True` permanently. Any subsequent inference call on the same model instance would then run with dropout active.
Move the tensor creation inside the `try` block, or expand the `try` to start at line 314:
```suggestion
logger.info("Training device", device=Device.DEFAULT)
try:
# Pre-load dataset onto compute device once to eliminate per-step transfers
gpu_past_continuous = Tensor(dataset.past_continuous)
gpu_past_categorical = Tensor(dataset.past_categorical)
gpu_future_categorical = Tensor(dataset.future_categorical)
gpu_static_categorical = Tensor(dataset.static_categorical)
gpu_targets = Tensor(dataset.targets)
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: models/tide/src/tide/trainer.py
Line: 69-71
Comment:
**Fragile error-message string match**
`if "Total days available" not in str(e)` couples this catch to the exact wording of the error message in `data.py`. If that string changes, the guard silently swallows unrelated `ValueError`s instead of re-raising them, hiding real failures.
Consider raising a distinct exception subclass in `get_dataset` (e.g., `InsufficientDataError`) so the catch here can be type-based rather than text-based.
How can I resolve this? If you propose a fix, please make it concise.Reviews (2): Last reviewed commit: "Address PR #821 review feedback" | Re-trigger Greptile
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
models/tide/src/tide/data.py (1)
411-413:⚠️ Potential issue | 🟠 Major
predictis reusing historical rows as future covariates.For
data_type="predict", the lastoutput_lengthhistorical rows becomefuture_categorical, so features likeday_of_month,month, andyeardescribe the past rather than the forecast horizon. That also makes prediction requireinput_length + output_lengthhistory instead of just the encoder window. Build the decoder categorical features from dates after the latest timestamp, then pair them with only the lastinput_lengthhistorical rows.Also applies to: 472-483
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@models/tide/src/tide/data.py` around lines 411 - 413, For data_type == "predict" (the branch setting self.batch_data via _get_prediction_data), stop using the last output_length historical rows as future_categorical; instead generate decoder categorical/temporal features from timestamps after the most recent timestamp (i.e., the actual forecast horizon) and pair those decoder features with only the last input_length encoder rows (not input_length + output_length history). Update _get_prediction_data (and the analogous logic at the other block around the 472-483 region) to build future_categorical/decoder features from generated future dates, and ensure batch_data uses the encoder window of length input_length plus decoder features for output_length.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@applications/data_manager/src/storage.rs`:
- Around line 133-134: The magic number 4096 used in storage.rs should be
extracted into a shared constant (e.g., DUCKDB_CONFIG_MAX_LEN: usize = 4096) so
the length limit is a single cross-file contract; add the constant in a common
place (top of storage.rs or a shared constants module) with appropriate
visibility (pub or pub(crate)) and replace the literal usage in the value.len()
> 4096 check and any tests that assert against 4096 to reference
DUCKDB_CONFIG_MAX_LEN instead; update imports/usages accordingly so all
production code and tests use the constant.
In `@applications/ensemble_manager/src/ensemble_manager/server.py`:
- Line 421: The metric variable prediction_batch_count is being set to
len(dataset) (sample count) which mismatches its name; update the metric to
either be renamed to prediction_sample_count everywhere it's defined/used
(search for prediction_batch_count) or adjust its help/description to indicate
it tracks sample count, and update any telemetry/metrics registration and usages
(e.g., in server.py where prediction_batch_count.set(len(dataset)) is called) to
use the new name/description so the metric name and value semantics match.
In `@maskfile.md`:
- Around line 791-809: The script currently calls prefect directly in the line
prefect deployment run "${deployment}" --param "lookback_days=${lookback_days}",
which fails when Prefect is only available via the uv environment; change that
invocation to run Prefect through uv (e.g., use uv run prefect deployment run
...), preserving the deployment and lookback_days variables (deployment,
lookback_days, model_name) so the exact call becomes an uv-run of prefect with
the same arguments and params; ensure quotes and variable expansions remain
intact and test that mask model train tide triggers the deployment via uv.
In `@models/tide/src/tide/model.py`:
- Around line 420-423: The early-stopping branch uses
validate_model(validation_dataset) and lets a NaN stopping_loss consume
patience, so change the logic in the block that sets stopping_loss (and the
identical block later) to treat an empty validation_dataset the same as None:
first check if validation_dataset is None or empty (e.g.,
len(validation_dataset)==0 or validation_dataset.is_empty()), and if so set
stopping_loss = epoch_loss; otherwise call validate_model(...), then if the
returned stopping_loss is NaN (use math.isnan or np.isnan) fall back to
stopping_loss = epoch_loss before comparing against best_loss and
early_stopping_min_delta; apply this change for both occurrences around the
validate_model calls.
- Around line 420-423: The code uses validate_model to compute stopping_loss for
early stopping but still selects/checkpoints on epoch_loss, causing restored
weights to mismatch; update the checkpoint selection and saving logic so that
when validation_dataset is provided you use stopping_loss (the value returned by
validate_model) as the metric for saving best checkpoints and for restoring
weights, i.e., set a unified checkpoint_metric variable to stopping_loss (else
epoch_loss), use that metric in the checkpoint saving/compare code (the same
branch that currently references epoch_loss), and update the
best-loss/best-epoch tracking and restore logic to reference that unified metric
so early stopping and checkpoint restore use the same criterion.
- Around line 530-535: validate_model is calling quantile_loss without the
huber_delta used during training, so validation uses a different objective;
update the quantile_loss call in validate_model (where predictions and
targets_reshaped are computed) to pass huber_delta=self.huber_delta (same
argument used in train()), ensuring the validation loss matches the training
loss function.
In `@models/tide/src/tide/trainer.py`:
- Around line 61-73: In the validation dataset creation in trainer.py, narrow
the except block around tide_data.get_dataset to only swallow the specific "not
enough data for validation windows" error: catch the ValueError as e, inspect e
(e.g., check for a distinctive substring like "not enough data" or "too small
for windowing") and only set validation_dataset = None and log the warning in
that case; for any other ValueError re-raise it so invalid split/configuration
errors from tide_data.get_dataset surface instead. Use the existing symbols
tide_data.get_dataset, validation_dataset, and the surrounding logger.warning
call to locate where to implement this check.
In `@models/tide/tests/test_tasks.py`:
- Around line 261-302: The test currently lets DLNGpB be dropped by the join
because _SAMPLE_CATEGORIES lacks a category for it, so update the test in
test_prepare_training_data_succeeds_when_raw_data_contains_preferred_tickers to
add a category row for "DLNGpB" to _SAMPLE_CATEGORIES (so both tickers are
present through the join), then capture and inspect the parquet uploaded by
prepare_training_data (mock_s3_client.put_object or the object returned from
model_artifacts_bucket upload) to assert that the resulting parquet contains
only the preferred ticker(s) you expect (e.g., only "AAPL" if filter_equity_bars
should remove DLNGpB); reference prepare_training_data and the mocked
boto3.client (mock_s3_client.get_object/put_object) to locate where to read the
uploaded bytes and validate the parquet contents.
In `@tools/tests/test_build_work_pool_template.py`:
- Around line 91-105: Add an assertion to verify that the AWS region is
propagated into the container log config: in the test function
test_build_work_pool_template_configures_gpu_and_logging (or a new test like
test_build_work_pool_template_uses_aws_region_env_var), set or monkeypatch the
AWS_REGION env var and assert that
result["job_configuration"]["task_definition"]["containerDefinitions"][0]["logConfiguration"]["options"]["awslogs-region"]
equals the expected region (e.g., "eu-west-1"); this ensures
build_work_pool_template reads the environment variable and populates
awslogs-region in the logConfiguration.
---
Outside diff comments:
In `@models/tide/src/tide/data.py`:
- Around line 411-413: For data_type == "predict" (the branch setting
self.batch_data via _get_prediction_data), stop using the last output_length
historical rows as future_categorical; instead generate decoder
categorical/temporal features from timestamps after the most recent timestamp
(i.e., the actual forecast horizon) and pair those decoder features with only
the last input_length encoder rows (not input_length + output_length history).
Update _get_prediction_data (and the analogous logic at the other block around
the 472-483 region) to build future_categorical/decoder features from generated
future dates, and ensure batch_data uses the encoder window of length
input_length plus decoder features for output_length.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9ac4c5ef-efc1-4146-8a4c-6d7c60b2a66f
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (26)
.claude/skills/autotrain/SKILL.md.github/workflows/launch_infrastructure.yamlapplications/data_manager/src/storage.rsapplications/data_manager/tests/test_storage.rsapplications/ensemble_manager/pyproject.tomlapplications/ensemble_manager/src/ensemble_manager/server.pyapplications/portfolio_manager/pyproject.tomlinfrastructure/__main__.pyinfrastructure/training.pymaskfile.mdmodels/tide/Dockerfilemodels/tide/src/tide/data.pymodels/tide/src/tide/deploy.pymodels/tide/src/tide/model.pymodels/tide/src/tide/run.pymodels/tide/src/tide/tasks.pymodels/tide/src/tide/trainer.pymodels/tide/src/tide/workflow.pymodels/tide/tests/test_data.pymodels/tide/tests/test_deploy.pymodels/tide/tests/test_model.pymodels/tide/tests/test_run.pymodels/tide/tests/test_tasks.pyprefect.yamltools/src/tools/build_work_pool_template.pytools/tests/test_build_work_pool_template.py
💤 Files with no reviewable changes (3)
- prefect.yaml
- models/tide/tests/test_run.py
- models/tide/src/tide/run.py
- server.py: set prediction_batch_count to 1 (all predictions run in a single in-memory batch after refactor; previously used len(dataset) which counted symbols, not batches) - tasks.py: guard sector/industry filter with column presence check so consolidate_data does not fail when those columns are absent from the consolidated frame - build_work_pool_template.py: fix CLI usage string to reflect actual invocation via `python -m tools.build_work_pool_template` - storage.rs: extract magic 4096 to DUCKDB_CONFIG_VALUE_MAX_LENGTH const; update test to use the constant - maskfile.md: prefix prefect deployment run with `uv run` so Prefect executes inside the uv environment on a clean workspace - model.py: add math import; fall back stopping_loss to epoch_loss when validate_model returns NaN; use stopping_loss (not epoch_loss) as checkpoint metric when validation is available to avoid saving overfit weights; pass huber_delta to quantile_loss in validate_model to match training loop - trainer.py: narrow ValueError catch to only swallow "Total days available" errors; re-raise config/programming errors immediately - test_tasks.py: add DLNGpB category row so filter_equity_bars test exercises the actual filter path; assert only AAPL appears in the uploaded parquet output from prepare_training_data - test_build_work_pool_template.py: assert awslogs-region in GPU/logging test to match what build_work_pool_template writes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Overview
Changes
tide_data.py→data.pyandtide_model.py→model.pyfor consistencyDatapreprocessing pipeline with named stages (ValidateColumns,EngineerFeatures,CleanData,ScaleAndEncode) and aPipelineorchestratorCleanDatato collect stats on the original dataframe before filtering, and usepl.all_horizontal()for a single-pass filter; correctedis_infinite()vs~is_finite()distinctionfilter_equity_barsintasks.pyto exclude all lowercase-suffix tickers (warrants, rights, preferred shares) usingstr.contains("[a-z]")instead of only filteringpconsolidate_dataintasks.pyto filter rows with nullsectororindustryafter the join, resolving panderaSchemaErroron non-nullable columnsDevice.DEFAULT) before training loop inmodel.py.tolist()for tinygrad compatibilitytrainer.pydefault configuration:early_stopping_patience10 → 3,batch_size256 → 512,learning_rate0.0005 → 0.001run.pyandtest_run.py; replaced withdeploy.pyandtrainer.pytasks.pywith S3 read/write and data consolidation functions, fully testedtools/src/tools/build_work_pool_template.pyfor ECS work pool configuration generationmaskfile.mdmodel commands and training infrastructure sectioninfrastructure/training.pywith ECS GPU cluster and TiDE trainer task definitionContext
This branch rebuilds the TiDE model training pipeline after migrating the model code into
models/tide/. Key fixes address pandera schema validation failures observed in production (lowercase ticker suffixes and null sector/industry values post-join). Training loop performance is improved by pre-loading dataset tensors onto the GPU once per training run rather than per batch step. Early stopping and batch size defaults are tuned to reduce wall-clock training time on theg4dn.xlargeECS instance.Summary by CodeRabbit
Release Notes
New Features
Improvements
Bug Fixes
Documentation