Conversation
|
Caution Review failedFailed to post review comments WalkthroughThis pull request adds comprehensive observability via Sentry and structured logging across the datamanager, equitypricemodel, and portfoliomanager services. It introduces new training infrastructure by adding SageMaker-based and local model training workflows, new data preparation tooling, and support for S3-based model artifact management. Infrastructure changes create dedicated ECR repositories for each service, S3 buckets for data and model artifacts, and configure corresponding ECS services with environment variables and secrets. Configuration files are expanded with additional permissions, dependencies, and Docker build environment variables for DuckDB integration. Collectively, these changes establish a complete observability, training, and deployment pipeline for the equity price prediction system. Sequence DiagramsequenceDiagram
participant Server as equitypricemodel Server
participant S3 as AWS S3
participant Filesystem as Temp Directory
participant Model as TiDE Model
participant Client as HTTP Client
Server->>Server: App startup (lifespan context)
Server->>S3: Find latest artifact key
S3-->>Server: Artifact key path
Server->>S3: Download artifact (.tar.gz)
S3-->>Server: Artifact file
Server->>Filesystem: Extract to temp directory
Filesystem-->>Server: Model files ready
Server->>Model: Load model from directory
Model-->>Server: Model instance loaded
Server->>Server: Store model in app.state
Server-->>Client: Application ready
Client->>Server: POST /model/predictions
Server->>Server: Retrieve model from app.state
Server->>Model: predict(data)
Model-->>Server: Predictions
Server-->>Client: JSON response
Server->>Server: App shutdown (lifespan cleanup)
Server->>Filesystem: Clean up temp directory
Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
🚀 The Update (preview) for forstmeier/pocketsizefund/production (at 99b3ea8) was successful. ✨ Neo ExplanationInitial production deployment creating complete AWS infrastructure for a containerized financial application with three microservices (datamanager, equitypricemodel, portfoliomanager) running on ECS with load balancing, private networking, and ML training capabilities via SageMaker.Root Cause AnalysisThis deployment was triggered by a code change in the repository. The infrastructure is being created from scratch - this appears to be an initial deployment of the "pocketsizefund" application to the production environment. The developer has defined a complete AWS infrastructure stack for a microservices-based financial application with three main services: datamanager, equitypricemodel, and portfoliomanager. Dependency ChainThe deployment follows a logical infrastructure creation order:
Risk analysisLOW RISK - This is a greenfield deployment with no existing resources to impact. All resources are being created fresh, so there's no risk of data loss or service disruption. Resource Changes Name Type Operation
+ equitypricemodel_trainer_repository aws:ecr/repository:Repository create
+ s3_gateway_endpoint aws:ec2/vpcEndpoint:VpcEndpoint create
+ nat_route aws:ec2/route:Route create
+ task_role aws:iam/role:Role create
+ model_artifacts_bucket aws:s3/bucketV2:BucketV2 create
+ sagemaker_s3_policy aws:iam/rolePolicy:RolePolicy create
+ private_subnet_2_rta aws:ec2/routeTableAssociation:RouteTableAssociation create
+ equitypricemodel_repository aws:ecr/repository:Repository create
+ vpc aws:ec2/vpc:Vpc create
+ igw aws:ec2/internetGateway:InternetGateway create
+ public_subnet_1 aws:ec2/subnet:Subnet create
+ public_route_table aws:ec2/routeTable:RouteTable create
+ portfoliomanager_sd aws:servicediscovery/service:Service create
+ ecr_dkr_endpoint aws:ec2/vpcEndpoint:VpcEndpoint create
+ datamanager_service aws:ecs/service:Service create
+ datamanager_logs aws:cloudwatch/logGroup:LogGroup create
+ ecs_sg aws:ec2/securityGroup:SecurityGroup create
+ vpc_endpoints_sg aws:ec2/securityGroup:SecurityGroup create
+ private_route_table aws:ec2/routeTable:RouteTable create
+ alb_sg aws:ec2/securityGroup:SecurityGroup create
+ ecs_from_alb aws:ec2/securityGroupRule:SecurityGroupRule create
+ equitypricemodel_task aws:ecs/taskDefinition:TaskDefinition create
+ datamanager_rule aws:lb/listenerRule:ListenerRule create
+ equitypricemodel_tg aws:lb/targetGroup:TargetGroup create
+ public_subnet_1_rta aws:ec2/routeTableAssociation:RouteTableAssociation create
+ equitypricemodel_rule aws:lb/listenerRule:ListenerRule create
+ datamanager_tg aws:lb/targetGroup:TargetGroup create
+ private_subnet_2 aws:ec2/subnet:Subnet create
+ task_role_s3_policy aws:iam/rolePolicy:RolePolicy create
+ nat_gateway aws:ec2/natGateway:NatGateway create
+ http_listener aws:lb/listener:Listener create
+ portfoliomanager_rule aws:lb/listenerRule:ListenerRule create
+ equitypricemodel_service aws:ecs/service:Service create
+ equitypricemodel_logs aws:cloudwatch/logGroup:LogGroup create
+ sagemaker_cloudwatch_policy aws:iam/rolePolicy:RolePolicy create
+ sagemaker_ecr_policy aws:iam/rolePolicy:RolePolicy create
+ vpc_endpoints_ingress aws:ec2/securityGroupRule:SecurityGroupRule create
+ portfoliomanager_task aws:ecs/taskDefinition:TaskDefinition create
+ portfoliomanager_service aws:ecs/service:Service create
+ execution_role aws:iam/role:Role create
+ public_subnet_2 aws:ec2/subnet:Subnet create
... and 26 other changes |
Confidence Score: 4/5
|
There was a problem hiding this comment.
Additional Comments (1)
-
infrastructure/__main__.py, line 10 (link)syntax: Changed from
.nameto.region, butaws.get_region()returns anAwaitableGetRegionResultwith anamefield, notregion
30 files reviewed, 2 comments
|
|
||
| logger = structlog.get_logger() | ||
|
|
||
| logger.info(f"Using device: {Device.DEFAULT}") |
There was a problem hiding this comment.
style: Don't use f-strings with structlog, pass values as kwargs for structured logging
| logger.info(f"Using device: {Device.DEFAULT}") | |
| logger.info("Using device", device=Device.DEFAULT) |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: tools/run_training_local.py
Line: 17:17
Comment:
**style:** Don't use f-strings with `structlog`, pass values as kwargs for structured logging
```suggestion
logger.info("Using device", device=Device.DEFAULT)
```
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Pull request overview
This pull request implements equity data integration for model training with significant infrastructure and application improvements. It adds data preparation pipelines, enhances the training workflow, integrates Sentry for error monitoring, and improves the datamanager's S3 query capabilities.
Changes:
- Added new data management tools for syncing equity categories and preparing training data
- Integrated Sentry SDK for error monitoring across all services
- Enhanced infrastructure with S3 buckets, ECR repositories, and SageMaker IAM roles
- Improved training workflow with local training support, GPU support, and enhanced logging
- Fixed datamanager S3 query patterns and added comprehensive logging
Reviewed changes
Copilot reviewed 22 out of 25 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Added sentry-sdk, boto3, numpy, structlog, and tinygrad dependencies for equitypricemodel |
| tools/sync_equity_categories.py | New script to fetch and sync equity sector/industry data from Polygon API to S3 |
| tools/sync_equity_bars_data.py | Reduced API rate limit sleep time from 15s to 1s |
| tools/run_training_local.py | New script for local model training with GPU support |
| tools/run_training_job.py | Added configurable SageMaker instance types with presets |
| tools/prepare_training_data.py | New script to consolidate equity bars with categories for training |
| maskfile.md | Added data management commands and model training workflows |
| infrastructure/main.py | Created S3 buckets, ECR repositories, SageMaker roles, and improved resource management |
| infrastructure/Pulumi.production.yaml | Changed AWS region from encrypted to plain text |
| applications/portfoliomanager | Added Sentry integration and structured logging |
| applications/equitypricemodel/trainer.py | Added comprehensive logging with structlog and device detection |
| applications/equitypricemodel/tide_model.py | Added data validation, early stopping, and improved training metrics |
| applications/equitypricemodel/tide_data.py | Optimized batch creation with numpy and added NaN/Inf validation |
| applications/equitypricemodel/server.py | Added S3 artifact loading, Sentry integration, and improved data handling |
| applications/equitypricemodel/preprocess.py | Optimized filtering using semi-join instead of filtering after aggregation |
| applications/equitypricemodel/predictions_schema.py | Fixed timestamp dtype and predictions count validation |
| applications/equitypricemodel/Dockerfile | Changed trainer base image to CUDA-enabled for GPU support |
| applications/datamanager | Added Sentry integration, improved S3 glob patterns, enhanced logging throughout |
| Cargo.lock | Added sentry and sentry-tower dependencies |
| .flox/env/manifest.toml | Added direnv package |
| .claude/settings.local.json | Added AWS CLI and Sentry MCP permissions |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| days=70 | ||
| ) # need 42 trading days (35 input + 7 output), ~60 calendar days + buffer |
There was a problem hiding this comment.
The magic number 70 is hardcoded with an explanatory comment. Consider extracting this to a named constant like LOOKBACK_DAYS = 70 with the comment explaining the calculation (35 input days + 7 output days + buffer for non-trading days). This makes the code more maintainable and the intent clearer.
| cat_array = ticker_df[self.categorical_columns].to_numpy().astype(np.int32) | ||
| cont_array = ticker_df[self.continuous_columns].to_numpy().astype(np.float32) | ||
| static_array = ticker_df[self.static_categorical_columns].head(1).to_numpy().astype(np.int32) |
There was a problem hiding this comment.
The dtype conversion from float64 to float32 is mentioned in a comment about Metal GPU compatibility. Consider adding a configuration parameter or constant for the target dtype instead of hardcoding float32, especially since this could affect model precision. Document the trade-off between precision and GPU compatibility.
| pa.Check( | ||
| check_fn=lambda data: check_dates_count_per_ticker( | ||
| data=data, dates_count=7 | ||
| data=data, dates_count=1 |
There was a problem hiding this comment.
The predictions schema check changed from expecting 7 dates per ticker to expecting 1 date per ticker. This seems inconsistent with the model's output_length of 7 days. If the model predicts 7 days ahead, shouldn't each ticker have 7 prediction rows? This change may indicate a logic error in how predictions are being structured or validated.
| data=data, dates_count=1 | |
| data=data |
There was a problem hiding this comment.
The expected output should be seven dates - is there a reason it was changed to 1?
| RUN uv sync --no-dev | ||
|
|
||
| FROM python:3.12.10-slim AS trainer | ||
| FROM nvidia/cuda:12.4.1-runtime-ubuntu22.04 AS trainer |
There was a problem hiding this comment.
The base image changed from python:3.12.10-slim to nvidia/cuda:12.4.1-runtime-ubuntu22.04 for GPU support, but the comment on line 31 sets ENV CUDA=1. Consider documenting whether this is for tinygrad GPU detection or if additional CUDA environment variables may be needed. Also verify that CUDA 12.4.1 is compatible with the target SageMaker GPU instances.
| #[derive(Deserialize)] | ||
| pub struct SavePayload { | ||
| pub data: DataFrame, | ||
| pub data: Vec<Prediction>, |
There was a problem hiding this comment.
The SavePayload now expects Vec<Prediction> instead of DataFrame. This is a breaking API change that requires the client to send JSON-serialized predictions instead of binary Parquet. While this may be intentional for better error handling and validation, ensure all clients are updated to use the new format. Document this change in API documentation and consider versioning the endpoint if backward compatibility is needed.
| ); | ||
|
|
||
| let required_columns = vec!["sector", "industry"]; | ||
| let required_columns = vec!["ticker", "sector", "industry"]; |
There was a problem hiding this comment.
The required columns for equity_details_dataframe now include 'ticker', but the equity_details_schema in the Python code doesn't include 'ticker' as a required field (only 'sector' and 'industry'). This mismatch will cause validation failures when the dataframe is sent to equitypricemodel. Either add 'ticker' to the Python schema or remove it from the required columns in Rust.
forstmeier
left a comment
There was a problem hiding this comment.
Some comments/questions and also there are Ruff errors on a chunk of the Python stuff that needs to be fixed.
| ); | ||
|
|
||
| let required_columns = vec!["sector", "industry"]; | ||
| let required_columns = vec!["ticker", "sector", "industry"]; |
| #[derive(Deserialize)] | ||
| pub struct SavePayload { | ||
| pub data: DataFrame, | ||
| pub data: Vec<Prediction>, |
| pa.Check( | ||
| check_fn=lambda data: check_dates_count_per_ticker( | ||
| data=data, dates_count=7 | ||
| data=data, dates_count=1 |
There was a problem hiding this comment.
The expected output should be seven dates - is there a reason it was changed to 1?
There was a problem hiding this comment.
Remove unused comments.
There was a problem hiding this comment.
Also I like extracting the model from the image. In the future, I'd like to explore having things like equitypricemodel be more of an aggregator and normalizer with just a pass-through call to an external service (e.g. SageMaker inference, HuggingFace inference, etc).
You should be able to remove some of the logic around downloading artifacts from the launch_infrastructure.yaml GitHub workflow (I say we still keep the Mask command/helper script for now). Another cleanup (not necessary for this pull request) would likely be extracting the common logic from that tools/ helper script and here into the shared library.
| nan_count = data.filter(pl.col(col).is_nan()).height | ||
| null_count = data.filter(pl.col(col).is_null()).height | ||
| inf_count = data.filter(~pl.col(col).is_finite()).height | ||
| if nan_count > 0 or null_count > 0 or inf_count > 0: |
There was a problem hiding this comment.
Should any of these checks exit the method or is it safe to continue even if there are issues?
| if early_stopping_patience is not None: | ||
| if epoch_loss < best_loss - early_stopping_min_delta: | ||
| best_loss = epoch_loss | ||
| epochs_without_improvement = 0 | ||
| logger.info( | ||
| "New best loss", | ||
| best_loss=f"{best_loss:.4f}", | ||
| ) | ||
| else: | ||
| epochs_without_improvement += 1 | ||
| logger.info( | ||
| "No improvement", | ||
| epochs_without_improvement=epochs_without_improvement, | ||
| patience=early_stopping_patience, | ||
| ) | ||
|
|
||
| if epochs_without_improvement >= early_stopping_patience: | ||
| logger.info( | ||
| "Early stopping triggered", | ||
| epoch=epoch + 1, | ||
| best_loss=f"{best_loss:.4f}", | ||
| epochs_without_improvement=epochs_without_improvement, |
There was a problem hiding this comment.
This is really nice. Trick of the trade I didn't know.
| tags=tags, | ||
| ) | ||
|
|
||
| equitypricemodel_tg = aws.lb.TargetGroup( |
There was a problem hiding this comment.
You said this is the one that's temporary, right? If so, just toss in a comment indicating it is.
| input_dir.mkdir(parents=True, exist_ok=True) | ||
| output_dir.mkdir(parents=True, exist_ok=True) | ||
|
|
||
| training_data_path = input_dir / "filtered_tft_training_data.parquet" |
There was a problem hiding this comment.
These files and S3 references should be updated from tft to tide.
There was a problem hiding this comment.
In-line comments can be pulled.
| uv run python tools/run_training_job.py | ||
| ``` | ||
|
|
||
| ### train-local (application_name) |
There was a problem hiding this comment.
Nitpick but maybe have one of the options for "instance type" in the above be local and that does this logic here? A bit of a concept overload but I love a clean Mask command list.
Overview
Changes
Comments