Conversation
|
Caution Review failedThe pull request is closed. WalkthroughThis PR establishes AWS infrastructure for data and model management, adds comprehensive observability to the datamanager Rust application, and introduces Python utilities for data preparation pipelines. Changes include: new S3 buckets for data and model artifacts with versioning enabled, ECR repositories for containerized services, IAM roles and policies for SageMaker integration, CloudWatch log groups for application observability, DuckDB environment variable configuration in the datamanager Dockerfile, extensive distributed tracing instrumentation across datamanager modules (data.rs, storage.rs, equity_bars.rs, state.rs, health.rs, main.rs), new Python scripts for syncing equity categories from Polygon API and consolidating training data from S3, development workflow automation via maskfile commands, and adjusted rate-limiting in existing data sync tooling. Sequence DiagramssequenceDiagram
participant User
participant Mask as Mask Command
participant S3 as AWS S3
participant PolygonAPI as Polygon API
participant DuckDB as DuckDB/Local
participant SageMaker as SageMaker
User->>Mask: data sync-categories
Mask->>PolygonAPI: Fetch all US stock tickers
PolygonAPI-->>Mask: Ticker list with sectors/industries
Mask->>S3: Upload categories CSV
S3-->>Mask: Stored
User->>Mask: models prepare (app_name)
Mask->>S3: Read equity bars parquet (date range)
Mask->>S3: Read categories CSV
S3-->>Mask: Data retrieved
Mask->>DuckDB: Filter & join data
DuckDB-->>Mask: Consolidated training dataset
Mask->>S3: Write training data parquet
S3-->>Mask: Stored
User->>Mask: models train (app_name)
Mask->>SageMaker: Submit training job
SageMaker->>S3: Read training data
SageMaker->>S3: Write model artifacts
SageMaker-->>Mask: Training complete
sequenceDiagram
participant Client as Client/API
participant Datamanager as Datamanager
participant DuckDB as DuckDB
participant S3 as AWS S3
Client->>Datamanager: Request equity bars
Datamanager->>Datamanager: Log request (tracing)
Datamanager->>S3: Query parquet with glob (daily/**/*.parquet)
S3-->>DuckDB: S3 credential setup + read_parquet
DuckDB->>DuckDB: Apply date range filter (YYYYMMDD)
DuckDB->>DuckDB: Hive partitioned scan
DuckDB-->>Datamanager: DataFrame results
Datamanager->>Datamanager: Log row count & dimensions
Datamanager-->>Client: Return data
Client->>Datamanager: Request portfolio snapshot
Datamanager->>Datamanager: Log timestamp filter
Datamanager->>S3: Glob most recent partition
S3-->>DuckDB: read_parquet with Hive partitioning
DuckDB-->>Datamanager: Latest portfolio state
Datamanager->>Datamanager: Log final row count
Datamanager-->>Client: Return portfolio
Possibly related PRs
Suggested labels
✨ Finishing touches
📜 Recent review detailsConfiguration used: Organization UI Review profile: CHILL Plan: Pro ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (14)
✏️ Tip: You can disable this entire section by setting Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
🚀 The Update (preview) for forstmeier/pocketsizefund/production (at 43cf7b9) was successful. ✨ Neo ExplanationInitial production deployment of a containerized financial portfolio management platform on AWS, creating a complete microservices architecture with ECS, load balancing, service discovery, and ML training capabilities.Root Cause AnalysisThis is a complete initial deployment of the PocketSizeFund production infrastructure. The developer has written new Pulumi infrastructure-as-code that defines a microservices architecture for a financial portfolio management application running on AWS ECS (containerized services). Dependency ChainThe infrastructure builds in layers:
Risk AnalysisMODERATE RISK - This is a greenfield deployment creating all new infrastructure. The primary risks are:
Resource Changes Name Type Operation
+ public_subnet_2 aws:ec2/subnet:Subnet create
+ public_subnet_1_rta aws:ec2/routeTableAssociation:RouteTableAssociation create
+ data_bucket_versioning aws:s3/bucketVersioningV2:BucketVersioningV2 create
+ portfoliomanager_tg aws:lb/targetGroup:TargetGroup create
+ nat_gateway aws:ec2/natGateway:NatGateway create
+ s3_gateway_endpoint aws:ec2/vpcEndpoint:VpcEndpoint create
+ datamanager_repository aws:ecr/repository:Repository create
+ ecs_cluster aws:ecs/cluster:Cluster create
+ sagemaker_execution_role aws:iam/role:Role create
+ equitypricemodel_logs aws:cloudwatch/logGroup:LogGroup create
+ vpc_endpoints_sg aws:ec2/securityGroup:SecurityGroup create
+ service_discovery aws:servicediscovery/privateDnsNamespace:PrivateDnsNamespace create
+ sagemaker_cloudwatch_policy aws:iam/rolePolicy:RolePolicy create
+ public_internet_route aws:ec2/route:Route create
+ private_subnet_2 aws:ec2/subnet:Subnet create
+ ecr_dkr_endpoint aws:ec2/vpcEndpoint:VpcEndpoint create
+ equitypricemodel_sd aws:servicediscovery/service:Service create
+ portfoliomanager_rule aws:lb/listenerRule:ListenerRule create
+ equitypricemodel_service aws:ecs/service:Service create
+ equitypricemodel_trainer_repository aws:ecr/repository:Repository create
+ public_subnet_2_rta aws:ec2/routeTableAssociation:RouteTableAssociation create
+ model_artifacts_bucket_versioning aws:s3/bucketVersioningV2:BucketVersioningV2 create
+ igw aws:ec2/internetGateway:InternetGateway create
+ alb_sg aws:ec2/securityGroup:SecurityGroup create
+ alb aws:lb/loadBalancer:LoadBalancer create
+ portfoliomanager_service aws:ecs/service:Service create
+ datamanager_tg aws:lb/targetGroup:TargetGroup create
+ datamanager_task aws:ecs/taskDefinition:TaskDefinition create
+ private_subnet_1_rta aws:ec2/routeTableAssociation:RouteTableAssociation create
+ vpc aws:ec2/vpc:Vpc create
+ execution_role_secrets_policy aws:iam/rolePolicy:RolePolicy create
+ execution_role_policy aws:iam/rolePolicyAttachment:RolePolicyAttachment create
+ ecs_egress aws:ec2/securityGroupRule:SecurityGroupRule create
+ public_subnet_1 aws:ec2/subnet:Subnet create
+ ecs_sg aws:ec2/securityGroup:SecurityGroup create
+ ecs_from_alb aws:ec2/securityGroupRule:SecurityGroupRule create
+ nat_route aws:ec2/route:Route create
+ datamanager_service aws:ecs/service:Service create
+ task_role_s3_policy aws:iam/rolePolicy:RolePolicy create
+ sagemaker_s3_policy aws:iam/rolePolicy:RolePolicy create
+ datamanager_sd aws:servicediscovery/service:Service create
... and 24 other changes |
Confidence Score: 4/5
|
| logger.info("Waiting 15 seconds before next request") | ||
| time.sleep(15) # Massive API rate limit | ||
| time.sleep(1) # Massive API rate limit |
There was a problem hiding this comment.
syntax: Log message says "Waiting 15 seconds" but code sleeps for 1 second
| logger.info("Waiting 15 seconds before next request") | |
| time.sleep(15) # Massive API rate limit | |
| time.sleep(1) # Massive API rate limit | |
| logger.info("Waiting 1 second before next request") | |
| time.sleep(1) # Massive API rate limit |
Prompt To Fix With AI
This is a comment left during a code review.
Path: tools/sync_equity_bars_data.py
Line: 121:122
Comment:
**syntax:** Log message says "Waiting 15 seconds" but code sleeps for 1 second
```suggestion
logger.info("Waiting 1 second before next request")
time.sleep(1) # Massive API rate limit
```
How can I resolve this? If you propose a fix, please make it concise.| account_id = current_identity.account_id | ||
|
|
||
| region = aws.get_region().name | ||
| region = aws.get_region().region |
There was a problem hiding this comment.
style: Changed from aws.get_region().name to .region - check that this is the correct attribute for the AWS provider version being used
Prompt To Fix With AI
This is a comment left during a code review.
Path: infrastructure/__main__.py
Line: 10:10
Comment:
**style:** Changed from `aws.get_region().name` to `.region` - check that this is the correct attribute for the AWS provider version being used
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Pull request overview
This pull request prepares data infrastructure for model training by adding new data synchronization tools, fixing datamanager issues, and expanding infrastructure to support SageMaker training jobs.
Changes:
- Added two new Python scripts for syncing equity categories from Polygon API and preparing consolidated training data
- Updated infrastructure to create S3 buckets, ECR repositories, and SageMaker execution roles instead of looking up existing resources
- Enhanced datamanager with improved logging, better S3 query patterns using glob and hive partitioning, and fixed data type handling
Reviewed changes
Copilot reviewed 14 out of 15 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/sync_equity_categories.py | New script to fetch ticker categories (sector/industry) from Polygon API and upload to S3 |
| tools/prepare_training_data.py | New script to consolidate equity bars with categories and filter by price/volume thresholds |
| tools/sync_equity_bars_data.py | Reduced API rate limit wait time from 15s to 1s |
| infrastructure/main.py | Major refactor: creates S3 buckets, ECR repositories, SageMaker role; fixes region attribute; updates task definitions |
| infrastructure/Pulumi.production.yaml | Changed region from secure/encrypted value to plain text |
| applications/datamanager/src/storage.rs | Improved logging, switched to glob patterns with hive partitioning for S3 queries, better error handling |
| applications/datamanager/src/state.rs | Enhanced initialization logging and improved default value handling |
| applications/datamanager/src/main.rs | Updated logging filter and added startup log |
| applications/datamanager/src/health.rs | Added debug logging to health check |
| applications/datamanager/src/equity_bars.rs | Fixed data types (u64→f64 for prices/volumes), improved logging, enhanced error messages |
| applications/datamanager/src/data.rs | Added comprehensive logging throughout data transformation functions |
| applications/datamanager/Dockerfile | Added DuckDB environment variables for build |
| maskfile.md | Added new commands for syncing categories and preparing training data |
| .flox/env/manifest.lock | Minor formatting fix (trailing newline) |
| .claude/settings.local.json | Reformatted JSON and added AWS CLI permissions |
Comments suppressed due to low confidence (1)
applications/datamanager/src/storage.rs:1
- Logging raw API response content could expose sensitive data if the API returns any credentials, tokens, or PII in error responses. Consider sanitizing or limiting what gets logged, or use debug level instead of warn.
use crate::data::{
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if next_url: | ||
| url = next_url | ||
| params = {"apiKey": api_key} | ||
| time.sleep(0.25) |
There was a problem hiding this comment.
The sleep duration of 0.25 seconds may be too aggressive for API rate limiting. Consider documenting the specific rate limit from Polygon API or making this configurable via environment variable.
| output_key: str, | ||
| ) -> str: | ||
| """Write consolidated training data to S3 as parquet.""" | ||
| import io |
There was a problem hiding this comment.
Import statement should be moved to the top of the file with other imports (after line 17) following PEP 8 style guidelines.
| if current_date <= end_date: | ||
| logger.info("Waiting 15 seconds before next request") | ||
| time.sleep(15) # Massive API rate limit | ||
| time.sleep(1) # Massive API rate limit |
There was a problem hiding this comment.
Reduced sleep time from 15 seconds to 1 second represents a 15x increase in request rate. Verify this doesn't violate the Massive API rate limits, as comment still references 'Massive API rate limit'. If rate limit has changed, update the comment to reflect the actual limit.
| account_id = current_identity.account_id | ||
|
|
||
| region = aws.get_region().name | ||
| region = aws.get_region().region |
There was a problem hiding this comment.
This changes the attribute from .name to .region. Verify that .region is the correct attribute for aws.get_region() in the Pulumi AWS SDK version being used. The old code used .name which is a common pattern.
| region = aws.get_region().region | |
| region = aws.get_region().name |
| config: | ||
| aws:region: | ||
| secure: AAABALPeEekY8m4V3bzTX5idnUTmjjjRWfJQit7uhk+w4mxTWyDK5r0= | ||
| aws:region: us-east-1 |
There was a problem hiding this comment.
The AWS region was changed from an encrypted value to plain text. While region information is generally not sensitive, if the previous configuration used encryption for consistency or organizational policy, this change breaks that pattern. Consider whether this was intentional or if it should remain encrypted.
| t: u64, | ||
| v: Option<u64>, | ||
| vw: Option<u64>, | ||
| v: Option<f64>, |
There was a problem hiding this comment.
Changed price and volume fields from u64 to f64. While this fixes the type mismatch issue (prices should be floating point), verify this doesn't introduce precision issues for volume which was originally u64. Volume typically represents share counts (whole numbers) and may have precision requirements that exceed f64's 53-bit mantissa for high-volume stocks.
| v: Option<f64>, | |
| v: Option<u64>, |
| let start_date_int = start_timestamp.format("%Y%m%d").to_string().parse::<i32>().unwrap_or(0); | ||
| let end_date_int = end_timestamp.format("%Y%m%d").to_string().parse::<i32>().unwrap_or(99999999); | ||
|
|
There was a problem hiding this comment.
Using unwrap_or with fallback values (0 and 99999999) silently hides parse errors. If the timestamp formatting or parsing fails, the query will execute with incorrect date bounds without warning. Consider logging a warning or returning an error instead of using fallback values.
| let start_date_int = start_timestamp.format("%Y%m%d").to_string().parse::<i32>().unwrap_or(0); | |
| let end_date_int = end_timestamp.format("%Y%m%d").to_string().parse::<i32>().unwrap_or(99999999); | |
| let start_date_str = start_timestamp.format("%Y%m%d").to_string(); | |
| let end_date_str = end_timestamp.format("%Y%m%d").to_string(); | |
| let start_date_int = start_date_str.parse::<i32>().map_err(|e| { | |
| warn!( | |
| "Failed to parse start date '{}' as integer for date filter: {}", | |
| start_date_str, e | |
| ); | |
| Error::Other(format!( | |
| "Failed to parse start date '{}' as integer", | |
| start_date_str | |
| )) | |
| })?; | |
| let end_date_int = end_date_str.parse::<i32>().map_err(|e| { | |
| warn!( | |
| "Failed to parse end date '{}' as integer for date filter: {}", | |
| end_date_str, e | |
| ); | |
| Error::Other(format!( | |
| "Failed to parse end date '{}' as integer", | |
| end_date_str | |
| )) | |
| })?; |
Overview
Changes
Comments