Skip to content

data prep#667

Merged
chrisaddy merged 5 commits intodatamanager-fixesfrom
data-prep
Jan 16, 2026
Merged

data prep#667
chrisaddy merged 5 commits intodatamanager-fixesfrom
data-prep

Conversation

@chrisaddy
Copy link
Copy Markdown
Collaborator

@chrisaddy chrisaddy commented Jan 14, 2026

This pull request enhances logging, error handling, and data normalization in the datamanager Rust application, especially around DataFrame creation and S3/HTTP interactions. It also updates the expected data types for equity bar fields and adds new environment variables for DuckDB library paths in the Dockerfile. The changes improve observability, debugging, and robustness of the application's data pipelines.

Logging and Error Handling Improvements:

  • Added detailed logging (using debug, info, and warn) throughout DataFrame creation functions in applications/datamanager/src/data.rs to trace row counts, column normalization, errors, and final DataFrame shapes. Error handling now logs warnings before returning errors. [1] [2] [3] [4] [5] [6] [7]
  • Enhanced HTTP request/response logging and error reporting in applications/datamanager/src/equity_bars.rs, including API request details, response sizes, JSON parsing, and S3 upload status. Errors now use warn! instead of info! for failures. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]

Data Normalization and Type Updates:

  • Normalized string columns (such as ticker, side, action, sector, industry) to uppercase in DataFrames for consistency. [1] [2] [3] [4]
  • Changed several fields in the BarResult struct in applications/datamanager/src/equity_bars.rs from Option<u64> to Option<f64> to better reflect the expected data types from the API. [1] [2]

Docker and Environment Configuration:

  • Added environment variables for DuckDB library and include paths in the applications/datamanager/Dockerfile to support database operations.

Local Settings Update:

  • Expanded the allowed tool list in .claude/settings.local.json to include additional AWS and Python commands, improving local development capabilities.

Copilot AI review requested due to automatic review settings January 14, 2026 05:18
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jan 14, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@pulumi
Copy link
Copy Markdown

pulumi Bot commented Jan 14, 2026

🚀 The Update (preview) for forstmeier/pocketsizefund/production (at 071e83e) was successful.

✨ Neo Explanation

Initial production infrastructure deployment for PocketSizeFund, creating a complete AWS environment with VPC networking, ECS container orchestration, load balancing, and storage for a three-service portfolio management application.

Root Cause Analysis

This is an initial infrastructure deployment for the PocketSizeFund application on AWS. The developer is provisioning an entirely new production environment from scratch - there are no existing resources, and all 64 resources shown are being created for the first time. This represents the foundation of a microservices-based portfolio management system with three main services: a data manager, an equity price model, and a portfolio manager.

Dependency Chain

The deployment follows a standard AWS infrastructure pattern:

  1. Network Foundation: VPC → Subnets (public/private across 2 AZs) → Internet Gateway + NAT Gateway
  2. Security Layer: Security groups configured for the ALB, ECS tasks, and VPC endpoints
  3. Routing: Route tables and associations to direct traffic between public internet, private subnets, and NAT gateway
  4. Service Infrastructure:
    • ECR repositories for storing container images for all three services
    • ECS cluster to run the containerized applications
    • Application Load Balancer with listener rules routing traffic to specific services
    • CloudWatch log groups for application logging
  5. Data Storage: S3 buckets with versioning enabled for data storage and ML model artifacts
  6. IAM & Permissions: Execution and task roles with policies for ECS, SageMaker, S3, and secrets access
  7. Service Discovery: Private DNS namespace for inter-service communication
  8. Application Deployment: ECS task definitions and services for the three microservices

Risk analysis

Low Risk - This is a greenfield deployment with no existing infrastructure. There are no resources being replaced or deleted, eliminating concerns about data loss or service disruption.

Resource Changes

    Name                               Type                                                          Operation
+   pocketsizefund-production          pulumi:pulumi:Stack                                           create
+   execution_role                     aws:iam/role:Role                                             create
+   public_route_table                 aws:ec2/routeTable:RouteTable                                 create
+   private_route_table                aws:ec2/routeTable:RouteTable                                 create
+   service_discovery                  aws:servicediscovery/privateDnsNamespace:PrivateDnsNamespace  create
+   portfoliomanager_tg                aws:lb/targetGroup:TargetGroup                                create
+   execution_role_secrets_policy      aws:iam/rolePolicy:RolePolicy                                 create
+   public_subnet_2_rta                aws:ec2/routeTableAssociation:RouteTableAssociation           create
+   portfoliomanager_sd                aws:servicediscovery/service:Service                          create
+   datamanager_sd                     aws:servicediscovery/service:Service                          create
+   portfoliomanager_rule              aws:lb/listenerRule:ListenerRule                              create
+   portfoliomanager_service           aws:ecs/service:Service                                       create
+   datamanager_task                   aws:ecs/taskDefinition:TaskDefinition                         create
+   private_subnet_1_rta               aws:ec2/routeTableAssociation:RouteTableAssociation           create
+   ecs_egress                         aws:ec2/securityGroupRule:SecurityGroupRule                   create
+   portfoliomanager_task              aws:ecs/taskDefinition:TaskDefinition                         create
+   equitypricemodel_service           aws:ecs/service:Service                                       create
+   datamanager_logs                   aws:cloudwatch/logGroup:LogGroup                              create
+   execution_role_policy              aws:iam/rolePolicyAttachment:RolePolicyAttachment             create
+   model_artifacts_bucket_versioning  aws:s3/bucketVersioningV2:BucketVersioningV2                  create
+   datamanager_service                aws:ecs/service:Service                                       create
+   vpc_endpoints_sg                   aws:ec2/securityGroup:SecurityGroup                           create
+   private_subnet_2_rta               aws:ec2/routeTableAssociation:RouteTableAssociation           create
+   nat_elastic_ip                     aws:ec2/eip:Eip                                               create
+   data_bucket_versioning             aws:s3/bucketVersioningV2:BucketVersioningV2                  create
+   igw                                aws:ec2/internetGateway:InternetGateway                       create
+   datamanager_tg                     aws:lb/targetGroup:TargetGroup                                create
+   public_internet_route              aws:ec2/route:Route                                           create
+   ecs_from_alb                       aws:ec2/securityGroupRule:SecurityGroupRule                   create
+   http_listener                      aws:lb/listener:Listener                                      create
+   data_bucket                        aws:s3/bucketV2:BucketV2                                      create
+   model_artifacts_bucket             aws:s3/bucketV2:BucketV2                                      create
+   equitypricemodel_repository        aws:ecr/repository:Repository                                 create
+   task_role                          aws:iam/role:Role                                             create
+   ecs_cluster                        aws:ecs/cluster:Cluster                                       create
+   sagemaker_ecr_policy               aws:iam/rolePolicy:RolePolicy                                 create
+   ecr_api_endpoint                   aws:ec2/vpcEndpoint:VpcEndpoint                               create
+   sagemaker_execution_role           aws:iam/role:Role                                             create
+   private_subnet_2                   aws:ec2/subnet:Subnet                                         create
+   alb_sg                             aws:ec2/securityGroup:SecurityGroup                           create
+   alb                                aws:lb/loadBalancer:LoadBalancer                              create
... and 24 other changes

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jan 14, 2026

Confidence Score: 4/5

  • This PR is safe to merge with minor log message inconsistency
  • The changes are well-structured with proper logging, type safety improvements, and infrastructure modernization. The main issue is a minor log/sleep mismatch. The infrastructure changes are significant but follow best practices by creating managed resources. Type changes from u64 to f64 correctly align with API response types. New training data preparation tools follow Python type hints convention per CLAUDE.md.
  • Review tools/sync_equity_bars_data.py for the log message inconsistency at line 121

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

14 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread tools/sync_equity_bars_data.py Outdated
@chrisaddy chrisaddy changed the base branch from master to datamanager-fixes January 14, 2026 05:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Comment thread tools/sync_equity_categories.py Outdated
ticker = ticker_data.get("ticker", "")
# Polygon uses 'sic_description' for industry, but we can also check other fields
# The primary_exchange and type fields help filter
if ticker_data.get("type") not in ("CS", "ADRC"): # Common Stock or ADR
Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected spelling of 'ADRC' to 'ADR' in comment. The ticker type code is 'ADRC' but the comment should say 'ADR Common' or just keep 'ADR' as written.

Suggested change
if ticker_data.get("type") not in ("CS", "ADRC"): # Common Stock or ADR
if ticker_data.get("type") not in ("CS", "ADRC"): # Common Stock or ADR Common

Copilot uses AI. Check for mistakes.
Comment thread tools/prepare_training_data.py Outdated
output_key: str,
) -> str:
"""Write consolidated training data to S3 as parquet."""
import io
Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The io module is imported inside the function write_training_data_to_s3 rather than at the module level. Move this import to the top of the file with other imports for consistency and better code organization.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seconded.

Comment thread infrastructure/__main__.py
Comment thread applications/datamanager/src/storage.rs Outdated
Comment on lines +185 to +186
let start_date_int = start_timestamp.format("%Y%m%d").to_string().parse::<i32>().unwrap_or(0);
let end_date_int = end_timestamp.format("%Y%m%d").to_string().parse::<i32>().unwrap_or(99999999);
Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 99999999 is used as a fallback for end_date_int. Consider defining this as a named constant (e.g., MAX_DATE_INT) to improve code readability and maintainability.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator

@forstmeier forstmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bots left good cleanup comments. Overall looks good especially the categories stuff.

Comment thread tools/prepare_training_data.py Outdated
output_key: str,
) -> str:
"""Write consolidated training data to S3 as parquet."""
import io
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seconded.

Comment thread tools/sync_equity_categories.py Outdated

logger = structlog.get_logger()

POLYGON_BASE_URL = "https://api.polygon.io"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can probably be fetched from Secrets Manager and should be the massive URL and variable name.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread tools/sync_equity_categories.py
Comment thread tools/sync_equity_categories.py
Comment thread maskfile.md
Comment on lines +543 to +544
export AWS_S3_DATA_BUCKET="$(pulumi stack output aws_s3_data_bucket)"
export AWS_S3_MODEL_ARTIFACTS_BUCKET="$(pulumi stack output aws_s3_model_artifacts_bucket)"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These expect the stack to be up rather than permanent buckets just FYI.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buckets being made permanent

@forstmeier
Copy link
Copy Markdown
Collaborator

Oh, and one general thing (I can't remember if this is in the pull request changes or not) but any file/variable names that reference the model should be tide now not tft.

@chrisaddy chrisaddy merged commit 19e2bbf into datamanager-fixes Jan 16, 2026
2 of 3 checks passed
@chrisaddy chrisaddy deleted the data-prep branch January 16, 2026 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants