Skip to content

Update infrastructure resources#610

Merged
chrisaddy merged 1 commit intomasterfrom
07-23-update_infrastructure_resources
Jul 29, 2025
Merged

Update infrastructure resources#610
chrisaddy merged 1 commit intomasterfrom
07-23-update_infrastructure_resources

Conversation

@forstmeier
Copy link
Copy Markdown
Collaborator

@forstmeier forstmeier commented Jul 24, 2025

Overview

Changes

  • update EKS configuration
  • add VPC resources
  • update ingress logic

Comments

This is a broad update to the infrastructure. There are still a number of tweaks that should be made (e..g monitoring, etc) and it needs to be manually tested.

Summary by CodeRabbit

  • New Features

    • Introduced a comprehensive Grafana dashboard for real-time monitoring of trading services and infrastructure.
    • Added a task to upload Grafana dashboards directly from the command line with enhanced AWS integration and error handling.
    • Implemented a fully AWS-native infrastructure with explicit VPC, subnets, NAT gateway, ALB, and API Gateway integration for improved reliability and scalability.
    • Added creation of Knative eventing Broker resource for enhanced event management.
  • Improvements

    • Unified and simplified service run and test tasks for easier local development and testing.
    • Enhanced environment variable management and AWS IAM role configuration for better security and flexibility.
    • Improved Knative service integration with AWS ALB and API Gateway for more robust service exposure and routing.
    • Added stricter resource dependency management and custom timeouts to improve deployment reliability.
    • Refactored Kubernetes cluster and IAM role creation for modularity and clearer dependency handling.
    • Updated Prometheus scraper configuration for more structured and reliable monitoring setup.
    • Parameterized Docker image builds with external DockerHub credentials for better security.
    • Refined environment variable naming and Docker Compose configuration for streamlined AWS S3 and IAM credential usage.
  • Bug Fixes

    • Updated environment variable names and Docker Compose configuration for better compatibility with AWS S3 and IAM credentials.
  • Removals

    • Removed legacy Kubernetes ingress, ALB controller, and API Gateway constructs in favor of AWS-native networking.
    • Deleted outdated health check and dashboard scripts, as well as unused environment variable and role management files.
  • Chores

    • Updated Python version and cleaned up dependencies for infrastructure provisioning tools.
    • Renamed common tags dictionary and introduced manual tags for clearer resource tagging.

Copilot AI review requested due to automatic review settings July 24, 2025 03:20
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Jul 24, 2025

"""

Walkthrough

This update extensively refactors the infrastructure provisioning, focusing on AWS-native VPC, networking, ALB, and API Gateway integration, while removing Kubernetes-native ingress and simplifying environment variable and IAM management. Knative services are reworked for ALB integration, and a new, comprehensive Grafana dashboard is introduced. Numerous supporting scripts and modules are added, deleted, or refactored for improved modularity and explicit dependency handling.

Changes

File(s) Change Summary
.mise.toml Consolidated and simplified task definitions, switched from arg to option interpolation, added conditional logic to test tasks, and introduced a new Grafana dashboard upload task.
application/datamanager/compose.yaml Updated environment variables and removed Google Cloud credentials volume mount for datamanager service.
application/datamanager/src/datamanager/main.py Renamed environment variables for S3 and DuckDB IAM credentials.
infrastructure/__main__.py Major refactor: implements detailed AWS VPC, ALB, API Gateway, and Knative service setup; removes Kubernetes ingress and simplifies secrets and environment variable management.
infrastructure/api.py New: Defines API Gateway, VPC Link, integrations, and IAM role creation for Knative services and ALB.
infrastructure/cluster.py Refactored cluster creation to accept external VPC/subnets and modularized IAM role creation; improved dependency and config management.
infrastructure/environment_variables.py, infrastructure/roles.py, infrastructure/ping.nu, infrastructure/grafana_dashboard.json Deleted: Removed files for legacy environment variable, IAM role, and service health check management, and old Grafana dashboard.
infrastructure/grafana-dashboard.json New: Introduces a comprehensive, multi-panel Grafana dashboard for monitoring trading and infrastructure metrics.
infrastructure/images.py Updated image build function to accept DockerHub credentials as parameters.
infrastructure/ingress.py Replaced Kubernetes/Helm-based ingress and API Gateway logic with AWS-native ALB resource definitions.
infrastructure/keys.py Parameterized DuckDB IAM access key creation and updated tagging.
infrastructure/monitors.py Refactored Prometheus scraper to accept explicit parameters and structured config, with improved dependency management.
infrastructure/pyproject.toml Updated project version, Python version, and removed unused dependencies.
infrastructure/services.py Added utility for environment variables, new Knative broker creation, enhanced dependency and timeout management, and reworked Knative service for ALB integration.
infrastructure/tags.py Renamed common_tags to pulumi_tags and added manual_tags.
infrastructure/upload_grafana_dashboard.nu Refactored script for parameterization, AWS integration, error handling, and temporary API key management for dashboard upload.
infrastructure/vpc.py New: Provides modular functions for VPC, subnets, gateways, and route tables.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Pulumi
    participant AWS
    participant ALB
    participant API_Gateway
    participant Knative_Service

    User->>Pulumi: Run infrastructure deployment
    Pulumi->>AWS: Provision VPC, subnets, IGW, NAT, route tables
    Pulumi->>AWS: Create EKS cluster with IAM roles
    Pulumi->>AWS: Build and push Docker images (with DockerHub credentials)
    Pulumi->>AWS: Create ALB, security group, target group, listener
    Pulumi->>AWS: Deploy Knative services (with ALB target group integration)
    Pulumi->>AWS: Create API Gateway and VPC Link to ALB
    Pulumi->>AWS: Create IAM role for API Gateway access
    API_Gateway->>ALB: Proxy HTTP requests via VPC Link
    ALB->>Knative_Service: Forward traffic to Knative pods
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • #606: Updates DuckDB AWS IAM credentials and environment variable handling, directly related to the new parameterization and secret management.
  • #605: Reimplements infrastructure with AWS EKS and removes GCP services, matching the core architectural changes here.
  • #547: Updates Pulumi infrastructure for Docker image builds and DockerHub credential management, aligning with the new image build logic.

Suggested labels

feature

Suggested reviewers

  • chrisaddy

Poem

🐇
In the cloud, a VPC grows,
With subnets and gateways in neat little rows.
An ALB listens, API Gateway calls,
Knative pods answer, behind secure walls.
Dashboards now gleam, metrics in view—
Oh, what a burrow for our trading crew!

"""

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch 07-23-update_infrastructure_resources

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown
Collaborator Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@forstmeier forstmeier requested a review from chrisaddy July 24, 2025 03:22
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a comprehensive infrastructure overhaul with significant changes to AWS resource management, Kubernetes deployment configurations, and monitoring capabilities. The changes modernize the deployment architecture with improved load balancing, API gateway integration, and enhanced monitoring dashboards.

  • Major restructuring of infrastructure modules with consolidated networking, improved Kubernetes cluster management, and streamlined service deployments
  • Migration from direct ingress to API Gateway with Application Load Balancer integration for better traffic management
  • Enhanced monitoring with updated Grafana dashboard and improved Prometheus scraping configuration

Reviewed Changes

Copilot reviewed 20 out of 21 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
infrastructure/vpc.py New VPC management module with comprehensive networking functions
infrastructure/upload_grafana_dashboard.nu Complete rewrite with AWS integration and improved error handling
infrastructure/tags.py Renamed tags and added manual tagging support
infrastructure/services.py Enhanced Knative service configurations with ALB integration
infrastructure/roles.py File removed - functionality moved to cluster.py
infrastructure/pyproject.toml Updated version and simplified dependencies
infrastructure/ping.nu File removed
infrastructure/monitors.py Improved Prometheus configuration with structured YAML
infrastructure/keys.py Updated to use new environment variable naming
infrastructure/ingress.py Simplified to focus on ALB instead of complex ingress controllers
infrastructure/images.py Refactored to accept credentials as parameters
infrastructure/grafana-dashboard.json New comprehensive monitoring dashboard
infrastructure/environment_variables.py File removed - functionality moved to main.py
infrastructure/cluster.py Consolidated cluster and role management
infrastructure/api.py New API Gateway management module
infrastructure/main.py Complete restructure with new resource organization
application/datamanager/src/datamanager/main.py Updated environment variable references
application/datamanager/compose.yaml Simplified environment configuration
.mise.toml Updated task definitions and options

Comment thread infrastructure/services.py
Comment thread infrastructure/monitors.py
Comment thread infrastructure/services.py
Comment thread infrastructure/upload_grafana_dashboard.nu Outdated
Comment thread infrastructure/keys.py
Comment thread infrastructure/__main__.py Outdated
Comment thread infrastructure/upload_grafana_dashboard.nu Outdated
@forstmeier forstmeier added this to the Refactor milestone Jul 24, 2025
@graphite-app
Copy link
Copy Markdown

graphite-app Bot commented Jul 24, 2025

Graphite Automations

"Assign author to pull request" took an action on this PR • (07/24/25)

1 assignee was added to this PR based on John Forstmeier's automation.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🔭 Outside diff range comments (1)
infrastructure/keys.py (1)

21-42: Fix Pulumi Output interpolation in JSON policy.

The f-string interpolation in the JSON policy (lines 34-35) uses data_bucket_name directly, but since it's a pulumi.Output[str], this won't work correctly. Pulumi Outputs need to be resolved before being used in string formatting.

Apply this fix:

-        policy=json.dumps(
-            {
-                "Version": "2012-10-17",
-                "Statement": [
-                    {
-                        "Effect": "Allow",
-                        "Action": [
-                            "s3:GetObject",
-                            "s3:ListBucket",
-                            "s3:PutObject",
-                            "s3:DeleteObject",
-                        ],
-                        "Resource": [
-                            f"arn:aws:s3:::{data_bucket_name}/*",
-                            f"arn:aws:s3:::{data_bucket_name}",
-                        ],
-                    }
-                ],
-            }
-        ),
+        policy=data_bucket_name.apply(
+            lambda bucket_name: json.dumps(
+                {
+                    "Version": "2012-10-17",
+                    "Statement": [
+                        {
+                            "Effect": "Allow",
+                            "Action": [
+                                "s3:GetObject",
+                                "s3:ListBucket",
+                                "s3:PutObject",
+                                "s3:DeleteObject",
+                            ],
+                            "Resource": [
+                                f"arn:aws:s3:::{bucket_name}/*",
+                                f"arn:aws:s3:::{bucket_name}",
+                            ],
+                        }
+                    ],
+                }
+            )
+        ),
🧹 Nitpick comments (10)
infrastructure/monitors.py (1)

29-31: Consider extracting service account file paths as constants.

The hardcoded service account paths are repeated multiple times. Consider defining them as module-level constants for better maintainability.

Add these constants at the module level:

SERVICE_ACCOUNT_CA_PATH = "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
SERVICE_ACCOUNT_TOKEN_PATH = "/var/run/secrets/kubernetes.io/serviceaccount/token"

Then use them in the configuration:

-                        "ca_file": "/var/run/secrets/kubernetes.io/serviceaccount/ca.crt",  # noqa: E501
+                        "ca_file": SERVICE_ACCOUNT_CA_PATH,

Also applies to: 54-56

infrastructure/upload_grafana_dashboard.nu (1)

86-88: Consider reducing API key lifetime for enhanced security.

While 1 hour (3600 seconds) is reasonable, consider reducing to the minimum needed time (e.g., 300 seconds) since the upload operation should complete quickly.

-            --seconds-to-live 3600 
+            --seconds-to-live 300 
infrastructure/grafana-dashboard.json (1)

492-493: Update placeholder timestamps to use dynamic values.

The created and updated timestamps appear to be hardcoded placeholder values. These should typically be set dynamically when the dashboard is created/updated.

Consider removing these fields or setting them to null/empty to let Grafana handle them automatically:

-    "created": "2025-01-19T10:00:00Z",
-    "updated": "2025-01-19T10:00:00Z",
+    "created": null,
+    "updated": null,
infrastructure/api.py (1)

40-52: Consider the implications of auto_deploy in production.

While auto_deploy=True simplifies deployments, you may want to consider setting it to False for production environments to have more control over when changes are deployed.

infrastructure/ingress.py (2)

76-76: Be aware of potential downtime with replace_on_changes.

The replace_on_changes=["*"] setting means any modification to the target group will trigger a recreation, which could cause brief service interruption. Consider if this is necessary for all properties or if it can be limited to specific fields.


93-94: Consider adding HTTPS listener for production.

The listener is configured for HTTP only on port 80. For production environments, consider adding an HTTPS listener on port 443 with appropriate SSL/TLS certificates.

infrastructure/services.py (1)

157-158: Consider parameterizing the broker namespace.

The broker is hardcoded to use the "default" namespace. Consider making this configurable if you need brokers in different namespaces.

infrastructure/__main__.py (3)

83-86: Consider the availability trade-off with single NAT gateway

Using a single NAT gateway reduces costs but creates a single point of failure. If high availability is required, consider creating one NAT gateway per availability zone.


208-267: Consider service-specific environment variables

All three services (datamanager, predictionengine, positionmanager) receive the same environment variables. Some services might not need all variables (e.g., positionmanager might not need POLYGON_API_KEY).

Consider creating service-specific environment variable sets to follow the principle of least privilege and reduce the attack surface.


304-340: Simplify lambda functions in exports

The lambda functions lambda ref: f"{ref}" are unnecessary since they just convert to string without modification.

Apply this pattern to simplify:

-    "DATAMANAGER_SERVICE_IMAGE", datamanager_image.ref.apply(lambda ref: f"{ref}")
+    "DATAMANAGER_SERVICE_IMAGE", datamanager_image.ref

This applies to all image exports and other exports where the lambda just formats without modification.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 78f1d77 and 31fbc44.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (20)
  • .mise.toml (2 hunks)
  • application/datamanager/compose.yaml (1 hunks)
  • application/datamanager/src/datamanager/main.py (1 hunks)
  • infrastructure/__main__.py (2 hunks)
  • infrastructure/api.py (1 hunks)
  • infrastructure/cluster.py (2 hunks)
  • infrastructure/environment_variables.py (0 hunks)
  • infrastructure/grafana-dashboard.json (1 hunks)
  • infrastructure/grafana_dashboard.json (0 hunks)
  • infrastructure/images.py (2 hunks)
  • infrastructure/ingress.py (1 hunks)
  • infrastructure/keys.py (2 hunks)
  • infrastructure/monitors.py (1 hunks)
  • infrastructure/ping.nu (0 hunks)
  • infrastructure/pyproject.toml (1 hunks)
  • infrastructure/roles.py (0 hunks)
  • infrastructure/services.py (14 hunks)
  • infrastructure/tags.py (1 hunks)
  • infrastructure/upload_grafana_dashboard.nu (1 hunks)
  • infrastructure/vpc.py (1 hunks)
💤 Files with no reviewable changes (4)
  • infrastructure/environment_variables.py
  • infrastructure/ping.nu
  • infrastructure/roles.py
  • infrastructure/grafana_dashboard.json
🧰 Additional context used
🧠 Learnings (1)
.mise.toml (1)

Learnt from: CR
PR: pocketsizefund/pocketsizefund#0
File: CLAUDE.md:0-0
Timestamp: 2025-07-13T00:52:51.438Z
Learning: Applies to .mise.toml : Always use mise tasks for running tests, linting, formatting. If there is not a command already present in .mise.toml, suggest creating one.

🧬 Code Graph Analysis (3)
infrastructure/keys.py (1)
infrastructure/environment_variables.py (1)
  • create_environment_variables (17-32)
application/datamanager/src/datamanager/main.py (1)
application/datamanager/src/datamanager/clients.py (1)
  • S3Client (21-45)
infrastructure/cluster.py (1)
infrastructure/roles.py (2)
  • create_cluster_role (5-33)
  • create_node_role (36-74)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Run Python tests
🔇 Additional comments (25)
infrastructure/tags.py (1)

3-13: Excellent tagging strategy improvement.

The separation of pulumi_tags and manual_tags provides clear resource ownership tracking and consistent metadata across different management approaches. The dynamic stack value in pulumi_tags enables environment-specific identification while maintaining consistency.

application/datamanager/compose.yaml (1)

10-12: Verify DuckDB IAM credentials are provided at runtime.

The compose file now only includes AWS_S3_DATA_BUCKET_NAME, but application/datamanager/src/datamanager/main.py (lines 61-62) still expects AWS_IAM_DUCKDB_USER_ACCESS_KEY_ID and AWS_IAM_DUCKDB_USER_ACCESS_KEY_SECRET environment variables. Ensure these credentials are provided through the deployment environment or add them to the compose file for local development.

application/datamanager/src/datamanager/main.py (1)

57-62: Environment variable updates look good and consistent.

The renaming from generic names to AWS-prefixed environment variables (DATA_BUCKET_NAMEAWS_S3_DATA_BUCKET_NAME, DUCKDB_USER_ACCESS_KEY_IDAWS_IAM_DUCKDB_USER_ACCESS_KEY_ID, etc.) provides better clarity about the credential source and aligns well with the AWS-centric infrastructure refactoring.

infrastructure/keys.py (1)

8-10: Good parameterization approach.

Making the function accept data_bucket_name as a parameter instead of relying on imports improves modularity and makes the infrastructure more flexible.

infrastructure/images.py (1)

10-37: Well-structured refactoring for explicit secret management.

The changes improve modularity by accepting Docker Hub credentials as parameters rather than retrieving them internally. The use of pulumi.Output[str] for secret parameters and direct return of the Image object are appropriate design choices.

infrastructure/monitors.py (1)

7-151: Excellent refactoring to structured configuration.

The migration from YAML string to Python dictionary with json_dumps significantly improves maintainability and type safety. The explicit parameter passing and proper dependency management align well with infrastructure best practices.

infrastructure/vpc.py (1)

1-121: Well-structured VPC module with proper resource dependencies.

The module provides a clean abstraction for VPC infrastructure components with consistent tagging and explicit dependency management. The separation of concerns and modular design facilitate infrastructure as code best practices.

infrastructure/upload_grafana_dashboard.nu (1)

1-179: Well-crafted Nu script with comprehensive error handling and AWS integration.

The script demonstrates excellent practices including parameterization, dry-run mode, temporary credential management with cleanup, and proper error handling throughout. The AWS integration is well-implemented with appropriate security considerations.

infrastructure/grafana-dashboard.json (1)

1-505: Comprehensive monitoring dashboard with well-structured panels and queries.

The dashboard provides excellent coverage of infrastructure and application metrics with appropriate visualizations, time ranges, and refresh intervals. The Prometheus queries are well-constructed and the panel organization facilitates effective monitoring.

.mise.toml (4)

43-48: LGTM! Unified task definition improves maintainability.

The consolidation of separate production/development tasks into a single application:service:run task with the switch to option parameter handling simplifies the configuration and provides better flexibility.


60-66: Good addition of cleanup flag for flexible test execution.

The conditional logic based on the cleanup flag provides flexibility to either teardown containers or run tests, which is useful for different testing scenarios.


100-101: Consistent parameter handling update.

The switch to option(user-name="user-name") maintains consistency with other task definitions in the file.


103-108: Welcome addition for Grafana dashboard management.

The new dashboard:upload task provides a convenient way to upload the Grafana dashboard configuration, aligning with the infrastructure updates in this PR.

infrastructure/api.py (2)

8-24: Well-structured VPC Link creation.

The function properly creates a VPC Link with correct security group and subnet configurations, including appropriate dependency declarations.


104-161: Secure IAM role implementation with least privilege.

The IAM role properly restricts assume role permissions to the specified user ARN and scopes API invocation permissions to specific endpoints, following security best practices.

infrastructure/ingress.py (2)

12-18: Confirm open HTTP access is intentional.

The security group allows HTTP traffic from anywhere (0.0.0.0/0). Please confirm this open access is intentional for your use case. Consider restricting the CIDR blocks if the ALB should only be accessible from specific IP ranges.


35-52: Correct ALB configuration for internet-facing setup.

The ALB is properly configured as internet-facing with public subnets and appropriate security group attachment.

infrastructure/cluster.py (4)

10-35: Improved network isolation with explicit VPC and private subnets.

The updated function signature properly accepts VPC and private subnet parameters, improving network isolation and security for the EKS cluster.


38-51: Good improvements to provider configuration.

The addition of replace_on_changes for kubeconfig and proper JSON serialization improves resource lifecycle management.


61-94: Cleaner implementation with JSON serialization.

The use of pulumi.Output.json_dumps for building mapRoles and mapUsers is much cleaner and less error-prone than manual string concatenation.


121-188: Well-structured IAM role creation functions.

Both IAM role creation functions properly configure assume role policies and attach the necessary AWS managed policies for EKS cluster and node operations. The use of json.dumps for policy serialization is consistent and clean.

infrastructure/services.py (3)

9-12: Clean utility for environment variable management.

This utility function provides a nice abstraction for converting environment variable tuples to Pulumi Output dictionaries.


66-77: Verify Knative network configuration requirements.

The ConfigMap sets specific network configurations including http-protocol: Redirected. Please verify these settings align with your requirements, especially the HTTP redirect behavior.


290-295: Improved dependency management and timeouts.

The explicit dependency on knative_eventing_core and addition of custom timeouts improve resource creation reliability and lifecycle management.

infrastructure/__main__.py (1)

278-300: Verify cron schedules align with market hours

The cron schedules appear to use UTC time:

  • Monday 10:00 UTC (5-6 AM ET depending on DST) - before market open
  • Friday 13:00 UTC (8-9 AM ET depending on DST) - during market hours

Ensure these times align with your trading strategy and market hours (US markets open 9:30 AM ET).

Comment thread infrastructure/__main__.py
Comment thread infrastructure/__main__.py
Comment thread infrastructure/__main__.py
Comment thread infrastructure/api.py Outdated
Comment thread infrastructure/pyproject.toml
Comment thread infrastructure/services.py
Comment thread infrastructure/upload_grafana_dashboard.nu Outdated
Comment thread infrastructure/vpc.py
@forstmeier forstmeier force-pushed the 07-23-update_infrastructure_resources branch from 31fbc44 to 23b6911 Compare July 25, 2025 01:33
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (8)
infrastructure/monitors.py (1)

104-104: Regex pattern correctly uses raw string.

The implementation already addresses the previous review suggestion by using a raw string with the r prefix.

infrastructure/ingress.py (1)

12-24: HTTPS ingress properly configured.

The security group correctly includes ingress rules for both HTTP (port 80) and HTTPS (port 443), addressing the previous review concern.

infrastructure/__main__.py (2)

60-70: Region configuration properly synchronized.

The implementation correctly uses a single aws_region variable for both availability zone queries and environment configuration, addressing the previous review concern about region mismatch.


155-156: Use IAM roles for service accounts instead of access keys.

Passing AWS access keys as environment variables is less secure than using IAM roles for service accounts (IRSA). This enables better security and automatic credential rotation.

infrastructure/services.py (4)

103-106: Remove redundant kubernetes_provider dependency.

Same issue as in the serving core function - the kubernetes_provider dependency is redundant when already specified as the provider.


120-124: Consider increasing timeout values for eventing components.

Same timeout concern applies to eventing core components as they also require time to initialize properly.


294-294: Remove redundant kubernetes_provider dependency.

Same redundant dependency issue as in other functions.


338-338: Remove redundant kubernetes_provider dependency.

Same redundant dependency issue as in other functions throughout the file.

🧹 Nitpick comments (9)
infrastructure/upload_grafana_dashboard.nu (1)

165-184: Consider clarifying the jq requirement.

The prerequisites check is well-implemented. However, since the script uses Nu's built-in JSON parsing (from json), the jq requirement might be confusing to users.

Consider updating the jq check message to be more specific about when it might be needed:

-        print "jq not found (optional but recommended for JSON debugging)"
+        print "jq not found (optional - only needed for manual JSON debugging outside this script)"
infrastructure/vpc.py (1)

3-3: Consider using absolute imports for better maintainability.

Using relative imports can cause issues when the module structure changes or when importing from different contexts. Consider using absolute imports instead.

-from tags import pulumi_tags
+from infrastructure.tags import pulumi_tags
infrastructure/ingress.py (2)

82-82: Reconsider the aggressive replacement policy for target group.

Using replace_on_changes=["*"] forces replacement of the target group on any change, which could cause service disruptions. Most target group properties can be updated in-place.

Consider removing this option or limiting it to specific properties that truly require replacement:

         opts=pulumi.ResourceOptions(
-            replace_on_changes=["*"],
             depends_on=[
                 virtual_private_cloud,
                 application_load_balancer,
             ],
         ),

92-114: Consider adding HTTPS listener for secure traffic.

The security group allows HTTPS traffic on port 443, but there's no corresponding HTTPS listener configured. This means HTTPS traffic will be dropped at the ALB.

For production use, consider adding an HTTPS listener with an ACM certificate:

  1. Create or import an SSL certificate in AWS Certificate Manager
  2. Add an HTTPS listener on port 443 with the certificate
  3. Optionally redirect HTTP to HTTPS for better security
infrastructure/cluster.py (1)

128-195: Well-structured IAM role creation functions.

The IAM role creation functions are properly implemented with all necessary policies. The migration from roles.py maintains the same functionality while improving modularity.

Consider adding docstrings to document the purpose and required policies for each role:

 def create_kubernetes_cluster_role() -> aws.iam.Role:
+    """Create IAM role for EKS cluster with required AWS managed policies.
+    
+    Returns:
+        aws.iam.Role: IAM role with AmazonEKSClusterPolicy attached
+    """
     assume_role_policy = {
infrastructure/__main__.py (2)

85-88: Single NAT gateway configuration noted.

Using a single NAT gateway for cost efficiency is reasonable, but be aware this creates a single point of failure. For production workloads requiring high availability, consider one NAT gateway per availability zone.


318-340: Remove redundant lambda functions in exports.

The lambda functions used for string formatting in these exports are unnecessary since the values are already strings.

 pulumi.export(
     "AWS_EKS_CLUSTER_NAME",
-    kubernetes_cluster.eks_cluster.name.apply(lambda cluster_name: f"{cluster_name}"),
+    kubernetes_cluster.eks_cluster.name,
 )

 pulumi.export(
     "AWS_VIRTUAL_PRIVATE_CLOUD_ID",
-    virtual_private_cloud.id.apply(lambda vpc_id: f"{vpc_id}"),
+    virtual_private_cloud.id,
 )

 pulumi.export(
     "AWS_API_GATEWAY_ACCESS_IAM_ROLE_ARN",
-    api_access_iam_role.arn.apply(lambda arn: f"{arn}"),
+    api_access_iam_role.arn,
 )

 pulumi.export(
     "AWS_API_GATEWAY_ENDPOINT_URL",
-    api_gateway.api_endpoint.apply(lambda endpoint: f"{endpoint}/production"),
+    api_gateway.api_endpoint.apply(lambda endpoint: f"{endpoint}/production"),
 )

Note: Keep the last lambda as it's actually performing concatenation with "/production".

infrastructure/services.py (2)

38-42: Consider increasing timeout values for Kubernetes core components.

The 2-minute timeout might be insufficient for Knative serving core components, which can take longer to initialize, especially in resource-constrained environments.

Consider increasing to at least 5-10 minutes:

            custom_timeouts=pulumi.CustomTimeouts(
-                create="2m",
-                update="2m",
-                delete="2m",
+                create="10m",
+                update="10m",
+                delete="10m",
            ),

66-66: Address the NOTE comment about ConfigMap necessity.

The NOTE comment indicates uncertainty about whether the ConfigMap or its configurations are necessary. This should be resolved before deployment.

Would you like me to help research the necessity of this Knative network configuration or create an issue to track this investigation?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 31fbc44 and 23b6911.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (20)
  • .mise.toml (2 hunks)
  • application/datamanager/compose.yaml (1 hunks)
  • application/datamanager/src/datamanager/main.py (1 hunks)
  • infrastructure/__main__.py (2 hunks)
  • infrastructure/api.py (1 hunks)
  • infrastructure/cluster.py (2 hunks)
  • infrastructure/environment_variables.py (0 hunks)
  • infrastructure/grafana-dashboard.json (1 hunks)
  • infrastructure/grafana_dashboard.json (0 hunks)
  • infrastructure/images.py (2 hunks)
  • infrastructure/ingress.py (1 hunks)
  • infrastructure/keys.py (2 hunks)
  • infrastructure/monitors.py (1 hunks)
  • infrastructure/ping.nu (0 hunks)
  • infrastructure/pyproject.toml (1 hunks)
  • infrastructure/roles.py (0 hunks)
  • infrastructure/services.py (14 hunks)
  • infrastructure/tags.py (1 hunks)
  • infrastructure/upload_grafana_dashboard.nu (1 hunks)
  • infrastructure/vpc.py (1 hunks)
💤 Files with no reviewable changes (4)
  • infrastructure/roles.py
  • infrastructure/ping.nu
  • infrastructure/environment_variables.py
  • infrastructure/grafana_dashboard.json
✅ Files skipped from review due to trivial changes (1)
  • application/datamanager/compose.yaml
🚧 Files skipped from review as they are similar to previous changes (8)
  • application/datamanager/src/datamanager/main.py
  • infrastructure/tags.py
  • infrastructure/images.py
  • infrastructure/pyproject.toml
  • infrastructure/keys.py
  • infrastructure/grafana-dashboard.json
  • .mise.toml
  • infrastructure/api.py
🧰 Additional context used
🧠 Learnings (1)
infrastructure/__main__.py (1)

Learnt from: chrisaddy
PR: #535
File: application/datamanager/src/datamanager/main.py:147-147
Timestamp: 2025-05-22T02:26:11.306Z
Learning: In the datamanager service, debug-style f-string syntax (e.g., {year=}) is intentionally used for constructing GCS path prefixes to match Hive-style partitioned directory structures for data storage and retrieval.

🧬 Code Graph Analysis (2)
infrastructure/cluster.py (1)
infrastructure/roles.py (2)
  • create_cluster_role (5-33)
  • create_node_role (36-74)
infrastructure/services.py (2)
infrastructure/publishers_subscribers.py (6)
  • create_knative_service (99-149)
  • create_knative_eventing_core (62-96)
  • create_knative_serving_core (5-59)
  • create_knative_schedule (219-255)
  • create_knative_trigger (174-216)
  • create_knative_broker (152-171)
infrastructure/environment_variables.py (1)
  • create_environment_variables (17-32)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Run Python tests
🔇 Additional comments (15)
infrastructure/upload_grafana_dashboard.nu (8)

3-19: LGTM! Well-structured function signature and validation.

The parameterized approach with sensible defaults and early file validation is excellent. The error handling provides clear feedback to users.


20-46: Excellent improvement! Past review feedback has been addressed.

The workspace discovery logic now correctly:

  • Uses --output json format (fixing the previous text/JSON parsing issue)
  • Includes proper error handling for empty workspace results
  • Provides clear error messages for troubleshooting

48-64: Good implementation of workspace endpoint retrieval.

The error handling and JSON parsing are consistent with the rest of the script. The user feedback helps with debugging.


66-85: Well-implemented dashboard loading and dry-run functionality.

The JSON validation, payload structure with timestamp, and dry-run mode are all excellent features that enhance usability and reliability.


87-104: Good implementation addressing previous feedback.

The API key creation now uses a deterministic naming pattern with timestamp, which addresses the previous concern about key accumulation. The 1-hour lifetime provides a reasonable balance between functionality and security.


106-129: Robust upload implementation with proper cleanup.

The HTTP request structure is correct, and the error handling includes cleanup attempts for the temporary API key, which prevents resource leaks.


131-147: Excellent response handling with defensive programming.

The optional field access with defaults and comprehensive success reporting provide a robust user experience.


148-163: Proper cleanup and user-friendly completion.

The final cleanup handles failures gracefully without affecting the overall success, and the completion message provides actionable information to users.

infrastructure/vpc.py (1)

90-96: Good implementation of subnet validation!

The subnet number validation correctly prevents CIDR conflicts by ensuring the value is within the valid range (0-255), addressing the concern from previous reviews.

infrastructure/services.py (6)

9-12: LGTM! Clean utility function implementation.

This new utility function properly handles Pulumi Output conversion and provides a clean interface for environment variable processing. The implementation correctly uses pulumi.Output.all() and type hints are appropriate.


177-184: Excellent refactoring that addresses previous concerns.

The function signature has been improved by reducing from 7 to 5 parameters and now properly integrates with the ALB infrastructure. The use of docker_build.Image object provides better type safety.


206-208: ALB integration annotations look correct.

The ALB-specific annotations are properly configured for internet-facing load balancer with IP targeting. The target group ARN reference ensures proper integration with the AWS infrastructure.


220-226: Resource limits and ports properly configured.

Good to see resource limits have been retained (addressing the previous review comment) and container ports are explicitly defined for ALB integration.


259-259: Improved function signature with explicit dependencies.

The change from generic depends_on parameter to specific knative_eventing_core parameter makes dependencies more explicit and type-safe.


309-309: Consistent improvement in function signature.

Same positive change as in the trigger function - explicit dependency parameter improves type safety and clarity.

Comment thread infrastructure/services.py
Comment thread infrastructure/services.py
Comment thread infrastructure/services.py
Comment thread infrastructure/vpc.py
@forstmeier forstmeier force-pushed the 07-23-update_infrastructure_resources branch from 23b6911 to d91e994 Compare July 25, 2025 01:45
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (11)
infrastructure/upload_grafana_dashboard.nu (1)

29-33: Past review comment has been addressed.

The empty workspace handling issue raised in previous reviews has been properly addressed with the addition of the empty result check and graceful error handling.

infrastructure/keys.py (1)

21-40: String interpolation in policy still needs improvement.

The IAM policy still uses f-string interpolation with the data_bucket_name parameter, which could be error-prone as mentioned in previous reviews. The data_bucket_name is a pulumi.Output[str], so direct string interpolation won't work correctly.

Apply this fix to properly handle the Pulumi Output:

-        policy=json.dumps(
-            {
-                "Version": "2012-10-17",
-                "Statement": [
-                    {
-                        "Effect": "Allow",
-                        "Action": [
-                            "s3:GetObject",
-                            "s3:ListBucket",
-                            "s3:PutObject",
-                            "s3:DeleteObject",
-                        ],
-                        "Resource": [
-                            f"arn:aws:s3:::{data_bucket_name}/*",
-                            f"arn:aws:s3:::{data_bucket_name}",
-                        ],
-                    }
-                ],
-            }
-        ),
+        policy=data_bucket_name.apply(lambda bucket_name: json.dumps(
+            {
+                "Version": "2012-10-17",
+                "Statement": [
+                    {
+                        "Effect": "Allow",
+                        "Action": [
+                            "s3:GetObject",
+                            "s3:ListBucket",
+                            "s3:PutObject",
+                            "s3:DeleteObject",
+                        ],
+                        "Resource": [
+                            f"arn:aws:s3:::{bucket_name}/*",
+                            f"arn:aws:s3:::{bucket_name}",
+                        ],
+                    }
+                ],
+            }
+        )),
infrastructure/api.py (1)

65-66: Path handling improvement has been implemented.

The previous concern about path handling for complex paths has been addressed by the addition of line 66 which filters out empty segments, improving robustness for paths with multiple consecutive slashes.

infrastructure/__main__.py (2)

155-156: Consider using IAM roles instead of access keys for service authentication

Passing AWS access keys as environment variables is less secure than using IAM roles for service accounts (IRSA). Consider implementing IRSA for better security and credential rotation.


161-165: Verify ALB security group rules include HTTPS (443) ingress

The ALB security group is created but the security rules are not visible here. Based on previous analysis, ensure it allows HTTPS traffic on port 443 in addition to HTTP on port 80.

infrastructure/services.py (6)

21-25: Remove redundant kubernetes_provider dependency

The kubernetes_provider is already specified in the provider parameter of ResourceOptions, making it redundant in the depends_on list.

        opts=pulumi.ResourceOptions(
            provider=kubernetes_provider,
-            depends_on=[kubernetes_provider],
+            depends_on=[],
        ),

Apply similar changes to lines 34-37, 54-57, and 81-85.

Also applies to: 34-37, 54-57, 81-85


103-106: Remove redundant kubernetes_provider dependency

Same issue as in create_knative_serving_core - the provider is redundant in depends_on.

Remove kubernetes_provider from the depends_on lists while keeping other dependencies.

Also applies to: 116-119, 135-139


167-167: Remove redundant kubernetes_provider from depends_on

The kubernetes_provider is redundant in the depends_on list as noted in previous reviews.

-            depends_on=[kubernetes_provider, knative_eventing_core],
+            depends_on=[knative_eventing_core],

177-183: Consider reducing function parameters for better maintainability

This function has too many parameters (6). Consider grouping related parameters into a configuration object or dataclass to improve maintainability and readability.


240-240: Remove redundant kubernetes_provider from depends_on

The kubernetes_provider is redundant in the depends_on list.

            depends_on=[
-                kubernetes_provider,
                image,
                application_load_balancer_service_target_group,
                knative_serving_core,
            ],

294-294: Remove redundant kubernetes_provider from depends_on lists

Both create_knative_trigger and create_knative_schedule have the same redundancy issue.

# Line 294
-            depends_on=[kubernetes_provider, knative_eventing_core],
+            depends_on=[knative_eventing_core],

# Line 338
-            depends_on=[kubernetes_provider, knative_eventing_core],
+            depends_on=[knative_eventing_core],

Also applies to: 338-338

🧹 Nitpick comments (3)
infrastructure/upload_grafana_dashboard.nu (1)

74-78: Consider validating dashboard structure.

The script loads the dashboard JSON but doesn't validate that it contains the expected dashboard key structure. This could lead to runtime errors during upload.

let upload_payload = {
+    dashboard: ($dashboard_content | get dashboard? | default {})
-    dashboard: $dashboard_content.dashboard
    overwrite: true
    message: $"Uploaded via Nu script at (date now | format date '%Y-%m-%d %H:%M:%S')"
}
infrastructure/__main__.py (1)

319-319: Remove redundant lambda functions in exports

The lambda functions that simply format values as strings are unnecessary. Pulumi outputs can be exported directly.

-    kubernetes_cluster.eks_cluster.name.apply(lambda cluster_name: f"{cluster_name}"),
+    kubernetes_cluster.eks_cluster.name,
-    virtual_private_cloud.id.apply(lambda vpc_id: f"{vpc_id}"),
+    virtual_private_cloud.id,
-    api_access_iam_role.arn.apply(lambda arn: f"{arn}"),
+    api_access_iam_role.arn,

Also applies to: 329-329, 334-334

infrastructure/services.py (1)

66-66: Address the TODO comment about ConfigMap necessity

The NOTE comment indicates uncertainty about whether the Knative network configuration is necessary. This should be verified and documented properly.

Would you like me to research the necessity of these specific Knative network configurations or create an issue to track this verification task?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 23b6911 and d91e994.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (20)
  • .mise.toml (2 hunks)
  • application/datamanager/compose.yaml (1 hunks)
  • application/datamanager/src/datamanager/main.py (1 hunks)
  • infrastructure/__main__.py (2 hunks)
  • infrastructure/api.py (1 hunks)
  • infrastructure/cluster.py (2 hunks)
  • infrastructure/environment_variables.py (0 hunks)
  • infrastructure/grafana-dashboard.json (1 hunks)
  • infrastructure/grafana_dashboard.json (0 hunks)
  • infrastructure/images.py (2 hunks)
  • infrastructure/ingress.py (1 hunks)
  • infrastructure/keys.py (2 hunks)
  • infrastructure/monitors.py (1 hunks)
  • infrastructure/ping.nu (0 hunks)
  • infrastructure/pyproject.toml (1 hunks)
  • infrastructure/roles.py (0 hunks)
  • infrastructure/services.py (14 hunks)
  • infrastructure/tags.py (1 hunks)
  • infrastructure/upload_grafana_dashboard.nu (1 hunks)
  • infrastructure/vpc.py (1 hunks)
💤 Files with no reviewable changes (4)
  • infrastructure/environment_variables.py
  • infrastructure/roles.py
  • infrastructure/grafana_dashboard.json
  • infrastructure/ping.nu
🚧 Files skipped from review as they are similar to previous changes (9)
  • infrastructure/tags.py
  • application/datamanager/compose.yaml
  • infrastructure/pyproject.toml
  • application/datamanager/src/datamanager/main.py
  • infrastructure/images.py
  • infrastructure/monitors.py
  • .mise.toml
  • infrastructure/grafana-dashboard.json
  • infrastructure/vpc.py
🧰 Additional context used
🧠 Learnings (1)
infrastructure/__main__.py (1)

Learnt from: chrisaddy
PR: #535
File: application/datamanager/src/datamanager/main.py:147-147
Timestamp: 2025-05-22T02:26:11.306Z
Learning: In the datamanager service, debug-style f-string syntax (e.g., {year=}) is intentionally used for constructing GCS path prefixes to match Hive-style partitioned directory structures for data storage and retrieval.

🧬 Code Graph Analysis (3)
infrastructure/keys.py (1)
infrastructure/environment_variables.py (1)
  • create_environment_variables (17-32)
infrastructure/services.py (2)
infrastructure/publishers_subscribers.py (6)
  • create_knative_service (99-149)
  • create_knative_eventing_core (62-96)
  • create_knative_serving_core (5-59)
  • create_knative_schedule (219-255)
  • create_knative_trigger (174-216)
  • create_knative_broker (152-171)
infrastructure/environment_variables.py (1)
  • create_environment_variables (17-32)
infrastructure/cluster.py (1)
infrastructure/roles.py (2)
  • create_cluster_role (5-33)
  • create_node_role (36-74)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Run Python tests
🔇 Additional comments (12)
infrastructure/upload_grafana_dashboard.nu (2)

91-91: API key naming strategy addresses previous concern.

The deterministic naming pattern using workspace ID and timestamp effectively addresses the previous concern about random UUID accumulation, providing both uniqueness and traceability.


23-28: AWS CLI output format correctly uses JSON.

The previous concern about using --output text with from json has been resolved by properly using --output json.

infrastructure/keys.py (1)

8-10: Approve parameterized function signature.

The updated function signature accepting data_bucket_name as a parameter improves modularity and follows good practices for dynamic resource creation.

infrastructure/api.py (3)

8-24: Approve VPC Link creation with proper dependencies.

The VPC Link creation properly handles dependencies and resource configuration for connecting API Gateway to the ALB.


87-102: Approve route creation with proper target configuration.

The route creation correctly uses the integration ID with the proper integrations/ prefix format required by API Gateway v2.


105-162: Verify API Gateway v2 execute-api ARN format

Ensure the IAM policy’s Resource entries follow AWS’s documented pattern for HTTP APIs:

arn:aws:execute-api:{region}:{account_id}:{api_id}/{stage}/{method}/{resource_path}

Key checks in infrastructure/api.py (lines 105–162):

• Confirm that api_gateway.arn returns the execute-api ARN (e.g.
arn:aws:execute-api:us-east-1:123456789012:a1b2c3d4e5) rather than the API-management ARN.
– If it does not, use the API’s execution_arn output or a Stage resource’s ExecutionArn.
• Verify your use of /*/${method}/${path} correctly covers the stage (wildcard vs. explicit).

infrastructure/ingress.py (2)

6-38: Review security group configuration for production readiness.

The security group allows unrestricted ingress from 0.0.0.0/0 on ports 80 and 443, and unrestricted egress. While this may be appropriate for a public-facing load balancer, consider if any restrictions are needed for your use case.

Consider whether the security group rules align with your security requirements:

  • Ingress from 0.0.0.0/0 on HTTP/HTTPS ports
  • Unrestricted egress to 0.0.0.0/0

61-89: Approve target group configuration with proper health checks.

The target group configuration is well-structured with appropriate health check settings targeting the /health endpoint on port 8080, matching typical service configurations.

infrastructure/cluster.py (4)

10-42: Approve modularized cluster creation with external dependencies.

The updated cluster creation function properly accepts external VPC, subnets, and IAM roles, improving modularity and dependency management.


68-101: Approve improved JSON serialization for ConfigMap data.

The use of pulumi.Output.json_dumps() for mapRoles and mapUsers provides better handling of Pulumi outputs compared to string concatenation.


128-195: IAM role creation functions look correct.

The new IAM role creation functions properly implement the required policies for EKS cluster and node roles, with appropriate assume role policies and managed policy attachments.


26-32: Verify node group configuration aligns with cluster settings.

The node group options specify the same capacity and instance type settings as the main cluster configuration. Consider if this duplication is intentional or if the node group should inherit from cluster settings.

Confirm whether the duplicated capacity settings in node_group_options are intentional or if they should be removed to inherit from the main cluster configuration.

Comment thread infrastructure/ingress.py
Comment thread infrastructure/upload_grafana_dashboard.nu
@forstmeier forstmeier force-pushed the 07-23-update_infrastructure_resources branch from d91e994 to 292ca49 Compare July 25, 2025 01:57
@forstmeier forstmeier force-pushed the 07-23-update_infrastructure_resources branch from 292ca49 to ff497c7 Compare July 25, 2025 02:14
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (7)
infrastructure/ingress.py (1)

92-114: Add HTTPS listener to match security group configuration.

The current implementation only creates an HTTP listener on port 80, but the security group allows HTTPS traffic on port 443. For production environments, you should add an HTTPS listener with proper SSL certificate configuration.

Consider adding an HTTPS listener like this:

def create_application_load_balancer_https_listener(
    application_load_balancer: aws.lb.LoadBalancer,
    application_load_balancer_target_group: aws.lb.TargetGroup,
    certificate_arn: str,
) -> aws.lb.Listener:
    return aws.lb.Listener(
        resource_name="pocketsizefund-https-listener",
        load_balancer_arn=application_load_balancer.arn,
        port=443,
        protocol="HTTPS",
        ssl_policy="ELBSecurityPolicy-TLS-1-2-2017-01",
        certificate_arn=certificate_arn,
        default_actions=[
            aws.lb.ListenerDefaultActionArgs(
                type="forward",
                target_group_arn=application_load_balancer_target_group.arn,
            )
        ],
        opts=pulumi.ResourceOptions(
            depends_on=[
                application_load_balancer,
                application_load_balancer_target_group,
            ],
        ),
        tags=pulumi_tags,
    )
infrastructure/services.py (5)

21-25: Remove redundant kubernetes_provider dependency.

The kubernetes_provider is already specified in the provider parameter of ResourceOptions, making it redundant in the depends_on list.

        opts=pulumi.ResourceOptions(
            provider=kubernetes_provider,
-            depends_on=[kubernetes_provider],
        ),

53-63: Remove redundant kubernetes_provider dependency.

The same dependency cleanup is needed here as well.

            depends_on=[
-                kubernetes_provider,
                knative_serving_namespace,
                knative_serving_crds,
            ],

80-91: Clean up redundant provider dependencies in ConfigMap.

The ConfigMap creation also has the same redundant provider dependency issue.

            depends_on=[
-                kubernetes_provider,
                knative_serving_namespace,
                knative_serving_core,
                knative_serving_crds,
            ],

167-167: Remove redundant kubernetes_provider dependency in broker creation.

The create_knative_broker function has the same redundant provider dependency.

        opts=pulumi.ResourceOptions(
            provider=kubernetes_provider,
-            depends_on=[kubernetes_provider, knative_eventing_core],
+            depends_on=[knative_eventing_core],

239-244: Clean up redundant kubernetes_provider dependency.

The dependency list is more explicit which is good, but the kubernetes_provider is redundant as it's already specified as the provider.

            depends_on=[
-                kubernetes_provider,
                image,
                application_load_balancer_service_target_group,
                knative_serving_core,
            ],
infrastructure/__main__.py (1)

146-159: Consider using IAM roles instead of access keys for service authentication.

Passing AWS access keys as environment variables is less secure than using IAM roles for service accounts (IRSA). Consider implementing IRSA for better security and credential rotation.

For improved security, consider implementing IAM Roles for Service Accounts (IRSA) instead of passing access keys directly to services.

🧹 Nitpick comments (1)
infrastructure/upload_grafana_dashboard.nu (1)

167-186: Well-implemented prerequisites checking.

The function properly validates required tools and handles optional dependencies appropriately.

For consistency with AWS CLI checking, consider using a similar pattern for jq:

     try {
-        which jq | ignore
+        jq --version | ignore
         print "jq available"
     } catch {
         print "jq not found (optional but recommended for JSON debugging)"
     }

This would be more consistent with the AWS CLI check pattern.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 292ca49 and ff497c7.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (20)
  • .mise.toml (2 hunks)
  • application/datamanager/compose.yaml (1 hunks)
  • application/datamanager/src/datamanager/main.py (1 hunks)
  • infrastructure/__main__.py (2 hunks)
  • infrastructure/api.py (1 hunks)
  • infrastructure/cluster.py (2 hunks)
  • infrastructure/environment_variables.py (0 hunks)
  • infrastructure/grafana-dashboard.json (1 hunks)
  • infrastructure/grafana_dashboard.json (0 hunks)
  • infrastructure/images.py (2 hunks)
  • infrastructure/ingress.py (1 hunks)
  • infrastructure/keys.py (2 hunks)
  • infrastructure/monitors.py (1 hunks)
  • infrastructure/ping.nu (0 hunks)
  • infrastructure/pyproject.toml (1 hunks)
  • infrastructure/roles.py (0 hunks)
  • infrastructure/services.py (14 hunks)
  • infrastructure/tags.py (1 hunks)
  • infrastructure/upload_grafana_dashboard.nu (1 hunks)
  • infrastructure/vpc.py (1 hunks)
💤 Files with no reviewable changes (4)
  • infrastructure/environment_variables.py
  • infrastructure/roles.py
  • infrastructure/ping.nu
  • infrastructure/grafana_dashboard.json
🚧 Files skipped from review as they are similar to previous changes (11)
  • application/datamanager/src/datamanager/main.py
  • infrastructure/tags.py
  • application/datamanager/compose.yaml
  • infrastructure/images.py
  • infrastructure/pyproject.toml
  • infrastructure/keys.py
  • infrastructure/monitors.py
  • infrastructure/grafana-dashboard.json
  • .mise.toml
  • infrastructure/api.py
  • infrastructure/vpc.py
🧰 Additional context used
🧠 Learnings (1)
infrastructure/__main__.py (1)

Learnt from: chrisaddy
PR: #535
File: application/datamanager/src/datamanager/main.py:147-147
Timestamp: 2025-05-22T02:26:11.306Z
Learning: In the datamanager service, debug-style f-string syntax (e.g., {year=}) is intentionally used for constructing GCS path prefixes to match Hive-style partitioned directory structures for data storage and retrieval.

🧬 Code Graph Analysis (1)
infrastructure/cluster.py (1)
infrastructure/roles.py (2)
  • create_cluster_role (5-33)
  • create_node_role (36-74)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Run Python tests
🔇 Additional comments (24)
infrastructure/ingress.py (3)

6-38: LGTM: Well-structured ALB security group with proper ingress rules.

The security group correctly allows both HTTP (port 80) and HTTPS (port 443) inbound traffic, addressing the security concern from previous reviews. The egress rule appropriately allows all outbound traffic, and proper dependencies are established.


41-58: Good ALB configuration with proper subnet placement.

The ALB is correctly configured as internet-facing and attached to public subnets. The security group association and explicit dependencies ensure proper resource creation order.


61-89: Health check endpoints verified
All services expose a /health endpoint returning HTTP 200 OK as expected:

  • predictionengine: application/predictionengine/src/predictionengine/main.py
  • positionmanager: application/positionmanager/src/positionmanager/main.py
  • datamanager: application/datamanager/src/datamanager/main.py

No further changes required.

infrastructure/cluster.py (5)

10-35: Excellent modularization of cluster creation.

The updated function signature properly accepts VPC and subnet dependencies, making the infrastructure more explicit and testable. The explicit dependency management ensures correct resource creation order.


38-51: Good provider configuration with proper timeouts.

The Kubernetes provider is well-configured with appropriate custom timeouts and the replace_on_changes option for kubeconfig updates.


61-118: Improved ConfigMap creation with JSON serialization.

The switch from manual string concatenation to pulumi.Output.json_dumps for mapRoles and mapUsers is a significant improvement in maintainability and reduces the risk of formatting errors.


121-147: Well-implemented IAM role for EKS cluster.

The cluster role creation follows AWS best practices with proper assume role policy and managed policy attachment. Using a Python dictionary for the assume role policy instead of a JSON string is cleaner and more maintainable than the previous implementation.


150-188: Comprehensive node role with all required policies.

The node role correctly includes all three required managed policies for EKS worker nodes: EKSWorkerNodePolicy, EC2ContainerRegistryReadOnly, and EKS_CNI_Policy. The implementation is consistent with the cluster role pattern.

infrastructure/services.py (3)

9-12: Clean utility function for environment variables.

The new create_service_environment_variables function provides a clean abstraction for converting key-value tuples to Pulumi Output dictionaries.


198-208: LGTM: ALB integration annotations properly configured.

The service metadata correctly includes ALB-specific annotations for internet-facing scheme, IP target type, and target group ARN. This properly integrates Knative services with the AWS ALB infrastructure.


222-222: Good resource limits for production workloads.

The CPU limit has been appropriately increased to 1000m (1 CPU core) which should provide better performance for production workloads while maintaining reasonable resource constraints.

infrastructure/__main__.py (7)

60-70: Good fix for region consistency issue.

The region is now properly extracted and used consistently for both availability zone queries and AWS configuration. This addresses the previous review comment about region mismatch.


72-105: Well-structured VPC and subnet creation.

The explicit creation of public and private subnets across multiple availability zones provides good redundancy and follows AWS best practices for high availability.


108-117: Excellent modularization of IAM role creation.

Moving the IAM role creation to dedicated functions improves code organization and maintainability. The explicit dependencies ensure proper resource creation order.


161-189: Comprehensive ALB and API Gateway setup.

The ALB infrastructure setup is well-orchestrated with proper security groups, target groups, and API Gateway integration. The modular approach makes the infrastructure maintainable.


200-208: Good security practice requiring DockerHub credentials.

Explicitly requiring DockerHub credentials as secrets rather than embedding them or using defaults is a good security practice.


210-238: Well-integrated service deployment with API Gateway.

The datamanager service deployment properly integrates with the ALB target group and includes comprehensive API Gateway endpoint configurations. The explicit endpoint information makes the API structure clear.


305-340: Comprehensive infrastructure exports.

The exports provide good visibility into the deployed infrastructure components, including service images, cluster information, VPC details, and API Gateway endpoints. The lambda functions ensure proper string formatting.

infrastructure/upload_grafana_dashboard.nu (6)

3-9: Well-structured function signature with sensible defaults.

The main function parameters are properly typed and provide reasonable defaults for common use cases.


10-18: Good improvement: Prerequisites check is now properly called.

The addition of check_prerequisites at the start of main addresses the previous review feedback. The file existence validation is also well-implemented.


52-66: Solid implementation of workspace endpoint retrieval.

The error handling and JSON structure navigation are well-implemented.


68-87: Excellent implementation with safety features.

The dashboard loading includes proper JSON validation, and the dry-run functionality is a great safety feature for testing changes before actual deployment.


89-148: Comprehensive implementation with proper error handling and cleanup.

The API key management addresses previous feedback with timestamped names, and the upload process includes robust error handling with cleanup on failure. The response parsing is also well-handled.


150-165: Good cleanup implementation with graceful error handling.

The API key cleanup logic handles failures appropriately and provides clear completion feedback.

Comment thread infrastructure/services.py
Comment thread infrastructure/upload_grafana_dashboard.nu
@forstmeier
Copy link
Copy Markdown
Collaborator Author

@chrisaddy here's some additional context to hand this off to you.

  1. Copilot and CodeRabbit went over it pretty well but not all of it was stuff I felt needed to be included - feel free to check their feedback
  2. We're going to need to add back the ping.nu both as an updated script for API Gateway and as part of the infrastructure:up Mise command
  3. I have not tested out any of the inter-service stuff yet (e.g. triggers, schedules) - these definitely need to be checked so I'd recommend tweaking the schedules and manually invoking the services (I'm not sure on the specifics of how to do that) but do that with "development" values (e.g. paper Alpaca key)
  4. The Grafana dashboard/upload script have also not been tested (I'm pretty sure there are inconsistencies between the definitions there and the custom metrics in the services) but that's more of a nice-to-have we can add whenever
  5. One of the slower resources to provision is coredns so if that (and anything else) are slow we can look into ways of making them "permanent" resources like the data bucket and then referencing them via import

@forstmeier
Copy link
Copy Markdown
Collaborator Author

@chrisaddy also the data will need to be backfilled into the S3 bucket. There is still data in the GCP bucket but it's probably just better to do a manual backfill to test the endpoints/functionality.

@forstmeier
Copy link
Copy Markdown
Collaborator Author

@chrisaddy also I've run into some issues with pulumi down not destroying everything properly - I thought I'd fixed it with depends_on but I noticed my terminal showed failures to destroy today after running it previously (could've been I closed my laptop and interrupted the process or something but just FYI).

@chrisaddy chrisaddy merged commit 4dd98e5 into master Jul 29, 2025
4 checks passed
@forstmeier forstmeier deleted the 07-23-update_infrastructure_resources branch August 24, 2025 02:01
@coderabbitai coderabbitai Bot mentioned this pull request Dec 23, 2025
@coderabbitai coderabbitai Bot mentioned this pull request Jan 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants