Skip to content

Update Pulumi infrastructure to build Docker images#547

Merged
chrisaddy merged 10 commits intomasterfrom
codex/push-docker-images-to-docker-hub-using-pulumi
May 22, 2025
Merged

Update Pulumi infrastructure to build Docker images#547
chrisaddy merged 10 commits intomasterfrom
codex/push-docker-images-to-docker-hub-using-pulumi

Conversation

@chrisaddy
Copy link
Copy Markdown
Collaborator

@chrisaddy chrisaddy commented May 21, 2025

Overview

Changes

  • build Docker images from application/*/Dockerfile using Pulumi
  • store images in Docker Hub with configurable credentials
  • use newly built datamanager image for Cloud Run service
  • include new pulumi-docker dependency

Comments

Summary by CodeRabbit

  • New Features

    • Introduced a cloud-native API for managing equity bars data, supporting create, retrieve, and delete operations with Google Cloud Storage integration.
    • Added health check endpoint for service monitoring.
    • Implemented Flyte workflow for automated backfilling of equity bar data over date ranges.
    • Added Docker Compose setup for local development and testing, including health checks and environment variable management.
  • Bug Fixes

    • Improved error handling and response codes across API endpoints.
  • Documentation

    • Added behavior-driven development (BDD) feature files and test scenarios for API endpoints.
  • Refactor

    • Migrated configuration and data management to use environment variables, cloud storage, and modern Python data models.
    • Reorganized source code structure for clarity and maintainability.
  • Chores

    • Updated dependencies and project metadata.
    • Enhanced .gitignore to exclude sensitive and local configuration files.
    • Added and updated development and test automation tasks.

chrisaddy added 4 commits May 19, 2025 12:41
Implement integration testing and lifespan config
some fixes from codex

Refine bucket management steps

Merge pull request #537 from pocketsizefund/codex/create-behave-tests-for-data-manager-api

Add behave tests for bucket management
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2025

Warning

Rate limit exceeded

@chrisaddy has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 25 minutes and 2 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between ad366d2 and 3f98463.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (13)
  • .gitignore (1 hunks)
  • .mise.toml (4 hunks)
  • application/datamanager/compose.yaml (1 hunks)
  • application/datamanager/features/environment.py (1 hunks)
  • application/datamanager/features/equity_bars.feature (1 hunks)
  • application/datamanager/pyproject.toml (2 hunks)
  • application/datamanager/src/datamanager/main.py (1 hunks)
  • application/datamanager/src/datamanager/models.py (1 hunks)
  • infrastructure/cloud_run.py (2 hunks)
  • infrastructure/pyproject.toml (0 hunks)
  • pyproject.toml (1 hunks)
  • workflows/backfill_datamanager.py (1 hunks)
  • workflows/pyproject.toml (1 hunks)

Walkthrough

This update introduces new cloud-native, partition-aware data management and testing infrastructure for the datamanager application. It adds Flyte-based workflow automation, Docker Compose integration, BDD test suites, and robust configuration models. The refactor replaces legacy code with asynchronous FastAPI endpoints, Google Cloud Storage integration, and Polars/DuckDB-based data processing.

Changes

File(s) Change Summary
application/datamanager/Dockerfile, application/datamanager/Dockerfile.test, application/datamanager/compose.yaml Refactored service Dockerfile, added test Dockerfile, and introduced Docker Compose configuration for orchestrated service and test execution with environment variables and health checks.
application/datamanager/features/environment.py, application/datamanager/features/equity_bars.feature, application/datamanager/features/health.feature, application/datamanager/features/steps/equity_bars_steps.py, application/datamanager/features/steps/health_steps.py Added BDD feature files and step definitions for API health and equity bars management, including environment setup for tests.
application/datamanager/mise.toml Added tasks for running and cleaning up Docker Compose-based behave tests.
application/datamanager/pyproject.toml Added runtime and development dependencies for cloud storage, logging, HTTP, and BDD testing.
application/datamanager/src/datamanager/config.py Introduced configuration models for Polygon API, GCP buckets, and credentials using Pydantic.
application/datamanager/src/datamanager/main.py Refactored FastAPI app: removed legacy database logic, added GCS/DuckDB/Polars integration, new endpoints, and improved error handling.
application/datamanager/src/datamanager/models.py Added Pydantic models for date handling and API schemas, with validation logic.
application/datamanager/tests/test_datamanager_main.py Removed legacy test module for health check endpoint.
infrastructure/buckets.py Refactored bucket resource naming and direct secret usage for production data bucket.
infrastructure/cloud_run.py Updated Cloud Run service configuration to use new secret environment variables and revised bucket/project environment variable names.
infrastructure/pyproject.toml Removed description and readme fields from project metadata.
infrastructure/Pulumi.yaml Added YAML document start marker.
infrastructure/main.py Removed script with a simple print statement.
.github/workflows/test.yaml Enabled Ubuntu in the test matrix, removed comments and unused installation step.
.gitignore Ignored local Claude settings and .envrc files.
.mise.toml Added environment variable and new tasks for testing, development, and linting.
pyproject.toml Added pytest-asyncio dependency, reordered workspace members, and added workflows project.
workflows/backfill_datamanager.py, workflows/pyproject.toml Added Flyte workflow and task for backfilling equity bars data; introduced workflows project configuration.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant FastAPI
    participant DuckDB
    participant GCS
    participant PolygonAPI

    Client->>FastAPI: GET /equity-bars?start_date&end_date
    FastAPI->>DuckDB: Query Parquet files on GCS via httpfs
    DuckDB->>GCS: Read partitioned Parquet files
    DuckDB-->>FastAPI: Return filtered data
    FastAPI-->>Client: Respond with Arrow IPC or 404

    Client->>FastAPI: POST /equity-bars { date }
    FastAPI->>PolygonAPI: Fetch daily bars JSON
    FastAPI->>Polars: Convert JSON to DataFrame, add partitions
    FastAPI->>GCS: Write Parquet partitioned by date
    FastAPI-->>Client: Respond with summary

    Client->>FastAPI: DELETE /equity-bars { date }
    FastAPI->>GCS: Delete Parquet files for date prefix
    FastAPI-->>Client: Respond 204 or 404
Loading
sequenceDiagram
    participant FlyteWorkflow
    participant FastAPI

    loop For each date in [start_date, end_date]
        FlyteWorkflow->>FastAPI: POST /equity-bars { date }
        FastAPI-->>FlyteWorkflow: Return count or raise error
    end
    FlyteWorkflow-->>FlyteWorkflow: Aggregate counts and return
Loading

Possibly related PRs

  • Rebuild full application #485: Modifies and later removes infrastructure/cloud_run.py, directly related to the updated Cloud Run configuration in this PR.

Suggested labels

application

Suggested reviewers

  • forstmeier

Poem

In the warren of code, new tunnels appear,
With Docker and Flyte, our data flows clear.
Parquet and Polars, on GCS they play,
While rabbits run tests in a BDD ballet.
Secrets are guarded, configs in a row—
Hopping ahead, our cloud carrots grow! 🥕

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
infrastructure/images.py (2)

12-26: Consider parameterizing the Docker Hub organization name.

The Docker Hub organization name "pocketsizefund" is hardcoded. Consider making it configurable via Pulumi config to improve flexibility.

 config = Config()
 docker_username = config.require_secret("dockerhub_username")
 docker_password = config.require_secret("dockerhub_password")
 image_tag = config.get("image_tag") or pulumi.get_stack()
+docker_org = config.get("docker_org") or "pocketsizefund"
 
 images = {}
 for dockerfile in glob(os.path.join("..", "application", "*", "Dockerfile")):
     service_name = os.path.basename(os.path.dirname(dockerfile))
     images[service_name] = docker.Image(
         f"{service_name}-image",
         build=docker.DockerBuild(
             context="..", dockerfile=f"application/{service_name}/Dockerfile"
         ),
-        image_name=f"docker.io/pocketsizefund/{service_name}:{image_tag}",
+        image_name=f"docker.io/{docker_org}/{service_name}:{image_tag}",
         registry=docker.RegistryArgs(
             server="docker.io",
             username=docker_username,
             password=docker_password,
         ),
     )

1-27: Consider exporting image URIs for other consumers.

For improved usability, consider exporting the full image URIs as Pulumi outputs, which would make them accessible to other stacks or CI/CD systems.

 import os
 from glob import glob
 import pulumi
 import pulumi_docker as docker
 from pulumi import Config
+from pulumi import export

 config = Config()
 docker_username = config.require_secret("dockerhub_username")
 docker_password = config.require_secret("dockerhub_password")
 image_tag = config.get("image_tag") or pulumi.get_stack()
 
 images = {}
+image_uris = {}
 for dockerfile in glob(os.path.join("..", "application", "*", "Dockerfile")):
     service_name = os.path.basename(os.path.dirname(dockerfile))
+    image_uri = f"docker.io/pocketsizefund/{service_name}:{image_tag}"
+    image_uris[service_name] = image_uri
     images[service_name] = docker.Image(
         f"{service_name}-image",
         build=docker.DockerBuild(
             context="..", dockerfile=f"application/{service_name}/Dockerfile"
         ),
-        image_name=f"docker.io/pocketsizefund/{service_name}:{image_tag}",
+        image_name=image_uri,
         registry=docker.RegistryArgs(
             server="docker.io",
             username=docker_username,
             password=docker_password,
         ),
     )
+
+# Export image URIs as stack outputs
+export('image_uris', image_uris)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 31eb7f7 and 54bcb3d.

📒 Files selected for processing (4)
  • infrastructure/__main__.py (1 hunks)
  • infrastructure/cloud_run.py (2 hunks)
  • infrastructure/images.py (1 hunks)
  • infrastructure/pyproject.toml (1 hunks)
🔇 Additional comments (5)
infrastructure/__main__.py (1)

3-3: LGTM! Import properly added.

The import statement for the new images module has been correctly added. The # noqa: F401 comment is appropriate, as it acknowledges that while the import isn't directly used in this file, it's necessary to ensure the Pulumi Docker image resources are loaded during execution.

infrastructure/pyproject.toml (1)

10-10: LGTM! New dependency properly added.

The pulumi-docker dependency with a minimum version of 3.0.0 has been correctly added to support the Docker image building and publishing functionality implemented in the new images.py module.

infrastructure/cloud_run.py (2)

7-7: LGTM! Import correctly added.

The import for the images module has been properly added to access the Docker images built and published by the Pulumi infrastructure.


23-23: LGTM! Image reference updated to use dynamically built image.

The container image reference has been properly updated to use the dynamically built and published Docker image from the images module, replacing the previously hardcoded image reference.

infrastructure/images.py (1)

1-11: LGTM! Module initialization with proper configuration.

The imports and configuration setup look good. Using secrets for Docker Hub credentials is a security best practice, and defaulting to the Pulumi stack name for the image tag is a sensible approach.

Comment thread infrastructure/images.py
@graphite-app
Copy link
Copy Markdown

graphite-app Bot commented May 21, 2025

Graphite Automations

"Assign author to pull request" took an action on this PR • (05/21/25)

1 assignee was added to this PR based on John Forstmeier's automation.

Copy link
Copy Markdown
Collaborator

@forstmeier forstmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeRabbit suggestion is good.

Comment thread infrastructure/images.py
@chrisaddy chrisaddy requested a review from forstmeier May 22, 2025 02:02
@chrisaddy chrisaddy force-pushed the codex/push-docker-images-to-docker-hub-using-pulumi branch 2 times, most recently from ad366d2 to d800417 Compare May 22, 2025 02:35
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 10

🔭 Outside diff range comments (1)
application/datamanager/features/equity_bars.feature (1)

24-25: ⚠️ Potential issue

Incomplete "Skip weekends" scenario needs implementation

The "Skip weekends" scenario is incomplete - it has a title but no steps defined. This should either be completed or removed before merging.

Complete the scenario with appropriate steps or remove it if it's not ready for implementation:

- Scenario Outline: Skip weekends
+# TODO: Implement weekend skipping scenario
+# Scenario Outline: Skip weekends
+#     Given I have weekend dates:
+#         | start_date | end_date   |
+#         | <start_date> | <end_date> |
+#     When I send a POST request to "/equity-bars" for date range
+#     Then the response status code should be 200
+#     And the response should indicate weekends were skipped
🧹 Nitpick comments (9)
workflows/backfill_datamanager.py (2)

8-12: Good use of task retries but could benefit from more robust error handling

The Flyte task includes retries which is excellent for resilience. However, consider adding more detailed error handling and logging to help diagnose failures.

@task(retries=3)
def backfill_single_date(base_url: str, day: date) -> int:
-    response = httpx.post(f"{base_url}/equity-bars", json={"date": day.isoformat()})
-    response.raise_for_status()
-    return response.json().get("count", 0)
+    try:
+        response = httpx.post(
+            f"{base_url}/equity-bars", 
+            json={"date": day.isoformat()},
+            timeout=30.0
+        )
+        response.raise_for_status()
+        return response.json().get("count", 0)
+    except httpx.HTTPStatusError as e:
+        print(f"HTTP error for {day}: {e.response.status_code} - {e.response.text}")
+        raise
+    except httpx.RequestError as e:
+        print(f"Request error for {day}: {str(e)}")
+        raise

15-22: Sequential processing might be inefficient for large date ranges

The workflow processes dates sequentially, which could be inefficient for large date ranges. Consider adding parallelism options for better performance with extended date ranges.

@workflow
def backfill_equity_bars(base_url: str, start_date: date, end_date: date) -> List[int]:
    results: List[int] = []
    current = start_date
+    # For large date ranges, consider using map_task for parallel execution
+    # dates = [start_date + timedelta(days=i) for i in range((end_date - start_date).days + 1)]
+    # return map_task(backfill_single_date)(base_url=[base_url]*len(dates), day=dates)
    while current <= end_date:
        results.append(backfill_single_date(base_url=base_url, day=current))
        current += timedelta(days=1)
    return results
application/datamanager/src/datamanager/models.py (2)

24-35: DateRange model lacks to_payload method for consistency

The DateRange model in positionmanager includes a to_payload method that's missing here. Consider adding this method for consistency across services.

class DateRange(BaseModel):
    start: datetime.date
    end: datetime.date

    @field_validator("end")
    @classmethod
    def check_end_after_start(cls, end_value, info):
        start_value = info.data.get("start")
        if start_value and end_value <= start_value:
            raise ValueError("End date must be after start date.")
        return end_value
+
+    def to_payload(self) -> dict[str, str]:
+        return {
+            "start_date": self.start.isoformat(),
+            "end_date": self.end.isoformat(),
+        }

37-39: Consider adding validation for BarsSummary date field

The BarsSummary class has a string date field but lacks validation to ensure it follows a consistent date format. Consider adding validation or using a datetime.date type with appropriate serialization.

class BarsSummary(BaseModel):
-    date: str
+    date: datetime.date
    count: int
+
+    model_config = {"json_encoders": {datetime.date: lambda d: d.strftime("%Y-%m-%d")}}
.mise.toml (3)

80-85: Expose the API on a configurable host/port for local dev.

uvicorn is started without the --host/--port flags, so it will default to 127.0.0.1:8000.
Most Cloud Run-bound services listen on 0.0.0.0:${PORT:-8080}. Without mirroring this locally you risk “works-on-my-machine” bugs (e.g. container health-checks failing).

-cd application/{{arg(name="service_name")}}
-uv run uvicorn src.{{arg(name="service_name")}}.main:application --reload
+cd application/{{arg(name="service_name")}}
+uv run uvicorn \
+  src.{{arg(name="service_name")}}.main:application:application \
+  --reload \
+  --host 0.0.0.0 \
+  --port ${PORT:-8080}

90-92: Use the new Docker CLI syntax to avoid deprecation warnings.

docker-compose (with a hyphen) is now superseded by docker compose (space). Newer Docker releases emit deprecation notices; switching avoids noise in CI logs.

-docker-compose up --build --abort-on-container-exit --remove-orphans
+docker compose up --build --abort-on-container-exit --remove-orphans

52-58: Missing dependency chain can lead to stale test images.

python:test:behave jumps straight to behave without guaranteeing the service image is (re)built first. Consider depending on application:service:build (or the compose test task) to ensure the freshest image is under test.

[tasks."python:test:behave"]
depends = ["application:service:build"]  # <- new
application/datamanager/features/steps/equity_bars_steps.py (1)

31-38: Minor: function name typo.

step_imp_get_request is missing the “l” in impl. Harmless at runtime, but it breaks naming consistency and hampers discoverability.

application/datamanager/src/datamanager/main.py (1)

141-147: Preserve the original traceback when re-raising.

Raising a new HTTPException inside except loses the root cause. Use exception chaining as Ruff hints (B904):

-raise HTTPException(
-    status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
-    detail="Failed to write data",
-)
+raise HTTPException(
+    status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+    detail="Failed to write data",
+) from e
🧰 Tools
🪛 Ruff (0.11.9)

144-147: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

🛑 Comments failed to post (10)
application/datamanager/features/steps/health_steps.py (2)

8-8: ⚠️ Potential issue

Fix variable naming inconsistency.

There's a mismatch between the variable names: environment.py sets context.base_url but this file uses context.api_url.

def step_impl(context, endpoint):
    """Send a GET request to the specified endpoint."""
-    url = f"{context.api_url}{endpoint}"
+    url = f"{context.base_url}{endpoint}"
    context.response = requests.get(url)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

def step_impl(context, endpoint):
    """Send a GET request to the specified endpoint."""
    url = f"{context.base_url}{endpoint}"
    context.response = requests.get(url)
🤖 Prompt for AI Agents
In application/datamanager/features/steps/health_steps.py at line 8, the
variable context.api_url is used but the correct variable set in environment.py
is context.base_url. Replace context.api_url with context.base_url in the URL
construction to fix the naming inconsistency.

6-9: 🛠️ Refactor suggestion

Add error handling and timeout to the request.

The current implementation lacks error handling and timeout settings for the HTTP request, which could cause test failures or hangs if the service is unavailable or slow to respond.

@when('I send a GET request to "{endpoint}"')
def step_impl(context, endpoint):
    """Send a GET request to the specified endpoint."""
    url = f"{context.api_url}{endpoint}"
-    context.response = requests.get(url)
+    try:
+        context.response = requests.get(url, timeout=10)
+    except requests.exceptions.RequestException as e:
+        context.response = None
+        context.request_error = str(e)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

@when('I send a GET request to "{endpoint}"')
def step_impl(context, endpoint):
    """Send a GET request to the specified endpoint."""
    url = f"{context.api_url}{endpoint}"
    try:
        context.response = requests.get(url, timeout=10)
    except requests.exceptions.RequestException as e:
        context.response = None
        context.request_error = str(e)
🤖 Prompt for AI Agents
In application/datamanager/features/steps/health_steps.py around lines 6 to 9,
the GET request lacks error handling and timeout, risking test failures or hangs
if the service is down or slow. Modify the requests.get call to include a
timeout parameter and wrap it in a try-except block to catch exceptions like
requests.exceptions.RequestException. On exception, handle the error
appropriately, such as logging it or setting a failure state in the context.
application/datamanager/features/environment.py (1)

4-6: ⚠️ Potential issue

Fix variable naming inconsistency with step definitions.

There's a mismatch between variable names: this file sets context.base_url but health_steps.py uses context.api_url.

def before_all(context):
    """Set up test environment."""
-    context.base_url = os.environ.get("BASE_URL", "http://datamanager:8000")
+    context.api_url = os.environ.get("BASE_URL", "http://datamanager:8000")
+    print(f"Using API URL: {context.api_url}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

def before_all(context):
    """Set up test environment."""
    context.api_url = os.environ.get("BASE_URL", "http://datamanager:8000")
    print(f"Using API URL: {context.api_url}")
🤖 Prompt for AI Agents
In application/datamanager/features/environment.py around lines 4 to 6, the
variable name set on the context is `base_url` but the step definitions in
health_steps.py expect `api_url`. Rename `context.base_url` to `context.api_url`
to ensure consistency and proper variable usage across the test steps.
application/datamanager/Dockerfile (1)

15-15: 💡 Verification agent

🧩 Analysis chain

Port changed from 8000 to 8080

The port change to 8080 aligns with Cloud Run's default port. Note that any existing services communicating with this container will need to be updated accordingly.


🏁 Script executed:

#!/bin/bash
# Check if there are any references to port 8000 in the codebase that might need updating
rg -g '!Dockerfile' '8000' --context 2 | grep -v '8080'

Length of output: 2672


Inconsistent port referenced in test environment

The Dockerfile’s exposed port was updated to 8080, but your test setup still defaults to port 8000. Update the BASE_URL default in application/datamanager/features/environment.py so tests target the new port:

--- a/application/datamanager/features/environment.py
+++ b/application/datamanager/features/environment.py
@@ def before_all(context):
-    context.base_url = os.environ.get("BASE_URL", "http://datamanager:8000")
+    context.base_url = os.environ.get("BASE_URL", "http://datamanager:8080")

No other occurrences of port 8000 were found outside lock files.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

def before_all(context):
    context.base_url = os.environ.get("BASE_URL", "http://datamanager:8080")
🤖 Prompt for AI Agents
In application/datamanager/Dockerfile at line 15, the exposed port was changed
from 8000 to 8080 to match Cloud Run's default. To fix the inconsistency, update
the BASE_URL default port in application/datamanager/features/environment.py
from 8000 to 8080 so that the test environment targets the correct port. This
ensures all references to the service port are consistent across the codebase.
application/datamanager/src/datamanager/models.py (1)

5-21: 🛠️ Refactor suggestion

SummaryDate model uses deprecated validator decorator

The SummaryDate class uses the deprecated @validator decorator instead of the newer @field_validator which is used elsewhere in the file. For consistency and forward compatibility, update to the newer decorator.

class SummaryDate(BaseModel):
    date: datetime.date = Field(
        default_factory=lambda: datetime.datetime.utcnow().date()
    )

-    @validator("date", pre=True)
-    def parse_date(cls, value):
+    @field_validator("date", mode="before")
+    @classmethod
+    def parse_date(cls, value):
        if isinstance(value, datetime.date):
            return value
        for fmt in ("%Y-%m-%d", "%Y/%m/%d"):
            try:
                return datetime.datetime.strptime(value, fmt).date()
            except ValueError:
                continue
        raise ValueError("Invalid date format: expected YYYY-MM-DD or YYYY/MM/DD")

    model_config = {"json_encoders": {datetime.date: lambda d: d.strftime("%Y/%m/%d")}}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

class SummaryDate(BaseModel):
    date: datetime.date = Field(
        default_factory=lambda: datetime.datetime.utcnow().date()
    )

    @field_validator("date", mode="before")
    @classmethod
    def parse_date(cls, value):
        if isinstance(value, datetime.date):
            return value
        for fmt in ("%Y-%m-%d", "%Y/%m/%d"):
            try:
                return datetime.datetime.strptime(value, fmt).date()
            except ValueError:
                continue
        raise ValueError("Invalid date format: expected YYYY-MM-DD or YYYY/MM/DD")

    model_config = {"json_encoders": {datetime.date: lambda d: d.strftime("%Y/%m/%d")}}
🤖 Prompt for AI Agents
In application/datamanager/src/datamanager/models.py between lines 5 and 21,
replace the deprecated @validator decorator on the parse_date method with the
newer @field_validator decorator to ensure consistency and forward
compatibility. Update the decorator usage accordingly, maintaining the same
validation logic and parameters.
application/datamanager/features/steps/equity_bars_steps.py (2)

57-63: ⚠️ Potential issue

Deletion assertion is a no-op for GCS and path mismatch locally.

  1. When GCP_GCS_BUCKET is set the test does assert True, so any DELETE response will pass even if data remains.
  2. The local fallback checks for equity_bars_{date}.parquet, but the API writes partitioned files under equity/bars/year=YYYY/month=MM/day=DD/….

Consider:

if os.environ.get("GCP_GCS_BUCKET"):
    client = storage.Client()
    bucket = client.bucket(os.environ["GCP_GCS_BUCKET"])
    blobs  = list(bucket.list_blobs(prefix=f"equity/bars/year={date.year}/month={date.month}/day={date.day}"))
    assert not blobs, f"Blobs still present for {date_str}"
else:
    expected_dir = Path(f"equity/bars/year={date.year}/month={date.month}/day={date.day}")
    assert not expected_dir.exists(), f"{expected_dir} still exists"
🤖 Prompt for AI Agents
In application/datamanager/features/steps/equity_bars_steps.py around lines 57
to 63, the deletion assertion is ineffective because it uses a no-op assert for
GCP and checks the wrong local path. Replace the GCP check to instantiate a
storage client, list blobs with the correct prefix for the date partition, and
assert that no blobs remain. For the local fallback, check the existence of the
partitioned directory path matching the date
(equity/bars/year=YYYY/month=MM/day=DD) and assert it does not exist.

12-16: ⚠️ Potential issue

Loop overwrites the range; only the last row is preserved.

If the .feature table supplies multiple ranges you will quietly lose all but the final one:

for row in context.table:
    context.start_date = ...
    context.end_date = ...

Either accumulate into a list or assert the table contains exactly one row.

-for row in context.table:
-    context.start_date = row["start_date"]
-    context.end_date   = row["end_date"]
+if len(context.table) != 1:
+    raise AssertionError("Expected a single date range, got %s rows" % len(context.table))
+row = context.table[0]
+context.start_date, context.end_date = row["start_date"], row["end_date"]
🤖 Prompt for AI Agents
In application/datamanager/features/steps/equity_bars_steps.py around lines 12
to 16, the loop overwrites context.start_date and context.end_date for each row,
so only the last row's values are kept. To fix this, either accumulate all date
ranges into a list stored in context (e.g., context.date_ranges) or add an
assertion to ensure context.table has exactly one row before assigning
start_date and end_date. This prevents silent data loss when multiple rows are
present.
application/datamanager/src/datamanager/main.py (3)

48-53: 🛠️ Refactor suggestion

Potential connection-sharing race condition.

duckdb.connect() returns a connection that is not thread-safe. FastAPI’s default worker model (uvicorn) serves requests concurrently, so two requests could mutate/read the same connection simultaneously, corrupting state.

Options:

  1. Open a fresh connection per request in a dependency.
  2. Use duckdb.connect(":memory:", read_only=True) clones.
  3. Protect calls with an asyncio.Lock().

Recommend moving connection creation into a request-scoped dependency unless there is a strong performance reason not to.

🤖 Prompt for AI Agents
In application/datamanager/src/datamanager/main.py around lines 48 to 53, the
current code creates a single DuckDB connection stored in app.state, which is
not thread-safe and can cause race conditions under concurrent FastAPI requests.
To fix this, refactor the code to create a new DuckDB connection per request by
defining a FastAPI dependency that opens and yields a fresh connection for each
request, ensuring no shared mutable state. Replace usage of the global
connection with this dependency injection to avoid concurrent access issues.

45-59: ⚠️ Potential issue

Fail fast when GCS credentials are missing.

DUCKDB_ACCESS_KEY/DUCKDB_SECRET may be None, leading to CREATE SECRET with literal "None" and obscure connection errors. Validate first:

-DUCKDB_ACCESS_KEY = os.getenv("DUCKDB_ACCESS_KEY")
-DUCKDB_SECRET = os.getenv("DUCKDB_SECRET")
+DUCKDB_ACCESS_KEY = os.getenv("DUCKDB_ACCESS_KEY")
+DUCKDB_SECRET     = os.getenv("DUCKDB_SECRET")
+
+if not DUCKDB_ACCESS_KEY or not DUCKDB_SECRET:
+    raise RuntimeError("DUCKDB_ACCESS_KEY and DUCKDB_SECRET must be set")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    DUCKDB_ACCESS_KEY = os.getenv("DUCKDB_ACCESS_KEY")
    DUCKDB_SECRET     = os.getenv("DUCKDB_SECRET")

    if not DUCKDB_ACCESS_KEY or not DUCKDB_SECRET:
        raise RuntimeError("DUCKDB_ACCESS_KEY and DUCKDB_SECRET must be set")

    app.state.connection = duckdb.connect()
    app.state.connection.execute(f"""
      INSTALL httpfs;
      LOAD httpfs;
      SET http_keep_alive=true;
      SET http_timeout=30000;
      CREATE SECRET (
        TYPE GCS, 
        KEY_ID '{DUCKDB_ACCESS_KEY}',
        SECRET '{DUCKDB_SECRET}'
      );
    """)
🤖 Prompt for AI Agents
In application/datamanager/src/datamanager/main.py around lines 45 to 59, the
environment variables DUCKDB_ACCESS_KEY and DUCKDB_SECRET are used directly
without validation, which can cause the CREATE SECRET command to use "None" as
credentials leading to obscure errors. Add a check after retrieving these
variables to verify they are not None, and if either is missing, raise an
explicit error or exit the application immediately to fail fast and provide
clear feedback about missing credentials.

18-35: ⚠️ Potential issue

Return-type annotation is incorrect.

bars_query constructs a SQL string but is annotated as returning pl.DataFrame, which misleads IDEs and static type checkers.

-def bars_query(*, bucket: str, start_date: date, end_date: date) -> pl.DataFrame:
+def bars_query(*, bucket: str, start_date: date, end_date: date) -> str:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

def bars_query(*, bucket: str, start_date: date, end_date: date) -> str:
    path_pattern = f"gs://{bucket}/equity/bars/*/*/*/*"

    return f"""
        SELECT *
        FROM read_parquet(
            '{path_pattern}', 
            HIVE_PARTITIONING=1
        )
        WHERE 
            (year > {start_date.year} OR 
             (year = {start_date.year} AND month > {start_date.month}) OR 
             (year = {start_date.year} AND month = {start_date.month} AND day >= {start_date.day}))
            AND
            (year < {end_date.year} OR 
             (year = {end_date.year} AND month < {end_date.month}) OR 
             (year = {end_date.year} AND month = {end_date.month} AND day <= {end_date.day}))
    """
🤖 Prompt for AI Agents
In application/datamanager/src/datamanager/main.py around lines 18 to 35, the
function bars_query is annotated to return pl.DataFrame but actually returns a
SQL query string. Change the return type annotation to str to accurately reflect
that the function returns a SQL string, not a DataFrame.

…-using-datamanager

Update datamanager backfill workflow
fixes
@chrisaddy chrisaddy force-pushed the codex/push-docker-images-to-docker-hub-using-pulumi branch from d800417 to 55681d5 Compare May 22, 2025 02:41
forstmeier
forstmeier previously approved these changes May 22, 2025
Copy link
Copy Markdown
Collaborator

@forstmeier forstmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good. There's lots of duplicate code I've seen in the PRs so I'm not sold on whatever Git flow/tool this is.

@chrisaddy chrisaddy merged commit c8bebfd into master May 22, 2025
3 of 4 checks passed
@chrisaddy chrisaddy deleted the codex/push-docker-images-to-docker-hub-using-pulumi branch May 22, 2025 03:30
@coderabbitai coderabbitai Bot mentioned this pull request Jun 2, 2025
@coderabbitai coderabbitai Bot mentioned this pull request Aug 15, 2025
@coderabbitai coderabbitai Bot mentioned this pull request Sep 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants