Skip to content

Data fetcher#535

Merged
chrisaddy merged 7 commits intomasterfrom
data-fetcher
May 28, 2025
Merged

Data fetcher#535
chrisaddy merged 7 commits intomasterfrom
data-fetcher

Conversation

@chrisaddy
Copy link
Copy Markdown
Collaborator

@chrisaddy chrisaddy commented May 20, 2025

This pull request introduces a comprehensive set of changes to enhance the datamanager service, including new testing frameworks, configuration updates, and mock services for integration testing. The updates focus on supporting end-to-end testing, improving developer workflows, and adding mock data capabilities. Below is a summary of the most important changes grouped by theme:

Testing Framework and End-to-End Tests

  • Added behave end-to-end tests for the datamanager service, including scenarios for fetching, storing, retrieving, and deleting equity bars data, as well as health check endpoint tests (application/datamanager/features/equity_bars.feature, application/datamanager/features/health.feature, [1] [2].
  • Introduced step definitions for the behave tests, implementing functionality for API interactions and data validation (application/datamanager/features/steps/equity_bars_steps.py, application/datamanager/features/steps/health_steps.py, [1] [2].
  • Added Docker Compose tasks to run and clean up behave tests in a containerized environment (application/datamanager/mise.toml, application/datamanager/mise.tomlR1-R11).

Mock Services for Testing

  • Created a mock service using FastAPI to simulate Polygon API responses, including endpoints for health checks and grouped aggregates (application/datamanager/mock/server.py, application/datamanager/mock/Dockerfile, [1] [2].

Configuration and Dependency Updates

  • Added a Settings class using Pydantic to manage environment variables for API keys and GCP configurations (application/datamanager/src/datamanager/config.py, application/datamanager/src/datamanager/config.pyR1-R9).
  • Updated pyproject.toml to include new dependencies for testing (behave, httpx, pytest-asyncio) and Google Cloud Storage support (application/datamanager/pyproject.toml, [1] [2].

Dockerfile and Compose Enhancements

  • Updated the Dockerfile for the datamanager service to include additional dependencies, expose port 8080, and adjust the entry point for running the application (application/datamanager/Dockerfile, application/datamanager/DockerfileL2-R18).
  • Added a Dockerfile.test for running behave tests in a containerized environment (application/datamanager/Dockerfile.test, application/datamanager/Dockerfile.testR1-R12).
  • Introduced a compose.yaml file to define services for the datamanager application and its tests, including health checks and environment variable configurations (application/datamanager/compose.yaml, application/datamanager/compose.yamlR1-R42).

SQL and Utility Updates

These changes collectively improve the robustness, testability, and developer experience of the datamanager service.

Summary by CodeRabbit

  • New Features
    • Introduced a new model training workflow that fetches data, trains a simple model, and stores the resulting model artifact.
  • Chores
    • Added tasks to facilitate running and cleaning up behave tests using Docker Compose.
    • Updated dependency list to include loguru for enhanced logging capabilities.
    • Updated Docker image tag configuration for production tasks.
  • Documentation
    • Added a descriptive docstring to the test environment setup function.
  • Other
    • Adjusted file ignore patterns and made minor formatting changes.
    • Duplicated certain environment variables in the Cloud Run service configuration (no user-facing impact).

@chrisaddy chrisaddy self-assigned this May 20, 2025
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2025

Warning

Rate limit exceeded

@chrisaddy has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 5 minutes and 42 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between ae8b56f and e4abe43.

⛔ Files ignored due to path filters (2)
  • .flox/env/manifest.lock is excluded by !**/*.lock
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (8)
  • .gitignore (1 hunks)
  • .mise.toml (1 hunks)
  • application/datamanager/features/environment.py (1 hunks)
  • application/datamanager/mise.toml (1 hunks)
  • application/datamanager/pyproject.toml (1 hunks)
  • application/datamanager/src/datamanager/main.py (0 hunks)
  • infrastructure/cloud_run.py (1 hunks)
  • workflows/prediction_model.py (1 hunks)

Walkthrough

This update introduces a Flyte workflow for model training, adds Docker Compose test tasks, updates dependencies, and modifies configuration files for image tagging and file ignoring. Minor documentation and formatting adjustments are included, along with duplicated environment variables in a Cloud Run configuration.

Changes

Files/Paths Change Summary
.gitignore Changed ignore pattern from .coverage.xml to coverage.xml.
.mise.toml Updated Docker image tag variable from service_name to application_name in a production task.
application/datamanager/features/environment.py Added a docstring to the before_all(context) function.
application/datamanager/mise.toml Added tasks for running and cleaning up behave tests using Docker Compose.
application/datamanager/pyproject.toml Added loguru>=0.7.3 as a runtime dependency.
application/datamanager/src/datamanager/main.py Removed a trailing blank line before a return statement.
infrastructure/cloud_run.py Duplicated three environment variables in the Cloud Run container configuration.
workflows/prediction_model.py Introduced a new Flyte workflow module for end-to-end model training with three tasks and one workflow.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Flyte
    participant HTTP_API
    participant Storage

    User->>Flyte: Trigger training_workflow(start_date, end_date)
    Flyte->>HTTP_API: fetch_data(start_date, end_date)
    HTTP_API-->>Flyte: Return equity bar data
    Flyte->>Flyte: train_dummy_model(data)
    Flyte->>Storage: store_model(model_bytes)
    Storage-->>Flyte: Return artifact path
    Flyte-->>User: Return model artifact path
Loading

Possibly related PRs

  • pocketsizefund/pocketsizefund#558: Also modifies .mise.toml tasks and behavioral test workflow definitions, overlapping with changes to task management and test orchestration.

Suggested labels

application, python

Poem

A workflow hops from code to cloud,
With Docker Compose, it’s feeling proud.
Loguru logs each bunny leap,
While models train and data sweep.
Ignore the files you do not need,
And let the Flyte tasks now proceed!
🐇✨

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@chrisaddy chrisaddy force-pushed the data-fetcher branch 5 times, most recently from c1bd352 to 6cb1bc2 Compare May 21, 2025 03:38
@chrisaddy chrisaddy requested a review from forstmeier May 21, 2025 03:43
@chrisaddy chrisaddy marked this pull request as ready for review May 21, 2025 03:43
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🔭 Outside diff range comments (1)
application/datamanager/src/datamanager/bars.sql (1)

1-2: ⚠️ Potential issue

Empty SQL file requires content

This file appears to be empty with just a newline character. According to the PR summary, SQL queries for reading parquet files were added to the service, but this file doesn't contain any actual SQL.

Please add the intended SQL query content or remove this file if it's not needed.

🧹 Nitpick comments (21)
application/datamanager/pyproject.toml (1)

24-31: Development dependencies properly segregated

Good practice to separate development dependencies into their own group. These dependencies support the new testing infrastructure described in the PR objectives, particularly the behave framework for end-to-end testing.

However, note that polars is included in both main dependencies (line 10) and dev dependencies (line 30), which appears redundant.

Consider removing the redundant polars dependency from the dev group since it's already included in the main dependencies:

 dev = [
     "httpx>=0.28.1",
     "pytest-asyncio>=0.26.0",
     "behave>=1.2.6",
     "requests>=2.31.0",
-    "polars>=1.29.0",
 ]
application/datamanager/features/steps/health_steps.py (1)

4-5: Consider using requests.Session for reusable connections

For test efficiency, especially when running multiple tests against the same API, consider using requests.Session() instead of direct requests.get() calls.

-import requests
+from requests import Session
+
+@given('the datamanager API is running')
+def step_impl(context):
+    """Set up a session for API requests."""
+    context.session = Session()
 
 @when('I send a GET request to "{endpoint}"')
 def step_impl(context, endpoint):
     """Send a GET request to the specified endpoint."""
     url = f"{context.api_url}{endpoint}"
-    context.response = requests.get(url)
+    context.response = context.session.get(url)
application/datamanager/Dockerfile.test (2)

1-2: Python and UV version management needs improvement.

The Dockerfile uses python:3.13-slim which is appropriate, but pulls uv from the latest tag. Consider pinning to a specific version of UV for better reproducibility and to prevent unexpected changes.

FROM python:3.13-slim
-COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/uv
+COPY --from=ghcr.io/astral-sh/uv:0.1.25 /uv /bin/uv

4-11: Well-structured test container, but consider security improvements.

The Dockerfile follows good practices with appropriate working directory setup, dependency installation, and execution command. However, it runs as root by default.

Consider adding a non-root user for improved security:

COPY features /tests

+# Create non-root user for running tests
+RUN groupadd -r testuser && useradd -r -g testuser testuser
+USER testuser

CMD ["uv", "run", "behave", "features/"]
infrastructure/cloud_run.py (1)

43-49: Inconsistent formatting in environment variable definitions.

There's inconsistent formatting in the DuckDB environment variable definitions - one has name and value on a single line while the other spreads across multiple lines. Also, there's extra indentation in the closing parenthesis.

cloudrun.ServiceTemplateSpecContainerEnvArgs(
-    name="DUCKDB_ACCESS_KEY", value=duckdb_access_key
+    name="DUCKDB_ACCESS_KEY",
+    value=duckdb_access_key,
),
cloudrun.ServiceTemplateSpecContainerEnvArgs(
    name="DUCKDB_SECRET",
    value=duckdb_secret,
-    ),
+),
application/datamanager/compose.yaml (1)

1-24: Good datamanager service configuration but could benefit from resource limits.

The datamanager service is well configured with proper environment variables, port mapping, and health check. However, it lacks resource constraints which could lead to resource contention in shared environments.

Consider adding resource limits:

    healthcheck:
      test: ["CMD", "curl", "-f", "http://0.0.0.0:8080/health"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 1s
+    deploy:
+      resources:
+        limits:
+          cpus: '1.0'
+          memory: 1G
+        reservations:
+          cpus: '0.5'
+          memory: 512M
application/datamanager/mise.toml (1)

1-11: Update to modern Docker Compose command syntax and add combined task.

The tasks use the deprecated docker-compose command (with hyphen) instead of the current docker compose command. Also, consider adding a combined task that runs tests and cleans up regardless of the test result.

[tasks."test:docker:behave"]
description = "Run behave tests with Docker Compose"
run = """
-docker-compose up --build --abort-on-container-exit
+docker compose up --build --abort-on-container-exit
"""

[tasks."test:docker:behave:cleanup"]
description = "Clean up after Docker Compose tests"
run = """
-docker-compose down -v
+docker compose down -v
"""
+
+[tasks."test:docker:behave:run-and-cleanup"]
+description = "Run behave tests and clean up regardless of result"
+run = """
+docker compose up --build --abort-on-container-exit
+docker compose down -v
+"""
application/datamanager/src/datamanager/query.py (1)

22-24: SQL query construction could be improved

The current SQL query construction using an f-string with direct variable interpolation is generally discouraged for SQL queries.

While DuckDB might handle this safely, consider a more parameterized approach:

 async def query_bars(*, filepaths: list[Path]) -> DataFrame:
-    query = f"""
-    SELECT * FROM read_parquet({filepaths})
-    """
+    # Use a safer approach for file paths
+    filepaths_str = str(filepaths)
+    query = f"""
+    SELECT * FROM read_parquet({filepaths_str})
+    """
     try:
application/datamanager/features/equity_bars.feature (2)

35-36: Missing newline at end of file

Best practice is to end files with a newline character.

Add a newline at the end of this file.


20-22: Consider expanding test examples

The test only has one example date. Adding more diverse dates could improve test coverage.

Consider adding more diverse test dates:

         Examples: dates
             | start_date | end_date   |
             | 2025-05-20 | 2025-05-20 |
+            | 2025-05-20 | 2025-05-21 | 
+            | 2025-05-01 | 2025-05-05 |
application/datamanager/src/datamanager/models.py (1)

38-41: BarsResult has missing newline at end of file

Best practice is to end files with a newline character.

Add a newline at the end of the file.

.mise.toml (1)

87-92: Docker Compose test task is useful but could be improved

The application:test task correctly sets up integration tests but lacks cleanup options.

Consider adding a flag to allow cleaning up after tests:

 [tasks."application:test"]
 description = "Run integration tests"
 run = """
 cd application/{{arg(name="application_name")}}
-docker-compose up --build --abort-on-container-exit --remove-orphans
+docker-compose up --build --abort-on-container-exit --remove-orphans {{arg(name="clean", default="")}}
 """
+
+[tasks."application:test:clean"]
+description = "Run integration tests and clean up containers after"
+run = """
+mise run application:test application_name={{arg(name="application_name")}} clean="--rm"
+"""
application/datamanager/src/datamanager/config.py (2)

1-6: Remove unused imports and avoid dead code

date, datetime, and logger are not referenced. They trigger Ruff F401 and create noise for readers.

-from datetime import date, datetime
-from loguru import logger
+# from datetime import date, datetime  # Uncomment when actually needed
+# from loguru import logger            # Uncomment when logging is added
🧰 Tools
🪛 Ruff (0.11.9)

4-4: datetime.date imported but unused

Remove unused import

(F401)


4-4: datetime.datetime imported but unused

Remove unused import

(F401)


6-6: loguru.logger imported but unused

Remove unused import: loguru.logger

(F401)


38-41: Escape all dangerous characters or switch to base-64

secret only escapes single quotes. Double quotes or back-slashes in the JSON
can still break the CREATE SECRET statement.
Base-64 encoding is safer and keeps SQL readable.

import base64
@computed_field
def secret(self) -> str:
    return base64.b64encode(json.dumps(self._creds).encode()).decode()
application/datamanager/features/steps/equity_bars_steps.py (5)

1-12: Cull unused imports

json, datetime, timedelta, and httpx are never used – remove them to satisfy linting and speed-up start-up.

-import json
-from datetime import datetime, timedelta
-import httpx
🧰 Tools
🪛 Ruff (0.11.9)

1-1: json imported but unused

Remove unused import: json

(F401)


4-4: datetime.datetime imported but unused

Remove unused import

(F401)


4-4: datetime.timedelta imported but unused

Remove unused import

(F401)


10-10: httpx imported but unused

Remove unused import: httpx

(F401)


8-9: Avoid mutating sys.path in test code

Manipulating sys.path is order-dependent and brittle. Prefer installing the
package in editable mode inside the test container or using relative imports.


34-42: Typo: step_impstep_impl

Keeps naming consistent and simplifies navigation.

-def step_imp(context, endpoint):
+def step_impl(context, endpoint):

125-130: Use equality, not substring, when validating dates

"2024-05-1" is considered in "2024-05-10". Compare with == to prevent
false positives.


142-153: Local parquet checks break when data is written to GCS

Path.exists() cannot verify remote gs://… objects.
Mock the storage client or rely on the GET endpoint to confirm deletion.

🧰 Tools
🪛 Ruff (0.11.9)

142-142: Redefinition of unused step_impl from line 134

(F811)

application/datamanager/src/datamanager/main.py (2)

12-14: Trim unused imports

HTTPException, Query, and .models.DateRange are imported but unused.
Cleaning them keeps Ruff quiet.

🧰 Tools
🪛 Ruff (0.11.9)

12-12: fastapi.HTTPException imported but unused

Remove unused import

(F401)


12-12: fastapi.Query imported but unused

Remove unused import

(F401)


14-14: .models.DateRange imported but unused

Remove unused import: .models.DateRange

(F401)


41-59: Reuse config-derived credentials instead of ad-hoc env vars

Settings.gcp already exposes key_id and secret. Relying on separate
DUCKDB_ACCESS_KEY/SECRET duplicates state and can drift.

DUCKDB_ACCESS_KEY = app.state.settings.gcp.key_id
DUCKDB_SECRET = app.state.settings.gcp.secret
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 31eb7f7 and 91e72df.

⛔ Files ignored due to path filters (2)
  • .flox/env/manifest.lock is excluded by !**/*.lock
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (23)
  • .gitignore (1 hunks)
  • .mise.toml (3 hunks)
  • application/datamanager/Dockerfile (1 hunks)
  • application/datamanager/Dockerfile.test (1 hunks)
  • application/datamanager/compose.yaml (1 hunks)
  • application/datamanager/features/environment.py (1 hunks)
  • application/datamanager/features/equity_bars.feature (1 hunks)
  • application/datamanager/features/health.feature (1 hunks)
  • application/datamanager/features/steps/equity_bars_steps.py (1 hunks)
  • application/datamanager/features/steps/health_steps.py (1 hunks)
  • application/datamanager/mise.toml (1 hunks)
  • application/datamanager/pyproject.toml (2 hunks)
  • application/datamanager/src/datamanager/bars.sql (1 hunks)
  • application/datamanager/src/datamanager/config.py (1 hunks)
  • application/datamanager/src/datamanager/main.py (1 hunks)
  • application/datamanager/src/datamanager/models.py (1 hunks)
  • application/datamanager/src/datamanager/query.py (1 hunks)
  • application/datamanager/tests/test_datamanager_main.py (0 hunks)
  • infrastructure/buckets.py (1 hunks)
  • infrastructure/cloud_run.py (2 hunks)
  • infrastructure/main.py (0 hunks)
  • infrastructure/pyproject.toml (0 hunks)
  • pyproject.toml (1 hunks)
💤 Files with no reviewable changes (3)
  • infrastructure/pyproject.toml
  • infrastructure/main.py
  • application/datamanager/tests/test_datamanager_main.py
🧰 Additional context used
🧬 Code Graph Analysis (2)
infrastructure/buckets.py (1)
application/datamanager/src/datamanager/config.py (1)
  • Bucket (15-21)
application/datamanager/src/datamanager/models.py (1)
application/positionmanager/src/positionmanager/models.py (2)
  • DateRange (36-53)
  • check_end_after_start (42-47)
🪛 Hadolint (2.12.0)
application/datamanager/Dockerfile

[error] 12-12: COPY with more than 2 arguments requires the last argument to end with /

(DL3021)

🪛 Ruff (0.11.9)
application/datamanager/src/datamanager/query.py

28-28: SyntaxError: Expected an indented block after except clause

application/datamanager/src/datamanager/config.py

4-4: datetime.date imported but unused

Remove unused import

(F401)


4-4: datetime.datetime imported but unused

Remove unused import

(F401)


6-6: loguru.logger imported but unused

Remove unused import: loguru.logger

(F401)

application/datamanager/src/datamanager/main.py

12-12: fastapi.HTTPException imported but unused

Remove unused import

(F401)


12-12: fastapi.Query imported but unused

Remove unused import

(F401)


14-14: .models.DateRange imported but unused

Remove unused import: .models.DateRange

(F401)

application/datamanager/features/steps/equity_bars_steps.py

1-1: json imported but unused

Remove unused import: json

(F401)


4-4: datetime.datetime imported but unused

Remove unused import

(F401)


4-4: datetime.timedelta imported but unused

Remove unused import

(F401)


10-10: httpx imported but unused

Remove unused import: httpx

(F401)


28-28: Redefinition of unused step_impl from line 16

(F811)


45-45: Redefinition of unused step_impl from line 28

(F811)


52-52: Redefinition of unused step_impl from line 45

(F811)


59-59: Redefinition of unused step_impl from line 52

(F811)


65-65: Redefinition of unused step_impl from line 59

(F811)


77-77: Redefinition of unused step_impl from line 65

(F811)


87-87: Redefinition of unused step_impl from line 77

(F811)


102-102: Redefinition of unused step_impl from line 87

(F811)


114-114: Redefinition of unused step_impl from line 102

(F811)


134-134: Redefinition of unused step_impl from line 114

(F811)


142-142: Redefinition of unused step_impl from line 134

(F811)


157-157: Redefinition of unused step_impl from line 142

(F811)

🪛 GitHub Actions: Code quality check
application/datamanager/src/datamanager/query.py

[error] 28-28: Failed to parse file: Expected an indented block after except clause.

🔇 Additional comments (21)
.gitignore (1)

10-11: Appropriate additions to gitignore

These new entries properly exclude local configuration files that should not be committed to version control, aligning with the enhanced configuration management introduced in this PR.

pyproject.toml (2)

8-8: Good addition of pytest-asyncio dependency

Adding pytest-asyncio supports the new asynchronous testing capabilities in the datamanager service, which aligns with the testing framework enhancements mentioned in the PR objectives.


14-17: Workspace members reordering

The reordering of workspace members doesn't affect functionality but maintains consistent organization in the project structure.

application/datamanager/pyproject.toml (1)

12-14: Appropriate runtime dependencies added

The new dependencies align well with the PR objectives:

  • google-cloud-storage supports the GCP configurations mentioned in the PR
  • loguru provides enhanced logging capabilities
  • datamanager suggests a modular design approach
application/datamanager/features/environment.py (1)

4-6: Well-structured behave test environment setup

The before_all function follows best practices for setting up the test environment. It appropriately configures the base URL from an environment variable with a sensible default, supporting the containerized testing approach described in the PR.

application/datamanager/Dockerfile (3)

5-5: LGTM: Added curl as a dependency

Good addition of curl, which will be useful for health checks and API testing in the container.


16-16: LGTM: Updated exposed port

Correctly updated the exposed port from 8000 to 8080 to match the port specified in the entrypoint command.


18-18: LGTM: Updated entrypoint to use new module path

The entrypoint correctly reflects the new module structure with the FastAPI application now located at datamanager.src.datamanager.main:application.

application/datamanager/features/health.feature (1)

1-11: Well-structured BDD test for health endpoint

This is a clean, well-formatted BDD feature file that follows Gherkin syntax best practices. The scenario effectively tests the health endpoint functionality mentioned in the PR objectives.

application/datamanager/features/steps/health_steps.py (1)

7-11:

Details

✅ Verification successful

Well-implemented step definition, but is the assertion step defined elsewhere?

The step definition is clear and follows behave's conventions. However, I don't see the implementation for the "Then the response status code should be 200" step from the feature file.

Please verify that the assertion step is defined in another file, such as a common steps file. You can check with:


🏁 Script executed:

#!/bin/bash
# Look for the step definition that checks response status code
cd application/datamanager
rg -l "response status code should be" --type py features/

Length of output: 125


🏁 Script executed:

#!/bin/bash
cd application/datamanager
rg -n "@then.*response status code should be" -C2 features/steps/equity_bars_steps.py

Length of output: 291


Assertion step found and no action required

I located the corresponding @then("the response status code should be {status_code}") step in application/datamanager/features/steps/equity_bars_steps.py at lines 44–46. The assertion is already implemented—no missing definitions.

infrastructure/buckets.py (2)

8-10: Improved naming consistency and resource identification

Good refactoring to use more explicit and consistent naming patterns, changing from "prod" to "production". This aligns better with the Bucket configuration class shown in the application code.


10-10: Cleaner configuration access

Nice improvement to directly use config.require_secret() inline rather than storing it in an intermediate variable first.

infrastructure/cloud_run.py (2)

12-13: Secure handling of DuckDB credentials.

The code correctly uses require_secret to retrieve sensitive DuckDB credentials, maintaining good security practices.


36-42: Environment variable renaming and addition looks good.

Renaming GCP_GCS_BUCKET to GCP_PROJECT and adding the new DATA_BUCKET variable pointing to the production data bucket is appropriate for the refactored architecture.

application/datamanager/compose.yaml (1)

25-36: Test service dependency setup is well-defined.

The tests service properly depends on the datamanager service's health check, ensuring tests only run once the service is ready. The environment variable setup and volume mounting look good.

application/datamanager/features/equity_bars.feature (1)

1-36: LGTM - Well-structured BDD tests for equity bars data management

The feature file follows good BDD practices with clear scenarios for testing the equity bars data management functionality. The tests cover the full lifecycle (POST, GET, DELETE) and also include a specific test for handling weekend dates.

application/datamanager/src/datamanager/models.py (1)

25-36:

Details

❌ Incorrect review comment

DateRange implementation differs from positionmanager

The DateRange class implementation differs from the version in positionmanager that uses datetime instead of date types.

Verify whether this is intentional or if the implementations should be consistent:


🏁 Script executed:

#!/bin/bash
# Check how DateRange is used across the codebase
rg -A 5 -B 5 "DateRange" --glob "*.py"

Length of output: 6281


DateRange implementations intentionally differ
The two DateRange models serve distinct domains and thus use different types:

  • In datamanager, start/end are datetime.date (daily‐aggregated data).
  • In positionmanager, start/end are full datetime (time‐window queries).

No changes required—this difference is intentional.

Likely an incorrect or invalid review comment.

.mise.toml (2)

51-58: LGTM - Well-defined behave test task

The new python:test:behave task is well structured and aligns with the BDD testing approach introduced in the pull request.


80-86: LGTM - Useful local development task

The application:dev task provides a convenient way to run the application locally with hot reloading, which is great for development workflow.

application/datamanager/features/steps/equity_bars_steps.py (1)

27-33: POST currently ignores the end-date context

Only start_date is sent in the body:

json={"date": context.start_date}

Confirm whether the scenario intends a single date or a range, and adjust the
payload accordingly.

🧰 Tools
🪛 Ruff (0.11.9)

28-28: Redefinition of unused step_impl from line 16

(F811)

application/datamanager/src/datamanager/main.py (1)

88-104: Helpful 404 body

When no rows are found, the 404 has no body or headers. Clients may benefit
from a JSON { "detail": "no bars found" }.

Comment thread application/datamanager/Dockerfile Outdated
Comment thread application/datamanager/compose.yaml
Comment thread application/datamanager/src/datamanager/query.py Outdated
Comment thread application/datamanager/src/datamanager/query.py Outdated
Comment thread application/datamanager/src/datamanager/models.py Outdated
Comment thread application/datamanager/features/steps/equity_bars_steps.py Outdated
Comment thread application/datamanager/features/steps/equity_bars_steps.py Outdated
Comment thread application/datamanager/src/datamanager/main.py Outdated
Comment thread application/datamanager/src/datamanager/main.py Outdated
Comment thread application/datamanager/src/datamanager/main.py
Comment thread .mise.toml Outdated
Comment thread .mise.toml Outdated
Comment thread .mise.toml Outdated
Comment thread .mise.toml Outdated
Comment thread .mise.toml Outdated
Comment thread application/datamanager/src/datamanager/main.py Outdated
Comment thread application/datamanager/src/datamanager/main.py Outdated
Comment thread application/datamanager/src/datamanager/models.py Outdated
Comment thread infrastructure/buckets.py
Comment thread infrastructure/pyproject.toml Outdated
@forstmeier
Copy link
Copy Markdown
Collaborator

@chrisaddy are you addressing any of the CodeRabbit comments or are they resolved?

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (2)
application/datamanager/src/datamanager/main.py (2)

18-18: Fix incorrect return type annotation

The function bars_query returns a SQL query string, not a Polars DataFrame as indicated by the type annotation.

-def bars_query(*, bucket: str, start_date: date, end_date: date) -> pl.DataFrame:
+def bars_query(*, bucket: str, start_date: date, end_date: date) -> str:

141-159: Return JSON response instead of 204 No Content

The current DELETE endpoint returns a 204 No Content status, but previous review comments indicate that tests expect a JSON response.

    logger.info(f"deleting {prefix=}")
    bucket.delete_blobs(blobs)

-   return Response(status_code=status.HTTP_204_NO_CONTENT)
+   return {"status": "success", "deleted": len(blobs), "date": summary_date.date.strftime("%Y-%m-%d")}
🧹 Nitpick comments (1)
workflows/backfill_datamanager.py (1)

1-22: Consider adding logging and progress tracking

For a backfill workflow that might run over a long date range, it would be helpful to add logging and progress tracking.

from datetime import date, timedelta
from typing import List

import httpx
from flytekit import task, workflow
+import logging

+logger = logging.getLogger(__name__)

@task(retries=3)
def backfill_single_date(base_url: str, day: date) -> int:
+    logger.info(f"Backfilling data for {day.isoformat()}")
    response = httpx.post(f"{base_url}/equity-bars", json={"date": day.isoformat()})
    response.raise_for_status()
-    return response.json().get("count", 0)
+    count = response.json().get("count", 0)
+    logger.info(f"Successfully backfilled {count} records for {day.isoformat()}")
+    return count

@workflow
def backfill_equity_bars(base_url: str, start_date: date, end_date: date) -> List[int]:
+    logger.info(f"Starting backfill from {start_date.isoformat()} to {end_date.isoformat()}")
    results: List[int] = []
    current = start_date
+    total_days = (end_date - start_date).days + 1
+    day_count = 0
    while current <= end_date:
+        day_count += 1
+        logger.info(f"Processing day {day_count}/{total_days} - {current.isoformat()}")
        results.append(backfill_single_date(base_url=base_url, day=current))
        current += timedelta(days=1)
+    logger.info(f"Backfill complete. Processed {len(results)} days with a total of {sum(results)} records")
    return results
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd0c0a2 and f0f0902.

⛔ Files ignored due to path filters (1)
  • .flox/env/manifest.lock is excluded by !**/*.lock
📒 Files selected for processing (12)
  • .github/workflows/test.yaml (1 hunks)
  • .mise.toml (4 hunks)
  • application/datamanager/features/steps/equity_bars_steps.py (1 hunks)
  • application/datamanager/features/steps/health_steps.py (1 hunks)
  • application/datamanager/src/datamanager/config.py (1 hunks)
  • application/datamanager/src/datamanager/main.py (1 hunks)
  • application/datamanager/src/datamanager/models.py (1 hunks)
  • infrastructure/Pulumi.yaml (1 hunks)
  • infrastructure/cloud_run.py (2 hunks)
  • pyproject.toml (1 hunks)
  • workflows/backfill_datamanager.py (1 hunks)
  • workflows/pyproject.toml (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • infrastructure/Pulumi.yaml
  • workflows/pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (7)
  • pyproject.toml
  • application/datamanager/features/steps/health_steps.py
  • application/datamanager/src/datamanager/models.py
  • infrastructure/cloud_run.py
  • application/datamanager/src/datamanager/config.py
  • application/datamanager/features/steps/equity_bars_steps.py
  • .mise.toml
🧰 Additional context used
🧬 Code Graph Analysis (1)
application/datamanager/src/datamanager/main.py (2)
application/datamanager/src/datamanager/config.py (2)
  • Settings (41-43)
  • daily_bars_path (18-19)
application/datamanager/src/datamanager/models.py (2)
  • BarsSummary (38-40)
  • SummaryDate (5-22)
🔇 Additional comments (5)
.github/workflows/test.yaml (1)

11-12:

Details

❓ Verification inconclusive

Confirm the OS matrix update is intentional and sufficient

The matrix now only includes ubuntu-latest, removing macos-latest. Please verify that:

  • All unit, integration, and Behave-based end-to-end tests (including Docker‐based scenarios) execute without macOS-specific adjustments.
  • Dropping macOS aligns with your support and coverage requirements.

To ensure no residual references exist, you can run:


🏁 Script executed:

#!/bin/bash
rg "macos-latest" -n .github/workflows

Length of output: 233


Confirm OS matrix consistency across workflows

It looks like you’ve limited test.yaml to ubuntu-latest, but quality.yaml still runs on both ubuntu-latest and macos-latest, and deploy.yaml uses runs-on: macos-latest. Please verify that:

  • Dropping macOS in test jobs is intentional and doesn’t leave coverage gaps for any macOS-specific logic.
  • You truly need macOS runners in quality and deploy workflows; otherwise, remove or align them to your updated matrix.
  • All affected workflows (unit, integration, Behave end-to-end, Docker scenarios, deployments) behave correctly on the specified OS runners.

Pinpointed locations:

  • .github/workflows/test.yaml (lines 11–12): # os: [macos-latest]os: [ubuntu-latest]
  • .github/workflows/quality.yaml (line 11): os: [ubuntu-latest, macos-latest]
  • .github/workflows/deploy.yaml (line 10): runs-on: macos-latest

To find all remaining references, you can run:

rg "macos-latest" -n .github/workflows
application/datamanager/src/datamanager/main.py (4)

19-35: LGTM: Well-structured query for partitioned data

The query correctly uses Hive partitioning to filter Parquet files by year, month, and day columns, with appropriate date range logic.


38-62: Well-structured lifespan context manager for FastAPI

The lifespan setup correctly initializes GCS bucket access and configures DuckDB with HTTPFS and GCS credentials. The use of environment variables for secrets is appropriate.


67-69: LGTM: Simple health check endpoint

The health check endpoint is appropriately implemented to return a 200 OK status without a body, which is a common pattern for health checks.


72-107: Robust error handling and Arrow IPC response format

The implementation correctly handles the case of empty results with a 404 response, includes comprehensive error logging, and returns data in an efficient Apache Arrow IPC format with useful metadata headers.

Comment thread application/datamanager/src/datamanager/main.py
Comment thread application/datamanager/src/datamanager/main.py
Comment thread workflows/backfill_datamanager.py
Comment thread workflows/backfill_datamanager.py
@chrisaddy chrisaddy force-pushed the data-fetcher branch 3 times, most recently from d800417 to 55681d5 Compare May 22, 2025 02:41
chrisaddy added 5 commits May 21, 2025 22:47
Implement integration testing and lifespan config
some fixes from codex

Refine bucket management steps

Merge pull request #537 from pocketsizefund/codex/create-behave-tests-for-data-manager-api

Add behave tests for bucket management
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (2)
application/datamanager/src/datamanager/query.py (1)

21-28: ⚠️ Potential issue

Fix the incomplete exception handling

The function has a syntax error in the exception handling block. The return statement after the except clause doesn't specify a return value and lacks proper indentation.

Apply this fix:

async def query_bars(*, filepaths: list[Path]) -> DataFrame:
    query = f"""
    SELECT * FROM read_parquet({filepaths})
    """
    try:
        return duckdb.sql(query).pl()
    except duckdb.duckdb.HTTPException as e:
-    return 
+        # Log the error for debugging
+        print(f"HTTP error occurred: {e}")
+        # Return an empty DataFrame
+        return DataFrame()
🧰 Tools
🪛 Ruff (0.11.9)

28-28: SyntaxError: Expected an indented block after except clause

application/datamanager/src/datamanager/config.py (1)

28-32: ⚠️ Potential issue

Fail fast – but gracefully – when credentials are missing

Opening GOOGLE_APPLICATION_CREDENTIALS at import-time will crash the service with FileNotFoundError if the env-var is unset or the file is absent.

Move the check inside the cached property and raise an explanatory RuntimeError:

@cached_property
def _creds(self) -> dict:
-    with open(self.credentials_path) as f:
-        creds = json.load(f)
-    return creds
+    if not self.credentials_path or not os.path.isfile(self.credentials_path):
+        raise RuntimeError(
+            "GOOGLE_APPLICATION_CREDENTIALS not set or file cannot be found."
+        )
+    try:
+        with open(self.credentials_path) as f:
+            return json.load(f)
+    except (OSError, json.JSONDecodeError) as exc:
+        raise RuntimeError(
+            f"Unable to load GCP credentials from {self.credentials_path}"
+        ) from exc
🧹 Nitpick comments (6)
workflows/prediction_model.py (1)

11-20: Consider more robust error handling for API requests

The function correctly uses a timeout and raises HTTP errors, but there's no handling for network errors or malformed responses.

Consider enhancing error handling:

@task
def fetch_data(start_date: datetime, end_date: datetime) -> List[Dict[str, Any]]:
    base_url = os.getenv("DATAMANAGER_BASE_URL", "http://localhost:8080")
-    response = requests.get(
-        f"{base_url}/equity-bars",
-        params={"start_date": start_date.isoformat(), "end_date": end_date.isoformat()},
-        timeout=10,
-    )
-    response.raise_for_status()
-    return response.json().get("data", [])
+    try:
+        response = requests.get(
+            f"{base_url}/equity-bars",
+            params={"start_date": start_date.isoformat(), "end_date": end_date.isoformat()},
+            timeout=10,
+        )
+        response.raise_for_status()
+        return response.json().get("data", [])
+    except requests.RequestException as e:
+        print(f"Error fetching data: {e}")
+        raise RuntimeError(f"Failed to fetch data: {e}") from e
application/datamanager/features/environment.py (1)

1-6: Consider adding docstrings and environment configuration validation

The current implementation lacks comprehensive documentation and validation for the environment configuration.

Consider enhancing the module with better documentation and validation:

import os


def before_all(context):
    """Set up test environment."""
    context.base_url = os.environ.get("BASE_URL", "http://datamanager:8080")
+    
+    # Validate the base URL
+    if not context.base_url.startswith(("http://", "https://")):
+        raise ValueError(f"Invalid base URL: {context.base_url}. Must start with http:// or https://")
+    
+    print(f"Using base URL: {context.base_url}")
application/datamanager/src/datamanager/query.py (1)

7-18: Enhance error handling and add validation for bucket name

The function assumes the bucket name is valid and doesn't handle potential errors in path construction.

Consider enhancing the function with validation and error handling:

async def list_file_paths(
    *, bucket: str, start_date: date, end_date: date
) -> list[Path]:
+    if not bucket:
+        raise ValueError("Bucket name cannot be empty")
+    
+    if start_date > end_date:
+        raise ValueError(f"Start date ({start_date}) cannot be after end date ({end_date})")
+
    filepaths = []
    current = start_date
    while current <= end_date:
        filepaths.append(
            f"gs://{bucket}/equity/bars/{current.strftime('%Y-%m-%d')}/data.parquet"
        )
        current += timedelta(days=1)

    return filepaths
application/datamanager/src/datamanager/config.py (3)

1-6: Remove unused imports

Several imported modules are not used in the code.

Apply this diff to remove the unused imports:

import os
import json
from functools import cached_property
-from datetime import date, datetime
from pydantic import BaseModel, Field, computed_field
-from loguru import logger
🧰 Tools
🪛 Ruff (0.11.9)

4-4: datetime.date imported but unused

Remove unused import

(F401)


4-4: datetime.datetime imported but unused

Remove unused import

(F401)


6-6: loguru.logger imported but unused

Remove unused import: loguru.logger

(F401)


9-12: Consider renaming to PolygonClient as suggested in previous review

The class name could be more descriptive to indicate it's a client configuration.

As suggested in the previous review, consider renaming the class to better represent its purpose:

-class Polygon(BaseModel):
+class PolygonClient(BaseModel):
    api_key: str = Field(default=os.getenv("POLYGON_API_KEY"))
    base_url: str = "https://api.polygon.io"
    daily_bars: str = "/v2/aggs/grouped/locale/us/market/stocks/"

Don't forget to update the Settings class reference on line 45.


15-21: Consider renaming to GCSBucket as suggested in previous review

The current name doesn't clearly indicate it's specifically for Google Cloud Storage.

As suggested in the previous review, consider renaming the class:

-class Bucket(BaseModel):
+class GCSBucket(BaseModel):
    name: str = Field(default=os.getenv("DATA_BUCKET"))
    project: str = Field(default=os.getenv("GCP_PROJECT"))

    @computed_field
    def daily_bars_path(self) -> str:
        return f"gs://{self.name}/equity/bars/"

Don't forget to update the GCP class reference on line 25.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f0f0902 and eb0b0b5.

⛔ Files ignored due to path filters (2)
  • .flox/env/manifest.lock is excluded by !**/*.lock
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (18)
  • .gitignore (1 hunks)
  • .mise.toml (3 hunks)
  • application/datamanager/compose.yaml (1 hunks)
  • application/datamanager/features/environment.py (1 hunks)
  • application/datamanager/features/equity_bars.feature (1 hunks)
  • application/datamanager/features/steps/equity_bars_steps.py (1 hunks)
  • application/datamanager/features/steps/health_steps.py (1 hunks)
  • application/datamanager/pyproject.toml (2 hunks)
  • application/datamanager/src/datamanager/bars.sql (1 hunks)
  • application/datamanager/src/datamanager/config.py (1 hunks)
  • application/datamanager/src/datamanager/main.py (1 hunks)
  • application/datamanager/src/datamanager/models.py (1 hunks)
  • application/datamanager/src/datamanager/query.py (1 hunks)
  • infrastructure/cloud_run.py (2 hunks)
  • pyproject.toml (1 hunks)
  • workflows/backfill_datamanager.py (1 hunks)
  • workflows/prediction_model.py (1 hunks)
  • workflows/pyproject.toml (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • application/datamanager/src/datamanager/bars.sql
  • application/datamanager/compose.yaml
🚧 Files skipped from review as they are similar to previous changes (10)
  • pyproject.toml
  • .gitignore
  • application/datamanager/features/steps/health_steps.py
  • application/datamanager/pyproject.toml
  • workflows/pyproject.toml
  • infrastructure/cloud_run.py
  • application/datamanager/features/equity_bars.feature
  • workflows/backfill_datamanager.py
  • application/datamanager/features/steps/equity_bars_steps.py
  • application/datamanager/src/datamanager/models.py
🧰 Additional context used
🧠 Learnings (1)
application/datamanager/src/datamanager/main.py (1)
Learnt from: chrisaddy
PR: pocketsizefund/pocketsizefund#535
File: application/datamanager/src/datamanager/main.py:147-147
Timestamp: 2025-05-22T02:26:11.263Z
Learning: In the datamanager service, debug-style f-string syntax (e.g., `{year=}`) is intentionally used for constructing GCS path prefixes to match Hive-style partitioned directory structures for data storage and retrieval.
🧬 Code Graph Analysis (1)
application/datamanager/src/datamanager/main.py (2)
application/datamanager/src/datamanager/config.py (2)
  • Settings (43-45)
  • daily_bars_path (20-21)
application/datamanager/src/datamanager/models.py (2)
  • BarsSummary (37-39)
  • SummaryDate (5-21)
🪛 Ruff (0.11.9)
application/datamanager/src/datamanager/query.py

28-28: SyntaxError: Expected an indented block after except clause

workflows/prediction_model.py

27-27: Undefined name statistics

(F821)


29-29: Undefined name pickle

(F821)

application/datamanager/src/datamanager/main.py

144-147: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

application/datamanager/src/datamanager/config.py

4-4: datetime.date imported but unused

Remove unused import

(F401)


4-4: datetime.datetime imported but unused

Remove unused import

(F401)


6-6: loguru.logger imported but unused

Remove unused import: loguru.logger

(F401)

🪛 GitHub Actions: Test and coverage check
.mise.toml

[error] 44-44: TOML parse error: invalid table header due to duplicate key "python:test:behave" in table "tasks".

🔇 Additional comments (15)
application/datamanager/features/environment.py (1)

4-6:

Details

✅ Verification successful

Verify the port configuration

Confirm whether port 8080 is the correct port for the datamanager service in the test environment, as questioned in a previous review.


🏁 Script executed:

#!/bin/bash
# Check the port exposed in the Dockerfile and compose.yaml
grep -A 3 EXPOSE application/datamanager/Dockerfile || echo "No EXPOSE in Dockerfile"
grep -A 5 "datamanager:" application/datamanager/compose.yaml || echo "No datamanager service in compose.yaml"

Length of output: 508


Verified port 8080 is correct

The Dockerfile exposes port 8080 and docker-compose.yaml maps 8080:8080. The BASE_URL=http://datamanager:8080 environment variable also matches. The default in before_all therefore aligns with your service configuration—no change needed.

• application/datamanager/Dockerfile: EXPOSE 8080
• application/datamanager/compose.yaml:
ports: - 8080:8080
environment: - BASE_URL=http://datamanager:8080

application/datamanager/src/datamanager/config.py (1)

34-40: 🛠️ Refactor suggestion

Protect against missing credential fields and improve error handling

The current implementation assumes that the credential file always contains a "client_email" field and that it can be properly serialized to JSON.

Add defensive checks to handle missing fields and potential serialization errors:

@computed_field
def key_id(self) -> str | None:
-    return self._creds.get("client_email")
+    try:
+        return self._creds.get("client_email")
+    except Exception as e:
+        print(f"Error accessing client_email: {e}")
+        return None

@computed_field
def secret(self) -> str:
-    return json.dumps(self._creds).replace("'", "\\'")
+    try:
+        return json.dumps(self._creds).replace("'", "\\'")
+    except (TypeError, ValueError) as e:
+        print(f"Error serializing credentials: {e}")
+        return "{}"

Likely an incorrect or invalid review comment.

.mise.toml (5)

1-2: LGTM! Environment variable addition for Compose Bake

Adding COMPOSE_BAKE=true enables Docker Compose's Bake feature, which is appropriate for the new Docker Compose setup in the datamanager service.


57-59: LGTM! Updated Docker build commands

The updated Docker build commands now use application_name consistently, which aligns with the PR's goal of improving configuration management.


72-77: LGTM! New development task

This task allows running the service locally with hot reloading, which is a good addition for improving developer workflows.


79-84: LGTM! New integration test task

This task provides a convenient way to run integration tests using Docker Compose, which aligns with the PR's goal of enhancing testing capabilities.


90-90: LGTM! Updated lint task

The lint task now includes YAML linting with relaxed configuration and a maximum line length of 110 characters, which is appropriate for the new YAML files in the project.

application/datamanager/src/datamanager/main.py (8)

18-35: LGTM! Well-structured SQL query generation

The bars_query function generates a well-formed SQL query for retrieving data from partitioned Parquet files. The date range filtering logic is comprehensive, handling year, month, and day comparisons correctly.


38-61: LGTM! Robust application lifecycle management

The lifespan context manager properly initializes the application state, including:

  • Loading settings from environment variables
  • Setting up GCS bucket access
  • Configuring DuckDB with proper HTTP and GCS credentials

This approach ensures resources are properly initialized before the application starts handling requests.


67-69: LGTM! Simple health check endpoint

The health check endpoint is appropriately implemented to return a 200 OK status without a response body, which is a common pattern for health check endpoints.


72-107: LGTM! Comprehensive error handling in GET endpoint

The GET endpoint includes proper error handling for:

  • Empty result sets (returning 404)
  • Exceptions during query execution (returning 500)
  • Logging detailed error information

The Arrow IPC format for response data is an efficient choice for transferring tabular data.


1-16: LGTM! Appropriate imports

The imports are well-organized and include all necessary libraries for:

  • Cloud storage integration (Google Cloud Storage)
  • Data processing (Polars, PyArrow)
  • HTTP client (httpx)
  • Web framework (FastAPI)
  • Error handling (traceback)

110-149: LGTM! Well-implemented data ingestion endpoint

The POST endpoint correctly:

  1. Fetches data from the Polygon API
  2. Transforms it into a Polars DataFrame
  3. Adds necessary datetime and partition columns
  4. Writes to partitioned Parquet files in GCS
  5. Includes comprehensive error handling

The error handling with try/except blocks for the write_parquet operation is a good addition based on previous review feedback.

🧰 Tools
🪛 Ruff (0.11.9)

144-147: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


151-165: LGTM! Thorough error handling for DELETE endpoint

The DELETE endpoint properly handles edge cases:

  • Non-existent bucket (returns 404)
  • No matching blobs (returns 404)
  • Successfully deleting blobs

The use of debug-style f-string syntax ({year=}) is intentional for Hive-style partitioning as noted in the retrieved learnings.


166-168:

Details

❌ Incorrect review comment

Mismatch between DELETE endpoint return type and test expectations

The DELETE endpoint currently returns a 204 No Content status, but according to previous review comments, the tests expect a JSON response. This inconsistency should be resolved.

 logger.info(f"deleting {prefix=}")
 bucket.delete_blobs(blobs)
-return Response(status_code=status.HTTP_204_NO_CONTENT)
+return {"status": "success", "deleted": len(blobs), "date": summary_date.date.strftime("%Y-%m-%d")}

Let's verify this is still an issue:


🏁 Script executed:

#!/bin/bash
# Find tests that expect JSON response from DELETE endpoint
grep -r "delete.*json" --include="*.py" --include="*.feature" application/datamanager/features/

Length of output: 218


🏁 Script executed:

#!/bin/bash
sed -n '1,200p' application/datamanager/features/steps/equity_bars_steps.py

Length of output: 2072


No JSON response expected in DELETE endpoint

The DELETE feature tests only verify the HTTP status code and side‐effects (file/bucket deletion), not any response body. There is no call to response.json() or assertions against a JSON payload. Returning a 204 No Content is therefore consistent with the tests and no change is needed.

Likely an incorrect or invalid review comment.

Comment thread workflows/prediction_model.py
Comment thread workflows/prediction_model.py
Comment thread .mise.toml Outdated
Comment thread .mise.toml Outdated
Comment thread application/datamanager/src/datamanager/main.py
Copy link
Copy Markdown
Collaborator

@forstmeier forstmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeRabbit caught a couple things to adderss.

Comment thread .mise.toml Outdated
Comment thread application/datamanager/src/datamanager/bars.sql Outdated
Comment thread application/datamanager/src/datamanager/main.py
Comment thread workflows/prediction_model.py
…-syntax-and-destination-path

Fix datamanager Dockerfile and improve error handling
PR cleanups
@chrisaddy chrisaddy force-pushed the data-fetcher branch 2 times, most recently from 8c103b9 to ae8b56f Compare May 26, 2025 19:02
@chrisaddy chrisaddy merged commit e148432 into master May 28, 2025
5 checks passed
@chrisaddy chrisaddy deleted the data-fetcher branch May 28, 2025 19:47
@coderabbitai coderabbitai Bot mentioned this pull request Jun 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants