Conversation
|
Warning Rate limit exceeded@chrisaddy has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 5 minutes and 42 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ⛔ Files ignored due to path filters (2)
📒 Files selected for processing (8)
WalkthroughThis update introduces a Flyte workflow for model training, adds Docker Compose test tasks, updates dependencies, and modifies configuration files for image tagging and file ignoring. Minor documentation and formatting adjustments are included, along with duplicated environment variables in a Cloud Run configuration. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Flyte
participant HTTP_API
participant Storage
User->>Flyte: Trigger training_workflow(start_date, end_date)
Flyte->>HTTP_API: fetch_data(start_date, end_date)
HTTP_API-->>Flyte: Return equity bar data
Flyte->>Flyte: train_dummy_model(data)
Flyte->>Storage: store_model(model_bytes)
Storage-->>Flyte: Return artifact path
Flyte-->>User: Return model artifact path
Possibly related PRs
Suggested labels
Poem
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
c1bd352 to
6cb1bc2
Compare
There was a problem hiding this comment.
Actionable comments posted: 12
🔭 Outside diff range comments (1)
application/datamanager/src/datamanager/bars.sql (1)
1-2:⚠️ Potential issueEmpty SQL file requires content
This file appears to be empty with just a newline character. According to the PR summary, SQL queries for reading parquet files were added to the service, but this file doesn't contain any actual SQL.
Please add the intended SQL query content or remove this file if it's not needed.
🧹 Nitpick comments (21)
application/datamanager/pyproject.toml (1)
24-31: Development dependencies properly segregatedGood practice to separate development dependencies into their own group. These dependencies support the new testing infrastructure described in the PR objectives, particularly the behave framework for end-to-end testing.
However, note that
polarsis included in both main dependencies (line 10) and dev dependencies (line 30), which appears redundant.Consider removing the redundant
polarsdependency from the dev group since it's already included in the main dependencies:dev = [ "httpx>=0.28.1", "pytest-asyncio>=0.26.0", "behave>=1.2.6", "requests>=2.31.0", - "polars>=1.29.0", ]application/datamanager/features/steps/health_steps.py (1)
4-5: Consider using requests.Session for reusable connectionsFor test efficiency, especially when running multiple tests against the same API, consider using
requests.Session()instead of directrequests.get()calls.-import requests +from requests import Session + +@given('the datamanager API is running') +def step_impl(context): + """Set up a session for API requests.""" + context.session = Session() @when('I send a GET request to "{endpoint}"') def step_impl(context, endpoint): """Send a GET request to the specified endpoint.""" url = f"{context.api_url}{endpoint}" - context.response = requests.get(url) + context.response = context.session.get(url)application/datamanager/Dockerfile.test (2)
1-2: Python and UV version management needs improvement.The Dockerfile uses
python:3.13-slimwhich is appropriate, but pullsuvfrom thelatesttag. Consider pinning to a specific version of UV for better reproducibility and to prevent unexpected changes.FROM python:3.13-slim -COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/uv +COPY --from=ghcr.io/astral-sh/uv:0.1.25 /uv /bin/uv
4-11: Well-structured test container, but consider security improvements.The Dockerfile follows good practices with appropriate working directory setup, dependency installation, and execution command. However, it runs as root by default.
Consider adding a non-root user for improved security:
COPY features /tests +# Create non-root user for running tests +RUN groupadd -r testuser && useradd -r -g testuser testuser +USER testuser CMD ["uv", "run", "behave", "features/"]infrastructure/cloud_run.py (1)
43-49: Inconsistent formatting in environment variable definitions.There's inconsistent formatting in the DuckDB environment variable definitions - one has name and value on a single line while the other spreads across multiple lines. Also, there's extra indentation in the closing parenthesis.
cloudrun.ServiceTemplateSpecContainerEnvArgs( - name="DUCKDB_ACCESS_KEY", value=duckdb_access_key + name="DUCKDB_ACCESS_KEY", + value=duckdb_access_key, ), cloudrun.ServiceTemplateSpecContainerEnvArgs( name="DUCKDB_SECRET", value=duckdb_secret, - ), +),application/datamanager/compose.yaml (1)
1-24: Good datamanager service configuration but could benefit from resource limits.The datamanager service is well configured with proper environment variables, port mapping, and health check. However, it lacks resource constraints which could lead to resource contention in shared environments.
Consider adding resource limits:
healthcheck: test: ["CMD", "curl", "-f", "http://0.0.0.0:8080/health"] interval: 10s timeout: 5s retries: 3 start_period: 1s + deploy: + resources: + limits: + cpus: '1.0' + memory: 1G + reservations: + cpus: '0.5' + memory: 512Mapplication/datamanager/mise.toml (1)
1-11: Update to modern Docker Compose command syntax and add combined task.The tasks use the deprecated
docker-composecommand (with hyphen) instead of the currentdocker composecommand. Also, consider adding a combined task that runs tests and cleans up regardless of the test result.[tasks."test:docker:behave"] description = "Run behave tests with Docker Compose" run = """ -docker-compose up --build --abort-on-container-exit +docker compose up --build --abort-on-container-exit """ [tasks."test:docker:behave:cleanup"] description = "Clean up after Docker Compose tests" run = """ -docker-compose down -v +docker compose down -v """ + +[tasks."test:docker:behave:run-and-cleanup"] +description = "Run behave tests and clean up regardless of result" +run = """ +docker compose up --build --abort-on-container-exit +docker compose down -v +"""application/datamanager/src/datamanager/query.py (1)
22-24: SQL query construction could be improvedThe current SQL query construction using an f-string with direct variable interpolation is generally discouraged for SQL queries.
While DuckDB might handle this safely, consider a more parameterized approach:
async def query_bars(*, filepaths: list[Path]) -> DataFrame: - query = f""" - SELECT * FROM read_parquet({filepaths}) - """ + # Use a safer approach for file paths + filepaths_str = str(filepaths) + query = f""" + SELECT * FROM read_parquet({filepaths_str}) + """ try:application/datamanager/features/equity_bars.feature (2)
35-36: Missing newline at end of fileBest practice is to end files with a newline character.
Add a newline at the end of this file.
20-22: Consider expanding test examplesThe test only has one example date. Adding more diverse dates could improve test coverage.
Consider adding more diverse test dates:
Examples: dates | start_date | end_date | | 2025-05-20 | 2025-05-20 | + | 2025-05-20 | 2025-05-21 | + | 2025-05-01 | 2025-05-05 |application/datamanager/src/datamanager/models.py (1)
38-41: BarsResult has missing newline at end of fileBest practice is to end files with a newline character.
Add a newline at the end of the file.
.mise.toml (1)
87-92: Docker Compose test task is useful but could be improvedThe
application:testtask correctly sets up integration tests but lacks cleanup options.Consider adding a flag to allow cleaning up after tests:
[tasks."application:test"] description = "Run integration tests" run = """ cd application/{{arg(name="application_name")}} -docker-compose up --build --abort-on-container-exit --remove-orphans +docker-compose up --build --abort-on-container-exit --remove-orphans {{arg(name="clean", default="")}} """ + +[tasks."application:test:clean"] +description = "Run integration tests and clean up containers after" +run = """ +mise run application:test application_name={{arg(name="application_name")}} clean="--rm" +"""application/datamanager/src/datamanager/config.py (2)
1-6: Remove unused imports and avoid dead code
date,datetime, andloggerare not referenced. They trigger Ruff F401 and create noise for readers.-from datetime import date, datetime -from loguru import logger +# from datetime import date, datetime # Uncomment when actually needed +# from loguru import logger # Uncomment when logging is added🧰 Tools
🪛 Ruff (0.11.9)
4-4:
datetime.dateimported but unusedRemove unused import
(F401)
4-4:
datetime.datetimeimported but unusedRemove unused import
(F401)
6-6:
loguru.loggerimported but unusedRemove unused import:
loguru.logger(F401)
38-41: Escape all dangerous characters or switch to base-64
secretonly escapes single quotes. Double quotes or back-slashes in the JSON
can still break theCREATE SECRETstatement.
Base-64 encoding is safer and keeps SQL readable.import base64 @computed_field def secret(self) -> str: return base64.b64encode(json.dumps(self._creds).encode()).decode()application/datamanager/features/steps/equity_bars_steps.py (5)
1-12: Cull unused imports
json,datetime,timedelta, andhttpxare never used – remove them to satisfy linting and speed-up start-up.-import json -from datetime import datetime, timedelta -import httpx🧰 Tools
🪛 Ruff (0.11.9)
1-1:
jsonimported but unusedRemove unused import:
json(F401)
4-4:
datetime.datetimeimported but unusedRemove unused import
(F401)
4-4:
datetime.timedeltaimported but unusedRemove unused import
(F401)
10-10:
httpximported but unusedRemove unused import:
httpx(F401)
8-9: Avoid mutatingsys.pathin test codeManipulating
sys.pathis order-dependent and brittle. Prefer installing the
package in editable mode inside the test container or using relative imports.
34-42: Typo:step_imp→step_implKeeps naming consistent and simplifies navigation.
-def step_imp(context, endpoint): +def step_impl(context, endpoint):
125-130: Use equality, not substring, when validating dates
"2024-05-1"is considered in"2024-05-10". Compare with==to prevent
false positives.
142-153: Local parquet checks break when data is written to GCS
Path.exists()cannot verify remotegs://…objects.
Mock the storage client or rely on the GET endpoint to confirm deletion.🧰 Tools
🪛 Ruff (0.11.9)
142-142: Redefinition of unused
step_implfrom line 134(F811)
application/datamanager/src/datamanager/main.py (2)
12-14: Trim unused imports
HTTPException,Query, and.models.DateRangeare imported but unused.
Cleaning them keeps Ruff quiet.🧰 Tools
🪛 Ruff (0.11.9)
12-12:
fastapi.HTTPExceptionimported but unusedRemove unused import
(F401)
12-12:
fastapi.Queryimported but unusedRemove unused import
(F401)
14-14:
.models.DateRangeimported but unusedRemove unused import:
.models.DateRange(F401)
41-59: Reuse config-derived credentials instead of ad-hoc env vars
Settings.gcpalready exposeskey_idandsecret. Relying on separate
DUCKDB_ACCESS_KEY/SECRETduplicates state and can drift.DUCKDB_ACCESS_KEY = app.state.settings.gcp.key_id DUCKDB_SECRET = app.state.settings.gcp.secret
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting
⛔ Files ignored due to path filters (2)
.flox/env/manifest.lockis excluded by!**/*.lockuv.lockis excluded by!**/*.lock
📒 Files selected for processing (23)
.gitignore(1 hunks).mise.toml(3 hunks)application/datamanager/Dockerfile(1 hunks)application/datamanager/Dockerfile.test(1 hunks)application/datamanager/compose.yaml(1 hunks)application/datamanager/features/environment.py(1 hunks)application/datamanager/features/equity_bars.feature(1 hunks)application/datamanager/features/health.feature(1 hunks)application/datamanager/features/steps/equity_bars_steps.py(1 hunks)application/datamanager/features/steps/health_steps.py(1 hunks)application/datamanager/mise.toml(1 hunks)application/datamanager/pyproject.toml(2 hunks)application/datamanager/src/datamanager/bars.sql(1 hunks)application/datamanager/src/datamanager/config.py(1 hunks)application/datamanager/src/datamanager/main.py(1 hunks)application/datamanager/src/datamanager/models.py(1 hunks)application/datamanager/src/datamanager/query.py(1 hunks)application/datamanager/tests/test_datamanager_main.py(0 hunks)infrastructure/buckets.py(1 hunks)infrastructure/cloud_run.py(2 hunks)infrastructure/main.py(0 hunks)infrastructure/pyproject.toml(0 hunks)pyproject.toml(1 hunks)
💤 Files with no reviewable changes (3)
- infrastructure/pyproject.toml
- infrastructure/main.py
- application/datamanager/tests/test_datamanager_main.py
🧰 Additional context used
🧬 Code Graph Analysis (2)
infrastructure/buckets.py (1)
application/datamanager/src/datamanager/config.py (1)
Bucket(15-21)
application/datamanager/src/datamanager/models.py (1)
application/positionmanager/src/positionmanager/models.py (2)
DateRange(36-53)check_end_after_start(42-47)
🪛 Hadolint (2.12.0)
application/datamanager/Dockerfile
[error] 12-12: COPY with more than 2 arguments requires the last argument to end with /
(DL3021)
🪛 Ruff (0.11.9)
application/datamanager/src/datamanager/query.py
28-28: SyntaxError: Expected an indented block after except clause
application/datamanager/src/datamanager/config.py
4-4: datetime.date imported but unused
Remove unused import
(F401)
4-4: datetime.datetime imported but unused
Remove unused import
(F401)
6-6: loguru.logger imported but unused
Remove unused import: loguru.logger
(F401)
application/datamanager/src/datamanager/main.py
12-12: fastapi.HTTPException imported but unused
Remove unused import
(F401)
12-12: fastapi.Query imported but unused
Remove unused import
(F401)
14-14: .models.DateRange imported but unused
Remove unused import: .models.DateRange
(F401)
application/datamanager/features/steps/equity_bars_steps.py
1-1: json imported but unused
Remove unused import: json
(F401)
4-4: datetime.datetime imported but unused
Remove unused import
(F401)
4-4: datetime.timedelta imported but unused
Remove unused import
(F401)
10-10: httpx imported but unused
Remove unused import: httpx
(F401)
28-28: Redefinition of unused step_impl from line 16
(F811)
45-45: Redefinition of unused step_impl from line 28
(F811)
52-52: Redefinition of unused step_impl from line 45
(F811)
59-59: Redefinition of unused step_impl from line 52
(F811)
65-65: Redefinition of unused step_impl from line 59
(F811)
77-77: Redefinition of unused step_impl from line 65
(F811)
87-87: Redefinition of unused step_impl from line 77
(F811)
102-102: Redefinition of unused step_impl from line 87
(F811)
114-114: Redefinition of unused step_impl from line 102
(F811)
134-134: Redefinition of unused step_impl from line 114
(F811)
142-142: Redefinition of unused step_impl from line 134
(F811)
157-157: Redefinition of unused step_impl from line 142
(F811)
🪛 GitHub Actions: Code quality check
application/datamanager/src/datamanager/query.py
[error] 28-28: Failed to parse file: Expected an indented block after except clause.
🔇 Additional comments (21)
.gitignore (1)
10-11: Appropriate additions to gitignoreThese new entries properly exclude local configuration files that should not be committed to version control, aligning with the enhanced configuration management introduced in this PR.
pyproject.toml (2)
8-8: Good addition of pytest-asyncio dependencyAdding pytest-asyncio supports the new asynchronous testing capabilities in the datamanager service, which aligns with the testing framework enhancements mentioned in the PR objectives.
14-17: Workspace members reorderingThe reordering of workspace members doesn't affect functionality but maintains consistent organization in the project structure.
application/datamanager/pyproject.toml (1)
12-14: Appropriate runtime dependencies addedThe new dependencies align well with the PR objectives:
google-cloud-storagesupports the GCP configurations mentioned in the PRloguruprovides enhanced logging capabilitiesdatamanagersuggests a modular design approachapplication/datamanager/features/environment.py (1)
4-6: Well-structured behave test environment setupThe
before_allfunction follows best practices for setting up the test environment. It appropriately configures the base URL from an environment variable with a sensible default, supporting the containerized testing approach described in the PR.application/datamanager/Dockerfile (3)
5-5: LGTM: Added curl as a dependencyGood addition of curl, which will be useful for health checks and API testing in the container.
16-16: LGTM: Updated exposed portCorrectly updated the exposed port from 8000 to 8080 to match the port specified in the entrypoint command.
18-18: LGTM: Updated entrypoint to use new module pathThe entrypoint correctly reflects the new module structure with the FastAPI application now located at
datamanager.src.datamanager.main:application.application/datamanager/features/health.feature (1)
1-11: Well-structured BDD test for health endpointThis is a clean, well-formatted BDD feature file that follows Gherkin syntax best practices. The scenario effectively tests the health endpoint functionality mentioned in the PR objectives.
application/datamanager/features/steps/health_steps.py (1)
7-11:Details
✅ Verification successful
Well-implemented step definition, but is the assertion step defined elsewhere?
The step definition is clear and follows behave's conventions. However, I don't see the implementation for the "Then the response status code should be 200" step from the feature file.
Please verify that the assertion step is defined in another file, such as a common steps file. You can check with:
🏁 Script executed:
#!/bin/bash # Look for the step definition that checks response status code cd application/datamanager rg -l "response status code should be" --type py features/Length of output: 125
🏁 Script executed:
#!/bin/bash cd application/datamanager rg -n "@then.*response status code should be" -C2 features/steps/equity_bars_steps.pyLength of output: 291
Assertion step found and no action required
I located the corresponding
@then("the response status code should be {status_code}")step inapplication/datamanager/features/steps/equity_bars_steps.pyat lines 44–46. The assertion is already implemented—no missing definitions.infrastructure/buckets.py (2)
8-10: Improved naming consistency and resource identificationGood refactoring to use more explicit and consistent naming patterns, changing from "prod" to "production". This aligns better with the Bucket configuration class shown in the application code.
10-10: Cleaner configuration accessNice improvement to directly use
config.require_secret()inline rather than storing it in an intermediate variable first.infrastructure/cloud_run.py (2)
12-13: Secure handling of DuckDB credentials.The code correctly uses
require_secretto retrieve sensitive DuckDB credentials, maintaining good security practices.
36-42: Environment variable renaming and addition looks good.Renaming
GCP_GCS_BUCKETtoGCP_PROJECTand adding the newDATA_BUCKETvariable pointing to the production data bucket is appropriate for the refactored architecture.application/datamanager/compose.yaml (1)
25-36: Test service dependency setup is well-defined.The tests service properly depends on the datamanager service's health check, ensuring tests only run once the service is ready. The environment variable setup and volume mounting look good.
application/datamanager/features/equity_bars.feature (1)
1-36: LGTM - Well-structured BDD tests for equity bars data managementThe feature file follows good BDD practices with clear scenarios for testing the equity bars data management functionality. The tests cover the full lifecycle (POST, GET, DELETE) and also include a specific test for handling weekend dates.
application/datamanager/src/datamanager/models.py (1)
25-36:Details
❌ Incorrect review comment
DateRange implementation differs from positionmanager
The DateRange class implementation differs from the version in positionmanager that uses datetime instead of date types.
Verify whether this is intentional or if the implementations should be consistent:
🏁 Script executed:
#!/bin/bash # Check how DateRange is used across the codebase rg -A 5 -B 5 "DateRange" --glob "*.py"Length of output: 6281
DateRange implementations intentionally differ
The twoDateRangemodels serve distinct domains and thus use different types:
- In datamanager,
start/endaredatetime.date(daily‐aggregated data).- In positionmanager,
start/endare fulldatetime(time‐window queries).No changes required—this difference is intentional.
Likely an incorrect or invalid review comment.
.mise.toml (2)
51-58: LGTM - Well-defined behave test taskThe new
python:test:behavetask is well structured and aligns with the BDD testing approach introduced in the pull request.
80-86: LGTM - Useful local development taskThe
application:devtask provides a convenient way to run the application locally with hot reloading, which is great for development workflow.application/datamanager/features/steps/equity_bars_steps.py (1)
27-33: POST currently ignores the end-date contextOnly
start_dateis sent in the body:json={"date": context.start_date}Confirm whether the scenario intends a single date or a range, and adjust the
payload accordingly.🧰 Tools
🪛 Ruff (0.11.9)
28-28: Redefinition of unused
step_implfrom line 16(F811)
application/datamanager/src/datamanager/main.py (1)
88-104: Helpful 404 bodyWhen no rows are found, the 404 has no body or headers. Clients may benefit
from a JSON{ "detail": "no bars found" }.
|
@chrisaddy are you addressing any of the CodeRabbit comments or are they resolved? |
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (2)
application/datamanager/src/datamanager/main.py (2)
18-18: Fix incorrect return type annotationThe function
bars_queryreturns a SQL query string, not a Polars DataFrame as indicated by the type annotation.-def bars_query(*, bucket: str, start_date: date, end_date: date) -> pl.DataFrame: +def bars_query(*, bucket: str, start_date: date, end_date: date) -> str:
141-159: Return JSON response instead of 204 No ContentThe current DELETE endpoint returns a 204 No Content status, but previous review comments indicate that tests expect a JSON response.
logger.info(f"deleting {prefix=}") bucket.delete_blobs(blobs) - return Response(status_code=status.HTTP_204_NO_CONTENT) + return {"status": "success", "deleted": len(blobs), "date": summary_date.date.strftime("%Y-%m-%d")}
🧹 Nitpick comments (1)
workflows/backfill_datamanager.py (1)
1-22: Consider adding logging and progress trackingFor a backfill workflow that might run over a long date range, it would be helpful to add logging and progress tracking.
from datetime import date, timedelta from typing import List import httpx from flytekit import task, workflow +import logging +logger = logging.getLogger(__name__) @task(retries=3) def backfill_single_date(base_url: str, day: date) -> int: + logger.info(f"Backfilling data for {day.isoformat()}") response = httpx.post(f"{base_url}/equity-bars", json={"date": day.isoformat()}) response.raise_for_status() - return response.json().get("count", 0) + count = response.json().get("count", 0) + logger.info(f"Successfully backfilled {count} records for {day.isoformat()}") + return count @workflow def backfill_equity_bars(base_url: str, start_date: date, end_date: date) -> List[int]: + logger.info(f"Starting backfill from {start_date.isoformat()} to {end_date.isoformat()}") results: List[int] = [] current = start_date + total_days = (end_date - start_date).days + 1 + day_count = 0 while current <= end_date: + day_count += 1 + logger.info(f"Processing day {day_count}/{total_days} - {current.isoformat()}") results.append(backfill_single_date(base_url=base_url, day=current)) current += timedelta(days=1) + logger.info(f"Backfill complete. Processed {len(results)} days with a total of {sum(results)} records") return results
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
.flox/env/manifest.lockis excluded by!**/*.lock
📒 Files selected for processing (12)
.github/workflows/test.yaml(1 hunks).mise.toml(4 hunks)application/datamanager/features/steps/equity_bars_steps.py(1 hunks)application/datamanager/features/steps/health_steps.py(1 hunks)application/datamanager/src/datamanager/config.py(1 hunks)application/datamanager/src/datamanager/main.py(1 hunks)application/datamanager/src/datamanager/models.py(1 hunks)infrastructure/Pulumi.yaml(1 hunks)infrastructure/cloud_run.py(2 hunks)pyproject.toml(1 hunks)workflows/backfill_datamanager.py(1 hunks)workflows/pyproject.toml(1 hunks)
✅ Files skipped from review due to trivial changes (2)
- infrastructure/Pulumi.yaml
- workflows/pyproject.toml
🚧 Files skipped from review as they are similar to previous changes (7)
- pyproject.toml
- application/datamanager/features/steps/health_steps.py
- application/datamanager/src/datamanager/models.py
- infrastructure/cloud_run.py
- application/datamanager/src/datamanager/config.py
- application/datamanager/features/steps/equity_bars_steps.py
- .mise.toml
🧰 Additional context used
🧬 Code Graph Analysis (1)
application/datamanager/src/datamanager/main.py (2)
application/datamanager/src/datamanager/config.py (2)
Settings(41-43)daily_bars_path(18-19)application/datamanager/src/datamanager/models.py (2)
BarsSummary(38-40)SummaryDate(5-22)
🔇 Additional comments (5)
.github/workflows/test.yaml (1)
11-12:Details
❓ Verification inconclusive
Confirm the OS matrix update is intentional and sufficient
The matrix now only includes
ubuntu-latest, removingmacos-latest. Please verify that:
- All unit, integration, and Behave-based end-to-end tests (including Docker‐based scenarios) execute without macOS-specific adjustments.
- Dropping macOS aligns with your support and coverage requirements.
To ensure no residual references exist, you can run:
🏁 Script executed:
#!/bin/bash rg "macos-latest" -n .github/workflowsLength of output: 233
Confirm OS matrix consistency across workflows
It looks like you’ve limited
test.yamltoubuntu-latest, butquality.yamlstill runs on bothubuntu-latestandmacos-latest, anddeploy.yamlusesruns-on: macos-latest. Please verify that:
- Dropping macOS in test jobs is intentional and doesn’t leave coverage gaps for any macOS-specific logic.
- You truly need macOS runners in quality and deploy workflows; otherwise, remove or align them to your updated matrix.
- All affected workflows (unit, integration, Behave end-to-end, Docker scenarios, deployments) behave correctly on the specified OS runners.
Pinpointed locations:
.github/workflows/test.yaml(lines 11–12):# os: [macos-latest]→os: [ubuntu-latest].github/workflows/quality.yaml(line 11):os: [ubuntu-latest, macos-latest].github/workflows/deploy.yaml(line 10):runs-on: macos-latestTo find all remaining references, you can run:
rg "macos-latest" -n .github/workflowsapplication/datamanager/src/datamanager/main.py (4)
19-35: LGTM: Well-structured query for partitioned dataThe query correctly uses Hive partitioning to filter Parquet files by year, month, and day columns, with appropriate date range logic.
38-62: Well-structured lifespan context manager for FastAPIThe lifespan setup correctly initializes GCS bucket access and configures DuckDB with HTTPFS and GCS credentials. The use of environment variables for secrets is appropriate.
67-69: LGTM: Simple health check endpointThe health check endpoint is appropriately implemented to return a 200 OK status without a body, which is a common pattern for health checks.
72-107: Robust error handling and Arrow IPC response formatThe implementation correctly handles the case of empty results with a 404 response, includes comprehensive error logging, and returns data in an efficient Apache Arrow IPC format with useful metadata headers.
d800417 to
55681d5
Compare
prediction pipeline
…-destination-path
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (2)
application/datamanager/src/datamanager/query.py (1)
21-28:⚠️ Potential issueFix the incomplete exception handling
The function has a syntax error in the exception handling block. The
returnstatement after theexceptclause doesn't specify a return value and lacks proper indentation.Apply this fix:
async def query_bars(*, filepaths: list[Path]) -> DataFrame: query = f""" SELECT * FROM read_parquet({filepaths}) """ try: return duckdb.sql(query).pl() except duckdb.duckdb.HTTPException as e: - return + # Log the error for debugging + print(f"HTTP error occurred: {e}") + # Return an empty DataFrame + return DataFrame()🧰 Tools
🪛 Ruff (0.11.9)
28-28: SyntaxError: Expected an indented block after
exceptclauseapplication/datamanager/src/datamanager/config.py (1)
28-32:⚠️ Potential issueFail fast – but gracefully – when credentials are missing
Opening
GOOGLE_APPLICATION_CREDENTIALSat import-time will crash the service withFileNotFoundErrorif the env-var is unset or the file is absent.Move the check inside the cached property and raise an explanatory
RuntimeError:@cached_property def _creds(self) -> dict: - with open(self.credentials_path) as f: - creds = json.load(f) - return creds + if not self.credentials_path or not os.path.isfile(self.credentials_path): + raise RuntimeError( + "GOOGLE_APPLICATION_CREDENTIALS not set or file cannot be found." + ) + try: + with open(self.credentials_path) as f: + return json.load(f) + except (OSError, json.JSONDecodeError) as exc: + raise RuntimeError( + f"Unable to load GCP credentials from {self.credentials_path}" + ) from exc
🧹 Nitpick comments (6)
workflows/prediction_model.py (1)
11-20: Consider more robust error handling for API requestsThe function correctly uses a timeout and raises HTTP errors, but there's no handling for network errors or malformed responses.
Consider enhancing error handling:
@task def fetch_data(start_date: datetime, end_date: datetime) -> List[Dict[str, Any]]: base_url = os.getenv("DATAMANAGER_BASE_URL", "http://localhost:8080") - response = requests.get( - f"{base_url}/equity-bars", - params={"start_date": start_date.isoformat(), "end_date": end_date.isoformat()}, - timeout=10, - ) - response.raise_for_status() - return response.json().get("data", []) + try: + response = requests.get( + f"{base_url}/equity-bars", + params={"start_date": start_date.isoformat(), "end_date": end_date.isoformat()}, + timeout=10, + ) + response.raise_for_status() + return response.json().get("data", []) + except requests.RequestException as e: + print(f"Error fetching data: {e}") + raise RuntimeError(f"Failed to fetch data: {e}") from eapplication/datamanager/features/environment.py (1)
1-6: Consider adding docstrings and environment configuration validationThe current implementation lacks comprehensive documentation and validation for the environment configuration.
Consider enhancing the module with better documentation and validation:
import os def before_all(context): """Set up test environment.""" context.base_url = os.environ.get("BASE_URL", "http://datamanager:8080") + + # Validate the base URL + if not context.base_url.startswith(("http://", "https://")): + raise ValueError(f"Invalid base URL: {context.base_url}. Must start with http:// or https://") + + print(f"Using base URL: {context.base_url}")application/datamanager/src/datamanager/query.py (1)
7-18: Enhance error handling and add validation for bucket nameThe function assumes the bucket name is valid and doesn't handle potential errors in path construction.
Consider enhancing the function with validation and error handling:
async def list_file_paths( *, bucket: str, start_date: date, end_date: date ) -> list[Path]: + if not bucket: + raise ValueError("Bucket name cannot be empty") + + if start_date > end_date: + raise ValueError(f"Start date ({start_date}) cannot be after end date ({end_date})") + filepaths = [] current = start_date while current <= end_date: filepaths.append( f"gs://{bucket}/equity/bars/{current.strftime('%Y-%m-%d')}/data.parquet" ) current += timedelta(days=1) return filepathsapplication/datamanager/src/datamanager/config.py (3)
1-6: Remove unused importsSeveral imported modules are not used in the code.
Apply this diff to remove the unused imports:
import os import json from functools import cached_property -from datetime import date, datetime from pydantic import BaseModel, Field, computed_field -from loguru import logger🧰 Tools
🪛 Ruff (0.11.9)
4-4:
datetime.dateimported but unusedRemove unused import
(F401)
4-4:
datetime.datetimeimported but unusedRemove unused import
(F401)
6-6:
loguru.loggerimported but unusedRemove unused import:
loguru.logger(F401)
9-12: Consider renaming toPolygonClientas suggested in previous reviewThe class name could be more descriptive to indicate it's a client configuration.
As suggested in the previous review, consider renaming the class to better represent its purpose:
-class Polygon(BaseModel): +class PolygonClient(BaseModel): api_key: str = Field(default=os.getenv("POLYGON_API_KEY")) base_url: str = "https://api.polygon.io" daily_bars: str = "/v2/aggs/grouped/locale/us/market/stocks/"Don't forget to update the
Settingsclass reference on line 45.
15-21: Consider renaming toGCSBucketas suggested in previous reviewThe current name doesn't clearly indicate it's specifically for Google Cloud Storage.
As suggested in the previous review, consider renaming the class:
-class Bucket(BaseModel): +class GCSBucket(BaseModel): name: str = Field(default=os.getenv("DATA_BUCKET")) project: str = Field(default=os.getenv("GCP_PROJECT")) @computed_field def daily_bars_path(self) -> str: return f"gs://{self.name}/equity/bars/"Don't forget to update the
GCPclass reference on line 25.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
.flox/env/manifest.lockis excluded by!**/*.lockuv.lockis excluded by!**/*.lock
📒 Files selected for processing (18)
.gitignore(1 hunks).mise.toml(3 hunks)application/datamanager/compose.yaml(1 hunks)application/datamanager/features/environment.py(1 hunks)application/datamanager/features/equity_bars.feature(1 hunks)application/datamanager/features/steps/equity_bars_steps.py(1 hunks)application/datamanager/features/steps/health_steps.py(1 hunks)application/datamanager/pyproject.toml(2 hunks)application/datamanager/src/datamanager/bars.sql(1 hunks)application/datamanager/src/datamanager/config.py(1 hunks)application/datamanager/src/datamanager/main.py(1 hunks)application/datamanager/src/datamanager/models.py(1 hunks)application/datamanager/src/datamanager/query.py(1 hunks)infrastructure/cloud_run.py(2 hunks)pyproject.toml(1 hunks)workflows/backfill_datamanager.py(1 hunks)workflows/prediction_model.py(1 hunks)workflows/pyproject.toml(1 hunks)
✅ Files skipped from review due to trivial changes (2)
- application/datamanager/src/datamanager/bars.sql
- application/datamanager/compose.yaml
🚧 Files skipped from review as they are similar to previous changes (10)
- pyproject.toml
- .gitignore
- application/datamanager/features/steps/health_steps.py
- application/datamanager/pyproject.toml
- workflows/pyproject.toml
- infrastructure/cloud_run.py
- application/datamanager/features/equity_bars.feature
- workflows/backfill_datamanager.py
- application/datamanager/features/steps/equity_bars_steps.py
- application/datamanager/src/datamanager/models.py
🧰 Additional context used
🧠 Learnings (1)
application/datamanager/src/datamanager/main.py (1)
Learnt from: chrisaddy
PR: pocketsizefund/pocketsizefund#535
File: application/datamanager/src/datamanager/main.py:147-147
Timestamp: 2025-05-22T02:26:11.263Z
Learning: In the datamanager service, debug-style f-string syntax (e.g., `{year=}`) is intentionally used for constructing GCS path prefixes to match Hive-style partitioned directory structures for data storage and retrieval.
🧬 Code Graph Analysis (1)
application/datamanager/src/datamanager/main.py (2)
application/datamanager/src/datamanager/config.py (2)
Settings(43-45)daily_bars_path(20-21)application/datamanager/src/datamanager/models.py (2)
BarsSummary(37-39)SummaryDate(5-21)
🪛 Ruff (0.11.9)
application/datamanager/src/datamanager/query.py
28-28: SyntaxError: Expected an indented block after except clause
workflows/prediction_model.py
27-27: Undefined name statistics
(F821)
29-29: Undefined name pickle
(F821)
application/datamanager/src/datamanager/main.py
144-147: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
application/datamanager/src/datamanager/config.py
4-4: datetime.date imported but unused
Remove unused import
(F401)
4-4: datetime.datetime imported but unused
Remove unused import
(F401)
6-6: loguru.logger imported but unused
Remove unused import: loguru.logger
(F401)
🪛 GitHub Actions: Test and coverage check
.mise.toml
[error] 44-44: TOML parse error: invalid table header due to duplicate key "python:test:behave" in table "tasks".
🔇 Additional comments (15)
application/datamanager/features/environment.py (1)
4-6:Details
✅ Verification successful
Verify the port configuration
Confirm whether port 8080 is the correct port for the datamanager service in the test environment, as questioned in a previous review.
🏁 Script executed:
#!/bin/bash # Check the port exposed in the Dockerfile and compose.yaml grep -A 3 EXPOSE application/datamanager/Dockerfile || echo "No EXPOSE in Dockerfile" grep -A 5 "datamanager:" application/datamanager/compose.yaml || echo "No datamanager service in compose.yaml"Length of output: 508
Verified port 8080 is correct
The Dockerfile exposes port 8080 and
docker-compose.yamlmaps8080:8080. TheBASE_URL=http://datamanager:8080environment variable also matches. The default inbefore_alltherefore aligns with your service configuration—no change needed.• application/datamanager/Dockerfile:
EXPOSE 8080
• application/datamanager/compose.yaml:
–ports: - 8080:8080
–environment: - BASE_URL=http://datamanager:8080application/datamanager/src/datamanager/config.py (1)
34-40: 🛠️ Refactor suggestionProtect against missing credential fields and improve error handling
The current implementation assumes that the credential file always contains a "client_email" field and that it can be properly serialized to JSON.
Add defensive checks to handle missing fields and potential serialization errors:
@computed_field def key_id(self) -> str | None: - return self._creds.get("client_email") + try: + return self._creds.get("client_email") + except Exception as e: + print(f"Error accessing client_email: {e}") + return None @computed_field def secret(self) -> str: - return json.dumps(self._creds).replace("'", "\\'") + try: + return json.dumps(self._creds).replace("'", "\\'") + except (TypeError, ValueError) as e: + print(f"Error serializing credentials: {e}") + return "{}"Likely an incorrect or invalid review comment.
.mise.toml (5)
1-2: LGTM! Environment variable addition for Compose BakeAdding
COMPOSE_BAKE=trueenables Docker Compose's Bake feature, which is appropriate for the new Docker Compose setup in the datamanager service.
57-59: LGTM! Updated Docker build commandsThe updated Docker build commands now use
application_nameconsistently, which aligns with the PR's goal of improving configuration management.
72-77: LGTM! New development taskThis task allows running the service locally with hot reloading, which is a good addition for improving developer workflows.
79-84: LGTM! New integration test taskThis task provides a convenient way to run integration tests using Docker Compose, which aligns with the PR's goal of enhancing testing capabilities.
90-90: LGTM! Updated lint taskThe lint task now includes YAML linting with relaxed configuration and a maximum line length of 110 characters, which is appropriate for the new YAML files in the project.
application/datamanager/src/datamanager/main.py (8)
18-35: LGTM! Well-structured SQL query generationThe
bars_queryfunction generates a well-formed SQL query for retrieving data from partitioned Parquet files. The date range filtering logic is comprehensive, handling year, month, and day comparisons correctly.
38-61: LGTM! Robust application lifecycle managementThe lifespan context manager properly initializes the application state, including:
- Loading settings from environment variables
- Setting up GCS bucket access
- Configuring DuckDB with proper HTTP and GCS credentials
This approach ensures resources are properly initialized before the application starts handling requests.
67-69: LGTM! Simple health check endpointThe health check endpoint is appropriately implemented to return a 200 OK status without a response body, which is a common pattern for health check endpoints.
72-107: LGTM! Comprehensive error handling in GET endpointThe GET endpoint includes proper error handling for:
- Empty result sets (returning 404)
- Exceptions during query execution (returning 500)
- Logging detailed error information
The Arrow IPC format for response data is an efficient choice for transferring tabular data.
1-16: LGTM! Appropriate importsThe imports are well-organized and include all necessary libraries for:
- Cloud storage integration (Google Cloud Storage)
- Data processing (Polars, PyArrow)
- HTTP client (httpx)
- Web framework (FastAPI)
- Error handling (traceback)
110-149: LGTM! Well-implemented data ingestion endpointThe POST endpoint correctly:
- Fetches data from the Polygon API
- Transforms it into a Polars DataFrame
- Adds necessary datetime and partition columns
- Writes to partitioned Parquet files in GCS
- Includes comprehensive error handling
The error handling with try/except blocks for the write_parquet operation is a good addition based on previous review feedback.
🧰 Tools
🪛 Ruff (0.11.9)
144-147: Within an
exceptclause, raise exceptions withraise ... from errorraise ... from Noneto distinguish them from errors in exception handling(B904)
151-165: LGTM! Thorough error handling for DELETE endpointThe DELETE endpoint properly handles edge cases:
- Non-existent bucket (returns 404)
- No matching blobs (returns 404)
- Successfully deleting blobs
The use of debug-style f-string syntax (
{year=}) is intentional for Hive-style partitioning as noted in the retrieved learnings.
166-168:Details
❌ Incorrect review comment
Mismatch between DELETE endpoint return type and test expectations
The DELETE endpoint currently returns a 204 No Content status, but according to previous review comments, the tests expect a JSON response. This inconsistency should be resolved.
logger.info(f"deleting {prefix=}") bucket.delete_blobs(blobs) -return Response(status_code=status.HTTP_204_NO_CONTENT) +return {"status": "success", "deleted": len(blobs), "date": summary_date.date.strftime("%Y-%m-%d")}Let's verify this is still an issue:
🏁 Script executed:
#!/bin/bash # Find tests that expect JSON response from DELETE endpoint grep -r "delete.*json" --include="*.py" --include="*.feature" application/datamanager/features/Length of output: 218
🏁 Script executed:
#!/bin/bash sed -n '1,200p' application/datamanager/features/steps/equity_bars_steps.pyLength of output: 2072
No JSON response expected in DELETE endpoint
The DELETE feature tests only verify the HTTP status code and side‐effects (file/bucket deletion), not any response body. There is no call to
response.json()or assertions against a JSON payload. Returning a 204 No Content is therefore consistent with the tests and no change is needed.Likely an incorrect or invalid review comment.
forstmeier
left a comment
There was a problem hiding this comment.
CodeRabbit caught a couple things to adderss.
…-syntax-and-destination-path Fix datamanager Dockerfile and improve error handling PR cleanups
8c103b9 to
ae8b56f
Compare
This pull request introduces a comprehensive set of changes to enhance the
datamanagerservice, including new testing frameworks, configuration updates, and mock services for integration testing. The updates focus on supporting end-to-end testing, improving developer workflows, and adding mock data capabilities. Below is a summary of the most important changes grouped by theme:Testing Framework and End-to-End Tests
behaveend-to-end tests for thedatamanagerservice, including scenarios for fetching, storing, retrieving, and deleting equity bars data, as well as health check endpoint tests (application/datamanager/features/equity_bars.feature,application/datamanager/features/health.feature, [1] [2].behavetests, implementing functionality for API interactions and data validation (application/datamanager/features/steps/equity_bars_steps.py,application/datamanager/features/steps/health_steps.py, [1] [2].behavetests in a containerized environment (application/datamanager/mise.toml, application/datamanager/mise.tomlR1-R11).Mock Services for Testing
application/datamanager/mock/server.py,application/datamanager/mock/Dockerfile, [1] [2].Configuration and Dependency Updates
Settingsclass using Pydantic to manage environment variables for API keys and GCP configurations (application/datamanager/src/datamanager/config.py, application/datamanager/src/datamanager/config.pyR1-R9).pyproject.tomlto include new dependencies for testing (behave,httpx,pytest-asyncio) and Google Cloud Storage support (application/datamanager/pyproject.toml, [1] [2].Dockerfile and Compose Enhancements
Dockerfilefor thedatamanagerservice to include additional dependencies, expose port 8080, and adjust the entry point for running the application (application/datamanager/Dockerfile, application/datamanager/DockerfileL2-R18).Dockerfile.testfor runningbehavetests in a containerized environment (application/datamanager/Dockerfile.test, application/datamanager/Dockerfile.testR1-R12).compose.yamlfile to define services for thedatamanagerapplication and its tests, including health checks and environment variable configurations (application/datamanager/compose.yaml, application/datamanager/compose.yamlR1-R42).SQL and Utility Updates
datamanagerservice (application/datamanager/src/datamanager/bars.sql, application/datamanager/src/datamanager/bars.sqlR1).These changes collectively improve the robustness, testability, and developer experience of the
datamanagerservice.Summary by CodeRabbit