Skip to content

Create initial datamanager helper classes#618

Merged
forstmeier merged 8 commits intomasterfrom
08-20-create_initial_datamanager_helper_classes
Aug 23, 2025
Merged

Create initial datamanager helper classes#618
forstmeier merged 8 commits intomasterfrom
08-20-create_initial_datamanager_helper_classes

Conversation

@forstmeier
Copy link
Copy Markdown
Collaborator

@forstmeier forstmeier commented Aug 21, 2025

Overview

Changes

  • build S3Client helper class
  • build AlpacaClient helper class

Comments

Probably gonna change a bit but this was basically a slimmed down port of the logic we had in the earlier datamanager iteration.

Summary by CodeRabbit

  • New Features

    • Fetch end-of-day US equity data from Alpaca.
    • S3-backed equity bars storage with partitioned Parquet writes and date-range reads.
    • New Docker Swarm-based deployment and stack (Traefik, Prometheus, Grafana, monitoring) and orchestration scripts.
  • Tests

    • Unit tests for Alpaca data fetching (valid data, missing bars, date mismatches).
    • Unit tests for S3 read/write behavior with mocked I/O.
  • Chores

    • Updated dependencies and packaging; toolchain pinned to Python 3.12 and refined test build sync.
    • .gitignore and local settings adjusted.

Copilot AI review requested due to automatic review settings August 21, 2025 02:13
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Aug 21, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds Datamanager Alpaca and S3 clients with unit tests and packaging updates; updates test Dockerfile. Replaces large Pulumi/EKS-based infra with Docker Swarm/Lightsail stack and Nu orchestration, removes many Pulumi modules, and adds Docker Swarm stack, Prometheus config, and helper scripts/settings.

Changes

Cohort / File(s) Change summary
Test Docker image
Dockerfile.tests
Base image pinned python:3.13python:3.12.10; removed ENV PYTEST_ADDOPTS; replaced RUN uv sync --all-packages --devRUN uv sync --all-packages --all-groups; added extra RUN uv sync --all-packages --all-groups after COPY libraries/.
Datamanager packaging
applications/datamanager/pyproject.toml
Expanded dependencies to include loguru, alpaca-py, boto3, duckdb alongside internal; added [tool.uv] with package = true and src = ["src"].
Datamanager clients (new)
applications/datamanager/src/datamanager/alpaca_client.py, applications/datamanager/src/datamanager/s3_client.py
Added AlpacaClient to fetch EOD US equity snapshots (chunked requests, timezone/date checks, returns Polars DF). Added S3Client that configures DuckDB httpfs, writes Parquet partitioned by year/month/day to S3, and reads ranges via DuckDB.
Datamanager tests (new)
applications/datamanager/tests/test_alpaca_client.py, applications/datamanager/tests/test_s3_client.py
Added unit tests for AlpacaClient (normal, missing daily_bar, wrong date) and S3Client (write/read/empty-write), mocking external services and I/O.
Claude/local settings
.claude/settings.local.json, infrastructure/.claude/settings.local.json
Added and expanded local Claude permission allow-lists and defaults; new infra-local Claude settings include Bash permissions and additionalDirectories.
Flox / build manifest
.flox/env/manifest.toml
Replaced awscli mapping with awscli2 and introduced cargo.pkg-path = "cargo".
Repo ignore & tooling
.gitignore, infrastructure/.gitignore, infrastructure/.python-version
Modified .gitignore (removed some ignores, added infrastructure/swarm.pem); infra .gitignore added *.pyc, venv/; added infrastructure/.python-version pinning 3.12.
Project misc / scripts
.mise.toml, infrastructure/main.nu, infrastructure/upload_grafana_dashboard.nu (removed)
Reworked mise tasks (infrastructure:up now script-driven); added new Nu script infrastructure/main.nu providing create-contexts, launch-stacks, and an "infrastructure up" orchestration; removed previous upload_grafana_dashboard.nu.
Pulumi infra rework
infrastructure/__main__.py, infrastructure/pyproject.toml, infrastructure/requirements.txt
Replaced AWS/EKS/Knative Pulumi orchestration with Lightsail-based Docker Swarm bootstrap in __main__.py; updated infra pyproject metadata and dependencies (pulumi, pulumi-aws, pulumi-command, pulumi-tls); added requirements pins.
Pulumi modules removed
infrastructure/api.py, infrastructure/cluster.py, infrastructure/ingress.py, infrastructure/services.py, infrastructure/keys.py, infrastructure/monitors.py, infrastructure/vpc.py, infrastructure/tags.py
Deleted multiple Pulumi helper modules and their exported functions/variables that provisioned VPC, EKS, ALB/API Gateway, Knative services, Grafana upload, AMP scraper, IAM keys, and tagging.
Monitoring / stack
infrastructure/stack.yml, infrastructure/prometheus.yml, infrastructure/grafana-dashboard.json (removed)
Added Docker Swarm stack manifest stack.yml (Traefik, Grafana, Prometheus, node-exporter, cadvisor, Portainer) and Prometheus config prometheus.yml; removed the previous Grafana JSON dashboard file.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Caller
  participant AC as AlpacaClient
  participant AssetsAPI as Alpaca Assets API
  participant SnapAPI as Alpaca Snapshot API
  participant TZ as Timezone Normalizer
  participant DF as Polars

  Caller->>AC: fetch_latest_data(current_date)
  AC->>AssetsAPI: Get active US equity assets
  AssetsAPI-->>AC: symbols[]
  loop for each chunk (≤100)
    AC->>SnapAPI: Request snapshots(chunk, feed=iex)
    SnapAPI-->>AC: snapshots
    AC->>TZ: Normalize daily_bar timestamp → America/New_York
    TZ-->>AC: local_date
    alt daily_bar present AND local_date == current_date
      AC->>DF: Append record (ticker, ts_ms, o/h/l/c, vol, vwap)
    else
      AC-->>AC: Skip / log
    end
    AC-->>AC: Sleep (rate limit)
  end
  AC-->>Caller: Polars DataFrame
Loading
sequenceDiagram
  autonumber
  actor Caller
  participant S3C as S3Client
  participant DuckDB as DuckDB+HTTPFS
  participant S3 as S3
  participant DF as Polars

  rect rgb(230,245,255)
    note right of S3C: Write flow
    Caller->>S3C: write_equity_bars_data(df)
    S3C->>DF: add dt, extract year/month/day
    S3C->>S3: write_parquet(path=s3://.../equity/bars/, partition_by=[year,month,day])
  end

  rect rgb(240,255,240)
    note right of S3C: Read flow
    Caller->>S3C: read_equity_bars_data(start_date,end_date)
    S3C->>DuckDB: read_parquet('s3://.../**/*.parquet', hive_partitioning=1)
    DuckDB->>DuckDB: apply WHERE year/month/day range filter
    DuckDB-->>S3C: results (Pandas/Arrow)
    S3C-->>Caller: Polars DataFrame
  end
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested reviewers

  • chrisaddy

Poem

A rabbit taps keys with a caffeinated cheer,
Alpaca snapshots hop in, neat rows of the year.
DuckDB digs tunnels to S3 where Parquets lay,
Tests curl in burrows while Swarm greets the day.
Hop, deploy, repeat — data carrots on display! 🐇📊

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch 08-20-create_initial_datamanager_helper_classes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown
Collaborator Author

forstmeier commented Aug 21, 2025

@forstmeier forstmeier added this to the Refactor milestone Aug 21, 2025
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces initial datamanager helper classes for collecting and managing financial market data. The implementation creates client classes for integrating with Alpaca API for market data fetching and AWS S3 for data storage.

  • Adds AlpacaClient class for fetching equity market data from Alpaca API
  • Adds S3Client class for reading/writing equity bars data to/from S3 storage
  • Updates project dependencies and Docker configuration to support the new functionality

Reviewed Changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
applications/datamanager/src/datamanager/alpaca_client.py Implements AlpacaClient for fetching latest equity market data with rate limiting and error handling
applications/datamanager/src/datamanager/s3_client.py Implements S3Client for reading/writing partitioned equity bars data using DuckDB and Polars
applications/datamanager/tests/test_alpaca_client.py Comprehensive test suite covering AlpacaClient functionality including edge cases
applications/datamanager/tests/test_s3_client.py Test suite for S3Client covering write/read operations and empty dataframe handling
applications/datamanager/pyproject.toml Updates dependencies to include required packages for Alpaca API, AWS, and data processing
Dockerfile.tests Updates Python version and sync configuration for the test environment

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread applications/datamanager/src/datamanager/s3_client.py Outdated
Comment thread applications/datamanager/src/datamanager/alpaca_client.py
Comment thread applications/datamanager/src/datamanager/s3_client.py Outdated
@graphite-app
Copy link
Copy Markdown

graphite-app Bot commented Aug 21, 2025

Graphite Automations

"Assign author to pull request" took an action on this PR • (08/21/25)

1 assignee was added to this PR based on John Forstmeier's automation.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (11)
applications/datamanager/pyproject.toml (1)

5-12: Consider Python version consistency across the project.

The pyproject.toml specifies Python 3.12.10, while the Dockerfile.tests uses Python 3.12 (which defaults to the latest 3.12.x). This could lead to subtle differences between local development and CI/CD environments.

Consider aligning the Python version specifications. Either:

  1. Update the Dockerfile to use python:3.12.10 for exact version matching, or
  2. Relax the pyproject.toml requirement to >=3.12,<3.13 for flexibility
-requires-python = "==3.12.10"
+requires-python = ">=3.12,<3.13"
applications/datamanager/tests/test_alpaca_client.py (2)

56-56: Consider using a constant for the is_paper parameter.

The FBT003 noqa comment suggests the linter is warning about boolean positional arguments. Consider defining a constant or using keyword-only arguments for better readability.

-        alpaca_client = AlpacaClient("test-key", "test-secret", True)  # noqa: FBT003
+        alpaca_client = AlpacaClient("test-key", "test-secret", is_paper=True)

1-157: Consider adding tests for error handling and chunk processing.

While the current tests cover the main scenarios well, consider adding tests for:

  1. Exception handling when get_all_assets fails
  2. Chunk processing with more than 100 tickers
  3. Partial chunk failures (when some chunks succeed and others fail)
  4. VWAP being None scenario

Would you like me to generate additional test cases to cover these scenarios?

applications/datamanager/src/datamanager/alpaca_client.py (2)

26-26: Consider making rate limit configurable.

The hardcoded rate limit of 0.5 seconds might not be optimal for all API tiers. Consider making this configurable via an environment variable or constructor parameter.

     def __init__(
         self,
         api_key: str,
         api_secret: str,
         is_paper: bool,  # noqa: FBT001
+        rate_limit_sleep: float = 0.5,
     ) -> None:
-        self.rate_limit_sleep = 0.5  # seconds
+        self.rate_limit_sleep = rate_limit_sleep  # seconds

62-63: Consider validating chunk_size parameter.

The chunk size of 100 is hardcoded. Consider making it configurable and validate it's a positive integer.

Add chunk_size as a class attribute or method parameter:

 class AlpacaClient:
+    DEFAULT_CHUNK_SIZE = 100
+    
     def __init__(
         self,
         api_key: str,
         api_secret: str,
         is_paper: bool,  # noqa: FBT001
+        chunk_size: int = DEFAULT_CHUNK_SIZE,
     ) -> None:
         self.rate_limit_sleep = 0.5  # seconds
+        if chunk_size <= 0:
+            raise ValueError("chunk_size must be positive")
+        self.chunk_size = chunk_size

Then use self.chunk_size instead of the hardcoded value.

applications/datamanager/src/datamanager/s3_client.py (1)

68-70: Consider memory efficiency for large datasets.

The fetchdf() method loads all data into memory as a pandas DataFrame before converting to Polars. For large date ranges, this could cause memory issues.

Consider using DuckDB's arrow interface for more efficient memory usage:

-        result = self.duckdb_connection.execute(query).fetchdf()
-
-        return pl.from_pandas(result)
+        result = self.duckdb_connection.execute(query).arrow()
+        return pl.from_arrow(result)
applications/datamanager/tests/test_s3_client.py (5)

11-14: Patch at the point of use to make mocks reliable and future-proof.

Patch where the symbols are looked up (datamanager.s3_client.*) to avoid leaky mocks if import bindings change.

-        patch("boto3.Session") as mock_session,
-        patch("duckdb.connect") as mock_duckdb_connect,
+        patch("datamanager.s3_client.boto3.Session") as mock_session,
+        patch("datamanager.s3_client.duckdb.connect") as mock_duckdb_connect,
         patch("polars.DataFrame.write_parquet") as mock_write_parquet,
-        patch("boto3.Session") as mock_session,
-        patch("duckdb.connect") as mock_duckdb_connect,
+        patch("datamanager.s3_client.boto3.Session") as mock_session,
+        patch("datamanager.s3_client.duckdb.connect") as mock_duckdb_connect,
-        patch("boto3.Session") as mock_session,
-        patch("duckdb.connect") as mock_duckdb_connect,
+        patch("datamanager.s3_client.boto3.Session") as mock_session,
+        patch("datamanager.s3_client.duckdb.connect") as mock_duckdb_connect,

Also applies to: 47-49, 86-88


39-43: Strengthen assertions: verify partition columns exist on the DataFrame being written and that DuckDB init sets the region.

This ensures we don’t just check call shape but also the transformed data and initialization side effects.

         mock_write_parquet.assert_called_once()
         call_args = mock_write_parquet.call_args
         assert call_args[1]["file"] == "s3://test-bucket/equity/bars/"
         assert call_args[1]["partition_by"] == ["year", "month", "day"]
+        # also assert that the DF passed into write_parquet contains the partition columns
+        df_written = call_args[0][0]  # 'self' of the bound method
+        assert set(["year", "month", "day"]).issubset(set(df_written.columns))
+
+        # verify DuckDB init SQL applied the region from boto3
+        init_sql = mock_duckdb_conn.execute.call_args_list[0].args[0]
+        assert "SET s3_region='us-east-1'" in init_sql

71-79: Make the read test robust: avoid brittle call_count and assert SQL intent; distinguish init vs. query with side_effect.

Relying on exact call counts is fragile if init is refactored. Assert on SQL content instead and use side_effect to model two different returns.

-        mock_duckdb_conn.execute.return_value = mock_result
+        # first execute() for init returns a dummy, second for query returns a result with fetchdf()
+        mock_duckdb_conn.execute.side_effect = [MagicMock(), mock_result]
@@
-        assert mock_duckdb_conn.execute.call_count == 2  # noqa: PLR2004
-
-        assert isinstance(result, pl.DataFrame)
+        # Assert init SQL and query SQL contents instead of exact counts
+        calls = mock_duckdb_conn.execute.call_args_list
+        assert any("INSTALL httpfs" in str(c.args[0]) for c in calls)
+        assert any("SET s3_region='us-east-1'" in str(c.args[0]) for c in calls)
+        assert "HIVE_PARTITIONING=1" in calls[-1].args[0]
+        assert "s3://test-bucket/equity/bars/*/*/*/*" in calls[-1].args[0]
+
+        # Sanity-check conversion to Polars and expected columns propagated
+        assert isinstance(result, pl.DataFrame)
+        expected_cols = {
+            "ticker",
+            "timestamp",
+            "open_price",
+            "high_price",
+            "low_price",
+            "close_price",
+            "volume",
+            "volume_weighted_average_price",
+            "year",
+            "month",
+            "day",
+        }
+        assert expected_cols.issubset(set(result.columns))

84-98: Empty DataFrame test should assert no write occurs.

Currently it only ensures “does not raise.” Tighten the contract by asserting write_parquet is not called.

-    with (
-        patch("boto3.Session") as mock_session,
-        patch("duckdb.connect") as mock_duckdb_connect,
-    ):
+    with (
+        patch("datamanager.s3_client.boto3.Session") as mock_session,
+        patch("datamanager.s3_client.duckdb.connect") as mock_duckdb_connect,
+        patch("polars.DataFrame.write_parquet") as mock_write_parquet,
+    ):
@@
         s3_client.write_equity_bars_data(empty_data)
+        mock_write_parquet.assert_not_called()

9-37: Optional: Add edge-case tests for default-region fallback and error propagation.

These will harden behavior around AWS env variability and logging on failures.

# New tests (can be appended to this file)

def test_s3_client_defaults_region_when_boto3_none() -> None:
    with (
        patch("datamanager.s3_client.boto3.Session") as mock_session,
        patch("datamanager.s3_client.duckdb.connect") as mock_duckdb_connect,
    ):
        mock_session.return_value.region_name = None
        mock_conn = MagicMock()
        mock_duckdb_connect.return_value = mock_conn

        S3Client("test-bucket")

        init_sql = mock_conn.execute.call_args_list[0].args[0]
        assert "SET s3_region='us-east-1'" in init_sql

def test_s3_client_write_equity_bars_data_propagates_exceptions() -> None:
    with (
        patch("datamanager.s3_client.boto3.Session") as mock_session,
        patch("datamanager.s3_client.duckdb.connect") as mock_duckdb_connect,
        patch("datamanager.s3_client.logger") as mock_logger,
        patch("polars.DataFrame.write_parquet", side_effect=RuntimeError("boom")),
    ):
        mock_session.return_value.region_name = "us-east-1"
        mock_duckdb_connect.return_value = MagicMock()
        s3_client = S3Client("test-bucket")

        df = pl.DataFrame({"timestamp": [1]})
        try:
            s3_client.write_equity_bars_data(df)
            assert False, "Expected exception"
        except RuntimeError:
            pass
        mock_logger.error.assert_called()
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between da0c0a3 and 8fdec8f.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (6)
  • Dockerfile.tests (2 hunks)
  • applications/datamanager/pyproject.toml (1 hunks)
  • applications/datamanager/src/datamanager/alpaca_client.py (1 hunks)
  • applications/datamanager/src/datamanager/s3_client.py (1 hunks)
  • applications/datamanager/tests/test_alpaca_client.py (1 hunks)
  • applications/datamanager/tests/test_s3_client.py (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
applications/datamanager/tests/test_s3_client.py (1)
applications/datamanager/src/datamanager/s3_client.py (3)
  • S3Client (9-70)
  • write_equity_bars_data (21-44)
  • read_equity_bars_data (46-70)
applications/datamanager/tests/test_alpaca_client.py (1)
applications/datamanager/src/datamanager/alpaca_client.py (2)
  • AlpacaClient (19-141)
  • fetch_latest_data (40-141)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Run Python tests
🔇 Additional comments (5)
applications/datamanager/tests/test_alpaca_client.py (3)

9-76: Test coverage looks comprehensive for the happy path.

The test effectively validates the core functionality of fetch_latest_data with proper mocking of external dependencies and assertion of all expected fields in the resulting DataFrame.


78-110: Good edge case coverage for missing daily bar data.

The test properly validates that the client handles missing daily bar data gracefully by returning an empty DataFrame.


112-156: Well-structured test for date mismatch scenario.

The test effectively validates that the client filters out equity bars that don't match the requested date.

applications/datamanager/src/datamanager/alpaca_client.py (1)

19-39: Well-structured initialization with proper client setup.

The initialization properly sets up both historical and trading clients with appropriate sandbox/paper mode configuration.

applications/datamanager/tests/test_s3_client.py (1)

41-42: No action needed: Polars and DuckDB already align on Hive-style partitions

Polars’ DataFrame.write_parquet(..., partition_by=[…]) uses hive_style=True by default, emitting directories like year=2022/month=1/day=2, which matches DuckDB’s HIVE_PARTITIONING=1 expectation. Since your code doesn’t override hive_style, it will produce the correct key=value layout end-to-end, and the existing test correctly verifies the partition_by setting.

If you’d like extra clarity or to guard against any future default changes, you can optionally:

  • Pass hive_style=True explicitly in your write_parquet call.
  • Add an assertion in test_s3_client.py to ensure the parameter is present, for example:
 assert call_args[1]["partition_by"] == ["year", "month", "day"]
+assert call_args[1].get("hive_style", True) is True

Otherwise, this concern can be considered resolved.

Comment thread applications/datamanager/src/datamanager/alpaca_client.py
Comment thread applications/datamanager/src/datamanager/alpaca_client.py Outdated
Comment thread applications/datamanager/src/datamanager/s3_client.py
Comment thread applications/datamanager/src/datamanager/s3_client.py
Comment thread applications/datamanager/src/datamanager/s3_client.py
Comment thread applications/datamanager/src/datamanager/s3_client.py Outdated
Comment thread applications/datamanager/src/datamanager/s3_client.py Outdated
Comment thread Dockerfile.tests
@forstmeier forstmeier force-pushed the 08-20-create_initial_datamanager_helper_classes branch from 8fdec8f to 98fca58 Compare August 21, 2025 02:53
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (4)
applications/datamanager/src/datamanager/alpaca_client.py (1)

54-57: Log full exception context; avoid bare Exception when possible.

Switch to logger.exception to capture the stack trace, and consider narrowing the exception type (HTTP/network/API error) if the SDK exposes them.

-        except Exception as e:
-            logger.error(f"Error fetching Alpaca assets: {e}")
+        except Exception as e:  # consider narrowing to the SDK's error types
+            logger.exception(f"Error fetching Alpaca assets: {e}")
             raise
applications/datamanager/src/datamanager/s3_client.py (3)

34-41: Nice: context manager and explicit close implemented.

The connection lifecycle improvements address earlier feedback about leaks and make usage with with-statements straightforward.


65-65: Path pattern fix looks good.

Switching to s3://…/equity/bars/**/*.parquet resolves the earlier mismatch with the hive-style year/month/day partitioning.


67-84: Simplify and harden the date filter; cast partition columns or build a DATE.

With HIVE_PARTITIONING=1, year/month/day arrive as strings. Explicit casts avoid accidental lexical comparisons; using make_date + BETWEEN is simpler and often improves partition pruning.

-        query = """ 
-            SELECT *
-            FROM read_parquet(
-                ?, 
-                HIVE_PARTITIONING=1
-            )
-            WHERE 
-                (year > ? OR 
-                (year = ? AND month > ?) OR 
-                (year = ? AND month = ?
-                AND day >= ?))
-                AND
-                (year < ? OR 
-                (year = ? AND month < ?) OR 
-                (year = ?
-                AND month = ?
-                AND day <= ?))
-        """
+        query = """
+            SELECT *
+            FROM read_parquet(?, HIVE_PARTITIONING=1)
+            WHERE make_date(CAST(year AS INTEGER),
+                            CAST(month AS INTEGER),
+                            CAST(day AS INTEGER))
+                  BETWEEN ? AND ?
+        """
 
-        params = (
-            path_pattern,
-            start_date.year,
-            start_date.year,
-            start_date.month,
-            start_date.year,
-            start_date.month,
-            start_date.day,
-            end_date.year,
-            end_date.year,
-            end_date.month,
-            end_date.year,
-            end_date.month,
-            end_date.day,
-        )
+        params = (path_pattern, start_date, end_date)

Also applies to: 86-101

🧹 Nitpick comments (14)
applications/datamanager/src/datamanager/alpaca_client.py (8)

1-5: Add missing import for math (used in failure-threshold calc suggestion).

You’ll need math for a safer failure-threshold calculation below.

+import math
 import time
 from datetime import date
 from typing import TYPE_CHECKING, cast
 from zoneinfo import ZoneInfo

46-51: Use enum for asset attributes instead of a raw string.

attributes="has_options" is brittle. Prefer the enum list to align with the SDK’s request model (and future-proof parsing). Also import the enum.

-from alpaca.trading.enums import AssetClass, AssetStatus
+from alpaca.trading.enums import AssetClass, AssetStatus, AssetAttributes
@@
-                    GetAssetsRequest(
-                        status=AssetStatus.ACTIVE,
-                        asset_class=AssetClass.US_EQUITY,
-                        attributes="has_options",
-                    )
+                    GetAssetsRequest(
+                        status=AssetStatus.ACTIVE,
+                        asset_class=AssetClass.US_EQUITY,
+                        attributes=[AssetAttributes.OPTIONABLE],
+                    )

If your intent was to include all equities (optionable or not), drop the attributes filter entirely.


65-66: Define UTC alongside market timezone for consistent normalization.

We’ll need UTC for safe timestamp normalization below.

-        timezone_filter = ZoneInfo("America/New_York")
+        timezone_filter = ZoneInfo("America/New_York")
+        UTC = ZoneInfo("UTC")

81-83: Prefer enum member over string for DataFeed.

Use DataFeed.IEX instead of constructing via string; it’s clearer and safer across SDK versions.

-                            feed=DataFeed("iex"),
+                            feed=DataFeed.IEX,

111-115: Reduce log noise for expected skips.

Skipping non-matching dates is normal; demote to debug to avoid overwhelming logs with large universes.

-                        logger.info(
+                        logger.debug(
                             f"Skipping equity bar for {snapshot.symbol} on {daily_equity_bar_date}"  # noqa: E501
                         )

149-150: Ensure rate limiting even on failures (sleep in finally).

Right now you sleep only on success. If a chunk fails, you immediately try the next chunk—risking rate limits. Move the sleep into a finally.

-                time.sleep(self.rate_limit_sleep)
-
-            except Exception as e:
+            except Exception as e:
                 logger.error(
                     f"Error fetching Alpaca snapshots for chunk {i // chunk_size + 1}: {e}"  # noqa: E501
                 )
                 failed_chunks += 1
                 if failed_chunks > maximum_failed_chunks:
                     message = f"Too many chunk failures: {failed_chunks} chunks failed"
                     raise RuntimeError(message) from e
                 continue  # continue with next chunk instead of raising
+            finally:
+                time.sleep(self.rate_limit_sleep)

Also applies to: 151-160


161-165: Optional: stabilize output schema when no records.

If nothing is collected, pl.DataFrame(equity_bars) yields a 0x0 frame. Consider returning an empty DataFrame with a fixed schema so downstream code doesn’t branch on shape.

I can provide a tiny helper to build an empty Polars frame with the expected dtypes if you want to standardize this.


68-69: Optional: make chunk size configurable.

Consider promoting chunk_size to an instance attribute or constructor arg to tune throughput vs. rate limits without code changes.

applications/datamanager/src/datamanager/s3_client.py (6)

24-28: Parameterize DuckDB SET and split multi-statement SQL.

Avoid string interpolation in SQL and prefer one statement per execute for clearer errors and safety. Parameterization also makes the prior regex validation less critical.

Apply this diff:

-        self.duckdb_connection.execute(f"""
-            INSTALL httpfs;
-            LOAD httpfs;
-            SET s3_region='{region}';
-        """)
+        # Run one statement per execute; parameterize the region setting.
+        self.duckdb_connection.execute("INSTALL httpfs")
+        self.duckdb_connection.execute("LOAD httpfs")
+        self.duckdb_connection.execute("SET s3_region = ?", [region])

30-33: Make close() idempotent and avoid stale handle after closing.

Set the connection to None after closing so repeated close() or exit calls are safe and references don’t point to a closed handle.

 def close(self) -> None:
-        if self.duckdb_connection:
-            self.duckdb_connection.close()
+        con = getattr(self, "duckdb_connection", None)
+        if con is not None:
+            try:
+                con.close()
+            finally:
+                self.duckdb_connection = None

Also applies to: 40-41


57-61: Prefer logger.exception to capture stack traces.

You already include rich context. logger.exception preserves the traceback automatically; no need to interpolate the exception.

-            except Exception as e:
-                logger.error(
-                    f"Error writing equity bars data to bucket '{self.data_bucket_name}' "  # noqa: E501
-                    f"at path '{self.daily_equity_bars_path}' with {count} rows: {e}"
-                )
+            except Exception:
+                logger.exception(
+                    f"Error writing equity bars data to bucket '{self.data_bucket_name}' "  # noqa: E501
+                    f"at path '{self.daily_equity_bars_path}' with {count} rows"
+                )
                 raise

64-66: Validate date bounds and reuse the path constant.

Guard against start_date > end_date early, and reuse the existing path attribute for consistency.

-        path_pattern = f"s3://{self.data_bucket_name}/equity/bars/**/*.parquet"
+        if start_date > end_date:
+            raise ValueError("start_date must be on or before end_date")
+        path_pattern = f"{self.daily_equity_bars_path}**/*.parquet"

102-104: Return Polars directly from DuckDB; avoid pandas hop.

DuckDB’s Python API can yield a Polars DataFrame directly, reducing dependencies and conversion overhead.

-        result = self.duckdb_connection.execute(query, params).fetchdf()
-
-        return pl.from_pandas(result)
+        return self.duckdb_connection.execute(query, params).pl()

34-41: Optional: add precise type hints for context manager dunder methods.

If you decide to remove noqa suppressions, these hints keep mypy/ruff happy and improve readability.

+from typing import Optional, Type
+from types import TracebackType
@@
-    def __enter__(self):  # noqa: ANN204
-        return self
+    def __enter__(self) -> "S3Client":
+        return self
@@
-    def __exit__(self, exc_type, exc_val, exc_tb):  # noqa: ANN001, ANN204
-        _ = exc_type, exc_val, exc_tb
-
-        self.close()
+    def __exit__(
+        self,
+        exc_type: Optional[Type[BaseException]],
+        exc_val: Optional[BaseException],
+        exc_tb: Optional[TracebackType],
+    ) -> None:
+        _ = exc_type, exc_val, exc_tb
+        self.close()
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8fdec8f and 98fca58.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (6)
  • Dockerfile.tests (2 hunks)
  • applications/datamanager/pyproject.toml (1 hunks)
  • applications/datamanager/src/datamanager/alpaca_client.py (1 hunks)
  • applications/datamanager/src/datamanager/s3_client.py (1 hunks)
  • applications/datamanager/tests/test_alpaca_client.py (1 hunks)
  • applications/datamanager/tests/test_s3_client.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • applications/datamanager/tests/test_s3_client.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • applications/datamanager/tests/test_alpaca_client.py
  • Dockerfile.tests
  • applications/datamanager/pyproject.toml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Run Python tests
🔇 Additional comments (3)
applications/datamanager/src/datamanager/alpaca_client.py (2)

35-39: TradingClient initialization looks correct.

Passing paper=is_paper is the intended knob for live vs. paper trading.


29-33: StockHistoricalDataClient sandbox parameter is supported

I’ve confirmed that as of the July 2025 release, StockHistoricalDataClient’s constructor includes a sandbox: bool = False parameter (and does not use a paper flag). The existing code using sandbox=is_paper is therefore correct—no changes are needed.

applications/datamanager/src/datamanager/s3_client.py (1)

52-55: Remove the LazyFrame.sink_parquet refactor—use DataFrame.write_parquet for partitioned writes

The LazyFrame.sink_parquet API does not support a partition_by parameter as of August 2025; partitioned (“Hive‐style”) Parquet datasets—especially to S3—must be written with DataFrame.write_parquet.

File: applications/datamanager/src/datamanager/s3_client.py
Lines: 52–55

Suggested update

-                ).drop("dt").write_parquet(
-                    file=self.daily_equity_bars_path,
-                    partition_by=["year", "month", "day"],
-                )
+                ).drop("dt").write_parquet(
+                    file=self.daily_equity_bars_path,
+                    partition_by=["year", "month", "day"],
+                    use_pyarrow=True,                              # ensure pyarrow dataset write
+                    compression="snappy",
+                    storage_options=self.s3_storage_options,      # pass AWS creds/options
+                )

Verification

  • Confirm that your project’s Polars version (in requirements.txt or pyproject.toml) is ≥ the release where DataFrame.write_parquet(..., partition_by=…) on S3 was stabilized.

Likely an incorrect or invalid review comment.

Comment thread applications/datamanager/src/datamanager/alpaca_client.py
Comment thread applications/datamanager/src/datamanager/alpaca_client.py
Comment thread applications/datamanager/src/datamanager/s3_client.py Outdated
@forstmeier forstmeier force-pushed the 08-20-create_initial_datamanager_helper_classes branch from 98fca58 to fc5e48f Compare August 21, 2025 03:18
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
applications/datamanager/src/datamanager/s3_client.py (3)

30-33: Make close() idempotent and drop unused variables in exit.

Set the connection to None after closing to avoid double-closing; exit doesn’t need to consume its args.

     def close(self) -> None:
-        if self.duckdb_connection:
-            self.duckdb_connection.close()
+        if self.duckdb_connection is not None:
+            try:
+                self.duckdb_connection.close()
+            finally:
+                self.duckdb_connection = None
@@
-    def __exit__(self, exc_type, exc_val, exc_tb):  # noqa: ANN001, ANN204
-        _ = exc_type, exc_val, exc_tb
-
-        self.close()
+    def __exit__(self, exc_type, exc_val, exc_tb) -> None:  # noqa: ANN001
+        self.close()

Also applies to: 37-41


65-65: Path pattern looks correct for recursive partitions.

Good move to a recursive glob with explicit Parquet suffix; matches the partitioned dataset layout and avoids the prior 4-wildcard mismatch.


67-84: Date filtering can return wrong rows; cast partitions and compare as a proper DATE.

year/month/day come from Hive partitions and are strings; comparing them to integers is fragile and can be lexicographic in some engines. Build a DATE from the partitions and compare using BETWEEN with date-typed parameters.

Apply these diffs to simplify and harden the query and parameters:

-        query = """ 
-            SELECT *
-            FROM read_parquet(
-                ?, 
-                HIVE_PARTITIONING=1
-            )
-            WHERE 
-                (year > ? OR 
-                (year = ? AND month > ?) OR 
-                (year = ? AND month = ?
-                AND day >= ?))
-                AND
-                (year < ? OR 
-                (year = ? AND month < ?) OR 
-                (year = ?
-                AND month = ?
-                AND day <= ?))
-        """
+        query = """
+            SELECT *
+            FROM read_parquet(
+                ?,
+                HIVE_PARTITIONING=1
+            )
+            WHERE make_date(CAST(year AS INT), CAST(month AS INT), CAST(day AS INT))
+                  BETWEEN ? AND ?
+        """
@@
-        params = (
-            path_pattern,
-            start_date.year,
-            start_date.year,
-            start_date.month,
-            start_date.year,
-            start_date.month,
-            start_date.day,
-            end_date.year,
-            end_date.year,
-            end_date.month,
-            end_date.year,
-            end_date.month,
-            end_date.day,
-        )
+        params = (path_pattern, start_date, end_date)

Also applies to: 86-100

🧹 Nitpick comments (7)
applications/datamanager/tests/test_alpaca_client.py (5)

15-16: Assert the rate-limit sleeps to lock in behavior.

You already patch time.sleep; asserting the call count makes the tests guard the rate limiting contract (once after assets, once after snapshots).

Apply this diff to add an explicit assertion and drop the noqa:

-        patch("time.sleep") as mock_sleep,  # noqa: F841
+        patch("time.sleep") as mock_sleep,
@@
         assert row["timestamp"] == expected_timestamp
+
+        # rate-limiting: one sleep after assets call and one after snapshots
+        assert mock_sleep.call_count == 2

Also applies to: 60-76


37-41: Nit: clarify timezone variable name for readability.

Avoid shadowing the stdlib symbol name “timezone” and make intent explicit.

-        timezone = ZoneInfo("America/New_York")
+        ny_tz = ZoneInfo("America/New_York")
@@
-        mock_daily_bar.timestamp = datetime(2024, 1, 15, 16, 0, 0, tzinfo=timezone)
+        mock_daily_bar.timestamp = datetime(2024, 1, 15, 16, 0, 0, tzinfo=ny_tz)
@@
-        expected_timestamp = int(
-            datetime(2024, 1, 15, 16, 0, 0, tzinfo=timezone).timestamp() * 1000
-        )
+        expected_timestamp = int(datetime(2024, 1, 15, 16, 0, 0, tzinfo=ny_tz).timestamp() * 1000)

Also applies to: 72-75


84-85: Replicate the sleep assertion in the “no daily_bar” path.

Even when skipping rows, the client still rate-limits. This assertion protects against accidental changes.

-        patch("time.sleep") as mock_sleep,  # noqa: F841
+        patch("time.sleep") as mock_sleep,
@@
         assert isinstance(result, pl.DataFrame)
         assert len(result) == 0
+        assert mock_sleep.call_count == 2

Also applies to: 108-110


118-119: Replicate the sleep assertion in the “wrong date” path.

Keeps rate limiting guarantees consistent across branches.

-        patch("time.sleep") as mock_sleep,  # noqa: F841
+        patch("time.sleep") as mock_sleep,
@@
         assert isinstance(result, pl.DataFrame)
         assert len(result) == 0
+        assert mock_sleep.call_count == 2

Also applies to: 155-156


1-157: Add coverage for naive timestamps and missing VWAP.

fetch_latest_data treats naive timestamps as UTC and allows vwap=None. A focused test will prevent regressions.

Proposed test to append to this file:

def test_alpaca_client_fetch_latest_data_naive_timestamp_and_none_vwap() -> None:
    from datetime import date, datetime

    with (
        patch("datamanager.alpaca_client.StockHistoricalDataClient") as mock_hist,
        patch("datamanager.alpaca_client.TradingClient") as mock_trading,
        patch("time.sleep"),
    ):
        trading = MagicMock()
        mock_trading.return_value = trading
        asset = MagicMock()
        asset.symbol = "NAIVE"
        trading.get_all_assets.return_value = [asset]

        hist = MagicMock()
        mock_hist.return_value = hist

        # naive timestamp -> interpreted as UTC in implementation
        mock_daily_bar = MagicMock()
        mock_daily_bar.timestamp = datetime(2024, 1, 15, 21, 0, 0)  # naive; 21:00 UTC
        mock_daily_bar.open = 1.0
        mock_daily_bar.high = 2.0
        mock_daily_bar.low = 0.5
        mock_daily_bar.close = 1.5
        mock_daily_bar.volume = 123
        mock_daily_bar.vwap = None  # explicit None path

        snap = MagicMock()
        snap.symbol = "NAIVE"
        snap.daily_bar = mock_daily_bar
        hist.get_stock_snapshot.return_value = {"NAIVE": snap}

        client = AlpacaClient("k", "s", True)  # noqa: FBT003
        df = client.fetch_latest_data(date(2024, 1, 15))

        assert len(df) == 1
        row = df.row(0, named=True)
        assert row["ticker"] == "NAIVE"
        assert row["volume_weighted_average_price"] is None
        # naive 2024-01-15 21:00:00 interpreted as UTC -> epoch ms matches timestamp()
        assert row["timestamp"] == int(mock_daily_bar.timestamp.timestamp() * 1000)
applications/datamanager/src/datamanager/s3_client.py (2)

64-66: Validate input dates early.

Defensive check improves error messages and avoids surprising empty results.

     def read_equity_bars_data(self, start_date: date, end_date: date) -> pl.DataFrame:
-        path_pattern = f"s3://{self.data_bucket_name}/equity/bars/**/*.parquet"
+        if start_date > end_date:
+            raise ValueError(f"start_date {start_date} must be on or before end_date {end_date}")
+        path_pattern = f"s3://{self.data_bucket_name}/equity/bars/**/*.parquet"

24-29: Minor: prefer separate executes for clarity and error locality.

Splitting the multi-statement SQL helps isolate failures and improves readability when diagnosing connection/setup issues.

-        self.duckdb_connection.execute(f"""
-            INSTALL httpfs;
-            LOAD httpfs;
-            SET s3_region='{region}';
-        """)
+        self.duckdb_connection.execute("INSTALL httpfs;")
+        self.duckdb_connection.execute("LOAD httpfs;")
+        self.duckdb_connection.execute(f"SET s3_region='{region}';")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 98fca58 and fc5e48f.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (6)
  • Dockerfile.tests (2 hunks)
  • applications/datamanager/pyproject.toml (1 hunks)
  • applications/datamanager/src/datamanager/alpaca_client.py (1 hunks)
  • applications/datamanager/src/datamanager/s3_client.py (1 hunks)
  • applications/datamanager/tests/test_alpaca_client.py (1 hunks)
  • applications/datamanager/tests/test_s3_client.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • applications/datamanager/tests/test_s3_client.py
  • Dockerfile.tests
  • applications/datamanager/src/datamanager/alpaca_client.py
  • applications/datamanager/pyproject.toml
🧰 Additional context used
🧬 Code Graph Analysis (1)
applications/datamanager/tests/test_alpaca_client.py (1)
applications/datamanager/src/datamanager/alpaca_client.py (2)
  • AlpacaClient (21-169)
  • fetch_latest_data (63-169)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Run Python tests

chrisaddy
chrisaddy previously approved these changes Aug 23, 2025
…hub.com:pocketsizefund/pocketsizefund into 08-20-create_initial_datamanager_helper_classes
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

♻️ Duplicate comments (1)
Dockerfile.tests (1)

28-35: Remove redundant second uv sync; install local packages in editable mode instead, and make the first sync deterministic and faster

Running uv sync --all-packages --all-groups again after copying sources re-resolves the entire environment and lengthens builds. Use a single, cached, frozen sync for third-party deps, then install the local packages you need in editable mode. This preserves layer caching and avoids full re-resolution. This mirrors the prior feedback on the same lines.

-RUN uv sync --all-packages --all-groups
+RUN --mount=type=cache,target=/root/.cache/uv uv sync --all-packages --all-groups --frozen
@@
-COPY libraries/ ./libraries/
-
-RUN uv sync --all-packages --all-groups
+COPY libraries/ ./libraries/
+RUN uv pip install -e ./applications/datamanager
+# If this is a first-party package consumed by tests, keep the editable install;
+# otherwise, drop the line below.
+RUN uv pip install -e ./libraries/python

Notes:

  • --frozen ensures the environment matches uv.lock exactly, failing fast on drift.
  • The BuildKit cache mount speeds up repeat builds considerably.
  • If you truly need a second uv sync (e.g., to pick up workspace resolution across multiple local packages), prefer uv sync --frozen and scope it to the specific project; otherwise the editable installs above should suffice.
🧹 Nitpick comments (19)
Dockerfile.tests (1)

1-1: Good call aligning tests to Python 3.12.x

Pinning to 3.12.10 matches the repo-wide 3.12 standard and keeps test runs reproducible. As an optional tweak to shrink image size, consider switching to the slim variant and using --no-install-recommends if the toolchain allows.

Apply if desired:

-FROM python:3.12.10
+FROM python:3.12.10-slim
@@
-RUN apt-get update && apt-get install -y \
+RUN apt-get update && apt-get install -y --no-install-recommends \
     build-essential \
     clang \
     libc6-dev \
     linux-libc-dev \
     && rm -rf /var/lib/apt/lists/*
infrastructure/Pulumi.yaml (2)

7-10: Good move to config-based tags; consider adding a minimal standard tag set for discoverability.

The pulumi:template tag is fine. To replace the prior code-based tagging and improve search/ownership in the Pulumi Service, add stable tags (non-reserved keys) like repo, owner, and service. These won’t affect resources, only project metadata.

Apply if desired:

 config:
   pulumi:tags:
     value:
       pulumi:template: aws-python
+      repo: pocketsizefund/pocketsizefund
+      service: infrastructure
+      owner: datamanager

Note: environment-specific tags (e.g., environment) are better attached at the stack level rather than project-level.


2-2: Nit: make the description clearer.

Minor clarity/readability tweak.

-description: pocketsizefund infra
+description: Pocket Size Fund infrastructure
.claude/settings.local.json (4)

19-19: ln: allow only safe symlink creation within the repo.

Unrestricted ln can create hardlinks/symlinks to system locations. Restrict to symlinks and relative targets in the workspace.

Apply:

-      "Bash(ln:*)",
+      "Bash(ln -s:./*)",

Optional: if supported by the matcher, also add a deny to block absolute paths:

  • "Bash(ln -s:/*)" in the deny list.

21-21: docker build: restrict to infra contexts to avoid arbitrary code execution.

Building arbitrary Dockerfiles can execute RUN steps over the network. Scope it to known-safe paths.

Apply:

-      "Bash(docker build:*)",
+      "Bash(docker build:infrastructure/**)",

Optional hardening:

  • Prefer pinned builders/targets (e.g., buildx bake) if applicable.
  • If supported, add deny entries for parent-directory contexts and absolute paths (e.g., "Bash(docker build:../)", "Bash(docker build:/)").
  • Consider requiring an approval mode for docker build even if other commands are auto-approved.

22-22: Add CI step to verify uv lock consistency

It looks like none of your CI definitions (e.g. .github/workflows/… or tests.yaml) currently run uv lock --frozen before executing tests. To ensure that your lockfile remains up-to-date and reproducible, add a pre-test check in your pipeline:

• In your CI YAML (e.g. tests.yaml), insert before the coverage or test step:

- name: Verify uv lock consistency
  run: |
    set -euo pipefail
    uv lock --frozen || { echo "Lockfile is out of date"; exit 1; }

• If you ever migrate to GitHub Actions, you can similarly add in your workflow .github/workflows/ci.yml:

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Install dependencies
        run: pip install uv
      - name: Verify uv lock consistency
        run: |
          set -euo pipefail
          uv lock --frozen || { echo "Lockfile is out of date"; exit 1; }
      - name: Run tests
        run: uv run coverage run --parallel-mode -m pytest

This will cause CI to fail whenever the lockfile drifts from your declared dependencies, ensuring updates are deliberate.


20-20: Prefer allowlisted WebFetch domains over broad WebSearch.

WebSearch broadens exfiltration surface and introduces non-deterministic behavior. Either replace it with explicit WebFetch rules for only the domains you trust, or if you must keep WebSearch, lock down its default approval mode to ensure you’re prompted before any search.

• Option A — Replace WebSearch with explicit WebFetch entries (add alongside your existing allow rules):

"permissions": {
  "allow": [
    "WebFetch(domain:docs.github.com)",
    "WebFetch(domain:docs.astral.sh)",
    "WebFetch(domain:alpaca.markets)",
    "WebFetch(domain:docs.aws.amazon.com)"
  ]
}

• Option B — Keep WebSearch but require human approval by adjusting permissions.defaultMode in .claude/settings.local.json. Valid modes are:
default – prompts on first use of each tool (standard behavior) (docs.anthropic.com)
acceptEdits – automatically accepts file-edit permissions for the session (docs.anthropic.com)
plan – analysis-only mode; Claude can inspect code but cannot modify files or execute commands (docs.anthropic.com)
bypassPermissions – skips all permission prompts (use only in isolated/safe environments) (docs.anthropic.com)

Command patterns follow these rules (all case-sensitive):
Bash(cmd) exactly matches the entire shell invocation.
Bash(prefix:*) matches any command beginning with that prefix (e.g. Bash(docker build:*) matches docker build …) but does not span chained operators like &&. (docs.anthropic.com)

Example to require a prompt before any WebSearch:

"permissions": {
  "allow": [
    "WebSearch"
  ],
  "defaultMode": "default"
}
.gitignore (1)

18-18: Good call ignoring the private key; also ignore the pub/known_hosts counterparts

Swarm key material often generates a .pub and SSH writes known_hosts. Prevent accidental commits.

Apply this diff:

 infrastructure/swarm.pem
+infrastructure/swarm.pem.pub
+infrastructure/known_hosts
+infrastructure/.ssh/
infrastructure/prometheus.yml (2)

1-3: LGTM; consider setting a scrape_timeout to bound slow targets

A small timeout improves scheduler behavior when a target hangs.

Apply this diff:

 global:
   scrape_interval: 15s
   evaluation_interval: 15s
+  scrape_timeout: 10s

10-16: Optional: future-proof service discovery

Static targets work in Swarm if service VIPs resolve, but DNS SD avoids hardcoding. If you later expose exporters with DNS names, consider dns_sd_configs or file_sd_configs. No change required now.

infrastructure/__main__.py (3)

114-131: Quote the join token to avoid accidental shell interpretation

Defensive quoting of the token avoids edge cases if whitespace/newlines ever sneak in.

Apply this diff:

 create_cmd = pulumi.Output.all(mgr_ip.ip_address, worker_token).apply(
     lambda vals: (
-        "bash -lc 'for i in {1..120}; do sudo docker info >/dev/null 2>&1 && break || sleep 3; done && "
-        f"sudo docker swarm join --token {vals[1]} {vals[0]}:2377'"
+        "bash -lc 'for i in {1..120}; do sudo docker info >/dev/null 2>&1 && break || sleep 3; done && "
+        f"sudo docker swarm join --token \"{vals[1]}\" {vals[0]}:2377'"
     )
 )

Also applies to: 143-149


11-16: Avoid cross-stack name collisions by suffixing resource names with the stack

Lightsail KeyPair and Instance names are region-unique. Appending the Pulumi stack name prevents clashes when deploying multiple stacks.

Apply this diff:

-ssh_key = tls.PrivateKey("swarm-key", algorithm="RSA", rsa_bits=4096)
-ls_key = aws.lightsail.KeyPair(
-    "swarm-ls-key",
-    name="swarm-ls-key",
+ssh_key = tls.PrivateKey("swarm-key", algorithm="RSA", rsa_bits=4096)
+stack_name = pulumi.get_stack()
+ls_key = aws.lightsail.KeyPair(
+    f"swarm-ls-key-{stack_name}",
+    name=f"swarm-ls-key-{stack_name}",
     public_key=ssh_key.public_key_openssh,
 )
@@
-    inst = aws.lightsail.Instance(
-        name,
-        name=name,
+    inst = aws.lightsail.Instance(
+        f"{name}-{stack_name}",
+        name=f"{name}-{stack_name}",
@@
-    ip = aws.lightsail.StaticIp(f"{name}-ip", name=f"{name}-ip")
+    ip = aws.lightsail.StaticIp(f"{name}-ip-{stack_name}", name=f"{name}-ip-{stack_name}")
@@
-mgr_inst, mgr_ip = mk_instance("swarm-mgr-1", bundle_mgr)
-w1_inst, w1_ip = mk_instance("swarm-wkr-1", bundle_wkr)
-w2_inst, w2_ip = mk_instance("swarm-wkr-2", bundle_wkr)
+mgr_inst, mgr_ip = mk_instance("swarm-mgr-1", bundle_mgr)
+w1_inst, w1_ip = mk_instance("swarm-wkr-1", bundle_wkr)
+w2_inst, w2_ip = mk_instance("swarm-wkr-2", bundle_wkr)

Note: Instance logical names passed to mk_instance remain stable; the AWS-side names gain the stack suffix.

Also applies to: 52-61, 91-96, 101-103


1-5: Optional: add typing for return values and role

Improves readability and IDE support; small ergonomics boost.

Apply this diff:

 import pulumi
 import pulumi_aws as aws
 import pulumi_tls as tls
 from pulumi_command import remote
+from typing import NamedTuple, Literal
@@
-def mk_instance(name: str, bundle_id: str):  # noqa: ANN201
+class InstanceResources(NamedTuple):
+    inst: aws.lightsail.Instance
+    ip: aws.lightsail.StaticIp
+    attach: aws.lightsail.StaticIpAttachment
+
+def mk_instance(name: str, bundle_id: str) -> InstanceResources:  # noqa: ANN201
.mise.toml (1)

48-48: Yamllint invocation looks good; consider centralizing config.

Inline config is fine, but a repo-level .yamllint keeps local IDE/CI consistent and easier to tweak without touching tasks. Not a blocker.

infrastructure/main.nu (1)

46-49: Minor hardening suggestion: prefer ED25519 and avoid accepting new host keys implicitly.

You already remove old keys and repopulate via ssh-keyscan. Consider pinning the key type and keeping StrictHostKeyChecking default/“yes” to avoid TOFU surprises. Not blocking given current workflow.

Example tweak:

ssh-keyscan -t ed25519 -H $manager_ip | save --append $"($env.HOME)/.ssh/known_hosts"

And in the SSH block:

  StrictHostKeyChecking yes
infrastructure/stack.yml (4)

45-52: Enable ACME/TLS and persist certificates; you already declared a traefik_letsencrypt volume.

Right now Grafana is HTTP-only and Traefik has no cert resolver. Recommend wiring Let’s Encrypt and switching Grafana to websecure.

Apply this diff in traefik service:

       - --metrics.prometheus.addServicesLabels=true
+      - --certificatesresolvers.letsencrypt.acme.email=ops@pocketsizefund.com
+      - --certificatesresolvers.letsencrypt.acme.storage=/letsencrypt/acme.json
+      - --certificatesresolvers.letsencrypt.acme.httpchallenge=true
+      - --certificatesresolvers.letsencrypt.acme.httpchallenge.entrypoint=web
@@
     volumes:
-      - /var/run/docker.sock:/var/run/docker.sock:ro
+      - /var/run/docker.sock:/var/run/docker.sock:ro
+      - traefik_letsencrypt:/letsencrypt

And in grafana labels (Lines 134-137), prefer TLS:

-        - traefik.http.routers.grafana.entrypoints=web
+        - traefik.http.routers.grafana.entrypoints=websecure
+        - traefik.http.routers.grafana.tls=true
+        - traefik.http.routers.grafana.tls.certresolver=letsencrypt

55-56: Avoid floating tags for Portainer.

latest can introduce breaking changes unexpectedly. Pin to a known-good version used in your environment.

Example:

-    image: portainer/portainer-ce:latest
+    image: portainer/portainer-ce:2.20.3

If you prefer, create a version variable and inject via CI.


135-137: Grafana router host is a placeholder; make it configurable.

Hardcoding grafana.example.com may rot. Consider an env or compose variable, e.g., ${GRAFANA_HOST} from a .env.

Apply this diff:

-        - traefik.http.routers.grafana.rule=Host(`grafana.example.com`)
+        - traefik.http.routers.grafana.rule=Host(`${GRAFANA_HOST}`)

1-138: Add missing newline at end of file.

Yamllint flagged this. Add a trailing newline to keep tooling happy.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between fc5e48f and d1e45c9.

⛔ Files ignored due to path filters (1)
  • .flox/env/manifest.lock is excluded by !**/*.lock
📒 Files selected for processing (25)
  • .claude/settings.local.json (1 hunks)
  • .flox/env/manifest.toml (1 hunks)
  • .gitignore (1 hunks)
  • .mise.toml (1 hunks)
  • Dockerfile.tests (2 hunks)
  • infrastructure/.claude/settings.local.json (1 hunks)
  • infrastructure/.gitignore (1 hunks)
  • infrastructure/.python-version (1 hunks)
  • infrastructure/Pulumi.yaml (1 hunks)
  • infrastructure/__main__.py (1 hunks)
  • infrastructure/api.py (0 hunks)
  • infrastructure/cluster.py (0 hunks)
  • infrastructure/grafana-dashboard.json (0 hunks)
  • infrastructure/ingress.py (0 hunks)
  • infrastructure/keys.py (0 hunks)
  • infrastructure/main.nu (1 hunks)
  • infrastructure/monitors.py (0 hunks)
  • infrastructure/prometheus.yml (1 hunks)
  • infrastructure/pyproject.toml (1 hunks)
  • infrastructure/requirements.txt (1 hunks)
  • infrastructure/services.py (0 hunks)
  • infrastructure/stack.yml (1 hunks)
  • infrastructure/tags.py (0 hunks)
  • infrastructure/upload_grafana_dashboard.nu (0 hunks)
  • infrastructure/vpc.py (0 hunks)
💤 Files with no reviewable changes (10)
  • infrastructure/tags.py
  • infrastructure/keys.py
  • infrastructure/ingress.py
  • infrastructure/vpc.py
  • infrastructure/grafana-dashboard.json
  • infrastructure/upload_grafana_dashboard.nu
  • infrastructure/services.py
  • infrastructure/cluster.py
  • infrastructure/monitors.py
  • infrastructure/api.py
✅ Files skipped from review due to trivial changes (3)
  • infrastructure/.gitignore
  • infrastructure/.python-version
  • infrastructure/.claude/settings.local.json
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-15T13:54:49.230Z
Learnt from: forstmeier
PR: pocketsizefund/pocketsizefund#613
File: infrastructure/Pulumi.yaml:1-10
Timestamp: 2025-08-15T13:54:49.230Z
Learning: The user forstmeier prefers Pulumi stack naming convention as `pocketsizefund/pocketsizefund/production` to maintain organization/repository/environment mapping pattern for the pocketsizefund project.

Applied to files:

  • infrastructure/Pulumi.yaml
📚 Learning: 2025-08-15T13:54:45.910Z
Learnt from: forstmeier
PR: pocketsizefund/pocketsizefund#613
File: infrastructure/main.nu:10-13
Timestamp: 2025-08-15T13:54:45.910Z
Learning: The team typically runs Nu shell scripts in infrastructure/main.nu through Mise commands, which provides a controlled execution environment where relative paths like "stack.yml" work reliably.

Applied to files:

  • infrastructure/main.nu
  • .mise.toml
🪛 YAMLlint (1.37.1)
infrastructure/stack.yml

[error] 138-138: no new line character at the end of file

(new-line-at-end-of-file)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Run Python tests
🔇 Additional comments (4)
.flox/env/manifest.toml (1)

14-14: ✅ AWS CLI v2 migration verified

The automated scan completed without detecting any legacy v1-only commands:

  • No usages of aws ecr get-login (v1) were found.
  • No commands with --payload patterns that typically require --cli-binary-format raw-in-base64-out were detected.

All existing scripts appear compatible with awscli2. Feel free to proceed with the migration.

infrastructure/Pulumi.yaml (1)

3-6: Confirm uv toolchain availability in all environments

We’ve verified the following:

  • All expected pyproject.toml files are present at the repo root and within each sub-package (applications/datamanager, applications/models, applications/portfoliomanager, infrastructure, libraries/python) so uv sync has the manifests it needs.
  • The tests’ Docker image installs uv via COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/uv and runs uv sync --all-packages --all-groups in Dockerfile.tests, ensuring containerized CI builds succeed.

Next steps to ensure smooth operation outside that Docker context:

  • Install uv in local and non-Docker CI runners
    Add an installation step to your setup scripts or CI workflows, for example:
    # Install uv globally via pipx
    pipx install uv
  • Verify the lockfile is committed
    Ensure uv.lock exists at the repository root for reproducible installs:
    fd -t f uv.lock || echo "Missing uv.lock – lockfile for reproducible installs"
  • (Optional) Fallback to pip if needed
    If any environment can’t use uv, you can temporarily switch to the default pip toolchain:
     runtime:
       name: python
       options:
    -    toolchain: uv
    +    toolchain: pip

Finally, consider adding a short note to your README or docs/ summarizing how to install and use uv for local development and CI setups.

infrastructure/main.nu (1)

10-13: Invalid flag on docker stack deploy.

docker stack deploy doesn’t support --detach. This will error out and the stack won’t deploy.

Apply this diff:

-  docker stack deploy -c stack.yml infrastructure --detach=false
+  docker stack deploy -c stack.yml infrastructure

Likely an incorrect or invalid review comment.

infrastructure/stack.yml (1)

25-35: Traefik provider flags are wrong for v3; Docker provider must be used with swarmMode=true.

providers.swarm.* no longer applies. With v3, use the Docker provider and enable swarm mode; otherwise Traefik won’t pick up any services.

Apply this diff to the command block:

-      - --providers.swarm.endpoint=unix:///var/run/docker.sock
-      - --providers.swarm.exposedByDefault=false
+      - --providers.docker.endpoint=unix:///var/run/docker.sock
+      - --providers.docker.swarmMode=true
+      - --providers.docker.exposedByDefault=false

Likely an incorrect or invalid review comment.

Comment thread .claude/settings.local.json
Comment thread .flox/env/manifest.toml
Comment thread .mise.toml
Comment thread infrastructure/__main__.py
Comment thread infrastructure/__main__.py
Comment thread infrastructure/main.nu
Comment thread infrastructure/Pulumi.yaml
Comment thread infrastructure/pyproject.toml
Comment thread infrastructure/requirements.txt
Comment thread infrastructure/stack.yml
…hub.com:pocketsizefund/pocketsizefund into 08-20-create_initial_datamanager_helper_classes
@forstmeier
Copy link
Copy Markdown
Collaborator Author

I have no idea why stuff that was deleted in master is also being deleted here (it's like your changes are in this now for some reason, @chrisaddy). I'm gonna merge it regardless.

@forstmeier forstmeier merged commit f50c94b into master Aug 23, 2025
4 checks passed
@forstmeier forstmeier deleted the 08-20-create_initial_datamanager_helper_classes branch August 23, 2025 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants