Skip to content

docker swarm#613

Merged
chrisaddy merged 1 commit intomasterfrom
08-14-docker_swarm
Aug 23, 2025
Merged

docker swarm#613
chrisaddy merged 1 commit intomasterfrom
08-14-docker_swarm

Conversation

@chrisaddy
Copy link
Copy Markdown
Collaborator

@chrisaddy chrisaddy commented Aug 15, 2025

Overview

Changes

Comments

Summary by CodeRabbit

  • New Features

    • Provision an AWS Lightsail–backed Docker Swarm with a single command.
    • Deploy a built-in monitoring/admin stack: Traefik, Portainer, Prometheus, Grafana.
    • Added Prometheus scraping configuration and automatic Docker context setup.
  • Refactor

    • Migrated from Kubernetes/EKS and API Gateway to Lightsail + Swarm.
    • Removed legacy infrastructure components and dashboard uploader.
  • Chores

    • Updated dependencies/tooling (Pulumi), added Python 3.12 version file, and switched tests to Python 3.12.
    • Adjusted ignore rules and environment packages (awscli2, cargo).

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Aug 15, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Caution

Review failed

The pull request is closed.

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.


📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between dbfa692 and 4b9f516.

⛔ Files ignored due to path filters (2)
  • .flox/env/manifest.lock is excluded by !**/*.lock
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (25)
  • .claude/settings.local.json (1 hunks)
  • .flox/env/manifest.toml (1 hunks)
  • .gitignore (1 hunks)
  • .mise.toml (1 hunks)
  • Dockerfile.tests (1 hunks)
  • infrastructure/.claude/settings.local.json (1 hunks)
  • infrastructure/.gitignore (1 hunks)
  • infrastructure/.python-version (1 hunks)
  • infrastructure/Pulumi.yaml (1 hunks)
  • infrastructure/__main__.py (1 hunks)
  • infrastructure/api.py (0 hunks)
  • infrastructure/cluster.py (0 hunks)
  • infrastructure/grafana-dashboard.json (0 hunks)
  • infrastructure/ingress.py (0 hunks)
  • infrastructure/keys.py (0 hunks)
  • infrastructure/main.nu (1 hunks)
  • infrastructure/monitors.py (0 hunks)
  • infrastructure/prometheus.yml (1 hunks)
  • infrastructure/pyproject.toml (1 hunks)
  • infrastructure/requirements.txt (1 hunks)
  • infrastructure/services.py (0 hunks)
  • infrastructure/stack.yml (1 hunks)
  • infrastructure/tags.py (0 hunks)
  • infrastructure/upload_grafana_dashboard.nu (0 hunks)
  • infrastructure/vpc.py (0 hunks)
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch 08-14-docker_swarm

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown
Collaborator Author

chrisaddy commented Aug 15, 2025

This stack of pull requests is managed by Graphite. Learn more about stacking.

@chrisaddy chrisaddy mentioned this pull request Aug 15, 2025
@chrisaddy chrisaddy marked this pull request as ready for review August 15, 2025 03:58
Copilot AI review requested due to automatic review settings August 15, 2025 03:58
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR significantly refactors the infrastructure setup by replacing Kubernetes/EKS with Docker Swarm on AWS Lightsail instances. The change simplifies the architecture by moving from a complex Kubernetes-based serverless setup to a more straightforward Docker Swarm deployment.

Key changes:

  • Complete replacement of Kubernetes/EKS infrastructure with Docker Swarm on AWS Lightsail
  • Removal of Knative, API Gateway, and associated AWS services
  • Introduction of Docker Compose stack configuration for monitoring services (Traefik, Prometheus, Grafana)
  • Addition of automated deployment scripts using Nushell

Reviewed Changes

Copilot reviewed 23 out of 26 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
__main__.py Complete rewrite to provision AWS Lightsail instances and set up Docker Swarm cluster
stack.yml New Docker Compose stack configuration defining monitoring and networking services
main.nu New Nushell script for automated infrastructure deployment and stack management
prometheus.yml New Prometheus configuration for monitoring the Docker Swarm cluster
Multiple deleted files Removal of VPC, Kubernetes, Knative, and API Gateway related infrastructure code

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread infrastructure/stack.yml
Comment thread infrastructure/__main__.py
Comment thread infrastructure/__main__.py
Comment thread infrastructure/__main__.py
Comment thread infrastructure/__main__.py
Comment thread infrastructure/__main__.py
@graphite-app
Copy link
Copy Markdown

graphite-app Bot commented Aug 15, 2025

Graphite Automations

"Assign author to pull request" took an action on this PR • (08/15/25)

1 assignee was added to this PR based on John Forstmeier's automation.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🔭 Outside diff range comments (2)
.mise.toml (2)

98-102: Outdated task: references removed Pulumi output DATAMANAGER_API_ACCESS_POLICY_ARN.

The new infra removed API Gateway/IAM helpers; this task will fail.

Apply this diff to disable the task until a replacement exists:

-[tasks."cli:datamanager:authorize"]
-description = "Authorize the CLI with AWS credentials"
-run = """
-aws iam attach-user-policy --user-name {{option(user-name="user-name")}} --policy-arn ${{pulumi stack output DATAMANAGER_API_ACCESS_POLICY_ARN}}
-"""
+# [tasks."cli:datamanager:authorize"]
+# description = "Authorize the CLI with AWS credentials"
+# run = """
+# echo "Deprecated: DATAMANAGER_API_ACCESS_POLICY_ARN output was removed with the migration."
+# exit 1
+# """

104-110: Broken task: upload_grafana_dashboard.nu removed.

This references a file removed by the migration.

Apply this diff to remove or replace the task:

-[tasks."dashboard:upload"]
-description = "Upload Grafana dashboard"
-run = """
-cd infrastructure
-nu upload_grafana_dashboard.nu
-"""
+# Grafana dashboard upload task removed during migration to Swarm.
🧹 Nitpick comments (16)
.flox/env/manifest.toml (1)

14-15: awscli2 & cargo present — please confirm cross-platform availability; keep google-cloud-sdk for now

Short summary:

  • .flox/env/manifest.toml contains awscli2.pkg-path = "awscli2" and cargo.pkg-path = "cargo" (lines 14–15).
  • The manifest's systems list includes "aarch64-darwin" and "aarch64-linux" (lines 22–26) — I did not see x86_64 entries in the provided snippet, so platform coverage is inconclusive.

Locations needing attention:

  • .flox/env/manifest.toml — awscli2 & cargo entries (lines 14–15) and systems list (lines ~22–26).
  • workflows/fetch_data.py — calls ["gcloud", "auth", "print-identity-token"] (around line 15).
  • application/predictionengine/compose.yaml — mounts ~/.config/gcloud/application_default_credentials.json (around line 15).

Requested actions:

  • Verify awscli2 and cargo resolve on all target platforms (evaluate with flox/nix or confirm availability for x86_64-darwin and x86_64-linux if those are required).
  • Do not remove google-cloud-sdk yet — remove/update the gcloud references above before deleting the package.
infrastructure/.gitignore (1)

1-2: LGTM — ignore pyc and local venv

This keeps infra state clean in Git. Optional: consider ignoring common virtualenv names like .venv as well.

Apply:

 *.pyc
-venv/
+venv/
+.venv/
infrastructure/requirements.txt (1)

1-2: Avoid dependency drift between pyproject.toml and requirements.txt

The pyproject (per PR context) pins higher minimums (e.g., pulumi>=3.189.0, pulumi-aws>=7.4.0). Divergent constraints can cause non-reproducible environments.

Options:

  • Prefer a single source of truth (pyproject with uv/pip and drop requirements.txt).
  • Or align constraints here to match pyproject:
-pulumi>=3.0.0,<4.0.0
-pulumi-aws>=7.0.0,<8.0.0
+pulumi>=3.189.0,<4.0.0
+pulumi-aws>=7.4.0,<8.0.0

If you want, I can generate a constraints.txt wired to uv/pip-compile to keep ranges consistent.

infrastructure/.claude/settings.local.json (2)

3-12: Do not commit local, user-specific Claude settings to the repo.

This file hardcodes a local absolute path (Line 10) and grants broad command allowances. It’s environment-specific and better kept out of version control to avoid portability issues and inadvertent exposure of local environment details.

Apply one of the following:

  • Move this file to a local-only path and add the pattern to .gitignore (e.g., infrastructure/.claude/*.local.json).
  • Commit an example template (settings.local.example.json) without absolute paths and instruct devs to copy it locally.

If you prefer to keep it, at minimum replace the absolute path with a home-relative or environment-variable-based path (e.g., $HOME/Library/...).


4-8: Scope the allow-list and defaultMode more tightly.

  • "Bash(nu:)" and "Bash(docker stack ps:)" are very permissive; tighten to the specific subcommands used in your flow.
  • defaultMode=acceptEdits can be surprising; consider defaultMode=ask and explicitly allow key commands, reducing accidental execution.
.mise.toml (2)

69-75: Optional: move yamllint inline config to a project config file.

Using an inline -d string is fine, but a .yamllint config improves reuse and editor integration.

- yamllint -d "{extends: relaxed, rules: {line-length: {max: 110}}}" .
+ yamllint .

And add a .yamllint at repo root with the same rules.


1-110: Add mise tasks for Swarm workflows.

Given the shift to Docker Swarm, consider adding tasks to:

  • initialize docker context to the manager (ssh with swarm.pem)
  • deploy/update the Swarm stack (e.g., using infrastructure/stack.yml)
  • check Swarm status (docker stack ps)

I can draft these tasks if helpful.

infrastructure/pyproject.toml (1)

6-6: Pinning Python to exactly 3.12.10 may be unnecessarily strict.

If reproducibility is not hard-required, consider a range (e.g., >=3.12,<3.13) to ease local installs and CI images. Otherwise, keep as-is.

infrastructure/__main__.py (4)

117-130: Ruff FLY002: Prefer a multiline script string over join.

Using " && ".join([...]) makes quoting complex and contributes to E501. A single multi-line shell script (triple-quoted) improves maintainability and avoids quoting traps.

Proposed pattern:

-    create=" && ".join([...]),
+    create="""
+set -euo pipefail
+for i in {1..120}; do sudo docker info >/dev/null 2>&1 && break || sleep 3; done
+PUBIP=$(curl -s http://169.254.169.254/latest/meta-data/public-ipv4)
+echo "Using-Public-IP:${PUBIP}"
+STATE=$(sudo docker info --format "{{.Swarm.LocalNodeState}}") || true
+CTRL=$(sudo docker info --format "{{.Swarm.ControlAvailable}}") || true
+if ! [ "$STATE" = active -a "$CTRL" = true ]; then
+  sudo docker swarm leave --force || true
+  sudo docker swarm init --advertise-addr "$PUBIP" --listen-addr "0.0.0.0:2377"
+fi
+for i in {1..30}; do sudo ss -ltn | awk '$4 ~ /:2377$/' && break || sleep 2; done
+sudo docker swarm join-token -q worker | sudo tee /home/ubuntu/worker.token >/dev/null
+sudo docker swarm join-token -q manager | sudo tee /home/ubuntu/manager.token >/dev/null
+""",

This also helps with line lengths; add the file-level E501 ignore if any lines still exceed 88 chars.


51-99: Add return type annotation to mk_instance.

Static analysis flagged missing return type. It returns (Instance, StaticIp).

Apply this diff:

-def mk_instance(name: str, bundle_id: str):
+def mk_instance(name: str, bundle_id: str) -> tuple[aws.lightsail.Instance, aws.lightsail.StaticIp]:

63-88: Open Swarm and SSH ports to 0.0.0.0/0; consider restricting to node IPs.

While Swarm uses mutual TLS, exposing 2377, 7946 (TCP/UDP), and 4789/UDP to the entire Internet increases attack surface and noise. Where feasible, restrict these to your known static IPs (/32) for manager↔worker traffic. Keep 80/443 public if required.

I can sketch a variant of InstancePublicPorts that uses the known IP outputs to build CIDR /32 lists per node role.


18-48: Cloud-init hardening nit: pin curl to fail fast and enable retries on repo key fetch.

You already wrap most apt actions with retry; consider adding retry around the GPG key fetch to account for transient network issues.

Example:

-      curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /etc/apt/keyrings/docker.gpg
+      retry bash -lc 'curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /etc/apt/keyrings/docker.gpg'
infrastructure/prometheus.yml (1)

5-9: Optional: add Traefik metrics scrape (and network reachability)

You’ve enabled Traefik Prometheus metrics in stack.yml, but there’s no scrape job here. If you want Traefik metrics:

  • Either attach Prometheus to the public network or attach Traefik to internal.
  • Add a Traefik job (default metrics are on port 8080).

Example job once network reachability is in place:

- job_name: traefik
  static_configs:
    - targets: ['traefik:8080']

If you keep Prometheus on internal, add traefik to internal networks in stack.yml instead.

infrastructure/main.nu (1)

20-21: Nit: write PEM under ~/.ssh for clarity and standardization

Storing swarm.pem under the working directory can be surprising; ~/.ssh is conventional.

Option:

let pem_path = $"($env.HOME)/.ssh/swarm.pem"
$pem | save --raw --force $pem_path
chmod 600 $pem_path
# and update IdentityFile in the block accordingly
infrastructure/stack.yml (2)

24-53: Optional: expose Traefik metrics to Prometheus network

You’ve enabled Traefik metrics but Prometheus isn’t connected to the public network, so it cannot scrape Traefik. Consider also attaching Traefik to internal or Prometheus to public.

Two options:

  • Attach Traefik to both networks:
   traefik:
@@
-    networks: [public]
+    networks: [public, internal]
  • Or attach Prometheus to public as well:
   prometheus:
@@
-    networks: [internal]
+    networks: [internal, public]

Then add a Traefik scrape job in prometheus.yml.


138-138: Add a newline at EOF

YAMLlint flagged a missing newline at the end of file.

-    networks: [public, internal]
\ No newline at end of file
+    networks: [public, internal]
+
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4dd98e5 and 0e5cd76.

⛔ Files ignored due to path filters (2)
  • .flox/env/manifest.lock is excluded by !**/*.lock
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (24)
  • .flox/env/manifest.toml (1 hunks)
  • .gitignore (1 hunks)
  • .mise.toml (1 hunks)
  • infrastructure/.claude/settings.local.json (1 hunks)
  • infrastructure/.gitignore (1 hunks)
  • infrastructure/.python-version (1 hunks)
  • infrastructure/Pulumi.yaml (1 hunks)
  • infrastructure/__main__.py (1 hunks)
  • infrastructure/api.py (0 hunks)
  • infrastructure/cluster.py (0 hunks)
  • infrastructure/grafana-dashboard.json (0 hunks)
  • infrastructure/images.py (0 hunks)
  • infrastructure/ingress.py (0 hunks)
  • infrastructure/keys.py (0 hunks)
  • infrastructure/main.nu (1 hunks)
  • infrastructure/monitors.py (0 hunks)
  • infrastructure/prometheus.yml (1 hunks)
  • infrastructure/pyproject.toml (1 hunks)
  • infrastructure/requirements.txt (1 hunks)
  • infrastructure/services.py (0 hunks)
  • infrastructure/stack.yml (1 hunks)
  • infrastructure/tags.py (0 hunks)
  • infrastructure/upload_grafana_dashboard.nu (0 hunks)
  • infrastructure/vpc.py (0 hunks)
💤 Files with no reviewable changes (11)
  • infrastructure/vpc.py
  • infrastructure/grafana-dashboard.json
  • infrastructure/tags.py
  • infrastructure/keys.py
  • infrastructure/cluster.py
  • infrastructure/upload_grafana_dashboard.nu
  • infrastructure/images.py
  • infrastructure/api.py
  • infrastructure/monitors.py
  • infrastructure/ingress.py
  • infrastructure/services.py
🧰 Additional context used
📓 Path-based instructions (1)
.mise.toml

📄 CodeRabbit Inference Engine (CLAUDE.md)

Always use mise tasks for running tests, linting, formatting. If there is not a command already present in .mise.toml, suggest creating one.

Files:

  • .mise.toml
🧠 Learnings (1)
📚 Learning: 2025-07-29T13:26:20.287Z
Learnt from: CR
PR: pocketsizefund/pocketsizefund#0
File: CLAUDE.md:0-0
Timestamp: 2025-07-29T13:26:20.287Z
Learning: Applies to .mise.toml : Always use mise tasks for running tests, linting, formatting. If there is not a command already present in .mise.toml, suggest creating one.

Applied to files:

  • .mise.toml
🧬 Code Graph Analysis (1)
infrastructure/prometheus.yml (1)
infrastructure/monitors.py (1)
  • create_prometheus_scraper (7-151)
🪛 YAMLlint (1.37.1)
infrastructure/stack.yml

[error] 138-138: no new line character at the end of file

(new-line-at-end-of-file)

🪛 GitHub Check: Run Python quality checks
infrastructure/__main__.py

[failure] 121-121: Ruff (E501)
infrastructure/main.py:121:89: E501 Line too long (97 > 88)


[failure] 120-120: Ruff (E501)
infrastructure/main.py:120:89: E501 Line too long (123 > 88)


[failure] 119-119: Ruff (E501)
infrastructure/main.py:119:89: E501 Line too long (106 > 88)


[failure] 117-130: Ruff (FLY002)
infrastructure/main.py:117:12: FLY002 Consider f-string instead of string join


[failure] 51-51: Ruff (ANN201)
infrastructure/main.py:51:5: ANN201 Missing return type annotation for public function mk_instance


[failure] 47-47: Ruff (E501)
infrastructure/main.py:47:89: E501 Line too long (131 > 88)


[failure] 40-40: Ruff (E501)
infrastructure/main.py:40:89: E501 Line too long (111 > 88)


[failure] 37-37: Ruff (E501)
infrastructure/main.py:37:89: E501 Line too long (197 > 88)


[failure] 33-33: Ruff (E501)
infrastructure/main.py:33:89: E501 Line too long (109 > 88)


[failure] 27-27: Ruff (E501)
infrastructure/main.py:27:89: E501 Line too long (92 > 88)

🪛 GitHub Actions: Code quality check
infrastructure/__main__.py

[error] 27-27: Ruff: E501 Line too long (92 > 88) in infrastructure/main.py:27.

🔇 Additional comments (8)
infrastructure/Pulumi.yaml (1)

1-10: Pulumi project metadata update looks good.

Project rename, description, uv toolchain, and pulumi:tags template are consistent and align with the migration.

infrastructure/pyproject.toml (1)

3-12: Dependencies and metadata updates look aligned with the new infra.

Upgrading pulumi and aws provider; adding pulumi-command and pulumi-tls matches the remote + key flow.

infrastructure/__main__.py (3)

143-155: Worker join sequencing and retries look solid.

The wait for Docker readiness and explicit depends_on(get_worker_token) ensure proper ordering. The command is minimal and to the point.


11-16: Lightsail KeyPair handling LGTM.

Generating a TLS key and injecting its public key into a Lightsail KeyPair is a clean, reproducible approach.


161-163: Exporting private key — verified: no unsafe pulumi stack output usages found

Verified the repository for calls that print the exported sshPrivateKeyPem without --show-secrets. No matches found.

Findings:

  • infrastructure/main.py:163 — pulumi.export("sshPrivateKeyPem", pulumi.Output.secret(ssh_key.private_key_pem))
  • infrastructure/main.nu:16–20 (line 18) — let pem = (^pulumi stack output --show-secrets sshPrivateKeyPem) then save to swarm.pem and chmod 600 (explicitly uses --show-secrets)
  • infrastructure/main.py:106–110 — remote connection uses private_key=ssh_key.private_key_pem (expected; key is kept secret in exports)

Note: infrastructure/.claude/settings.local.json contains a permissive "Bash(pulumi stack output:*)" permission, but there are no direct calls that would expose sshPrivateKeyPem without --show-secrets.

Recommendation: keep the export marked secret (as it is) and continue to avoid printing or storing it in world-readable locations.

infrastructure/prometheus.yml (1)

1-3: Confirm Pulumi/AMP scraper (monitors.py) usage and reconcile Prometheus intervals

I searched the repository for Pulumi/AMP scraper code and Kubernetes scrape configs. Only infrastructure/prometheus.yml sets 15s intervals; I did not find monitors.py, aws.amp.Scraper, or kubernetes_sd_configs.

Files to review:

  • infrastructure/prometheus.yml — lines 1–3:
    global:
    scrape_interval: 15s
    evaluation_interval: 15s

If a Pulumi/AMP scraper (with 15m intervals) exists in another repo or deployment, confirm it’s retired or align its intervals/targets with this static-targets config to avoid conflict.

infrastructure/stack.yml (2)

70-88: Prometheus/cAdvisor/node-exporter connectivity is good; version pins look sane

Prometheus is manager-scoped with a persistent volume and lifecycle enabled; node-exporter and cAdvisor are global with appropriate mounts and sensible memory limits.

Also applies to: 90-118


1-21: Ensure docker stack deploy reads the intended stack.yml (run from infra dir or use an explicit path)

Verified: infrastructure/stack.yml contains configs.prometheus.yml.file: ./prometheus.yml and infrastructure/prometheus.yml exists. infrastructure/main.nu calls:

  • docker stack deploy -c stack.yml infrastructure --detach=false

Notes: the Compose spec resolves config file paths relative to the compose file’s directory (so ./prometheus.yml is correct when the intended stack.yml is the one read), but docker stack deploy can be sensitive to the working directory and there are historical version-specific bugs. Also: main.nu does not use $env.FILE_PWD — it relies on an unqualified stack.yml.

Action (choose one):

  • Ensure callers run the deploy from the infrastructure directory (cd infrastructure) so -c stack.yml is the correct file (mise tasks already cd into infrastructure in several places).
  • Or make the deploy invocation explicit/robust — e.g. deploy with an explicit path: docker stack deploy -c infrastructure/stack.yml ... or use an absolute path.

Files to review:

  • infrastructure/stack.yml — configs.prometheus.yml.file: ./prometheus.yml
  • infrastructure/prometheus.yml — present
  • infrastructure/main.nu — launch-stacks: docker stack deploy -c stack.yml ...
  • .mise.toml — tasks that cd into infrastructure

Comment thread .gitignore
Comment thread .mise.toml
Comment thread infrastructure/__main__.py
Comment thread infrastructure/__main__.py
Comment thread infrastructure/.python-version
Comment thread infrastructure/main.nu
Comment thread infrastructure/prometheus.yml
Comment thread infrastructure/Pulumi.yaml
Comment thread infrastructure/requirements.txt
Comment thread infrastructure/stack.yml
Copy link
Copy Markdown
Collaborator

@forstmeier forstmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly agreeing with CR plus some naming tweaks.

Beautiful diff.

Image

Comment thread infrastructure/.python-version
Comment thread infrastructure/main.nu
Comment thread infrastructure/main.nu
Comment thread infrastructure/prometheus.yml
Comment thread infrastructure/Pulumi.yaml
Comment thread infrastructure/stack.yml
Comment thread infrastructure/stack.yml
Comment thread infrastructure/stack.yml
Comment thread infrastructure/stack.yml
Comment thread .mise.toml
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
infrastructure/stack.yml (1)

142-146: Do not commit Grafana admin password; read from a Swarm secret instead.

The admin password is hardcoded in plaintext. You already declare a GRAFANA_ADMIN_PASSWORD secret at the top; wire it into the Grafana service and read it via the __FILE variant.

Apply this diff within the Grafana service:

   grafana:
     image: grafana/grafana:10.4.2
     environment:
       GF_SECURITY_ADMIN_USER: admin
-      # TODO: Change this to a random password
-      GF_SECURITY_ADMIN_PASSWORD: admin123
+      GF_SECURITY_ADMIN_PASSWORD__FILE: /run/secrets/GRAFANA_ADMIN_PASSWORD
+    secrets:
+      - GRAFANA_ADMIN_PASSWORD
🧹 Nitpick comments (6)
infrastructure/stack.yml (6)

11-15: traefik_letsencrypt volume is currently unused.

Either remove it or mount it in Traefik for ACME storage as suggested above. Keeping unused volumes adds noise.


75-89: Avoid using the 'latest' tag for Portainer; pin to a specific version.

Pinning ensures repeatable, predictable deployments and avoids surprise upgrades.

Example:

-    image: portainer/portainer-ce:latest
+    image: portainer/portainer-ce:2.21.5

If you prefer a different stable version, update accordingly.


126-139: cAdvisor may require additional privileges on modern Docker/cgroups.

Depending on host OS/cgroups v2, cAdvisor might need extra mounts/caps and can report partial metrics. If you observe gaps, consider:

  • adding device/proc/sys mounts recommended in cAdvisor docs,
  • or switching to node_exporter + container-specific exporters.

Would you like a hardened cAdvisor service snippet tailored for your target OS?


22-43: Declared secrets are unused in this stack file.

If these are for app services not included yet, ignore. Otherwise, remove unused secrets to reduce operational overhead and confusion.


67-73: Add resource reservations/limits for Traefik.

You’ve set resources on other services; do the same for Traefik to improve scheduling and avoid noisy-neighbor issues.

Example:

     deploy:
       placement:
         constraints: [node.role == manager]
       update_config:
         parallelism: 1
         order: start-first
+      resources:
+        reservations: { memory: 128M }
+        limits: { memory: 512M }

140-159: Add a Grafana healthcheck for safer rollouts.

A healthcheck helps Swarm detect readiness and can make updates safer alongside start-first on Traefik.

Apply this diff:

   grafana:
     image: grafana/grafana:10.4.2
+    healthcheck:
+      test: ["CMD-SHELL", "wget -qO- http://localhost:3000/api/health | grep -q 'ok'"]
+      interval: 30s
+      timeout: 5s
+      retries: 5
+      start_period: 30s
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 0e5cd76 and d075629.

📒 Files selected for processing (1)
  • infrastructure/stack.yml (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Run Python tests
🔇 Additional comments (1)
infrastructure/stack.yml (1)

111-124: Verify bind propagation rslave support in Swarm for node-exporter.

Bind propagation flags can behave differently in Swarm. If deployment errors or metrics are missing, consider dropping rslave or following the official node-exporter-on-Swarm recipe.

Comment thread infrastructure/stack.yml
Comment thread infrastructure/stack.yml
Comment thread infrastructure/stack.yml Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
infrastructure/stack.yml (2)

46-67: Enable HTTPS (ACME) and persist certs; also expose metrics on a dedicated entrypoint.

You declare traefik_letsencrypt but don’t mount it or enable ACME; HTTPS on 443 won’t have valid certs. Also, Prometheus metrics are enabled without an entrypoint, and Traefik isn’t on the internal network for Prometheus to reach it.

Apply this diff to add ACME, HTTP→HTTPS, metrics entrypoint, mount the volume, and join the internal network:

     command:
       - --providers.docker=true
       - --providers.docker.endpoint=unix:///var/run/docker.sock
       - --providers.docker.swarmMode=true
       - --providers.docker.exposedByDefault=false
       - --entrypoints.web.address=:80
       - --entrypoints.websecure.address=:443
+      - --entrypoints.web.http.redirections.entryPoint.to=websecure
+      - --entrypoints.web.http.redirections.entryPoint.scheme=https
       - --api.dashboard=true
       - --metrics.prometheus=true
       - --metrics.prometheus.addEntryPointsLabels=true
       - --metrics.prometheus.addServicesLabels=true
+      - --entrypoints.metrics.address=:8080
+      - --metrics.prometheus.entryPoint=metrics
+      # ACME/Let's Encrypt
+      - --certificatesresolvers.le.acme.email=${ACME_EMAIL}
+      - --certificatesresolvers.le.acme.storage=/letsencrypt/acme.json
+      - --certificatesresolvers.le.acme.httpchallenge=true
+      - --certificatesresolvers.le.acme.httpchallenge.entrypoint=web
     ports:
@@
     volumes:
       - /var/run/docker.sock:/var/run/docker.sock:ro
+      - traefik_letsencrypt:/letsencrypt
@@
-    networks: [public]
+    networks: [public, internal]

Ensure ACME_EMAIL is supplied via .env or environment at deploy time.

Also applies to: 73-73


140-159: Don’t pass Grafana admin password via env; use a Swarm secret and serve over HTTPS.

Environment variables are exposed in container metadata; you already declared GRAFANA_ADMIN_PASSWORD as an external secret but don’t use it. Also, route Grafana via TLS with the ACME resolver added above.

Apply this diff:

   grafana:
     image: grafana/grafana:10.4.2
     environment:
       GF_SECURITY_ADMIN_USER: admin
-      # TODO: Change this to a random password
-      GF_SECURITY_ADMIN_PASSWORD: ${GF_SECURITY_ADMIN_PASSWORD}
+      GF_SECURITY_ADMIN_PASSWORD__FILE: /run/secrets/GRAFANA_ADMIN_PASSWORD
+      # Optional but recommended so links use the correct scheme/host:
+      # GF_SERVER_ROOT_URL: https://grafana.example.com
     volumes:
       - grafana_data:/var/lib/grafana
+    secrets:
+      - GRAFANA_ADMIN_PASSWORD
     deploy:
@@
       labels:
         - traefik.enable=true
         - traefik.http.routers.grafana.rule=Host(`grafana.example.com`)
-        - traefik.http.routers.grafana.entrypoints=web
+        - traefik.http.routers.grafana.entrypoints=websecure
+        - traefik.http.routers.grafana.tls=true
+        - traefik.http.routers.grafana.tls.certresolver=le
+        - traefik.docker.network=public
         - traefik.http.services.grafana.loadbalancer.server.port=3000

Make sure the external secret GRAFANA_ADMIN_PASSWORD exists before deploying (docker secret ls).

🧹 Nitpick comments (2)
infrastructure/stack.yml (2)

22-43: Many external secrets are declared but unused in this stack.

If they belong to application services deployed in a separate stack, consider removing them here to reduce operational drift and confusion.

Alternatively, add a short README comment pointing to where each secret is used to keep infra clear.


126-139: Confirm Prometheus cAdvisor Scrape Job

We’ve verified that infrastructure/prometheus.yml includes:

  • job_name: cadvisor
    static_configs → targets: [‘tasks.cadvisor:8080’]

Your cAdvisor service in infrastructure/stack.yml is currently running unprivileged. Without elevated privileges (or additional device mounts), some container and cgroup metrics may be missing. If you observe incomplete data, consider:

• File: infrastructure/stack.yml
Under service cadvisor, add:

   cadvisor:
     image: gcr.io/cadvisor/cadvisor:v0.47.2
+    privileged: true

Alternatively, leave it unprivileged and first validate that you’re receiving the full set of metrics in your environment.
[optional_refactors_recommended]

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d075629 and fe7c1d0.

📒 Files selected for processing (1)
  • infrastructure/stack.yml (1 hunks)
🔇 Additional comments (3)
infrastructure/stack.yml (3)

90-109: Prometheus service looks solid.

Good use of config mounts, TSDB retention, lifecycle, and resource limits. Network isolation to internal is appropriate.


111-125: Prometheus scrape configuration for node-exporter is present and correct

The infrastructure/prometheus.yml file already defines a node-exporter job with:

  • static_configs:
    • targets: ['tasks.node-exporter:9100']

This matches the Docker Swarm service on the internal network, so Prometheus will discover and scrape the node-exporter tasks as intended. No further changes required.


75-83: Pin Portainer to a stable, explicit tag

The latest semantic-versioned Portainer CE release is 2.32.0. Replace the floating latest tag to ensure reproducible deployments and intentional upgrades.

• File: infrastructure/stack.yml
• Lines: 75–83

   portainer:
-    image: portainer/portainer-ce:latest
+    image: portainer/portainer-ce:2.32.0
     ports:
       - "9000:9000"
       - "9443:9443"

Comment thread infrastructure/stack.yml
Comment thread infrastructure/stack.yml
@graphite-app graphite-app Bot requested a review from forstmeier August 20, 2025 02:57
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (4)
infrastructure/Pulumi.yaml (1)

1-1: Ensure automation uses the intended fully-qualified stack names

Confirm CI and local tooling use pocketsizefund/pocketsizefund/<env> to match this project name.

Quick check:

#!/bin/bash
rg -n -C2 -P 'pulumi\s+(up|down|stack\s+(select|output)).*(pocketsizefund/pocketsizefund/[^"'\''\s]+)|PULUMI_STACK'
infrastructure/__main__.py (3)

10-10: SSH open to 0.0.0.0/0 by default is unsafe; require explicit allowed CIDRs.

Defaulting SSH to world-accessible is a production-grade risk. Make it a required config or fail fast if the default would be 0.0.0.0/0.

Apply this diff:

-allowed_ssh_cidrs = pulumi.Config().get_object("allowedSshCidrs") or ["0.0.0.0/0"]
+cfg = pulumi.Config()
+allowed_ssh_cidrs = cfg.get_object("allowedSshCidrs")
+if not allowed_ssh_cidrs or "0.0.0.0/0" in allowed_ssh_cidrs:
+    raise pulumi.RunError(
+        'Refusing to open SSH (22/tcp) to 0.0.0.0/0. '
+        'Set "allowedSshCidrs" to trusted CIDR(s), e.g.: '
+        "pulumi config set allowedSshCidrs '[\"203.0.113.42/32\"]'"
+    )

21-61: Unblock lints and improve maintainability by moving cloud-init to a file.

The inline cloud-init causes Ruff E501 failures and is hard to maintain. Read it from infrastructure/cloud-init.yml instead.

Apply this diff:

+from pathlib import Path
@@
-cloud_init = """#cloud-config
-package_update: true
-package_upgrade: true
-write_files:
-  - path: /usr/local/bin/install-docker.sh
-    permissions: "0755"
-    content: |
-      #!/usr/bin/env bash
-      set -euo pipefail
-      retry() {
-        for i in {1..10}; do "$@" && break || { sleep 3; echo "retry $i"; };
-        done
-      }
-
-      retry apt-get update
-      retry apt-get install -y ca-certificates curl gnupg
-
-      install -m 0755 -d /etc/apt/keyrings
-      curl -fsSL https://download.docker.com/linux/ubuntu/gpg | \
-        gpg --dearmor -o /etc/apt/keyrings/docker.gpg
-      chmod a+r /etc/apt/keyrings/docker.gpg
-
-      . /etc/os-release
-      echo "deb [arch=$(dpkg --print-architecture) \
-        signed-by=/etc/apt/keyrings/docker.gpg] \
-        https://download.docker.com/linux/ubuntu ${VERSION_CODENAME} stable" > \
-        /etc/apt/sources.list.d/docker.list
-
-      retry apt-get update
-      retry apt-get install -y docker-ce docker-ce-cli containerd.io \
-        docker-buildx-plugin docker-compose-plugin
-
-      usermod -aG docker ubuntu || true
-      systemctl enable --now docker
-
-runcmd:
-  - /usr/local/bin/install-docker.sh
-  - bash -lc 'for i in {1..60}; do \
-      [ -x /usr/bin/docker ] && systemctl is-active --quiet docker && \
-      exit 0 || sleep 2; done; exit 1'
-"""
+cloud_init = Path(__file__).with_name("cloud-init.yml").read_text()

And add the removed content to infrastructure/cloud-init.yml (same repo path).


129-165: Broken quoting and PUBIP scope: swarm init will not see $PUBIP; simplify to one bash -lc.

Each bash -lc '...' runs in a separate shell; PUBIP set in one won’t be visible in the next. Also, the trailing \' leaves an unmatched single quote. This will lead to init failures.

Replace the multi-fragment command with a single, readable script:

- wait_for_docker = (
-     "bash -lc 'for i in {1..120}; do sudo docker info >/dev/null 2>&1 && "
-     "break || sleep 3; done'"
- )
- get_public_ip = (
-     "bash -lc 'PUBIP=$(curl -s http://169.254.169.254/latest/meta-data/"
-     "public-ipv4); echo Using-Public-IP:$PUBIP'"
- )
- init_swarm = (
-     "bash -lc 'STATE=$(sudo docker info --format "
-     "{{.Swarm.LocalNodeState}}") || true; "
-     'CTRL=$(sudo docker info --format "{{.Swarm.ControlAvailable}}") || true; '
-     '[ "$STATE" = active -a "$CTRL" = true ] || '
-     "(sudo docker swarm leave --force || true; "
-     ' sudo docker swarm init --advertise-addr "$PUBIP" --listen-addr "0.0.0.0:2377")\''
- )
- wait_for_swarm = (
-     "bash -lc 'for i in {1..30}; do sudo ss -ltn | "
-     'awk "\\$4 ~ /:2377$/" && break || sleep 2; done\''
- )
- save_worker_token = (
-     "bash -lc 'sudo docker swarm join-token -q worker | "
-     "sudo tee /home/ubuntu/worker.token >/dev/null'"
- )
- save_manager_token = (
-     "bash -lc 'sudo docker swarm join-token -q manager | "
-     "sudo tee /home/ubuntu/manager.token >/dev/null'"
- )
-
- init_manager = remote.Command(
-     "init-manager",
-     connection=conn(mgr_ip.ip_address),
-     create=(
-         f"{wait_for_docker} && {get_public_ip} && {init_swarm} && "
-         f"{wait_for_swarm} && {save_worker_token} && {save_manager_token}"
-     ),
- )
+init_manager = remote.Command(
+    "init-manager",
+    connection=conn(mgr_ip.ip_address),
+    create="""bash -lc '
+set -e
+# Wait for Docker daemon
+for i in {1..120}; do sudo docker info >/dev/null 2>&1 && break || sleep 3; done
+# Discover public IP (Lightsail/EC2 IMDS)
+PUBIP=$(curl -s http://169.254.169.254/latest/meta-data/public-ipv4)
+echo Using-Public-IP:"$PUBIP"
+# Initialize swarm if not active/controlling
+STATE=$(sudo docker info --format "{{.Swarm.LocalNodeState}}") || true
+CTRL=$(sudo docker info --format "{{.Swarm.ControlAvailable}}") || true
+if [ "$STATE" != "active" ] || [ "$CTRL" != "true" ]; then
+  sudo docker swarm leave --force || true
+  sudo docker swarm init --advertise-addr "$PUBIP" --listen-addr "0.0.0.0:2377"
+fi
+# Wait for swarm port to be listening
+for i in {1..30}; do sudo ss -ltn | awk '"'"'$4 ~ /:2377$/'"'"' && break || sleep 2; done
+# Persist tokens for later retrieval
+sudo docker swarm join-token -q worker | sudo tee /home/ubuntu/worker.token >/dev/null
+sudo docker swarm join-token -q manager | sudo tee /home/ubuntu/manager.token >/dev/null
+'""",
+)

Note the use of '"'"' to embed single quotes safely inside the single-quoted heredoc.

🧹 Nitpick comments (17)
.claude/settings.local.json (4)

20-20: Restrict flox activation to non-interactive forms to avoid hangs.

Bash(flox activate:*) may spawn an interactive shell and hang CI/editor sessions. Prefer non-interactive commands (e.g., flox run or activate with an explicit --command), and narrow the pattern to exactly what the workflows require.

If feasible, narrow the permission:

-      "Bash(flox activate:*)"
+      "Bash(flox run:*)"
+      "Bash(flox activate --command:*)"

Please confirm the exact flox invocation your Nushell/Pulumi orchestration needs so we can scope this precisely.


18-18: Addition of gh issue close aligns with existing gh allowances—confirm repo scope.

Granting Bash(gh issue close:*) matches the existing create/view capabilities. Just ensure these commands run only in the intended repository context to avoid accidental closes elsewhere.

If you want to be extra cautious, consider documenting this permission in CONTRIBUTING.md and limiting usage to automation that references the current repo explicitly (e.g., --repo pocketsizefund/pocketsizefund).


1-3: Confirm that committing a “local” settings file is intentional.

The filename suggests developer-local overrides. If this is meant to be user-specific, consider adding it to .gitignore and using a team-wide non-local settings file for shared permissions. If team-wide, the name may cause confusion.

I can propose a follow-up PR to standardize naming (e.g., .claude/settings.team.json) and update ignore rules if desired.


19-19: Scope or whitelist WebSearch instead of enabling it globally

The root settings file includes WebSearch, but the infra-specific config does not. Granting global WebSearch enables broad outbound access across all assistant workflows. If you only need external lookups for specific docs or infra tasks, consider narrowing the scope:

• File to update:

  • .claude/settings.local.json (entry "WebSearch")
    • Options:
  • Replace with WebFetch(domain:…) for known domains
  • Move WebSearch into infrastructure/.claude/settings.local.json or another narrow config
  • Remove the global entry entirely

Suggested diff:

       "Bash(gh issue close:*)",
-      "WebSearch",
       "Bash(flox activate:*)"
Dockerfile.tests (2)

3-3: Pin uv image tag/digest for reproducible builds

Using ghcr.io/astral-sh/uv:latest can introduce non-determinism as the tag moves. Pin a version/tag or digest.

Apply:

-COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/uv
+COPY --from=ghcr.io/astral-sh/uv:0.5.6 /uv /bin/uv

(Replace 0.5.6 with the exact version you standardize on.)


14-14: Question: ENV PYTHON=1 appears unused

If not consumed by tooling/tests, drop it to keep the image environment minimal.

pyproject.toml (2)

123-123: Broad Bandit ignore (S105) on infrastructure/main.py—prefer narrowing

S105 flags hardcoded secrets. A file-wide ignore can mask real issues. Prefer inline # noqa: S105 on specific literals or a targeted per-file rule with context (e.g., only test fake creds).

Example narrowing:

[tool.ruff.lint.per-file-ignores]
"infrastructure/__main__.py" = ["S105"]  # TODO: replace with line-level noqa after refactor

Or remove this and add inline # noqa: S105 with a rationale where necessary.


5-5: Exact Python pin is consistent; consider minor-range pin if patch bumps are safe

requires-python = "==3.12.10" ensures strict alignment (good). If you want to pick up security patch releases without editing config, consider ~=3.12.

application/datamanager/Dockerfile.test (1)

2-2: Pin uv image for reproducibility

Same concern as in Dockerfile.tests—avoid :latest for deterministic builds.

-COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/uv
+COPY --from=ghcr.io/astral-sh/uv:0.5.6 /uv /bin/uv
infrastructure/prometheus.yml (1)

10-13: Consider DNS SD for Traefik to scrape all replicas

If Traefik is scaled >1, traefik:8080 hits the VIP and may not scrape every task each interval. Prefer dns_sd_configs with tasks.traefik.

-  - job_name: traefik
-    metrics_path: /metrics
-    static_configs:
-      - targets: ['traefik:8080']
+  - job_name: traefik
+    metrics_path: /metrics
+    dns_sd_configs:
+      - names: ['tasks.traefik']
+        type: A
+        port: 8080

Also ensure Traefik metrics are enabled in the stack (prometheus provider and /metrics exposed).

infrastructure/README.md (4)

19-21: Avoid brittle references to line numbers; fix MD034 bare URL and clarify where to edit Grafana domain.

Referencing “stack.yml line 162” will rot quickly. Point to the Traefik router label instead and wrap the URL to satisfy markdownlint.

Apply this diff:

-- **URL**: https://grafana.example.com (update the domain in `stack.yml` line 162)
+- **URL**: <https://grafana.example.com> (update the Traefik router Host() rule in `stack.yml`; search for `traefik.http.routers.grafana.rule`)

28-31: Fix MD034 bare URL warning for Prometheus.

Wrap the URL in angle brackets to silence markdownlint and improve consistency.

-- **Internal URL**: http://prometheus:9090 (only accessible from within the swarm)
+- **Internal URL**: <http://prometheus:9090> (only accessible from within the swarm)

43-72: Align docs with actual Prometheus config (static vs DNS SD).

This section labels static targets as a “Known Issue” and recommends DNS service discovery. Per the PR summary and file comments, infrastructure/prometheus.yml already uses dns_sd_configs. If that’s true, update this section to describe the current state (DNS SD), or remove the “Known Issue” framing.

Proposed wording if DNS SD is already in place:

-### Current Issue: Prometheus Service Discovery
-
-⚠️ **Known Issue**: The current Prometheus configuration uses static targets that only scrape one replica per service:
-...
-**Problem**: This resolves to only one IP address, missing metrics from other swarm nodes.
-
-**Solution**: Switch to DNS service discovery to scrape all replicas:
+### Prometheus Service Discovery
+
+The Prometheus configuration uses DNS service discovery to scrape all replicas of global services:
 ...
-# Recommended configuration
+# Active configuration
 - job_name: node-exporter
   dns_sd_configs:
     - names: ['tasks.node-exporter']
       type: A
       port: 9100
 ...

115-132: Good security guidance; consider making SSH CIDR restriction a hard requirement.

Docs are solid. Given the infra is “production,” consider matching the code to this guidance by refusing to deploy with SSH open to 0.0.0.0/0 unless explicitly overridden.

infrastructure/__main__.py (3)

78-103: Narrow exposed ports to the minimum per node role.

80/443 are opened on every instance (manager and workers). If Traefik only runs on manager(s), you can reduce attack surface by not opening 80/443 on worker nodes.

Option A (parameterize):

-def mk_instance(
-    name: str, bundle_id: str
-) -> tuple[aws.lightsail.Instance, aws.lightsail.StaticIp]:
+def mk_instance(
+    name: str, bundle_id: str, expose_web: bool = False
+) -> tuple[aws.lightsail.Instance, aws.lightsail.StaticIp]:
@@
-    aws.lightsail.InstancePublicPorts(
+    web_ports = []
+    if expose_web:
+        web_ports = [
+            aws.lightsail.InstancePublicPortsPortInfoArgs(
+                from_port=80, to_port=80, protocol="tcp", cidrs=["0.0.0.0/0"]
+            ),
+            aws.lightsail.InstancePublicPortsPortInfoArgs(
+                from_port=443, to_port=443, protocol="tcp", cidrs=["0.0.0.0/0"]
+            ),
+        ]
+    aws.lightsail.InstancePublicPorts(
         f"{name}-ports",
         instance_name=inst.name,
-        port_infos=[
+        port_infos=[
             aws.lightsail.InstancePublicPortsPortInfoArgs(
                 from_port=22, to_port=22, protocol="tcp", cidrs=allowed_ssh_cidrs
             ),
             aws.lightsail.InstancePublicPortsPortInfoArgs(
                 from_port=2377, to_port=2377, protocol="tcp", cidrs=swarm_manager_allowed_cidrs
             ),
             aws.lightsail.InstancePublicPortsPortInfoArgs(
                 from_port=7946, to_port=7946, protocol="tcp", cidrs=swarm_cluster_allowed_cidrs
             ),
             aws.lightsail.InstancePublicPortsPortInfoArgs(
                 from_port=7946, to_port=7946, protocol="udp", cidrs=swarm_cluster_allowed_cidrs
             ),
             aws.lightsail.InstancePublicPortsPortInfoArgs(
                 from_port=4789, to_port=4789, protocol="udp", cidrs=swarm_cluster_allowed_cidrs
             ),
-            aws.lightsail.InstancePublicPortsPortInfoArgs(
-                from_port=80, to_port=80, protocol="tcp", cidrs=["0.0.0.0/0"]
-            ),
-            aws.lightsail.InstancePublicPortsPortInfoArgs(
-                from_port=443, to_port=443, protocol="tcp", cidrs=["0.0.0.0/0"]
-            ),
-        ],
+        ] + web_ports,
     )

Then call with expose_web=True for the manager only.


168-175: Token handling looks fine; minor improvement: trim at source.

Current code trims via apply(lambda s: s.strip()). You can trim server-side to avoid accidental trailing newlines earlier, but this is non-blocking.


196-198: Exporting the private SSH key increases blast radius; prefer local file or external secret.

Even as a Pulumi secret, exporting sshPrivateKeyPem propagates it into stack state and any automation outputs. Prefer one of:

  • Generate the key locally and pass the public key via config; keep the private key out of state.
  • Use pulumi-command local to write the key to a local file and don’t export it.

If you want, I can provide a small helper that writes ssh_key.private_key_pem to infrastructure/swarm.pem locally (ignored by .gitignore) and removes the export.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between fe7c1d0 and dbfa692.

⛔ Files ignored due to path filters (1)
  • .flox/env/manifest.lock is excluded by !**/*.lock
📒 Files selected for processing (15)
  • .claude/commands/infrastructure-up.md (1 hunks)
  • .claude/commands/lint.md (1 hunks)
  • .claude/commands/pytests.md (1 hunks)
  • .claude/settings.local.json (1 hunks)
  • .flox/env/manifest.toml (1 hunks)
  • Dockerfile.tests (1 hunks)
  • application/datamanager/Dockerfile.test (1 hunks)
  • infrastructure/Pulumi.yaml (1 hunks)
  • infrastructure/README.md (1 hunks)
  • infrastructure/__main__.py (1 hunks)
  • infrastructure/main.nu (1 hunks)
  • infrastructure/prometheus.yml (1 hunks)
  • infrastructure/requirements.txt (1 hunks)
  • infrastructure/stack.yml (1 hunks)
  • pyproject.toml (1 hunks)
✅ Files skipped from review due to trivial changes (3)
  • .claude/commands/lint.md
  • .claude/commands/infrastructure-up.md
  • .claude/commands/pytests.md
🚧 Files skipped from review as they are similar to previous changes (4)
  • .flox/env/manifest.toml
  • infrastructure/requirements.txt
  • infrastructure/main.nu
  • infrastructure/stack.yml
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-15T13:54:49.196Z
Learnt from: forstmeier
PR: pocketsizefund/pocketsizefund#613
File: infrastructure/Pulumi.yaml:1-10
Timestamp: 2025-08-15T13:54:49.196Z
Learning: The user forstmeier prefers Pulumi stack naming convention as `pocketsizefund/pocketsizefund/production` to maintain organization/repository/environment mapping pattern for the pocketsizefund project.

Applied to files:

  • infrastructure/Pulumi.yaml
🧬 Code Graph Analysis (1)
infrastructure/prometheus.yml (1)
infrastructure/monitors.py (1)
  • create_prometheus_scraper (7-151)
🪛 LanguageTool
infrastructure/README.md

[grammar] ~9-~9: There might be a mistake here.
Context: ...y and load balancer with automatic HTTPS - Portainer: Docker management UI - **Pr...

(QB_NEW_EN)


[grammar] ~10-~10: There might be a mistake here.
Context: ...PS - Portainer: Docker management UI - Prometheus: Metrics collection and mon...

(QB_NEW_EN)


[grammar] ~11-~11: There might be a mistake here.
Context: ...eus**: Metrics collection and monitoring - Node Exporter: Host-level metrics coll...

(QB_NEW_EN)


[grammar] ~12-~12: There might be a mistake here.
Context: ...l metrics collection (runs on all nodes) - cAdvisor: Container metrics collection...

(QB_NEW_EN)


[grammar] ~13-~13: There might be a mistake here.
Context: ...r metrics collection (runs on all nodes) - Grafana: Metrics visualization and das...

(QB_NEW_EN)


[grammar] ~18-~18: There might be a mistake here.
Context: ...ccessing Services ### Grafana Dashboard - URL: https://grafana.example.com (upda...

(QB_NEW_EN)


[grammar] ~19-~19: There might be a mistake here.
Context: ...date the domain in stack.yml line 162) - Username: admin - Password: Set ...

(QB_NEW_EN)


[grammar] ~20-~20: There might be a mistake here.
Context: ...n stack.yml line 162) - Username: admin - Password: Set via `GRAFANA_ADMIN_PASSW...

(QB_NEW_EN)


[grammar] ~21-~21: There might be a mistake here.
Context: ... Set via GRAFANA_ADMIN_PASSWORD secret - TLS: Automatic HTTPS certificates via ...

(QB_NEW_EN)


[grammar] ~24-~24: There might be a mistake here.
Context: ... Let's Encrypt ### Portainer Management - URL: https://[manager-node-ip]:9443 - ...

(QB_NEW_EN)


[grammar] ~25-~25: There might be a mistake here.
Context: ... URL: https://[manager-node-ip]:9443 - Port: 9000 (HTTP) or 9443 (HTTPS) ###...

(QB_NEW_EN)


[grammar] ~28-~28: There might be a mistake here.
Context: ...0 (HTTP) or 9443 (HTTPS) ### Prometheus - Internal URL: http://prometheus:9090 (...

(QB_NEW_EN)


[grammar] ~29-~29: There might be a mistake here.
Context: ... (only accessible from within the swarm) - Port: 9090 ## Network Architecture T...

(QB_NEW_EN)


[grammar] ~36-~36: There might be a mistake here.
Context: ...s (Traefik, Portainer, Grafana frontend) - internal: Internal monitoring communic...

(QB_NEW_EN)


[grammar] ~101-~101: There might be a mistake here.
Context: ...tack.yml: Docker Swarm stack definition - prometheus.yml: Prometheus scraping configuration - m...

(QB_NEW_EN)


[grammar] ~102-~102: There might be a mistake here.
Context: ....yml: Prometheus scraping configuration - main.nu: Infrastructure management scripts - p...

(QB_NEW_EN)


[grammar] ~103-~103: There might be a mistake here.
Context: ...n.nu: Infrastructure management scripts - ping.nu: Service health checks - upload_grafan...

(QB_NEW_EN)


[grammar] ~104-~104: There might be a mistake here.
Context: ...ripts - ping.nu: Service health checks - upload_grafana_dashboard.nu: Dashboard deployment script ## Enviro...

(QB_NEW_EN)


[grammar] ~127-~127: There might be a mistake here.
Context: ...32"]' ``` Security Recommendations: - Always restrict SSH access to known IP r...

(QB_NEW_EN)


[grammar] ~148-~148: There might be a mistake here.
Context: .../32"]' ``` Critical Security Notes: - Never expose port 2377 to 0.0.0.0/0 - ...

(QB_NEW_EN)


[grammar] ~169-~169: There might be a mistake here.
Context: ...'["10.0.0.0/8"]' ``` Ports secured: - Port 7946 (TCP/UDP): Inter-node commun...

(QB_NEW_EN)


[grammar] ~170-~170: There might be a mistake here.
Context: ...cation for cluster membership and gossip - Port 4789 (UDP): Overlay network traff...

(QB_NEW_EN)


[grammar] ~173-~173: There might be a mistake here.
Context: ...ffic (VXLAN) Security Requirements: - These ports should never be exposed ...

(QB_NEW_EN)


[grammar] ~183-~183: There might be a mistake here.
Context: ...N_PASSWORD`: Grafana admin user password - Application-specific secrets (Alpaca, Po...

(QB_NEW_EN)


[grammar] ~197-~197: There might be a mistake here.
Context: ...sted in traefik_letsencrypt volume at /letsencrypt/acme.json - Services using TLS: Configure Traefik ...

(QB_NEW_EN)


[grammar] ~206-~206: There might be a mistake here.
Context: ...a_data: Grafana dashboards and settings - prom_data`: Prometheus time-series data (15-day re...

(QB_NEW_EN)


[grammar] ~207-~207: There might be a mistake here.
Context: ...heus time-series data (15-day retention) - portainer_data: Portainer configuration - `traefik_let...

(QB_NEW_EN)


[grammar] ~208-~208: There might be a mistake here.
Context: ...portainer_data: Portainer configuration - traefik_letsencrypt`: SSL certificates ## Troubleshooting ...

(QB_NEW_EN)


[grammar] ~213-~213: There might be a mistake here.
Context: ...oubleshooting ### Service Access Issues 1. Verify swarm node connectivity: `docker ...

(QB_NEW_EN)


[grammar] ~214-~214: There might be a mistake here.
Context: ...sues 1. Verify swarm node connectivity: docker node ls 2. Check service status: `docker service ls...

(QB_NEW_EN)


[grammar] ~215-~215: There might be a mistake here.
Context: ...ocker node ls2. Check service status:docker service ls3. Inspect service logs:docker service lo...

(QB_NEW_EN)


[grammar] ~218-~218: There might be a mistake here.
Context: ...s ` ### Monitoring Issues 1. Check Prometheus targets: Visit Promethe...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)
infrastructure/README.md

19-19: Bare URL used

(MD034, no-bare-urls)


29-29: Bare URL used

(MD034, no-bare-urls)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Run Python tests
🔇 Additional comments (7)
.claude/settings.local.json (1)

1-25: LGTM on JSON validity and placement within allow list.

The structure is valid JSON, integrates cleanly with existing entries, and preserves deny as empty.

Dockerfile.tests (1)

1-1: Python versions are aligned; verify CI pinning
All pyproject.toml files pin requires-python = "==3.12.10" and every Dockerfile uses FROM python:3.12.10 (one test image uses 3.12.10-slim, which still matches the patch).

• workflows/pyproject.toml, cli/pyproject.toml, pyproject.toml, application/.../pyproject.toml, infrastructure/pyproject.toml
• Dockerfile.tests, application/predictionengine/Dockerfile, application/datamanager/Dockerfile, application/datamanager/Dockerfile.test, application/positionmanager/Dockerfile

Please confirm your CI workflows (e.g. in .github/workflows/*.yml) also pin python-version: 3.12.10 to avoid environment drift.

application/datamanager/Dockerfile.test (1)

1-1: Good: Aligns to Python 3.12.10-slim

Slim base reduces image size and matches repo-wide versioning.

infrastructure/Pulumi.yaml (1)

2-10: Pulumi project metadata and uv toolchain config LGTM

name: pocketsizefund aligns with the preferred stack naming pattern and the uv toolchain block is correct for Python projects.

infrastructure/prometheus.yml (3)

6-8: Prometheus self-scrape target OK

Static target is fine for a single Prometheus instance.


15-20: Good use of Swarm DNS SD for node-exporter

This will enumerate all global tasks; no VIP skew.


21-25: Good use of Swarm DNS SD for cAdvisor

Covers all replicas across nodes.

forstmeier
forstmeier previously approved these changes Aug 20, 2025
Copy link
Copy Markdown
Collaborator

@forstmeier forstmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just comments/questions but nothing that holds up merging. Merge whenever.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is a requirements.txt file required?

Comment thread infrastructure/README.md Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate having a README doc but we can move it into code later.

@forstmeier forstmeier added this to the Refactor milestone Aug 21, 2025
@chrisaddy chrisaddy merged commit b376faa into master Aug 23, 2025
1 of 2 checks passed
@chrisaddy chrisaddy deleted the 08-14-docker_swarm branch August 23, 2025 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants