Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. Caution Review failedThe pull request is closed. Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. 📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 💡 Knowledge Base configuration:
You can enable these sources in your CodeRabbit configuration. ⛔ Files ignored due to path filters (2)
📒 Files selected for processing (25)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Pull Request Overview
This PR significantly refactors the infrastructure setup by replacing Kubernetes/EKS with Docker Swarm on AWS Lightsail instances. The change simplifies the architecture by moving from a complex Kubernetes-based serverless setup to a more straightforward Docker Swarm deployment.
Key changes:
- Complete replacement of Kubernetes/EKS infrastructure with Docker Swarm on AWS Lightsail
- Removal of Knative, API Gateway, and associated AWS services
- Introduction of Docker Compose stack configuration for monitoring services (Traefik, Prometheus, Grafana)
- Addition of automated deployment scripts using Nushell
Reviewed Changes
Copilot reviewed 23 out of 26 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
__main__.py |
Complete rewrite to provision AWS Lightsail instances and set up Docker Swarm cluster |
stack.yml |
New Docker Compose stack configuration defining monitoring and networking services |
main.nu |
New Nushell script for automated infrastructure deployment and stack management |
prometheus.yml |
New Prometheus configuration for monitoring the Docker Swarm cluster |
| Multiple deleted files | Removal of VPC, Kubernetes, Knative, and API Gateway related infrastructure code |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Graphite Automations"Assign author to pull request" took an action on this PR • (08/15/25)1 assignee was added to this PR based on John Forstmeier's automation. |
There was a problem hiding this comment.
Actionable comments posted: 12
🔭 Outside diff range comments (2)
.mise.toml (2)
98-102: Outdated task: references removed Pulumi output DATAMANAGER_API_ACCESS_POLICY_ARN.The new infra removed API Gateway/IAM helpers; this task will fail.
Apply this diff to disable the task until a replacement exists:
-[tasks."cli:datamanager:authorize"] -description = "Authorize the CLI with AWS credentials" -run = """ -aws iam attach-user-policy --user-name {{option(user-name="user-name")}} --policy-arn ${{pulumi stack output DATAMANAGER_API_ACCESS_POLICY_ARN}} -""" +# [tasks."cli:datamanager:authorize"] +# description = "Authorize the CLI with AWS credentials" +# run = """ +# echo "Deprecated: DATAMANAGER_API_ACCESS_POLICY_ARN output was removed with the migration." +# exit 1 +# """
104-110: Broken task: upload_grafana_dashboard.nu removed.This references a file removed by the migration.
Apply this diff to remove or replace the task:
-[tasks."dashboard:upload"] -description = "Upload Grafana dashboard" -run = """ -cd infrastructure -nu upload_grafana_dashboard.nu -""" +# Grafana dashboard upload task removed during migration to Swarm.
🧹 Nitpick comments (16)
.flox/env/manifest.toml (1)
14-15: awscli2 & cargo present — please confirm cross-platform availability; keep google-cloud-sdk for nowShort summary:
- .flox/env/manifest.toml contains awscli2.pkg-path = "awscli2" and cargo.pkg-path = "cargo" (lines 14–15).
- The manifest's systems list includes "aarch64-darwin" and "aarch64-linux" (lines 22–26) — I did not see x86_64 entries in the provided snippet, so platform coverage is inconclusive.
Locations needing attention:
- .flox/env/manifest.toml — awscli2 & cargo entries (lines 14–15) and systems list (lines ~22–26).
- workflows/fetch_data.py — calls ["gcloud", "auth", "print-identity-token"] (around line 15).
- application/predictionengine/compose.yaml — mounts ~/.config/gcloud/application_default_credentials.json (around line 15).
Requested actions:
- Verify awscli2 and cargo resolve on all target platforms (evaluate with flox/nix or confirm availability for x86_64-darwin and x86_64-linux if those are required).
- Do not remove google-cloud-sdk yet — remove/update the gcloud references above before deleting the package.
infrastructure/.gitignore (1)
1-2: LGTM — ignore pyc and local venvThis keeps infra state clean in Git. Optional: consider ignoring common virtualenv names like .venv as well.
Apply:
*.pyc -venv/ +venv/ +.venv/infrastructure/requirements.txt (1)
1-2: Avoid dependency drift between pyproject.toml and requirements.txtThe pyproject (per PR context) pins higher minimums (e.g., pulumi>=3.189.0, pulumi-aws>=7.4.0). Divergent constraints can cause non-reproducible environments.
Options:
- Prefer a single source of truth (pyproject with uv/pip and drop requirements.txt).
- Or align constraints here to match pyproject:
-pulumi>=3.0.0,<4.0.0 -pulumi-aws>=7.0.0,<8.0.0 +pulumi>=3.189.0,<4.0.0 +pulumi-aws>=7.4.0,<8.0.0If you want, I can generate a constraints.txt wired to uv/pip-compile to keep ranges consistent.
infrastructure/.claude/settings.local.json (2)
3-12: Do not commit local, user-specific Claude settings to the repo.This file hardcodes a local absolute path (Line 10) and grants broad command allowances. It’s environment-specific and better kept out of version control to avoid portability issues and inadvertent exposure of local environment details.
Apply one of the following:
- Move this file to a local-only path and add the pattern to .gitignore (e.g., infrastructure/.claude/*.local.json).
- Commit an example template (settings.local.example.json) without absolute paths and instruct devs to copy it locally.
If you prefer to keep it, at minimum replace the absolute path with a home-relative or environment-variable-based path (e.g., $HOME/Library/...).
4-8: Scope the allow-list and defaultMode more tightly.
- "Bash(nu:)" and "Bash(docker stack ps:)" are very permissive; tighten to the specific subcommands used in your flow.
- defaultMode=acceptEdits can be surprising; consider defaultMode=ask and explicitly allow key commands, reducing accidental execution.
.mise.toml (2)
69-75: Optional: move yamllint inline config to a project config file.Using an inline -d string is fine, but a .yamllint config improves reuse and editor integration.
- yamllint -d "{extends: relaxed, rules: {line-length: {max: 110}}}" . + yamllint .And add a .yamllint at repo root with the same rules.
1-110: Add mise tasks for Swarm workflows.Given the shift to Docker Swarm, consider adding tasks to:
- initialize docker context to the manager (ssh with swarm.pem)
- deploy/update the Swarm stack (e.g., using infrastructure/stack.yml)
- check Swarm status (docker stack ps)
I can draft these tasks if helpful.
infrastructure/pyproject.toml (1)
6-6: Pinning Python to exactly 3.12.10 may be unnecessarily strict.If reproducibility is not hard-required, consider a range (e.g., >=3.12,<3.13) to ease local installs and CI images. Otherwise, keep as-is.
infrastructure/__main__.py (4)
117-130: Ruff FLY002: Prefer a multiline script string over join.Using " && ".join([...]) makes quoting complex and contributes to E501. A single multi-line shell script (triple-quoted) improves maintainability and avoids quoting traps.
Proposed pattern:
- create=" && ".join([...]), + create=""" +set -euo pipefail +for i in {1..120}; do sudo docker info >/dev/null 2>&1 && break || sleep 3; done +PUBIP=$(curl -s http://169.254.169.254/latest/meta-data/public-ipv4) +echo "Using-Public-IP:${PUBIP}" +STATE=$(sudo docker info --format "{{.Swarm.LocalNodeState}}") || true +CTRL=$(sudo docker info --format "{{.Swarm.ControlAvailable}}") || true +if ! [ "$STATE" = active -a "$CTRL" = true ]; then + sudo docker swarm leave --force || true + sudo docker swarm init --advertise-addr "$PUBIP" --listen-addr "0.0.0.0:2377" +fi +for i in {1..30}; do sudo ss -ltn | awk '$4 ~ /:2377$/' && break || sleep 2; done +sudo docker swarm join-token -q worker | sudo tee /home/ubuntu/worker.token >/dev/null +sudo docker swarm join-token -q manager | sudo tee /home/ubuntu/manager.token >/dev/null +""",This also helps with line lengths; add the file-level E501 ignore if any lines still exceed 88 chars.
51-99: Add return type annotation to mk_instance.Static analysis flagged missing return type. It returns (Instance, StaticIp).
Apply this diff:
-def mk_instance(name: str, bundle_id: str): +def mk_instance(name: str, bundle_id: str) -> tuple[aws.lightsail.Instance, aws.lightsail.StaticIp]:
63-88: Open Swarm and SSH ports to 0.0.0.0/0; consider restricting to node IPs.While Swarm uses mutual TLS, exposing 2377, 7946 (TCP/UDP), and 4789/UDP to the entire Internet increases attack surface and noise. Where feasible, restrict these to your known static IPs (/32) for manager↔worker traffic. Keep 80/443 public if required.
I can sketch a variant of InstancePublicPorts that uses the known IP outputs to build CIDR /32 lists per node role.
18-48: Cloud-init hardening nit: pin curl to fail fast and enable retries on repo key fetch.You already wrap most apt actions with retry; consider adding retry around the GPG key fetch to account for transient network issues.
Example:
- curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /etc/apt/keyrings/docker.gpg + retry bash -lc 'curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /etc/apt/keyrings/docker.gpg'infrastructure/prometheus.yml (1)
5-9: Optional: add Traefik metrics scrape (and network reachability)You’ve enabled Traefik Prometheus metrics in stack.yml, but there’s no scrape job here. If you want Traefik metrics:
- Either attach Prometheus to the public network or attach Traefik to internal.
- Add a Traefik job (default metrics are on port 8080).
Example job once network reachability is in place:
- job_name: traefik static_configs: - targets: ['traefik:8080']If you keep Prometheus on internal, add traefik to internal networks in stack.yml instead.
infrastructure/main.nu (1)
20-21: Nit: write PEM under ~/.ssh for clarity and standardizationStoring swarm.pem under the working directory can be surprising; ~/.ssh is conventional.
Option:
let pem_path = $"($env.HOME)/.ssh/swarm.pem" $pem | save --raw --force $pem_path chmod 600 $pem_path # and update IdentityFile in the block accordinglyinfrastructure/stack.yml (2)
24-53: Optional: expose Traefik metrics to Prometheus networkYou’ve enabled Traefik metrics but Prometheus isn’t connected to the public network, so it cannot scrape Traefik. Consider also attaching Traefik to internal or Prometheus to public.
Two options:
- Attach Traefik to both networks:
traefik: @@ - networks: [public] + networks: [public, internal]
- Or attach Prometheus to public as well:
prometheus: @@ - networks: [internal] + networks: [internal, public]Then add a Traefik scrape job in prometheus.yml.
138-138: Add a newline at EOFYAMLlint flagged a missing newline at the end of file.
- networks: [public, internal] \ No newline at end of file + networks: [public, internal] +
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
⛔ Files ignored due to path filters (2)
.flox/env/manifest.lockis excluded by!**/*.lockuv.lockis excluded by!**/*.lock
📒 Files selected for processing (24)
.flox/env/manifest.toml(1 hunks).gitignore(1 hunks).mise.toml(1 hunks)infrastructure/.claude/settings.local.json(1 hunks)infrastructure/.gitignore(1 hunks)infrastructure/.python-version(1 hunks)infrastructure/Pulumi.yaml(1 hunks)infrastructure/__main__.py(1 hunks)infrastructure/api.py(0 hunks)infrastructure/cluster.py(0 hunks)infrastructure/grafana-dashboard.json(0 hunks)infrastructure/images.py(0 hunks)infrastructure/ingress.py(0 hunks)infrastructure/keys.py(0 hunks)infrastructure/main.nu(1 hunks)infrastructure/monitors.py(0 hunks)infrastructure/prometheus.yml(1 hunks)infrastructure/pyproject.toml(1 hunks)infrastructure/requirements.txt(1 hunks)infrastructure/services.py(0 hunks)infrastructure/stack.yml(1 hunks)infrastructure/tags.py(0 hunks)infrastructure/upload_grafana_dashboard.nu(0 hunks)infrastructure/vpc.py(0 hunks)
💤 Files with no reviewable changes (11)
- infrastructure/vpc.py
- infrastructure/grafana-dashboard.json
- infrastructure/tags.py
- infrastructure/keys.py
- infrastructure/cluster.py
- infrastructure/upload_grafana_dashboard.nu
- infrastructure/images.py
- infrastructure/api.py
- infrastructure/monitors.py
- infrastructure/ingress.py
- infrastructure/services.py
🧰 Additional context used
📓 Path-based instructions (1)
.mise.toml
📄 CodeRabbit Inference Engine (CLAUDE.md)
Always use mise tasks for running tests, linting, formatting. If there is not a command already present in .mise.toml, suggest creating one.
Files:
.mise.toml
🧠 Learnings (1)
📚 Learning: 2025-07-29T13:26:20.287Z
Learnt from: CR
PR: pocketsizefund/pocketsizefund#0
File: CLAUDE.md:0-0
Timestamp: 2025-07-29T13:26:20.287Z
Learning: Applies to .mise.toml : Always use mise tasks for running tests, linting, formatting. If there is not a command already present in .mise.toml, suggest creating one.
Applied to files:
.mise.toml
🧬 Code Graph Analysis (1)
infrastructure/prometheus.yml (1)
infrastructure/monitors.py (1)
create_prometheus_scraper(7-151)
🪛 YAMLlint (1.37.1)
infrastructure/stack.yml
[error] 138-138: no new line character at the end of file
(new-line-at-end-of-file)
🪛 GitHub Check: Run Python quality checks
infrastructure/__main__.py
[failure] 121-121: Ruff (E501)
infrastructure/main.py:121:89: E501 Line too long (97 > 88)
[failure] 120-120: Ruff (E501)
infrastructure/main.py:120:89: E501 Line too long (123 > 88)
[failure] 119-119: Ruff (E501)
infrastructure/main.py:119:89: E501 Line too long (106 > 88)
[failure] 117-130: Ruff (FLY002)
infrastructure/main.py:117:12: FLY002 Consider f-string instead of string join
[failure] 51-51: Ruff (ANN201)
infrastructure/main.py:51:5: ANN201 Missing return type annotation for public function mk_instance
[failure] 47-47: Ruff (E501)
infrastructure/main.py:47:89: E501 Line too long (131 > 88)
[failure] 40-40: Ruff (E501)
infrastructure/main.py:40:89: E501 Line too long (111 > 88)
[failure] 37-37: Ruff (E501)
infrastructure/main.py:37:89: E501 Line too long (197 > 88)
[failure] 33-33: Ruff (E501)
infrastructure/main.py:33:89: E501 Line too long (109 > 88)
[failure] 27-27: Ruff (E501)
infrastructure/main.py:27:89: E501 Line too long (92 > 88)
🪛 GitHub Actions: Code quality check
infrastructure/__main__.py
[error] 27-27: Ruff: E501 Line too long (92 > 88) in infrastructure/main.py:27.
🔇 Additional comments (8)
infrastructure/Pulumi.yaml (1)
1-10: Pulumi project metadata update looks good.Project rename, description, uv toolchain, and pulumi:tags template are consistent and align with the migration.
infrastructure/pyproject.toml (1)
3-12: Dependencies and metadata updates look aligned with the new infra.Upgrading pulumi and aws provider; adding pulumi-command and pulumi-tls matches the remote + key flow.
infrastructure/__main__.py (3)
143-155: Worker join sequencing and retries look solid.The wait for Docker readiness and explicit depends_on(get_worker_token) ensure proper ordering. The command is minimal and to the point.
11-16: Lightsail KeyPair handling LGTM.Generating a TLS key and injecting its public key into a Lightsail KeyPair is a clean, reproducible approach.
161-163: Exporting private key — verified: no unsafe pulumi stack output usages foundVerified the repository for calls that print the exported sshPrivateKeyPem without --show-secrets. No matches found.
Findings:
- infrastructure/main.py:163 — pulumi.export("sshPrivateKeyPem", pulumi.Output.secret(ssh_key.private_key_pem))
- infrastructure/main.nu:16–20 (line 18) — let pem = (^pulumi stack output --show-secrets sshPrivateKeyPem) then save to swarm.pem and chmod 600 (explicitly uses --show-secrets)
- infrastructure/main.py:106–110 — remote connection uses private_key=ssh_key.private_key_pem (expected; key is kept secret in exports)
Note: infrastructure/.claude/settings.local.json contains a permissive "Bash(pulumi stack output:*)" permission, but there are no direct calls that would expose sshPrivateKeyPem without --show-secrets.
Recommendation: keep the export marked secret (as it is) and continue to avoid printing or storing it in world-readable locations.
infrastructure/prometheus.yml (1)
1-3: Confirm Pulumi/AMP scraper (monitors.py) usage and reconcile Prometheus intervalsI searched the repository for Pulumi/AMP scraper code and Kubernetes scrape configs. Only infrastructure/prometheus.yml sets 15s intervals; I did not find monitors.py, aws.amp.Scraper, or kubernetes_sd_configs.
Files to review:
- infrastructure/prometheus.yml — lines 1–3:
global:
scrape_interval: 15s
evaluation_interval: 15sIf a Pulumi/AMP scraper (with 15m intervals) exists in another repo or deployment, confirm it’s retired or align its intervals/targets with this static-targets config to avoid conflict.
infrastructure/stack.yml (2)
70-88: Prometheus/cAdvisor/node-exporter connectivity is good; version pins look sanePrometheus is manager-scoped with a persistent volume and lifecycle enabled; node-exporter and cAdvisor are global with appropriate mounts and sensible memory limits.
Also applies to: 90-118
1-21: Ensure docker stack deploy reads the intended stack.yml (run from infra dir or use an explicit path)Verified: infrastructure/stack.yml contains configs.prometheus.yml.file: ./prometheus.yml and infrastructure/prometheus.yml exists. infrastructure/main.nu calls:
- docker stack deploy -c stack.yml infrastructure --detach=false
Notes: the Compose spec resolves config file paths relative to the compose file’s directory (so ./prometheus.yml is correct when the intended stack.yml is the one read), but docker stack deploy can be sensitive to the working directory and there are historical version-specific bugs. Also: main.nu does not use $env.FILE_PWD — it relies on an unqualified stack.yml.
Action (choose one):
- Ensure callers run the deploy from the infrastructure directory (cd infrastructure) so -c stack.yml is the correct file (mise tasks already cd into infrastructure in several places).
- Or make the deploy invocation explicit/robust — e.g. deploy with an explicit path: docker stack deploy -c infrastructure/stack.yml ... or use an absolute path.
Files to review:
- infrastructure/stack.yml — configs.prometheus.yml.file: ./prometheus.yml
- infrastructure/prometheus.yml — present
- infrastructure/main.nu — launch-stacks: docker stack deploy -c stack.yml ...
- .mise.toml — tasks that cd into infrastructure
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
infrastructure/stack.yml (1)
142-146: Do not commit Grafana admin password; read from a Swarm secret instead.The admin password is hardcoded in plaintext. You already declare a GRAFANA_ADMIN_PASSWORD secret at the top; wire it into the Grafana service and read it via the __FILE variant.
Apply this diff within the Grafana service:
grafana: image: grafana/grafana:10.4.2 environment: GF_SECURITY_ADMIN_USER: admin - # TODO: Change this to a random password - GF_SECURITY_ADMIN_PASSWORD: admin123 + GF_SECURITY_ADMIN_PASSWORD__FILE: /run/secrets/GRAFANA_ADMIN_PASSWORD + secrets: + - GRAFANA_ADMIN_PASSWORD
🧹 Nitpick comments (6)
infrastructure/stack.yml (6)
11-15: traefik_letsencrypt volume is currently unused.Either remove it or mount it in Traefik for ACME storage as suggested above. Keeping unused volumes adds noise.
75-89: Avoid using the 'latest' tag for Portainer; pin to a specific version.Pinning ensures repeatable, predictable deployments and avoids surprise upgrades.
Example:
- image: portainer/portainer-ce:latest + image: portainer/portainer-ce:2.21.5If you prefer a different stable version, update accordingly.
126-139: cAdvisor may require additional privileges on modern Docker/cgroups.Depending on host OS/cgroups v2, cAdvisor might need extra mounts/caps and can report partial metrics. If you observe gaps, consider:
- adding device/proc/sys mounts recommended in cAdvisor docs,
- or switching to node_exporter + container-specific exporters.
Would you like a hardened cAdvisor service snippet tailored for your target OS?
22-43: Declared secrets are unused in this stack file.If these are for app services not included yet, ignore. Otherwise, remove unused secrets to reduce operational overhead and confusion.
67-73: Add resource reservations/limits for Traefik.You’ve set resources on other services; do the same for Traefik to improve scheduling and avoid noisy-neighbor issues.
Example:
deploy: placement: constraints: [node.role == manager] update_config: parallelism: 1 order: start-first + resources: + reservations: { memory: 128M } + limits: { memory: 512M }
140-159: Add a Grafana healthcheck for safer rollouts.A healthcheck helps Swarm detect readiness and can make updates safer alongside start-first on Traefik.
Apply this diff:
grafana: image: grafana/grafana:10.4.2 + healthcheck: + test: ["CMD-SHELL", "wget -qO- http://localhost:3000/api/health | grep -q 'ok'"] + interval: 30s + timeout: 5s + retries: 5 + start_period: 30s
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
infrastructure/stack.yml(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Run Python tests
🔇 Additional comments (1)
infrastructure/stack.yml (1)
111-124: Verify bind propagation rslave support in Swarm for node-exporter.Bind propagation flags can behave differently in Swarm. If deployment errors or metrics are missing, consider dropping rslave or following the official node-exporter-on-Swarm recipe.
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
infrastructure/stack.yml (2)
46-67: Enable HTTPS (ACME) and persist certs; also expose metrics on a dedicated entrypoint.You declare
traefik_letsencryptbut don’t mount it or enable ACME; HTTPS on 443 won’t have valid certs. Also, Prometheus metrics are enabled without an entrypoint, and Traefik isn’t on the internal network for Prometheus to reach it.Apply this diff to add ACME, HTTP→HTTPS, metrics entrypoint, mount the volume, and join the internal network:
command: - --providers.docker=true - --providers.docker.endpoint=unix:///var/run/docker.sock - --providers.docker.swarmMode=true - --providers.docker.exposedByDefault=false - --entrypoints.web.address=:80 - --entrypoints.websecure.address=:443 + - --entrypoints.web.http.redirections.entryPoint.to=websecure + - --entrypoints.web.http.redirections.entryPoint.scheme=https - --api.dashboard=true - --metrics.prometheus=true - --metrics.prometheus.addEntryPointsLabels=true - --metrics.prometheus.addServicesLabels=true + - --entrypoints.metrics.address=:8080 + - --metrics.prometheus.entryPoint=metrics + # ACME/Let's Encrypt + - --certificatesresolvers.le.acme.email=${ACME_EMAIL} + - --certificatesresolvers.le.acme.storage=/letsencrypt/acme.json + - --certificatesresolvers.le.acme.httpchallenge=true + - --certificatesresolvers.le.acme.httpchallenge.entrypoint=web ports: @@ volumes: - /var/run/docker.sock:/var/run/docker.sock:ro + - traefik_letsencrypt:/letsencrypt @@ - networks: [public] + networks: [public, internal]Ensure ACME_EMAIL is supplied via .env or environment at deploy time.
Also applies to: 73-73
140-159: Don’t pass Grafana admin password via env; use a Swarm secret and serve over HTTPS.Environment variables are exposed in container metadata; you already declared
GRAFANA_ADMIN_PASSWORDas an external secret but don’t use it. Also, route Grafana via TLS with the ACME resolver added above.Apply this diff:
grafana: image: grafana/grafana:10.4.2 environment: GF_SECURITY_ADMIN_USER: admin - # TODO: Change this to a random password - GF_SECURITY_ADMIN_PASSWORD: ${GF_SECURITY_ADMIN_PASSWORD} + GF_SECURITY_ADMIN_PASSWORD__FILE: /run/secrets/GRAFANA_ADMIN_PASSWORD + # Optional but recommended so links use the correct scheme/host: + # GF_SERVER_ROOT_URL: https://grafana.example.com volumes: - grafana_data:/var/lib/grafana + secrets: + - GRAFANA_ADMIN_PASSWORD deploy: @@ labels: - traefik.enable=true - traefik.http.routers.grafana.rule=Host(`grafana.example.com`) - - traefik.http.routers.grafana.entrypoints=web + - traefik.http.routers.grafana.entrypoints=websecure + - traefik.http.routers.grafana.tls=true + - traefik.http.routers.grafana.tls.certresolver=le + - traefik.docker.network=public - traefik.http.services.grafana.loadbalancer.server.port=3000Make sure the external secret
GRAFANA_ADMIN_PASSWORDexists before deploying (docker secret ls).
🧹 Nitpick comments (2)
infrastructure/stack.yml (2)
22-43: Many external secrets are declared but unused in this stack.If they belong to application services deployed in a separate stack, consider removing them here to reduce operational drift and confusion.
Alternatively, add a short README comment pointing to where each secret is used to keep infra clear.
126-139: Confirm Prometheus cAdvisor Scrape JobWe’ve verified that
infrastructure/prometheus.ymlincludes:
- job_name: cadvisor
static_configs → targets: [‘tasks.cadvisor:8080’]Your cAdvisor service in
infrastructure/stack.ymlis currently running unprivileged. Without elevated privileges (or additional device mounts), some container and cgroup metrics may be missing. If you observe incomplete data, consider:• File: infrastructure/stack.yml
Under servicecadvisor, add:cadvisor: image: gcr.io/cadvisor/cadvisor:v0.47.2 + privileged: trueAlternatively, leave it unprivileged and first validate that you’re receiving the full set of metrics in your environment.
[optional_refactors_recommended]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
infrastructure/stack.yml(1 hunks)
🔇 Additional comments (3)
infrastructure/stack.yml (3)
90-109: Prometheus service looks solid.Good use of config mounts, TSDB retention, lifecycle, and resource limits. Network isolation to internal is appropriate.
111-125: Prometheus scrape configuration for node-exporter is present and correctThe
infrastructure/prometheus.ymlfile already defines anode-exporterjob with:
- static_configs:
- targets: ['tasks.node-exporter:9100']
This matches the Docker Swarm service on the
internalnetwork, so Prometheus will discover and scrape the node-exporter tasks as intended. No further changes required.
75-83: Pin Portainer to a stable, explicit tagThe latest semantic-versioned Portainer CE release is
2.32.0. Replace the floatinglatesttag to ensure reproducible deployments and intentional upgrades.• File: infrastructure/stack.yml
• Lines: 75–83portainer: - image: portainer/portainer-ce:latest + image: portainer/portainer-ce:2.32.0 ports: - "9000:9000" - "9443:9443"
fe7c1d0 to
dbfa692
Compare
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (4)
infrastructure/Pulumi.yaml (1)
1-1: Ensure automation uses the intended fully-qualified stack namesConfirm CI and local tooling use
pocketsizefund/pocketsizefund/<env>to match this project name.Quick check:
#!/bin/bash rg -n -C2 -P 'pulumi\s+(up|down|stack\s+(select|output)).*(pocketsizefund/pocketsizefund/[^"'\''\s]+)|PULUMI_STACK'infrastructure/__main__.py (3)
10-10: SSH open to 0.0.0.0/0 by default is unsafe; require explicit allowed CIDRs.Defaulting SSH to world-accessible is a production-grade risk. Make it a required config or fail fast if the default would be 0.0.0.0/0.
Apply this diff:
-allowed_ssh_cidrs = pulumi.Config().get_object("allowedSshCidrs") or ["0.0.0.0/0"] +cfg = pulumi.Config() +allowed_ssh_cidrs = cfg.get_object("allowedSshCidrs") +if not allowed_ssh_cidrs or "0.0.0.0/0" in allowed_ssh_cidrs: + raise pulumi.RunError( + 'Refusing to open SSH (22/tcp) to 0.0.0.0/0. ' + 'Set "allowedSshCidrs" to trusted CIDR(s), e.g.: ' + "pulumi config set allowedSshCidrs '[\"203.0.113.42/32\"]'" + )
21-61: Unblock lints and improve maintainability by moving cloud-init to a file.The inline cloud-init causes Ruff E501 failures and is hard to maintain. Read it from
infrastructure/cloud-init.ymlinstead.Apply this diff:
+from pathlib import Path @@ -cloud_init = """#cloud-config -package_update: true -package_upgrade: true -write_files: - - path: /usr/local/bin/install-docker.sh - permissions: "0755" - content: | - #!/usr/bin/env bash - set -euo pipefail - retry() { - for i in {1..10}; do "$@" && break || { sleep 3; echo "retry $i"; }; - done - } - - retry apt-get update - retry apt-get install -y ca-certificates curl gnupg - - install -m 0755 -d /etc/apt/keyrings - curl -fsSL https://download.docker.com/linux/ubuntu/gpg | \ - gpg --dearmor -o /etc/apt/keyrings/docker.gpg - chmod a+r /etc/apt/keyrings/docker.gpg - - . /etc/os-release - echo "deb [arch=$(dpkg --print-architecture) \ - signed-by=/etc/apt/keyrings/docker.gpg] \ - https://download.docker.com/linux/ubuntu ${VERSION_CODENAME} stable" > \ - /etc/apt/sources.list.d/docker.list - - retry apt-get update - retry apt-get install -y docker-ce docker-ce-cli containerd.io \ - docker-buildx-plugin docker-compose-plugin - - usermod -aG docker ubuntu || true - systemctl enable --now docker - -runcmd: - - /usr/local/bin/install-docker.sh - - bash -lc 'for i in {1..60}; do \ - [ -x /usr/bin/docker ] && systemctl is-active --quiet docker && \ - exit 0 || sleep 2; done; exit 1' -""" +cloud_init = Path(__file__).with_name("cloud-init.yml").read_text()And add the removed content to
infrastructure/cloud-init.yml(same repo path).
129-165: Broken quoting and PUBIP scope: swarm init will not see $PUBIP; simplify to one bash -lc.Each
bash -lc '...'runs in a separate shell;PUBIPset in one won’t be visible in the next. Also, the trailing\'leaves an unmatched single quote. This will lead to init failures.Replace the multi-fragment command with a single, readable script:
- wait_for_docker = ( - "bash -lc 'for i in {1..120}; do sudo docker info >/dev/null 2>&1 && " - "break || sleep 3; done'" - ) - get_public_ip = ( - "bash -lc 'PUBIP=$(curl -s http://169.254.169.254/latest/meta-data/" - "public-ipv4); echo Using-Public-IP:$PUBIP'" - ) - init_swarm = ( - "bash -lc 'STATE=$(sudo docker info --format " - "{{.Swarm.LocalNodeState}}") || true; " - 'CTRL=$(sudo docker info --format "{{.Swarm.ControlAvailable}}") || true; ' - '[ "$STATE" = active -a "$CTRL" = true ] || ' - "(sudo docker swarm leave --force || true; " - ' sudo docker swarm init --advertise-addr "$PUBIP" --listen-addr "0.0.0.0:2377")\'' - ) - wait_for_swarm = ( - "bash -lc 'for i in {1..30}; do sudo ss -ltn | " - 'awk "\\$4 ~ /:2377$/" && break || sleep 2; done\'' - ) - save_worker_token = ( - "bash -lc 'sudo docker swarm join-token -q worker | " - "sudo tee /home/ubuntu/worker.token >/dev/null'" - ) - save_manager_token = ( - "bash -lc 'sudo docker swarm join-token -q manager | " - "sudo tee /home/ubuntu/manager.token >/dev/null'" - ) - - init_manager = remote.Command( - "init-manager", - connection=conn(mgr_ip.ip_address), - create=( - f"{wait_for_docker} && {get_public_ip} && {init_swarm} && " - f"{wait_for_swarm} && {save_worker_token} && {save_manager_token}" - ), - ) +init_manager = remote.Command( + "init-manager", + connection=conn(mgr_ip.ip_address), + create="""bash -lc ' +set -e +# Wait for Docker daemon +for i in {1..120}; do sudo docker info >/dev/null 2>&1 && break || sleep 3; done +# Discover public IP (Lightsail/EC2 IMDS) +PUBIP=$(curl -s http://169.254.169.254/latest/meta-data/public-ipv4) +echo Using-Public-IP:"$PUBIP" +# Initialize swarm if not active/controlling +STATE=$(sudo docker info --format "{{.Swarm.LocalNodeState}}") || true +CTRL=$(sudo docker info --format "{{.Swarm.ControlAvailable}}") || true +if [ "$STATE" != "active" ] || [ "$CTRL" != "true" ]; then + sudo docker swarm leave --force || true + sudo docker swarm init --advertise-addr "$PUBIP" --listen-addr "0.0.0.0:2377" +fi +# Wait for swarm port to be listening +for i in {1..30}; do sudo ss -ltn | awk '"'"'$4 ~ /:2377$/'"'"' && break || sleep 2; done +# Persist tokens for later retrieval +sudo docker swarm join-token -q worker | sudo tee /home/ubuntu/worker.token >/dev/null +sudo docker swarm join-token -q manager | sudo tee /home/ubuntu/manager.token >/dev/null +'""", +)Note the use of
'"'"'to embed single quotes safely inside the single-quoted heredoc.
🧹 Nitpick comments (17)
.claude/settings.local.json (4)
20-20: Restrict flox activation to non-interactive forms to avoid hangs.
Bash(flox activate:*)may spawn an interactive shell and hang CI/editor sessions. Prefer non-interactive commands (e.g., flox run or activate with an explicit--command), and narrow the pattern to exactly what the workflows require.If feasible, narrow the permission:
- "Bash(flox activate:*)" + "Bash(flox run:*)" + "Bash(flox activate --command:*)"Please confirm the exact flox invocation your Nushell/Pulumi orchestration needs so we can scope this precisely.
18-18: Addition of gh issue close aligns with existing gh allowances—confirm repo scope.Granting
Bash(gh issue close:*)matches the existing create/view capabilities. Just ensure these commands run only in the intended repository context to avoid accidental closes elsewhere.If you want to be extra cautious, consider documenting this permission in CONTRIBUTING.md and limiting usage to automation that references the current repo explicitly (e.g.,
--repo pocketsizefund/pocketsizefund).
1-3: Confirm that committing a “local” settings file is intentional.The filename suggests developer-local overrides. If this is meant to be user-specific, consider adding it to .gitignore and using a team-wide non-local settings file for shared permissions. If team-wide, the name may cause confusion.
I can propose a follow-up PR to standardize naming (e.g.,
.claude/settings.team.json) and update ignore rules if desired.
19-19: Scope or whitelist WebSearch instead of enabling it globallyThe root settings file includes
WebSearch, but the infra-specific config does not. Granting global WebSearch enables broad outbound access across all assistant workflows. If you only need external lookups for specific docs or infra tasks, consider narrowing the scope:• File to update:
.claude/settings.local.json(entry"WebSearch")
• Options:- Replace with
WebFetch(domain:…)for known domains- Move
WebSearchintoinfrastructure/.claude/settings.local.jsonor another narrow config- Remove the global entry entirely
Suggested diff:
"Bash(gh issue close:*)", - "WebSearch", "Bash(flox activate:*)"Dockerfile.tests (2)
3-3: Pin uv image tag/digest for reproducible buildsUsing ghcr.io/astral-sh/uv:latest can introduce non-determinism as the tag moves. Pin a version/tag or digest.
Apply:
-COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/uv +COPY --from=ghcr.io/astral-sh/uv:0.5.6 /uv /bin/uv(Replace 0.5.6 with the exact version you standardize on.)
14-14: Question: ENV PYTHON=1 appears unusedIf not consumed by tooling/tests, drop it to keep the image environment minimal.
pyproject.toml (2)
123-123: Broad Bandit ignore (S105) on infrastructure/main.py—prefer narrowingS105 flags hardcoded secrets. A file-wide ignore can mask real issues. Prefer inline
# noqa: S105on specific literals or a targeted per-file rule with context (e.g., only test fake creds).Example narrowing:
[tool.ruff.lint.per-file-ignores] "infrastructure/__main__.py" = ["S105"] # TODO: replace with line-level noqa after refactorOr remove this and add inline
# noqa: S105with a rationale where necessary.
5-5: Exact Python pin is consistent; consider minor-range pin if patch bumps are safe
requires-python = "==3.12.10"ensures strict alignment (good). If you want to pick up security patch releases without editing config, consider~=3.12.application/datamanager/Dockerfile.test (1)
2-2: Pin uv image for reproducibilitySame concern as in Dockerfile.tests—avoid :latest for deterministic builds.
-COPY --from=ghcr.io/astral-sh/uv:latest /uv /bin/uv +COPY --from=ghcr.io/astral-sh/uv:0.5.6 /uv /bin/uvinfrastructure/prometheus.yml (1)
10-13: Consider DNS SD for Traefik to scrape all replicasIf Traefik is scaled >1,
traefik:8080hits the VIP and may not scrape every task each interval. Preferdns_sd_configswithtasks.traefik.- - job_name: traefik - metrics_path: /metrics - static_configs: - - targets: ['traefik:8080'] + - job_name: traefik + metrics_path: /metrics + dns_sd_configs: + - names: ['tasks.traefik'] + type: A + port: 8080Also ensure Traefik metrics are enabled in the stack (prometheus provider and /metrics exposed).
infrastructure/README.md (4)
19-21: Avoid brittle references to line numbers; fix MD034 bare URL and clarify where to edit Grafana domain.Referencing “stack.yml line 162” will rot quickly. Point to the Traefik router label instead and wrap the URL to satisfy markdownlint.
Apply this diff:
-- **URL**: https://grafana.example.com (update the domain in `stack.yml` line 162) +- **URL**: <https://grafana.example.com> (update the Traefik router Host() rule in `stack.yml`; search for `traefik.http.routers.grafana.rule`)
28-31: Fix MD034 bare URL warning for Prometheus.Wrap the URL in angle brackets to silence markdownlint and improve consistency.
-- **Internal URL**: http://prometheus:9090 (only accessible from within the swarm) +- **Internal URL**: <http://prometheus:9090> (only accessible from within the swarm)
43-72: Align docs with actual Prometheus config (static vs DNS SD).This section labels static targets as a “Known Issue” and recommends DNS service discovery. Per the PR summary and file comments,
infrastructure/prometheus.ymlalready usesdns_sd_configs. If that’s true, update this section to describe the current state (DNS SD), or remove the “Known Issue” framing.Proposed wording if DNS SD is already in place:
-### Current Issue: Prometheus Service Discovery - -⚠️ **Known Issue**: The current Prometheus configuration uses static targets that only scrape one replica per service: -... -**Problem**: This resolves to only one IP address, missing metrics from other swarm nodes. - -**Solution**: Switch to DNS service discovery to scrape all replicas: +### Prometheus Service Discovery + +The Prometheus configuration uses DNS service discovery to scrape all replicas of global services: ... -# Recommended configuration +# Active configuration - job_name: node-exporter dns_sd_configs: - names: ['tasks.node-exporter'] type: A port: 9100 ...
115-132: Good security guidance; consider making SSH CIDR restriction a hard requirement.Docs are solid. Given the infra is “production,” consider matching the code to this guidance by refusing to deploy with SSH open to 0.0.0.0/0 unless explicitly overridden.
infrastructure/__main__.py (3)
78-103: Narrow exposed ports to the minimum per node role.80/443 are opened on every instance (manager and workers). If Traefik only runs on manager(s), you can reduce attack surface by not opening 80/443 on worker nodes.
Option A (parameterize):
-def mk_instance( - name: str, bundle_id: str -) -> tuple[aws.lightsail.Instance, aws.lightsail.StaticIp]: +def mk_instance( + name: str, bundle_id: str, expose_web: bool = False +) -> tuple[aws.lightsail.Instance, aws.lightsail.StaticIp]: @@ - aws.lightsail.InstancePublicPorts( + web_ports = [] + if expose_web: + web_ports = [ + aws.lightsail.InstancePublicPortsPortInfoArgs( + from_port=80, to_port=80, protocol="tcp", cidrs=["0.0.0.0/0"] + ), + aws.lightsail.InstancePublicPortsPortInfoArgs( + from_port=443, to_port=443, protocol="tcp", cidrs=["0.0.0.0/0"] + ), + ] + aws.lightsail.InstancePublicPorts( f"{name}-ports", instance_name=inst.name, - port_infos=[ + port_infos=[ aws.lightsail.InstancePublicPortsPortInfoArgs( from_port=22, to_port=22, protocol="tcp", cidrs=allowed_ssh_cidrs ), aws.lightsail.InstancePublicPortsPortInfoArgs( from_port=2377, to_port=2377, protocol="tcp", cidrs=swarm_manager_allowed_cidrs ), aws.lightsail.InstancePublicPortsPortInfoArgs( from_port=7946, to_port=7946, protocol="tcp", cidrs=swarm_cluster_allowed_cidrs ), aws.lightsail.InstancePublicPortsPortInfoArgs( from_port=7946, to_port=7946, protocol="udp", cidrs=swarm_cluster_allowed_cidrs ), aws.lightsail.InstancePublicPortsPortInfoArgs( from_port=4789, to_port=4789, protocol="udp", cidrs=swarm_cluster_allowed_cidrs ), - aws.lightsail.InstancePublicPortsPortInfoArgs( - from_port=80, to_port=80, protocol="tcp", cidrs=["0.0.0.0/0"] - ), - aws.lightsail.InstancePublicPortsPortInfoArgs( - from_port=443, to_port=443, protocol="tcp", cidrs=["0.0.0.0/0"] - ), - ], + ] + web_ports, )Then call with
expose_web=Truefor the manager only.
168-175: Token handling looks fine; minor improvement: trim at source.Current code trims via
apply(lambda s: s.strip()). You can trim server-side to avoid accidental trailing newlines earlier, but this is non-blocking.
196-198: Exporting the private SSH key increases blast radius; prefer local file or external secret.Even as a Pulumi secret, exporting
sshPrivateKeyPempropagates it into stack state and any automation outputs. Prefer one of:
- Generate the key locally and pass the public key via config; keep the private key out of state.
- Use pulumi-command local to write the key to a local file and don’t export it.
If you want, I can provide a small helper that writes
ssh_key.private_key_pemtoinfrastructure/swarm.pemlocally (ignored by .gitignore) and removes the export.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
⛔ Files ignored due to path filters (1)
.flox/env/manifest.lockis excluded by!**/*.lock
📒 Files selected for processing (15)
.claude/commands/infrastructure-up.md(1 hunks).claude/commands/lint.md(1 hunks).claude/commands/pytests.md(1 hunks).claude/settings.local.json(1 hunks).flox/env/manifest.toml(1 hunks)Dockerfile.tests(1 hunks)application/datamanager/Dockerfile.test(1 hunks)infrastructure/Pulumi.yaml(1 hunks)infrastructure/README.md(1 hunks)infrastructure/__main__.py(1 hunks)infrastructure/main.nu(1 hunks)infrastructure/prometheus.yml(1 hunks)infrastructure/requirements.txt(1 hunks)infrastructure/stack.yml(1 hunks)pyproject.toml(1 hunks)
✅ Files skipped from review due to trivial changes (3)
- .claude/commands/lint.md
- .claude/commands/infrastructure-up.md
- .claude/commands/pytests.md
🚧 Files skipped from review as they are similar to previous changes (4)
- .flox/env/manifest.toml
- infrastructure/requirements.txt
- infrastructure/main.nu
- infrastructure/stack.yml
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-15T13:54:49.196Z
Learnt from: forstmeier
PR: pocketsizefund/pocketsizefund#613
File: infrastructure/Pulumi.yaml:1-10
Timestamp: 2025-08-15T13:54:49.196Z
Learning: The user forstmeier prefers Pulumi stack naming convention as `pocketsizefund/pocketsizefund/production` to maintain organization/repository/environment mapping pattern for the pocketsizefund project.
Applied to files:
infrastructure/Pulumi.yaml
🧬 Code Graph Analysis (1)
infrastructure/prometheus.yml (1)
infrastructure/monitors.py (1)
create_prometheus_scraper(7-151)
🪛 LanguageTool
infrastructure/README.md
[grammar] ~9-~9: There might be a mistake here.
Context: ...y and load balancer with automatic HTTPS - Portainer: Docker management UI - **Pr...
(QB_NEW_EN)
[grammar] ~10-~10: There might be a mistake here.
Context: ...PS - Portainer: Docker management UI - Prometheus: Metrics collection and mon...
(QB_NEW_EN)
[grammar] ~11-~11: There might be a mistake here.
Context: ...eus**: Metrics collection and monitoring - Node Exporter: Host-level metrics coll...
(QB_NEW_EN)
[grammar] ~12-~12: There might be a mistake here.
Context: ...l metrics collection (runs on all nodes) - cAdvisor: Container metrics collection...
(QB_NEW_EN)
[grammar] ~13-~13: There might be a mistake here.
Context: ...r metrics collection (runs on all nodes) - Grafana: Metrics visualization and das...
(QB_NEW_EN)
[grammar] ~18-~18: There might be a mistake here.
Context: ...ccessing Services ### Grafana Dashboard - URL: https://grafana.example.com (upda...
(QB_NEW_EN)
[grammar] ~19-~19: There might be a mistake here.
Context: ...date the domain in stack.yml line 162) - Username: admin - Password: Set ...
(QB_NEW_EN)
[grammar] ~20-~20: There might be a mistake here.
Context: ...n stack.yml line 162) - Username: admin - Password: Set via `GRAFANA_ADMIN_PASSW...
(QB_NEW_EN)
[grammar] ~21-~21: There might be a mistake here.
Context: ... Set via GRAFANA_ADMIN_PASSWORD secret - TLS: Automatic HTTPS certificates via ...
(QB_NEW_EN)
[grammar] ~24-~24: There might be a mistake here.
Context: ... Let's Encrypt ### Portainer Management - URL: https://[manager-node-ip]:9443 - ...
(QB_NEW_EN)
[grammar] ~25-~25: There might be a mistake here.
Context: ... URL: https://[manager-node-ip]:9443 - Port: 9000 (HTTP) or 9443 (HTTPS) ###...
(QB_NEW_EN)
[grammar] ~28-~28: There might be a mistake here.
Context: ...0 (HTTP) or 9443 (HTTPS) ### Prometheus - Internal URL: http://prometheus:9090 (...
(QB_NEW_EN)
[grammar] ~29-~29: There might be a mistake here.
Context: ... (only accessible from within the swarm) - Port: 9090 ## Network Architecture T...
(QB_NEW_EN)
[grammar] ~36-~36: There might be a mistake here.
Context: ...s (Traefik, Portainer, Grafana frontend) - internal: Internal monitoring communic...
(QB_NEW_EN)
[grammar] ~101-~101: There might be a mistake here.
Context: ...tack.yml: Docker Swarm stack definition - prometheus.yml: Prometheus scraping configuration - m...
(QB_NEW_EN)
[grammar] ~102-~102: There might be a mistake here.
Context: ....yml: Prometheus scraping configuration - main.nu: Infrastructure management scripts - p...
(QB_NEW_EN)
[grammar] ~103-~103: There might be a mistake here.
Context: ...n.nu: Infrastructure management scripts - ping.nu: Service health checks - upload_grafan...
(QB_NEW_EN)
[grammar] ~104-~104: There might be a mistake here.
Context: ...ripts - ping.nu: Service health checks - upload_grafana_dashboard.nu: Dashboard deployment script ## Enviro...
(QB_NEW_EN)
[grammar] ~127-~127: There might be a mistake here.
Context: ...32"]' ``` Security Recommendations: - Always restrict SSH access to known IP r...
(QB_NEW_EN)
[grammar] ~148-~148: There might be a mistake here.
Context: .../32"]' ``` Critical Security Notes: - Never expose port 2377 to 0.0.0.0/0 - ...
(QB_NEW_EN)
[grammar] ~169-~169: There might be a mistake here.
Context: ...'["10.0.0.0/8"]' ``` Ports secured: - Port 7946 (TCP/UDP): Inter-node commun...
(QB_NEW_EN)
[grammar] ~170-~170: There might be a mistake here.
Context: ...cation for cluster membership and gossip - Port 4789 (UDP): Overlay network traff...
(QB_NEW_EN)
[grammar] ~173-~173: There might be a mistake here.
Context: ...ffic (VXLAN) Security Requirements: - These ports should never be exposed ...
(QB_NEW_EN)
[grammar] ~183-~183: There might be a mistake here.
Context: ...N_PASSWORD`: Grafana admin user password - Application-specific secrets (Alpaca, Po...
(QB_NEW_EN)
[grammar] ~197-~197: There might be a mistake here.
Context: ...sted in traefik_letsencrypt volume at /letsencrypt/acme.json - Services using TLS: Configure Traefik ...
(QB_NEW_EN)
[grammar] ~206-~206: There might be a mistake here.
Context: ...a_data: Grafana dashboards and settings - prom_data`: Prometheus time-series data (15-day re...
(QB_NEW_EN)
[grammar] ~207-~207: There might be a mistake here.
Context: ...heus time-series data (15-day retention) - portainer_data: Portainer configuration - `traefik_let...
(QB_NEW_EN)
[grammar] ~208-~208: There might be a mistake here.
Context: ...portainer_data: Portainer configuration - traefik_letsencrypt`: SSL certificates ## Troubleshooting ...
(QB_NEW_EN)
[grammar] ~213-~213: There might be a mistake here.
Context: ...oubleshooting ### Service Access Issues 1. Verify swarm node connectivity: `docker ...
(QB_NEW_EN)
[grammar] ~214-~214: There might be a mistake here.
Context: ...sues 1. Verify swarm node connectivity: docker node ls 2. Check service status: `docker service ls...
(QB_NEW_EN)
[grammar] ~215-~215: There might be a mistake here.
Context: ...ocker node ls2. Check service status:docker service ls3. Inspect service logs:docker service lo...
(QB_NEW_EN)
[grammar] ~218-~218: There might be a mistake here.
Context: ...s ` ### Monitoring Issues 1. Check Prometheus targets: Visit Promethe...
(QB_NEW_EN)
🪛 markdownlint-cli2 (0.17.2)
infrastructure/README.md
19-19: Bare URL used
(MD034, no-bare-urls)
29-29: Bare URL used
(MD034, no-bare-urls)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Run Python tests
🔇 Additional comments (7)
.claude/settings.local.json (1)
1-25: LGTM on JSON validity and placement within allow list.The structure is valid JSON, integrates cleanly with existing entries, and preserves deny as empty.
Dockerfile.tests (1)
1-1: Python versions are aligned; verify CI pinning
Allpyproject.tomlfiles pinrequires-python = "==3.12.10"and every Dockerfile usesFROM python:3.12.10(one test image uses3.12.10-slim, which still matches the patch).• workflows/pyproject.toml, cli/pyproject.toml, pyproject.toml, application/.../pyproject.toml, infrastructure/pyproject.toml
• Dockerfile.tests, application/predictionengine/Dockerfile, application/datamanager/Dockerfile, application/datamanager/Dockerfile.test, application/positionmanager/DockerfilePlease confirm your CI workflows (e.g. in
.github/workflows/*.yml) also pinpython-version: 3.12.10to avoid environment drift.application/datamanager/Dockerfile.test (1)
1-1: Good: Aligns to Python 3.12.10-slimSlim base reduces image size and matches repo-wide versioning.
infrastructure/Pulumi.yaml (1)
2-10: Pulumi project metadata and uv toolchain config LGTM
name: pocketsizefundaligns with the preferred stack naming pattern and the uv toolchain block is correct for Python projects.infrastructure/prometheus.yml (3)
6-8: Prometheus self-scrape target OKStatic target is fine for a single Prometheus instance.
15-20: Good use of Swarm DNS SD for node-exporterThis will enumerate all global tasks; no VIP skew.
21-25: Good use of Swarm DNS SD for cAdvisorCovers all replicas across nodes.
forstmeier
left a comment
There was a problem hiding this comment.
Just comments/questions but nothing that holds up merging. Merge whenever.
There was a problem hiding this comment.
Why is a requirements.txt file required?
There was a problem hiding this comment.
I hate having a README doc but we can move it into code later.
dbfa692 to
4b9f516
Compare


Overview
Changes
Comments
Summary by CodeRabbit
New Features
Refactor
Chores