Skip to content

Codex/phase2 platform hardening#9

Merged
pahuldeepp merged 7 commits intomasterfrom
codex/phase2-platform-hardening
Mar 28, 2026
Merged

Codex/phase2 platform hardening#9
pahuldeepp merged 7 commits intomasterfrom
codex/phase2-platform-hardening

Conversation

@pahuldeepp
Copy link
Copy Markdown
Owner

@pahuldeepp pahuldeepp commented Mar 28, 2026

Summary by CodeRabbit

  • New Features

    • Added PostgreSQL and Redis backup/restore runbooks and CLI scripts for local data protection and recovery.
    • Added per-service NetworkPolicies and configurable PodDisruptionBudgets for improved cluster safety.
  • Infrastructure & Monitoring

    • Revised monitoring stack: new scrape targets, dashboard panels, datasource provisioning with stable IDs, and collector batching/health endpoint.
    • Exposed a local health port for the collector.
  • Chores

    • Reorganized deployment targets and Argo CD project/role configuration.
    • Bumped OpenTelemetry dependency versions.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 28, 2026

📝 Walkthrough

Walkthrough

This PR updates Argo CD project wiring, adds an AppProject and RBAC roles, introduces Helm NetworkPolicy/PDB templates and values, adds PostgreSQL/Redis backup+restore scripts and runbooks, adjusts observability (Prometheus, OTEL, Grafana), tweaks docker-compose and Vite port, and bumps OpenTelemetry Go deps.

Changes

Cohort / File(s) Summary
Argo CD root manifests
argocd/app-of-apps.yaml
Renamed application, switched project to grainguard, and updated repository URL to https://github.com/pahuldeepp/GrainGuard-.git.
Argo CD environment applications
argocd/environments/dev.yaml, argocd/environments/prod.yaml
Pointed Applications to project grainguard, updated repo URL, changed chart path to k8s/helm/grainguard, and renamed values files (values.*.yamlvalues-*.yaml).
Argo CD project manifest (argocd + k8s)
argocd/project.yaml, k8s/argocd/project.yaml
Added/updated AppProject to allow the new repo and configured destinations, resource whitelists, and project roles/policies.
k8s Argocd apps & install flow
k8s/argocd/app-of-apps.yaml, k8s/argocd/apps/grainguard-dev.yaml, k8s/argocd/apps/grainguard-prod.yaml, k8s/argocd/install.sh
Updated Applications to use grainguard project; install script now applies the project manifest before App-of-Apps.
Helm templates & values
k8s/helm/grainguard/templates/deployment.yaml, k8s/helm/grainguard/templates/networkpolicy.yaml, k8s/helm/grainguard/templates/pdb.yaml, k8s/helm/grainguard/values.yaml
Added per-service RollingUpdate strategy, new NetworkPolicy template (conditionally rendered per service), configurable PDB rendering and minAvailable, and corresponding values enabling PDBs and network policies for many services.
Backup & restore scripts
infra/scripts/backup-postgres.sh, infra/scripts/restore-postgres.sh, infra/scripts/backup-redis.sh, infra/scripts/restore-redis.sh
Added timestamped PostgreSQL and Redis backup and restore scripts with validation, metadata files, container operations, and readiness checks.
Runbooks / docs
docs/runbooks/postgres-backup-restore.md, docs/runbooks/redis-backup-restore.md
New operational runbooks documenting end-to-end backup, restore, verification, and escalation procedures for Postgres and Redis.
Observability: OTEL / Prometheus / Grafana
infra/docker/otel-collector-config.yaml, infra/docker/prometheus.yml, infra/docker/grafana/provisioning/datasources/datasources.yaml, infra/docker/grafana/provisioning/dashboards/grainguard-overview.json
Switched OTEL exporter to otlp_http with queue/retry/batch and health_check extension; added collector health port mapping; increased Prometheus scrape/eval interval and added gateway/bff/tempo jobs; added datasource UIDs and pruning; replaced dashboard panels with gateway-focused metrics.
Docker compose & local dev
infra/docker/docker-compose.yml, apps/dashboard/vite.config.ts
Added host binding for OTEL collector health port 127.0.0.1:13133:13133; changed dashboard dev server port from 51745173.
Go module deps
go.mod
Bumped OpenTelemetry Go modules from v1.41.0 → v1.42.0.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Codex/phase2 platform hardening' is vague and does not clearly summarize the specific changes in the pull request, which include infrastructure updates, Argo CD configuration changes, backup/restore scripts, and monitoring enhancements. Replace with a more specific title that highlights the main changes, such as 'Update Argo CD project configuration and add backup/restore automation' or break into multiple PRs with more focused scopes.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/phase2-platform-hardening

Warning

Review ran into problems

🔥 Problems

Git: Failed to clone repository. Please run the @coderabbitai full review command to re-trigger a full review. If the issue persists, set path_filters to include or exclude specific files.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
k8s/helm/grainguard/values.yaml (1)

220-266: 🧹 Nitpick | 🔵 Trivial

Consider adding PDB and NetworkPolicy to remaining services.

Services cdc-transformer, dlq-reprocessor, and search-indexer lack pdb and networkPolicy configurations. While cdc-transformer and dlq-reprocessor have replicaCount: 1 (making PDB less useful), search-indexer could benefit from these for consistency.

If this is intentional (e.g., these are non-critical background workers), consider adding a comment to document the decision.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/helm/grainguard/values.yaml` around lines 220 - 266, Add explicit pdb and
networkPolicy configuration entries for the cdc-transformer, dlq-reprocessor,
and search-indexer blocks in values.yaml (or document why omitted);
specifically, add a pdb subsection (even if disabled when replicaCount: 1) and a
networkPolicy subsection under each service (e.g., cdc-transformer,
dlq-reprocessor, search-indexer) with sensible defaults or disabled flags so
chart consumers see the options; if you intentionally want them omitted, add a
short comment next to the replicaCount or service block stating they are
non-critical background workers and why pdb/networkPolicy are not provided.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@argocd/app-of-apps.yaml`:
- Around line 11-13: Two manifests define the same Application identity
(metadata.name: grainguard-apps in namespace: argocd) but point to different
spec.source.path values, causing last-write-wins drift; resolve by choosing one
canonical root App-of-Apps or renaming one Application to a unique metadata.name
and update any scripts/RBAC/policy references that depend on that name, or
consolidate the differing spec.source.path entries into the single canonical
manifest so there is only one grainguard-apps Application in the cluster.

In `@docs/runbooks/postgres-backup-restore.md`:
- Around line 27-29: Replace the hardcoded absolute path example in
docs/runbooks/postgres-backup-restore.md with an instruction to run the script
from the repo root: remove the "cd /Users/..." line and instead prefix the
script invocation with a comment like "From the repository root" and run
./infra/scripts/backup-postgres.sh (and make the identical change for the second
occurrence later in the file around the restore/other example), so users use
relative paths to their own clones rather than a local absolute path.
- Around line 15-16: Update the Markdown links in
docs/runbooks/postgres-backup-restore.md to avoid machine-specific absolute
paths by using repository-relative paths for the scripts; replace
`/Users/pahuldeep/Developer/grainguard/infra/scripts/backup-postgres.sh` and
`/Users/pahuldeep/Developer/grainguard/infra/scripts/restore-postgres.sh` with
relative links such as `infra/scripts/backup-postgres.sh` and
`infra/scripts/restore-postgres.sh` (or use `./infra/scripts/...`) so
contributors can open the scripts regardless of their local clone location.

In `@docs/runbooks/redis-backup-restore.md`:
- Around line 14-15: Update the markdown to stop linking to machine-specific
absolute paths and to remove hardcoded cd commands: replace the absolute links
to backup-redis.sh and restore-redis.sh with relative repository paths (e.g.,
./infra/scripts/backup-redis.sh and ./infra/scripts/restore-redis.sh or path
relative to the docs folder) and remove or change the hardcoded "cd /Users/..."
occurrences to a generic project-root relative `cd` (or instruct readers to run
commands from the repo root) where the docs currently reference the
user-specific path; target the links for backup-redis.sh and restore-redis.sh
and the two "cd" instances referenced in the file.

In `@infra/docker/docker-compose.yml`:
- Line 660: The port mapping "13133:13133" exposes the collector health endpoint
on all host interfaces; change the mapping so the port is bound to loopback or
internal-only (e.g., replace "13133:13133" with "127.0.0.1:13133:13133" or
remove the host mapping and expose only on the internal network) in the
docker-compose.yml service that currently declares the "13133:13133" mapping to
ensure the health endpoint is not reachable from external hosts.

In `@infra/docker/grafana/provisioning/dashboards/grainguard-overview.json`:
- Around line 10-11: The dashboard panels "Kafka Consumer Lag" and "Publish
Queue Depth" use Prometheus expressions that include "or vector(0)" for metrics
kafka_consumer_lag, publish_queue_depth, worker_queue_depth and
commit_queue_depth which masks missing telemetry; either (A) add/enable the
scrape targets and instrumentation that emit those metric names so the
expressions in the targets (e.g., sum(kafka_consumer_lag) and
sum(publish_queue_depth) / sum(worker_queue_depth) / sum(commit_queue_depth))
return real series, or (B) remove the "or vector(0)" from the target expressions
in those panels (titles "Kafka Consumer Lag" and "Publish Queue Depth") so
Grafana surfaces "no data" instead of showing zeros, or replace the expressions
with existing, backed metrics used elsewhere in the gateway if appropriate.

In `@infra/scripts/backup-redis.sh`:
- Around line 13-14: Replace the blocking redis-cli SAVE call with a
non-blocking BGSAVE and wait loop: invoke docker exec "${REDIS_CONTAINER}"
redis-cli BGSAVE, then poll docker exec "${REDIS_CONTAINER}" redis-cli LASTSAVE
in a loop (with a short sleep and timeout) until LASTSAVE is newer than the
start timestamp; update the echo/log messages accordingly and ensure the script
fails if the BGSAVE does not complete within the timeout to avoid hanging
indefinitely.

In `@infra/scripts/restore-postgres.sh`:
- Around line 41-43: The pg_restore invocation in restore-postgres.sh currently
includes the redundant flags --clean and --if-exists when restoring into a
freshly DROP/CREATE'd database; remove these flags from the pg_restore command
that uses "${dump_file}" so the command no longer passes --clean or --if-exists
(leave other flags like --no-owner intact) to avoid unnecessary/no-op options.
- Around line 34-37: The DROP DATABASE command fails when other clients hold
connections; before the docker exec psql call that runs DROP DATABASE for
"${db}" (using "${container}" and "${PGUSER}"), run a psql command that
terminates all other backends for that database (e.g., query pg_stat_activity
and call pg_terminate_backend(pid) for pid != pg_backend_pid()), then proceed to
DROP and CREATE; alternatively update the script to require stopping dependent
services before restore and document that requirement.
- Around line 35-37: The script interpolates ${db} directly into SQL for
DROP/CREATE which is unsafe for identifiers; update the psql commands that use
${db} (the DROP and CREATE statements executed by docker exec ... psql -U
"${PGUSER}" -d postgres -v ON_ERROR_STOP=1) to quote the identifier instead of
plain interpolation (e.g., construct the SQL using double-quoted identifiers so
the database name is wrapped in quotes), and ensure any embedded quotes in ${db}
are handled (e.g., via proper shell quoting/escaping or passing the identifier
as a psql variable) so both DROP DATABASE and CREATE DATABASE use a quoted
identifier.

In `@infra/scripts/restore-redis.sh`:
- Around line 19-20: The restore script currently copies RDB into the stopped
container but does not remove existing AOF, so Redis running with --appendonly
yes may replay AOF and override the restore; update restore-redis.sh to ensure
/data/appendonly.aof (and any appendonly.aof.*) is removed from the Redis data
volume before starting Redis: since docker exec won't work on a stopped
container, remove the AOF by either (a) mounting the same volume into a
short-lived helper container and rm /data/appendonly.aof after copying
${RDB_FILE} into the volume, or (b) start the ${REDIS_CONTAINER} in a controlled
way, immediately run docker exec ${REDIS_CONTAINER} rm -f /data/appendonly.aof
&& rm -f /data/appendonly.aof.* before allowing Redis to start; ensure the
script references REDIS_CONTAINER and RDB_FILE and logs the AOF removal step.
- Around line 22-23: After starting the container with docker start
"${REDIS_CONTAINER}", add a retry loop that repeatedly runs docker exec
"${REDIS_CONTAINER}" redis-cli PING until it returns PONG (or until a
configurable timeout/maximum attempts is reached), sleeping briefly between
attempts; if the loop times out, log an error and exit non‑zero. Place this loop
immediately after docker start and before any commands that assume Redis is
ready to ensure the script waits for the redis-cli PING to succeed.

In `@k8s/argocd/install.sh`:
- Around line 17-18: The install script applies a conflicting AppProject
(kubectl apply -f k8s/argocd/project.yaml) that can overwrite the role policies
applied by scripts/deploy.sh (argocd/project.yaml); fix by making a single
source of truth: remove the redundant apply from k8s/argocd/install.sh or change
it to reuse the same canonical AppProject (e.g., call the shared deploy script
or apply the single agreed project file), or add a guard that compares the two
project specs and aborts if they differ so the dev-deployer/prod-deployer
policies are never silently dropped.

In `@k8s/helm/grainguard/templates/networkpolicy.yaml`:
- Around line 22-34: The network policy fragment using
$networkPolicy.allowExternal currently emits only a ports list (no from clause),
which permits traffic from any source; update the template logic around
allowExternal in networkpolicy.yaml so that when allowExternal is true you
either (a) explicitly add a from selector limiting sources to your ingress
controller namespace/labels (e.g., match ingress controller
podSelector/namespaceSelector) or (b) document/guard that allowExternal must
only be set for gateway services—locate the block referencing
$networkPolicy.allowExternal and $svc.port/$svc.grpcPort/$svc.healthPort and
implement the chosen change (add a from: with appropriate selector when
restricting ingress, or add validation/commentary preventing accidental use on
internal services).

In `@k8s/helm/grainguard/templates/pdb.yaml`:
- Line 3: The PDB template currently forces replicaCount > 1 and ignores
explicit pdb.enabled: true; update the conditional so a PDB is rendered when the
service is enabled and either pdb.enabled is truthy or replicaCount > 1.
Concretely, modify the if-expression that uses $svc.enabled, $pdb.enabled, and
$svc.replicaCount so $pdb.enabled (with a sensible default) can opt-in
regardless of replicaCount—i.e., test pdb.enabled OR replicaCount > 1 instead of
requiring replicaCount > 1 in all cases.

---

Outside diff comments:
In `@k8s/helm/grainguard/values.yaml`:
- Around line 220-266: Add explicit pdb and networkPolicy configuration entries
for the cdc-transformer, dlq-reprocessor, and search-indexer blocks in
values.yaml (or document why omitted); specifically, add a pdb subsection (even
if disabled when replicaCount: 1) and a networkPolicy subsection under each
service (e.g., cdc-transformer, dlq-reprocessor, search-indexer) with sensible
defaults or disabled flags so chart consumers see the options; if you
intentionally want them omitted, add a short comment next to the replicaCount or
service block stating they are non-critical background workers and why
pdb/networkPolicy are not provided.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c9f0978e-fd4e-45cd-8896-99cb55caf913

📥 Commits

Reviewing files that changed from the base of the PR and between 2ae66f6 and 6de789d.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (26)
  • apps/dashboard/vite.config.ts
  • argocd/app-of-apps.yaml
  • argocd/environments/dev.yaml
  • argocd/environments/prod.yaml
  • argocd/project.yaml
  • docs/runbooks/postgres-backup-restore.md
  • docs/runbooks/redis-backup-restore.md
  • go.mod
  • infra/docker/docker-compose.yml
  • infra/docker/grafana/provisioning/dashboards/grainguard-overview.json
  • infra/docker/grafana/provisioning/datasources/datasources.yaml
  • infra/docker/otel-collector-config.yaml
  • infra/docker/prometheus.yml
  • infra/scripts/backup-postgres.sh
  • infra/scripts/backup-redis.sh
  • infra/scripts/restore-postgres.sh
  • infra/scripts/restore-redis.sh
  • k8s/argocd/app-of-apps.yaml
  • k8s/argocd/apps/grainguard-dev.yaml
  • k8s/argocd/apps/grainguard-prod.yaml
  • k8s/argocd/install.sh
  • k8s/argocd/project.yaml
  • k8s/helm/grainguard/templates/deployment.yaml
  • k8s/helm/grainguard/templates/networkpolicy.yaml
  • k8s/helm/grainguard/templates/pdb.yaml
  • k8s/helm/grainguard/values.yaml

Comment thread argocd/app-of-apps.yaml
Comment thread docs/runbooks/postgres-backup-restore.md Outdated
Comment thread docs/runbooks/postgres-backup-restore.md
Comment thread docs/runbooks/redis-backup-restore.md Outdated
Comment thread infra/docker/docker-compose.yml Outdated
Comment thread infra/scripts/restore-redis.sh
Comment thread infra/scripts/restore-redis.sh Outdated
Comment thread k8s/argocd/install.sh
Comment on lines +17 to +18
echo "==> Applying GrainGuard ArgoCD project..."
kubectl apply -f k8s/argocd/project.yaml
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

This applies an RBAC-incomplete AppProject variant and can remove role policies.

Line 18 applies k8s/argocd/project.yaml, while scripts/deploy.sh applies argocd/project.yaml (same AppProject identity, different spec content). Last apply wins, so dev-deployer/prod-deployer policies can be silently dropped.

🔧 Proposed fix (single source of truth)
 echo "==> Applying GrainGuard ArgoCD project..."
-kubectl apply -f k8s/argocd/project.yaml
+kubectl apply -f argocd/project.yaml
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
echo "==> Applying GrainGuard ArgoCD project..."
kubectl apply -f k8s/argocd/project.yaml
echo "==> Applying GrainGuard ArgoCD project..."
kubectl apply -f argocd/project.yaml
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/argocd/install.sh` around lines 17 - 18, The install script applies a
conflicting AppProject (kubectl apply -f k8s/argocd/project.yaml) that can
overwrite the role policies applied by scripts/deploy.sh (argocd/project.yaml);
fix by making a single source of truth: remove the redundant apply from
k8s/argocd/install.sh or change it to reuse the same canonical AppProject (e.g.,
call the shared deploy script or apply the single agreed project file), or add a
guard that compares the two project specs and aborts if they differ so the
dev-deployer/prod-deployer policies are never silently dropped.

Comment on lines +22 to +34
{{- if $networkPolicy.allowExternal }}
- ports:
- protocol: TCP
port: {{ $svc.port }}
{{- if $svc.grpcPort }}
- protocol: TCP
port: {{ $svc.grpcPort }}
{{- end }}
{{- if and $svc.healthPort (ne $svc.healthPort $svc.port) }}
- protocol: TCP
port: {{ $svc.healthPort }}
{{- end }}
{{- end }}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Clarify: allowExternal permits traffic from any source.

The ingress rule for allowExternal: true has no from selector, meaning it allows traffic from any source (inside or outside the cluster) on the specified ports. This is correct for the gateway service (LoadBalancer), but ensure this flag isn't accidentally enabled on internal services.

# Current behavior when allowExternal: true
- ports:          # No 'from' selector = allow from anywhere
    - protocol: TCP
      port: {{ $svc.port }}

If you want to restrict external traffic to only come through an ingress controller, you'd need to add a from selector with the ingress namespace/labels.

🧰 Tools
🪛 YAMLlint (1.38.0)

[error] 22-22: too many spaces inside braces

(braces)


[error] 25-25: too many spaces inside braces

(braces)


[error] 25-25: too many spaces inside braces

(braces)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/helm/grainguard/templates/networkpolicy.yaml` around lines 22 - 34, The
network policy fragment using $networkPolicy.allowExternal currently emits only
a ports list (no from clause), which permits traffic from any source; update the
template logic around allowExternal in networkpolicy.yaml so that when
allowExternal is true you either (a) explicitly add a from selector limiting
sources to your ingress controller namespace/labels (e.g., match ingress
controller podSelector/namespaceSelector) or (b) document/guard that
allowExternal must only be set for gateway services—locate the block referencing
$networkPolicy.allowExternal and $svc.port/$svc.grpcPort/$svc.healthPort and
implement the chosen change (add a from: with appropriate selector when
restricting ingress, or add validation/commentary preventing accidental use on
internal services).

Comment thread k8s/helm/grainguard/templates/pdb.yaml Outdated
- argocd/app-of-apps.yaml: rename to grainguard-apps-legacy to prevent
  last-write-wins collision with k8s/argocd/app-of-apps.yaml (same name,
  different source paths)
- k8s/argocd/project.yaml: add dev-deployer/prod-deployer RBAC roles so
  install.sh no longer silently drops policies vs argocd/project.yaml
- infra/docker/docker-compose.yml: bind OTel health port to 127.0.0.1
  so 13133 is not reachable from external interfaces
- infra/scripts/restore-redis.sh: clear AOF files before starting Redis
  to prevent appendonly.aof from overwriting restored RDB; add retry loop
  so PING check doesn't race against Redis startup
- docs/runbooks: replace hardcoded /Users/pahuldeep/Developer/grainguard
  absolute paths with relative paths for portability
- k8s/helm/grainguard/templates/pdb.yaml: respect explicit pdb.enabled:
  true for single-replica services instead of ignoring it

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
docs/runbooks/postgres-backup-restore.md (1)

14-16: ⚠️ Potential issue | 🟡 Minor

Markdown links won't resolve correctly.

Since this file is at docs/runbooks/postgres-backup-restore.md, the links infra/scripts/... resolve to docs/runbooks/infra/scripts/... which doesn't exist. Use relative paths that traverse up to the repo root.

📝 Proposed fix
 - scripts:
-  - [`backup-postgres.sh`](infra/scripts/backup-postgres.sh)
-  - [`restore-postgres.sh`](infra/scripts/restore-postgres.sh)
+  - [`backup-postgres.sh`](../../infra/scripts/backup-postgres.sh)
+  - [`restore-postgres.sh`](../../infra/scripts/restore-postgres.sh)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/runbooks/postgres-backup-restore.md` around lines 14 - 16, The markdown
links to the scripts are resolving relative to docs/runbooks; update the two
links for `backup-postgres.sh` and `restore-postgres.sh` so they traverse up to
the repo root (replace `infra/scripts/...` with `../../infra/scripts/...`) so
the links point to the actual files in the infra/scripts directory.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/runbooks/redis-backup-restore.md`:
- Around line 13-15: Update the markdown links for the script references so they
resolve from docs/runbooks to the repository root: replace the current relative
links for `backup-redis.sh` and `restore-redis.sh` with links that traverse up
two levels (e.g., `../../infra/scripts/backup-redis.sh` and
`../../infra/scripts/restore-redis.sh`) so the `backup-redis.sh` and
`restore-redis.sh` links point to the correct files.

In `@k8s/helm/grainguard/templates/pdb.yaml`:
- Line 4: The helm template uses the default function so an explicit
pdb.enabled: false is treated as unset and ignored; update the logic that sets
$pdbEnabled (currently using "$pdb.enabled | default (gt $replicaCount 1)") to
first check whether the pdb.enabled key exists using hasKey and only fall back
to (gt $replicaCount 1) when pdb.enabled is not present; locate the pdb.enabled
reference and replace the default-based expression with a hasKey-based
conditional that honors an explicit false while still enabling PDBs by default
when replicaCount > 1.

---

Duplicate comments:
In `@docs/runbooks/postgres-backup-restore.md`:
- Around line 14-16: The markdown links to the scripts are resolving relative to
docs/runbooks; update the two links for `backup-postgres.sh` and
`restore-postgres.sh` so they traverse up to the repo root (replace
`infra/scripts/...` with `../../infra/scripts/...`) so the links point to the
actual files in the infra/scripts directory.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3b34c9b1-d12c-4280-84da-5e094d9389cb

📥 Commits

Reviewing files that changed from the base of the PR and between 6de789d and aac2710.

📒 Files selected for processing (7)
  • argocd/app-of-apps.yaml
  • docs/runbooks/postgres-backup-restore.md
  • docs/runbooks/redis-backup-restore.md
  • infra/docker/docker-compose.yml
  • infra/scripts/restore-redis.sh
  • k8s/argocd/project.yaml
  • k8s/helm/grainguard/templates/pdb.yaml

Comment on lines +13 to +15
- scripts:
- [`backup-redis.sh`](infra/scripts/backup-redis.sh)
- [`restore-redis.sh`](infra/scripts/restore-redis.sh)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Markdown links won't resolve correctly.

Same issue as the Postgres runbook—since this file is at docs/runbooks/redis-backup-restore.md, these links need to traverse up to the repo root.

📝 Proposed fix
 - scripts:
-  - [`backup-redis.sh`](infra/scripts/backup-redis.sh)
-  - [`restore-redis.sh`](infra/scripts/restore-redis.sh)
+  - [`backup-redis.sh`](../../infra/scripts/backup-redis.sh)
+  - [`restore-redis.sh`](../../infra/scripts/restore-redis.sh)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- scripts:
- [`backup-redis.sh`](infra/scripts/backup-redis.sh)
- [`restore-redis.sh`](infra/scripts/restore-redis.sh)
- scripts:
- [`backup-redis.sh`](../../infra/scripts/backup-redis.sh)
- [`restore-redis.sh`](../../infra/scripts/restore-redis.sh)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/runbooks/redis-backup-restore.md` around lines 13 - 15, Update the
markdown links for the script references so they resolve from docs/runbooks to
the repository root: replace the current relative links for `backup-redis.sh`
and `restore-redis.sh` with links that traverse up two levels (e.g.,
`../../infra/scripts/backup-redis.sh` and
`../../infra/scripts/restore-redis.sh`) so the `backup-redis.sh` and
`restore-redis.sh` links point to the correct files.

{{- if and $svc.enabled (gt ($svc.replicaCount | default 1 | int) 1) }}
{{- $pdb := $svc.pdb | default dict }}
{{- $replicaCount := ($svc.replicaCount | default 1 | int) }}
{{- $pdbEnabled := $pdb.enabled | default (gt $replicaCount 1) }}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Explicit pdb.enabled: false is ignored for multi-replica services.

Helm's default function treats false as falsy, so $pdb.enabled | default (gt $replicaCount 1) returns true when replicaCount > 1 even if pdb.enabled is explicitly set to false. This prevents operators from opting out of PDBs for high-replica services.

Use hasKey to distinguish between "not set" and "explicitly false":

🔧 Proposed fix using hasKey
 {{- $pdb := $svc.pdb | default dict }}
 {{- $replicaCount := ($svc.replicaCount | default 1 | int) }}
-{{- $pdbEnabled := $pdb.enabled | default (gt $replicaCount 1) }}
+{{- $pdbEnabled := ternary $pdb.enabled (gt $replicaCount 1) (hasKey $pdb "enabled") }}
 {{- if and $svc.enabled $pdbEnabled }}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
{{- $pdbEnabled := $pdb.enabled | default (gt $replicaCount 1) }}
{{- $pdb := $svc.pdb | default dict }}
{{- $replicaCount := ($svc.replicaCount | default 1 | int) }}
{{- $pdbEnabled := ternary $pdb.enabled (gt $replicaCount 1) (hasKey $pdb "enabled") }}
{{- if and $svc.enabled $pdbEnabled }}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/helm/grainguard/templates/pdb.yaml` at line 4, The helm template uses the
default function so an explicit pdb.enabled: false is treated as unset and
ignored; update the logic that sets $pdbEnabled (currently using "$pdb.enabled |
default (gt $replicaCount 1)") to first check whether the pdb.enabled key exists
using hasKey and only fall back to (gt $replicaCount 1) when pdb.enabled is not
present; locate the pdb.enabled reference and replace the default-based
expression with a hasKey-based conditional that honors an explicit false while
still enabling PDBs by default when replicaCount > 1.

@pahuldeepp pahuldeepp merged commit 584c795 into master Mar 28, 2026
59 checks passed
@pahuldeepp pahuldeepp deleted the codex/phase2-platform-hardening branch March 28, 2026 14:40
This was referenced Mar 28, 2026
pahuldeepp added a commit that referenced this pull request Mar 28, 2026
HIGH PRIORITY:
- #9: Fix CSRF timing attack — pad buffers to fixed length before timingSafeEqual
- #10: Upgrade circuit breaker to distributed (Redis-backed state sharing across pods)
- #8: Fix saga recovery infinite loop — mark corrupted payloads as FAILED instead of retrying

MEDIUM PRIORITY:
- #11: Add webhook idempotency check (dedup by endpoint_id + event_type)
- #12: Assert Stripe webhook body is Buffer before signature verification
- #13: Eliminate N+1 query in deviceTelemetry resolver (single JOIN query)
- #15: Add ORDER BY + LIMIT to saga FindByCorrelationID for deterministic results
- #17: Make critical audit events (auth, admin) throw on failure instead of silent swallow

LOW PRIORITY:
- #20: Add 10s query timeout to all saga repository DB operations
- #23: Add IsValidStatus validator for saga status constants
- #24: Set httpServer.timeout (30s) and keepAliveTimeout (65s) on BFF
- #25: Add RabbitMQ heartbeat (30s) and connection error/close handlers
- #7: Fix remaining saga JSON marshal error check (initialErr)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant