Skip to content

feat(staging): scaffold pre-prod environment and harden local/runtime flows#15

Merged
pahuldeepp merged 21 commits intomasterfrom
codex/staging-bootstrap
Mar 29, 2026
Merged

feat(staging): scaffold pre-prod environment and harden local/runtime flows#15
pahuldeepp merged 21 commits intomasterfrom
codex/staging-bootstrap

Conversation

@pahuldeepp
Copy link
Copy Markdown
Owner

@pahuldeepp pahuldeepp commented Mar 29, 2026

Summary by CodeRabbit

  • New Features

    • Staging infra & automated deployment: new staging Terraform, Argo CD application/role, and staging Helm values; dev/prod namespace values added
    • New k6 load tests: GraphQL, ingest, mixed-stack; writes JSON summaries
    • Docker Compose now reads a shared root .env
    • Export filename now respects server-provided filename
  • Bug Fixes

    • Improved export error handling and clearer user-facing messages
    • SSO endpoint degrades gracefully when Auth0 management API is unavailable
    • Tenant deletion response now reflects retained/immutable audit events
  • Documentation

    • README updated: capabilities, deployment status, runbooks (Postgres/Redis backups), roadmap, and local validation metrics

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 29, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a staging Terraform environment and ArgoCD/Helm staging deployment, introduces multiple k6 load tests, tightens error handling for SSO and account export/deletion (audit-aware), centralizes Docker Compose envs, updates README/runbooks/roadmap, and improves proto resolution for gRPC device service.

Changes

Cohort / File(s) Summary
Documentation & runbooks
README.md
Rewrote capability description, testing/deployment guidance, runbooks (added Postgres/Redis backup entries), roadmap/ADR updates, replaced performance metrics with local Docker validation, and adjusted final tagline.
Docker Compose env centralization
infra/docker/docker-compose.yml
Shared env_file: - ../../.env for gateway, grafana, jobs-worker; removed multiple inline secret/default environment overrides.
Account deletion & export (server + tests)
apps/gateway/src/routes/account.ts, apps/gateway/src/routes/__tests__/account.test.ts
Added apiRateLimiter; delete flow now removes telemetry_readings, deletes devices, then COUNTs audit_events and conditionally returns a message if immutable audits remain; export endpoint wrapped in try/catch returning HTTP 500 { error: "internal_error" } on failure; tests updated for new sequence and audit-aware response.
SSO resilience (server + tests)
apps/gateway/src/routes/sso.ts, apps/gateway/src/routes/__tests__/sso.test.ts
listOrgConnections(orgId) call wrapped in try/catch; on failure logs and returns { configured: true, orgId, connections: [], warning }; test added for Auth0 management API failure (imports mocked function).
Frontend export handler
apps/dashboard/src/features/settings/SettingsPage.tsx
handleExport checks res.ok, parses JSON error bodies on failure, extracts filename from content-disposition when present, and surfaces error messages from thrown Error.
gRPC proto resolution
apps/gateway/src/services/device.ts
Proto resolution builds prioritized candidate paths (cwd, __dirname variants, absolute paths) and throws explicit error listing checked paths if none exist before loading the proto.
Load testing scripts (k6)
scripts/load-tests/graphql-stress.js, scripts/load-tests/ingest-stress.js, scripts/load-tests/mixed-stack-stress.js, scripts/load-tests/performance-budget.js
Added three k6 scripts (GraphQL, ingest, mixed) with staged scenarios, thresholds, Trends/Rates, env-driven config, and summary handleSummary artifacts; minor catch variable fix in performance-budget.js.
Staging Terraform environment
infra/terraform/environments/staging/*, infra/terraform/environments/dev/main.tf
Added complete staging TF environment (providers, backend, variables, lockfile, outputs, main.tf) wiring modules: vpc, eks, rds, elasticache, msk, dynamodb, secrets_manager, iam_irsa, ecr; dev IRSA namespace changed to grainguard-dev; added .terraform/ to staging .gitignore.
ArgoCD & Helm (staging/dev/prod)
k8s/argocd/apps/grainguard-staging.yaml, k8s/argocd/project.yaml, k8s/helm/grainguard/values-staging.yaml, k8s/helm/grainguard/values-dev.yaml, k8s/helm/grainguard/values-prod.yaml
Added ArgoCD Application grainguard-staging and updated project destinations/roles; new values-staging.yaml with grainguard-staging namespace, image.pullPolicy, per-service replica/autoscaling settings; dev/prod values set explicit namespaces.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant K6 as rgba(45,125,255,0.5) K6 runner
participant Gateway as rgba(0,200,120,0.5) Gateway (GraphQL)
participant Ingest as rgba(200,80,80,0.5) Ingest service
participant DB as rgba(180,180,60,0.5) Data stores (DB/TS)
K6->>Gateway: POST /graphql (graphqlReader)
Gateway->>DB: Query devices / read model
DB-->>Gateway: devices payload
Gateway-->>K6: HTTP 200 + data
K6->>Ingest: POST /ingest (ingestWriter) with X-Api-Key
Ingest->>DB: Insert telemetry/event
DB-->>Ingest: accepted:true / 202
Ingest-->>K6: HTTP 202 + {accepted:true}

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main changes: introducing a staging/pre-production environment via new Terraform configs and Kubernetes manifests, plus hardening of runtime flows through improved error handling and resilience in account/SSO/device management routes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/staging-bootstrap

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment on lines 79 to -135
@@ -113,33 +133,38 @@ accountRouter.get(
"/account/export",
authMiddleware,
async (req: Request, res: Response) => {
const tenantId = req.user!.tenantId;

const [tenantResult, usersResult, devicesResult, alertsResult, auditResult, keysResult] =
await Promise.all([
pool.query("SELECT id, name, slug, plan, email, created_at FROM tenants WHERE id = $1", [tenantId]),
pool.query("SELECT id, email, role, created_at FROM tenant_users WHERE tenant_id = $1", [tenantId]),
pool.query("SELECT id, serial_number, created_at FROM devices WHERE tenant_id = $1", [tenantId]),
pool.query("SELECT id, name, metric, operator, threshold, enabled, created_at FROM alert_rules WHERE tenant_id = $1", [tenantId]),
pool.query("SELECT id, event_type, actor_id, resource_type, payload, created_at FROM audit_events WHERE tenant_id = $1 ORDER BY created_at DESC LIMIT 1000", [tenantId]),
pool.query("SELECT id, name, created_at, expires_at, revoked_at FROM api_keys WHERE tenant_id = $1", [tenantId]),
]);

const exportData = {
exportedAt: new Date().toISOString(),
tenant: tenantResult.rows[0] || null,
users: usersResult.rows,
devices: devicesResult.rows,
alertRules: alertsResult.rows,
auditEvents: auditResult.rows,
apiKeys: keysResult.rows,

Check failure

Code scanning / CodeQL

Missing rate limiting High

This route handler performs
a database access
, but is not rate-limited.
This route handler performs
a database access
, but is not rate-limited.
This route handler performs
a database access
, but is not rate-limited.
This route handler performs
a database access
, but is not rate-limited.
This route handler performs
a database access
, but is not rate-limited.
This route handler performs
a database access
, but is not rate-limited.
This route handler performs
a database access
, but is not rate-limited.
This route handler performs
a database access
, but is not rate-limited.
This route handler performs
a database access
, but is not rate-limited.
This route handler performs
a database access
, but is not rate-limited.
This route handler performs
a database access
, but is not rate-limited.
This route handler performs
a database access
, but is not rate-limited.

Copilot Autofix

AI 29 days ago

Copilot could not generate an autofix suggestion

Copilot could not generate an autofix suggestion for this alert. Try pushing a new commit or if the problem persists contact support.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
apps/gateway/src/routes/sso.ts (2)

129-144: 🛠️ Refactor suggestion | 🟠 Major

Auth0 errors propagate unhandled on mutating routes.

The POST /saml route (and similarly /oidc and DELETE) calls Auth0 functions (createSamlConnection, enableConnectionOnOrg) without try/catch. If Auth0 returns an error (e.g., rate limit, invalid cert format, network timeout), it propagates as an unhandled exception.

Unlike the GET route which now gracefully degrades, these mutating routes will return 500 errors with potentially leaky error messages. Consider wrapping Auth0 calls with consistent error handling that returns structured error responses.

Example error handling pattern
try {
  const { connectionId } = await createSamlConnection({
    tenantId, name, signInUrl, signingCert, emailDomains,
  });
  await enableConnectionOnOrg(orgId, connectionId);
  // ... DB updates
} catch (error) {
  console.error("[sso] SAML setup failed:", error);
  const message = error instanceof Error ? error.message : "unknown";
  if (message.includes("SSO not configured")) {
    return res.status(503).json({ error: "sso_not_configured" });
  }
  return res.status(502).json({ error: "auth0_error", message: "Failed to configure SAML connection" });
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/routes/sso.ts` around lines 129 - 144, Wrap the Auth0 calls
in the POST /saml handler (specifically the createSamlConnection and
enableConnectionOnOrg calls and the subsequent DB update using pool.query) in a
try/catch block, logging the caught error and returning a structured HTTP error
response instead of letting exceptions bubble; map known error cases (e.g.,
message includes "SSO not configured") to a 503 with error key
"sso_not_configured" and return a 502 with error key "auth0_error" and a generic
message like "Failed to configure SAML connection" for other Auth0/network
errors; apply the same try/catch/error-mapping pattern to the POST /oidc and
DELETE SSO routes so all mutating SSO endpoints consistently handle and sanitize
Auth0 errors.

84-94: ⚠️ Potential issue | 🟠 Major

Risk of inconsistent state without transaction or compensating action.

If createOrganization at line 85 succeeds but the pool.query UPDATE at lines 88-91 fails, the Auth0 org exists but auth0_org_id is not recorded. Subsequent calls will create another org (the idempotency check at line 79 reads from the DB, not Auth0).

Since Auth0 is external, a simple transaction won't fully solve this. Consider:

  1. Wrap DB operations in a transaction
  2. On DB failure after Auth0 success, either retry the DB update or delete the Auth0 org

The same pattern applies to POST /saml, POST /oidc, and DELETE routes below.

As per coding guidelines: "Verify every mutating route uses a real transaction client instead of pool-level BEGIN/COMMIT."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/routes/sso.ts` around lines 84 - 94, The code may create an
Auth0 org via createOrganization(tenantId, rows[0].name) but then fail to
persist auth0_org_id with pool.query, leaving an inconsistent external state;
update the POST /sso handler to use a real DB transaction client (start a
transaction client, use that client for the SELECT/UPDATE/INSERT operations
instead of pool), and after createOrganization succeeds ensure either (a) retry
the DB UPDATE within the same transaction client and commit only on success, or
(b) if the DB update ultimately fails, call the Auth0 cleanup delete (using the
orgId returned from createOrganization) as a compensating action before rolling
back; apply the same pattern (transaction client + retry or compensating delete)
to the POST /saml, POST /oidc and relevant DELETE routes that create external
Auth0 resources so external and DB state remain consistent.
apps/gateway/src/services/device.ts (1)

50-56: ⚠️ Potential issue | 🔴 Critical

K8s deployments (staging and production) lack gRPC mTLS certificate volume mounts.

The concern is valid. Verification reveals that:

  • Docker-compose correctly mounts /certs:/certs for both gateway and telemetry-service
  • K8s base deployments (gateway.yaml, telemetry-service.yaml) have no volume mounts for certs
  • Helm template does not include a volumes/volumeMounts section
  • Helm values files for staging and production contain no cert volume configuration

With current K8s deployments, both the gateway and telemetry-service will always trigger the fs.existsSync("/certs/ca.crt") check to fail, falling back to grpc.credentials.createInsecure(). This means unencrypted gRPC traffic in staging and production environments.

Add volumeMounts and volumes to the K8s base manifests and/or Helm template to mount the TLS certificates, or update the code to source cert paths from environment variables that point to K8s secret-mounted paths.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/services/device.ts` around lines 50 - 56, The code currently
checks fs.existsSync("/certs/ca.crt") and falls back to
grpc.credentials.createInsecure(), causing unencrypted gRPC in K8s because certs
are not mounted; fix by either (A) adding volume and volumeMount entries to the
K8s base manifests/Helm chart and values so /certs (containing ca.crt,
gateway-client.key, gateway-client.crt) is mounted into the gateway and
telemetry-service pods, or (B) update the certificate loading in this module to
read cert file paths from environment variables (e.g., CERT_DIR or CA_PATH)
instead of hardcoding "/certs", and ensure Helm values and K8s manifests mount
the Secret at those paths; ensure the code uses those env vars when calling
fs.existsSync and grpc.credentials.createSsl so createSsl is used when certs are
present.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/dashboard/src/features/settings/SettingsPage.tsx`:
- Around line 106-123: The catch currently shows raw backend error codes (e.g.,
"internal_error") to users; add a small translation layer and use it before
calling toast.error. Implement a helper like translateExportError(errorCode:
string): string (add near SettingsPage or above the export handler) that maps
known codes (e.g., "internal_error" -> "An internal server error occurred.
Please try again later.", "not_authenticated" -> "Please sign in to export
data.", etc.) and returns a friendly default for unknown values; then replace
the toast.error call in the catch block to use
toast.error(translateExportError(e instanceof Error ? e.message : "")) so users
see human-friendly messages instead of raw machine codes.
- Around line 116-118: The regex used to extract the filename in
SettingsPage.tsx (where disposition is matched and filenameMatch is derived)
contains unnecessary escapes for double quotes; update the pattern used in
disposition.match(...) (the expression that sets filenameMatch) to remove the
backslashes so it reads /filename="?([^"]+)"?/i so ESLint's no-useless-escape
rule is satisfied and a.download assignment still falls back to the same default
filename when no match is found.

In `@apps/gateway/src/routes/account.ts`:
- Around line 95-109: The backend now retains immutable audit_events when
deleting a tenant (see the deletion response constructed in account route:
return res.json(...) in the handler that performs DELETE FROM tenants), so
update the dashboard confirmation copy to match: in the SettingsPage component
(SettingsPage.tsx) locate the tenant erasure confirmation/confirmation modal
text (the string that currently says audit logs will be permanently deleted) and
change it to indicate audit logs are retained for compliance (e.g., “Audit logs
will be retained and cannot be deleted for compliance”); update any related help
text or docs strings in the same component to reflect retained-audit behavior so
the UI matches the account.ts response.
- Around line 136-146: This route is returning tenant-wide sensitive data;
enforce RBAC by checking the authenticated user's role (req.user and
req.user.role) before running the tenant-wide queries: if req.user.role !==
'admin' return 403, otherwise continue; alternatively, for non-admins scope each
query to the requesting user by replacing tenant-wide queries for tenant_users,
devices, alert_rules, audit_events, and api_keys with user-scoped queries
(filter by req.user.id or actor_id where appropriate) so only the requester’s
records are exported; update the code around the existing tenantId and
Promise.all block that runs pool.query for tenant_users, devices, alert_rules,
audit_events, and api_keys to enforce the chosen behavior.

In `@apps/gateway/src/routes/sso.ts`:
- Around line 46-58: The catch in the route swallowing every error from
listOrgConnections hides operator configuration problems (e.g. errors from
getManagementToken when AUTH0_* env vars are missing); update error handling so
configuration errors surface: have getManagementToken (or listOrgConnections)
throw a distinct error type or include a recognisable property/message (e.g.
ConfigError or error.code === 'CONFIG_MISSING') and in the sso route catch
distinguish those and rethrow or return a 5xx response indicating
misconfiguration, while still converting transient Auth0 API failures into the
existing configured:true with warning response; reference listOrgConnections and
getManagementToken when adding the new error type check and branching logic.

In `@infra/docker/docker-compose.yml`:
- Around line 560-561: The docker-compose services currently include a global
env_file reference (env_file: - ../../.env) for the gateway, grafana, and
jobs-worker which exposes the entire root .env to containers; replace that broad
injection by scoping environment variables per-service: remove the ../../.env
env_file entry from the gateway, grafana, and jobs-worker service blocks and
either (a) point each service to a minimal service-specific env file (e.g.,
env_file: - ./gateway.env) that contains only the vars it needs, or (b)
enumerate only the required variables in the service's environment section using
explicit ${VAR} mappings; update the gateway, grafana, and jobs-worker service
definitions accordingly to enforce least-privilege.

In `@infra/terraform/environments/staging/.gitignore`:
- Line 1: Add common Terraform ignore entries in addition to .terraform/ to
prevent accidental commits of state and local artifacts: include patterns for
lock and state files (e.g., .terraform.lock.hcl, *.tfstate, *.tfstate.backup),
crash and override artifacts (crash.log, override.tf, override.tf.json), CLI and
variable files (.terraform.tfvars, *.tfvars, *.tfvars.json, *.tfvars.local) and
workspace/local backend files; update the .gitignore in this Terraform
environment to include those entries so local state, credentials, and
machine-generated files are not committed.

In `@infra/terraform/environments/staging/main.tf`:
- Line 5: Extract the hardcoded CIDR by adding a locals block (locals { vpc_cidr
= "10.10.0.0/16" }) and replace all occurrences of the literal "10.10.0.0/16"
with local.vpc_cidr; update the vpc_cidr argument passed into the modules that
reference it (the module blocks for vpc, rds, elasticache, and msk) to use
local.vpc_cidr so future changes only require editing the locals entry.

In `@infra/terraform/environments/staging/outputs.tf`:
- Around line 1-17: Mark the exposed connection endpoints as sensitive to avoid
accidental leakage in logs: add sensitive = true to the output blocks for
"rds_endpoint", "redis_endpoint", "kafka_brokers_tls", and "kafka_brokers_sasl"
(the output blocks referencing module.rds.endpoint,
module.elasticache.primary_endpoint, module.msk.bootstrap_brokers_tls, and
module.msk.bootstrap_brokers_sasl respectively); update those output
declarations accordingly while leaving other outputs unchanged.

In `@infra/terraform/environments/staging/providers.tf`:
- Around line 23-25: The AWS provider block (provider "aws") should include a
default_tags block to apply consistent tags to all resources; add a default_tags
{ tags = { ... } } section inside provider "aws" and populate keys like
"Environment", "Owner", "Project" using variables (e.g. var.environment,
var.owner, var.project) or hardcoded values as appropriate, and create any
missing vars if they don't exist so tags reference valid variables; keep region
= var.aws_region and ensure tag keys follow your org's naming convention.

In `@infra/terraform/environments/staging/variables.tf`:
- Around line 1-14: Add descriptive text for each Terraform input variable to
improve documentation and terraform plan clarity: add a description to variable
"project" explaining it holds the GCP/AWS project or service name, to variable
"aws_region" describing the AWS region (e.g., default region for resources), and
to variable "db_password" clarifying this is the sensitive database password
with no default and must be provided; update the variable blocks for project,
aws_region, and db_password to include these description strings.

In `@k8s/argocd/apps/grainguard-staging.yaml`:
- Around line 13-14: The ArgoCD application currently sets targetRevision: HEAD
which will promote every push on the default branch into the staging cluster;
update the ArgoCD app config by replacing targetRevision: HEAD with a controlled
ref such as a dedicated branch name (e.g., targetRevision: staging) or a
tag/semver release ref to ensure only intended commits are deployed, and
document the chosen strategy so repoURL and targetRevision remain consistent
with your deployment workflow.

In `@k8s/argocd/project.yaml`:
- Around line 36-40: The staging-deployer role (name: staging-deployer) defines
policies but has no groups field, so add a groups array to bind it to the same
subject group(s) used by dev-deployer/prod-deployer (e.g., the CI/service
account or team group) or explicitly note that bindings are handled elsewhere;
update the staging-deployer role to include a groups: [...] entry (matching the
existing dev-deployer/prod-deployer group names) so the policies can be assumed
by the intended users/service accounts.

In `@k8s/helm/grainguard/values-prod.yaml`:
- Around line 1-2: The Helm change added namespace: grainguard-prod but there is
no Terraform prod environment; create a new Terraform environment directory
mirroring infra/terraform/environments/dev and
infra/terraform/environments/staging (e.g., infra/terraform/environments/prod/)
and copy the same IRSA module configuration found in those main.tf files,
updating the module/input that sets environment to "prod" so the IRSA module and
any resources match the grainguard-prod namespace; ensure any variables or
backend configs are adjusted to match prod conventions and that the environment
value is consistently "prod" across the module and Terraform workspace
configuration.

In `@README.md`:
- Around line 190-201: Update the README Testing section to point to the new
load-test locations and filenames: replace references to tests/load/ with
scripts/load-tests/ and list the committed scripts graphql-stress.js,
ingest-stress.js, and mixed-stack-stress.js (e.g., k6 run
scripts/load-tests/graphql-stress.js, etc.); also update the Note bullet that
mentions tests/load/ to reference scripts/load-tests/ so the README and the new
PR-added scripts are in sync while leaving the existing replay path
(./scripts/replay/replay_test.sh) unchanged.

---

Outside diff comments:
In `@apps/gateway/src/routes/sso.ts`:
- Around line 129-144: Wrap the Auth0 calls in the POST /saml handler
(specifically the createSamlConnection and enableConnectionOnOrg calls and the
subsequent DB update using pool.query) in a try/catch block, logging the caught
error and returning a structured HTTP error response instead of letting
exceptions bubble; map known error cases (e.g., message includes "SSO not
configured") to a 503 with error key "sso_not_configured" and return a 502 with
error key "auth0_error" and a generic message like "Failed to configure SAML
connection" for other Auth0/network errors; apply the same
try/catch/error-mapping pattern to the POST /oidc and DELETE SSO routes so all
mutating SSO endpoints consistently handle and sanitize Auth0 errors.
- Around line 84-94: The code may create an Auth0 org via
createOrganization(tenantId, rows[0].name) but then fail to persist auth0_org_id
with pool.query, leaving an inconsistent external state; update the POST /sso
handler to use a real DB transaction client (start a transaction client, use
that client for the SELECT/UPDATE/INSERT operations instead of pool), and after
createOrganization succeeds ensure either (a) retry the DB UPDATE within the
same transaction client and commit only on success, or (b) if the DB update
ultimately fails, call the Auth0 cleanup delete (using the orgId returned from
createOrganization) as a compensating action before rolling back; apply the same
pattern (transaction client + retry or compensating delete) to the POST /saml,
POST /oidc and relevant DELETE routes that create external Auth0 resources so
external and DB state remain consistent.

In `@apps/gateway/src/services/device.ts`:
- Around line 50-56: The code currently checks fs.existsSync("/certs/ca.crt")
and falls back to grpc.credentials.createInsecure(), causing unencrypted gRPC in
K8s because certs are not mounted; fix by either (A) adding volume and
volumeMount entries to the K8s base manifests/Helm chart and values so /certs
(containing ca.crt, gateway-client.key, gateway-client.crt) is mounted into the
gateway and telemetry-service pods, or (B) update the certificate loading in
this module to read cert file paths from environment variables (e.g., CERT_DIR
or CA_PATH) instead of hardcoding "/certs", and ensure Helm values and K8s
manifests mount the Secret at those paths; ensure the code uses those env vars
when calling fs.existsSync and grpc.credentials.createSsl so createSsl is used
when certs are present.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7c54fd6d-56b2-41c6-8646-2d95d47c2c19

📥 Commits

Reviewing files that changed from the base of the PR and between 2be6ae0 and 6126b70.

📒 Files selected for processing (23)
  • README.md
  • apps/dashboard/src/features/settings/SettingsPage.tsx
  • apps/gateway/src/routes/__tests__/account.test.ts
  • apps/gateway/src/routes/__tests__/sso.test.ts
  • apps/gateway/src/routes/account.ts
  • apps/gateway/src/routes/sso.ts
  • apps/gateway/src/services/device.ts
  • infra/docker/docker-compose.yml
  • infra/terraform/environments/dev/main.tf
  • infra/terraform/environments/staging/.gitignore
  • infra/terraform/environments/staging/.terraform.lock.hcl
  • infra/terraform/environments/staging/main.tf
  • infra/terraform/environments/staging/outputs.tf
  • infra/terraform/environments/staging/providers.tf
  • infra/terraform/environments/staging/variables.tf
  • k8s/argocd/apps/grainguard-staging.yaml
  • k8s/argocd/project.yaml
  • k8s/helm/grainguard/values-dev.yaml
  • k8s/helm/grainguard/values-prod.yaml
  • k8s/helm/grainguard/values-staging.yaml
  • scripts/load-tests/graphql-stress.js
  • scripts/load-tests/ingest-stress.js
  • scripts/load-tests/mixed-stack-stress.js

Comment on lines +106 to +123
if (!res.ok) {
const body = await res.json().catch(() => ({}));
throw new Error(
typeof body?.error === "string" ? body.error : `HTTP ${res.status}`
);
}
const blob = await res.blob();
const url = URL.createObjectURL(blob);
const a = document.createElement("a");
a.href = url;
a.download = `grainguard-export-${new Date().toISOString().slice(0, 10)}.json`;
const disposition = res.headers.get("content-disposition") ?? "";
const filenameMatch = disposition.match(/filename=\"?([^"]+)\"?/i);
a.download = filenameMatch?.[1] ?? `grainguard-export-${new Date().toISOString().slice(0, 10)}.json`;
a.click();
URL.revokeObjectURL(url);
toast.success("Data exported");
} catch {
toast.error("Export failed");
} catch (e) {
toast.error(e instanceof Error ? e.message : "Export failed");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Translate backend error codes before showing the toast.

The gateway currently returns machine codes like internal_error on export failures, so this branch will show "internal_error" directly to end users. Map known codes to friendly copy and keep raw values out of the toast.

💡 Proposed fix
       if (!res.ok) {
         const body = await res.json().catch(() => ({}));
-        throw new Error(
-          typeof body?.error === "string" ? body.error : `HTTP ${res.status}`
-        );
+        const code = typeof body?.error === "string" ? body.error : null;
+        throw new Error(
+          code === "internal_error"
+            ? "Export failed. Please try again."
+            : code ?? `HTTP ${res.status}`
+        );
       }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (!res.ok) {
const body = await res.json().catch(() => ({}));
throw new Error(
typeof body?.error === "string" ? body.error : `HTTP ${res.status}`
);
}
const blob = await res.blob();
const url = URL.createObjectURL(blob);
const a = document.createElement("a");
a.href = url;
a.download = `grainguard-export-${new Date().toISOString().slice(0, 10)}.json`;
const disposition = res.headers.get("content-disposition") ?? "";
const filenameMatch = disposition.match(/filename=\"?([^"]+)\"?/i);
a.download = filenameMatch?.[1] ?? `grainguard-export-${new Date().toISOString().slice(0, 10)}.json`;
a.click();
URL.revokeObjectURL(url);
toast.success("Data exported");
} catch {
toast.error("Export failed");
} catch (e) {
toast.error(e instanceof Error ? e.message : "Export failed");
if (!res.ok) {
const body = await res.json().catch(() => ({}));
const code = typeof body?.error === "string" ? body.error : null;
throw new Error(
code === "internal_error"
? "Export failed. Please try again."
: code ?? `HTTP ${res.status}`
);
}
const blob = await res.blob();
const url = URL.createObjectURL(blob);
const a = document.createElement("a");
a.href = url;
const disposition = res.headers.get("content-disposition") ?? "";
const filenameMatch = disposition.match(/filename=\"?([^"]+)\"?/i);
a.download = filenameMatch?.[1] ?? `grainguard-export-${new Date().toISOString().slice(0, 10)}.json`;
a.click();
URL.revokeObjectURL(url);
toast.success("Data exported");
} catch (e) {
toast.error(e instanceof Error ? e.message : "Export failed");
🧰 Tools
🪛 GitHub Actions: CI

[error] 117-117: ESLint (no-useless-escape): Unnecessary escape character: "

🪛 GitHub Check: TS Checks — dashboard

[failure] 117-117:
Unnecessary escape character: "


[failure] 117-117:
Unnecessary escape character: "

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/features/settings/SettingsPage.tsx` around lines 106 -
123, The catch currently shows raw backend error codes (e.g., "internal_error")
to users; add a small translation layer and use it before calling toast.error.
Implement a helper like translateExportError(errorCode: string): string (add
near SettingsPage or above the export handler) that maps known codes (e.g.,
"internal_error" -> "An internal server error occurred. Please try again
later.", "not_authenticated" -> "Please sign in to export data.", etc.) and
returns a friendly default for unknown values; then replace the toast.error call
in the catch block to use toast.error(translateExportError(e instanceof Error ?
e.message : "")) so users see human-friendly messages instead of raw machine
codes.

Comment thread apps/dashboard/src/features/settings/SettingsPage.tsx
Comment on lines +95 to +109
// Most tenant-linked tables cascade from tenants, so deleting the
// tenant removes them automatically. Immutable audit_events are
// intentionally retained for compliance and cannot be deleted.
await client.query("DELETE FROM tenants WHERE id = $1", [tenantId]);

await client.query("COMMIT");
return res.json({ deleted: true, scope: "tenant", message: "Tenant and all data deleted" });
const immutableAuditEvents = auditEventRows[0]?.count ?? 0;
return res.json({
deleted: true,
scope: "tenant",
message:
immutableAuditEvents > 0
? "Tenant deleted. Immutable audit events were retained for compliance."
: "Tenant and all mutable data deleted",
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Update the delete UX/docs to match the retained-audit behavior.

This branch now intentionally keeps audit_events, but the dashboard confirmation in apps/dashboard/src/features/settings/SettingsPage.tsx still says audit logs will be permanently deleted. Please update that copy in the same PR so the erasure flow does not promise behavior the backend no longer performs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/routes/account.ts` around lines 95 - 109, The backend now
retains immutable audit_events when deleting a tenant (see the deletion response
constructed in account route: return res.json(...) in the handler that performs
DELETE FROM tenants), so update the dashboard confirmation copy to match: in the
SettingsPage component (SettingsPage.tsx) locate the tenant erasure
confirmation/confirmation modal text (the string that currently says audit logs
will be permanently deleted) and change it to indicate audit logs are retained
for compliance (e.g., “Audit logs will be retained and cannot be deleted for
compliance”); update any related help text or docs strings in the same component
to reflect retained-audit behavior so the UI matches the account.ts response.

Comment on lines +136 to +146
try {
const tenantId = req.user!.tenantId;

const [tenantResult, usersResult, devicesResult, alertsResult, auditResult, keysResult] =
await Promise.all([
pool.query("SELECT id, name, slug, plan, email, created_at FROM tenants WHERE id = $1", [tenantId]),
pool.query("SELECT id, email, role, created_at FROM tenant_users WHERE tenant_id = $1", [tenantId]),
pool.query("SELECT id, serial_number, created_at FROM devices WHERE tenant_id = $1", [tenantId]),
pool.query("SELECT id, name, metric, operator, threshold, enabled, created_at FROM alert_rules WHERE tenant_id = $1", [tenantId]),
pool.query("SELECT id, event_type, actor_id, resource_type, payload, created_at FROM audit_events WHERE tenant_id = $1 ORDER BY created_at DESC LIMIT 1000", [tenantId]),
pool.query("SELECT id, name, created_at, expires_at, revoked_at FROM api_keys WHERE tenant_id = $1", [tenantId]),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Restrict this export to admins or narrow it to per-user data.

This handler exports tenant_users, devices, alert_rules, audit_events, and api_keys for the whole tenant, so any authenticated member can download other users’ emails and tenant-wide operational data. Either add an admin check here or scope the export to the requesting user’s own records.

As per coding guidelines, "Verify tenantId, RBAC, and CSRF protections are applied consistently on all protected routes."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/routes/account.ts` around lines 136 - 146, This route is
returning tenant-wide sensitive data; enforce RBAC by checking the authenticated
user's role (req.user and req.user.role) before running the tenant-wide queries:
if req.user.role !== 'admin' return 403, otherwise continue; alternatively, for
non-admins scope each query to the requesting user by replacing tenant-wide
queries for tenant_users, devices, alert_rules, audit_events, and api_keys with
user-scoped queries (filter by req.user.id or actor_id where appropriate) so
only the requester’s records are exported; update the code around the existing
tenantId and Promise.all block that runs pool.query for tenant_users, devices,
alert_rules, audit_events, and api_keys to enforce the chosen behavior.

Comment on lines +46 to +58
try {
const connections = await listOrgConnections(orgId);

return res.json({ configured: true, orgId, connections });
} catch (error) {
console.error("[sso] failed to list org connections:", error);
return res.json({
configured: true,
orgId,
connections: [],
warning: "Auth0 management API is unavailable for the gateway right now.",
});
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Catching all errors masks configuration issues.

The try/catch catches every error from listOrgConnections, including configuration errors thrown by getManagementToken() (e.g., missing AUTH0_DOMAIN, AUTH0_MANAGEMENT_CLIENT_ID, AUTH0_MANAGEMENT_CLIENT_SECRET). These are operator misconfigurations that should surface clearly, not be hidden behind a generic "unavailable" warning.

Consider distinguishing transient API failures from configuration errors:

Proposed fix to differentiate error types
     try {
       const connections = await listOrgConnections(orgId);
 
       return res.json({ configured: true, orgId, connections });
     } catch (error) {
+      const message = error instanceof Error ? error.message : "";
+      // Configuration errors should fail loudly, not be masked
+      if (message.includes("SSO not configured") || message.includes("must be set")) {
+        console.error("[sso] configuration error:", error);
+        return res.status(500).json({ error: "sso_misconfigured" });
+      }
       console.error("[sso] failed to list org connections:", error);
       return res.json({
         configured: true,
         orgId,
         connections: [],
         warning: "Auth0 management API is unavailable for the gateway right now.",
       });
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/routes/sso.ts` around lines 46 - 58, The catch in the route
swallowing every error from listOrgConnections hides operator configuration
problems (e.g. errors from getManagementToken when AUTH0_* env vars are
missing); update error handling so configuration errors surface: have
getManagementToken (or listOrgConnections) throw a distinct error type or
include a recognisable property/message (e.g. ConfigError or error.code ===
'CONFIG_MISSING') and in the sso route catch distinguish those and rethrow or
return a 5xx response indicating misconfiguration, while still converting
transient Auth0 API failures into the existing configured:true with warning
response; reference listOrgConnections and getManagementToken when adding the
new error type check and branching logic.

Comment on lines +1 to +14
variable "project" {
type = string
default = "grainguard"
}

variable "aws_region" {
type = string
default = "us-east-1"
}

variable "db_password" {
type = string
sensitive = true
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

LGTM with minor suggestion.

Variables are correctly defined. The db_password is appropriately marked sensitive = true with no default, requiring explicit input.

Consider adding description fields for better documentation and terraform plan output clarity:

📝 Add descriptions
 variable "project" {
   type    = string
   default = "grainguard"
+  description = "Project name used for resource naming"
 }

 variable "aws_region" {
   type    = string
   default = "us-east-1"
+  description = "AWS region for staging infrastructure"
 }

 variable "db_password" {
   type      = string
   sensitive = true
+  description = "RDS database master password"
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
variable "project" {
type = string
default = "grainguard"
}
variable "aws_region" {
type = string
default = "us-east-1"
}
variable "db_password" {
type = string
sensitive = true
}
variable "project" {
type = string
default = "grainguard"
description = "Project name used for resource naming"
}
variable "aws_region" {
type = string
default = "us-east-1"
description = "AWS region for staging infrastructure"
}
variable "db_password" {
type = string
sensitive = true
description = "RDS database master password"
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@infra/terraform/environments/staging/variables.tf` around lines 1 - 14, Add
descriptive text for each Terraform input variable to improve documentation and
terraform plan clarity: add a description to variable "project" explaining it
holds the GCP/AWS project or service name, to variable "aws_region" describing
the AWS region (e.g., default region for resources), and to variable
"db_password" clarifying this is the sensitive database password with no default
and must be provided; update the variable blocks for project, aws_region, and
db_password to include these description strings.

Comment on lines +13 to +14
repoURL: https://github.com/pahuldeepp/GrainGuard-.git
targetRevision: HEAD
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

targetRevision: HEAD may cause unintended deployments to staging.

Using HEAD means every push to the default branch triggers a staging deployment. For a pre-production environment meant to validate rollouts, consider tracking a dedicated branch (e.g., staging) or using tags/release candidates to control what gets deployed.

♻️ Suggested alternatives
-    targetRevision: HEAD
+    targetRevision: staging  # or use release tags like v1.2.0-rc1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
repoURL: https://github.com/pahuldeepp/GrainGuard-.git
targetRevision: HEAD
repoURL: https://github.com/pahuldeepp/GrainGuard-.git
targetRevision: staging # or use release tags like v1.2.0-rc1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/argocd/apps/grainguard-staging.yaml` around lines 13 - 14, The ArgoCD
application currently sets targetRevision: HEAD which will promote every push on
the default branch into the staging cluster; update the ArgoCD app config by
replacing targetRevision: HEAD with a controlled ref such as a dedicated branch
name (e.g., targetRevision: staging) or a tag/semver release ref to ensure only
intended commits are deployed, and document the chosen strategy so repoURL and
targetRevision remain consistent with your deployment workflow.

Comment thread k8s/argocd/project.yaml
Comment on lines +36 to +40
- name: staging-deployer
description: Can sync staging for pre-production validation
policies:
- p, proj:grainguard:staging-deployer, applications, sync, grainguard/grainguard-staging, allow
- p, proj:grainguard:staging-deployer, applications, get, grainguard/grainguard-staging, allow
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider adding groups to bind the role to subjects.

The staging-deployer role is defined with correct policies, but like the existing dev-deployer and prod-deployer roles, it lacks a groups field. Without group bindings, no users or service accounts can assume this role.

If this is intentional (bindings managed elsewhere via Argo CD RBAC ConfigMap), this is fine. Otherwise, add:

📝 Example group binding
     - name: staging-deployer
       description: Can sync staging for pre-production validation
       policies:
         - p, proj:grainguard:staging-deployer, applications, sync, grainguard/grainguard-staging, allow
         - p, proj:grainguard:staging-deployer, applications, get,  grainguard/grainguard-staging, allow
+      groups:
+        - staging-team
+        # or for CI: system:serviceaccount:argocd:<ci-service-account>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/argocd/project.yaml` around lines 36 - 40, The staging-deployer role
(name: staging-deployer) defines policies but has no groups field, so add a
groups array to bind it to the same subject group(s) used by
dev-deployer/prod-deployer (e.g., the CI/service account or team group) or
explicitly note that bindings are handled elsewhere; update the staging-deployer
role to include a groups: [...] entry (matching the existing
dev-deployer/prod-deployer group names) so the policies can be assumed by the
intended users/service accounts.

Comment on lines +1 to +2
namespace: grainguard-prod

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check prod Terraform IRSA configuration
echo "=== Prod IRSA namespace configuration ==="
rg -n 'k8s_namespace' infra/terraform/environments/prod/ 2>/dev/null || echo "No prod environment found or no k8s_namespace set"

echo -e "\n=== Full prod main.tf iam_irsa module block ==="
ast-grep --pattern $'module "iam_irsa" {
  $$$
}' | head -50

Repository: pahuldeepp/GrainGuard-

Length of output: 206


🏁 Script executed:

#!/bin/bash
# Check Terraform directory structure
echo "=== Terraform environments structure ==="
find infra/terraform/environments -type d -maxdepth 1 2>/dev/null || echo "No environments dir found"

echo -e "\n=== All Terraform files in infra/terraform ==="
find infra/terraform -name "*.tf" -type f | head -20

echo -e "\n=== Check for any IRSA module definitions ==="
rg -l 'module.*iam_irsa' infra/terraform/ 2>/dev/null || echo "No IRSA modules found"

echo -e "\n=== Check if prod namespace is set anywhere ==="
rg -r 'grainguard-prod' infra/terraform/ 2>/dev/null || echo "No grainguard-prod references found in Terraform"

Repository: pahuldeepp/GrainGuard-

Length of output: 1506


🏁 Script executed:

#!/bin/bash
# Examine dev environment IRSA configuration to see the pattern
echo "=== Dev IRSA module configuration ==="
rg -A 5 'module.*"iam_irsa"' infra/terraform/environments/dev/main.tf 2>/dev/null

echo -e "\n=== Staging IRSA module configuration ==="
rg -A 5 'module.*"iam_irsa"' infra/terraform/environments/staging/main.tf 2>/dev/null

echo -e "\n=== Check dev/staging helm values for namespace ==="
find k8s/helm -name "values*.yaml" -type f | xargs grep -l namespace 2>/dev/null || echo "No namespaces in values files"

Repository: pahuldeepp/GrainGuard-

Length of output: 866


🏁 Script executed:

#!/bin/bash
# Check the iam_irsa module to understand namespace handling
echo "=== iam_irsa module variables/main ==="
cat infra/terraform/modules/iam_irsa/variables.tf 2>/dev/null | head -30

echo -e "\n=== iam_irsa module main logic (relevant parts) ==="
rg -A 2 'namespace|environment' infra/terraform/modules/iam_irsa/main.tf 2>/dev/null | head -20

echo -e "\n=== Check dev values namespace ==="
grep namespace k8s/helm/grainguard/values-dev.yaml 2>/dev/null

echo -e "\n=== Check staging values namespace ==="
grep namespace k8s/helm/grainguard/values-staging.yaml 2>/dev/null

echo -e "\n=== Check prod values namespace ==="
grep namespace k8s/helm/grainguard/values-prod.yaml 2>/dev/null

Repository: pahuldeepp/GrainGuard-

Length of output: 1267


Prod infrastructure setup is incomplete - Terraform environment missing.

Adding namespace: grainguard-prod to the Helm values is correct for consistency, but the prod Terraform environment doesn't exist yet. Dev and staging both have environment configurations (infra/terraform/environments/{dev,staging}/main.tf), but there is no prod environment. Creating prod namespace in Helm values without corresponding prod Terraform infrastructure means pods cannot be deployed to production. Create infra/terraform/environments/prod/ with the same IRSA module configuration before deploying to production, passing environment = "prod" to match the grainguard-prod namespace.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@k8s/helm/grainguard/values-prod.yaml` around lines 1 - 2, The Helm change
added namespace: grainguard-prod but there is no Terraform prod environment;
create a new Terraform environment directory mirroring
infra/terraform/environments/dev and infra/terraform/environments/staging (e.g.,
infra/terraform/environments/prod/) and copy the same IRSA module configuration
found in those main.tf files, updating the module/input that sets environment to
"prod" so the IRSA module and any resources match the grainguard-prod namespace;
ensure any variables or backend configs are adjusted to match prod conventions
and that the environment value is consistently "prod" across the module and
Terraform workspace configuration.

Comment thread README.md
Comment on lines 190 to +201
# k6 load tests (requires running stack)
k6 run tests/load/spike.js
k6 run tests/load/soak.js
k6 run tests/load/stress.js

# Chaos tests (requires kubectl + live cluster)
bash tests/chaos/run-all.sh

# Replay + idempotency test
./scripts/replay/replay_test.sh
```

Note:
- The core load-test scripts above are committed in `tests/load/`.
- Cluster-level chaos automation is not currently committed on `master`; add or restore it before relying on README-driven chaos drills.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Load-test paths in Testing section are out of sync with the new committed scripts.

Line 191-Line 193 and Line 200 reference tests/load/, but this PR adds load tests under scripts/load-tests/ (graphql-stress.js, ingest-stress.js, mixed-stack-stress.js). Please align the README so users run the intended scripts.

📌 Proposed README fix
 # k6 load tests (requires running stack)
-k6 run tests/load/spike.js
-k6 run tests/load/soak.js
-k6 run tests/load/stress.js
+k6 run scripts/load-tests/graphql-stress.js
+k6 run scripts/load-tests/ingest-stress.js
+k6 run scripts/load-tests/mixed-stack-stress.js

 # Replay + idempotency test
 ./scripts/replay/replay_test.sh
@@
 Note:
-- The core load-test scripts above are committed in `tests/load/`.
+- The core load-test scripts above are committed in `scripts/load-tests/`.
 - Cluster-level chaos automation is not currently committed on `master`; add or restore it before relying on README-driven chaos drills.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# k6 load tests (requires running stack)
k6 run tests/load/spike.js
k6 run tests/load/soak.js
k6 run tests/load/stress.js
# Chaos tests (requires kubectl + live cluster)
bash tests/chaos/run-all.sh
# Replay + idempotency test
./scripts/replay/replay_test.sh
```
Note:
- The core load-test scripts above are committed in `tests/load/`.
- Cluster-level chaos automation is not currently committed on `master`; add or restore it before relying on README-driven chaos drills.
# k6 load tests (requires running stack)
k6 run scripts/load-tests/graphql-stress.js
k6 run scripts/load-tests/ingest-stress.js
k6 run scripts/load-tests/mixed-stack-stress.js
# Replay + idempotency test
./scripts/replay/replay_test.sh
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 190 - 201, Update the README Testing section to point
to the new load-test locations and filenames: replace references to tests/load/
with scripts/load-tests/ and list the committed scripts graphql-stress.js,
ingest-stress.js, and mixed-stack-stress.js (e.g., k6 run
scripts/load-tests/graphql-stress.js, etc.); also update the Note bullet that
mentions tests/load/ to reference scripts/load-tests/ so the README and the new
PR-added scripts are in sync while leaving the existing replay path
(./scripts/replay/replay_test.sh) unchanged.

pahuldeepp added a commit that referenced this pull request Mar 29, 2026
HIGH PRIORITY:
- #9: Fix CSRF timing attack — pad buffers to fixed length before timingSafeEqual
- #10: Upgrade circuit breaker to distributed (Redis-backed state sharing across pods)
- #8: Fix saga recovery infinite loop — mark corrupted payloads as FAILED instead of retrying

MEDIUM PRIORITY:
- #11: Add webhook idempotency check (dedup by endpoint_id + event_type)
- #12: Assert Stripe webhook body is Buffer before signature verification
- #13: Eliminate N+1 query in deviceTelemetry resolver (single JOIN query)
- #15: Add ORDER BY + LIMIT to saga FindByCorrelationID for deterministic results
- #17: Make critical audit events (auth, admin) throw on failure instead of silent swallow

LOW PRIORITY:
- #20: Add 10s query timeout to all saga repository DB operations
- #23: Add IsValidStatus validator for saga status constants
- #24: Set httpServer.timeout (30s) and keepAliveTimeout (65s) on BFF
- #25: Add RabbitMQ heartbeat (30s) and connection error/close handlers
- #7: Fix remaining saga JSON marshal error check (initialErr)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
apps/dashboard/src/features/settings/SettingsPage.tsx (1)

100-125: 🧹 Nitpick | 🔵 Trivial

Consider adding a loading state to prevent duplicate export requests.

Unlike handleDeleteAccount which tracks deleting state, handleExport has no loading indicator. Users can click "Export My Data" multiple times while the request is in-flight, triggering redundant downloads.

♻️ Proposed improvement
 const [deleting,    setDeleting]    = useState(false);
+const [exporting,   setExporting]   = useState(false);

 async function handleExport() {
+  if (exporting) return;
+  setExporting(true);
   try {
     // ... existing logic ...
   } catch (e) {
     toast.error(e instanceof Error ? e.message : "Export failed");
+  } finally {
+    setExporting(false);
   }
 }

Then disable the button:

 <button
   onClick={handleExport}
+  disabled={exporting}
   className="px-4 py-2 text-sm bg-gray-100 ..."
 >
-  Export My Data
+  {exporting ? "Exporting..." : "Export My Data"}
 </button>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/features/settings/SettingsPage.tsx` around lines 100 -
125, handleExport has no loading guard so users can trigger duplicate export
requests; add a boolean state (e.g., exporting) and set it true at start of
handleExport and false in finally, bail out early if exporting is already true,
and use that state to disable the "Export My Data" button and show a
spinner/disabled text; update references to getAccessTokenSilently, fetch, and
toast inside handleExport accordingly and ensure URL.revokeObjectURL runs even
on errors by placing it in the finally block.
♻️ Duplicate comments (1)
apps/dashboard/src/features/settings/SettingsPage.tsx (1)

106-111: ⚠️ Potential issue | 🟡 Minor

Raw backend error codes are still exposed to end users.

The error handling extracts machine codes (internal_error, missing_token, tenant_missing, invalid_token) and displays them directly via toast.error(e.message). Users will see cryptic messages instead of friendly copy.

Map known error codes to user-friendly messages before displaying.

💡 Proposed fix
+const ERROR_MESSAGES: Record<string, string> = {
+  internal_error: "Export failed. Please try again later.",
+  missing_token: "Please sign in to export your data.",
+  invalid_token: "Your session has expired. Please sign in again.",
+  tenant_missing: "Account not found. Please contact support.",
+};
+
+function friendlyErrorMessage(code: string): string {
+  return ERROR_MESSAGES[code] ?? code;
+}
+
 async function handleExport() {
   try {
     const token = await getAccessTokenSilently();
     const res = await fetch(`${GW}/account/export`, {
       headers: { Authorization: `Bearer ${token}` },
     });
     if (!res.ok) {
       const body = await res.json().catch(() => ({}));
       throw new Error(
         typeof body?.error === "string" ? body.error : `HTTP ${res.status}`
       );
     }
     // ... blob handling ...
   } catch (e) {
-    toast.error(e instanceof Error ? e.message : "Export failed");
+    toast.error(
+      e instanceof Error ? friendlyErrorMessage(e.message) : "Export failed"
+    );
   }
 }

Also applies to: 122-123

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/dashboard/src/features/settings/SettingsPage.tsx` around lines 106 -
111, The current response error handling throws raw backend error codes
(body?.error) which then get shown to users; update the error handling around
the fetch response check (the block that reads const body = await
res.json().catch(() => ({})); and throws new Error(...)) to map known machine
error codes (e.g., "internal_error", "missing_token", "tenant_missing",
"invalid_token") to user-friendly messages via a simple map/object, and throw or
return a sanitized message instead; ensure the same mapping is applied wherever
you consume e.message for toast.error (the other occurrence mentioned), using a
default fallback like "An unexpected error occurred" if the code is unknown.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@apps/dashboard/src/features/settings/SettingsPage.tsx`:
- Around line 100-125: handleExport has no loading guard so users can trigger
duplicate export requests; add a boolean state (e.g., exporting) and set it true
at start of handleExport and false in finally, bail out early if exporting is
already true, and use that state to disable the "Export My Data" button and show
a spinner/disabled text; update references to getAccessTokenSilently, fetch, and
toast inside handleExport accordingly and ensure URL.revokeObjectURL runs even
on errors by placing it in the finally block.

---

Duplicate comments:
In `@apps/dashboard/src/features/settings/SettingsPage.tsx`:
- Around line 106-111: The current response error handling throws raw backend
error codes (body?.error) which then get shown to users; update the error
handling around the fetch response check (the block that reads const body =
await res.json().catch(() => ({})); and throws new Error(...)) to map known
machine error codes (e.g., "internal_error", "missing_token", "tenant_missing",
"invalid_token") to user-friendly messages via a simple map/object, and throw or
return a sanitized message instead; ensure the same mapping is applied wherever
you consume e.message for toast.error (the other occurrence mentioned), using a
default fallback like "An unexpected error occurred" if the code is unknown.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 69d173e2-4575-4768-bff7-29843ce7e93b

📥 Commits

Reviewing files that changed from the base of the PR and between 6126b70 and 3c14d5c.

📒 Files selected for processing (1)
  • apps/dashboard/src/features/settings/SettingsPage.tsx

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/gateway/src/services/device.ts`:
- Around line 11-18: The protoCandidates array contains an unreachable fallback
"/libs/proto/device.proto" for production because the Docker runtime copies libs
to /app/libs; remove that entry from protoCandidates (or replace it with a
documented environment-specific path) and update any comment near
protoCandidates to state that the production image uses /app/libs so only the
valid paths are listed; reference the protoCandidates constant in
apps/gateway/src/services/device.ts when making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3fc277f2-5fb5-449e-a899-f3bcbc49e01b

📥 Commits

Reviewing files that changed from the base of the PR and between 3c14d5c and f8bac84.

📒 Files selected for processing (1)
  • apps/gateway/src/services/device.ts

Comment on lines +11 to +18
const protoCandidates = [
path.resolve(process.cwd(), "libs/proto/device.proto"),
path.resolve(__dirname, "../../libs/proto/device.proto"),
path.resolve(__dirname, "../../../libs/proto/device.proto"),
path.resolve(__dirname, "../../../../libs/proto/device.proto"),
"/app/libs/proto/device.proto",
"/libs/proto/device.proto",
];
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Unreachable fallback path /libs/proto/device.proto.

Per the Dockerfile context, the libs directory is copied to /app/libs in the runtime image, not /libs. Line 17's path will never match in the production container. Consider removing it to avoid confusion, or add a comment explaining which environment it targets.

 const protoCandidates = [
   path.resolve(process.cwd(), "libs/proto/device.proto"),
   path.resolve(__dirname, "../../libs/proto/device.proto"),
   path.resolve(__dirname, "../../../libs/proto/device.proto"),
   path.resolve(__dirname, "../../../../libs/proto/device.proto"),
   "/app/libs/proto/device.proto",
-  "/libs/proto/device.proto",
 ];
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const protoCandidates = [
path.resolve(process.cwd(), "libs/proto/device.proto"),
path.resolve(__dirname, "../../libs/proto/device.proto"),
path.resolve(__dirname, "../../../libs/proto/device.proto"),
path.resolve(__dirname, "../../../../libs/proto/device.proto"),
"/app/libs/proto/device.proto",
"/libs/proto/device.proto",
];
const protoCandidates = [
path.resolve(process.cwd(), "libs/proto/device.proto"),
path.resolve(__dirname, "../../libs/proto/device.proto"),
path.resolve(__dirname, "../../../libs/proto/device.proto"),
path.resolve(__dirname, "../../../../libs/proto/device.proto"),
"/app/libs/proto/device.proto",
];
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/services/device.ts` around lines 11 - 18, The
protoCandidates array contains an unreachable fallback
"/libs/proto/device.proto" for production because the Docker runtime copies libs
to /app/libs; remove that entry from protoCandidates (or replace it with a
documented environment-specific path) and update any comment near
protoCandidates to state that the production image uses /app/libs so only the
valid paths are listed; reference the protoCandidates constant in
apps/gateway/src/services/device.ts when making the change.

Comment thread apps/gateway/src/routes/account.ts Fixed
Comment thread apps/gateway/src/routes/account.ts Fixed
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
apps/gateway/src/routes/account.ts (1)

141-149: ⚠️ Potential issue | 🔴 Critical

Gate the tenant-wide export behind admin RBAC.

This still exports tenant_users, devices, alert_rules, audit_events, and api_keys for the whole tenant without any role check. Any authenticated member can download other users’ emails and tenant-wide operational data. Require an admin role before these queries run, or scope the export to the requesting user’s own records.

As per coding guidelines, "Verify tenantId, RBAC, and CSRF protections are applied consistently on all protected routes."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/routes/account.ts` around lines 141 - 149, This endpoint
performs tenant-wide exports (tenantResult, usersResult, devicesResult,
alertsResult, auditResult, keysResult via pool.query using tenantId) without
RBAC; add an explicit admin role check (e.g., verify req.user.role === "admin"
or call hasAdminRole(req.user)) before running these queries and return 403 if
not authorized, or alternatively scope queries to the requesting user (filter
tenant_users, devices, api_keys, audit_events by actor_id or user_id) so only
the caller’s records are exported; implement the check at the top of the export
handler (before the Promise.all) and use the existing auth principal from the
request to decide.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/gateway/src/routes/account.ts`:
- Around line 161-166: The export response currently sets Content-Type and
Content-Disposition but does not prevent caching; update the handler around
where res.setHeader(...) and return res.json(exportData) are called to add
cache-busting headers (e.g. set "Cache-Control: no-store, no-cache,
must-revalidate", "Pragma: no-cache", and "Expires: 0") before returning; locate
the logic in the account export route (the block that calls res.setHeader(...)
and res.json(exportData)) and insert these headers so the exported tenant file
is not stored by browsers or intermediaries.
- Around line 49-50: The accountRouter currently applies apiRateLimiter globally
via accountRouter.use(), but the DELETE /account/me and GET /account/export
routes still re-apply apiRateLimiter; replace those redundant uses with
route-level stricter limiters: remove apiRateLimiter from the DELETE handler for
the account erasure route and attach strictRateLimiter (100 req/60s) to that
route, and remove apiRateLimiter from the GET export handler and attach
bulkRateLimiter (10 req/60s) to the export route; locate the routes by the
DELETE /account/me and GET /account/export handlers and the global
accountRouter.use() and swap the middleware references accordingly so global
apiRateLimiter remains in place for other endpoints.

---

Duplicate comments:
In `@apps/gateway/src/routes/account.ts`:
- Around line 141-149: This endpoint performs tenant-wide exports (tenantResult,
usersResult, devicesResult, alertsResult, auditResult, keysResult via pool.query
using tenantId) without RBAC; add an explicit admin role check (e.g., verify
req.user.role === "admin" or call hasAdminRole(req.user)) before running these
queries and return 403 if not authorized, or alternatively scope queries to the
requesting user (filter tenant_users, devices, api_keys, audit_events by
actor_id or user_id) so only the caller’s records are exported; implement the
check at the top of the export handler (before the Promise.all) and use the
existing auth principal from the request to decide.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1ab515ac-e3f6-4620-b541-be514fa79ba2

📥 Commits

Reviewing files that changed from the base of the PR and between bd8768f and 4b00e4b.

📒 Files selected for processing (1)
  • apps/gateway/src/routes/account.ts

Comment thread apps/gateway/src/routes/account.ts Outdated
Comment on lines 49 to 50
apiRateLimiter,
authMiddleware,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the account.ts file to understand the router setup
wc -l apps/gateway/src/routes/account.ts

Repository: pahuldeepp/GrainGuard-

Length of output: 104


🏁 Script executed:

# Read the relevant sections of account.ts
head -20 apps/gateway/src/routes/account.ts

Repository: pahuldeepp/GrainGuard-

Length of output: 823


🏁 Script executed:

# Check lines around line 49-50
sed -n '40,60p' apps/gateway/src/routes/account.ts

Repository: pahuldeepp/GrainGuard-

Length of output: 662


🏁 Script executed:

# Check lines around line 135-136
sed -n '125,145p' apps/gateway/src/routes/account.ts

Repository: pahuldeepp/GrainGuard-

Length of output: 905


🏁 Script executed:

# Check the rateLimiting middleware to see what limiters are available
cat apps/gateway/src/middleware/rateLimiting.ts

Repository: pahuldeepp/GrainGuard-

Length of output: 2116


Replace redundant apiRateLimiter applications with dedicated route-level limiters.

Line 8 already applies apiRateLimiter to all account routes via accountRouter.use(). Reapplying it on the DELETE and GET /export endpoints at lines 49-50 and 135-136 is redundant and prevents these sensitive GDPR endpoints from using tighter rate limits.

Replace with:

  • strictRateLimiter (100 req/60s) for DELETE /account/me (account erasure)
  • bulkRateLimiter (10 req/60s) for GET /account/export (data portability)
Suggested change
-import { apiRateLimiter } from "../middleware/rateLimiting";
+import { apiRateLimiter, bulkRateLimiter, strictRateLimiter } from "../middleware/rateLimiting";

...
 accountRouter.delete(
   "/account/me",
-  apiRateLimiter,
+  strictRateLimiter,
   authMiddleware,

...
 accountRouter.get(
   "/account/export",
-  apiRateLimiter,
+  bulkRateLimiter,
   authMiddleware,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
apiRateLimiter,
authMiddleware,
strictRateLimiter,
authMiddleware,
🧰 Tools
🪛 GitHub Check: CodeQL

[failure] 49-49: Missing rate limiting
This route handler performs a database access, but is not rate-limited.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/routes/account.ts` around lines 49 - 50, The accountRouter
currently applies apiRateLimiter globally via accountRouter.use(), but the
DELETE /account/me and GET /account/export routes still re-apply apiRateLimiter;
replace those redundant uses with route-level stricter limiters: remove
apiRateLimiter from the DELETE handler for the account erasure route and attach
strictRateLimiter (100 req/60s) to that route, and remove apiRateLimiter from
the GET export handler and attach bulkRateLimiter (10 req/60s) to the export
route; locate the routes by the DELETE /account/me and GET /account/export
handlers and the global accountRouter.use() and swap the middleware references
accordingly so global apiRateLimiter remains in place for other endpoints.

Comment on lines +161 to +166
res.setHeader("Content-Type", "application/json");
res.setHeader(
"Content-Disposition",
`attachment; filename="grainguard-export-${tenantId}-${new Date().toISOString().slice(0, 10)}.json"`
);
return res.json(exportData);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Mark the export response as non-cacheable.

This attachment contains sensitive tenant data, but the response never sets Cache-Control: no-store. Browsers and intermediaries can retain the file unless caching is explicitly disabled.

🛡️ Suggested header hardening
       res.setHeader("Content-Type", "application/json");
+      res.setHeader("Cache-Control", "private, no-store, max-age=0");
+      res.setHeader("Pragma", "no-cache");
       res.setHeader(
         "Content-Disposition",
         `attachment; filename="grainguard-export-${tenantId}-${new Date().toISOString().slice(0, 10)}.json"`
       );
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/routes/account.ts` around lines 161 - 166, The export
response currently sets Content-Type and Content-Disposition but does not
prevent caching; update the handler around where res.setHeader(...) and return
res.json(exportData) are called to add cache-busting headers (e.g. set
"Cache-Control: no-store, no-cache, must-revalidate", "Pragma: no-cache", and
"Expires: 0") before returning; locate the logic in the account export route
(the block that calls res.setHeader(...) and res.json(exportData)) and insert
these headers so the exported tenant file is not stored by browsers or
intermediaries.

accountRouter.get(
"/account/me",
authMiddleware,
apiRateLimiter,

Check failure

Code scanning / CodeQL

Missing rate limiting High

This route handler performs
a database access
, but is not rate-limited.

Copilot Autofix

AI 29 days ago

In general, to fix missing rate limiting on an Express route that performs database access, we should attach a proper rate-limiting middleware (for example, using the well-known express-rate-limit package) to that route or to a router that encloses it. This middleware tracks requests per client (typically per IP or auth identifier) over a sliding window and rejects or slows excessive requests, preventing easy exploitation for denial-of-service.

For this specific file, the simplest, least-disruptive fix that preserves existing behavior is to define a concrete rate limiter using express-rate-limit in apps/gateway/src/routes/account.ts and use it instead of the currently imported apiRateLimiter on this route. Because we cannot modify files other than the one shown, we’ll (1) add an import of express-rate-limit, (2) define a local accountMeRateLimiter configured to allow a reasonable burst (for example, 60 requests per 15 minutes per IP) and (3) replace apiRateLimiter in the /account/me route middleware chain with this new limiter. This guarantees that the /account/me route is clearly and directly protected by a recognized rate-limiting middleware, satisfying the security requirement without changing the functional logic of the handler itself.

Concretely:

  • Edit the import section in apps/gateway/src/routes/account.ts to import rateLimit from express-rate-limit.
  • Directly below the accountRouter creation, define const accountMeRateLimiter = rateLimit({...}) with a sensible windowMs, max, and standardHeaders/legacyHeaders options.
  • Update the accountRouter.get("/account/me", ...) call to use accountMeRateLimiter instead of apiRateLimiter. All other code in the handler stays unchanged.
Suggested changeset 1
apps/gateway/src/routes/account.ts

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/apps/gateway/src/routes/account.ts b/apps/gateway/src/routes/account.ts
--- a/apps/gateway/src/routes/account.ts
+++ b/apps/gateway/src/routes/account.ts
@@ -2,15 +2,23 @@
 import { authMiddleware } from "../middleware/auth";
 import { apiRateLimiter } from "../middleware/rateLimiting";
 import { writePool as pool } from "../database/db";
+import rateLimit from "express-rate-limit";
 
 export const accountRouter = Router();
 
+const accountMeRateLimiter = rateLimit({
+  windowMs: 15 * 60 * 1000, // 15 minutes
+  max: 100, // limit each IP to 100 requests per windowMs
+  standardHeaders: true,
+  legacyHeaders: false,
+});
+
 // ── GET /account/me ─────────────────────────────────────────────────────────
 // Returns the current user's profile and tenant info.
 accountRouter.get(
   "/account/me",
   authMiddleware,
-  apiRateLimiter,
+  accountMeRateLimiter,
   async (req: Request, res: Response) => {
     const tenantId = req.user!.tenantId;
     const userId = req.user!.sub;
EOF
@@ -2,15 +2,23 @@
import { authMiddleware } from "../middleware/auth";
import { apiRateLimiter } from "../middleware/rateLimiting";
import { writePool as pool } from "../database/db";
import rateLimit from "express-rate-limit";

export const accountRouter = Router();

const accountMeRateLimiter = rateLimit({
windowMs: 15 * 60 * 1000, // 15 minutes
max: 100, // limit each IP to 100 requests per windowMs
standardHeaders: true,
legacyHeaders: false,
});

// ── GET /account/me ─────────────────────────────────────────────────────────
// Returns the current user's profile and tenant info.
accountRouter.get(
"/account/me",
authMiddleware,
apiRateLimiter,
accountMeRateLimiter,
async (req: Request, res: Response) => {
const tenantId = req.user!.tenantId;
const userId = req.user!.sub;
Copilot is powered by AI and may make mistakes. Always verify output.
accountRouter.delete(
"/account/me",
authMiddleware,
apiRateLimiter,

Check failure

Code scanning / CodeQL

Missing rate limiting High

This route handler performs
a database access
, but is not rate-limited.

Copilot Autofix

AI 29 days ago

Copilot could not generate an autofix suggestion

Copilot could not generate an autofix suggestion for this alert. Try pushing a new commit or if the problem persists contact support.

alertRules: alertsResult.rows,
auditEvents: auditResult.rows,
apiKeys: keysResult.rows,
};

Check failure

Code scanning / CodeQL

Missing rate limiting High

This route handler performs
a database access
, but is not rate-limited.

Copilot Autofix

AI 29 days ago

Copilot could not generate an autofix suggestion

Copilot could not generate an autofix suggestion for this alert. Try pushing a new commit or if the problem persists contact support.

pahuldeepp and others added 2 commits March 29, 2026 16:08
…haining syntax

- billing.ts: add apiRateLimiter to GET /billing/subscription, POST /billing/checkout,
  POST /billing/portal to resolve CodeQL js/missing-rate-limiting alerts
- performance-budget.js: replace body?.errors optional chaining with body && body.errors
  for compatibility with older k6/Babel versions in CI

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
billingRouter.get(
"/billing/subscription",
authMiddleware,
apiRateLimiter,

Check failure

Code scanning / CodeQL

Missing rate limiting High

This route handler performs
a database access
, but is not rate-limited.

Copilot Autofix

AI 29 days ago

In general, the fix is to wrap this expensive route with a concrete, well-defined rate limiter so that bursts of requests cannot exhaust DB resources or Stripe’s API quota. The typical approach in Express is to use express-rate-limit to define a limiter (with a sensible window and max count) and then apply it as a middleware for this specific route.

Concretely for apps/gateway/src/routes/billing.ts, we will:

  1. Import rateLimit from the express-rate-limit package at the top of the file.
  2. Define a subscriptionRateLimiter constant configured with appropriate values (for example, a 15-minute window and a lowish per-IP cap, e.g., 30 or 50 calls).
  3. Apply this new subscriptionRateLimiter middleware to the /billing/subscription route, in addition to the existing authMiddleware and apiRateLimiter. This keeps all current behavior while adding an explicit and recognizable rate-limiting layer for this specific route.

We will only modify the shown regions: the import section and the billingRouter.get("/billing/subscription", ...) declaration.

Suggested changeset 1
apps/gateway/src/routes/billing.ts

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/apps/gateway/src/routes/billing.ts b/apps/gateway/src/routes/billing.ts
--- a/apps/gateway/src/routes/billing.ts
+++ b/apps/gateway/src/routes/billing.ts
@@ -1,6 +1,7 @@
 import { Router, Request, Response } from "express";
 import Stripe from "stripe";
 import amqp from "amqplib";
+import rateLimit from "express-rate-limit";
 import { writePool as pool } from "../lib/db";
 import { authMiddleware } from "../lib/auth";
 import { writeAuditLog } from "../lib/audit";
@@ -9,6 +10,13 @@
 const RABBITMQ_URL = process.env.RABBITMQ_URL || "amqp://grainguard:grainguard@rabbitmq:5672/grainguard";
 const STRIPE_BILLING_QUEUE = "grainguard.stripe.billing";
 
+const subscriptionRateLimiter = rateLimit({
+  windowMs: 15 * 60 * 1000, // 15 minutes
+  max: 50,                  // limit each IP to 50 requests per window
+  standardHeaders: true,
+  legacyHeaders: false,
+});
+
 async function publishToStripeQueue(payload: object): Promise<void> {
   let conn: Awaited<ReturnType<typeof amqp.connect>> | null = null;
   let ch: amqp.ConfirmChannel | null = null;
@@ -64,6 +72,7 @@
   "/billing/subscription",
   authMiddleware,
   apiRateLimiter,
+  subscriptionRateLimiter,
   async (req: Request, res: Response) => {
     const { rows } = await pool.query(
       `SELECT tb.stripe_customer_id,
EOF
@@ -1,6 +1,7 @@
import { Router, Request, Response } from "express";
import Stripe from "stripe";
import amqp from "amqplib";
import rateLimit from "express-rate-limit";
import { writePool as pool } from "../lib/db";
import { authMiddleware } from "../lib/auth";
import { writeAuditLog } from "../lib/audit";
@@ -9,6 +10,13 @@
const RABBITMQ_URL = process.env.RABBITMQ_URL || "amqp://grainguard:grainguard@rabbitmq:5672/grainguard";
const STRIPE_BILLING_QUEUE = "grainguard.stripe.billing";

const subscriptionRateLimiter = rateLimit({
windowMs: 15 * 60 * 1000, // 15 minutes
max: 50, // limit each IP to 50 requests per window
standardHeaders: true,
legacyHeaders: false,
});

async function publishToStripeQueue(payload: object): Promise<void> {
let conn: Awaited<ReturnType<typeof amqp.connect>> | null = null;
let ch: amqp.ConfirmChannel | null = null;
@@ -64,6 +72,7 @@
"/billing/subscription",
authMiddleware,
apiRateLimiter,
subscriptionRateLimiter,
async (req: Request, res: Response) => {
const { rows } = await pool.query(
`SELECT tb.stripe_customer_id,
Copilot is powered by AI and may make mistakes. Always verify output.
billingRouter.post(
"/billing/checkout",
authMiddleware,
apiRateLimiter,

Check failure

Code scanning / CodeQL

Missing rate limiting High

This route handler performs
a database access
, but is not rate-limited.

Copilot Autofix

AI 29 days ago

In general, to fix missing or insufficient rate limiting on an expensive route, you add a dedicated rate-limiting middleware that enforces an appropriate threshold for that specific endpoint, rather than relying only on a global or generic limiter. In Express, this is typically done using a library such as express-rate-limit, configured with a suitable windowMs and max for the operation in question.

For this specific /billing/checkout route in apps/gateway/src/routes/billing.ts, the best minimal fix without changing existing behavior is:

  • Import express-rate-limit at the top of the file.
  • Define a new limiter instance specifically for billing checkout, e.g. billingCheckoutRateLimiter, with stricter limits (for example, a low number of requests per minute per IP or per user).
  • Apply this new middleware in addition to the existing authMiddleware and apiRateLimiter in the /billing/checkout route’s middleware array.

Concretely:

  • In apps/gateway/src/routes/billing.ts, add an import for express-rate-limit.
  • Just after the queue/Stripe constants or near other middleware-related declarations, define const billingCheckoutRateLimiter = rateLimit({ ... }).
  • Update the billingRouter.post("/billing/checkout", ...) declaration so that its middleware chain is authMiddleware, apiRateLimiter, billingCheckoutRateLimiter, async (req, res) => { ... }.

This keeps existing functionality intact while adding an extra safeguard specifically tailored to this sensitive billing operation.

Suggested changeset 1
apps/gateway/src/routes/billing.ts

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/apps/gateway/src/routes/billing.ts b/apps/gateway/src/routes/billing.ts
--- a/apps/gateway/src/routes/billing.ts
+++ b/apps/gateway/src/routes/billing.ts
@@ -5,10 +5,18 @@
 import { authMiddleware } from "../lib/auth";
 import { writeAuditLog } from "../lib/audit";
 import { apiRateLimiter } from "../middleware/rateLimiting";
+import rateLimit from "express-rate-limit";
 
 const RABBITMQ_URL = process.env.RABBITMQ_URL || "amqp://grainguard:grainguard@rabbitmq:5672/grainguard";
 const STRIPE_BILLING_QUEUE = "grainguard.stripe.billing";
 
+const billingCheckoutRateLimiter = rateLimit({
+  windowMs: 1 * 60 * 1000, // 1 minute
+  max: 5, // limit each IP/user to 5 checkout requests per minute
+  standardHeaders: true,
+  legacyHeaders: false,
+});
+
 async function publishToStripeQueue(payload: object): Promise<void> {
   let conn: Awaited<ReturnType<typeof amqp.connect>> | null = null;
   let ch: amqp.ConfirmChannel | null = null;
@@ -124,6 +128,7 @@
   "/billing/checkout",
   authMiddleware,
   apiRateLimiter,
+  billingCheckoutRateLimiter,
   async (req: Request, res: Response) => {
     const { plan } = req.body as { plan: string };
 
EOF
@@ -5,10 +5,18 @@
import { authMiddleware } from "../lib/auth";
import { writeAuditLog } from "../lib/audit";
import { apiRateLimiter } from "../middleware/rateLimiting";
import rateLimit from "express-rate-limit";

const RABBITMQ_URL = process.env.RABBITMQ_URL || "amqp://grainguard:grainguard@rabbitmq:5672/grainguard";
const STRIPE_BILLING_QUEUE = "grainguard.stripe.billing";

const billingCheckoutRateLimiter = rateLimit({
windowMs: 1 * 60 * 1000, // 1 minute
max: 5, // limit each IP/user to 5 checkout requests per minute
standardHeaders: true,
legacyHeaders: false,
});

async function publishToStripeQueue(payload: object): Promise<void> {
let conn: Awaited<ReturnType<typeof amqp.connect>> | null = null;
let ch: amqp.ConfirmChannel | null = null;
@@ -124,6 +128,7 @@
"/billing/checkout",
authMiddleware,
apiRateLimiter,
billingCheckoutRateLimiter,
async (req: Request, res: Response) => {
const { plan } = req.body as { plan: string };

Copilot is powered by AI and may make mistakes. Always verify output.
billingRouter.post(
"/billing/portal",
authMiddleware,
apiRateLimiter,

Check failure

Code scanning / CodeQL

Missing rate limiting High

This route handler performs
a database access
, but is not rate-limited.

Copilot Autofix

AI 29 days ago

To fix the problem, the webhook endpoint should be protected by rate limiting just like the other billing routes. The existing apiRateLimiter middleware is already imported and used for /billing/checkout and /billing/portal, so the best fix is to apply the same middleware in front of the Stripe webhook handler without changing its core behavior.

Concretely, instead of exporting stripeWebhookHandler as a bare function, we should expose it as a route on billingRouter with apiRateLimiter in the middleware chain, for example via billingRouter.post("/billing/stripe-webhook", apiRateLimiter, stripeWebhookHandler). Since we cannot see the router setup here, and to avoid altering exports or wiring that may be used elsewhere, an even less intrusive change is to wrap the existing stripeWebhookHandler logic in a new, rate-limited handler and export that wrapper. This preserves the external interface (the exported symbol name remains stripeWebhookHandler) while adding rate limiting in-process.

Implementation steps within apps/gateway/src/routes/billing.ts:

  1. Leave the existing logic inside stripeWebhookHandler untouched but move it into a new internal function, e.g. async function stripeWebhookHandlerCore(req, res).
  2. Create a new exported stripeWebhookHandler that is an Express-style handler applying apiRateLimiter manually: invoke apiRateLimiter(req, res, next) and, in its callback, call stripeWebhookHandlerCore. This mimics how Express would apply the middleware without requiring router-level changes.
  3. Ensure imports remain unchanged; we already import apiRateLimiter, so no new imports are needed.

This adds rate limiting to the webhook handler while keeping its external behavior and signature compatible.

Suggested changeset 1
apps/gateway/src/routes/billing.ts

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/apps/gateway/src/routes/billing.ts b/apps/gateway/src/routes/billing.ts
--- a/apps/gateway/src/routes/billing.ts
+++ b/apps/gateway/src/routes/billing.ts
@@ -220,7 +220,7 @@
   }
 );
 
-export async function stripeWebhookHandler(req: Request, res: Response) {
+async function stripeWebhookHandlerCore(req: Request, res: Response) {
   const sig = req.headers["stripe-signature"] as string;
 
   let event: Stripe.Event;
@@ -248,3 +248,13 @@
     return res.status(500).json({ error: "queue_publish_failed" });
   }
 }
+
+export function stripeWebhookHandler(req: Request, res: Response) {
+  apiRateLimiter(req, res, (err?: any) => {
+    if (err) {
+      // If the rate limiter passes an error, delegate to Express-style error handling
+      return res.status(429).json({ error: "rate_limited" });
+    }
+    void stripeWebhookHandlerCore(req, res);
+  });
+}
EOF
@@ -220,7 +220,7 @@
}
);

export async function stripeWebhookHandler(req: Request, res: Response) {
async function stripeWebhookHandlerCore(req: Request, res: Response) {
const sig = req.headers["stripe-signature"] as string;

let event: Stripe.Event;
@@ -248,3 +248,13 @@
return res.status(500).json({ error: "queue_publish_failed" });
}
}

export function stripeWebhookHandler(req: Request, res: Response) {
apiRateLimiter(req, res, (err?: any) => {
if (err) {
// If the rate limiter passes an error, delegate to Express-style error handling
return res.status(429).json({ error: "rate_limited" });
}
void stripeWebhookHandlerCore(req, res);
});
}
Copilot is powered by AI and may make mistakes. Always verify output.
@pahuldeepp pahuldeepp merged commit 8f90a73 into master Mar 29, 2026
20 of 23 checks passed
@pahuldeepp pahuldeepp deleted the codex/staging-bootstrap branch March 29, 2026 22:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants