Skip to content
Closed
Show file tree
Hide file tree
Changes from 31 commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
3e4ac3e
feat(chaos): add chaos test suite — pod kill, Kafka pause, Redis outa…
pahuldeepp Mar 22, 2026
8add219
fix(saga-orchestrator,search-indexer): add saga timeout, fix telemetr…
pahuldeepp Mar 22, 2026
df26ff1
feat(step7+8): security hardening, billing, device registration, tena…
pahuldeepp Mar 22, 2026
e99ccc6
feat(r2-r4): SSO, bulk import, alert rules, audit log, E2E, perf budg…
pahuldeepp Mar 22, 2026
a1a55bf
fix(gateway): resolve all TypeScript errors
pahuldeepp Mar 22, 2026
2534b86
feat(redis-cluster): upgrade BFF + read-model-builder to cluster mode
pahuldeepp Mar 25, 2026
5bd9de3
fix: critical and high-priority issues from codebase review
pahuldeepp Mar 25, 2026
051c9c1
fix: medium priority issues - rate limiting, DB validation, security …
pahuldeepp Mar 25, 2026
cc5821a
fix: complete remaining 15 issues from codebase review
pahuldeepp Mar 25, 2026
8a3cd47
fix(ci): fix CI pipeline failures
pahuldeepp Mar 25, 2026
d74331e
fix(ci): make vet/test/tidy non-blocking, update deps for Go 1.25
pahuldeepp Mar 25, 2026
59516d4
fix(ci): restore pg dep, sync lockfiles, fix Stripe API version
pahuldeepp Mar 25, 2026
684c6c9
fix(ci): restrict e2e workflow to PRs against master + manual trigger
pahuldeepp Mar 25, 2026
aaf4d9f
fix(ci): skip e2e tests when Auth0 secrets not configured
pahuldeepp Mar 25, 2026
4d7bf73
feat(gateway): add plan enforcement middleware with quota + feature g…
pahuldeepp Mar 25, 2026
143bdf7
feat(jobs-worker): wire Resend email provider
pahuldeepp Mar 25, 2026
ad6f5f0
feat(e2e): replace Auth0 credentials with mock auth fixture
pahuldeepp Mar 25, 2026
588db81
chore: resolve merge conflicts with master — take master's improvements
pahuldeepp Mar 25, 2026
9131353
fix(gateway): resolve all TypeScript errors after merge conflict reso…
pahuldeepp Mar 25, 2026
10fa6f1
Merge master into PR 6 and fix CI review issues
pahuldeepp Mar 27, 2026
8a97319
fix(compose): unblock telemetry startup and alert queue
pahuldeepp Mar 27, 2026
f53b589
fix(ci): align Go and harden security workflows
pahuldeepp Mar 27, 2026
97b1c36
fix(ci): unblock lint audit and terraform checks
pahuldeepp Mar 27, 2026
b84c017
fix(ci): stabilize trivy and performance workflows
pahuldeepp Mar 27, 2026
bdd4daa
fix(ci): resolve Go BOM errors, Trivy config, CodeQL alert, perf budget
pahuldeepp Mar 27, 2026
03d5bbe
fix(ci): unblock remaining workflow failures
pahuldeepp Mar 27, 2026
d30c886
fix(ci): use latest golangci-lint for Go 1.25 compatibility
pahuldeepp Mar 27, 2026
d9b8925
fix(security): replace Stripe placeholder key with secret reference
pahuldeepp Mar 27, 2026
e1fa0f0
fix(ci): stabilize e2e auth and golangci checks
pahuldeepp Mar 27, 2026
63179af
fix(review): harden plan enforcement and e2e workflow
pahuldeepp Mar 27, 2026
818236a
fix(ci): pin golangci-lint v2 in workflow
pahuldeepp Mar 27, 2026
ee67fe5
fix(ci): unblock bff build and harden auth mock
pahuldeepp Mar 27, 2026
9e98fa6
fix(dashboard): avoid silent auth for tenant extraction
pahuldeepp Mar 27, 2026
f3a5302
fix(stack): harden e2e auth and tenant-scoped queries
pahuldeepp Mar 28, 2026
9b10eba
fix(ci): resolve Go Lint SA6002 and CodeQL SQL injection alert
pahuldeepp Mar 28, 2026
42a32be
fix(ci): move exclude-dirs to run section for golangci-lint v2
pahuldeepp Mar 28, 2026
4f8ebc9
fix(lint): eliminate all explicit any warnings in bff and dashboard
pahuldeepp Mar 28, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/cd.yml
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ jobs:
- name: Set up Go
uses: actions/setup-go@v5
with:
go-version: "1.24"
go-version: "1.25"
cache: true

- name: Build & Vet (Go)
Expand Down Expand Up @@ -95,4 +95,4 @@ jobs:
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
cache-to: type=gha,mode=max
136 changes: 136 additions & 0 deletions .github/workflows/chaos.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
name: Chaos Tests

on:
workflow_dispatch:
inputs:
experiment:
description: 'Experiment to run'
required: true
default: all
type: choice
options:
- all
- pod-kill
- kafka-consumer-pause
- redis-outage
- projection-lag
- network-partition
namespace:
description: 'Target namespace'
required: true
default: grainguard-dev
schedule:
# Run full suite every Saturday at 02:00 UTC (off-peak)
- cron: '0 2 * * 6'

env:
NAMESPACE: ${{ github.event.inputs.namespace || 'grainguard-dev' }}
Comment thread
coderabbitai[bot] marked this conversation as resolved.

jobs:
chaos:
name: Chaos — ${{ github.event.inputs.experiment || 'all' }}
runs-on: ubuntu-latest
timeout-minutes: 30
Comment on lines +29 to +33

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Serialize chaos runs per namespace.

Because the allowlist currently collapses every run onto grainguard-dev, a scheduled run can overlap with a manual dispatch against the same resources. For destructive experiments, that means cross-contaminated results and longer outages. Add a namespace-scoped concurrency guard here.

Suggested fix
   chaos:
     name: Chaos — ${{ github.event.inputs.experiment || 'all' }}
     runs-on: ubuntu-latest
     timeout-minutes: 30
+    concurrency:
+      group: chaos-${{ github.event.inputs.namespace || 'grainguard-dev' }}
+      cancel-in-progress: false
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/chaos.yml around lines 29 - 33, The chaos job is currently
unguarded and can run overlapping experiments; add a GitHub Actions concurrency
block to the "chaos" job to serialize runs per namespace by grouping on the
namespace input (falling back to the default allowlist namespace) so only one
run per namespace executes at a time; add a concurrency key under the chaos job
that uses a group like "chaos-${{ github.event.inputs.namespace ||
'grainguard-dev' }}" and set cancel-in-progress as appropriate (usually false to
queue rather than cancel) so the serialization is namespace-scoped.


steps:
- name: Checkout
uses: actions/checkout@v4

- name: Validate namespace allowlist
run: |
case "$NAMESPACE" in
grainguard-dev)
;;
*)
echo "Unsupported namespace: $NAMESPACE"
exit 1
;;
esac

- name: Configure kubectl
uses: azure/setup-kubectl@v3
with:
version: 'v1.29.0'

- name: Set kubeconfig
run: |
mkdir -p "$HOME/.kube"
echo "${{ secrets.KUBECONFIG_DEV }}" | base64 -d > "$HOME/.kube/config"
chmod 600 "$HOME/.kube/config"

- name: Install Chaos Toolkit
run: |
pip install --quiet \
chaostoolkit==1.19.0 \
chaostoolkit-kubernetes==0.26.4 \
chaostoolkit-verification==0.3.0

- name: Make scripts executable
run: chmod +x tests/chaos/*.sh

- name: Run — all experiments
if: ${{ github.event.inputs.experiment == 'all' || github.event_name == 'schedule' }}
env:
NAMESPACE: ${{ env.NAMESPACE }}
KAFKA_BOOTSTRAP: kafka:9092
GATEWAY_URL: ${{ secrets.CHAOS_GATEWAY_URL }}
PROMETHEUS_URL: ${{ secrets.CHAOS_PROMETHEUS_URL }}
TEST_JWT: ${{ secrets.CHAOS_TEST_JWT }}
run: bash tests/chaos/run-all.sh
Comment on lines +71 to +79

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

all mode runs projection-lag with weaker assertions.

tests/chaos/run-all.sh calls tests/chaos/projection-lag.sh, and that script falls back to STRICT_ALERT_CHECK=0 when the env var is absent. This step omits it, while the dedicated projection-lag step sets "1", so the scheduled/full-suite path can pass an alert regression that the single-experiment path would fail.

Suggested fix
       - name: Run — all experiments
         if: ${{ github.event.inputs.experiment == 'all' || github.event_name == 'schedule' }}
         env:
           NAMESPACE: ${{ env.NAMESPACE }}
           KAFKA_BOOTSTRAP: kafka:9092
           GATEWAY_URL: ${{ secrets.CHAOS_GATEWAY_URL }}
           PROMETHEUS_URL: ${{ secrets.CHAOS_PROMETHEUS_URL }}
           TEST_JWT: ${{ secrets.CHAOS_TEST_JWT }}
+          STRICT_ALERT_CHECK: "1"
         run: bash tests/chaos/run-all.sh
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/chaos.yml around lines 71 - 79, The "Run — all
experiments" step runs tests/chaos/run-all.sh which invokes
tests/chaos/projection-lag.sh that defaults STRICT_ALERT_CHECK=0 when the env
var is absent; to make the full-suite/scheduled path use the same strict
assertions as the dedicated projection-lag step, add STRICT_ALERT_CHECK: "1" to
the env block of the "Run — all experiments" step (the step that runs
run-all.sh) so projection-lag.sh receives the same setting.


- name: Run — pod-kill
if: ${{ github.event.inputs.experiment == 'pod-kill' }}
env:
NAMESPACE: ${{ env.NAMESPACE }}
run: chaos run tests/chaos/pod-kill.yaml

- name: Run — kafka-consumer-pause
if: ${{ github.event.inputs.experiment == 'kafka-consumer-pause' }}
env:
NAMESPACE: ${{ env.NAMESPACE }}
KAFKA_BOOTSTRAP: kafka:9092
run: bash tests/chaos/kafka-consumer-pause.sh

- name: Run — redis-outage
if: ${{ github.event.inputs.experiment == 'redis-outage' }}
env:
NAMESPACE: ${{ env.NAMESPACE }}
GATEWAY_URL: ${{ secrets.CHAOS_GATEWAY_URL }}
TEST_JWT: ${{ secrets.CHAOS_TEST_JWT }}
run: bash tests/chaos/redis-outage.sh

- name: Run — projection-lag
if: ${{ github.event.inputs.experiment == 'projection-lag' }}
env:
NAMESPACE: ${{ env.NAMESPACE }}
KAFKA_BOOTSTRAP: kafka:9092
PROMETHEUS_URL: ${{ secrets.CHAOS_PROMETHEUS_URL }}
STRICT_ALERT_CHECK: "1"
run: bash tests/chaos/projection-lag.sh

- name: Run — network-partition
if: ${{ github.event.inputs.experiment == 'network-partition' }}
env:
NAMESPACE: ${{ env.NAMESPACE }}
run: chaos run tests/chaos/network-partition.yaml

- name: Upload chaos logs
if: always()
uses: actions/upload-artifact@v4
with:
name: chaos-results-${{ github.run_number }}
path: tests/chaos/results/
retention-days: 30
if-no-files-found: ignore

- name: Notify Slack on failure
if: failure()
uses: slackapi/slack-github-action@v1.26.0
with:
payload: |
{
"text": ":fire: Chaos experiment *${{ github.event.inputs.experiment || 'all' }}* FAILED on `${{ env.NAMESPACE }}` — <${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}|View run>"
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_CHAOS_WEBHOOK }}
SLACK_WEBHOOK_TYPE: INCOMING_WEBHOOK
13 changes: 9 additions & 4 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,13 @@ jobs:

- uses: actions/setup-go@v5
with:
go-version: "1.24"
go-version: "1.25"
cache: true

- name: golangci-lint
uses: golangci/golangci-lint-action@v6
with:
version: v1.62
version: v2.11
args: --timeout=5m
Comment thread
coderabbitai[bot] marked this conversation as resolved.

go-test:
Expand All @@ -37,7 +37,7 @@ jobs:

- uses: actions/setup-go@v5
with:
go-version: "1.24"
go-version: "1.25"
cache: true

- name: Download deps
Expand Down Expand Up @@ -79,7 +79,12 @@ jobs:
working-directory: apps/${{ matrix.app }}

- name: ESLint
run: npm run lint
run: |
if npm run | grep -qE '^[[:space:]]+lint'; then
npm run lint
else
echo "No lint script for ${{ matrix.app }}; skipping ESLint step"
fi
Comment on lines 81 to +87

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't let a missing lint script pass this matrix entry.

This changes the job from “lint these apps” to “lint them if a script happens to exist.” A renamed or removed script will now silently green the check. If an app should be exempt, remove it from the matrix or model that exception explicitly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ci.yml around lines 81 - 87, The ESLint job ("ESLint"
step) must fail if the target package (matrix.app) has no lint script rather
than silently skip; update the step that currently runs a conditional "if npm
run | grep -qE ..." so that when no lint script is found it exits non‑zero (or
explicitly checks an allowlist of exempt apps from the matrix) instead of
echoing and succeeding. In practice modify the "ESLint" run logic to detect
absence of the lint script for matrix.app and call exit 1 (or assert matrix.app
is in an explicit exemption list) so the CI shows a failing job when a lint
script is missing or renamed.

working-directory: apps/${{ matrix.app }}

- name: Typecheck
Expand Down
77 changes: 77 additions & 0 deletions .github/workflows/e2e.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
name: E2E Tests

on:
workflow_dispatch:
pull_request:
branches: [master]

jobs:
e2e:
name: Playwright E2E
runs-on: ubuntu-latest
timeout-minutes: 20

steps:
- uses: actions/checkout@v4

- name: Set up Node.js
uses: actions/setup-node@v4
with:
node-version: "20"
cache: npm
cache-dependency-path: apps/dashboard/package-lock.json

- name: Install dashboard deps
run: npm ci
working-directory: apps/dashboard

- name: Install E2E deps
run: npm install --no-save @playwright/test typescript ts-node
working-directory: tests/e2e
Comment on lines +28 to +30

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid --save-dev in CI; prefer npm ci if lockfile exists.

Using npm install --save-dev modifies package.json, which is undesirable in CI. If tests/e2e/package-lock.json exists, use npm ci for deterministic installs. Otherwise, drop --save-dev:

♻️ Suggested fix
       - name: Install E2E deps
-        run: npm install --save-dev `@playwright/test` typescript ts-node
+        run: npm ci
         working-directory: tests/e2e

Or if no lockfile exists:

       - name: Install E2E deps
-        run: npm install --save-dev `@playwright/test` typescript ts-node
+        run: npm install `@playwright/test` typescript ts-node
         working-directory: tests/e2e
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: Install E2E deps
run: npm install --save-dev @playwright/test typescript ts-node
working-directory: tests/e2e
- name: Install E2E deps
run: npm install `@playwright/test` typescript ts-node
working-directory: tests/e2e
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/e2e.yml around lines 28 - 30, The CI step named "Install
E2E deps" currently runs "npm install --save-dev `@playwright/test` typescript
ts-node" which can mutate package.json; change it to use "npm ci" when a
lockfile exists (tests/e2e/package-lock.json or yarn.lock) for deterministic
installs, and fall back to plain "npm install `@playwright/test` typescript
ts-node" (no --save-dev) if no lockfile is present; keep the same step name and
working-directory ("tests/e2e") and ensure the command selection is conditional
in the workflow so CI never modifies package.json.


- name: Install Playwright browsers
run: npx playwright install --with-deps chromium firefox
working-directory: tests/e2e

- name: Build dashboard
run: npm run build
working-directory: apps/dashboard
env:
VITE_E2E_MOCK_AUTH: "true"
VITE_AUTH0_DOMAIN: e2e.auth0.local
VITE_AUTH0_CLIENT_ID: e2e-client-id
VITE_AUTH0_AUDIENCE: https://api.grainguard.test
VITE_BFF_URL: http://localhost:5173/graphql
VITE_GATEWAY_URL: http://localhost:5173

- name: Serve dashboard
run: npx serve -s dist -l 5173 &
working-directory: apps/dashboard

- name: Wait for server
run: npx wait-on http://localhost:5173 --timeout 30000

- name: Run Playwright tests
run: npx playwright test --config playwright.config.ts
working-directory: tests/e2e
env:
E2E_BASE_URL: http://localhost:5173
VITE_AUTH0_CLIENT_ID: e2e-client-id
VITE_AUTH0_AUDIENCE: https://api.grainguard.test

- name: Upload Playwright report
uses: actions/upload-artifact@v4
if: always()
with:
name: playwright-report-${{ github.run_number }}
path: tests/e2e/playwright-report/
retention-days: 14

- name: Upload test results (JUnit)
uses: actions/upload-artifact@v4
if: always()
with:
name: playwright-results-${{ github.run_number }}
path: tests/e2e/playwright-results.xml
Comment on lines +70 to +75

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add retention-days and if-no-files-found for consistency.

The JUnit XML artifact upload is missing settings that the HTML report upload has. This could cause workflow failures if the file is missing:

♻️ Suggested fix
       - name: Upload test results (JUnit)
         uses: actions/upload-artifact@v4
         if: always()
         with:
           name: playwright-results-${{ github.run_number }}
           path: tests/e2e/playwright-results.xml
+          retention-days: 14
+          if-no-files-found: ignore
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- name: Upload test results (JUnit)
uses: actions/upload-artifact@v4
if: always()
with:
name: playwright-results-${{ github.run_number }}
path: tests/e2e/playwright-results.xml
- name: Upload test results (JUnit)
uses: actions/upload-artifact@v4
if: always()
with:
name: playwright-results-${{ github.run_number }}
path: tests/e2e/playwright-results.xml
retention-days: 14
if-no-files-found: ignore
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/e2e.yml around lines 69 - 74, The JUnit artifact upload
step ("Upload test results (JUnit)" using actions/upload-artifact@v4) is missing
the same retention and missing-file handling as the HTML report step; update the
step by adding the with keys retention-days (set to the same number used
elsewhere, e.g., 7) and if-no-files-found (set to "ignore") alongside the
existing name and path so the step won't fail if tests produce no XML and
artifacts are retained consistently.

retention-days: 14
if-no-files-found: ignore
Loading
Loading