feat(bff): add PgBouncer pooling, Redis cache with stampede protectio… by pahuldeepp · Pull Request #5 · pahuldeepp/GrainGuard-

pahuldeepp · 2026-03-21T16:21:32Z

What this PR does

Closes out R1 infrastructure gaps by adding PgBouncer connection pooling
and Redis cache-aside pattern with stampede protection to the BFF.

Changes

PgBouncer — Connection Pooling

Added infra/docker/pgbouncer/pgbouncer.ini — pool config with
transaction mode, 200 max client connections, 20 default pool size
Added infra/docker/pgbouncer/userlist.txt — auth credentials
Added PgBouncer service to infra/docker/docker-compose.yml
BFF now connects via PgBouncer instead of directly to Postgres
Increased BFF pool max from 10 to 50 (PgBouncer handles real limit)

Redis Cache — Dashboard Queries

Wired cache.get/set from existing datasources/redis.ts into
3 high-frequency dashboard queries:
- getAllDevices — 30s TTL, keyed by limit
- getAllTelemetry — 30s TTL, keyed by tenantId + limit
- getAllDevicesWithTelemetry — 30s TTL, keyed by tenantId + limit

Stampede Protection

All 3 cached functions use cache.acquireLock on cache miss
Only 1 request queries Postgres at a time per cache key
Concurrent requests wait 100ms then hit warm cache
Lock TTL = 5s, auto-releases via releaseLock in finally block

Cleanup

Removed duplicate apps/bff/src/lib/cache.ts
Single Redis client — datasources/redis.ts is source of truth for BFF

Why

PgBouncer: prevents connection exhaustion under load — 50 services
sharing 20 real Postgres connections instead of 50 direct connections
Cache: dashboard load goes from N DB queries to 1 query per 30s
Stampede protection: prevents thundering herd on cache expiry
Documented in ADR-006 (Redis locks) and blueprint caching strategy

Testing

Verify BFF connects via pgbouncer container on startup
Verify cache hit on second dashboard request (Redis MONITOR)
Verify single DB query on concurrent cache miss (check Postgres logs)

Summary by CodeRabbit

Performance Improvements
- Added caching for device and telemetry queries and increased DB connection pool for higher throughput.
New Features
- Requests now accept and return a correlation ID alongside request ID for improved tracing.
Observability
- Logs include correlation ID; added centralized logging, searching, metrics, and message-broker components to improve diagnostics.
Chores
- Added developer automation (Makefile, bootstrap/seed scripts) and local infra orchestration and connection pooling utilities.

…n on dashboard queries

coderabbitai · 2026-03-21T16:21:53Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds Redis-backed caching with distributed-lock fallback to BFF Postgres queries, introduces correlation IDs and logging changes in the gateway, adds a project Makefile and infra/config files (PgBouncer, Loki, Promtail), and updates load-generator event payload and DB connection defaults/pool size.

Changes

Cohort / File(s)	Summary
BFF Postgres datasource `apps/bff/src/datasources/postgres.ts`	Added Redis cache reads/writes with tenant/limit-scoped keys and distributed-lock flow (acquire → query → set TTL → release); 100ms fallback when lock not acquired; changed `READ_DB_HOST` default to `pgbouncer`; increased pg pool `max` 10 → 50.
Gateway middleware & server logging `apps/gateway/src/middleware/requestId.ts`, `apps/gateway/src/server.ts`	Middleware now accepts/validates `x-correlation-id`, assigns `req.correlationId`, echoes `x-correlation-id`; ID helper renamed; request logger includes `correlation_id` in structured logs.
Developer Makefile `Makefile`	New Makefile with compose orchestration (up/down/restart/logs), image build, seed/migrate, test targets (Go, React, load), lint (go/ts), cleanup, status, and help target.
Compose and infra additions `infra/docker/docker-compose.yml`	Adds/rewires services: Elasticsearch, Kibana, RabbitMQ, Loki, Promtail, Memcached, PgBouncer; reorders services, adjusts env/volumes, and updates declared volumes.
Loki config `infra/docker/loki-config.yaml`	New Loki config: HTTP on 3100, in-memory ring, boltdb-shipper filesystem storage, chunk/index dirs under `/loki`, limits/lookback settings.
PgBouncer config `infra/docker/pgbouncer/pgbouncer.ini`	New PgBouncer config mapping `grainguard`/`grainguard_read`, listening on `0.0.0.0:5432`, transaction pooling, md5 auth, reserve pool settings.
Promtail config `infra/docker/promtail-config.yaml`	New Promtail config: HTTP server on 9080, Docker service discovery, relabels Docker metadata to log labels, forwards to `http://loki:3100/loki/api/v1/push`.
Load generator `apps/load-generator/main.go`	TelemetryEvent payload updated (`data.recordedAt` added), adjusted temperature/humidity generation ranges, uses UTC recordedAt, returns from main on shutdown instead of os.Exit.
Bootstrap script `scripts/bootstrap.sh`	New bootstrap script that verifies tooling, installs npm deps, brings up compose infra, waits for Postgres, seeds DB, and starts services.

Sequence Diagram

sequenceDiagram
    participant Client as Client
    participant BFF as BFF
    participant Redis as Redis (cache)
    participant Postgres as Postgres
    participant Lock as Redis Lock

    Client->>BFF: request (getAll...)
    BFF->>Redis: GET cacheKey
    alt cache hit
        Redis-->>BFF: cached data
        BFF-->>Client: return cached data
    else cache miss
        BFF->>Lock: TRY acquire lock(cacheKey)
        alt lock acquired
            Lock-->>BFF: lock granted
            BFF->>Postgres: execute query
            Postgres-->>BFF: result
            BFF->>Redis: SET cacheKey = result (TTL)
            Redis-->>BFF: OK
            BFF->>Lock: RELEASE lock
            Lock-->>BFF: released
            BFF-->>Client: return fresh result
        else lock not acquired
            BFF-->>BFF: wait 100ms
            BFF->>Redis: GET cacheKey (fallback)
            Redis-->>BFF: stale/empty
            BFF-->>Client: return fallback result
        end
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main changes: PgBouncer pooling and Redis caching with stampede protection, which are the primary infrastructure additions in this PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/r1-closeout

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

CodeRabbit can suggest fixes for GitHub Check annotations.

Configure the reviews.tools.github-checks setting to adjust the time to wait for GitHub Checks to complete.

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/bff/src/datasources/postgres.ts`:
- Around line 49-67: The cache key and query are tenant-agnostic: update the
call site and this function to accept a tenantId (or an explicit isAdmin flag),
include tenantId in the cache key (e.g.,
`devices:tenant:${tenantId}:all:${limit}`) and add a WHERE tenant_id = $2 (or
equivalent) to the cbQuery so the SQL filters by tenant; if this endpoint must
be admin-only instead, document and enforce that by checking an isAdmin guard
before using the shared key. Ensure the cache.get/set/acquireLock use the new
tenant-scoped key and thread tenantId through any callers that invoke this code
so RLS/tenant isolation is preserved.
- Around line 49-70: The cache logic currently returns [] on contested misses
and lets any cache (get/set/releaseLock) errors surface instead of falling back
to Postgres; update the flow around cacheKey, cache.get, cache.acquireLock,
cache.set and cache.releaseLock so that any cache error or a contested-miss
timeout triggers a direct cbQuery to Postgres and returns those rows, and ensure
cache.releaseLock errors are caught/logged and do not override the DB result.
Concretely: on cache miss, attempt acquireLock; if not locked, sleep then try
cache.get again and if still empty call cbQuery(...) and return its rows; wrap
all cache.get/set/releaseLock calls in try/catch to fall back to cbQuery on
failure; after a successful cbQuery, attempt cache.set but ignore/set-log any
errors so the DB result is still returned. This same pattern should be applied
to the other similar blocks indicated (lines ~85-120 and ~165-205).
- Around line 96-105: The branch is calling cbQuery directly despite having a
tenantId, bypassing the module's tenant/RLS pathway; replace the cbQuery call in
this block (the query selecting from device_telemetry_latest joined to
device_projections) with the module's tenant-scoped query helper (e.g., the
tenantScopedQuery/tenantQuery helper used elsewhere) and pass tenantId and limit
through that helper while keeping the explicit WHERE d.tenant_id = $1 predicate
for defense-in-depth; apply the same replacement for the analogous block around
lines 176-187 so all tenant-aware reads use the tenant-scoped helper rather than
cbQuery.
- Around line 53-70: The current lock handling uses a fixed value and
unconditional DEL which can remove another request's lock; modify
cache.acquireLock to return a unique token (e.g., UUID) when the lock is
obtained and change cache.releaseLock to accept that token and perform a
compare-and-delete (atomic check-and-del, e.g., Redis EVAL script) or
alternatively implement a lock-renewal mechanism (extend TTL while the query
runs). Update all callers in this file that call
cache.acquireLock/cache.releaseLock (the blocks around the current SELECT, and
the other similar blocks noted) to store the returned token and pass it into
releaseLock instead of calling releaseLock unconditionally. Ensure callers also
handle the case where acquireLock returns null/false (miss) and only attempt
release when they actually hold the token.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 31fb8a5e-4335-4231-858b-5a7a0223ad78

📥 Commits

Reviewing files that changed from the base of the PR and between 5b30d81 and 5c9b1a4.

📒 Files selected for processing (1)

apps/bff/src/datasources/postgres.ts

coderabbitai · 2026-03-21T16:29:40Z

+    const cacheKey = `devices:all:${limit}`;
+    const cached = await cache.get<any[]>(cacheKey);
+    if (cached) return cached;
+
+    const locked = await cache.acquireLock(cacheKey, 5);
+    if (!locked) {
+      await new Promise(r => setTimeout(r, 100));
+      return await cache.get<any[]>(cacheKey) || [];
+    }
+
+    try {
+      const result = await cbQuery(
+        `SELECT device_id, tenant_id, serial_number, created_at
+         FROM device_projections
+         ORDER BY created_at DESC
+         LIMIT $1`,
+        [limit]
+      );
+      await cache.set(cacheKey, result.rows, 30);


⚠️ Potential issue | 🟠 Major

Make this cache path tenant-aware or explicitly admin-only.

devices:all:${limit} is a shared cross-tenant key, and the query still returns rows for every tenant. If this method is reachable from normal tenant-scoped resolver flows, one caller can warm a payload that every other caller reuses. Thread tenantId through this API and key/query on it, or split it into an explicitly admin-only path. As per coding guidelines, apps/bff/src/**/*.ts: Check all queries use tenantId for RLS. Verify cache keys include tenantId.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/bff/src/datasources/postgres.ts` around lines 49 - 67, The cache key and query are tenant-agnostic: update the call site and this function to accept a tenantId (or an explicit isAdmin flag), include tenantId in the cache key (e.g., `devices:tenant:${tenantId}:all:${limit}`) and add a WHERE tenant_id = $2 (or equivalent) to the cbQuery so the SQL filters by tenant; if this endpoint must be admin-only instead, document and enforce that by checking an isAdmin guard before using the shared key. Ensure the cache.get/set/acquireLock use the new tenant-scoped key and thread tenantId through any callers that invoke this code so RLS/tenant isolation is preserved.

coderabbitai · 2026-03-21T16:29:40Z

+    const cacheKey = `devices:all:${limit}`;
+    const cached = await cache.get<any[]>(cacheKey);
+    if (cached) return cached;
+
+    const locked = await cache.acquireLock(cacheKey, 5);
+    if (!locked) {
+      await new Promise(r => setTimeout(r, 100));
+      return await cache.get<any[]>(cacheKey) || [];
+    }
+
+    try {
+      const result = await cbQuery(
+        `SELECT device_id, tenant_id, serial_number, created_at
+         FROM device_projections
+         ORDER BY created_at DESC
+         LIMIT $1`,
+        [limit]
+      );
+      await cache.set(cacheKey, result.rows, 30);
+      return result.rows;
+    } finally {
+      await cache.releaseLock(cacheKey);


⚠️ Potential issue | 🟠 Major

Do not let cache contention or Redis failures turn into empty/failed reads.

On a contested miss, the loser sleeps 100ms and then returns [] if the winner has not populated Redis yet. Separately, any cache.get / set / releaseLock error will reject the request, even when Postgres is healthy or the DB query already succeeded. These paths should fall back to the underlying DB query whenever the cache cannot serve a value.

Also applies to: 85-120, 165-205

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/bff/src/datasources/postgres.ts` around lines 49 - 70, The cache logic currently returns [] on contested misses and lets any cache (get/set/releaseLock) errors surface instead of falling back to Postgres; update the flow around cacheKey, cache.get, cache.acquireLock, cache.set and cache.releaseLock so that any cache error or a contested-miss timeout triggers a direct cbQuery to Postgres and returns those rows, and ensure cache.releaseLock errors are caught/logged and do not override the DB result. Concretely: on cache miss, attempt acquireLock; if not locked, sleep then try cache.get again and if still empty call cbQuery(...) and return its rows; wrap all cache.get/set/releaseLock calls in try/catch to fall back to cbQuery on failure; after a successful cbQuery, attempt cache.set but ignore/set-log any errors so the DB result is still returned. This same pattern should be applied to the other similar blocks indicated (lines ~85-120 and ~165-205).

coderabbitai · 2026-03-21T16:29:40Z

+      if (tenantId) {
+        const result = await cbQuery(
+          `SELECT t.device_id, t.temperature, t.humidity, t.recorded_at, t.updated_at, t.version
+           FROM device_telemetry_latest t
+           INNER JOIN device_projections d ON d.device_id = t.device_id
+           WHERE d.tenant_id = $1
+           ORDER BY t.updated_at DESC
+           LIMIT $2`,
+          [tenantId, limit]
+        );


⚠️ Potential issue | 🟠 Major

Use the tenant-scoped query path on these branches.

These branches already have a tenantId, but they still call cbQuery, so the new cached reads are bypassing the module's tenant/RLS path and relying only on ad hoc predicates. Route them through the tenant-scoped helper here, keeping the explicit tenant_id filter if you want defense in depth. As per coding guidelines, apps/bff/src/**/*.ts: Check all queries use tenantId for RLS.

Also applies to: 176-187

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/bff/src/datasources/postgres.ts` around lines 96 - 105, The branch is calling cbQuery directly despite having a tenantId, bypassing the module's tenant/RLS pathway; replace the cbQuery call in this block (the query selecting from device_telemetry_latest joined to device_projections) with the module's tenant-scoped query helper (e.g., the tenantScopedQuery/tenantQuery helper used elsewhere) and pass tenantId and limit through that helper while keeping the explicit WHERE d.tenant_id = $1 predicate for defense-in-depth; apply the same replacement for the analogous block around lines 176-187 so all tenant-aware reads use the tenant-scoped helper rather than cbQuery.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/gateway/src/middleware/requestId.ts`:
- Around line 29-38: The middleware currently generates a new correlationId when
x-correlation-id is missing/invalid instead of reusing the existing requestId
and it fails to inject generated IDs back into req.headers for downstream
propagation; update the logic in the requestId middleware (variables: requestId,
correlationId, generateId, VALID_REQUEST_ID) so that if incomingCorrelationId is
absent/invalid you set correlationId = requestId (not a new random ID), and
ensure both x-request-id and x-correlation-id are written into res.setHeader and
also injected into req.headers['x-request-id'] and
req.headers['x-correlation-id'] (after trimming/validation) so downstream
requests that forward req.headers preserve tracing continuity.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 987e320c-1df2-4917-9b28-a31664360425

📥 Commits

Reviewing files that changed from the base of the PR and between 5c9b1a4 and 1544192.

📒 Files selected for processing (2)

apps/gateway/src/middleware/requestId.ts
apps/gateway/src/server.ts

coderabbitai · 2026-03-21T16:40:21Z

+  const incomingCorrelationId = req.header("x-correlation-id");
+  const correlationId = (incomingCorrelationId && VALID_REQUEST_ID.test(incomingCorrelationId.trim()))
+    ? incomingCorrelationId.trim()
+    : generateId();

  req.requestId = requestId;
+  req.correlationId = correlationId;

-  // Echo back for client correlation and debugging
  res.setHeader("x-request-id", requestId);
+  res.setHeader("x-correlation-id", correlationId);


⚠️ Potential issue | 🟠 Major

Correlation fallback and propagation are inconsistent with cross-service tracing intent.

When x-correlation-id is absent/invalid, a new random ID is generated instead of reusing requestId, and generated IDs are not injected into req.headers. That breaks continuity and can prevent propagation to downstream calls that forward inbound headers.

🔧 Proposed fix

const incomingCorrelationId = req.header("x-correlation-id"); const correlationId = (incomingCorrelationId && VALID_REQUEST_ID.test(incomingCorrelationId.trim())) ? incomingCorrelationId.trim() - : generateId(); + : requestId; req.requestId = requestId; req.correlationId = correlationId; + req.headers["x-request-id"] = requestId; + req.headers["x-correlation-id"] = correlationId; res.setHeader("x-request-id", requestId); res.setHeader("x-correlation-id", correlationId);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/gateway/src/middleware/requestId.ts` around lines 29 - 38, The middleware currently generates a new correlationId when x-correlation-id is missing/invalid instead of reusing the existing requestId and it fails to inject generated IDs back into req.headers for downstream propagation; update the logic in the requestId middleware (variables: requestId, correlationId, generateId, VALID_REQUEST_ID) so that if incomingCorrelationId is absent/invalid you set correlationId = requestId (not a new random ID), and ensure both x-request-id and x-correlation-id are written into res.setHeader and also injected into req.headers['x-request-id'] and req.headers['x-correlation-id'] (after trimming/validation) so downstream requests that forward req.headers preserve tracing continuity.

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Makefile`:
- Line 1: Update the .PHONY declaration to include the missing targets and add a
default all target: extend the existing .PHONY list to include all,
logs-gateway, logs-bff, logs-kafka, migrate, test-go, test-react, test-load,
lint-go, lint-ts, clean-cache, ps, and health; then add a top-level all target
(e.g., all: up or all: build up — choose the appropriate default behavior for
this Makefile) so running make with no args uses the standard default; ensure
the new all name doesn't collide with any file and that the added phony names
match the actual targets (logs-gateway, logs-bff, logs-kafka, migrate, test-go,
test-react, test-load, lint-go, lint-ts, clean-cache, ps, health).
- Around line 91-93: The Makefile target clean-cache currently runs redis-cli
FLUSHALL which wipes all Redis databases; update the target (clean-cache) to use
redis-cli FLUSHDB so only the current DB is cleared, i.e., replace the FLUSHALL
argument with FLUSHDB in the docker compose exec command that invokes redis-cli.
- Around line 101-103: The health Makefile target uses silent curl commands that
don't fail on HTTP errors; update the two curl invocations in the "health"
target to use curl -fsS (and optionally add --max-time 5) so the command will
exit non‑zero on 4xx/5xx responses and time out reliably; modify the curl lines
under the health target (the two curl commands calling
http://localhost:8086/health and http://localhost:4000/health) to include -fsS
(and --max-time 5 if desired).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 37acc8ef-7ebf-4b00-b0b6-39a59d15e802

📥 Commits

Reviewing files that changed from the base of the PR and between 1544192 and c0891ea.

📒 Files selected for processing (1)

Makefile

coderabbitai · 2026-03-21T16:43:34Z

@@ -0,0 +1,103 @@
+.PHONY: up down restart logs build seed test lint clean help


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "== Targets defined in Makefile ==" rg -n '^[a-zA-Z_-]+:.*$' Makefile echo echo "== .PHONY declaration ==" rg -n '^\.(PHONY):' Makefile echo echo "== Check whether required default target exists ==" rg -n '^all:' Makefile || true

Repository: pahuldeepp/GrainGuard-

Length of output: 1021

Add a default all target and complete the .PHONY list.

The .PHONY declaration on line 1 is incomplete, missing 12 targets that are actually defined later in the Makefile: logs-gateway, logs-bff, logs-kafka, migrate, test-go, test-react, test-load, lint-go, lint-ts, clean-cache, ps, and health. Additionally, the all target is missing entirely, which is a standard Makefile convention. These omissions can cause target/file name collisions and potential CI lint failures.

Proposed fix

-.PHONY: up down restart logs build seed test lint clean help +.PHONY: all up down restart logs logs-gateway logs-bff logs-kafka build seed migrate test test-go test-react test-load lint lint-go lint-ts clean clean-cache ps health help + +all: up ## Default target

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

.PHONY: up down restart logs build seed test lint clean help

.PHONY: all up down restart logs logs-gateway logs-bff logs-kafka build seed migrate test test-go test-react test-load lint lint-go lint-ts clean clean-cache ps health help

all: up ## Default target

🧰 Tools

🪛 checkmake (0.2.2)

[warning] 1-1: Missing required phony target "all"

(minphony)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Makefile` at line 1, Update the .PHONY declaration to include the missing targets and add a default all target: extend the existing .PHONY list to include all, logs-gateway, logs-bff, logs-kafka, migrate, test-go, test-react, test-load, lint-go, lint-ts, clean-cache, ps, and health; then add a top-level all target (e.g., all: up or all: build up — choose the appropriate default behavior for this Makefile) so running make with no args uses the standard default; ensure the new all name doesn't collide with any file and that the added phony names match the actual targets (logs-gateway, logs-bff, logs-kafka, migrate, test-go, test-react, test-load, lint-go, lint-ts, clean-cache, ps, health).

coderabbitai · 2026-03-21T16:43:35Z

+clean-cache: ## Clear Redis cache
+	docker compose -f infra/docker/docker-compose.yml exec redis redis-cli FLUSHALL
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find and verify the Makefile exists fd -type f -name "Makefile" | head -20

Repository: pahuldeepp/GrainGuard-

Length of output: 236

🏁 Script executed:

# Once located, read the specific lines mentioned (91-93) cat -n Makefile | sed -n '85,100p'

Repository: pahuldeepp/GrainGuard-

Length of output: 660

🏁 Script executed:

# Check the docker-compose configuration for Redis setup fd -type f -name "docker-compose.yml" | head -10

Repository: pahuldeepp/GrainGuard-

Length of output: 236

🏁 Script executed:

# Find docker-compose.yml with correct syntax find . -name "docker-compose.yml" -type f

Repository: pahuldeepp/GrainGuard-

Length of output: 99

🏁 Script executed:

# Look at the docker-compose configuration cat infra/docker/docker-compose.yml

Repository: pahuldeepp/GrainGuard-

Length of output: 14710

🏁 Script executed:

# Check if there are any references to Redis DB selection in the codebase rg "redis|Redis|REDIS" --max-count=20 -i

Repository: pahuldeepp/GrainGuard-

Length of output: 28172

Use FLUSHDB instead of FLUSHALL in the clean-cache target.

Line 92 executes FLUSHALL, which clears all Redis databases. Multiple services (bff, gateway, read-model-builder) depend on this Redis instance for caching, rate-limiting, and idempotency locks. Use FLUSHDB to clear only the current database, limiting the scope of cache cleanup.

Fix

clean-cache: ## Clear Redis cache - docker compose -f infra/docker/docker-compose.yml exec redis redis-cli FLUSHALL + docker compose -f infra/docker/docker-compose.yml exec redis redis-cli FLUSHDB

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

clean-cache: ## Clear Redis cache

docker compose -f infra/docker/docker-compose.yml exec redis redis-cli FLUSHALL

clean-cache: ## Clear Redis cache

docker compose -f infra/docker/docker-compose.yml exec redis redis-cli FLUSHDB

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Makefile` around lines 91 - 93, The Makefile target clean-cache currently runs redis-cli FLUSHALL which wipes all Redis databases; update the target (clean-cache) to use redis-cli FLUSHDB so only the current DB is cleared, i.e., replace the FLUSHALL argument with FLUSHDB in the docker compose exec command that invokes redis-cli.

coderabbitai · 2026-03-21T16:43:35Z

+health: ## Check health of all services
+	curl -s http://localhost:8086/health | jq
+	curl -s http://localhost:4000/health | jq


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the Makefile find . -name "Makefile" -type f | head -5

Repository: pahuldeepp/GrainGuard-

Length of output: 76

🏁 Script executed:

# Once we have the Makefile, check the health target sed -n '95,110p' Makefile

Repository: pahuldeepp/GrainGuard-

Length of output: 332

🌐 Web query:

curl -s flag behavior HTTP error codes 4xx 5xx exit status

💡 Result:

-s / --silent only affects output: it suppresses the progress meter and curl’s error messages; it does not change how HTTP status codes map to curl’s exit status. Use -S / --show-error with -s if you want “silent except show errors”. [1][2]

For HTTP 4xx/5xx: by default, curl can still exit 0 even if the server responds 404, 500, etc., because the transfer itself succeeded at the HTTP layer. If you want HTTP >= 400 to be treated as a failure (non-zero exit), use -f / --fail (or --fail-with-body). [3]

With -f/--fail, curl uses exit code 22 for “HTTP error returned” (4xx/5xx). [4]

Common script pattern:

curl -fsS https://example.com/resource # silent, show errors, fail (exit!=0) on 4xx/5xx

Sources: [1] [2] [3] [4]

Add -f flag to curl commands so the health target fails on HTTP errors.

Lines 102–103 use curl -s which exits 0 even on HTTP 4xx/5xx responses (the transfer itself succeeds). Without the -f flag, make health will report success when services are unhealthy. Use curl -fsS instead to fail fast on HTTP errors; optionally add --max-time 5 to prevent hangs on stalled connections.

Proposed fix

health: ## Check health of all services - curl -s http://localhost:8086/health | jq - curl -s http://localhost:4000/health | jq + curl -fsS --max-time 5 http://localhost:8086/health | jq + curl -fsS --max-time 5 http://localhost:4000/health | jq

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@Makefile` around lines 101 - 103, The health Makefile target uses silent curl commands that don't fail on HTTP errors; update the two curl invocations in the "health" target to use curl -fsS (and optionally add --max-time 5) so the command will exit non‑zero on 4xx/5xx responses and time out reliably; modify the curl lines under the health target (the two curl commands calling http://localhost:8086/health and http://localhost:4000/health) to include -fsS (and --max-time 5 if desired).

coderabbitai

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

infra/docker/docker-compose.yml (1)

437-438: ⚠️ Potential issue | 🟠 Major

BFF is not connecting through PgBouncer as stated in PR objectives.

The PR description states "BFF now connects to Postgres via PgBouncer," but the BFF service still connects directly to postgres-read:5432. To route through PgBouncer, update the connection string.

🐛 Proposed fix to use PgBouncer

     environment:
-      - READ_DATABASE_URL=postgres://postgres:postgres@postgres-read:5432/grainguard_read
+      - READ_DATABASE_URL=postgres://postgres:postgres@pgbouncer:5432/grainguard_read

Also ensure depends_on includes pgbouncer:

     depends_on:
-      - postgres-read
+      pgbouncer:
+        condition: service_started
       - redis

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@infra/docker/docker-compose.yml` around lines 437 - 438, The BFF service is
still using READ_DATABASE_URL pointing at postgres-read:5432 instead of
PgBouncer; update the BFF service's READ_DATABASE_URL to use the pgbouncer host
and port (e.g., pgbouncer:6432 or the project's configured pgbouncer port) and
add pgbouncer to the BFF service's depends_on so the BFF waits for pgbouncer to
start; locate the READ_DATABASE_URL environment entry in the BFF service block
and the service's depends_on list to make these changes (ensure the connection
user/db in READ_DATABASE_URL remain correct for PgBouncer).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/load-generator/main.go`:
- Around line 47-51: The current parsing of deviceIDs only trims entries but
does not remove empty strings, causing empty deviceId/aggregateId and empty
Kafka keys; update the code that builds devices from deviceIDs (the devices
slice population where deviceIDs is split and trimmed) to filter out any entries
that are empty after TrimSpace, and apply the same filtering logic wherever
device lists or splits are performed (the other parsing sites around the device
creation/iteration that produce deviceId/aggregateId and the Kafka key
generation) so only non-empty device IDs are used.
- Around line 37-39: Parse and validate the numeric environment vars instead of
ignoring errors: when converting WORKERS, RATE, and DURATION with strconv.Atoi
(the variables workers, ratePerSec, durationSec), check the error and ensure
each value is > 0; if parsing fails or the value is non-positive, either assign
a safe default or log and exit (use log.Fatalf or similar) so you never pass 0
into concurrency/rate logic. Update the initialization around
workers/ratePerSec/durationSec and add guards before using them to create
goroutines or compute ticker intervals (the code paths that compute ticker
durations and start worker loops referenced by workers and ratePerSec) so
divide-by-zero and zero-worker conditions cannot occur.
- Line 143: The json.Marshal call that creates b from evt currently ignores its
error; change the code around the json.Marshal(evt) invocation to capture the
error (e.g., b, err := json.Marshal(evt)), check err, and handle it explicitly
(log the error with context including evt and skip/return/continue as
appropriate for the surrounding function). Update any surrounding logic that
assumes b is valid to only run when err == nil so marshaling failures are not
silently ignored; reference the json.Marshal(evt) call and the variable b/evt
when making the change.

In `@infra/docker/docker-compose.yml`:
- Around line 53-54: Add missing anonymous volume declarations for the
referenced volumes so compose can create them: declare both esdata and
rabbitmq-data under the top-level volumes: section (the same volumes block that
should contain other named volumes) so they match the references used in the
service definitions (volume names "esdata" and "rabbitmq-data"); update the
volumes: section to include those two entries (no special options required
unless persistence/driver/config is needed).
- Around line 38-60: The docker-compose Elasticsearch service currently sets
xpack.security.enabled=false in the elasticsearch environment, which disables
authentication; change this for non-dev use by enabling security (remove or set
xpack.security.enabled=true) and configure required credentials and TLS in the
elasticsearch service configuration (update the elasticsearch environment and
volumes to mount certs/keystore and provide ELASTIC_PASSWORD or use a secrets
mechanism), and ensure any clients/other services that talk to the elasticsearch
service (ports 9200 mapping and healthcheck curl) are updated to use the secured
endpoint and credentials.
- Around line 156-165: The current PgBouncer environment only points to the
primary via POSTGRESQL_HOST and POSTGRESQL_DATABASE (grainguard), so PgBouncer
will not route connections to the read replica; either update the pgbouncer.ini
to declare both databases (e.g., add a mapping for grainguard_read to the
read-replica host) and reload/use that custom pgbouncer.ini, or deploy a second
PgBouncer instance configured for the replica; specifically modify the
pgbouncer.ini to include an entry for grainguard_read (pointing at the replica
host) and ensure the docker-compose environment or service that starts PgBouncer
references that custom pgbouncer.ini so connections for grainguard_read are
pooled through PgBouncer.

In `@infra/docker/loki-config.yaml`:
- Around line 43-45: Update the Loki table_manager retention settings so logs
are pruned instead of accumulating indefinitely: change
retention_deletes_enabled from false to true and set retention_period to a
sensible duration (e.g., days/weeks/months) appropriate for your environment;
modify the table_manager block (retention_deletes_enabled / retention_period) to
enable deletion and prevent unbounded growth under /loki/chunks.

In `@infra/docker/pgbouncer/pgbouncer.ini`:
- Around line 16-17: The pgbouncer config currently disables connection event
logging via the settings log_connections = 0 and log_disconnections = 0; change
these to enabled (set both to 1) or make them configurable so
connection/disconnection events are logged for debugging and pool exhaustion
investigation, ensuring the updated values are applied in the pgbouncer.ini
load/restart flow.
- Around line 1-18: The pgbouncer.ini file is unused because the
bitnami/pgbouncer image is configured via environment variables in
docker-compose.yml; either remove the unused
infra/docker/pgbouncer/pgbouncer.ini file or mount it and stop using the
conflicting env vars (e.g., POSTGRESQL_HOST, POSTGRESQL_PORT) so the container
reads the mounted pgbouncer.ini; locate references to pgbouncer.ini and to the
environment-based config in docker-compose.yml, then choose one approach: (A)
delete infra/docker/pgbouncer/pgbouncer.ini and keep the env-based config, or
(B) add a bind mount for infra/docker/pgbouncer/pgbouncer.ini into the service
and remove/override the environment variables (POSTGRESQL_HOST, POSTGRESQL_PORT,
etc.) that would conflict so PgBouncer uses the file.

In `@infra/docker/promtail-config.yaml`:
- Around line 16-20: The relabel_configs only map __meta_docker_container_name
and __meta_docker_container_log_stream to container and stream; add relabel
rules to capture Docker Compose service labels (e.g.,
__meta_docker_container_label_com_docker_compose_service and/or
__meta_docker_container_label_com_docker_compose_project) and map them to
descriptive target labels such as service or compose_project so Grafana/Loki
queries can filter by compose service; update the relabel_configs block where
source_labels and target_label are defined to include these additional mappings
alongside the existing __meta_docker_container_name -> container and
__meta_docker_container_log_stream -> stream entries.
- Around line 5-6: The positions file is configured as filename:
/tmp/positions.yaml which will be lost on container restart; update
promtail-config.yaml to point positions.filename to a path on a persistent
volume (e.g., /data/positions.yaml) and then update your docker-compose service
for promtail (the promtail service) to mount a named volume at that container
path and declare that volume in the compose file so the positions file persists
across restarts.

---

Outside diff comments:
In `@infra/docker/docker-compose.yml`:
- Around line 437-438: The BFF service is still using READ_DATABASE_URL pointing
at postgres-read:5432 instead of PgBouncer; update the BFF service's
READ_DATABASE_URL to use the pgbouncer host and port (e.g., pgbouncer:6432 or
the project's configured pgbouncer port) and add pgbouncer to the BFF service's
depends_on so the BFF waits for pgbouncer to start; locate the READ_DATABASE_URL
environment entry in the BFF service block and the service's depends_on list to
make these changes (ensure the connection user/db in READ_DATABASE_URL remain
correct for PgBouncer).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2bded32f-3006-4a1b-a42a-0c68ec326424

📥 Commits

Reviewing files that changed from the base of the PR and between c0891ea and 562ad09.

📒 Files selected for processing (5)

apps/load-generator/main.go
infra/docker/docker-compose.yml
infra/docker/loki-config.yaml
infra/docker/pgbouncer/pgbouncer.ini
infra/docker/promtail-config.yaml

coderabbitai · 2026-03-21T17:19:37Z

 	workers, _ := strconv.Atoi(getenv("WORKERS", "10"))
 	ratePerSec, _ := strconv.Atoi(getenv("RATE", "1000"))
 	durationSec, _ := strconv.Atoi(getenv("DURATION", "30"))


⚠️ Potential issue | 🔴 Critical

Validate numeric env vars before using them in concurrency/rate control.

Line 37–39 ignore parse errors, which can make workers/ratePerSec become 0 on bad input. That can panic at Line 82/83 (divide-by-zero or non-positive ticker interval) and can stall processing when no workers are started.

Proposed fix

- workers, _ := strconv.Atoi(getenv("WORKERS", "10")) - ratePerSec, _ := strconv.Atoi(getenv("RATE", "1000")) - durationSec, _ := strconv.Atoi(getenv("DURATION", "30")) + workers := mustPositiveIntEnv("WORKERS", "10") + ratePerSec := mustPositiveIntEnv("RATE", "1000") + durationSec := mustPositiveIntEnv("DURATION", "30") ... - interval := time.Second / time.Duration(ratePerSec) + interval := time.Second / time.Duration(ratePerSec) ticker := time.NewTicker(interval) defer ticker.Stop()

func getenv(k, def string) string { v := os.Getenv(k) if v == "" { return def } return v } + +func mustPositiveIntEnv(k, def string) int { + raw := strings.TrimSpace(getenv(k, def)) + n, err := strconv.Atoi(raw) + if err != nil || n <= 0 { + log.Fatalf("invalid %s=%q: must be a positive integer", k, raw) + } + return n +}

Also applies to: 82-83, 101-113

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/load-generator/main.go` around lines 37 - 39, Parse and validate the numeric environment vars instead of ignoring errors: when converting WORKERS, RATE, and DURATION with strconv.Atoi (the variables workers, ratePerSec, durationSec), check the error and ensure each value is > 0; if parsing fails or the value is non-positive, either assign a safe default or log and exit (use log.Fatalf or similar) so you never pass 0 into concurrency/rate logic. Update the initialization around workers/ratePerSec/durationSec and add guards before using them to create goroutines or compute ticker intervals (the code paths that compute ticker durations and start worker loops referenced by workers and ratePerSec) so divide-by-zero and zero-worker conditions cannot occur.

coderabbitai · 2026-03-21T17:19:37Z

 	if deviceIDs != "" {
 		devices = strings.Split(deviceIDs, ",")
-		for i, d := range devices {
-			devices[i] = strings.TrimSpace(d)
+		for i := range devices {
+			devices[i] = strings.TrimSpace(devices[i])
 		}


⚠️ Potential issue | 🟠 Major

Filter empty DEVICE_IDS entries before use.

Line 47–51 only trims; it doesn’t remove blank values. Inputs like ",, " generate empty deviceId/aggregateId and empty Kafka keys at Line 145–147.

Proposed fix

if deviceIDs != "" { - devices = strings.Split(deviceIDs, ",") - for i := range devices { - devices[i] = strings.TrimSpace(devices[i]) - } + raw := strings.Split(deviceIDs, ",") + devices = make([]string, 0, len(raw)) + for _, d := range raw { + d = strings.TrimSpace(d) + if d != "" { + devices = append(devices, d) + } + } + if len(devices) == 0 { + log.Fatal("DEVICE_IDS is set but contains no valid device IDs") + } } else {

Also applies to: 128-133, 145-147

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/load-generator/main.go` around lines 47 - 51, The current parsing of deviceIDs only trims entries but does not remove empty strings, causing empty deviceId/aggregateId and empty Kafka keys; update the code that builds devices from deviceIDs (the devices slice population where deviceIDs is split and trimmed) to filter out any entries that are empty after TrimSpace, and apply the same filtering logic wherever device lists or splits are performed (the other parsing sites around the device creation/iteration that produce deviceId/aggregateId and the Kafka key generation) so only non-empty device IDs are used.

coderabbitai · 2026-03-21T17:19:37Z

+			evt.Data.RecordedAt = time.Now().UTC()
+			evt.Data.Humidity = 40 + float64(i%10)

 			b, _ := json.Marshal(evt)


🧹 Nitpick | 🔵 Trivial

Handle marshaling failures explicitly.

Ignoring json.Marshal errors at Line 143 makes bad payload states silent and harder to diagnose.

Proposed fix

- b, _ := json.Marshal(evt) + b, err := json.Marshal(evt) + if err != nil { + errors.Add(1) + continue + }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/load-generator/main.go` at line 143, The json.Marshal call that creates b from evt currently ignores its error; change the code around the json.Marshal(evt) invocation to capture the error (e.g., b, err := json.Marshal(evt)), check err, and handle it explicitly (log the error with context including evt and skip/return/continue as appropriate for the surrounding function). Update any surrounding logic that assumes b is valid to only run when err == nil so marshaling failures are not silently ignored; reference the json.Marshal(evt) call and the variable b/evt when making the change.

coderabbitai · 2026-03-21T17:19:37Z

+  elasticsearch:
+    image: elasticsearch:8.13.0
+    container_name: grainguard-elasticsearch
+    restart: unless-stopped
+    environment:
+      - discovery.type=single-node
+      - xpack.security.enabled=false
+      - ES_JAVA_OPTS=-Xms512m -Xmx512m
+
+    ports:
+      - "9200:9200"
+
+    networks:
+      - grainguard-net
+
+    volumes:
+      - esdata:/usr/share/elasticsearch/data
+
+    healthcheck:
+      test: ["CMD-SHELL", "curl -f http://localhost:9200/_cluster/health || exit 1"]
+      interval: 30s
+      timeout: 10s
+      retries: 5


🧹 Nitpick | 🔵 Trivial

Elasticsearch security disabled — acceptable for dev only.

xpack.security.enabled=false disables authentication. Ensure this is not used in production environments where Elasticsearch may contain sensitive log/search data.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@infra/docker/docker-compose.yml` around lines 38 - 60, The docker-compose Elasticsearch service currently sets xpack.security.enabled=false in the elasticsearch environment, which disables authentication; change this for non-dev use by enabling security (remove or set xpack.security.enabled=true) and configure required credentials and TLS in the elasticsearch service configuration (update the elasticsearch environment and volumes to mount certs/keystore and provide ELASTIC_PASSWORD or use a secrets mechanism), and ensure any clients/other services that talk to the elasticsearch service (ports 9200 mapping and healthcheck curl) are updated to use the secured endpoint and credentials.

coderabbitai · 2026-03-21T17:19:37Z

+    volumes:
+      - esdata:/usr/share/elasticsearch/data


⚠️ Potential issue | 🔴 Critical

Missing volume declarations will cause compose errors.

esdata (line 54) and rabbitmq-data (line 89) are referenced but not declared in the volumes: section at the bottom of the file. Docker Compose will fail to create these services.

🐛 Add missing volume declarations

volumes: pgdata: pgdata_read: kafka-data: keycloak_data: grafana-data: loki-data: + esdata: + rabbitmq-data:

Also applies to: 88-89

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@infra/docker/docker-compose.yml` around lines 53 - 54, Add missing anonymous volume declarations for the referenced volumes so compose can create them: declare both esdata and rabbitmq-data under the top-level volumes: section (the same volumes block that should contain other named volumes) so they match the references used in the service definitions (volume names "esdata" and "rabbitmq-data"); update the volumes: section to include those two entries (no special options required unless persistence/driver/config is needed).

coderabbitai · 2026-03-21T17:19:37Z

+table_manager:
+  retention_deletes_enabled: false
+  retention_period: 0s


🧹 Nitpick | 🔵 Trivial

Retention is disabled — logs will accumulate indefinitely.

With retention_deletes_enabled: false and retention_period: 0s, Loki will never prune old logs. In a long-running environment, /loki/chunks will grow unbounded, eventually exhausting disk space.

Consider enabling retention for production:

♻️ Suggested retention configuration

table_manager: - retention_deletes_enabled: false - retention_period: 0s + retention_deletes_enabled: true + retention_period: 168h # 7 days

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

table_manager:

retention_deletes_enabled: false

retention_period: 0s

table_manager:

retention_deletes_enabled: true

retention_period: 168h # 7 days

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@infra/docker/loki-config.yaml` around lines 43 - 45, Update the Loki table_manager retention settings so logs are pruned instead of accumulating indefinitely: change retention_deletes_enabled from false to true and set retention_period to a sensible duration (e.g., days/weeks/months) appropriate for your environment; modify the table_manager block (retention_deletes_enabled / retention_period) to enable deletion and prevent unbounded growth under /loki/chunks.

coderabbitai · 2026-03-21T17:19:37Z

+[databases]
+grainguard = host=postgres port=5432 dbname=grainguard
+grainguard_read = host=postgres-read port=5432 dbname=grainguard_read
+
+[pgbouncer]
+listen_port = 5432
+listen_addr = *
+auth_type = md5
+auth_file = /etc/pgbouncer/userlist.txt
+pool_mode = transaction
+max_client_conn = 200
+default_pool_size = 20
+min_pool_size = 5
+reserve_pool_size = 5
+reserve_pool_timeout = 3
+log_connections = 0
+log_disconnections = 0
+server_reset_query = DISCARD ALL


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify if pgbouncer.ini is mounted in docker-compose.yml rg -n "pgbouncer.ini" infra/docker/docker-compose.yml # Check what volumes are defined for pgbouncer service rg -A 10 "pgbouncer:" infra/docker/docker-compose.yml | head -20

Repository: pahuldeepp/GrainGuard-

Length of output: 401

Remove or mount the unused pgbouncer.ini file.

The docker-compose.yml configures PgBouncer entirely via environment variables (POSTGRESQL_HOST, POSTGRESQL_PORT, etc.) using the bitnami/pgbouncer image, which does not mount infra/docker/pgbouncer/pgbouncer.ini. The bitnami image generates its own configuration from environment variables.

Either:

Delete this unused file if it's not needed, or

Mount it in docker-compose.yml and remove the conflicting environment variables to use it instead.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@infra/docker/pgbouncer/pgbouncer.ini` around lines 1 - 18, The pgbouncer.ini file is unused because the bitnami/pgbouncer image is configured via environment variables in docker-compose.yml; either remove the unused infra/docker/pgbouncer/pgbouncer.ini file or mount it and stop using the conflicting env vars (e.g., POSTGRESQL_HOST, POSTGRESQL_PORT) so the container reads the mounted pgbouncer.ini; locate references to pgbouncer.ini and to the environment-based config in docker-compose.yml, then choose one approach: (A) delete infra/docker/pgbouncer/pgbouncer.ini and keep the env-based config, or (B) add a bind mount for infra/docker/pgbouncer/pgbouncer.ini into the service and remove/override the environment variables (POSTGRESQL_HOST, POSTGRESQL_PORT, etc.) that would conflict so PgBouncer uses the file.

coderabbitai · 2026-03-21T17:19:37Z

+log_connections = 0
+log_disconnections = 0


🧹 Nitpick | 🔵 Trivial

Connection logging disabled — consider enabling for debugging.

log_connections = 0 and log_disconnections = 0 suppress connection events. While this reduces log noise, it can hinder troubleshooting connection issues or pool exhaustion problems.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@infra/docker/pgbouncer/pgbouncer.ini` around lines 16 - 17, The pgbouncer config currently disables connection event logging via the settings log_connections = 0 and log_disconnections = 0; change these to enabled (set both to 1) or make them configurable so connection/disconnection events are logged for debugging and pool exhaustion investigation, ensuring the updated values are applied in the pgbouncer.ini load/restart flow.

coderabbitai · 2026-03-21T17:19:37Z

+positions:
+  filename: /tmp/positions.yaml


⚠️ Potential issue | 🟡 Minor

Positions file in /tmp will be lost on restart.

Using /tmp/positions.yaml means Promtail loses track of its read position when the container restarts, potentially causing duplicate log ingestion or missed logs. Mount a persistent volume for the positions file.

♻️ Suggested fix with persistent volume

In promtail-config.yaml:

positions: - filename: /tmp/positions.yaml + filename: /var/promtail/positions.yaml

In docker-compose.yml, add a volume to the promtail service:

promtail: volumes: - ./promtail-config.yaml:/etc/promtail/config.yaml - /var/run/docker.sock:/var/run/docker.sock - /var/lib/docker/containers:/var/lib/docker/containers:ro + - promtail-positions:/var/promtail

And declare the volume:

volumes: ... loki-data: + promtail-positions:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

positions:

filename: /tmp/positions.yaml

positions:

filename: /var/promtail/positions.yaml

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@infra/docker/promtail-config.yaml` around lines 5 - 6, The positions file is configured as filename: /tmp/positions.yaml which will be lost on container restart; update promtail-config.yaml to point positions.filename to a path on a persistent volume (e.g., /data/positions.yaml) and then update your docker-compose service for promtail (the promtail service) to mount a named volume at that container path and declare that volume in the compose file so the positions file persists across restarts.

coderabbitai · 2026-03-21T17:19:37Z

+    relabel_configs:
+      - source_labels: [__meta_docker_container_name]
+        target_label: container
+      - source_labels: [__meta_docker_container_log_stream]
+        target_label: stream


🧹 Nitpick | 🔵 Trivial

Consider adding compose service labels for better log filtering.

The current relabels only capture container name and stream. Adding compose-specific labels enables filtering by service name in Grafana/Loki queries.

♻️ Suggested additional relabel configs

relabel_configs: - source_labels: [__meta_docker_container_name] target_label: container - source_labels: [__meta_docker_container_log_stream] target_label: stream + - source_labels: [__meta_docker_container_label_com_docker_compose_service] + target_label: service + - source_labels: [__meta_docker_container_label_com_docker_compose_project] + target_label: project

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@infra/docker/promtail-config.yaml` around lines 16 - 20, The relabel_configs only map __meta_docker_container_name and __meta_docker_container_log_stream to container and stream; add relabel rules to capture Docker Compose service labels (e.g., __meta_docker_container_label_com_docker_compose_service and/or __meta_docker_container_label_com_docker_compose_project) and map them to descriptive target labels such as service or compose_project so Grafana/Loki queries can filter by compose service; update the relabel_configs block where source_labels and target_label are defined to include these additional mappings alongside the existing __meta_docker_container_name -> container and __meta_docker_container_log_stream -> stream entries.

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/bootstrap.sh`:
- Line 2: Replace the loose error handling in the bootstrap script by changing
the shell options set in the top-level invocation: update the existing set -e to
set -euo pipefail so that the script also fails on unset variables and pipeline
errors; locate the initialization that contains set -e in scripts/bootstrap.sh
and make this single-line replacement (no other logic changes needed).
- Around line 16-19: The script uses relative cd commands (e.g., the four lines
with "cd apps/bff && npm install && cd ../..", "cd apps/gateway ...", etc.)
which assume the current working directory is the repo root; prepend a robust
anchor to the repository root at the top of scripts/bootstrap.sh — for example
resolve the repo root with git rev-parse --show-toplevel (or compute the script
dir via dirname "$0" and cd to its parent) and cd there or exit on failure
before running those cd/install lines so all subsequent relative paths (apps/*,
infra/*, scripts/seed/*) are executed from the repo root.
- Around line 37-39: Replace the fragile fixed delay (the "sleep 10" used after
the echo "→ Waiting for Kafka...") with a readiness probe loop that polls the
Kafka broker and only proceeds when it responds or a timeout is reached;
implement a loop in the script that repeatedly checks the broker (for example by
invoking kafka-topics.sh --bootstrap-server "$KAFKA_BOOTSTRAP" --list or a TCP
probe to the Kafka host:port) and breaks when successful, logging retries, and
exit with a non-zero status after a configurable timeout instead of
unconditionally sleeping.
- Around line 43-45: The seeding command currently swallows failures by
appending "|| true" but still prints "✅ Data seeded"; remove the "|| true" and
change the flow so the docker-compose psql command's exit code is checked and
only print "✅ Data seeded" when that command succeeds, otherwise exit non-zero
(or print an error) to surface failures; locate the docker compose psql
invocation (the line that redirects scripts/seed/seed-postgres.sh into psql) and
the subsequent echo "✅ Data seeded" and update them so success is conditional on
the psql command's actual exit status.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7a51f954-cf2f-4176-bfde-acc98c65f240

📥 Commits

Reviewing files that changed from the base of the PR and between 562ad09 and 51b7622.

📒 Files selected for processing (1)

scripts/bootstrap.sh

coderabbitai · 2026-03-21T17:25:10Z

@@ -0,0 +1,64 @@
+#!/usr/bin/env bash
+set -e


🧹 Nitpick | 🔵 Trivial

Use stricter shell mode for safer bootstrap behavior.

set -e alone misses unset-variable and pipeline failure cases. Prefer set -euo pipefail for bootstrap scripts.

Proposed fix

-set -e +set -euo pipefail

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

set -e

set -euo pipefail

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/bootstrap.sh` at line 2, Replace the loose error handling in the bootstrap script by changing the shell options set in the top-level invocation: update the existing set -e to set -euo pipefail so that the script also fails on unset variables and pipeline errors; locate the initialization that contains set -e in scripts/bootstrap.sh and make this single-line replacement (no other logic changes needed).

coderabbitai · 2026-03-21T17:25:11Z

+cd apps/bff && npm install && cd ../..
+cd apps/gateway && npm install && cd ../..
+cd apps/dashboard && npm install && cd ../..
+cd apps/jobs-worker && npm install && cd ../..


⚠️ Potential issue | 🟠 Major

Anchor execution to repository root before using relative paths.

cd apps/... assumes the caller runs the script from repo root. If invoked from elsewhere, bootstrap fails early (and the same root-cause affects later relative paths like infra/... and scripts/seed/...).

Proposed fix

#!/usr/bin/env bash -set -e +set -e + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +REPO_ROOT="$(cd "${SCRIPT_DIR}/.." && pwd)" +cd "${REPO_ROOT}"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/bootstrap.sh` around lines 16 - 19, The script uses relative cd commands (e.g., the four lines with "cd apps/bff && npm install && cd ../..", "cd apps/gateway ...", etc.) which assume the current working directory is the repo root; prepend a robust anchor to the repository root at the top of scripts/bootstrap.sh — for example resolve the repo root with git rev-parse --show-toplevel (or compute the script dir via dirname "$0" and cd to its parent) and cd there or exit on failure before running those cd/install lines so all subsequent relative paths (apps/*, infra/*, scripts/seed/*) are executed from the repo root.

coderabbitai · 2026-03-21T17:25:11Z

+echo "→ Waiting for Kafka..."
+sleep 10
+echo "✅ Kafka ready"


⚠️ Potential issue | 🟠 Major

Replace fixed Kafka delay with an actual readiness probe.

A hardcoded sleep 10 is flaky; on slower startups, dependent services can still race and fail. Line 38 should poll for broker readiness instead of elapsed time.

Proposed fix

# Wait for Kafka to be ready echo "→ Waiting for Kafka..." -sleep 10 +until docker compose -f infra/docker/docker-compose.yml exec -T kafka \ + kafka-topics --bootstrap-server kafka:9092 --list >/dev/null 2>&1; do + sleep 2 +done echo "✅ Kafka ready"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/bootstrap.sh` around lines 37 - 39, Replace the fragile fixed delay (the "sleep 10" used after the echo "→ Waiting for Kafka...") with a readiness probe loop that polls the Kafka broker and only proceeds when it responds or a timeout is reached; implement a loop in the script that repeatedly checks the broker (for example by invoking kafka-topics.sh --bootstrap-server "$KAFKA_BOOTSTRAP" --list or a TCP probe to the Kafka host:port) and breaks when successful, logging retries, and exit with a non-zero status after a configurable timeout instead of unconditionally sleeping.

coderabbitai · 2026-03-21T17:25:11Z

+docker compose -f infra/docker/docker-compose.yml exec -T postgres psql \
+  -U postgres -d grainguard -f /dev/stdin < scripts/seed/seed-postgres.sh || true
+echo "✅ Data seeded"


⚠️ Potential issue | 🟠 Major

Do not suppress seeding failures while printing success.

Line 44 uses || true, so failed seed runs still report ✅ Data seeded, which can hide broken local environments.

Proposed fix

echo "→ Seeding dev data..." -docker compose -f infra/docker/docker-compose.yml exec -T postgres psql \ - -U postgres -d grainguard -f /dev/stdin < scripts/seed/seed-postgres.sh || true -echo "✅ Data seeded" +if docker compose -f infra/docker/docker-compose.yml exec -T postgres psql \ + -U postgres -d grainguard -f /dev/stdin < scripts/seed/seed-postgres.sh; then + echo "✅ Data seeded" +else + echo "❌ Data seeding failed" + exit 1 +fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

docker compose -f infra/docker/docker-compose.yml exec -T postgres psql \

-U postgres -d grainguard -f /dev/stdin < scripts/seed/seed-postgres.sh || true

echo "✅ Data seeded"

echo "→ Seeding dev data..."

if docker compose -f infra/docker/docker-compose.yml exec -T postgres psql \

-U postgres -d grainguard -f /dev/stdin < scripts/seed/seed-postgres.sh; then

echo "✅ Data seeded"

else

echo "❌ Data seeding failed"

exit 1

fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@scripts/bootstrap.sh` around lines 43 - 45, The seeding command currently swallows failures by appending "|| true" but still prints "✅ Data seeded"; remove the "|| true" and change the flow so the docker-compose psql command's exit code is checked and only print "✅ Data seeded" when that command succeeds, otherwise exit non-zero (or print an error) to surface failures; locate the docker compose psql invocation (the line that redirects scripts/seed/seed-postgres.sh into psql) and the subsequent echo "✅ Data seeded" and update them so success is conditional on the psql command's actual exit status.

…s, DB fallback on cache miss, token-based Redis lock ownership

pahuldeepp added 4 commits March 21, 2026 09:58

chore: full codebase review

dad20be

fix: handle migrate version error and address coderabbit review

6aeb102

fix: guard window token exposure to dev only, add auth error logging

a0c7a05

feat(bff): add PgBouncer pooling, Redis cache with stampede protectio…

5c9b1a4

…n on dashboard queries

coderabbitai Bot requested changes Mar 21, 2026

View reviewed changes

pahuldeepp added 2 commits March 21, 2026 11:36

feat(gateway): add correlation ID propagation across services

1544192

chore: add Makefile for one-command dev workflow

c0891ea

coderabbitai Bot requested changes Mar 21, 2026

View reviewed changes

feat(infra): wire Loki + Promtail for log aggregation

562ad09

coderabbitai Bot requested changes Mar 21, 2026

View reviewed changes

pahuldeepp added 2 commits March 21, 2026 12:22

chore: add bootstrap script for one-command dev setup

51b7622

docs: overhaul README with architecture, setup, and ADR catalogue

17f201c

coderabbitai Bot requested changes Mar 21, 2026

View reviewed changes

pahuldeepp added 2 commits March 21, 2026 12:35

feat(bff): add RBAC on mutations — admin/operator role enforcement

8689719

feat(dashboard): secure token storage — memory cache + refresh tokens

7c843cc

pahuldeepp force-pushed the master branch from 5b30d81 to 9bf33d4 Compare March 22, 2026 20:18

fix(bff): address CodeRabbit critical issues — tenant-aware cache key…

bc75555

…s, DB fallback on cache miss, token-based Redis lock ownership

pahuldeepp merged commit 46ffeda into master Mar 22, 2026
2 checks passed

		@@ -0,0 +1,103 @@
		.PHONY: up down restart logs build seed test lint clean help

-.PHONY: up down restart logs build seed test lint clean help
+.PHONY: all up down restart logs logs-gateway logs-bff logs-kafka build seed migrate test test-go test-react test-load lint lint-go lint-ts clean clean-cache ps health help
+all: up ## Default target

		clean-cache: ## Clear Redis cache
		docker compose -f infra/docker/docker-compose.yml exec redis redis-cli FLUSHALL

Conversation

pahuldeepp commented Mar 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Changes

PgBouncer — Connection Pooling

Redis Cache — Dashboard Queries

Stampede Protection

Cleanup

Why

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

pahuldeepp commented Mar 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 21, 2026 •

edited

Loading