Skip to content

feat(bff): add PgBouncer pooling, Redis cache with stampede protectio…#5

Merged
pahuldeepp merged 12 commits intomasterfrom
feat/r1-closeout
Mar 22, 2026
Merged

feat(bff): add PgBouncer pooling, Redis cache with stampede protectio…#5
pahuldeepp merged 12 commits intomasterfrom
feat/r1-closeout

Conversation

@pahuldeepp
Copy link
Copy Markdown
Owner

@pahuldeepp pahuldeepp commented Mar 21, 2026

What this PR does

Closes out R1 infrastructure gaps by adding PgBouncer connection pooling
and Redis cache-aside pattern with stampede protection to the BFF.

Changes

PgBouncer — Connection Pooling

  • Added infra/docker/pgbouncer/pgbouncer.ini — pool config with
    transaction mode, 200 max client connections, 20 default pool size
  • Added infra/docker/pgbouncer/userlist.txt — auth credentials
  • Added PgBouncer service to infra/docker/docker-compose.yml
  • BFF now connects via PgBouncer instead of directly to Postgres
  • Increased BFF pool max from 10 to 50 (PgBouncer handles real limit)

Redis Cache — Dashboard Queries

  • Wired cache.get/set from existing datasources/redis.ts into
    3 high-frequency dashboard queries:
    • getAllDevices — 30s TTL, keyed by limit
    • getAllTelemetry — 30s TTL, keyed by tenantId + limit
    • getAllDevicesWithTelemetry — 30s TTL, keyed by tenantId + limit

Stampede Protection

  • All 3 cached functions use cache.acquireLock on cache miss
  • Only 1 request queries Postgres at a time per cache key
  • Concurrent requests wait 100ms then hit warm cache
  • Lock TTL = 5s, auto-releases via releaseLock in finally block

Cleanup

  • Removed duplicate apps/bff/src/lib/cache.ts
  • Single Redis client — datasources/redis.ts is source of truth for BFF

Why

  • PgBouncer: prevents connection exhaustion under load — 50 services
    sharing 20 real Postgres connections instead of 50 direct connections
  • Cache: dashboard load goes from N DB queries to 1 query per 30s
  • Stampede protection: prevents thundering herd on cache expiry
  • Documented in ADR-006 (Redis locks) and blueprint caching strategy

Testing

  • Verify BFF connects via pgbouncer container on startup
  • Verify cache hit on second dashboard request (Redis MONITOR)
  • Verify single DB query on concurrent cache miss (check Postgres logs)

Summary by CodeRabbit

  • Performance Improvements

    • Added caching for device and telemetry queries and increased DB connection pool for higher throughput.
  • New Features

    • Requests now accept and return a correlation ID alongside request ID for improved tracing.
  • Observability

    • Logs include correlation ID; added centralized logging, searching, metrics, and message-broker components to improve diagnostics.
  • Chores

    • Added developer automation (Makefile, bootstrap/seed scripts) and local infra orchestration and connection pooling utilities.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 21, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds Redis-backed caching with distributed-lock fallback to BFF Postgres queries, introduces correlation IDs and logging changes in the gateway, adds a project Makefile and infra/config files (PgBouncer, Loki, Promtail), and updates load-generator event payload and DB connection defaults/pool size.

Changes

Cohort / File(s) Summary
BFF Postgres datasource
apps/bff/src/datasources/postgres.ts
Added Redis cache reads/writes with tenant/limit-scoped keys and distributed-lock flow (acquire → query → set TTL → release); 100ms fallback when lock not acquired; changed READ_DB_HOST default to pgbouncer; increased pg pool max 10 → 50.
Gateway middleware & server logging
apps/gateway/src/middleware/requestId.ts, apps/gateway/src/server.ts
Middleware now accepts/validates x-correlation-id, assigns req.correlationId, echoes x-correlation-id; ID helper renamed; request logger includes correlation_id in structured logs.
Developer Makefile
Makefile
New Makefile with compose orchestration (up/down/restart/logs), image build, seed/migrate, test targets (Go, React, load), lint (go/ts), cleanup, status, and help target.
Compose and infra additions
infra/docker/docker-compose.yml
Adds/rewires services: Elasticsearch, Kibana, RabbitMQ, Loki, Promtail, Memcached, PgBouncer; reorders services, adjusts env/volumes, and updates declared volumes.
Loki config
infra/docker/loki-config.yaml
New Loki config: HTTP on 3100, in-memory ring, boltdb-shipper filesystem storage, chunk/index dirs under /loki, limits/lookback settings.
PgBouncer config
infra/docker/pgbouncer/pgbouncer.ini
New PgBouncer config mapping grainguard/grainguard_read, listening on 0.0.0.0:5432, transaction pooling, md5 auth, reserve pool settings.
Promtail config
infra/docker/promtail-config.yaml
New Promtail config: HTTP server on 9080, Docker service discovery, relabels Docker metadata to log labels, forwards to http://loki:3100/loki/api/v1/push.
Load generator
apps/load-generator/main.go
TelemetryEvent payload updated (data.recordedAt added), adjusted temperature/humidity generation ranges, uses UTC recordedAt, returns from main on shutdown instead of os.Exit.
Bootstrap script
scripts/bootstrap.sh
New bootstrap script that verifies tooling, installs npm deps, brings up compose infra, waits for Postgres, seeds DB, and starts services.

Sequence Diagram

sequenceDiagram
    participant Client as Client
    participant BFF as BFF
    participant Redis as Redis (cache)
    participant Postgres as Postgres
    participant Lock as Redis Lock

    Client->>BFF: request (getAll...)
    BFF->>Redis: GET cacheKey
    alt cache hit
        Redis-->>BFF: cached data
        BFF-->>Client: return cached data
    else cache miss
        BFF->>Lock: TRY acquire lock(cacheKey)
        alt lock acquired
            Lock-->>BFF: lock granted
            BFF->>Postgres: execute query
            Postgres-->>BFF: result
            BFF->>Redis: SET cacheKey = result (TTL)
            Redis-->>BFF: OK
            BFF->>Lock: RELEASE lock
            Lock-->>BFF: released
            BFF-->>Client: return fresh result
        else lock not acquired
            BFF-->>BFF: wait 100ms
            BFF->>Redis: GET cacheKey (fallback)
            Redis-->>BFF: stale/empty
            BFF-->>Client: return fallback result
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main changes: PgBouncer pooling and Redis caching with stampede protection, which are the primary infrastructure additions in this PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/r1-closeout

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can suggest fixes for GitHub Check annotations.

Configure the reviews.tools.github-checks setting to adjust the time to wait for GitHub Checks to complete.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/bff/src/datasources/postgres.ts`:
- Around line 49-67: The cache key and query are tenant-agnostic: update the
call site and this function to accept a tenantId (or an explicit isAdmin flag),
include tenantId in the cache key (e.g.,
`devices:tenant:${tenantId}:all:${limit}`) and add a WHERE tenant_id = $2 (or
equivalent) to the cbQuery so the SQL filters by tenant; if this endpoint must
be admin-only instead, document and enforce that by checking an isAdmin guard
before using the shared key. Ensure the cache.get/set/acquireLock use the new
tenant-scoped key and thread tenantId through any callers that invoke this code
so RLS/tenant isolation is preserved.
- Around line 49-70: The cache logic currently returns [] on contested misses
and lets any cache (get/set/releaseLock) errors surface instead of falling back
to Postgres; update the flow around cacheKey, cache.get, cache.acquireLock,
cache.set and cache.releaseLock so that any cache error or a contested-miss
timeout triggers a direct cbQuery to Postgres and returns those rows, and ensure
cache.releaseLock errors are caught/logged and do not override the DB result.
Concretely: on cache miss, attempt acquireLock; if not locked, sleep then try
cache.get again and if still empty call cbQuery(...) and return its rows; wrap
all cache.get/set/releaseLock calls in try/catch to fall back to cbQuery on
failure; after a successful cbQuery, attempt cache.set but ignore/set-log any
errors so the DB result is still returned. This same pattern should be applied
to the other similar blocks indicated (lines ~85-120 and ~165-205).
- Around line 96-105: The branch is calling cbQuery directly despite having a
tenantId, bypassing the module's tenant/RLS pathway; replace the cbQuery call in
this block (the query selecting from device_telemetry_latest joined to
device_projections) with the module's tenant-scoped query helper (e.g., the
tenantScopedQuery/tenantQuery helper used elsewhere) and pass tenantId and limit
through that helper while keeping the explicit WHERE d.tenant_id = $1 predicate
for defense-in-depth; apply the same replacement for the analogous block around
lines 176-187 so all tenant-aware reads use the tenant-scoped helper rather than
cbQuery.
- Around line 53-70: The current lock handling uses a fixed value and
unconditional DEL which can remove another request's lock; modify
cache.acquireLock to return a unique token (e.g., UUID) when the lock is
obtained and change cache.releaseLock to accept that token and perform a
compare-and-delete (atomic check-and-del, e.g., Redis EVAL script) or
alternatively implement a lock-renewal mechanism (extend TTL while the query
runs). Update all callers in this file that call
cache.acquireLock/cache.releaseLock (the blocks around the current SELECT, and
the other similar blocks noted) to store the returned token and pass it into
releaseLock instead of calling releaseLock unconditionally. Ensure callers also
handle the case where acquireLock returns null/false (miss) and only attempt
release when they actually hold the token.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 31fb8a5e-4335-4231-858b-5a7a0223ad78

📥 Commits

Reviewing files that changed from the base of the PR and between 5b30d81 and 5c9b1a4.

📒 Files selected for processing (1)
  • apps/bff/src/datasources/postgres.ts

Comment on lines +49 to +67
const cacheKey = `devices:all:${limit}`;
const cached = await cache.get<any[]>(cacheKey);
if (cached) return cached;

const locked = await cache.acquireLock(cacheKey, 5);
if (!locked) {
await new Promise(r => setTimeout(r, 100));
return await cache.get<any[]>(cacheKey) || [];
}

try {
const result = await cbQuery(
`SELECT device_id, tenant_id, serial_number, created_at
FROM device_projections
ORDER BY created_at DESC
LIMIT $1`,
[limit]
);
await cache.set(cacheKey, result.rows, 30);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make this cache path tenant-aware or explicitly admin-only.

devices:all:${limit} is a shared cross-tenant key, and the query still returns rows for every tenant. If this method is reachable from normal tenant-scoped resolver flows, one caller can warm a payload that every other caller reuses. Thread tenantId through this API and key/query on it, or split it into an explicitly admin-only path. As per coding guidelines, apps/bff/src/**/*.ts: Check all queries use tenantId for RLS. Verify cache keys include tenantId.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/bff/src/datasources/postgres.ts` around lines 49 - 67, The cache key and
query are tenant-agnostic: update the call site and this function to accept a
tenantId (or an explicit isAdmin flag), include tenantId in the cache key (e.g.,
`devices:tenant:${tenantId}:all:${limit}`) and add a WHERE tenant_id = $2 (or
equivalent) to the cbQuery so the SQL filters by tenant; if this endpoint must
be admin-only instead, document and enforce that by checking an isAdmin guard
before using the shared key. Ensure the cache.get/set/acquireLock use the new
tenant-scoped key and thread tenantId through any callers that invoke this code
so RLS/tenant isolation is preserved.

Comment on lines +49 to +70
const cacheKey = `devices:all:${limit}`;
const cached = await cache.get<any[]>(cacheKey);
if (cached) return cached;

const locked = await cache.acquireLock(cacheKey, 5);
if (!locked) {
await new Promise(r => setTimeout(r, 100));
return await cache.get<any[]>(cacheKey) || [];
}

try {
const result = await cbQuery(
`SELECT device_id, tenant_id, serial_number, created_at
FROM device_projections
ORDER BY created_at DESC
LIMIT $1`,
[limit]
);
await cache.set(cacheKey, result.rows, 30);
return result.rows;
} finally {
await cache.releaseLock(cacheKey);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not let cache contention or Redis failures turn into empty/failed reads.

On a contested miss, the loser sleeps 100ms and then returns [] if the winner has not populated Redis yet. Separately, any cache.get / set / releaseLock error will reject the request, even when Postgres is healthy or the DB query already succeeded. These paths should fall back to the underlying DB query whenever the cache cannot serve a value.

Also applies to: 85-120, 165-205

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/bff/src/datasources/postgres.ts` around lines 49 - 70, The cache logic
currently returns [] on contested misses and lets any cache
(get/set/releaseLock) errors surface instead of falling back to Postgres; update
the flow around cacheKey, cache.get, cache.acquireLock, cache.set and
cache.releaseLock so that any cache error or a contested-miss timeout triggers a
direct cbQuery to Postgres and returns those rows, and ensure cache.releaseLock
errors are caught/logged and do not override the DB result. Concretely: on cache
miss, attempt acquireLock; if not locked, sleep then try cache.get again and if
still empty call cbQuery(...) and return its rows; wrap all
cache.get/set/releaseLock calls in try/catch to fall back to cbQuery on failure;
after a successful cbQuery, attempt cache.set but ignore/set-log any errors so
the DB result is still returned. This same pattern should be applied to the
other similar blocks indicated (lines ~85-120 and ~165-205).

Comment thread apps/bff/src/datasources/postgres.ts
Comment on lines +96 to +105
if (tenantId) {
const result = await cbQuery(
`SELECT t.device_id, t.temperature, t.humidity, t.recorded_at, t.updated_at, t.version
FROM device_telemetry_latest t
INNER JOIN device_projections d ON d.device_id = t.device_id
WHERE d.tenant_id = $1
ORDER BY t.updated_at DESC
LIMIT $2`,
[tenantId, limit]
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use the tenant-scoped query path on these branches.

These branches already have a tenantId, but they still call cbQuery, so the new cached reads are bypassing the module's tenant/RLS path and relying only on ad hoc predicates. Route them through the tenant-scoped helper here, keeping the explicit tenant_id filter if you want defense in depth. As per coding guidelines, apps/bff/src/**/*.ts: Check all queries use tenantId for RLS.

Also applies to: 176-187

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/bff/src/datasources/postgres.ts` around lines 96 - 105, The branch is
calling cbQuery directly despite having a tenantId, bypassing the module's
tenant/RLS pathway; replace the cbQuery call in this block (the query selecting
from device_telemetry_latest joined to device_projections) with the module's
tenant-scoped query helper (e.g., the tenantScopedQuery/tenantQuery helper used
elsewhere) and pass tenantId and limit through that helper while keeping the
explicit WHERE d.tenant_id = $1 predicate for defense-in-depth; apply the same
replacement for the analogous block around lines 176-187 so all tenant-aware
reads use the tenant-scoped helper rather than cbQuery.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/gateway/src/middleware/requestId.ts`:
- Around line 29-38: The middleware currently generates a new correlationId when
x-correlation-id is missing/invalid instead of reusing the existing requestId
and it fails to inject generated IDs back into req.headers for downstream
propagation; update the logic in the requestId middleware (variables: requestId,
correlationId, generateId, VALID_REQUEST_ID) so that if incomingCorrelationId is
absent/invalid you set correlationId = requestId (not a new random ID), and
ensure both x-request-id and x-correlation-id are written into res.setHeader and
also injected into req.headers['x-request-id'] and
req.headers['x-correlation-id'] (after trimming/validation) so downstream
requests that forward req.headers preserve tracing continuity.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 987e320c-1df2-4917-9b28-a31664360425

📥 Commits

Reviewing files that changed from the base of the PR and between 5c9b1a4 and 1544192.

📒 Files selected for processing (2)
  • apps/gateway/src/middleware/requestId.ts
  • apps/gateway/src/server.ts

Comment on lines +29 to +38
const incomingCorrelationId = req.header("x-correlation-id");
const correlationId = (incomingCorrelationId && VALID_REQUEST_ID.test(incomingCorrelationId.trim()))
? incomingCorrelationId.trim()
: generateId();

req.requestId = requestId;
req.correlationId = correlationId;

// Echo back for client correlation and debugging
res.setHeader("x-request-id", requestId);
res.setHeader("x-correlation-id", correlationId);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Correlation fallback and propagation are inconsistent with cross-service tracing intent.

When x-correlation-id is absent/invalid, a new random ID is generated instead of reusing requestId, and generated IDs are not injected into req.headers. That breaks continuity and can prevent propagation to downstream calls that forward inbound headers.

🔧 Proposed fix
   const incomingCorrelationId = req.header("x-correlation-id");
   const correlationId = (incomingCorrelationId && VALID_REQUEST_ID.test(incomingCorrelationId.trim()))
     ? incomingCorrelationId.trim()
-    : generateId();
+    : requestId;

   req.requestId = requestId;
   req.correlationId = correlationId;
+  req.headers["x-request-id"] = requestId;
+  req.headers["x-correlation-id"] = correlationId;

   res.setHeader("x-request-id", requestId);
   res.setHeader("x-correlation-id", correlationId);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/gateway/src/middleware/requestId.ts` around lines 29 - 38, The
middleware currently generates a new correlationId when x-correlation-id is
missing/invalid instead of reusing the existing requestId and it fails to inject
generated IDs back into req.headers for downstream propagation; update the logic
in the requestId middleware (variables: requestId, correlationId, generateId,
VALID_REQUEST_ID) so that if incomingCorrelationId is absent/invalid you set
correlationId = requestId (not a new random ID), and ensure both x-request-id
and x-correlation-id are written into res.setHeader and also injected into
req.headers['x-request-id'] and req.headers['x-correlation-id'] (after
trimming/validation) so downstream requests that forward req.headers preserve
tracing continuity.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Makefile`:
- Line 1: Update the .PHONY declaration to include the missing targets and add a
default all target: extend the existing .PHONY list to include all,
logs-gateway, logs-bff, logs-kafka, migrate, test-go, test-react, test-load,
lint-go, lint-ts, clean-cache, ps, and health; then add a top-level all target
(e.g., all: up or all: build up — choose the appropriate default behavior for
this Makefile) so running make with no args uses the standard default; ensure
the new all name doesn't collide with any file and that the added phony names
match the actual targets (logs-gateway, logs-bff, logs-kafka, migrate, test-go,
test-react, test-load, lint-go, lint-ts, clean-cache, ps, health).
- Around line 91-93: The Makefile target clean-cache currently runs redis-cli
FLUSHALL which wipes all Redis databases; update the target (clean-cache) to use
redis-cli FLUSHDB so only the current DB is cleared, i.e., replace the FLUSHALL
argument with FLUSHDB in the docker compose exec command that invokes redis-cli.
- Around line 101-103: The health Makefile target uses silent curl commands that
don't fail on HTTP errors; update the two curl invocations in the "health"
target to use curl -fsS (and optionally add --max-time 5) so the command will
exit non‑zero on 4xx/5xx responses and time out reliably; modify the curl lines
under the health target (the two curl commands calling
http://localhost:8086/health and http://localhost:4000/health) to include -fsS
(and --max-time 5 if desired).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 37acc8ef-7ebf-4b00-b0b6-39a59d15e802

📥 Commits

Reviewing files that changed from the base of the PR and between 1544192 and c0891ea.

📒 Files selected for processing (1)
  • Makefile

Comment thread Makefile
@@ -0,0 +1,103 @@
.PHONY: up down restart logs build seed test lint clean help
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Targets defined in Makefile =="
rg -n '^[a-zA-Z_-]+:.*$' Makefile

echo
echo "== .PHONY declaration =="
rg -n '^\.(PHONY):' Makefile

echo
echo "== Check whether required default target exists =="
rg -n '^all:' Makefile || true

Repository: pahuldeepp/GrainGuard-

Length of output: 1021


Add a default all target and complete the .PHONY list.

The .PHONY declaration on line 1 is incomplete, missing 12 targets that are actually defined later in the Makefile: logs-gateway, logs-bff, logs-kafka, migrate, test-go, test-react, test-load, lint-go, lint-ts, clean-cache, ps, and health. Additionally, the all target is missing entirely, which is a standard Makefile convention. These omissions can cause target/file name collisions and potential CI lint failures.

Proposed fix
-.PHONY: up down restart logs build seed test lint clean help
+.PHONY: all up down restart logs logs-gateway logs-bff logs-kafka build seed migrate test test-go test-react test-load lint lint-go lint-ts clean clean-cache ps health help
+
+all: up ## Default target
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
.PHONY: up down restart logs build seed test lint clean help
.PHONY: all up down restart logs logs-gateway logs-bff logs-kafka build seed migrate test test-go test-react test-load lint lint-go lint-ts clean clean-cache ps health help
all: up ## Default target
🧰 Tools
🪛 checkmake (0.2.2)

[warning] 1-1: Missing required phony target "all"

(minphony)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` at line 1, Update the .PHONY declaration to include the missing
targets and add a default all target: extend the existing .PHONY list to include
all, logs-gateway, logs-bff, logs-kafka, migrate, test-go, test-react,
test-load, lint-go, lint-ts, clean-cache, ps, and health; then add a top-level
all target (e.g., all: up or all: build up — choose the appropriate default
behavior for this Makefile) so running make with no args uses the standard
default; ensure the new all name doesn't collide with any file and that the
added phony names match the actual targets (logs-gateway, logs-bff, logs-kafka,
migrate, test-go, test-react, test-load, lint-go, lint-ts, clean-cache, ps,
health).

Comment thread Makefile
Comment on lines +91 to +93
clean-cache: ## Clear Redis cache
docker compose -f infra/docker/docker-compose.yml exec redis redis-cli FLUSHALL

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find and verify the Makefile exists
fd -type f -name "Makefile" | head -20

Repository: pahuldeepp/GrainGuard-

Length of output: 236


🏁 Script executed:

# Once located, read the specific lines mentioned (91-93)
cat -n Makefile | sed -n '85,100p'

Repository: pahuldeepp/GrainGuard-

Length of output: 660


🏁 Script executed:

# Check the docker-compose configuration for Redis setup
fd -type f -name "docker-compose.yml" | head -10

Repository: pahuldeepp/GrainGuard-

Length of output: 236


🏁 Script executed:

# Find docker-compose.yml with correct syntax
find . -name "docker-compose.yml" -type f

Repository: pahuldeepp/GrainGuard-

Length of output: 99


🏁 Script executed:

# Look at the docker-compose configuration
cat infra/docker/docker-compose.yml

Repository: pahuldeepp/GrainGuard-

Length of output: 14710


🏁 Script executed:

# Check if there are any references to Redis DB selection in the codebase
rg "redis|Redis|REDIS" --max-count=20 -i

Repository: pahuldeepp/GrainGuard-

Length of output: 28172


Use FLUSHDB instead of FLUSHALL in the clean-cache target.

Line 92 executes FLUSHALL, which clears all Redis databases. Multiple services (bff, gateway, read-model-builder) depend on this Redis instance for caching, rate-limiting, and idempotency locks. Use FLUSHDB to clear only the current database, limiting the scope of cache cleanup.

Fix
 clean-cache: ## Clear Redis cache
-	docker compose -f infra/docker/docker-compose.yml exec redis redis-cli FLUSHALL
+	docker compose -f infra/docker/docker-compose.yml exec redis redis-cli FLUSHDB
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
clean-cache: ## Clear Redis cache
docker compose -f infra/docker/docker-compose.yml exec redis redis-cli FLUSHALL
clean-cache: ## Clear Redis cache
docker compose -f infra/docker/docker-compose.yml exec redis redis-cli FLUSHDB
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 91 - 93, The Makefile target clean-cache currently
runs redis-cli FLUSHALL which wipes all Redis databases; update the target
(clean-cache) to use redis-cli FLUSHDB so only the current DB is cleared, i.e.,
replace the FLUSHALL argument with FLUSHDB in the docker compose exec command
that invokes redis-cli.

Comment thread Makefile
Comment on lines +101 to +103
health: ## Check health of all services
curl -s http://localhost:8086/health | jq
curl -s http://localhost:4000/health | jq No newline at end of file
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the Makefile
find . -name "Makefile" -type f | head -5

Repository: pahuldeepp/GrainGuard-

Length of output: 76


🏁 Script executed:

# Once we have the Makefile, check the health target
sed -n '95,110p' Makefile

Repository: pahuldeepp/GrainGuard-

Length of output: 332


🌐 Web query:

curl -s flag behavior HTTP error codes 4xx 5xx exit status

💡 Result:

  • -s / --silent only affects output: it suppresses the progress meter and curl’s error messages; it does not change how HTTP status codes map to curl’s exit status. Use -S / --show-error with -s if you want “silent except show errors”. [1][2]
  • For HTTP 4xx/5xx: by default, curl can still exit 0 even if the server responds 404, 500, etc., because the transfer itself succeeded at the HTTP layer. If you want HTTP >= 400 to be treated as a failure (non-zero exit), use -f / --fail (or --fail-with-body). [3]
  • With -f/--fail, curl uses exit code 22 for “HTTP error returned” (4xx/5xx). [4]

Common script pattern:

curl -fsS https://example.com/resource   # silent, show errors, fail (exit!=0) on 4xx/5xx

Sources: [1] [2] [3] [4]


Add -f flag to curl commands so the health target fails on HTTP errors.

Lines 102–103 use curl -s which exits 0 even on HTTP 4xx/5xx responses (the transfer itself succeeds). Without the -f flag, make health will report success when services are unhealthy. Use curl -fsS instead to fail fast on HTTP errors; optionally add --max-time 5 to prevent hangs on stalled connections.

Proposed fix
 health: ## Check health of all services
-	curl -s http://localhost:8086/health | jq
-	curl -s http://localhost:4000/health | jq
+	curl -fsS --max-time 5 http://localhost:8086/health | jq
+	curl -fsS --max-time 5 http://localhost:4000/health | jq
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 101 - 103, The health Makefile target uses silent curl
commands that don't fail on HTTP errors; update the two curl invocations in the
"health" target to use curl -fsS (and optionally add --max-time 5) so the
command will exit non‑zero on 4xx/5xx responses and time out reliably; modify
the curl lines under the health target (the two curl commands calling
http://localhost:8086/health and http://localhost:4000/health) to include -fsS
(and --max-time 5 if desired).

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
infra/docker/docker-compose.yml (1)

437-438: ⚠️ Potential issue | 🟠 Major

BFF is not connecting through PgBouncer as stated in PR objectives.

The PR description states "BFF now connects to Postgres via PgBouncer," but the BFF service still connects directly to postgres-read:5432. To route through PgBouncer, update the connection string.

🐛 Proposed fix to use PgBouncer
     environment:
-      - READ_DATABASE_URL=postgres://postgres:postgres@postgres-read:5432/grainguard_read
+      - READ_DATABASE_URL=postgres://postgres:postgres@pgbouncer:5432/grainguard_read

Also ensure depends_on includes pgbouncer:

     depends_on:
-      - postgres-read
+      pgbouncer:
+        condition: service_started
       - redis
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@infra/docker/docker-compose.yml` around lines 437 - 438, The BFF service is
still using READ_DATABASE_URL pointing at postgres-read:5432 instead of
PgBouncer; update the BFF service's READ_DATABASE_URL to use the pgbouncer host
and port (e.g., pgbouncer:6432 or the project's configured pgbouncer port) and
add pgbouncer to the BFF service's depends_on so the BFF waits for pgbouncer to
start; locate the READ_DATABASE_URL environment entry in the BFF service block
and the service's depends_on list to make these changes (ensure the connection
user/db in READ_DATABASE_URL remain correct for PgBouncer).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/load-generator/main.go`:
- Around line 47-51: The current parsing of deviceIDs only trims entries but
does not remove empty strings, causing empty deviceId/aggregateId and empty
Kafka keys; update the code that builds devices from deviceIDs (the devices
slice population where deviceIDs is split and trimmed) to filter out any entries
that are empty after TrimSpace, and apply the same filtering logic wherever
device lists or splits are performed (the other parsing sites around the device
creation/iteration that produce deviceId/aggregateId and the Kafka key
generation) so only non-empty device IDs are used.
- Around line 37-39: Parse and validate the numeric environment vars instead of
ignoring errors: when converting WORKERS, RATE, and DURATION with strconv.Atoi
(the variables workers, ratePerSec, durationSec), check the error and ensure
each value is > 0; if parsing fails or the value is non-positive, either assign
a safe default or log and exit (use log.Fatalf or similar) so you never pass 0
into concurrency/rate logic. Update the initialization around
workers/ratePerSec/durationSec and add guards before using them to create
goroutines or compute ticker intervals (the code paths that compute ticker
durations and start worker loops referenced by workers and ratePerSec) so
divide-by-zero and zero-worker conditions cannot occur.
- Line 143: The json.Marshal call that creates b from evt currently ignores its
error; change the code around the json.Marshal(evt) invocation to capture the
error (e.g., b, err := json.Marshal(evt)), check err, and handle it explicitly
(log the error with context including evt and skip/return/continue as
appropriate for the surrounding function). Update any surrounding logic that
assumes b is valid to only run when err == nil so marshaling failures are not
silently ignored; reference the json.Marshal(evt) call and the variable b/evt
when making the change.

In `@infra/docker/docker-compose.yml`:
- Around line 53-54: Add missing anonymous volume declarations for the
referenced volumes so compose can create them: declare both esdata and
rabbitmq-data under the top-level volumes: section (the same volumes block that
should contain other named volumes) so they match the references used in the
service definitions (volume names "esdata" and "rabbitmq-data"); update the
volumes: section to include those two entries (no special options required
unless persistence/driver/config is needed).
- Around line 38-60: The docker-compose Elasticsearch service currently sets
xpack.security.enabled=false in the elasticsearch environment, which disables
authentication; change this for non-dev use by enabling security (remove or set
xpack.security.enabled=true) and configure required credentials and TLS in the
elasticsearch service configuration (update the elasticsearch environment and
volumes to mount certs/keystore and provide ELASTIC_PASSWORD or use a secrets
mechanism), and ensure any clients/other services that talk to the elasticsearch
service (ports 9200 mapping and healthcheck curl) are updated to use the secured
endpoint and credentials.
- Around line 156-165: The current PgBouncer environment only points to the
primary via POSTGRESQL_HOST and POSTGRESQL_DATABASE (grainguard), so PgBouncer
will not route connections to the read replica; either update the pgbouncer.ini
to declare both databases (e.g., add a mapping for grainguard_read to the
read-replica host) and reload/use that custom pgbouncer.ini, or deploy a second
PgBouncer instance configured for the replica; specifically modify the
pgbouncer.ini to include an entry for grainguard_read (pointing at the replica
host) and ensure the docker-compose environment or service that starts PgBouncer
references that custom pgbouncer.ini so connections for grainguard_read are
pooled through PgBouncer.

In `@infra/docker/loki-config.yaml`:
- Around line 43-45: Update the Loki table_manager retention settings so logs
are pruned instead of accumulating indefinitely: change
retention_deletes_enabled from false to true and set retention_period to a
sensible duration (e.g., days/weeks/months) appropriate for your environment;
modify the table_manager block (retention_deletes_enabled / retention_period) to
enable deletion and prevent unbounded growth under /loki/chunks.

In `@infra/docker/pgbouncer/pgbouncer.ini`:
- Around line 16-17: The pgbouncer config currently disables connection event
logging via the settings log_connections = 0 and log_disconnections = 0; change
these to enabled (set both to 1) or make them configurable so
connection/disconnection events are logged for debugging and pool exhaustion
investigation, ensuring the updated values are applied in the pgbouncer.ini
load/restart flow.
- Around line 1-18: The pgbouncer.ini file is unused because the
bitnami/pgbouncer image is configured via environment variables in
docker-compose.yml; either remove the unused
infra/docker/pgbouncer/pgbouncer.ini file or mount it and stop using the
conflicting env vars (e.g., POSTGRESQL_HOST, POSTGRESQL_PORT) so the container
reads the mounted pgbouncer.ini; locate references to pgbouncer.ini and to the
environment-based config in docker-compose.yml, then choose one approach: (A)
delete infra/docker/pgbouncer/pgbouncer.ini and keep the env-based config, or
(B) add a bind mount for infra/docker/pgbouncer/pgbouncer.ini into the service
and remove/override the environment variables (POSTGRESQL_HOST, POSTGRESQL_PORT,
etc.) that would conflict so PgBouncer uses the file.

In `@infra/docker/promtail-config.yaml`:
- Around line 16-20: The relabel_configs only map __meta_docker_container_name
and __meta_docker_container_log_stream to container and stream; add relabel
rules to capture Docker Compose service labels (e.g.,
__meta_docker_container_label_com_docker_compose_service and/or
__meta_docker_container_label_com_docker_compose_project) and map them to
descriptive target labels such as service or compose_project so Grafana/Loki
queries can filter by compose service; update the relabel_configs block where
source_labels and target_label are defined to include these additional mappings
alongside the existing __meta_docker_container_name -> container and
__meta_docker_container_log_stream -> stream entries.
- Around line 5-6: The positions file is configured as filename:
/tmp/positions.yaml which will be lost on container restart; update
promtail-config.yaml to point positions.filename to a path on a persistent
volume (e.g., /data/positions.yaml) and then update your docker-compose service
for promtail (the promtail service) to mount a named volume at that container
path and declare that volume in the compose file so the positions file persists
across restarts.

---

Outside diff comments:
In `@infra/docker/docker-compose.yml`:
- Around line 437-438: The BFF service is still using READ_DATABASE_URL pointing
at postgres-read:5432 instead of PgBouncer; update the BFF service's
READ_DATABASE_URL to use the pgbouncer host and port (e.g., pgbouncer:6432 or
the project's configured pgbouncer port) and add pgbouncer to the BFF service's
depends_on so the BFF waits for pgbouncer to start; locate the READ_DATABASE_URL
environment entry in the BFF service block and the service's depends_on list to
make these changes (ensure the connection user/db in READ_DATABASE_URL remain
correct for PgBouncer).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2bded32f-3006-4a1b-a42a-0c68ec326424

📥 Commits

Reviewing files that changed from the base of the PR and between c0891ea and 562ad09.

📒 Files selected for processing (5)
  • apps/load-generator/main.go
  • infra/docker/docker-compose.yml
  • infra/docker/loki-config.yaml
  • infra/docker/pgbouncer/pgbouncer.ini
  • infra/docker/promtail-config.yaml

Comment on lines 37 to 39
workers, _ := strconv.Atoi(getenv("WORKERS", "10"))
ratePerSec, _ := strconv.Atoi(getenv("RATE", "1000"))
durationSec, _ := strconv.Atoi(getenv("DURATION", "30"))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Validate numeric env vars before using them in concurrency/rate control.

Line 37–39 ignore parse errors, which can make workers/ratePerSec become 0 on bad input. That can panic at Line 82/83 (divide-by-zero or non-positive ticker interval) and can stall processing when no workers are started.

Proposed fix
-	workers, _ := strconv.Atoi(getenv("WORKERS", "10"))
-	ratePerSec, _ := strconv.Atoi(getenv("RATE", "1000"))
-	durationSec, _ := strconv.Atoi(getenv("DURATION", "30"))
+	workers := mustPositiveIntEnv("WORKERS", "10")
+	ratePerSec := mustPositiveIntEnv("RATE", "1000")
+	durationSec := mustPositiveIntEnv("DURATION", "30")
...
-	interval := time.Second / time.Duration(ratePerSec)
+	interval := time.Second / time.Duration(ratePerSec)
 	ticker := time.NewTicker(interval)
 	defer ticker.Stop()
 func getenv(k, def string) string {
 	v := os.Getenv(k)
 	if v == "" {
 		return def
 	}
 	return v
 }
+
+func mustPositiveIntEnv(k, def string) int {
+	raw := strings.TrimSpace(getenv(k, def))
+	n, err := strconv.Atoi(raw)
+	if err != nil || n <= 0 {
+		log.Fatalf("invalid %s=%q: must be a positive integer", k, raw)
+	}
+	return n
+}

Also applies to: 82-83, 101-113

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/load-generator/main.go` around lines 37 - 39, Parse and validate the
numeric environment vars instead of ignoring errors: when converting WORKERS,
RATE, and DURATION with strconv.Atoi (the variables workers, ratePerSec,
durationSec), check the error and ensure each value is > 0; if parsing fails or
the value is non-positive, either assign a safe default or log and exit (use
log.Fatalf or similar) so you never pass 0 into concurrency/rate logic. Update
the initialization around workers/ratePerSec/durationSec and add guards before
using them to create goroutines or compute ticker intervals (the code paths that
compute ticker durations and start worker loops referenced by workers and
ratePerSec) so divide-by-zero and zero-worker conditions cannot occur.

Comment on lines 47 to 51
if deviceIDs != "" {
devices = strings.Split(deviceIDs, ",")
for i, d := range devices {
devices[i] = strings.TrimSpace(d)
for i := range devices {
devices[i] = strings.TrimSpace(devices[i])
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Filter empty DEVICE_IDS entries before use.

Line 47–51 only trims; it doesn’t remove blank values. Inputs like ",, " generate empty deviceId/aggregateId and empty Kafka keys at Line 145–147.

Proposed fix
 	if deviceIDs != "" {
-		devices = strings.Split(deviceIDs, ",")
-		for i := range devices {
-			devices[i] = strings.TrimSpace(devices[i])
-		}
+		raw := strings.Split(deviceIDs, ",")
+		devices = make([]string, 0, len(raw))
+		for _, d := range raw {
+			d = strings.TrimSpace(d)
+			if d != "" {
+				devices = append(devices, d)
+			}
+		}
+		if len(devices) == 0 {
+			log.Fatal("DEVICE_IDS is set but contains no valid device IDs")
+		}
 	} else {

Also applies to: 128-133, 145-147

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/load-generator/main.go` around lines 47 - 51, The current parsing of
deviceIDs only trims entries but does not remove empty strings, causing empty
deviceId/aggregateId and empty Kafka keys; update the code that builds devices
from deviceIDs (the devices slice population where deviceIDs is split and
trimmed) to filter out any entries that are empty after TrimSpace, and apply the
same filtering logic wherever device lists or splits are performed (the other
parsing sites around the device creation/iteration that produce
deviceId/aggregateId and the Kafka key generation) so only non-empty device IDs
are used.

evt.Data.RecordedAt = time.Now().UTC()
evt.Data.Humidity = 40 + float64(i%10)

b, _ := json.Marshal(evt)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Handle marshaling failures explicitly.

Ignoring json.Marshal errors at Line 143 makes bad payload states silent and harder to diagnose.

Proposed fix
-			b, _ := json.Marshal(evt)
+			b, err := json.Marshal(evt)
+			if err != nil {
+				errors.Add(1)
+				continue
+			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/load-generator/main.go` at line 143, The json.Marshal call that creates
b from evt currently ignores its error; change the code around the
json.Marshal(evt) invocation to capture the error (e.g., b, err :=
json.Marshal(evt)), check err, and handle it explicitly (log the error with
context including evt and skip/return/continue as appropriate for the
surrounding function). Update any surrounding logic that assumes b is valid to
only run when err == nil so marshaling failures are not silently ignored;
reference the json.Marshal(evt) call and the variable b/evt when making the
change.

Comment on lines +38 to +60
elasticsearch:
image: elasticsearch:8.13.0
container_name: grainguard-elasticsearch
restart: unless-stopped
environment:
- discovery.type=single-node
- xpack.security.enabled=false
- ES_JAVA_OPTS=-Xms512m -Xmx512m

ports:
- "9200:9200"

networks:
- grainguard-net

volumes:
- esdata:/usr/share/elasticsearch/data

healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:9200/_cluster/health || exit 1"]
interval: 30s
timeout: 10s
retries: 5
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Elasticsearch security disabled — acceptable for dev only.

xpack.security.enabled=false disables authentication. Ensure this is not used in production environments where Elasticsearch may contain sensitive log/search data.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@infra/docker/docker-compose.yml` around lines 38 - 60, The docker-compose
Elasticsearch service currently sets xpack.security.enabled=false in the
elasticsearch environment, which disables authentication; change this for
non-dev use by enabling security (remove or set xpack.security.enabled=true) and
configure required credentials and TLS in the elasticsearch service
configuration (update the elasticsearch environment and volumes to mount
certs/keystore and provide ELASTIC_PASSWORD or use a secrets mechanism), and
ensure any clients/other services that talk to the elasticsearch service (ports
9200 mapping and healthcheck curl) are updated to use the secured endpoint and
credentials.

Comment on lines +53 to +54
volumes:
- esdata:/usr/share/elasticsearch/data
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Missing volume declarations will cause compose errors.

esdata (line 54) and rabbitmq-data (line 89) are referenced but not declared in the volumes: section at the bottom of the file. Docker Compose will fail to create these services.

🐛 Add missing volume declarations
 volumes:
   pgdata:
   pgdata_read:
   kafka-data:
   keycloak_data:
   grafana-data:
   loki-data:
+  esdata:
+  rabbitmq-data:

Also applies to: 88-89

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@infra/docker/docker-compose.yml` around lines 53 - 54, Add missing anonymous
volume declarations for the referenced volumes so compose can create them:
declare both esdata and rabbitmq-data under the top-level volumes: section (the
same volumes block that should contain other named volumes) so they match the
references used in the service definitions (volume names "esdata" and
"rabbitmq-data"); update the volumes: section to include those two entries (no
special options required unless persistence/driver/config is needed).

Comment on lines +43 to +45
table_manager:
retention_deletes_enabled: false
retention_period: 0s No newline at end of file
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Retention is disabled — logs will accumulate indefinitely.

With retention_deletes_enabled: false and retention_period: 0s, Loki will never prune old logs. In a long-running environment, /loki/chunks will grow unbounded, eventually exhausting disk space.

Consider enabling retention for production:

♻️ Suggested retention configuration
 table_manager:
-  retention_deletes_enabled: false
-  retention_period: 0s
+  retention_deletes_enabled: true
+  retention_period: 168h  # 7 days
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
table_manager:
retention_deletes_enabled: false
retention_period: 0s
table_manager:
retention_deletes_enabled: true
retention_period: 168h # 7 days
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@infra/docker/loki-config.yaml` around lines 43 - 45, Update the Loki
table_manager retention settings so logs are pruned instead of accumulating
indefinitely: change retention_deletes_enabled from false to true and set
retention_period to a sensible duration (e.g., days/weeks/months) appropriate
for your environment; modify the table_manager block (retention_deletes_enabled
/ retention_period) to enable deletion and prevent unbounded growth under
/loki/chunks.

Comment on lines +1 to +18
[databases]
grainguard = host=postgres port=5432 dbname=grainguard
grainguard_read = host=postgres-read port=5432 dbname=grainguard_read

[pgbouncer]
listen_port = 5432
listen_addr = *
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
max_client_conn = 200
default_pool_size = 20
min_pool_size = 5
reserve_pool_size = 5
reserve_pool_timeout = 3
log_connections = 0
log_disconnections = 0
server_reset_query = DISCARD ALL
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify if pgbouncer.ini is mounted in docker-compose.yml
rg -n "pgbouncer.ini" infra/docker/docker-compose.yml

# Check what volumes are defined for pgbouncer service
rg -A 10 "pgbouncer:" infra/docker/docker-compose.yml | head -20

Repository: pahuldeepp/GrainGuard-

Length of output: 401


Remove or mount the unused pgbouncer.ini file.

The docker-compose.yml configures PgBouncer entirely via environment variables (POSTGRESQL_HOST, POSTGRESQL_PORT, etc.) using the bitnami/pgbouncer image, which does not mount infra/docker/pgbouncer/pgbouncer.ini. The bitnami image generates its own configuration from environment variables.

Either:

  1. Delete this unused file if it's not needed, or
  2. Mount it in docker-compose.yml and remove the conflicting environment variables to use it instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@infra/docker/pgbouncer/pgbouncer.ini` around lines 1 - 18, The pgbouncer.ini
file is unused because the bitnami/pgbouncer image is configured via environment
variables in docker-compose.yml; either remove the unused
infra/docker/pgbouncer/pgbouncer.ini file or mount it and stop using the
conflicting env vars (e.g., POSTGRESQL_HOST, POSTGRESQL_PORT) so the container
reads the mounted pgbouncer.ini; locate references to pgbouncer.ini and to the
environment-based config in docker-compose.yml, then choose one approach: (A)
delete infra/docker/pgbouncer/pgbouncer.ini and keep the env-based config, or
(B) add a bind mount for infra/docker/pgbouncer/pgbouncer.ini into the service
and remove/override the environment variables (POSTGRESQL_HOST, POSTGRESQL_PORT,
etc.) that would conflict so PgBouncer uses the file.

Comment on lines +16 to +17
log_connections = 0
log_disconnections = 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Connection logging disabled — consider enabling for debugging.

log_connections = 0 and log_disconnections = 0 suppress connection events. While this reduces log noise, it can hinder troubleshooting connection issues or pool exhaustion problems.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@infra/docker/pgbouncer/pgbouncer.ini` around lines 16 - 17, The pgbouncer
config currently disables connection event logging via the settings
log_connections = 0 and log_disconnections = 0; change these to enabled (set
both to 1) or make them configurable so connection/disconnection events are
logged for debugging and pool exhaustion investigation, ensuring the updated
values are applied in the pgbouncer.ini load/restart flow.

Comment on lines +5 to +6
positions:
filename: /tmp/positions.yaml
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Positions file in /tmp will be lost on restart.

Using /tmp/positions.yaml means Promtail loses track of its read position when the container restarts, potentially causing duplicate log ingestion or missed logs. Mount a persistent volume for the positions file.

♻️ Suggested fix with persistent volume

In promtail-config.yaml:

 positions:
-  filename: /tmp/positions.yaml
+  filename: /var/promtail/positions.yaml

In docker-compose.yml, add a volume to the promtail service:

   promtail:
     volumes:
       - ./promtail-config.yaml:/etc/promtail/config.yaml
       - /var/run/docker.sock:/var/run/docker.sock
       - /var/lib/docker/containers:/var/lib/docker/containers:ro
+      - promtail-positions:/var/promtail

And declare the volume:

 volumes:
   ...
   loki-data:
+  promtail-positions:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
positions:
filename: /tmp/positions.yaml
positions:
filename: /var/promtail/positions.yaml
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@infra/docker/promtail-config.yaml` around lines 5 - 6, The positions file is
configured as filename: /tmp/positions.yaml which will be lost on container
restart; update promtail-config.yaml to point positions.filename to a path on a
persistent volume (e.g., /data/positions.yaml) and then update your
docker-compose service for promtail (the promtail service) to mount a named
volume at that container path and declare that volume in the compose file so the
positions file persists across restarts.

Comment on lines +16 to +20
relabel_configs:
- source_labels: [__meta_docker_container_name]
target_label: container
- source_labels: [__meta_docker_container_log_stream]
target_label: stream No newline at end of file
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider adding compose service labels for better log filtering.

The current relabels only capture container name and stream. Adding compose-specific labels enables filtering by service name in Grafana/Loki queries.

♻️ Suggested additional relabel configs
     relabel_configs:
       - source_labels: [__meta_docker_container_name]
         target_label: container
       - source_labels: [__meta_docker_container_log_stream]
         target_label: stream
+      - source_labels: [__meta_docker_container_label_com_docker_compose_service]
+        target_label: service
+      - source_labels: [__meta_docker_container_label_com_docker_compose_project]
+        target_label: project
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@infra/docker/promtail-config.yaml` around lines 16 - 20, The relabel_configs
only map __meta_docker_container_name and __meta_docker_container_log_stream to
container and stream; add relabel rules to capture Docker Compose service labels
(e.g., __meta_docker_container_label_com_docker_compose_service and/or
__meta_docker_container_label_com_docker_compose_project) and map them to
descriptive target labels such as service or compose_project so Grafana/Loki
queries can filter by compose service; update the relabel_configs block where
source_labels and target_label are defined to include these additional mappings
alongside the existing __meta_docker_container_name -> container and
__meta_docker_container_log_stream -> stream entries.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/bootstrap.sh`:
- Line 2: Replace the loose error handling in the bootstrap script by changing
the shell options set in the top-level invocation: update the existing set -e to
set -euo pipefail so that the script also fails on unset variables and pipeline
errors; locate the initialization that contains set -e in scripts/bootstrap.sh
and make this single-line replacement (no other logic changes needed).
- Around line 16-19: The script uses relative cd commands (e.g., the four lines
with "cd apps/bff && npm install && cd ../..", "cd apps/gateway ...", etc.)
which assume the current working directory is the repo root; prepend a robust
anchor to the repository root at the top of scripts/bootstrap.sh — for example
resolve the repo root with git rev-parse --show-toplevel (or compute the script
dir via dirname "$0" and cd to its parent) and cd there or exit on failure
before running those cd/install lines so all subsequent relative paths (apps/*,
infra/*, scripts/seed/*) are executed from the repo root.
- Around line 37-39: Replace the fragile fixed delay (the "sleep 10" used after
the echo "→ Waiting for Kafka...") with a readiness probe loop that polls the
Kafka broker and only proceeds when it responds or a timeout is reached;
implement a loop in the script that repeatedly checks the broker (for example by
invoking kafka-topics.sh --bootstrap-server "$KAFKA_BOOTSTRAP" --list or a TCP
probe to the Kafka host:port) and breaks when successful, logging retries, and
exit with a non-zero status after a configurable timeout instead of
unconditionally sleeping.
- Around line 43-45: The seeding command currently swallows failures by
appending "|| true" but still prints "✅ Data seeded"; remove the "|| true" and
change the flow so the docker-compose psql command's exit code is checked and
only print "✅ Data seeded" when that command succeeds, otherwise exit non-zero
(or print an error) to surface failures; locate the docker compose psql
invocation (the line that redirects scripts/seed/seed-postgres.sh into psql) and
the subsequent echo "✅ Data seeded" and update them so success is conditional on
the psql command's actual exit status.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7a51f954-cf2f-4176-bfde-acc98c65f240

📥 Commits

Reviewing files that changed from the base of the PR and between 562ad09 and 51b7622.

📒 Files selected for processing (1)
  • scripts/bootstrap.sh

Comment thread scripts/bootstrap.sh
@@ -0,0 +1,64 @@
#!/usr/bin/env bash
set -e
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Use stricter shell mode for safer bootstrap behavior.

set -e alone misses unset-variable and pipeline failure cases. Prefer set -euo pipefail for bootstrap scripts.

Proposed fix
-set -e
+set -euo pipefail
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
set -e
set -euo pipefail
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/bootstrap.sh` at line 2, Replace the loose error handling in the
bootstrap script by changing the shell options set in the top-level invocation:
update the existing set -e to set -euo pipefail so that the script also fails on
unset variables and pipeline errors; locate the initialization that contains set
-e in scripts/bootstrap.sh and make this single-line replacement (no other logic
changes needed).

Comment thread scripts/bootstrap.sh
Comment on lines +16 to +19
cd apps/bff && npm install && cd ../..
cd apps/gateway && npm install && cd ../..
cd apps/dashboard && npm install && cd ../..
cd apps/jobs-worker && npm install && cd ../..
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Anchor execution to repository root before using relative paths.

cd apps/... assumes the caller runs the script from repo root. If invoked from elsewhere, bootstrap fails early (and the same root-cause affects later relative paths like infra/... and scripts/seed/...).

Proposed fix
 #!/usr/bin/env bash
-set -e
+set -e
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="$(cd "${SCRIPT_DIR}/.." && pwd)"
+cd "${REPO_ROOT}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/bootstrap.sh` around lines 16 - 19, The script uses relative cd
commands (e.g., the four lines with "cd apps/bff && npm install && cd ../..",
"cd apps/gateway ...", etc.) which assume the current working directory is the
repo root; prepend a robust anchor to the repository root at the top of
scripts/bootstrap.sh — for example resolve the repo root with git rev-parse
--show-toplevel (or compute the script dir via dirname "$0" and cd to its
parent) and cd there or exit on failure before running those cd/install lines so
all subsequent relative paths (apps/*, infra/*, scripts/seed/*) are executed
from the repo root.

Comment thread scripts/bootstrap.sh
Comment on lines +37 to +39
echo "→ Waiting for Kafka..."
sleep 10
echo "✅ Kafka ready"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Replace fixed Kafka delay with an actual readiness probe.

A hardcoded sleep 10 is flaky; on slower startups, dependent services can still race and fail. Line 38 should poll for broker readiness instead of elapsed time.

Proposed fix
 # Wait for Kafka to be ready
 echo "→ Waiting for Kafka..."
-sleep 10
+until docker compose -f infra/docker/docker-compose.yml exec -T kafka \
+  kafka-topics --bootstrap-server kafka:9092 --list >/dev/null 2>&1; do
+  sleep 2
+done
 echo "✅ Kafka ready"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/bootstrap.sh` around lines 37 - 39, Replace the fragile fixed delay
(the "sleep 10" used after the echo "→ Waiting for Kafka...") with a readiness
probe loop that polls the Kafka broker and only proceeds when it responds or a
timeout is reached; implement a loop in the script that repeatedly checks the
broker (for example by invoking kafka-topics.sh --bootstrap-server
"$KAFKA_BOOTSTRAP" --list or a TCP probe to the Kafka host:port) and breaks when
successful, logging retries, and exit with a non-zero status after a
configurable timeout instead of unconditionally sleeping.

Comment thread scripts/bootstrap.sh
Comment on lines +43 to +45
docker compose -f infra/docker/docker-compose.yml exec -T postgres psql \
-U postgres -d grainguard -f /dev/stdin < scripts/seed/seed-postgres.sh || true
echo "✅ Data seeded"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not suppress seeding failures while printing success.

Line 44 uses || true, so failed seed runs still report ✅ Data seeded, which can hide broken local environments.

Proposed fix
 echo "→ Seeding dev data..."
-docker compose -f infra/docker/docker-compose.yml exec -T postgres psql \
-  -U postgres -d grainguard -f /dev/stdin < scripts/seed/seed-postgres.sh || true
-echo "✅ Data seeded"
+if docker compose -f infra/docker/docker-compose.yml exec -T postgres psql \
+  -U postgres -d grainguard -f /dev/stdin < scripts/seed/seed-postgres.sh; then
+  echo "✅ Data seeded"
+else
+  echo "❌ Data seeding failed"
+  exit 1
+fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
docker compose -f infra/docker/docker-compose.yml exec -T postgres psql \
-U postgres -d grainguard -f /dev/stdin < scripts/seed/seed-postgres.sh || true
echo "✅ Data seeded"
echo "→ Seeding dev data..."
if docker compose -f infra/docker/docker-compose.yml exec -T postgres psql \
-U postgres -d grainguard -f /dev/stdin < scripts/seed/seed-postgres.sh; then
echo "✅ Data seeded"
else
echo "❌ Data seeding failed"
exit 1
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/bootstrap.sh` around lines 43 - 45, The seeding command currently
swallows failures by appending "|| true" but still prints "✅ Data seeded";
remove the "|| true" and change the flow so the docker-compose psql command's
exit code is checked and only print "✅ Data seeded" when that command succeeds,
otherwise exit non-zero (or print an error) to surface failures; locate the
docker compose psql invocation (the line that redirects
scripts/seed/seed-postgres.sh into psql) and the subsequent echo "✅ Data seeded"
and update them so success is conditional on the psql command's actual exit
status.

…s, DB fallback on cache miss, token-based Redis lock ownership
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant