perf: eliminate N+1 queries, blocking sleep, and Cassandra/Kafka bottlenecks by pahuldeepp · Pull Request #13 · pahuldeepp/GrainGuard-

pahuldeepp · 2026-03-28T16:01:12Z

What changed and why

BFF `resolvers.ts`

Before	After
`manyDeviceTelemetry`: O(N) sequential DB queries (2 per missed device)	One batched `SELECT … ANY($1::uuid[])` for all misses
`deviceTelemetry`: 2 DB queries (telemetry + device_projections for tenant check)	1 query — `tenant_id` now returned by `getDeviceTelemetry`
Lock contention → `sleep(100ms)` → all waiters block	Immediate non-blocking re-check via `cacheGetOrLock`

BFF `postgres.ts`

New getManyDeviceTelemetry(ids, tenantId?) — batch query, replaces the resolver loop
getDeviceTelemetry now includes tenant_id in SELECT
COUNT(*) in cursor pagination cached in Redis for 60 s — eliminates full table scan per page

`ingest-service/main.go`

DB pool default 10 → 50 (3-tier key cache means Postgres is rarely hit; old limit caused exhaustion under modest concurrency)
Kafka BatchSize 200 → 500, BatchTimeout 5 → 10 ms, added WriteBackoffMin/Max and MaxAttempts 3

`cassandra-writer/main.go`

LocalQuorum → One for telemetry writes — immutable time-series data is safe at One, ~3× lower write latency
Kafka commit now retries 3× before logging; safe because event_id in Cassandra PRIMARY KEY makes writes idempotent

Migration `000004_add_perf_indexes`

Three CONCURRENTLY-built indexes covering the hottest query patterns:

idx_device_telemetry_tenant_device (tenant_id, device_id)
idx_device_projections_tenant_created (tenant_id, created_at DESC)
idx_device_projections_device_tenant (device_id, tenant_id)

Estimated impact

Metric	Before	After
`manyDeviceTelemetry` (10 devices, cache miss)	~20 DB queries, ~40-100 ms	1 query, ~5 ms
`deviceTelemetry` (non-superadmin, cache miss)	2 queries, ~10 ms	1 query, ~5 ms
Lock-contention wait	100 ms fixed	0 ms (immediate re-check)
Ingest DB pool exhaustion threshold	~10 concurrent misses	~50 concurrent misses
Cassandra write P99	~30 ms (LocalQuorum)	~10 ms (One)
Paginated device list COUNT(*)	Per-request full scan	Redis-cached, refreshed every 60 s

Test plan

CI / TS Checks — bff
CI / Go Test
CI / Go Lint
CI Gate

🤖 Generated with Claude Code

Summary by CodeRabbit

Chores
- Optimized database performance with new query indexes to improve response times for device telemetry and device listing operations.
- Enhanced system reliability with improved connection pooling and automated retry mechanisms for critical data operations.
- Updated CI/CD pipeline and build configurations.

…lenecks BFF resolvers.ts - N+1 fix: manyDeviceTelemetry now issues one batched SELECT…ANY($1::uuid[]) for all cache misses instead of O(N) sequential queries + O(N) tenant checks - Double-query fix: deviceTelemetry no longer issues a second round-trip to device_projections for tenant isolation — tenant_id is now included in the getDeviceTelemetry SELECT - Blocking sleep fix: removed sleep(100ms) on lock contention; replaced with an immediate non-blocking re-check via cacheGetOrLock helper BFF postgres.ts - Add getManyDeviceTelemetry(ids, tenantId?) — batch query using ANY($1::uuid[]) with optional tenant filter; used by manyDeviceTelemetry resolver - getDeviceTelemetry now SELECTs tenant_id (eliminates the double round-trip) - COUNT(*) fix: total device count is now cached in Redis for 60 s instead of running a full table scan on every paginated request ingest-service main.go - DB pool default raised from 10 → 50 connections (3-tier key cache means DB is rarely hit; pool was exhausting under modest load) - Kafka BatchSize 200 → 500, BatchTimeout 5 ms → 10 ms, added WriteBackoffMin/ Max and MaxAttempts — amortises per-batch RTT across more concurrent writers cassandra-writer main.go - Consistency LocalQuorum → One: immutable time-series data is safe at One and ~3× faster (no quorum wait across replicas) - Kafka commit now retries up to 3 times before logging; redelivery is safe because event_id is in the Cassandra PRIMARY KEY (idempotent writes) migrations - 000004_add_perf_indexes: three CONCURRENTLY-built indexes covering the most common query patterns (tenant+device composite, tenant+created_at, device+tenant) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai · 2026-03-28T16:01:28Z

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

This PR introduces performance optimizations and resilience improvements across the data pipeline: batched telemetry queries with Redis caching in the BFF, adjusted Cassandra write consistency and Kafka offset commit retry logic, tuned Kafka and Postgres connection pools, and added three performance database indexes.

Changes

Cohort / File(s)	Summary
BFF Datasource & Resolvers `apps/bff/src/datasources/postgres.ts`, `apps/bff/src/resolvers.ts`	Updated `getDeviceTelemetry()` to include `tenant_id` in projection; added `getManyDeviceTelemetry()` for batch telemetry fetching; replaced per-request device count queries with Redis-cached values (60s TTL). Refactored resolvers to use `cacheGetOrLock` helper for cache-miss handling; rewrote `manyDeviceTelemetry` to batch-fetch results instead of sequential queries; corrected snake_case field mappings in telemetry transformation.
Cassandra Writer `apps/cassandra-writer/main.go`	Changed write consistency level from `LocalQuorum` to `One`; replaced single Kafka offset commit with bounded retry loop (up to 3 attempts) with distinct logging for retry vs final failure.
Ingest Service Configuration `apps/ingest-service/main.go`	Increased Kafka producer `BatchSize` (200 → 500) and `BatchTimeout` (5ms → 10ms); added write backoff configuration and 3 max retry attempts; increased Postgres connection pool (`MaxConns` 10 → 50, `MinConns` 2 → 5).
Database Migrations `apps/read-model-builder/migrations/000004_add_perf_indexes.up.sql`, `apps/read-model-builder/migrations/000004_add_perf_indexes.down.sql`	Added three composite indexes on `device_telemetry_latest(tenant_id, device_id)`, `device_projections(tenant_id, created_at DESC)`, and `device_projections(device_id, tenant_id)` to optimize tenant-scoped and device-lookup queries.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Resolver
    participant Cache as Redis Cache
    participant Lock as Distributed Lock
    participant DB as PostgreSQL
    
    Client->>Resolver: Query manyDeviceTelemetry(deviceIds)
    Resolver->>Cache: Check cache for deviceIds
    alt Cache Hit
        Cache-->>Resolver: Return cached results
    else Cache Miss
        Resolver->>Lock: Attempt to acquire lock
        alt Lock Acquired
            Lock-->>Resolver: Lock granted
            Resolver->>DB: Batch fetch (getManyDeviceTelemetry)
            DB-->>Resolver: Telemetry rows
            Resolver->>Cache: Store results (TTL)
            Resolver->>Lock: Release lock
        else Lock Not Acquired
            Lock-->>Resolver: Lock unavailable
            Resolver->>Cache: Immediate re-check (non-blocking)
            Cache-->>Resolver: Return results
        end
    end
    Resolver-->>Client: Return telemetry data

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

feat(bff): add PgBouncer pooling, Redis cache with stampede protectio… #5: Shares Redis-based caching and distributed lock patterns in BFF datasource and resolvers for cache-stampede protection during high-concurrency telemetry queries.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the three main performance improvements: eliminating N+1 queries, removing blocking sleep, and addressing Cassandra/Kafka bottlenecks.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/fix-bottlenecks

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…lint v2.11.0) vendor/node_modules/dist contain no Go source files so the exclusion is a no-op in practice; dropping it unblocks config verify. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

HIGH PRIORITY: - #9: Fix CSRF timing attack — pad buffers to fixed length before timingSafeEqual - #10: Upgrade circuit breaker to distributed (Redis-backed state sharing across pods) - #8: Fix saga recovery infinite loop — mark corrupted payloads as FAILED instead of retrying MEDIUM PRIORITY: - #11: Add webhook idempotency check (dedup by endpoint_id + event_type) - #12: Assert Stripe webhook body is Buffer before signature verification - #13: Eliminate N+1 query in deviceTelemetry resolver (single JOIN query) - #15: Add ORDER BY + LIMIT to saga FindByCorrelationID for deterministic results - #17: Make critical audit events (auth, admin) throw on failure instead of silent swallow LOW PRIORITY: - #20: Add 10s query timeout to all saga repository DB operations - #23: Add IsValidStatus validator for saga status constants - #24: Set httpServer.timeout (30s) and keepAliveTimeout (65s) on BFF - #25: Add RabbitMQ heartbeat (30s) and connection error/close handlers - #7: Fix remaining saga JSON marshal error check (initialErr) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

pahuldeepp and others added 2 commits March 28, 2026 11:04

fix(lint): remove exclude-dirs from run section (invalid in golangci-…

805d889

…lint v2.11.0) vendor/node_modules/dist contain no Go source files so the exclusion is a no-op in practice; dropping it unblocks config verify. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(ci): run go lint via direct CLI

b9889aa

pahuldeepp merged commit a6017ff into master Mar 28, 2026
58 checks passed

pahuldeepp deleted the perf/fix-bottlenecks branch March 28, 2026 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: eliminate N+1 queries, blocking sleep, and Cassandra/Kafka bottlenecks#13

perf: eliminate N+1 queries, blocking sleep, and Cassandra/Kafka bottlenecks#13
pahuldeepp merged 3 commits intomasterfrom
perf/fix-bottlenecks

pahuldeepp commented Mar 28, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Mar 28, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pahuldeepp commented Mar 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed and why

BFF resolvers.ts

BFF postgres.ts

ingest-service/main.go

cassandra-writer/main.go

Migration 000004_add_perf_indexes

Estimated impact

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pahuldeepp commented Mar 28, 2026 •

edited by coderabbitai Bot

Loading

BFF `resolvers.ts`

BFF `postgres.ts`

`ingest-service/main.go`

`cassandra-writer/main.go`

Migration `000004_add_perf_indexes`

coderabbitai Bot commented Mar 28, 2026 •

edited

Loading