perf: eliminate N+1 queries, blocking sleep, and Cassandra/Kafka bottlenecks#13
Merged
pahuldeepp merged 3 commits intomasterfrom Mar 28, 2026
Merged
perf: eliminate N+1 queries, blocking sleep, and Cassandra/Kafka bottlenecks#13pahuldeepp merged 3 commits intomasterfrom
pahuldeepp merged 3 commits intomasterfrom
Conversation
…lenecks BFF resolvers.ts - N+1 fix: manyDeviceTelemetry now issues one batched SELECT…ANY($1::uuid[]) for all cache misses instead of O(N) sequential queries + O(N) tenant checks - Double-query fix: deviceTelemetry no longer issues a second round-trip to device_projections for tenant isolation — tenant_id is now included in the getDeviceTelemetry SELECT - Blocking sleep fix: removed sleep(100ms) on lock contention; replaced with an immediate non-blocking re-check via cacheGetOrLock helper BFF postgres.ts - Add getManyDeviceTelemetry(ids, tenantId?) — batch query using ANY($1::uuid[]) with optional tenant filter; used by manyDeviceTelemetry resolver - getDeviceTelemetry now SELECTs tenant_id (eliminates the double round-trip) - COUNT(*) fix: total device count is now cached in Redis for 60 s instead of running a full table scan on every paginated request ingest-service main.go - DB pool default raised from 10 → 50 connections (3-tier key cache means DB is rarely hit; pool was exhausting under modest load) - Kafka BatchSize 200 → 500, BatchTimeout 5 ms → 10 ms, added WriteBackoffMin/ Max and MaxAttempts — amortises per-batch RTT across more concurrent writers cassandra-writer main.go - Consistency LocalQuorum → One: immutable time-series data is safe at One and ~3× faster (no quorum wait across replicas) - Kafka commit now retries up to 3 times before logging; redelivery is safe because event_id is in the Cassandra PRIMARY KEY (idempotent writes) migrations - 000004_add_perf_indexes: three CONCURRENTLY-built indexes covering the most common query patterns (tenant+device composite, tenant+created_at, device+tenant) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughThis PR introduces performance optimizations and resilience improvements across the data pipeline: batched telemetry queries with Redis caching in the BFF, adjusted Cassandra write consistency and Kafka offset commit retry logic, tuned Kafka and Postgres connection pools, and added three performance database indexes. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Resolver
participant Cache as Redis Cache
participant Lock as Distributed Lock
participant DB as PostgreSQL
Client->>Resolver: Query manyDeviceTelemetry(deviceIds)
Resolver->>Cache: Check cache for deviceIds
alt Cache Hit
Cache-->>Resolver: Return cached results
else Cache Miss
Resolver->>Lock: Attempt to acquire lock
alt Lock Acquired
Lock-->>Resolver: Lock granted
Resolver->>DB: Batch fetch (getManyDeviceTelemetry)
DB-->>Resolver: Telemetry rows
Resolver->>Cache: Store results (TTL)
Resolver->>Lock: Release lock
else Lock Not Acquired
Lock-->>Resolver: Lock unavailable
Resolver->>Cache: Immediate re-check (non-blocking)
Cache-->>Resolver: Return results
end
end
Resolver-->>Client: Return telemetry data
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
…lint v2.11.0) vendor/node_modules/dist contain no Go source files so the exclusion is a no-op in practice; dropping it unblocks config verify. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pahuldeepp
added a commit
that referenced
this pull request
Mar 28, 2026
HIGH PRIORITY: - #9: Fix CSRF timing attack — pad buffers to fixed length before timingSafeEqual - #10: Upgrade circuit breaker to distributed (Redis-backed state sharing across pods) - #8: Fix saga recovery infinite loop — mark corrupted payloads as FAILED instead of retrying MEDIUM PRIORITY: - #11: Add webhook idempotency check (dedup by endpoint_id + event_type) - #12: Assert Stripe webhook body is Buffer before signature verification - #13: Eliminate N+1 query in deviceTelemetry resolver (single JOIN query) - #15: Add ORDER BY + LIMIT to saga FindByCorrelationID for deterministic results - #17: Make critical audit events (auth, admin) throw on failure instead of silent swallow LOW PRIORITY: - #20: Add 10s query timeout to all saga repository DB operations - #23: Add IsValidStatus validator for saga status constants - #24: Set httpServer.timeout (30s) and keepAliveTimeout (65s) on BFF - #25: Add RabbitMQ heartbeat (30s) and connection error/close handlers - #7: Fix remaining saga JSON marshal error check (initialErr) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changed and why
BFF
resolvers.tsmanyDeviceTelemetry: O(N) sequential DB queries (2 per missed device)SELECT … ANY($1::uuid[])for all missesdeviceTelemetry: 2 DB queries (telemetry + device_projections for tenant check)tenant_idnow returned bygetDeviceTelemetrysleep(100ms)→ all waiters blockcacheGetOrLockBFF
postgres.tsgetManyDeviceTelemetry(ids, tenantId?)— batch query, replaces the resolver loopgetDeviceTelemetrynow includestenant_idin SELECTCOUNT(*)in cursor pagination cached in Redis for 60 s — eliminates full table scan per pageingest-service/main.go10 → 50(3-tier key cache means Postgres is rarely hit; old limit caused exhaustion under modest concurrency)BatchSize 200 → 500,BatchTimeout 5 → 10 ms, addedWriteBackoffMin/MaxandMaxAttempts 3cassandra-writer/main.goLocalQuorum → Onefor telemetry writes — immutable time-series data is safe atOne, ~3× lower write latencyevent_idin Cassandra PRIMARY KEY makes writes idempotentMigration
000004_add_perf_indexesThree
CONCURRENTLY-built indexes covering the hottest query patterns:idx_device_telemetry_tenant_device (tenant_id, device_id)idx_device_projections_tenant_created (tenant_id, created_at DESC)idx_device_projections_device_tenant (device_id, tenant_id)Estimated impact
manyDeviceTelemetry(10 devices, cache miss)deviceTelemetry(non-superadmin, cache miss)Test plan
🤖 Generated with Claude Code
Summary by CodeRabbit