Codex/phase1 hardening by pahuldeepp · Pull Request #8 · pahuldeepp/GrainGuard-

pahuldeepp · 2026-03-28T02:31:31Z

Summary by CodeRabbit

Bug Fixes
- More robust error handling and explicit logging around transaction commit/rollback and message processing.
- Added read header timeouts to internal HTTP servers to improve resiliency.
Chores
- CI/workflow updates (JS Node setting, linting job tweaks, security scan changes).
- Bumped several Go module dependencies and security scanning tool versions.
Refactor
- Formatting and minor API-internal cleanups across services.
- Small performance/memory optimizations in ingest handling.
Tests
- Test container cleanup made explicit.

coderabbitai · 2026-03-28T02:31:52Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3bea64c6-f583-4c70-95d7-6affe61398d4

📥 Commits

Reviewing files that changed from the base of the PR and between d4be229 and 5451012.

📒 Files selected for processing (2)

.github/workflows/ci.yml
.github/workflows/security.yml

📝 Walkthrough

Walkthrough

Workflow and CI tweaks, Go linter and module version bumps, multiple Go files refactored to ignore rollback errors and improve error-variable scoping, HTTP servers given ReadHeaderTimeout, minor runtime/formatting fixes, and targeted runtime/behavior adjustments in several services.

Changes

Cohort / File(s)	Summary
GitHub Actions Workflows `.github/workflows/ci.yml`, `.github/workflows/security.yml`	Added `FORCE_JAVASCRIPT_ACTIONS_TO_NODE24` env var; changed golangci-lint step to run a specific go module invocation; renamed and conditionalized TS checks matrix; pinned `aquasecurity/trivy-action` to `v0.33.1`; changed Trivy build context to repository root.
Linter & Module Configuration `.golangci.yml`, `go.mod`	Removed `exportloopref` linter, added gocritic disabled checks and gosec G404 exclusion; bumped `github.com/redis/go-redis/v9`, `google.golang.org/grpc`, and several `golang.org/x/*` indirect deps.
Transaction Rollback & Error Scoping `apps/read-model-builder/.../device_projection.go`, `apps/read-model-builder/.../telemetry_projection.go`, `apps/read-model-builder/.../telemetry_projection_integration_test.go`, `apps/telemetry-service/internal/application/create_device.go`, `apps/telemetry-service/internal/application/record_telemetry.go`, `apps/telemetry-service/internal/worker/outbox_worker.go`	Replaced direct `defer tx.Rollback(...)` with deferred closures that explicitly ignore rollback errors; added scoped error variables (e.g., `commitErr`, `scanErr`, `saveErr`) to avoid shadowing/overwrites.
HTTP Server & I/O handling `apps/saga-orchestrator/internal/health/handler.go`, `apps/telemetry-service/cmd/main.go`, `libs/health/health.go`, `libs/observability/metrics.go`	Switched from `http.ListenAndServe` to `http.Server` instances and added `ReadHeaderTimeout: 5 * time.Second`; made write/encode return values explicitly discarded (`_, _ =` / `_ =`).
Kafka, Consumer & Repository formatting `apps/saga-orchestrator/internal/consumer/kafka_consumer.go`, `apps/saga-orchestrator/internal/repository/postgres_saga_repository.go`	Removed leading UTF-8 BOM from package declaration and normalized formatting/indentation (tabs) — no logic changes.
Service-specific runtime fixes `apps/cassandra-writer/main.go`, `apps/ingest-service/main.go`, `apps/saga-orchestrator/internal/recovery/recovery_worker.go`	Refactored Kafka message error handling to use `unmarshalErr` and log commit errors; added `getenvInt32`; changed body pool to store `*[]byte`; explicit discard of write returns; skipped recovery actions for specific saga statuses.
Migration & utilities formatting `libs/migrate/migrate.go`, `libs/migrate/...`	Reformatted with tabs, simplified DSN construction logic, and renamed migration error variable to `migrateErr`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Codex/phase1 hardening #8: Overlapping multi-file changes (workflows, .golangci.yml, go.mod, and numerous Go sources) that appear to apply the same rollback-ignore, server timeout, and CI adjustments.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 13.64% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'Codex/phase1 hardening' is vague and does not clearly summarize the main changes in the pull request, which include linter updates, workflow configuration changes, error handling improvements, and dependency updates across multiple files.	Use a more specific title that captures the primary objective, such as 'Refactor error handling and improve linting configuration' or 'Update workflows and dependencies for Phase 1 hardening'.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch codex/phase1-hardening

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

.github/workflows/ci.yml (1)
97-99: 🧹 Nitpick | 🔵 Trivial

Typecheck fallback causes inconsistent behavior and unnecessary full builds.

The fallback npm run typecheck || npm run build produces different results per app:

gateway: Has typecheck script → runs tsc --noEmit (lightweight)

bff: No typecheck script → falls back to tsc (full build)

jobs-worker: No typecheck script → falls back to tsc (full build)

dashboard: No typecheck script → falls back to tsc -b && vite build (heavy, includes full Vite bundle)

For bff, jobs-worker, and especially dashboard, the fallback runs unnecessary full builds instead of type checking alone, increasing CI time. Add explicit typecheck scripts to each app for consistent and faster CI runs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ci.yml around lines 97 - 99, The CI step named "Typecheck"
currently falls back to running a full build when an app lacks a typecheck
script; add explicit "typecheck" scripts to the package.json of the bff,
jobs-worker, and dashboard apps so the workflow's Typecheck step runs
lightweight type checks only. For each app add a "typecheck" script that invokes
TypeScript without emitting (e.g., "tsc --noEmit") or, if the repo uses
composite/project refs, "tsc -b --noEmit" for that project; ensure the script
names are exactly "typecheck" so the existing workflow step (Typecheck) picks
them up and no changes to .github/workflows/ci.yml are required.
apps/read-model-builder/internal/projection/telemetry_projection.go (2)
185-190: ⚠️ Potential issue | 🟡 Minor

Remove duplicate redisClient nil check.

Lines 185-190 contain a duplicate nil check. The second check at line 188 is dead code since if redisClient == nil, the function already returned at line 186.
🧹 Proposed fix
 		if redisClient == nil {
 			return nil
 		}
-		if redisClient == nil {
-			return nil
-		}
 		pipe := redisClient.Pipeline()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/read-model-builder/internal/projection/telemetry_projection.go` around
lines 185 - 190, Remove the duplicate nil check for redisClient in
telemetry_projection.go: locate the function/block that contains the two
consecutive "if redisClient == nil { return nil }" checks and delete the second
one so there is only a single guard that returns when redisClient is nil
(preserve the first check and surrounding logic unchanged).
305-311: 🧹 Nitpick | 🔵 Trivial

Missing rows.Err() check after first rows.Close().

Per coding guidelines, rows.Err() should be checked after rows.Close(). The second query result (lines 377-387) correctly checks rows.Err(), but this first iteration does not. While the impact is lower here (only scanning event IDs), it's good practice to check consistently.
♻️ Proposed fix
 		for rows.Next() {
 			var id string
 			if scanErr := rows.Scan(&id); scanErr == nil {
 				newEventIDs[id] = struct{}{}
 			}
 		}
 		rows.Close()
+		if err := rows.Err(); err != nil {
+			observability.EventsRetry.Inc()
+			return err
+		}
As per coding guidelines: "Check pgx rows.Err() after rows.Close()."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/read-model-builder/internal/projection/telemetry_projection.go` around
lines 305 - 311, The rows returned for populating newEventIDs are not checking
rows.Err() after rows.Close(); update the block that iterates rows.Next()
filling newEventIDs to call rows.Close() then check if err := rows.Err(); handle
that error the same way as the later query (e.g., return the error or log and
exit), ensuring you reference the same variables (rows, newEventIDs) and follow
the error-handling pattern used in the second query that already checks
rows.Err().

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/ci.yml:
- Around line 30-32: The CI currently pairs golangci/golangci-lint-action@v8
with golangci-lint version v1.62.0 which is incompatible; update the workflow to
either change the action to a v6 or v7 tag (e.g.,
golangci/golangci-lint-action@v7) to keep using golangci-lint v1.x, or keep
golangci/golangci-lint-action@v8 and upgrade the referenced golangci-lint
`version:` to a v2.x release (e.g., v2.1.0+) so the action and linter versions
are compatible.

In @.github/workflows/security.yml:
- Line 47: Replace both occurrences of the compromised action string "uses:
aquasecurity/trivy-action@0.33.1" with the safe pinned tag "uses:
aquasecurity/trivy-action@v0.35.0" (update the two locations in the workflow
where that exact string appears) and then trigger a secrets audit for any tokens
or credentials that were available to runs during March 19–20, 2026 to
rotate/revoke any possibly exposed secrets.

In `@apps/cassandra-writer/main.go`:
- Around line 330-335: In the branch that handles non-matching event types (when
event.EventType != "telemetry.recorded"), increment the skip counter by calling
skipped.Add(1) before continuing; specifically, update the block that commits
the message via reader.CommitMessages(ctx, msg) to also call skipped.Add(1)
(ensure this happens regardless of commitErr) so the skip path for this
condition matches the other skip paths (which call skipped.Add(1)).

In `@apps/saga-orchestrator/internal/health/handler.go`:
- Around line 59-63: The http.Server returned in the health server (the return
block creating &http.Server with Addr, Handler, ReadHeaderTimeout) only sets
ReadHeaderTimeout; add ReadTimeout, WriteTimeout, and IdleTimeout on that same
&http.Server to harden against resource exhaustion (e.g., set small
ReadTimeout/WriteTimeout like a few seconds and an IdleTimeout for keep-alives);
update the return in handler.go where the server is constructed so the server
config includes ReadTimeout, WriteTimeout, and IdleTimeout alongside
ReadHeaderTimeout.

In `@apps/telemetry-service/internal/application/record_telemetry.go`:
- Around line 73-75: The defer currently swallows tx.Rollback(ctx) and uses the
request ctx which may be cancelled; change the defer to call tx.Rollback using a
non-request cleanup context (e.g., context.Background() or a short-lived context
with timeout) and check its error, logging any failure instead of discarding it.
Locate the rollback defer around tx.Rollback and replace it with a guarded defer
that does something like: create a cleanupCtx, if err :=
tx.Rollback(cleanupCtx); err != nil && err != sql.ErrTxDone { logger.Errorf("tx
rollback failed: %v", err) } so rollback failures are observed and the request
context cancellation cannot silently prevent cleanup.

In `@apps/telemetry-service/internal/worker/outbox_worker.go`:
- Around line 268-270: The deferred rollback currently discards errors and uses
the incoming ctx which may be canceled; instead update the defer that calls
tx.Rollback(ctx) to call Rollback with a fresh non-canceled context (e.g.,
context.Background() or a short timeout context) and check the returned
error—log it or wrap it (do not ignore) unless it is sql.ErrTxDone; locate the
defer around tx.Rollback in outbox_worker.go (the tx variable and its deferred
rollback) and replace the silent discard `_ = tx.Rollback(ctx)` with explicit
error handling that distinguishes a harmless "transaction already finished"
error from real rollback failures and records the failure via the component
logger/error reporter used in this file.

---

Outside diff comments:
In @.github/workflows/ci.yml:
- Around line 97-99: The CI step named "Typecheck" currently falls back to
running a full build when an app lacks a typecheck script; add explicit
"typecheck" scripts to the package.json of the bff, jobs-worker, and dashboard
apps so the workflow's Typecheck step runs lightweight type checks only. For
each app add a "typecheck" script that invokes TypeScript without emitting
(e.g., "tsc --noEmit") or, if the repo uses composite/project refs, "tsc -b
--noEmit" for that project; ensure the script names are exactly "typecheck" so
the existing workflow step (Typecheck) picks them up and no changes to
.github/workflows/ci.yml are required.

In `@apps/read-model-builder/internal/projection/telemetry_projection.go`:
- Around line 185-190: Remove the duplicate nil check for redisClient in
telemetry_projection.go: locate the function/block that contains the two
consecutive "if redisClient == nil { return nil }" checks and delete the second
one so there is only a single guard that returns when redisClient is nil
(preserve the first check and surrounding logic unchanged).
- Around line 305-311: The rows returned for populating newEventIDs are not
checking rows.Err() after rows.Close(); update the block that iterates
rows.Next() filling newEventIDs to call rows.Close() then check if err :=
rows.Err(); handle that error the same way as the later query (e.g., return the
error or log and exit), ensuring you reference the same variables (rows,
newEventIDs) and follow the error-handling pattern used in the second query that
already checks rows.Err().

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 98f1c8d9-4e4c-4385-ab24-c541daf76db3

📥 Commits

Reviewing files that changed from the base of the PR and between ce306f8 and d4be229.

⛔ Files ignored due to path filters (1)

go.sum is excluded by !**/*.sum

📒 Files selected for processing (20)

.github/workflows/ci.yml
.github/workflows/security.yml
.golangci.yml
apps/cassandra-writer/main.go
apps/ingest-service/main.go
apps/read-model-builder/internal/projection/device_projection.go
apps/read-model-builder/internal/projection/telemetry_projection.go
apps/read-model-builder/internal/projection/telemetry_projection_integration_test.go
apps/saga-orchestrator/internal/consumer/kafka_consumer.go
apps/saga-orchestrator/internal/health/handler.go
apps/saga-orchestrator/internal/recovery/recovery_worker.go
apps/saga-orchestrator/internal/repository/postgres_saga_repository.go
apps/telemetry-service/cmd/main.go
apps/telemetry-service/internal/application/create_device.go
apps/telemetry-service/internal/application/record_telemetry.go
apps/telemetry-service/internal/worker/outbox_worker.go
go.mod
libs/health/health.go
libs/migrate/migrate.go
libs/observability/metrics.go

coderabbitai · 2026-03-28T02:38:05Z

      - name: Run Trivy vulnerability scan
        continue-on-error: true
-        uses: aquasecurity/trivy-action@master
+        uses: aquasecurity/trivy-action@0.33.1


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify trivy-action version 0.33.1 exists and check for newer versions gh api repos/aquasecurity/trivy-action/releases --jq '[.[] | {tag: .tag_name, published: .published_at}] | .[0:10]'

Repository: pahuldeepp/GrainGuard-

Length of output: 226

🌐 Web query:

trivy-action 0.33.1 release aquasecurity

💡 Result:

The trivy-action v0.33.1 release by Aqua Security was a legitimate version of the GitHub Action for running Trivy vulnerability scans in CI pipelines. It is referenced in the official repository README and example workflows as "aquasecurity/trivy-action@0.33.1". However, on March 19, 2026 (~17:43 UTC), the repository suffered a supply chain attack where attackers with compromised credentials force-pushed 76 out of 77 version tags (all except v0.35.0) to malicious commits using a credential stealer payload. The tag v0.33.1 was specifically force-pushed to commit SHA 91e7c2c36dcad14149d8e455b960af62a2ffb275, which contains malicious code that reads GitHub Actions runner memory and exfiltrates data. The malicious tags were active until remediation around March 20, 2026 (05:40-07:12 UTC). As of 2026-03-28, Aqua Security recommends pinning to the verified safe version v0.35.0 or specific safe commit SHAs. Avoid using v0.33.1 or other affected tags; update workflows immediately and check for exposure during the attack window.

Citations:

1: https://github.com/aquasecurity/trivy-action/releases

2: https://www.aquasec.com/blog/trivy-supply-chain-attack-what-you-need-to-know/

3: https://github.com/aquasecurity/trivy-action/tree/v0.35.0

4: https://www.stepsecurity.io/blog/trivy-compromised-a-second-time---malicious-v0-69-4-release

5: https://github.com/aquasecurity/trivy-action

6: https://noxalis-lab.io/trivy-under-attack-again-widespread-github-actions-tag-compromise-secrets/

🏁 Script executed:

# Check the workflow file and verify both lines 47 and 84 have the version pinning cat -n .github/workflows/security.yml | sed -n '40,90p'

Repository: pahuldeepp/GrainGuard-

Length of output: 2349

CRITICAL: trivy-action v0.33.1 is compromised—update to v0.35.0 immediately.

The pinned version v0.33.1 was force-pushed to malicious code during a supply chain attack on March 19, 2026. The compromised tags contain a credential stealer payload that exfiltrates data from GitHub Actions runner memory. Update both lines 47 and 84 to the verified safe version aquasecurity/trivy-action@v0.35.0 and audit any secrets potentially exposed if this workflow ran during the attack window (March 19-20, 2026).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/security.yml at line 47, Replace both occurrences of the compromised action string "uses: aquasecurity/trivy-action@0.33.1" with the safe pinned tag "uses: aquasecurity/trivy-action@v0.35.0" (update the two locations in the workflow where that exact string appears) and then trigger a secrets audit for any tokens or credentials that were available to runs during March 19–20, 2026 to rotate/revoke any possibly exposed secrets.

coderabbitai · 2026-03-28T02:38:05Z

 			if event.EventType != "telemetry.recorded" {
-				reader.CommitMessages(ctx, msg)
+				if commitErr := reader.CommitMessages(ctx, msg); commitErr != nil {
+					log.Printf("commit error after skipping event type: %v", commitErr)
+				}
 				continue
 			}


⚠️ Potential issue | 🟡 Minor

Missing skipped.Add(1) for non-matching event types.

When event.EventType != "telemetry.recorded", the message is committed and skipped but skipped.Add(1) is not called, unlike other skip paths (lines 317, 323, 340, 349). This causes undercounting in the stats output.

🔧 Proposed fix

if event.EventType != "telemetry.recorded" { + skipped.Add(1) if commitErr := reader.CommitMessages(ctx, msg); commitErr != nil { log.Printf("commit error after skipping event type: %v", commitErr) } continue }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if event.EventType != "telemetry.recorded" {

reader.CommitMessages(ctx, msg)

if commitErr := reader.CommitMessages(ctx, msg); commitErr != nil {

log.Printf("commit error after skipping event type: %v", commitErr)

}

continue

}

if event.EventType != "telemetry.recorded" {

skipped.Add(1)

if commitErr := reader.CommitMessages(ctx, msg); commitErr != nil {

log.Printf("commit error after skipping event type: %v", commitErr)

}

continue

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/cassandra-writer/main.go` around lines 330 - 335, In the branch that handles non-matching event types (when event.EventType != "telemetry.recorded"), increment the skip counter by calling skipped.Add(1) before continuing; specifically, update the block that commits the message via reader.CommitMessages(ctx, msg) to also call skipped.Add(1) (ensure this happens regardless of commitErr) so the skip path for this condition matches the other skip paths (which call skipped.Add(1)).

coderabbitai · 2026-03-28T02:38:05Z

+	return &http.Server{
+		Addr:              addr,
+		Handler:           mux,
+		ReadHeaderTimeout: 5 * time.Second,
+	}


🧹 Nitpick | 🔵 Trivial

Consider adding additional server timeouts for defense-in-depth.

While ReadHeaderTimeout protects against Slowloris attacks on header reads, the server lacks ReadTimeout, WriteTimeout, and IdleTimeout. For a health check server with minimal traffic, this is lower risk, but adding these provides better protection against resource exhaustion.

♻️ Suggested timeout configuration

return &http.Server{ Addr: addr, Handler: mux, ReadHeaderTimeout: 5 * time.Second, + ReadTimeout: 10 * time.Second, + WriteTimeout: 10 * time.Second, + IdleTimeout: 60 * time.Second, }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

return &http.Server{

Addr: addr,

Handler: mux,

ReadHeaderTimeout: 5 * time.Second,

}

return &http.Server{

Addr: addr,

Handler: mux,

ReadHeaderTimeout: 5 * time.Second,

ReadTimeout: 10 * time.Second,

WriteTimeout: 10 * time.Second,

IdleTimeout: 60 * time.Second,

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/saga-orchestrator/internal/health/handler.go` around lines 59 - 63, The http.Server returned in the health server (the return block creating &http.Server with Addr, Handler, ReadHeaderTimeout) only sets ReadHeaderTimeout; add ReadTimeout, WriteTimeout, and IdleTimeout on that same &http.Server to harden against resource exhaustion (e.g., set small ReadTimeout/WriteTimeout like a few seconds and an IdleTimeout for keep-alives); update the return in handler.go where the server is constructed so the server config includes ReadTimeout, WriteTimeout, and IdleTimeout alongside ReadHeaderTimeout.

coderabbitai · 2026-03-28T02:38:06Z

+	defer func() {
+		_ = tx.Rollback(ctx)
+	}()


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n apps/telemetry-service/internal/application/record_telemetry.go | head -100

Repository: pahuldeepp/GrainGuard-

Length of output: 3251

🏁 Script executed:

cat -n apps/telemetry-service/internal/application/record_telemetry.go | sed -n '100,130p'

Repository: pahuldeepp/GrainGuard-

Length of output: 423

Use robust rollback handling instead of swallowing rollback failures.

The defer rollback at lines 73–75 ignores rollback errors and uses the request ctx; if the request context is cancelled, cleanup may fail silently without any logging or observability.

🔧 Proposed fix

import ( "context" "encoding/json" + "errors" "fmt" "time" "github.com/google/uuid" + "github.com/jackc/pgx/v5" "github.com/jackc/pgx/v5/pgxpool" @@ - defer func() { - _ = tx.Rollback(ctx) - }() + defer func() { + rbCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second) + defer cancel() + if rbErr := tx.Rollback(rbCtx); rbErr != nil && !errors.Is(rbErr, pgx.ErrTxClosed) { + // replace with project logger if available + fmt.Printf("tx rollback error: %v\n", rbErr) + } + }()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

defer func() {

_ = tx.Rollback(ctx)

}()

defer func() {

rbCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)

defer cancel()

if rbErr := tx.Rollback(rbCtx); rbErr != nil && !errors.Is(rbErr, pgx.ErrTxClosed) {

// replace with project logger if available

fmt.Printf("tx rollback error: %v\n", rbErr)

}

}()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/telemetry-service/internal/application/record_telemetry.go` around lines 73 - 75, The defer currently swallows tx.Rollback(ctx) and uses the request ctx which may be cancelled; change the defer to call tx.Rollback using a non-request cleanup context (e.g., context.Background() or a short-lived context with timeout) and check its error, logging any failure instead of discarding it. Locate the rollback defer around tx.Rollback and replace it with a guarded defer that does something like: create a cleanupCtx, if err := tx.Rollback(cleanupCtx); err != nil && err != sql.ErrTxDone { logger.Errorf("tx rollback failed: %v", err) } so rollback failures are observed and the request context cancellation cannot silently prevent cleanup.

coderabbitai · 2026-03-28T02:38:06Z

+	defer func() {
+		_ = tx.Rollback(ctx)
+	}()


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash rg -nP --type=go 'defer\s+func\(\)\s*\{\s*_\s*=\s*tx\.Rollback\(ctx\)\s*\}\(\)'

Repository: pahuldeepp/GrainGuard-

Length of output: 48

🏁 Script executed:

find . -path "*apps/telemetry-service/internal/worker/outbox_worker.go" -type f

Repository: pahuldeepp/GrainGuard-

Length of output: 123

🏁 Script executed:

cat -n apps/telemetry-service/internal/worker/outbox_worker.go | sed -n '260,280p'

Repository: pahuldeepp/GrainGuard-

Length of output: 647

🏁 Script executed:

cat -n apps/telemetry-service/internal/worker/outbox_worker.go | sed -n '280,330p'

Repository: pahuldeepp/GrainGuard-

Length of output: 1334

🏁 Script executed:

cat -n apps/telemetry-service/internal/worker/outbox_worker.go | sed -n '330,370p'

Repository: pahuldeepp/GrainGuard-

Length of output: 973

🏁 Script executed:

find ./apps/telemetry-service -name "*.go" -type f -exec grep -l "tx.Rollback" {} \;

Repository: pahuldeepp/GrainGuard-

Length of output: 252

🏁 Script executed:

rg -nP --type=go 'defer.*tx\.Rollback' ./apps/telemetry-service

Repository: pahuldeepp/GrainGuard-

Length of output: 48

🏁 Script executed:

grep -n "Rollback" ./apps/telemetry-service/internal/application/create_device.go ./apps/telemetry-service/internal/application/record_telemetry.go ./apps/telemetry-service/internal/worker/outbox_worker.go

Repository: pahuldeepp/GrainGuard-

Length of output: 331

🏁 Script executed:

cat -n ./apps/telemetry-service/internal/application/create_device.go | sed -n '50,70p'

Repository: pahuldeepp/GrainGuard-

Length of output: 714

🏁 Script executed:

cat -n ./apps/telemetry-service/internal/application/record_telemetry.go | sed -n '65,85p'

Repository: pahuldeepp/GrainGuard-

Length of output: 606

Handle rollback errors explicitly and avoid canceled request context for rollback.

At line 269, rollback errors are silently discarded with _, and the rollback uses ctx which may already be canceled at error boundaries. This can mask transaction cleanup failures and is a systemic pattern across device creation, telemetry recording, and outbox publishing paths in the service.

🔧 Proposed fix

import ( "context" "encoding/json" + "errors" "fmt" "log" "os" "strings" "time" + "github.com/jackc/pgx/v5" "github.com/jackc/pgx/v5/pgxpool" "github.com/segmentio/kafka-go" "google.golang.org/protobuf/proto" @@ - defer func() { - _ = tx.Rollback(ctx) - }() + defer func() { + rbCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second) + defer cancel() + if rbErr := tx.Rollback(rbCtx); rbErr != nil && !errors.Is(rbErr, pgx.ErrTxClosed) { + log.Println("tx rollback error:", rbErr) + } + }()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

defer func() {

_ = tx.Rollback(ctx)

}()

import (

"context"

"encoding/json"

"errors"

"fmt"

"log"

"os"

"strings"

"time"

"github.com/jackc/pgx/v5"

"github.com/jackc/pgx/v5/pgxpool"

"github.com/segmentio/kafka-go"

"google.golang.org/protobuf/proto"

)

// ... (other code)

defer func() {

rbCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)

defer cancel()

if rbErr := tx.Rollback(rbCtx); rbErr != nil && !errors.Is(rbErr, pgx.ErrTxClosed) {

log.Println("tx rollback error:", rbErr)

}

}()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@apps/telemetry-service/internal/worker/outbox_worker.go` around lines 268 - 270, The deferred rollback currently discards errors and uses the incoming ctx which may be canceled; instead update the defer that calls tx.Rollback(ctx) to call Rollback with a fresh non-canceled context (e.g., context.Background() or a short timeout context) and check the returned error—log it or wrap it (do not ignore) unless it is sql.ErrTxDone; locate the defer around tx.Rollback in outbox_worker.go (the tx variable and its deferred rollback) and replace the silent discard `_ = tx.Rollback(ctx)` with explicit error handling that distinguishes a harmless "transaction already finished" error from real rollback failures and records the failure via the component logger/error reporter used in this file.

HIGH PRIORITY: - #9: Fix CSRF timing attack — pad buffers to fixed length before timingSafeEqual - #10: Upgrade circuit breaker to distributed (Redis-backed state sharing across pods) - #8: Fix saga recovery infinite loop — mark corrupted payloads as FAILED instead of retrying MEDIUM PRIORITY: - #11: Add webhook idempotency check (dedup by endpoint_id + event_type) - #12: Assert Stripe webhook body is Buffer before signature verification - #13: Eliminate N+1 query in deviceTelemetry resolver (single JOIN query) - #15: Add ORDER BY + LIMIT to saga FindByCorrelationID for deterministic results - #17: Make critical audit events (auth, admin) throw on failure instead of silent swallow LOW PRIORITY: - #20: Add 10s query timeout to all saga repository DB operations - #23: Add IsValidStatus validator for saga status constants - #24: Set httpServer.timeout (30s) and keepAliveTimeout (65s) on BFF - #25: Add RabbitMQ heartbeat (30s) and connection error/close handlers - #7: Fix remaining saga JSON marshal error check (initialErr) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

pahuldeepp added 2 commits March 27, 2026 21:15

fix(phase1): complete hardening baseline

cbb3218

fix(ci): resolve hosted phase1 drift

d4be229

fix(ci): unblock final lint and trivy jobs

5451012

pahuldeepp merged commit 21f76e8 into master Mar 28, 2026
58 checks passed

coderabbitai Bot requested changes Mar 28, 2026

View reviewed changes

pahuldeepp deleted the codex/phase1-hardening branch March 28, 2026 03:39

This was referenced Mar 28, 2026

fix(ci): golangci-lint v2, SQL injection fix, BFF/dashboard lint cleanup #10

Merged

feat: platform features — SSO, billing, plan enforcement, audit log, chaos tests, Redis cluster #14

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codex/phase1 hardening#8

Codex/phase1 hardening#8
pahuldeepp merged 3 commits intomasterfrom
codex/phase1-hardening

pahuldeepp commented Mar 28, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Mar 28, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot Mar 28, 2026

Uh oh!

coderabbitai Bot Mar 28, 2026

Uh oh!

coderabbitai Bot Mar 28, 2026

Uh oh!

coderabbitai Bot Mar 28, 2026

Uh oh!

coderabbitai Bot Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-	defer func() {
-		_ = tx.Rollback(ctx)
-	}()
+	defer func() {
+		rbCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+		defer cancel()
+		if rbErr := tx.Rollback(rbCtx); rbErr != nil && !errors.Is(rbErr, pgx.ErrTxClosed) {
+			// replace with project logger if available
+			fmt.Printf("tx rollback error: %v\n", rbErr)
+		}
+	}()

-	defer func() {
-		_ = tx.Rollback(ctx)
-	}()
+import (
+	"context"
+	"encoding/json"
+	"errors"
+	"fmt"
+	"log"
+	"os"
+	"strings"
+	"time"
+	"github.com/jackc/pgx/v5"
+	"github.com/jackc/pgx/v5/pgxpool"
+	"github.com/segmentio/kafka-go"
+	"google.golang.org/protobuf/proto"
+)
+// ... (other code)
+	defer func() {
+		rbCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+		defer cancel()
+		if rbErr := tx.Rollback(rbCtx); rbErr != nil && !errors.Is(rbErr, pgx.ErrTxClosed) {
+			log.Println("tx rollback error:", rbErr)
+		}
+	}()

Conversation

pahuldeepp commented Mar 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pahuldeepp commented Mar 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 28, 2026 •

edited

Loading