Skip to content

Codex/phase1 hardening#8

Merged
pahuldeepp merged 3 commits intomasterfrom
codex/phase1-hardening
Mar 28, 2026
Merged

Codex/phase1 hardening#8
pahuldeepp merged 3 commits intomasterfrom
codex/phase1-hardening

Conversation

@pahuldeepp
Copy link
Copy Markdown
Owner

@pahuldeepp pahuldeepp commented Mar 28, 2026

Summary by CodeRabbit

  • Bug Fixes

    • More robust error handling and explicit logging around transaction commit/rollback and message processing.
    • Added read header timeouts to internal HTTP servers to improve resiliency.
  • Chores

    • CI/workflow updates (JS Node setting, linting job tweaks, security scan changes).
    • Bumped several Go module dependencies and security scanning tool versions.
  • Refactor

    • Formatting and minor API-internal cleanups across services.
    • Small performance/memory optimizations in ingest handling.
  • Tests

    • Test container cleanup made explicit.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 28, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3bea64c6-f583-4c70-95d7-6affe61398d4

📥 Commits

Reviewing files that changed from the base of the PR and between d4be229 and 5451012.

📒 Files selected for processing (2)
  • .github/workflows/ci.yml
  • .github/workflows/security.yml

📝 Walkthrough

Walkthrough

Workflow and CI tweaks, Go linter and module version bumps, multiple Go files refactored to ignore rollback errors and improve error-variable scoping, HTTP servers given ReadHeaderTimeout, minor runtime/formatting fixes, and targeted runtime/behavior adjustments in several services.

Changes

Cohort / File(s) Summary
GitHub Actions Workflows
.github/workflows/ci.yml, .github/workflows/security.yml
Added FORCE_JAVASCRIPT_ACTIONS_TO_NODE24 env var; changed golangci-lint step to run a specific go module invocation; renamed and conditionalized TS checks matrix; pinned aquasecurity/trivy-action to v0.33.1; changed Trivy build context to repository root.
Linter & Module Configuration
.golangci.yml, go.mod
Removed exportloopref linter, added gocritic disabled checks and gosec G404 exclusion; bumped github.com/redis/go-redis/v9, google.golang.org/grpc, and several golang.org/x/* indirect deps.
Transaction Rollback & Error Scoping
apps/read-model-builder/.../device_projection.go, apps/read-model-builder/.../telemetry_projection.go, apps/read-model-builder/.../telemetry_projection_integration_test.go, apps/telemetry-service/internal/application/create_device.go, apps/telemetry-service/internal/application/record_telemetry.go, apps/telemetry-service/internal/worker/outbox_worker.go
Replaced direct defer tx.Rollback(...) with deferred closures that explicitly ignore rollback errors; added scoped error variables (e.g., commitErr, scanErr, saveErr) to avoid shadowing/overwrites.
HTTP Server & I/O handling
apps/saga-orchestrator/internal/health/handler.go, apps/telemetry-service/cmd/main.go, libs/health/health.go, libs/observability/metrics.go
Switched from http.ListenAndServe to http.Server instances and added ReadHeaderTimeout: 5 * time.Second; made write/encode return values explicitly discarded (_, _ = / _ =).
Kafka, Consumer & Repository formatting
apps/saga-orchestrator/internal/consumer/kafka_consumer.go, apps/saga-orchestrator/internal/repository/postgres_saga_repository.go
Removed leading UTF-8 BOM from package declaration and normalized formatting/indentation (tabs) — no logic changes.
Service-specific runtime fixes
apps/cassandra-writer/main.go, apps/ingest-service/main.go, apps/saga-orchestrator/internal/recovery/recovery_worker.go
Refactored Kafka message error handling to use unmarshalErr and log commit errors; added getenvInt32; changed body pool to store *[]byte; explicit discard of write returns; skipped recovery actions for specific saga statuses.
Migration & utilities formatting
libs/migrate/migrate.go, libs/migrate/...
Reformatted with tabs, simplified DSN construction logic, and renamed migration error variable to migrateErr.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

  • Codex/phase1 hardening #8: Overlapping multi-file changes (workflows, .golangci.yml, go.mod, and numerous Go sources) that appear to apply the same rollback-ignore, server timeout, and CI adjustments.
🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 13.64% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Codex/phase1 hardening' is vague and does not clearly summarize the main changes in the pull request, which include linter updates, workflow configuration changes, error handling improvements, and dependency updates across multiple files. Use a more specific title that captures the primary objective, such as 'Refactor error handling and improve linting configuration' or 'Update workflows and dependencies for Phase 1 hardening'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/phase1-hardening

Comment @coderabbitai help to get the list of available commands and usage tips.

@pahuldeepp pahuldeepp merged commit 21f76e8 into master Mar 28, 2026
58 checks passed
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
.github/workflows/ci.yml (1)

97-99: 🧹 Nitpick | 🔵 Trivial

Typecheck fallback causes inconsistent behavior and unnecessary full builds.

The fallback npm run typecheck || npm run build produces different results per app:

  • gateway: Has typecheck script → runs tsc --noEmit (lightweight)
  • bff: No typecheck script → falls back to tsc (full build)
  • jobs-worker: No typecheck script → falls back to tsc (full build)
  • dashboard: No typecheck script → falls back to tsc -b && vite build (heavy, includes full Vite bundle)

For bff, jobs-worker, and especially dashboard, the fallback runs unnecessary full builds instead of type checking alone, increasing CI time. Add explicit typecheck scripts to each app for consistent and faster CI runs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ci.yml around lines 97 - 99, The CI step named "Typecheck"
currently falls back to running a full build when an app lacks a typecheck
script; add explicit "typecheck" scripts to the package.json of the bff,
jobs-worker, and dashboard apps so the workflow's Typecheck step runs
lightweight type checks only. For each app add a "typecheck" script that invokes
TypeScript without emitting (e.g., "tsc --noEmit") or, if the repo uses
composite/project refs, "tsc -b --noEmit" for that project; ensure the script
names are exactly "typecheck" so the existing workflow step (Typecheck) picks
them up and no changes to .github/workflows/ci.yml are required.
apps/read-model-builder/internal/projection/telemetry_projection.go (2)

185-190: ⚠️ Potential issue | 🟡 Minor

Remove duplicate redisClient nil check.

Lines 185-190 contain a duplicate nil check. The second check at line 188 is dead code since if redisClient == nil, the function already returned at line 186.

🧹 Proposed fix
 		if redisClient == nil {
 			return nil
 		}
-		if redisClient == nil {
-			return nil
-		}
 		pipe := redisClient.Pipeline()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/read-model-builder/internal/projection/telemetry_projection.go` around
lines 185 - 190, Remove the duplicate nil check for redisClient in
telemetry_projection.go: locate the function/block that contains the two
consecutive "if redisClient == nil { return nil }" checks and delete the second
one so there is only a single guard that returns when redisClient is nil
(preserve the first check and surrounding logic unchanged).

305-311: 🧹 Nitpick | 🔵 Trivial

Missing rows.Err() check after first rows.Close().

Per coding guidelines, rows.Err() should be checked after rows.Close(). The second query result (lines 377-387) correctly checks rows.Err(), but this first iteration does not. While the impact is lower here (only scanning event IDs), it's good practice to check consistently.

♻️ Proposed fix
 		for rows.Next() {
 			var id string
 			if scanErr := rows.Scan(&id); scanErr == nil {
 				newEventIDs[id] = struct{}{}
 			}
 		}
 		rows.Close()
+		if err := rows.Err(); err != nil {
+			observability.EventsRetry.Inc()
+			return err
+		}

As per coding guidelines: "Check pgx rows.Err() after rows.Close()."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/read-model-builder/internal/projection/telemetry_projection.go` around
lines 305 - 311, The rows returned for populating newEventIDs are not checking
rows.Err() after rows.Close(); update the block that iterates rows.Next()
filling newEventIDs to call rows.Close() then check if err := rows.Err(); handle
that error the same way as the later query (e.g., return the error or log and
exit), ensuring you reference the same variables (rows, newEventIDs) and follow
the error-handling pattern used in the second query that already checks
rows.Err().
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/ci.yml:
- Around line 30-32: The CI currently pairs golangci/golangci-lint-action@v8
with golangci-lint version v1.62.0 which is incompatible; update the workflow to
either change the action to a v6 or v7 tag (e.g.,
golangci/golangci-lint-action@v7) to keep using golangci-lint v1.x, or keep
golangci/golangci-lint-action@v8 and upgrade the referenced golangci-lint
`version:` to a v2.x release (e.g., v2.1.0+) so the action and linter versions
are compatible.

In @.github/workflows/security.yml:
- Line 47: Replace both occurrences of the compromised action string "uses:
aquasecurity/trivy-action@0.33.1" with the safe pinned tag "uses:
aquasecurity/trivy-action@v0.35.0" (update the two locations in the workflow
where that exact string appears) and then trigger a secrets audit for any tokens
or credentials that were available to runs during March 19–20, 2026 to
rotate/revoke any possibly exposed secrets.

In `@apps/cassandra-writer/main.go`:
- Around line 330-335: In the branch that handles non-matching event types (when
event.EventType != "telemetry.recorded"), increment the skip counter by calling
skipped.Add(1) before continuing; specifically, update the block that commits
the message via reader.CommitMessages(ctx, msg) to also call skipped.Add(1)
(ensure this happens regardless of commitErr) so the skip path for this
condition matches the other skip paths (which call skipped.Add(1)).

In `@apps/saga-orchestrator/internal/health/handler.go`:
- Around line 59-63: The http.Server returned in the health server (the return
block creating &http.Server with Addr, Handler, ReadHeaderTimeout) only sets
ReadHeaderTimeout; add ReadTimeout, WriteTimeout, and IdleTimeout on that same
&http.Server to harden against resource exhaustion (e.g., set small
ReadTimeout/WriteTimeout like a few seconds and an IdleTimeout for keep-alives);
update the return in handler.go where the server is constructed so the server
config includes ReadTimeout, WriteTimeout, and IdleTimeout alongside
ReadHeaderTimeout.

In `@apps/telemetry-service/internal/application/record_telemetry.go`:
- Around line 73-75: The defer currently swallows tx.Rollback(ctx) and uses the
request ctx which may be cancelled; change the defer to call tx.Rollback using a
non-request cleanup context (e.g., context.Background() or a short-lived context
with timeout) and check its error, logging any failure instead of discarding it.
Locate the rollback defer around tx.Rollback and replace it with a guarded defer
that does something like: create a cleanupCtx, if err :=
tx.Rollback(cleanupCtx); err != nil && err != sql.ErrTxDone { logger.Errorf("tx
rollback failed: %v", err) } so rollback failures are observed and the request
context cancellation cannot silently prevent cleanup.

In `@apps/telemetry-service/internal/worker/outbox_worker.go`:
- Around line 268-270: The deferred rollback currently discards errors and uses
the incoming ctx which may be canceled; instead update the defer that calls
tx.Rollback(ctx) to call Rollback with a fresh non-canceled context (e.g.,
context.Background() or a short timeout context) and check the returned
error—log it or wrap it (do not ignore) unless it is sql.ErrTxDone; locate the
defer around tx.Rollback in outbox_worker.go (the tx variable and its deferred
rollback) and replace the silent discard `_ = tx.Rollback(ctx)` with explicit
error handling that distinguishes a harmless "transaction already finished"
error from real rollback failures and records the failure via the component
logger/error reporter used in this file.

---

Outside diff comments:
In @.github/workflows/ci.yml:
- Around line 97-99: The CI step named "Typecheck" currently falls back to
running a full build when an app lacks a typecheck script; add explicit
"typecheck" scripts to the package.json of the bff, jobs-worker, and dashboard
apps so the workflow's Typecheck step runs lightweight type checks only. For
each app add a "typecheck" script that invokes TypeScript without emitting
(e.g., "tsc --noEmit") or, if the repo uses composite/project refs, "tsc -b
--noEmit" for that project; ensure the script names are exactly "typecheck" so
the existing workflow step (Typecheck) picks them up and no changes to
.github/workflows/ci.yml are required.

In `@apps/read-model-builder/internal/projection/telemetry_projection.go`:
- Around line 185-190: Remove the duplicate nil check for redisClient in
telemetry_projection.go: locate the function/block that contains the two
consecutive "if redisClient == nil { return nil }" checks and delete the second
one so there is only a single guard that returns when redisClient is nil
(preserve the first check and surrounding logic unchanged).
- Around line 305-311: The rows returned for populating newEventIDs are not
checking rows.Err() after rows.Close(); update the block that iterates
rows.Next() filling newEventIDs to call rows.Close() then check if err :=
rows.Err(); handle that error the same way as the later query (e.g., return the
error or log and exit), ensuring you reference the same variables (rows,
newEventIDs) and follow the error-handling pattern used in the second query that
already checks rows.Err().
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 98f1c8d9-4e4c-4385-ab24-c541daf76db3

📥 Commits

Reviewing files that changed from the base of the PR and between ce306f8 and d4be229.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (20)
  • .github/workflows/ci.yml
  • .github/workflows/security.yml
  • .golangci.yml
  • apps/cassandra-writer/main.go
  • apps/ingest-service/main.go
  • apps/read-model-builder/internal/projection/device_projection.go
  • apps/read-model-builder/internal/projection/telemetry_projection.go
  • apps/read-model-builder/internal/projection/telemetry_projection_integration_test.go
  • apps/saga-orchestrator/internal/consumer/kafka_consumer.go
  • apps/saga-orchestrator/internal/health/handler.go
  • apps/saga-orchestrator/internal/recovery/recovery_worker.go
  • apps/saga-orchestrator/internal/repository/postgres_saga_repository.go
  • apps/telemetry-service/cmd/main.go
  • apps/telemetry-service/internal/application/create_device.go
  • apps/telemetry-service/internal/application/record_telemetry.go
  • apps/telemetry-service/internal/worker/outbox_worker.go
  • go.mod
  • libs/health/health.go
  • libs/migrate/migrate.go
  • libs/observability/metrics.go

Comment thread .github/workflows/ci.yml Outdated
Comment thread .github/workflows/security.yml Outdated
- name: Run Trivy vulnerability scan
continue-on-error: true
uses: aquasecurity/trivy-action@master
uses: aquasecurity/trivy-action@0.33.1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify trivy-action version 0.33.1 exists and check for newer versions
gh api repos/aquasecurity/trivy-action/releases --jq '[.[] | {tag: .tag_name, published: .published_at}] | .[0:10]'

Repository: pahuldeepp/GrainGuard-

Length of output: 226


🌐 Web query:

trivy-action 0.33.1 release aquasecurity

💡 Result:

The trivy-action v0.33.1 release by Aqua Security was a legitimate version of the GitHub Action for running Trivy vulnerability scans in CI pipelines. It is referenced in the official repository README and example workflows as "aquasecurity/trivy-action@0.33.1". However, on March 19, 2026 (~17:43 UTC), the repository suffered a supply chain attack where attackers with compromised credentials force-pushed 76 out of 77 version tags (all except v0.35.0) to malicious commits using a credential stealer payload. The tag v0.33.1 was specifically force-pushed to commit SHA 91e7c2c36dcad14149d8e455b960af62a2ffb275, which contains malicious code that reads GitHub Actions runner memory and exfiltrates data. The malicious tags were active until remediation around March 20, 2026 (05:40-07:12 UTC). As of 2026-03-28, Aqua Security recommends pinning to the verified safe version v0.35.0 or specific safe commit SHAs. Avoid using v0.33.1 or other affected tags; update workflows immediately and check for exposure during the attack window.

Citations:


🏁 Script executed:

# Check the workflow file and verify both lines 47 and 84 have the version pinning
cat -n .github/workflows/security.yml | sed -n '40,90p'

Repository: pahuldeepp/GrainGuard-

Length of output: 2349


CRITICAL: trivy-action v0.33.1 is compromised—update to v0.35.0 immediately.

The pinned version v0.33.1 was force-pushed to malicious code during a supply chain attack on March 19, 2026. The compromised tags contain a credential stealer payload that exfiltrates data from GitHub Actions runner memory. Update both lines 47 and 84 to the verified safe version aquasecurity/trivy-action@v0.35.0 and audit any secrets potentially exposed if this workflow ran during the attack window (March 19-20, 2026).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/security.yml at line 47, Replace both occurrences of the
compromised action string "uses: aquasecurity/trivy-action@0.33.1" with the safe
pinned tag "uses: aquasecurity/trivy-action@v0.35.0" (update the two locations
in the workflow where that exact string appears) and then trigger a secrets
audit for any tokens or credentials that were available to runs during March
19–20, 2026 to rotate/revoke any possibly exposed secrets.

Comment on lines 330 to 335
if event.EventType != "telemetry.recorded" {
reader.CommitMessages(ctx, msg)
if commitErr := reader.CommitMessages(ctx, msg); commitErr != nil {
log.Printf("commit error after skipping event type: %v", commitErr)
}
continue
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing skipped.Add(1) for non-matching event types.

When event.EventType != "telemetry.recorded", the message is committed and skipped but skipped.Add(1) is not called, unlike other skip paths (lines 317, 323, 340, 349). This causes undercounting in the stats output.

🔧 Proposed fix
 		if event.EventType != "telemetry.recorded" {
+			skipped.Add(1)
 			if commitErr := reader.CommitMessages(ctx, msg); commitErr != nil {
 				log.Printf("commit error after skipping event type: %v", commitErr)
 			}
 			continue
 		}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if event.EventType != "telemetry.recorded" {
reader.CommitMessages(ctx, msg)
if commitErr := reader.CommitMessages(ctx, msg); commitErr != nil {
log.Printf("commit error after skipping event type: %v", commitErr)
}
continue
}
if event.EventType != "telemetry.recorded" {
skipped.Add(1)
if commitErr := reader.CommitMessages(ctx, msg); commitErr != nil {
log.Printf("commit error after skipping event type: %v", commitErr)
}
continue
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/cassandra-writer/main.go` around lines 330 - 335, In the branch that
handles non-matching event types (when event.EventType != "telemetry.recorded"),
increment the skip counter by calling skipped.Add(1) before continuing;
specifically, update the block that commits the message via
reader.CommitMessages(ctx, msg) to also call skipped.Add(1) (ensure this happens
regardless of commitErr) so the skip path for this condition matches the other
skip paths (which call skipped.Add(1)).

Comment on lines +59 to +63
return &http.Server{
Addr: addr,
Handler: mux,
ReadHeaderTimeout: 5 * time.Second,
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider adding additional server timeouts for defense-in-depth.

While ReadHeaderTimeout protects against Slowloris attacks on header reads, the server lacks ReadTimeout, WriteTimeout, and IdleTimeout. For a health check server with minimal traffic, this is lower risk, but adding these provides better protection against resource exhaustion.

♻️ Suggested timeout configuration
 return &http.Server{
     Addr:              addr,
     Handler:           mux,
     ReadHeaderTimeout: 5 * time.Second,
+    ReadTimeout:       10 * time.Second,
+    WriteTimeout:      10 * time.Second,
+    IdleTimeout:       60 * time.Second,
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
return &http.Server{
Addr: addr,
Handler: mux,
ReadHeaderTimeout: 5 * time.Second,
}
return &http.Server{
Addr: addr,
Handler: mux,
ReadHeaderTimeout: 5 * time.Second,
ReadTimeout: 10 * time.Second,
WriteTimeout: 10 * time.Second,
IdleTimeout: 60 * time.Second,
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/saga-orchestrator/internal/health/handler.go` around lines 59 - 63, The
http.Server returned in the health server (the return block creating
&http.Server with Addr, Handler, ReadHeaderTimeout) only sets ReadHeaderTimeout;
add ReadTimeout, WriteTimeout, and IdleTimeout on that same &http.Server to
harden against resource exhaustion (e.g., set small ReadTimeout/WriteTimeout
like a few seconds and an IdleTimeout for keep-alives); update the return in
handler.go where the server is constructed so the server config includes
ReadTimeout, WriteTimeout, and IdleTimeout alongside ReadHeaderTimeout.

Comment on lines +73 to +75
defer func() {
_ = tx.Rollback(ctx)
}()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n apps/telemetry-service/internal/application/record_telemetry.go | head -100

Repository: pahuldeepp/GrainGuard-

Length of output: 3251


🏁 Script executed:

cat -n apps/telemetry-service/internal/application/record_telemetry.go | sed -n '100,130p'

Repository: pahuldeepp/GrainGuard-

Length of output: 423


Use robust rollback handling instead of swallowing rollback failures.

The defer rollback at lines 73–75 ignores rollback errors and uses the request ctx; if the request context is cancelled, cleanup may fail silently without any logging or observability.

🔧 Proposed fix
 import (
 	"context"
 	"encoding/json"
+	"errors"
 	"fmt"
 	"time"

 	"github.com/google/uuid"
+	"github.com/jackc/pgx/v5"
 	"github.com/jackc/pgx/v5/pgxpool"
@@
-	defer func() {
-		_ = tx.Rollback(ctx)
-	}()
+	defer func() {
+		rbCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+		defer cancel()
+		if rbErr := tx.Rollback(rbCtx); rbErr != nil && !errors.Is(rbErr, pgx.ErrTxClosed) {
+			// replace with project logger if available
+			fmt.Printf("tx rollback error: %v\n", rbErr)
+		}
+	}()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
defer func() {
_ = tx.Rollback(ctx)
}()
defer func() {
rbCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
if rbErr := tx.Rollback(rbCtx); rbErr != nil && !errors.Is(rbErr, pgx.ErrTxClosed) {
// replace with project logger if available
fmt.Printf("tx rollback error: %v\n", rbErr)
}
}()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/telemetry-service/internal/application/record_telemetry.go` around lines
73 - 75, The defer currently swallows tx.Rollback(ctx) and uses the request ctx
which may be cancelled; change the defer to call tx.Rollback using a non-request
cleanup context (e.g., context.Background() or a short-lived context with
timeout) and check its error, logging any failure instead of discarding it.
Locate the rollback defer around tx.Rollback and replace it with a guarded defer
that does something like: create a cleanupCtx, if err :=
tx.Rollback(cleanupCtx); err != nil && err != sql.ErrTxDone { logger.Errorf("tx
rollback failed: %v", err) } so rollback failures are observed and the request
context cancellation cannot silently prevent cleanup.

Comment on lines +268 to +270
defer func() {
_ = tx.Rollback(ctx)
}()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
rg -nP --type=go 'defer\s+func\(\)\s*\{\s*_\s*=\s*tx\.Rollback\(ctx\)\s*\}\(\)'

Repository: pahuldeepp/GrainGuard-

Length of output: 48


🏁 Script executed:

find . -path "*apps/telemetry-service/internal/worker/outbox_worker.go" -type f

Repository: pahuldeepp/GrainGuard-

Length of output: 123


🏁 Script executed:

cat -n apps/telemetry-service/internal/worker/outbox_worker.go | sed -n '260,280p'

Repository: pahuldeepp/GrainGuard-

Length of output: 647


🏁 Script executed:

cat -n apps/telemetry-service/internal/worker/outbox_worker.go | sed -n '280,330p'

Repository: pahuldeepp/GrainGuard-

Length of output: 1334


🏁 Script executed:

cat -n apps/telemetry-service/internal/worker/outbox_worker.go | sed -n '330,370p'

Repository: pahuldeepp/GrainGuard-

Length of output: 973


🏁 Script executed:

find ./apps/telemetry-service -name "*.go" -type f -exec grep -l "tx.Rollback" {} \;

Repository: pahuldeepp/GrainGuard-

Length of output: 252


🏁 Script executed:

rg -nP --type=go 'defer.*tx\.Rollback' ./apps/telemetry-service

Repository: pahuldeepp/GrainGuard-

Length of output: 48


🏁 Script executed:

grep -n "Rollback" ./apps/telemetry-service/internal/application/create_device.go ./apps/telemetry-service/internal/application/record_telemetry.go ./apps/telemetry-service/internal/worker/outbox_worker.go

Repository: pahuldeepp/GrainGuard-

Length of output: 331


🏁 Script executed:

cat -n ./apps/telemetry-service/internal/application/create_device.go | sed -n '50,70p'

Repository: pahuldeepp/GrainGuard-

Length of output: 714


🏁 Script executed:

cat -n ./apps/telemetry-service/internal/application/record_telemetry.go | sed -n '65,85p'

Repository: pahuldeepp/GrainGuard-

Length of output: 606


Handle rollback errors explicitly and avoid canceled request context for rollback.

At line 269, rollback errors are silently discarded with _, and the rollback uses ctx which may already be canceled at error boundaries. This can mask transaction cleanup failures and is a systemic pattern across device creation, telemetry recording, and outbox publishing paths in the service.

🔧 Proposed fix
 import (
 	"context"
 	"encoding/json"
+	"errors"
 	"fmt"
 	"log"
 	"os"
 	"strings"
 	"time"

+	"github.com/jackc/pgx/v5"
 	"github.com/jackc/pgx/v5/pgxpool"
 	"github.com/segmentio/kafka-go"
 	"google.golang.org/protobuf/proto"
@@
-	defer func() {
-		_ = tx.Rollback(ctx)
-	}()
+	defer func() {
+		rbCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+		defer cancel()
+		if rbErr := tx.Rollback(rbCtx); rbErr != nil && !errors.Is(rbErr, pgx.ErrTxClosed) {
+			log.Println("tx rollback error:", rbErr)
+		}
+	}()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
defer func() {
_ = tx.Rollback(ctx)
}()
import (
"context"
"encoding/json"
"errors"
"fmt"
"log"
"os"
"strings"
"time"
"github.com/jackc/pgx/v5"
"github.com/jackc/pgx/v5/pgxpool"
"github.com/segmentio/kafka-go"
"google.golang.org/protobuf/proto"
)
// ... (other code)
defer func() {
rbCtx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
if rbErr := tx.Rollback(rbCtx); rbErr != nil && !errors.Is(rbErr, pgx.ErrTxClosed) {
log.Println("tx rollback error:", rbErr)
}
}()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/telemetry-service/internal/worker/outbox_worker.go` around lines 268 -
270, The deferred rollback currently discards errors and uses the incoming ctx
which may be canceled; instead update the defer that calls tx.Rollback(ctx) to
call Rollback with a fresh non-canceled context (e.g., context.Background() or a
short timeout context) and check the returned error—log it or wrap it (do not
ignore) unless it is sql.ErrTxDone; locate the defer around tx.Rollback in
outbox_worker.go (the tx variable and its deferred rollback) and replace the
silent discard `_ = tx.Rollback(ctx)` with explicit error handling that
distinguishes a harmless "transaction already finished" error from real rollback
failures and records the failure via the component logger/error reporter used in
this file.

@pahuldeepp pahuldeepp deleted the codex/phase1-hardening branch March 28, 2026 03:39
pahuldeepp added a commit that referenced this pull request Mar 28, 2026
HIGH PRIORITY:
- #9: Fix CSRF timing attack — pad buffers to fixed length before timingSafeEqual
- #10: Upgrade circuit breaker to distributed (Redis-backed state sharing across pods)
- #8: Fix saga recovery infinite loop — mark corrupted payloads as FAILED instead of retrying

MEDIUM PRIORITY:
- #11: Add webhook idempotency check (dedup by endpoint_id + event_type)
- #12: Assert Stripe webhook body is Buffer before signature verification
- #13: Eliminate N+1 query in deviceTelemetry resolver (single JOIN query)
- #15: Add ORDER BY + LIMIT to saga FindByCorrelationID for deterministic results
- #17: Make critical audit events (auth, admin) throw on failure instead of silent swallow

LOW PRIORITY:
- #20: Add 10s query timeout to all saga repository DB operations
- #23: Add IsValidStatus validator for saga status constants
- #24: Set httpServer.timeout (30s) and keepAliveTimeout (65s) on BFF
- #25: Add RabbitMQ heartbeat (30s) and connection error/close handlers
- #7: Fix remaining saga JSON marshal error check (initialErr)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant