Skip to content

feat: subprocess SBOM creation using moby/sys/reexec + os.Pipe#333

Closed
kooomix wants to merge 6 commits intomainfrom
feature/SUB-7103-subprocess-sbom-reexec
Closed

feat: subprocess SBOM creation using moby/sys/reexec + os.Pipe#333
kooomix wants to merge 6 commits intomainfrom
feature/SUB-7103-subprocess-sbom-reexec

Conversation

@kooomix
Copy link
Copy Markdown
Contributor

@kooomix kooomix commented Mar 5, 2026

Summary

Isolate Syft SBOM generation in a child process so that OOM kills don't crash the kubevuln pod. Uses moby/sys/reexec for clean subprocess dispatch and os.Pipe for IPC (keeping stdout/stderr free for logging).

Key improvements over naive self-re-exec:

  • reexec.Register() in init() — no env var checks in main()
  • os.Pipe (FD 3/4) — child can log to stdout/stderr without corrupting the JSON data channel
  • Shutdown() propagates SIGTERM to active children during graceful shutdown
  • Child tracking via mutex-protected slice for safe concurrent access

Feature flag: subprocessSBOM: false (default off, zero overhead)

Changes

File What
adapters/v1/subprocess_sbom.go Full reexec implementation: parent orchestration + child handler
adapters/v1/subprocess_sbom_test.go 11 tests including OOM simulation via real SIGKILL
adapters/v1/SUBPROCESS_SBOM.md Architecture docs, config reference, error sentinels
cmd/http/main.go InitReexec() at top + Shutdown() on graceful exit
config/config.go subprocessSBOM + subprocessSBOMMemoryLimit fields
go.mod +github.com/moby/sys/reexec

Test plan

  • TestHandleChildError_OOMKilled — child SIGKILLs itself, parent returns ErrChildOOMKilled
  • TestParentSurvivesChildOOMKill — parent does work + spawns new child after OOM
  • TestHandleChildError_NonZeroExit — exit code 1 not misclassified as OOM
  • TestHandleChildError_Timeout — expired context returns ErrChildTimeout
  • TestSubprocessSBOMCreator_ChildSucceeds — valid SBOM response parsed from child
  • TestSubprocessSBOMCreator_Shutdown — no panic on empty children list
  • TestSubprocessSBOMCreator_TrackUntrack — child tracking bookkeeping
  • JSON roundtrip tests for request/response types
  • Disabled mode falls through to SyftAdapter

Jira: SUB-7103

Summary by CodeRabbit

  • New Features

    • SBOM generation can run in isolated subprocess workers with timeouts, memory limits, and signal-aware error reporting; falls back to the original path when disabled.
  • Configuration

    • New options to enable subprocess SBOM creation and set per-worker memory limits and scan timeouts.
  • Shutdown / Lifecycle

    • Subprocess workers are tracked and terminated during server graceful shutdown.
  • Tests

    • Comprehensive tests for subprocess behaviors, IPC encoding/decoding, error propagation (OOM, signals, timeouts), and lifecycle.
  • Documentation

    • Added operational guidance and investigation notes around subprocess SBOM behavior, OOM handling, and recommended mitigations.

Isolate SBOM generation in a child process to survive OOM kills.

- Use reexec.Register("sbom-worker") in init() for clean dispatch
- IPC via os.Pipe (FD 3/4), keeping stdout/stderr free for logging
- SIGKILL detection → ErrChildOOMKilled sentinel error
- RLIMIT_AS memory limit on child via env var
- SIGTERM propagation: Shutdown() forwards to tracked children
- Temp dir cleanup after child failure (stereoscope* dirs)
- Feature flag: subprocessSBOM (default false, zero overhead when off)

Signed-off-by: kooomix <eranm@armosec.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 5, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a subprocess-based SBOM creation path using reexec and IPC (pipes), with per-child temp dirs, RLIMIT_AS memory limits, timeouts, child lifecycle tracking, shutdown, tests, docs, config flags, and integration into HTTP server startup/shutdown.

Changes

Cohort / File(s) Summary
Design Doc
adapters/v1/SUBPROCESS_SBOM.md
New design doc describing subprocess SBOM creation, parent/child control flow, activation, config fields, sentinel errors, and related files.
Subprocess SBOM Implementation
adapters/v1/subprocess_sbom.go
New exported SubprocessSBOMCreator, constructor, CreateSBOM/Version/Shutdown, reexec worker registration (InitReexec), parent↔child IPC via pipes, per-child temp dirs, RLIMIT_AS memory limit, timeout handling, child tracking/cleanup, and sentinel errors (ErrChildOOMKilled, ErrChildSignaled, ErrChildTimeout).
Tests
adapters/v1/subprocess_sbom_test.go
Extensive tests for JSON request/response round-trips, child behaviors (OOM/signal/exit/timeout), subprocess integration, lifecycle tracking, Shutdown semantics, and disabled-fallback behavior.
Integration (HTTP / Main)
cmd/http/main.go
Adds early reexec path, wraps Syft adapter with NewSubprocessSBOMCreator(...), logs subprocess config when enabled, and calls Shutdown() on SBOM adapter during server shutdown.
Configuration
config/config.go
Adds SubprocessSBOM bool and SubprocessSBOMMemoryLimit int64 to config with defaults; reorders struct fields accordingly.
Deps
go.mod
Adds dependency github.com/moby/sys/reexec v0.1.0 for reexec worker support.
OOM Investigation / Fix Docs
oom-investigation-findings.md, oom-killer-fix.md, reproducing-oom.md
New documentation describing OOM investigation, reproduction steps, and proposed fixes (oom_score_adj strategies, monitoring, and deployment options).

Sequence Diagram(s)

sequenceDiagram
    participant Parent as Parent Process (Server)
    participant PipeReq as Pipe FD (req)
    participant PipeResp as Pipe FD (resp)
    participant Child as Child Process (reexec worker)
    participant Syft as SyftAdapter

    Parent->>Parent: Marshal sbomWorkerRequest (JSON)
    Parent->>Parent: Create temp dir, create two pipes (req/resp)
    Parent->>Child: Spawn reexec child (ExtraFiles -> FD, env memlimit)
    Parent->>PipeReq: Write request JSON to child's FD (req)
    Parent->>Parent: Wait for response with timeout

    Child->>Child: Apply RLIMIT_AS from env (optional)
    Child->>PipeReq: Read request JSON from FD (req)
    Child->>Syft: Call CreateSBOM(...)
    alt Success
        Syft-->>Child: SBOM result
        Child->>PipeResp: Write sbomWorkerResponse JSON to FD (resp)
        PipeResp-->>Parent: Parent reads and unmarshals response
        Parent->>Parent: Return SBOM to caller
    else Child error / non-zero exit / signal
        Child-->>Parent: Exit with code/signal
        Parent->>Parent: Detect OOM/signal/exit -> map to sentinel error
    else Parent timeout
        Parent->>Child: Send SIGTERM (then SIGKILL if needed)
        Parent->>Parent: Return ErrChildTimeout
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 I hopped through pipes and tiny forks,

spawned a child to mind the SBOM corks.
With JSON crumbs and a memory string,
I time the hops and guard the ring.
Hooray — tidy SBOMs ready to spring!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.36% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and clearly describes the main technical change: implementing subprocess SBOM creation using specific technologies (moby/sys/reexec and os.Pipe), which aligns with the core objective of isolating SBOM generation in a child process.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/SUB-7103-subprocess-sbom-reexec
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can disable poems in the walkthrough.

Disable the reviews.poem setting to disable the poems in the walkthrough.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (2)
config/config.go (1)

49-60: Validate subprocessSBOMMemoryLimit after unmarshal.

A negative value is currently accepted and silently treated as “no limit” later. It’s safer to fail fast on invalid config.

🔧 Proposed patch
 import (
+	"fmt"
 	"path/filepath"
 	"time"
@@
 	var config Config
 	err = viper.Unmarshal(&config)
+	if err != nil {
+		return Config{}, err
+	}
+	if config.SubprocessSBOMMemoryLimit < 0 {
+		return Config{}, fmt.Errorf("subprocessSBOMMemoryLimit must be >= 0")
+	}
-	return config, err
+	return config, nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/config.go` around lines 49 - 60, After unmarshalling into the Config
struct (the variable config produced by viper.Unmarshal in config/config.go),
validate the subprocessSBOMMemoryLimit field and fail fast: if
config.SubprocessSBOMMemoryLimit is negative, return a non-nil error (e.g.,
fmt.Errorf("invalid subprocessSBOMMemoryLimit: %d",
config.SubprocessSBOMMemoryLimit)); do this immediately before returning config
so callers receive the validation error instead of silently treating negatives
as "no limit".
adapters/v1/subprocess_sbom_test.go (1)

175-196: Timeout test bypasses the real subprocess timeout path.

This test doesn’t exercise createSBOMInSubprocess, so it won’t catch regressions in child timeout enforcement or kill/reap behavior. Add an integration case that drives the real method and asserts ErrChildTimeout.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapters/v1/subprocess_sbom_test.go` around lines 175 - 196, The test
TestHandleChildError_Timeout is only exercising handleChildError directly and
thus misses regressions in the real subprocess timeout path; update it (or add a
new integration test) to call SubprocessSBOMCreator.createSBOMInSubprocess
instead of only handleChildError, arrange a child context that will expire
(e.g., very short timeout) while the subprocess work is running, and assert the
returned error equals ErrChildTimeout; reference SubprocessSBOMCreator,
createSBOMInSubprocess, ErrChildTimeout and keep the existing dummy
image/workload such that the test actually drives the subprocess timeout and
validates kill/reap behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@adapters/v1/subprocess_sbom.go`:
- Around line 184-185: childCtx is created but not used to enforce timeout: wrap
the blocking io.ReadAll and cmd.Wait in a goroutine and use select on
childCtx.Done() vs a done channel so a hung child triggers timeout handling; on
childCtx.Done() call cmd.Process.Kill() (or cmd.Process.Signal) and return
ErrChildTimeout. Update the code paths that call io.ReadAll and cmd.Wait
(references: childCtx, cancel, cmd, io.ReadAll, cmd.Wait, ErrChildTimeout,
s.timeout, parentTimeoutBuffer) to perform the start/read/wait in a goroutine,
select for completion or timeout, ensure cancel() is still deferred, and
replicate the same pattern for the other similar block noted in the comment.
- Around line 294-311: cleanupOrphanedTempDirs is too broad and may delete other
processes' stereoscope* temp dirs; instead, change createSBOMInSubprocess to
create a unique per-child temp directory (use ioutil.TempDir or os.MkdirTemp),
set TMPDIR in the subprocess env to that directory, and remove only that
specific directory on subprocess exit/failure (do not scan os.TempDir for
stereo* entries). Remove or limit use of cleanupOrphanedTempDirs to only target
directories you explicitly created/tracked by the process (e.g., return/record
the temp path from createSBOMInSubprocess and use that exact path in defered
cleanup).
- Around line 260-262: The current check returns success when resp.SBOM is nil
(the block that returns domain.SBOM{}, nil), which hides protocol/worker bugs;
change this to return a non-nil error instead of an empty SBOM so callers know
the payload is missing. Locate the nil-check for resp.SBOM in subprocess_sbom.go
(the if resp.SBOM == nil branch) and replace the silent success with an error
return (include a clear message like "missing SBOM in subprocess response" and
wrap any underlying context if available) so consumers receive a failure rather
than an empty SBOM.

In `@adapters/v1/SUBPROCESS_SBOM.md`:
- Around line 18-41: The fenced diagram block at the top uses ``` without a
language and triggers MD040; update the opening fence for that ASCII diagram
(the triple-backtick that precedes the "┌────────────────..." block) to include
a language hint such as text (e.g., change ``` to ```text) so markdownlint sees
a language, leaving the closing fence unchanged and preserving the diagram
content (target the fenced code block containing the reexec/child-parent
diagram).

In `@cmd/http/main.go`:
- Around line 144-148: Reorder the shutdown sequence so the controller's queue
is drained before terminating SBOM subprocesses: call controller.Shutdown()
prior to sbomAdapter.Shutdown() to avoid sending SIGTERM to SBOM children while
queued controller jobs may still start; update the shutdown order in the
function where sbomAdapter.Shutdown() and controller.Shutdown() are invoked so
controller finishes draining before SBOM processes are killed.

---

Nitpick comments:
In `@adapters/v1/subprocess_sbom_test.go`:
- Around line 175-196: The test TestHandleChildError_Timeout is only exercising
handleChildError directly and thus misses regressions in the real subprocess
timeout path; update it (or add a new integration test) to call
SubprocessSBOMCreator.createSBOMInSubprocess instead of only handleChildError,
arrange a child context that will expire (e.g., very short timeout) while the
subprocess work is running, and assert the returned error equals
ErrChildTimeout; reference SubprocessSBOMCreator, createSBOMInSubprocess,
ErrChildTimeout and keep the existing dummy image/workload such that the test
actually drives the subprocess timeout and validates kill/reap behavior.

In `@config/config.go`:
- Around line 49-60: After unmarshalling into the Config struct (the variable
config produced by viper.Unmarshal in config/config.go), validate the
subprocessSBOMMemoryLimit field and fail fast: if
config.SubprocessSBOMMemoryLimit is negative, return a non-nil error (e.g.,
fmt.Errorf("invalid subprocessSBOMMemoryLimit: %d",
config.SubprocessSBOMMemoryLimit)); do this immediately before returning config
so callers receive the validation error instead of silently treating negatives
as "no limit".

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4d4595ab-97ff-4400-acb2-9403129c5920

📥 Commits

Reviewing files that changed from the base of the PR and between 736e887 and 1eae4ed.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (6)
  • adapters/v1/SUBPROCESS_SBOM.md
  • adapters/v1/subprocess_sbom.go
  • adapters/v1/subprocess_sbom_test.go
  • cmd/http/main.go
  • config/config.go
  • go.mod

Comment thread adapters/v1/subprocess_sbom.go Outdated
Comment thread adapters/v1/subprocess_sbom.go
Comment thread adapters/v1/subprocess_sbom.go Outdated
Comment thread adapters/v1/SUBPROCESS_SBOM.md Outdated
Comment thread cmd/http/main.go Outdated
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 5, 2026

Summary:

  • License scan: failure
  • Credentials scan: failure
  • Vulnerabilities scan: failure
  • Unit test: success
  • Go linting: failure

…eorder shutdown

- Use goroutine+select to enforce parent timeout on hung child (Critical)
- Return error when worker response has nil SBOM (Major)
- Create per-child TMPDIR instead of broad stereoscope* cleanup (Major)
- Reorder shutdown: drain controller queue before killing SBOM workers (Major)
- Add language hint to markdown fenced code block (Minor)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: kooomix <eranm@armosec.io>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1


ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 735f05ac-9229-4d44-9972-d3f054dbdc35

📥 Commits

Reviewing files that changed from the base of the PR and between 1eae4ed and 47ff4a2.

📒 Files selected for processing (3)
  • adapters/v1/SUBPROCESS_SBOM.md
  • adapters/v1/subprocess_sbom.go
  • cmd/http/main.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • adapters/v1/SUBPROCESS_SBOM.md

Comment thread adapters/v1/subprocess_sbom.go
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 5, 2026

Summary:

  • License scan: failure
  • Credentials scan: failure
  • Vulnerabilities scan: failure
  • Unit test: success
  • Go linting: failure

When childCtx.Done() fires, check whether it was DeadlineExceeded
(return ErrChildTimeout) or context.Canceled from parent shutdown
(return wrapped context.Canceled). This prevents misclassifying
graceful shutdown as a scan timeout.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: kooomix <eranm@armosec.io>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
adapters/v1/subprocess_sbom.go (3)

390-398: Silent failure on memory limit parse error reduces observability.

If the KUBEVULN_SBOM_WORKER_MEMLIMIT environment variable contains an invalid value, the function silently returns without applying any limit. While invalid values are unlikely (parent controls the env var), a warning message would aid debugging misconfiguration scenarios.

🔧 Suggested improvement
 func applyMemoryLimit() {
 	limitStr := os.Getenv("KUBEVULN_SBOM_WORKER_MEMLIMIT")
 	if limitStr == "" {
 		return
 	}
 	var limit int64
 	if _, err := fmt.Sscanf(limitStr, "%d", &limit); err != nil || limit <= 0 {
+		fmt.Fprintf(os.Stderr, "warning: invalid memory limit value %q, skipping\n", limitStr)
 		return
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapters/v1/subprocess_sbom.go` around lines 390 - 398, The applyMemoryLimit
function silently returns when parsing KUBEVULN_SBOM_WORKER_MEMLIMIT fails;
update applyMemoryLimit to log a warning via your logger (or
fmt.Println/processLogger) when fmt.Sscanf on KUBEVULN_SBOM_WORKER_MEMLIMIT
returns an error or a non-positive limit before returning so invalid env values
are observable; keep the existing early-return behavior but ensure the warning
references the env var name and the parse error/value for debugging.

167-178: Pipe file descriptors are closed twice on the success path.

The defers at lines 171 and 178 close reqReader and respWriter, but lines 229-230 also explicitly close them after cmd.Start() succeeds. While os.File.Close() is safe to call multiple times (returns EINVAL on the second call), this is a code smell that could mask real issues.

🔧 Suggested fix: nil-guard the defers
 	reqReader, reqWriter, err := os.Pipe()
 	if err != nil {
 		return domain.SBOM{}, fmt.Errorf("failed to create request pipe: %w", err)
 	}
-	defer reqReader.Close()
+	defer func() {
+		if reqReader != nil {
+			reqReader.Close()
+		}
+	}()

 	respReader, respWriter, err := os.Pipe()
 	if err != nil {
 		reqWriter.Close()
 		return domain.SBOM{}, fmt.Errorf("failed to create response pipe: %w", err)
 	}
-	defer respWriter.Close()
+	defer func() {
+		if respWriter != nil {
+			respWriter.Close()
+		}
+	}()

Then at lines 229-230:

 	// Close the parent's copy of the child's read/write ends.
 	reqReader.Close()
+	reqReader = nil
 	respWriter.Close()
+	respWriter = nil
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapters/v1/subprocess_sbom.go` around lines 167 - 178, The pipe file
descriptors (reqReader, reqWriter, respReader, respWriter) are being closed
twice because there are unconditional defers and then explicit Close calls after
cmd.Start(); update the cleanup to avoid double-closing by either (a) changing
the defers to guard with nil checks (e.g., defer func() { if reqReader != nil {
reqReader.Close() } }()), or (b) keep the defers but set the variables to nil
immediately after you explicitly Close() them post cmd.Start() so the deferred
closures become no-ops; apply this change around the reqReader/reqWriter and
respReader/respWriter usage near the cmd.Start() call in subprocess_sbom.go.

377-380: Response includes SBOM pointer even when there's an error.

When CreateSBOM fails, resp.SBOM is still set to &sbom (which may be a zero-value). While the parent correctly checks resp.Error first (line 296-298), it's cleaner to only set SBOM on success to make the contract explicit.

🔧 Suggested clarification
-	resp := sbomWorkerResponse{SBOM: &sbom}
+	var resp sbomWorkerResponse
 	if err != nil {
 		resp.Error = err.Error()
+	} else {
+		resp.SBOM = &sbom
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapters/v1/subprocess_sbom.go` around lines 377 - 380, The response
currently sets resp := sbomWorkerResponse{SBOM: &sbom} before checking err, so
even on CreateSBOM failure the SBOM pointer is non-nil; change the flow in the
sbom worker (where sbomWorkerResponse, resp and sbom are used) to only assign
resp.SBOM = &sbom when err == nil (or explicitly set resp.SBOM = nil when err !=
nil) after the CreateSBOM call and error check, ensuring resp.Error contains the
error string and SBOM is only populated on success.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@adapters/v1/subprocess_sbom.go`:
- Around line 390-398: The applyMemoryLimit function silently returns when
parsing KUBEVULN_SBOM_WORKER_MEMLIMIT fails; update applyMemoryLimit to log a
warning via your logger (or fmt.Println/processLogger) when fmt.Sscanf on
KUBEVULN_SBOM_WORKER_MEMLIMIT returns an error or a non-positive limit before
returning so invalid env values are observable; keep the existing early-return
behavior but ensure the warning references the env var name and the parse
error/value for debugging.
- Around line 167-178: The pipe file descriptors (reqReader, reqWriter,
respReader, respWriter) are being closed twice because there are unconditional
defers and then explicit Close calls after cmd.Start(); update the cleanup to
avoid double-closing by either (a) changing the defers to guard with nil checks
(e.g., defer func() { if reqReader != nil { reqReader.Close() } }()), or (b)
keep the defers but set the variables to nil immediately after you explicitly
Close() them post cmd.Start() so the deferred closures become no-ops; apply this
change around the reqReader/reqWriter and respReader/respWriter usage near the
cmd.Start() call in subprocess_sbom.go.
- Around line 377-380: The response currently sets resp :=
sbomWorkerResponse{SBOM: &sbom} before checking err, so even on CreateSBOM
failure the SBOM pointer is non-nil; change the flow in the sbom worker (where
sbomWorkerResponse, resp and sbom are used) to only assign resp.SBOM = &sbom
when err == nil (or explicitly set resp.SBOM = nil when err != nil) after the
CreateSBOM call and error check, ensuring resp.Error contains the error string
and SBOM is only populated on success.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a9aaf533-75ba-4451-9428-3b30d56f0157

📥 Commits

Reviewing files that changed from the base of the PR and between 47ff4a2 and 07d2dca.

📒 Files selected for processing (1)
  • adapters/v1/subprocess_sbom.go

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 5, 2026

Summary:

  • License scan: failure
  • Credentials scan: failure
  • Vulnerabilities scan: failure
  • Unit test: success
  • Go linting: failure

… subprocess

The parent's childCtx was derived from the HTTP request context. When the
handler returned 200, the request context was canceled, immediately killing
the child process before it could do any work.

Switch to context.Background() so the subprocess has its own independent
lifecycle with timeout enforcement via the creator's configured timeout +
parentTimeoutBuffer. Graceful shutdown is handled separately via Shutdown().

Add 4 unit tests covering the pre-canceled context bug, own-timeout
enforcement, disabled-mode delegation, and generic error fallthrough.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
adapters/v1/subprocess_sbom.go (1)

171-178: Double-close of pipe file descriptors on success path.

reqReader and respWriter are closed twice: once by the defers at lines 171/178, and again explicitly at lines 229/230 after cmd.Start() succeeds. While os.File.Close() is safe to call multiple times (returns an error on second call that's discarded), this is a minor code smell.

Consider removing the defers and consolidating cleanup in a helper or restructuring so each close happens exactly once.

♻️ One possible fix pattern
 	reqReader, reqWriter, err := os.Pipe()
 	if err != nil {
 		return domain.SBOM{}, fmt.Errorf("failed to create request pipe: %w", err)
 	}
-	defer reqReader.Close()
 
 	respReader, respWriter, err := os.Pipe()
 	if err != nil {
 		reqWriter.Close()
+		reqReader.Close()
 		return domain.SBOM{}, fmt.Errorf("failed to create response pipe: %w", err)
 	}
-	defer respWriter.Close()

Then ensure all early-return error paths close the appropriate pipe ends.

Also applies to: 228-230

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@adapters/v1/subprocess_sbom.go` around lines 171 - 178, The defers closing
reqReader and respWriter cause double-closes because those same descriptors are
explicitly closed after cmd.Start() (see reqReader, respWriter, reqWriter and
the cmd.Start() success path); remove the redundant defers and consolidate pipe
cleanup into a single place (e.g., a small helper or a single defer that runs
after successful cmd.Start()) so each os.Pipe end is closed exactly once, and
ensure all early-return/error paths still close the appropriate pipe ends (close
reqWriter on pipe-creation errors, close respReader/respWriter as needed) to
avoid leaks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@adapters/v1/subprocess_sbom.go`:
- Around line 171-178: The defers closing reqReader and respWriter cause
double-closes because those same descriptors are explicitly closed after
cmd.Start() (see reqReader, respWriter, reqWriter and the cmd.Start() success
path); remove the redundant defers and consolidate pipe cleanup into a single
place (e.g., a small helper or a single defer that runs after successful
cmd.Start()) so each os.Pipe end is closed exactly once, and ensure all
early-return/error paths still close the appropriate pipe ends (close reqWriter
on pipe-creation errors, close respReader/respWriter as needed) to avoid leaks.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ba5e622e-c78b-4871-8499-d3d7854dab11

📥 Commits

Reviewing files that changed from the base of the PR and between 07d2dca and 91c0d46.

📒 Files selected for processing (2)
  • adapters/v1/subprocess_sbom.go
  • adapters/v1/subprocess_sbom_test.go

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 9, 2026

Summary:

  • License scan: failure
  • Credentials scan: failure
  • Vulnerabilities scan: failure
  • Unit test: success
  • Go linting: failure

Comment on lines +1 to +22
# Subprocess SBOM Creation

## Problem

When kubevuln scans large container images, the Syft SBOM generation step can consume
unbounded memory. If the Go runtime or the kernel OOM-killer terminates the process,
the entire kubevuln pod crashes — losing all in-flight scans and requiring a restart.

## Solution

`SubprocessSBOMCreator` isolates SBOM generation in a **child process** using
[moby/sys/reexec](https://pkg.go.dev/github.com/moby/sys/reexec). The parent
(kubevuln main process) survives even if the child is OOM-killed, and reports a
structured error instead of crashing.

## How It Works

```text
┌─────────────────────────────────────────────────────┐
│ kubevuln (parent) │
│ │
│ 1. Serialize scan request as JSON │
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kooomix based on this design, will we have multiple child processes in the same time? (parallel child process scans). I am ok with the option, but to be on the safe side I wouldn't run parallel scans for now (and since it is not time sensitive)

Signed-off-by: Ben <ben@armosec.io>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
oom-killer-fix.md (1)

37-47: Show error-checked examples for /proc/*/oom_score_adj writes.

The sample code writes to procfs without handling errors. In practice these writes can fail (permissions/capabilities), so the example should model explicit error checks and logging to avoid copy-paste unsafe patterns.

Also applies to: 52-54

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@oom-killer-fix.md` around lines 37 - 47, The examples call os.WriteFile for
"/proc/self/oom_score_adj" and fmt.Sprintf("/proc/%d/oom_score_adj", childPID)
without checking errors; update both uses so the return error is captured and
handled (e.g., check err != nil and log or return it) instead of discarding it.
Locate the os.WriteFile calls (the self-adjustment and the per-child write using
childPID) and wrap each write with proper error handling: assign the error, log
a descriptive message including the target path and error, and decide whether to
continue or fail based on the context (e.g., log.Warn on inability to set child
OOM target, log.Error/exit if protecting the orchestrator must be enforced).
Ensure you reference the same os.WriteFile calls and childPID formatting when
adding the checks.
reproducing-oom.md (1)

32-39: Add a rollback step for memory limit changes.

Line 32 modifies deployment resources but the doc does not include a revert command. Adding a short rollback step will prevent accidental long-lived degraded limits in shared environments.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@reproducing-oom.md` around lines 32 - 39, Add a brief rollback step after the
memory-limit change that tells operators how to revert the kubevuln Deployment
in the kubescape namespace back to its original memory limits or to undo the
change; reference the kubevuln Deployment and suggest using either kubectl
rollout undo for the Deployment or re-patching the
/spec/template/spec/containers/0/resources/limits/memory and /requests/memory
paths to their prior values so the temporary 512Mi change is not left in shared
environments.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@oom-investigation-findings.md`:
- Around line 27-30: The Markdown fenced code blocks shown (e.g., the block
containing "Pod-level cgroup: memory.oom.group = 0" and the other blocks at the
ranges called out) lack language identifiers causing MD040 failures; update each
triple-backtick fence to include an appropriate language tag (for example `text`
or `bash`) for the blocks at the cited ranges (27-30, 40-48, 56-58, 62-68) so
markdownlint passes and the blocks render with correct syntax highlighting.

In `@oom-killer-fix.md`:
- Around line 20-27: The paragraph incorrectly states that the kernel OOM
selects a single process and that memory.oom.group defaults to per-process
behavior, which contradicts the investigation in oom-investigation-findings.md
showing container-level group kills and capability constraints; update
oom-killer-fix.md to remove or reword the misleading claims, explicitly state
the correct sequence (kernel kills highest-RSS process like sbom-worker, but PID
1 behavior—kubevuln—can cause the pod to be restarted and the cgroup to be
reaped by the orchestrator) and reference memory.oom.group behavior and
capability limitations as documented in oom-investigation-findings.md (mention
sbom-worker, kubevuln, PID 1, and oom_score) so readers are guided to the actual
root cause and not toward fixes that assume per-process OOM behavior.

---

Nitpick comments:
In `@oom-killer-fix.md`:
- Around line 37-47: The examples call os.WriteFile for
"/proc/self/oom_score_adj" and fmt.Sprintf("/proc/%d/oom_score_adj", childPID)
without checking errors; update both uses so the return error is captured and
handled (e.g., check err != nil and log or return it) instead of discarding it.
Locate the os.WriteFile calls (the self-adjustment and the per-child write using
childPID) and wrap each write with proper error handling: assign the error, log
a descriptive message including the target path and error, and decide whether to
continue or fail based on the context (e.g., log.Warn on inability to set child
OOM target, log.Error/exit if protecting the orchestrator must be enforced).
Ensure you reference the same os.WriteFile calls and childPID formatting when
adding the checks.

In `@reproducing-oom.md`:
- Around line 32-39: Add a brief rollback step after the memory-limit change
that tells operators how to revert the kubevuln Deployment in the kubescape
namespace back to its original memory limits or to undo the change; reference
the kubevuln Deployment and suggest using either kubectl rollout undo for the
Deployment or re-patching the
/spec/template/spec/containers/0/resources/limits/memory and /requests/memory
paths to their prior values so the temporary 512Mi change is not left in shared
environments.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9e97ebee-3353-482d-8a72-dc33eeb4e6a4

📥 Commits

Reviewing files that changed from the base of the PR and between 91c0d46 and 3e841e9.

📒 Files selected for processing (3)
  • oom-investigation-findings.md
  • oom-killer-fix.md
  • reproducing-oom.md

Comment on lines +27 to +30
```
Pod-level cgroup: memory.oom.group = 0 (per-process selection)
Container-level cgroup: memory.oom.group = 1 (kills ALL processes together)
```
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Specify fence languages for code blocks.

These fenced blocks are missing language identifiers (MD040), which will keep markdownlint failing. Please label them (text, bash, etc.).

Also applies to: 40-48, 56-58, 62-68

🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 27-27: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@oom-investigation-findings.md` around lines 27 - 30, The Markdown fenced code
blocks shown (e.g., the block containing "Pod-level cgroup: memory.oom.group =
0" and the other blocks at the ranges called out) lack language identifiers
causing MD040 failures; update each triple-backtick fence to include an
appropriate language tag (for example `text` or `bash`) for the blocks at the
cited ranges (27-30, 40-48, 56-58, 62-68) so markdownlint passes and the blocks
render with correct syntax highlighting.

Comment thread oom-killer-fix.md
Comment on lines +20 to +27
The Linux kernel's OOM killer **is** correctly killing only the child process with the highest RSS. The pod restart happens because of a chain reaction:

1. Kernel OOM killer targets the sbom-worker (highest `oom_score` due to highest RSS)
2. sbom-worker is killed with SIGKILL
3. kubevuln (PID 1) either: crashes when it detects the child died, exits because it doesn't handle child death gracefully, or propagates the error and terminates
4. kubelet detects PID 1 has exited → restarts the pod per `restartPolicy`

The cgroup itself is **not** being killed as a whole — the orchestrator is simply not surviving the child's death.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Root-cause guidance here contradicts current investigation findings.

This section asserts single-process OOM selection and memory.oom.group default behavior that conflicts with the findings documented in oom-investigation-findings.md (container-level group kill behavior and capability constraints). As written, this can drive engineers toward a fix path that won’t work in the current deployment model.

Also applies to: 31-58, 98-101

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@oom-killer-fix.md` around lines 20 - 27, The paragraph incorrectly states
that the kernel OOM selects a single process and that memory.oom.group defaults
to per-process behavior, which contradicts the investigation in
oom-investigation-findings.md showing container-level group kills and capability
constraints; update oom-killer-fix.md to remove or reword the misleading claims,
explicitly state the correct sequence (kernel kills highest-RSS process like
sbom-worker, but PID 1 behavior—kubevuln—can cause the pod to be restarted and
the cgroup to be reaped by the orchestrator) and reference memory.oom.group
behavior and capability limitations as documented in
oom-investigation-findings.md (mention sbom-worker, kubevuln, PID 1, and
oom_score) so readers are guided to the actual root cause and not toward fixes
that assume per-process OOM behavior.

@github-actions
Copy link
Copy Markdown

Summary:

  • License scan: failure
  • Credentials scan: failure
  • Vulnerabilities scan: failure
  • Unit test: success
  • Go linting: failure

Signed-off-by: Ben <ben@armosec.io>
@github-actions
Copy link
Copy Markdown

Summary:

  • License scan: failure
  • Credentials scan: failure
  • Vulnerabilities scan: failure
  • Unit test: success
  • Go linting: failure

@matthyx matthyx moved this to WIP in KS PRs tracking Mar 17, 2026
@matthyx
Copy link
Copy Markdown
Contributor

matthyx commented Mar 29, 2026

superseded by #335

@matthyx matthyx closed this Mar 29, 2026
@matthyx matthyx moved this from WIP to To Archive in KS PRs tracking Mar 29, 2026
@matthyx matthyx deleted the feature/SUB-7103-subprocess-sbom-reexec branch April 20, 2026 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants