Skip to content

[Security Solution] Batched Attack Discovery with hierarchical merge#257831

Closed
patrykkopycinski wants to merge 2 commits intoelastic:mainfrom
patrykkopycinski:batched-attack-discovery-16182
Closed

[Security Solution] Batched Attack Discovery with hierarchical merge#257831
patrykkopycinski wants to merge 2 commits intoelastic:mainfrom
patrykkopycinski:batched-attack-discovery-16182

Conversation

@patrykkopycinski
Copy link
Copy Markdown
Contributor

Summary

Removes the alert count ceiling from Attack Discovery by implementing batch processing with LLM-based hierarchical merge. This enables Attack Discovery to process arbitrarily large alert sets by splitting them into manageable batches, running the existing AD graph on each batch in parallel, then consolidating discoveries across batches using a dedicated LLM merge pass.

Ref: elastic/security-team#16339 (Task 0B — Remove alert count ceiling)

Architecture

Alerts (N) → [Adaptive split into K batches] → [Parallel AD graph on each batch]
                                                         ↓
                                                  [Collect batch results]
                                                         ↓
                                                  [LLM Merge Pass: consolidate related discoveries]
                                                         ↓
                                                  [Return merged discoveries + quality metrics]

Key Components

Module Purpose
batch/split.ts Adaptive batch sizing (context window → optimal batch size), alert splitting
batch/merge.ts Hierarchical merge with LLM consolidation pass, quality metrics
batch/orchestrator.ts Batch orchestration with configurable concurrency control
batch/types.ts Interfaces, constants, known context window map
invoke_attack_discovery_graph Routing: batched path when alert count > adaptive batch size

Adaptive Batch Sizing

Batch size is computed from the LLM connector's context window:

available_tokens = context_window × 0.7 − 8000 (reserved for prompt/output)
batch_size = floor(available_tokens / 800 tokens per alert)
clamped to [10, 500]

Supports known model lookups (GPT-4o, Claude 3.x, Gemini) with partial matching, explicit context window override, and graceful fallback to default (50).

Hierarchical Merge Strategy

  • Single batch: No merge pass needed — direct passthrough
  • Multiple batches: LLM consolidation pass that:
    • Identifies discoveries describing the same attack across batches
    • Merges related discoveries (combines alert IDs, MITRE tactics, details)
    • Preserves genuinely distinct attacks unchanged
    • Guarantees no alert ID loss (every input alert ID appears in output)
  • Merge failure: Graceful degradation — returns unmerged discoveries with warning

Quality Metrics

Every batched run produces MergeQualityMetrics:

  • consolidationRatio — how many discoveries were merged (1.0 = none, lower = more consolidation)
  • alertCoverage — ratio of alert IDs preserved after merge (should be 1.0)
  • batchesProcessed / batchesFailed — batch success tracking
  • totalDurationMs / mergeDurationMs — performance tracking

Error Handling

  • Individual batch failures don't block other batches (Promise.allSettled)
  • Failed batches recorded with empty discoveries and error details
  • LLM merge pass failure returns unmerged results (no data loss)
  • Empty alert retrieval returns early

Configuration

Parameter Default Description
batchSize Adaptive Max alerts per batch (auto-calculated from context window)
maxBatches 20 Max batches to process (0 = unlimited)
concurrency 2 Max parallel batch executions

Testing

  • 29 unit tests across 3 test suites:
    • split.test.ts — batch splitting, adaptive sizing, model lookup, edge cases
    • merge.test.ts — single/multi-batch merge, metrics, error handling, replacement combining
    • orchestrator.test.ts — single/multi-batch orchestration, concurrency, failure resilience

Test plan

  • All 29 unit tests pass (yarn test:jest ...batch/)
  • ESLint passes on all changed files
  • No lint errors (ReadLints)
  • Type check passes (CI)
  • Existing AD tests still pass (CI)
  • Manual test with connector: < batch size alerts → single pass (no merge)
  • Manual test with connector: > batch size alerts → batched with merge
  • Verify merge metrics logged correctly
  • Verify partial batch failure doesn't lose other batches' results

Made with Cursor

Removes the alert count ceiling from Attack Discovery by implementing
batch processing with LLM-based hierarchical merge. Large alert sets
are split into batches, processed in parallel through the existing AD
graph, then consolidated via a dedicated merge LLM pass that identifies
and combines related attacks across batches.

Key changes:
- batch/split.ts: adaptive batch sizing from LLM context window, alert splitting
- batch/merge.ts: hierarchical merge with LLM consolidation pass and quality metrics
- batch/orchestrator.ts: batch orchestration with concurrency control
- batch/types.ts: interfaces, constants, known context windows
- invoke_attack_discovery_graph: routing to batched path when alerts exceed batch size

Ref: elastic/security-team#16339
@elasticmachine
Copy link
Copy Markdown
Contributor

🤖 Jobs for this PR can be triggered through checkboxes. 🚧

ℹ️ To trigger the CI, please tick the checkbox below 👇

  • Click to trigger kibana-pull-request for this PR!
  • Click to trigger kibana-deploy-project-from-pr for this PR!
  • Click to trigger kibana-deploy-cloud-from-pr for this PR!
  • Click to trigger kibana-entity-store-performance-from-pr for this PR!
  • Click to trigger kibana-storybooks-from-pr for this PR!

@patrykkopycinski
Copy link
Copy Markdown
Contributor Author

/ci

1 similar comment
@patrykkopycinski
Copy link
Copy Markdown
Contributor Author

/ci

…known[]

Fixes TS2322 error where spreading unknown[] tracers into callbacks
parameter expected (BaseCallbackHandler | BaseCallbackHandlerMethodsClass)[].
@patrykkopycinski
Copy link
Copy Markdown
Contributor Author

/ci

@elasticmachine
Copy link
Copy Markdown
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

Metrics [docs]

✅ unchanged

History

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants