Skip to content

Conversation

@w0rk3r
Copy link
Contributor

@w0rk3r w0rk3r commented Dec 26, 2025

Proposed commit message

windows: refine PowerShell script entropy pipeline

Replace code-point HashMap counting with a fixed 65k UTF-16 char histogram
and skip truncated signature fragments before entropy is computed. Add a
normalized entropy field scaled by script length (0–1).

Summary

Related issue:

This PR:

  • Replaces code‑point HashMap counting with a fixed 65k UTF‑16 char histogram for script entropy, reducing the script processor time and improving eps (2924 → 4873 eps in warm run).
  • Skips truncated signature fragments before entropy is computed.
  • Adds powershell.file.script_block_entropy_normalized = entropy_bits / log2(script_block_length) (0–1).
  • Adds benchmark fixtures to track performance regressions during our research.

Old pipeline:

image

Improved pipeline:

image
Complete benchmark output

Old:

PS C:\Users\Jonhnathan\Documents\Github\integrations\packages\windows> .\..\..\elastic-package.exe benchmark pipeline --data-streams powershell_operational --use-test-samples=false
Run pipeline benchmarks for the package
--- Benchmark results for package: windows - START ---
╭─────────────────────────╮
│ parameters              │
├──────────────────┬──────┤
│ source_doc_count │   11 │
│ doc_count        │ 2500 │
╰──────────────────┴──────╯
╭───────────────────────────╮
│ pipeline_performance      │
├─────────────────┬─────────┤
│ processing_time │   1.10s │
│ eps             │ 2278.94 │
╰─────────────────┴─────────╯
╭────────────────────────────────────────╮
│ procs_by_total_time                    │
├───────────────────────────────┬────────┤
│ script @ default.yml:322      │ 47.49% │
│ gsub @ default.yml:305        │ 30.36% │
│ fingerprint @ default.yml:311 │  3.19% │
│ set @ default.yml:60          │  2.10% │
│ script @ default.yml:13       │  1.82% │
│ gsub @ default.yml:316        │  1.09% │
│ script @ default.yml:30       │  1.00% │
│ remove @ default.yml:575      │  0.55% │
│ rename @ default.yml:290      │  0.18% │
│ trim @ default.yml:302        │  0.18% │
╰───────────────────────────────┴────────╯
╭─────────────────────────────────────────╮
│ procs_by_avg_time_per_doc               │
├───────────────────────────────┬─────────┤
│ script @ default.yml:322      │ 208.4µs │
│ gsub @ default.yml:305        │ 133.2µs │
│ fingerprint @ default.yml:311 │    14µs │
│ set @ default.yml:60          │   9.2µs │
│ script @ default.yml:13       │     8µs │
│ gsub @ default.yml:316        │   4.8µs │
│ script @ default.yml:30       │   4.4µs │
│ remove @ default.yml:575      │   2.4µs │
│ rename @ default.yml:290      │   800ns │
│ trim @ default.yml:302        │   800ns │
╰───────────────────────────────┴─────────╯

--- Benchmark results for package: windows - END   ---
Done
--- Benchmark results for package: windows - START ---
╭─────────────────────────╮
│ parameters              │
├──────────────────┬──────┤
│ source_doc_count │   11 │
│ doc_count        │ 2500 │
╰──────────────────┴──────╯
╭───────────────────────────╮
│ pipeline_performance      │
├─────────────────┬─────────┤
│ processing_time │   0.85s │
│ eps             │ 2923.98 │
╰─────────────────┴─────────╯
╭────────────────────────────────────────╮
│ procs_by_total_time                    │
├───────────────────────────────┬────────┤
│ script @ default.yml:322      │ 50.53% │
│ gsub @ default.yml:305        │ 34.15% │
│ fingerprint @ default.yml:311 │  2.57% │
│ gsub @ default.yml:316        │  1.17% │
│ script @ default.yml:13       │  0.70% │
│ set @ default.yml:60          │  0.58% │
│ remove @ default.yml:575      │  0.35% │
│ script @ default.yml:30       │  0.35% │
│ rename @ default.yml:290      │  0.12% │
╰───────────────────────────────┴────────╯
╭─────────────────────────────────────────╮
│ procs_by_avg_time_per_doc               │
├───────────────────────────────┬─────────┤
│ script @ default.yml:322      │ 172.8µs │
│ gsub @ default.yml:305        │ 116.8µs │
│ fingerprint @ default.yml:311 │   8.8µs │
│ gsub @ default.yml:316        │     4µs │
│ script @ default.yml:13       │   2.4µs │
│ set @ default.yml:60          │     2µs │
│ remove @ default.yml:575      │   1.2µs │
│ script @ default.yml:30       │   1.2µs │
│ rename @ default.yml:290      │   400ns │
╰───────────────────────────────┴─────────╯

--- Benchmark results for package: windows - END   ---
Done

Improved:

PS C:\Users\Jonhnathan\Documents\Github\integrations\packages\windows> .\..\..\elastic-package.exe benchmark pipeline --data-streams powershell_operational --use-test-samples=false
Run pipeline benchmarks for the package
--- Benchmark results for package: windows - START ---
╭─────────────────────────╮
│ parameters              │
├──────────────────┬──────┤
│ source_doc_count │   11 │
│ doc_count        │ 2500 │
╰──────────────────┴──────╯
╭───────────────────────────╮
│ pipeline_performance      │
├─────────────────┬─────────┤
│ processing_time │   0.51s │
│ eps             │ 4892.37 │
╰─────────────────┴─────────╯
╭────────────────────────────────────────╮
│ procs_by_total_time                    │
├───────────────────────────────┬────────┤
│ gsub @ default.yml:305        │ 55.19% │
│ script @ default.yml:322      │ 28.18% │
│ fingerprint @ default.yml:311 │  4.11% │
│ gsub @ default.yml:316        │  1.96% │
│ script @ default.yml:13       │  0.59% │
│ remove @ default.yml:657      │  0.39% │
│ rename @ default.yml:290      │  0.20% │
│ set @ default.yml:60          │  0.20% │
╰───────────────────────────────┴────────╯
╭─────────────────────────────────────────╮
│ procs_by_avg_time_per_doc               │
├───────────────────────────────┬─────────┤
│ gsub @ default.yml:305        │ 112.8µs │
│ script @ default.yml:322      │  57.6µs │
│ fingerprint @ default.yml:311 │   8.4µs │
│ gsub @ default.yml:316        │     4µs │
│ script @ default.yml:13       │   1.2µs │
│ remove @ default.yml:657      │   800ns │
│ rename @ default.yml:290      │   400ns │
│ set @ default.yml:60          │   400ns │
╰───────────────────────────────┴─────────╯

--- Benchmark results for package: windows - END   ---
Done
--- Benchmark results for package: windows - START ---
╭─────────────────────────╮
│ parameters              │
├──────────────────┬──────┤
│ source_doc_count │   11 │
│ doc_count        │ 2500 │
╰──────────────────┴──────╯
╭───────────────────────────╮
│ pipeline_performance      │
├─────────────────┬─────────┤
│ processing_time │   0.51s │
│ eps             │ 4873.29 │
╰─────────────────┴─────────╯
╭────────────────────────────────────────╮
│ procs_by_total_time                    │
├───────────────────────────────┬────────┤
│ gsub @ default.yml:305        │ 57.89% │
│ script @ default.yml:322      │ 25.93% │
│ fingerprint @ default.yml:311 │  3.51% │
│ gsub @ default.yml:316        │  1.95% │
│ script @ default.yml:13       │  0.78% │
│ remove @ default.yml:657      │  0.39% │
│ set @ default.yml:60          │  0.19% │
╰───────────────────────────────┴────────╯
╭─────────────────────────────────────────╮
│ procs_by_avg_time_per_doc               │
├───────────────────────────────┬─────────┤
│ gsub @ default.yml:305        │ 118.8µs │
│ script @ default.yml:322      │  53.2µs │
│ fingerprint @ default.yml:311 │   7.2µs │
│ gsub @ default.yml:316        │     4µs │
│ script @ default.yml:13       │   1.6µs │
│ remove @ default.yml:657      │   800ns │
│ set @ default.yml:60          │   400ns │
╰───────────────────────────────┴─────────╯

--- Benchmark results for package: windows - END   ---
Done

Checklist

  • I have reviewed tips for building integrations and this pull request is aligned with them.
  • I have verified that all data streams collect metrics or logs.
  • I have added an entry to my package's changelog.yml file.
  • I have verified that Kibana version constraints are current according to guidelines.
  • I have verified that any added dashboard complies with Kibana's Dashboard good practices

@w0rk3r w0rk3r self-assigned this Dec 26, 2025
@w0rk3r w0rk3r requested review from a team as code owners December 26, 2025 22:21
@w0rk3r w0rk3r added enhancement New feature or request Integration:windows Windows Team:Security-Windows Platform Security Windows Platform team [elastic/sec-windows-platform] labels Dec 26, 2025
@w0rk3r w0rk3r requested review from faec and mauri870 December 26, 2025 22:21
@elasticmachine
Copy link

Pinging @elastic/sec-windows-platform (Team:Security-Windows Platform)

@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Data-Plane Agent Data Plane team [elastic/elastic-agent-data-plane] label Jan 4, 2026
@elasticmachine
Copy link

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@mauri870 mauri870 self-requested a review January 5, 2026 12:15
Copy link
Member

@mauri870 mauri870 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I'm not very proficient with PowerShell. The code looks fine, but it needs a deeper look from the Windows team.

@andrewkroh andrewkroh added the documentation Improvements or additions to documentation. Applied to PRs that modify *.md files. label Jan 8, 2026
double normalizedEntropy = 0.0;
if (length > 1) {
double maxEntropy = Math.log((double) length) * invLog2; // max bits if every character is unique
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the normalized entropy calculation looks good 👍

Few notes for posterity:

  • For the line double maxEntropy = Math.log((double) length) * invLog2; // max bits if every character is unique I think it makes sense to use length here. Typical normalized entropy calculations (like that for R/Posterior ref) would use something akin to seenCount instead of length. However, this is expecting the input to be more akin to categories where a and a are equivalent regardless of their position in the script block. In our case, I think we want the position to mater as well, so each value is by definition unique making length the correct number to use here (as is correctly done in the code).
  • The pre-output check else if (normalizedEntropy > 1.0) normalizedEntropy = 1.0; I think is technically not necessary, as this should not occur. However, I think we should keep this check as it could catch floating point rounding issues without impacting the integrity of the data result (code is correct as is).

@elastic-vault-github-plugin-prod

🚀 Benchmarks report

To see the full report comment with /test benchmark fullreport

@elasticmachine
Copy link

💚 Build Succeeded

History

cc @w0rk3r

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation. Applied to PRs that modify *.md files. enhancement New feature or request Integration:windows Windows Team:Elastic-Agent-Data-Plane Agent Data Plane team [elastic/elastic-agent-data-plane] Team:Security-Windows Platform Security Windows Platform team [elastic/sec-windows-platform]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants