Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion explore-analyze/ai-features.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ The [Model Context Protocol (MCP)](/solutions/search/mcp.md) lets you connect AI
* [Partitioning](/solutions/observability/streams/management/partitioning.md): Use AI to suggest logical groupings and child streams based on your data when using wired streams.
* [Advanced settings](/solutions/observability/streams/management/advanced.md): Use AI to generate a [stream description](/solutions/observability/streams/management/advanced.md#streams-advanced-description) and a [feature identification](/solutions/observability/streams/management/advanced.md#streams-advanced-features) that other AI features, like significant events, use when generating suggestions.

## AI-powered features in {{elastic-sec}}
## AI-powered features in {{elastic-sec}} [security-features]

{{elastic-sec}}'s AI-powered features all require an [LLM connector](/explore-analyze/ai-features/llm-guides/llm-connectors.md). When you use one of these features, you can select any LLM connector that's configured in your environment. The connector you select for one feature does not affect which connector any other feature uses. For specific configuration instructions, refer to each feature's documentation.

Expand Down
41 changes: 21 additions & 20 deletions solutions/security/ai/large-language-model-performance-matrix.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,37 +13,38 @@

# Large language model performance matrix

This page describes the performance of various large language models (LLMs) for different use cases in {{elastic-sec}}, based on our internal testing. To learn more about these use cases, refer to [Attack discovery](/solutions/security/ai/attack-discovery.md) or [AI Assistant](/solutions/security/ai/ai-assistant.md).
This page describes the performance of various large language models (LLMs) for different use cases in {{elastic-sec}}, based on our internal testing. To learn more about these use cases, refer to [AI-Powered features](/explore-analyze/ai-features.md#security-features).

::::{important}
`Excellent` is the best rating, followed by `Great`, then by `Good`, and finally by `Poor`. Models rated `Excellent` or `Great` should produce quality results. Models rated `Good` or `Poor` are not recommended for that use case.
::::
Higher scores indicate better performance. A score of 100 on a task means the model met or exceeded all task-specific benchmarks.

Models with a score of "Not recommended" failed testing. This could be due to various issues, including context window constraints.
::::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be helpful to include a brief explanation of how to interpret the average score. Maybe something general like "models that score above [this threshold] might provide better performance for AI powered features. We don't recommend using models that score below [this threshold] as they won't perform as well."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll ask the product team if we can provide some more guidance on this, thank you for the idea



## Proprietary models [_proprietary_models]

Models from third-party LLM providers.

| **Feature** | - | **Assistant - General** | **Assistant - {{esql}} generation** | **Assistant - Alert questions** | **Assistant - Knowledge retrieval** | **Attack Discovery** | **Automatic Migration** |
| --- | --- | --- | --- | --- | --- | --- | --- |
| **Model** | **Claude Opus 4** | Excellent | Excellent | Excellent | Excellent | Excellent | Excellent
| | **Claude Sonnet 4** | Excellent | Excellent | Excellent | Excellent | Excellent | Excellent
| | **Claude Sonnet 3.7** | Excellent | Excellent | Excellent | Excellent | Excellent | Excellent
| | **GPT-4.1** | Excellent | Excellent | Excellent | Excellent | Excellent | Excellent
| | **Gemini 2.0 Flash 001** | Excellent | Excellent | Excellent | Excellent | Excellent | Excellent
| | **Gemini 2.5 Pro** | Excellent | Excellent | Excellent | Excellent | Excellent | Excellent

| **Model** | **Alerts** | **{{esql}} Query Generation** | **Knowledge Base Retrieval** | **Attack Discovery** | **General Security** | **Automatic Migration** | **Average Score** |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| **GPT 5 Chat** | 91 | 92 | 100 | 85 | 92 | 99 | **93** |

Check notice on line 31 in solutions/security/ai/large-language-model-performance-matrix.md

View workflow job for this annotation

GitHub Actions / preview / vale

Elastic.Acronyms: 'GPT' has no definition.
| **Sonnet 4.5** | 90 | 90 | 100 | 80 | 90 | 100 | **92** |
| **GPT 5.1** | 93 | 95 | 100 | 95 | 65 | 98 | **91** |
| **Sonnet 3.7** | 89 | 90 | 100 | 70 | 90 | 97 | **89** |
| **Elastic Managed LLM** | 89 | 90 | 100 | 70 | 90 | 97 | **89** |
| **Opus 4.5** | 86 | 86 | 100 | 85 | 90 | 73 | **87** |
| **Gemini 2.5 Pro** | 89 | 86 | 100 | 87 | 90 | 63 | **86** |
| **Opus 4.1** | 92 | 93 | 100 | 70 | 90 | 70 | **86** |
| **Sonnet 4** | 89 | 92 | 100 | 70 | 88 | 75 | **86** |
| **GPT 4.1** | 87 | 88 | 100 | 80 | 88 | 31 | **79** |
| **Gemini 2.5 Flash** | 87 | 90 | Not recommended | Not recommended | 90 | Not recommended | **45** |
| **Haiku 4.5** | 84 | 80 | Not recommended | Not recommended | 88 | Not recommended | **42** |

## Open-source models [_open_source_models]

Models you can [deploy yourself](/explore-analyze/ai-features/llm-guides/local-llms-overview.md).

| **Feature** | - | **Assistant - General** | **Assistant - {{esql}} generation** | **Assistant - Alert questions** | **Assistant - Knowledge retrieval** | **Attack Discovery** | **Automatic Migration**
| --- | --- | --- | --- | --- | --- | --- |
| **Model** | **Mistral‑Small‑3.2‑24B‑Instruct‑2506** | Excellent | Good | Excellent | Excellent | Good | N/A
| | **Mistral-Small-3.1-24B-Instruct-2503** | Excellent | Good | Excellent | Excellent | Good | N/A
| | **Mistral Nemo** | Good | Good | Great | Good | Poor | Poor |
| | **LLama 3.2** | Good | Poor | Good | Poor | Poor | Good |
| | **LLama 3.1 405b** | Good | Great | Good | Good | Poor | Poor |
| | **LLama 3.1 70b** | Good | Good | Poor | Poor | Poor | Good |
| **Model** | **Alerts** | **{{esql}} Query Generation** | **Knowledge Base Retrieval** | **Attack Discovery** | **General Security** | **Automatic Migration** | **Average Score** |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| **GPT OSS 20b** | 82 | 25 | Not recommended | Not recommended | 10 | Not recommended | **20** |

Check notice on line 50 in solutions/security/ai/large-language-model-performance-matrix.md

View workflow job for this annotation

GitHub Actions / preview / vale

Elastic.Acronyms: 'GPT' has no definition.
Loading