-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Add ML-based prompt injection detection #5623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ML-based prompt injection detection #5623
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR standardizes security configuration keys from snake_case to UPPER_CASE convention and introduces ML-based prompt injection detection using BERT models through a new Gondola provider. The changes enable more accurate security threat detection while maintaining backward compatibility through pattern-based scanning as a fallback.
Key changes:
- Configuration key standardization:
security_prompt_*→SECURITY_PROMPT_* - New ML detection infrastructure with Gondola provider for BERT-based model inference
- Enhanced scanner to combine pattern-based and ML-based detection results
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| ui/desktop/src/utils/configUtils.ts | Updated config labels to use new UPPER_CASE keys and added ML detection settings |
| ui/desktop/src/components/settings/security/SecurityToggle.tsx | Added ML detection toggle UI with model selection dropdown |
| documentation/docs/guides/security/prompt-injection-detection.md | Updated config examples to use new UPPER_CASE keys |
| documentation/docs/guides/config-files.md | Updated configuration reference table with new ML detection settings and standardized keys |
| crates/goose/src/security/scanner.rs | Refactored to support optional ML detection, enhanced conversation context scanning, and simplified tool content extraction |
| crates/goose/src/security/prompt_ml_detector.rs | New ML detector implementation with model registry and Gondola provider integration |
| crates/goose/src/security/mod.rs | Updated to initialize scanner with ML detection when enabled and handle fallback scenarios |
| crates/goose/src/providers/mod.rs | Added gondola module to provider list |
| crates/goose/src/providers/gondola.rs | New Gondola provider implementation for batch inference with BERT models |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
1f5fd06 to
9fe268c
Compare
459eb56 to
a0d6d5c
Compare
eff6fbe to
ba5feee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
ef65a16 to
fce20ae
Compare
|
|
||
| ## Overview | ||
|
|
||
| Goose requires a classification endpoint that can analyze text and return a score indicating the likelihood of prompt injection. This API follows the **HuggingFace Inference API format** for text classification, making it compatible with HuggingFace Inference Endpoints to allow for easy usage in OSS goose. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I understand, but we should write the documentation for the OSS people. I would just drop anything after ", making"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
| :::info Automatic Multi-Model Configuration | ||
| The experimental [AutoPilot](/docs/guides/multi-model/autopilot) feature provides intelligent, context-aware model switching. Configure models for different roles using the `x-advanced-models` setting. | ||
| ::: | ||
|
|
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The info box about AutoPilot configuration appears to be incorrectly placed here, as it's unrelated to the security configuration settings being documented. This content should either be moved to the appropriate section about multi-model configuration or removed from this location.
| :::info Automatic Multi-Model Configuration | |
| The experimental [AutoPilot](/docs/guides/multi-model/autopilot) feature provides intelligent, context-aware model switching. Configure models for different roles using the `x-advanced-models` setting. | |
| ::: |
| let max_confidence = stream::iter(user_messages) | ||
| .map(|msg| async move { self.scan_with_classifier(&msg).await }) | ||
| .buffer_unordered(ML_SCAN_CONCURRENCY) | ||
| .fold(0.0_f32, |acc, result| async move { | ||
| result.unwrap_or(0.0).max(acc) | ||
| }) | ||
| .await; |
Copilot
AI
Jan 5, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concurrent message scanning could create a performance bottleneck when processing 10 user messages with concurrent HTTP requests. The ML_SCAN_CONCURRENCY of 3 means up to 3 simultaneous HTTP requests, but if each request takes 5 seconds (the default timeout), this could add up to 17 seconds in the worst case for tool execution. Consider adding a circuit breaker or reducing the timeout for conversation scans specifically, or making the scan asynchronous to avoid blocking tool execution.
…that is evaluated for potential prompt injection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
| | `otel_exporter_otlp_timeout` | Export timeout in milliseconds for [observability](/docs/guides/environment-variables#opentelemetry-protocol-otlp) | Integer (ms) | 10000 | No | | ||
| | `SECURITY_PROMPT_ENABLED` | Enable [prompt injection detection](/docs/guides/security/prompt-injection-detection) to identify potentially harmful commands | true/false | false | No | | ||
| | `SECURITY_PROMPT_THRESHOLD` | Sensitivity threshold for [prompt injection detection](/docs/guides/security/prompt-injection-detection) (higher = stricter) | Float between 0.01 and 1.0 | 0.7 | No | | ||
| | `SECURITY_PROMPT_ENABLED` | Enable prompt injection detection to identify potentially harmful commands | true/false | false | No | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dorien-koelemeijer Can you please revert the changes to this existing topic for now? Otherwise the docs will go out before this feature is available in a release.
We'll add them back in after the feature is released (except the note below because autopilot has been removed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments! Do you mean it's best to revert all changes to this file for now?
| @@ -0,0 +1,89 @@ | |||
| # Classification API Specification | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # Classification API Specification | |
| --- | |
| title: Classification API Specification | |
| unlisted: true | |
| --- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
| unlisted: true | ||
| --- | ||
|
|
||
| This document defines the API that Goose uses for ML-based prompt injection detection. |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reference to "Goose" should be lowercase "goose" according to the project naming convention documented in HOWTOAI.md.
| **Goose's Usage:** | ||
| - Goose looks for the label with the highest score |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reference to "Goose" should be lowercase "goose" according to the project naming convention documented in HOWTOAI.md.
| **Goose's Usage:** | ||
| - Goose looks for the label with the highest score | ||
| - If the top label is "INJECTION" (or "LABEL_1"), the score is used as the injection confidence | ||
| - If the top label is "SAFE" (or "LABEL_0"), Goose uses `1.0 - score` as the injection confidence |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reference to "Goose" should be lowercase "goose" according to the project naming convention documented in HOWTOAI.md.
dianed-square
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
* 'main' of github.com:block/goose: Fixed fonts (#6389) Update confidence levels prompt injection detection to reduce false positive rates (#6390) Add ML-based prompt injection detection (#5623) docs: update custom extensions tutorial (#6388) fix ResultsFormat error when loading old sessions (#6385) docs: add MCP Apps tutorial and documentation updates (#6384) changed z-index to make sure the search highlighter does not appear on modal overlay (#6386) Handling special claude model response in github copilot provider (#6369) fix: prevent duplicate rendering when tool returns both mcp-ui and mcp-apps resources (#6378) fix: update MCP Apps _meta.ui.resourceUri to use nested format (SEP-1865) (#6372) feat(providers): add streaming support for Google Gemini provider (#6191) Blog: edit links in mcp apps post (#6371) fix: prevent infinite loop of tool-input notifications in MCP Apps (#6374)
* main: (31 commits) added validation and debug for invalid call tool result (#6368) Update MCP apps tutorial: fix _meta structure and version prereq (#6404) Fixed fonts (#6389) Update confidence levels prompt injection detection to reduce false positive rates (#6390) Add ML-based prompt injection detection (#5623) docs: update custom extensions tutorial (#6388) fix ResultsFormat error when loading old sessions (#6385) docs: add MCP Apps tutorial and documentation updates (#6384) changed z-index to make sure the search highlighter does not appear on modal overlay (#6386) Handling special claude model response in github copilot provider (#6369) fix: prevent duplicate rendering when tool returns both mcp-ui and mcp-apps resources (#6378) fix: update MCP Apps _meta.ui.resourceUri to use nested format (SEP-1865) (#6372) feat(providers): add streaming support for Google Gemini provider (#6191) Blog: edit links in mcp apps post (#6371) fix: prevent infinite loop of tool-input notifications in MCP Apps (#6374) fix: Show platform-specific keyboard shortcuts in UI (#6323) fix: we load extensions when agent starts so don't do it up front (#6350) docs: credit HumanLayer in RPI tutorial (#6365) Blog: Goose Lands MCP Apps (#6172) Claude 3.7 is out. we had some harcoded stuff (#6197) ...
* main: (89 commits) fix(google): treat signed text as regular content in streaming (#6400) Add frameDomains and baseUriDomains CSP support for MCP Apps (#6399) fix(ci): add missing dependencies to openapi-schema-check job (#6367) feat: http proxy support Add support for changing working dir and extensions in same window/session (#6057) Sort keys in canonical models (#6403) added validation and debug for invalid call tool result (#6368) Update MCP apps tutorial: fix _meta structure and version prereq (#6404) Fixed fonts (#6389) Update confidence levels prompt injection detection to reduce false positive rates (#6390) Add ML-based prompt injection detection (#5623) docs: update custom extensions tutorial (#6388) fix ResultsFormat error when loading old sessions (#6385) docs: add MCP Apps tutorial and documentation updates (#6384) changed z-index to make sure the search highlighter does not appear on modal overlay (#6386) Handling special claude model response in github copilot provider (#6369) fix: prevent duplicate rendering when tool returns both mcp-ui and mcp-apps resources (#6378) fix: update MCP Apps _meta.ui.resourceUri to use nested format (SEP-1865) (#6372) feat(providers): add streaming support for Google Gemini provider (#6191) Blog: edit links in mcp apps post (#6371) ...
* main: (89 commits) fix(google): treat signed text as regular content in streaming (#6400) Add frameDomains and baseUriDomains CSP support for MCP Apps (#6399) fix(ci): add missing dependencies to openapi-schema-check job (#6367) feat: http proxy support Add support for changing working dir and extensions in same window/session (#6057) Sort keys in canonical models (#6403) added validation and debug for invalid call tool result (#6368) Update MCP apps tutorial: fix _meta structure and version prereq (#6404) Fixed fonts (#6389) Update confidence levels prompt injection detection to reduce false positive rates (#6390) Add ML-based prompt injection detection (#5623) docs: update custom extensions tutorial (#6388) fix ResultsFormat error when loading old sessions (#6385) docs: add MCP Apps tutorial and documentation updates (#6384) changed z-index to make sure the search highlighter does not appear on modal overlay (#6386) Handling special claude model response in github copilot provider (#6369) fix: prevent duplicate rendering when tool returns both mcp-ui and mcp-apps resources (#6378) fix: update MCP Apps _meta.ui.resourceUri to use nested format (SEP-1865) (#6372) feat(providers): add streaming support for Google Gemini provider (#6191) Blog: edit links in mcp apps post (#6371) ...
Summary
This PR adds BERT model prompt injection evaluation alongside the existing pattern-based approach.
Context
https://docs.google.com/document/d/1GNvriNWLAaJUMpWE1heBapxFshQEED5ZYfn1ixJvGCE/edit?tab=t.0#heading=h.rj4p6llxqvrq
Key Changes
BERTin the name of the variables so we can create a follow-up PR to also allow LLM-as-a-judge type prompt injection detection by re-using people's providers, and hopefully prevent confusing names of variables.Planned follow-up PRs / Related PRs
Type of Change
Screenshots of changes
Internal:

External:
