Add ML-based prompt injection detection #5623

dorien-koelemeijer · 2025-11-07T01:21:09Z

Summary

This PR adds BERT model prompt injection evaluation alongside the existing pattern-based approach.

Context

https://docs.google.com/document/d/1GNvriNWLAaJUMpWE1heBapxFshQEED5ZYfn1ixJvGCE/edit?tab=t.0#heading=h.rj4p6llxqvrq

Key Changes

Updated PromptInjectionScanner to support both pattern-based + ML-based scanning, using the highest confidence score from either method to determine threats. ML-based scanning also looks at recent user messages in addition to tool call content.
Added ClassificationClient - Generic HTTP client that is compatible with HuggingFace's text classification API, supporting both internal users (Gondola hosted BERT models) and external/oss users (direct endpoints, such as using the hugging face API). The ML inference team is creating an additional wrapper endpoint that allows us to use their BatchInfer API with the same inputs and outputs as the Hugging Face text classification API.
Updated UI settings to allow users to configure what models to use if they decide to enable ML-based prompt injection detection. Have included BERT in the name of the variables so we can create a follow-up PR to also allow LLM-as-a-judge type prompt injection detection by re-using people's providers, and hopefully prevent confusing names of variables.

Planned follow-up PRs / Related PRs

Add a bunch of variables to goose-releases (will open PR shortly)
Re-use provider to do ML-based prompt injection detection rather than users requiring setup of the BERT models (mostly for oss/external users)
Provide reference implementation for users who want to run their BERT models locally (I've already prepared this but didn't want to make this PR super huge + I am not sure where best to store this info in the repo)
Waiting for this PR to get merged: https://github.com/squareup/gondola/pull/1085

Type of Change

Screenshots of changes

Internal:

External:

Copilot

Pull Request Overview

This PR standardizes security configuration keys from snake_case to UPPER_CASE convention and introduces ML-based prompt injection detection using BERT models through a new Gondola provider. The changes enable more accurate security threat detection while maintaining backward compatibility through pattern-based scanning as a fallback.

Key changes:

Configuration key standardization: security_prompt_* → SECURITY_PROMPT_*
New ML detection infrastructure with Gondola provider for BERT-based model inference
Enhanced scanner to combine pattern-based and ML-based detection results

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
ui/desktop/src/utils/configUtils.ts	Updated config labels to use new UPPER_CASE keys and added ML detection settings
ui/desktop/src/components/settings/security/SecurityToggle.tsx	Added ML detection toggle UI with model selection dropdown
documentation/docs/guides/security/prompt-injection-detection.md	Updated config examples to use new UPPER_CASE keys
documentation/docs/guides/config-files.md	Updated configuration reference table with new ML detection settings and standardized keys
crates/goose/src/security/scanner.rs	Refactored to support optional ML detection, enhanced conversation context scanning, and simplified tool content extraction
crates/goose/src/security/prompt_ml_detector.rs	New ML detector implementation with model registry and Gondola provider integration
crates/goose/src/security/mod.rs	Updated to initialize scanner with ML detection when enabled and handle fallback scenarios
crates/goose/src/providers/mod.rs	Added gondola module to provider list
crates/goose/src/providers/gondola.rs	New Gondola provider implementation for batch inference with BERT models

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

documentation/docs/guides/config-files.md

crates/goose/src/security/scanner.rs

github-actions · 2025-11-07T01:24:07Z

PR Preview Action v1.6.3
Preview removed because the pull request was closed.
2026-01-08 01:59 UTC

Copilot

Pull Request Overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

ui/desktop/src/components/settings/security/SecurityToggle.tsx

crates/goose/src/security/scanner.rs

crates/goose/src/security/prompt_ml_detector.rs

ui/desktop/src/components/settings/security/SecurityToggle.tsx

Copilot

Pull Request Overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

ui/desktop/src/components/settings/security/SecurityToggle.tsx

crates/goose/src/providers/gondola.rs

crates/goose/src/security/scanner.rs

ui/desktop/src/components/settings/security/SecurityToggle.tsx

DOsinga · 2025-12-16T00:51:35Z

documentation/docs/guides/security/classification-api-spec.md

+
+## Overview
+
+Goose requires a classification endpoint that can analyze text and return a score indicating the likelihood of prompt injection. This API follows the **HuggingFace Inference API format** for text classification, making it compatible with HuggingFace Inference Endpoints to allow for easy usage in OSS goose. 


yeah, I understand, but we should write the documentation for the OSS people. I would just drop anything after ", making"

crates/goose/src/security/scanner.rs

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Copilot · 2026-01-05T01:59:44Z

documentation/docs/guides/config-files.md

+:::info Automatic Multi-Model Configuration
+The experimental [AutoPilot](/docs/guides/multi-model/autopilot) feature provides intelligent, context-aware model switching. Configure models for different roles using the `x-advanced-models` setting.
+:::



The info box about AutoPilot configuration appears to be incorrectly placed here, as it's unrelated to the security configuration settings being documented. This content should either be moved to the appropriate section about multi-model configuration or removed from this location.

Suggested change

:::info Automatic Multi-Model Configuration

The experimental [AutoPilot](/docs/guides/multi-model/autopilot) feature provides intelligent, context-aware model switching. Configure models for different roles using the `x-advanced-models` setting.

:::

Copilot · 2026-01-05T01:59:45Z

crates/goose/src/security/scanner.rs

+        let max_confidence = stream::iter(user_messages)
+            .map(|msg| async move { self.scan_with_classifier(&msg).await })
+            .buffer_unordered(ML_SCAN_CONCURRENCY)
+            .fold(0.0_f32, |acc, result| async move {
+                result.unwrap_or(0.0).max(acc)
+            })
+            .await;


The concurrent message scanning could create a performance bottleneck when processing 10 user messages with concurrent HTTP requests. The ML_SCAN_CONCURRENCY of 3 means up to 3 simultaneous HTTP requests, but if each request takes 5 seconds (the default timeout), this could add up to 17 seconds in the worst case for tool execution. Consider adding a circuit breaker or reducing the timeout for conversation scans specifically, or making the scan asynchronous to avoid blocking tool execution.

…that is evaluated for potential prompt injection

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

dianed-square · 2026-01-06T01:11:28Z

documentation/docs/guides/config-files.md

 | `otel_exporter_otlp_timeout` | Export timeout in milliseconds for [observability](/docs/guides/environment-variables#opentelemetry-protocol-otlp) | Integer (ms) | 10000 | No |
-| `SECURITY_PROMPT_ENABLED` | Enable [prompt injection detection](/docs/guides/security/prompt-injection-detection) to identify potentially harmful commands | true/false | false | No |
-| `SECURITY_PROMPT_THRESHOLD` | Sensitivity threshold for [prompt injection detection](/docs/guides/security/prompt-injection-detection) (higher = stricter) | Float between 0.01 and 1.0 | 0.7 | No |
+| `SECURITY_PROMPT_ENABLED` | Enable prompt injection detection to identify potentially harmful commands | true/false | false | No |


@dorien-koelemeijer Can you please revert the changes to this existing topic for now? Otherwise the docs will go out before this feature is available in a release.

We'll add them back in after the feature is released (except the note below because autopilot has been removed)

Thanks for the comments! Do you mean it's best to revert all changes to this file for now?

dianed-square · 2026-01-06T01:14:05Z

documentation/docs/guides/security/classification-api-spec.md

@@ -0,0 +1,89 @@
+# Classification API Specification


Suggested change

# Classification API Specification

---

title: Classification API Specification

unlisted: true

---

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Copilot · 2026-01-06T22:54:30Z

documentation/docs/guides/security/classification-api-spec.md

+unlisted: true
+---
+
+This document defines the API that Goose uses for ML-based prompt injection detection.


The reference to "Goose" should be lowercase "goose" according to the project naming convention documented in HOWTOAI.md.

Copilot · 2026-01-06T22:54:30Z

documentation/docs/guides/security/classification-api-spec.md

+**Goose's Usage:**
+- Goose looks for the label with the highest score


The reference to "Goose" should be lowercase "goose" according to the project naming convention documented in HOWTOAI.md.

Copilot · 2026-01-06T22:54:30Z

documentation/docs/guides/security/classification-api-spec.md

+**Goose's Usage:**
+- Goose looks for the label with the highest score
+- If the top label is "INJECTION" (or "LABEL_1"), the score is used as the injection confidence
+- If the top label is "SAFE" (or "LABEL_0"), Goose uses `1.0 - score` as the injection confidence


The reference to "Goose" should be lowercase "goose" according to the project naming convention documented in HOWTOAI.md.

dianed-square

Thanks!

* 'main' of github.com:block/goose: Fixed fonts (#6389) Update confidence levels prompt injection detection to reduce false positive rates (#6390) Add ML-based prompt injection detection (#5623) docs: update custom extensions tutorial (#6388) fix ResultsFormat error when loading old sessions (#6385) docs: add MCP Apps tutorial and documentation updates (#6384) changed z-index to make sure the search highlighter does not appear on modal overlay (#6386) Handling special claude model response in github copilot provider (#6369) fix: prevent duplicate rendering when tool returns both mcp-ui and mcp-apps resources (#6378) fix: update MCP Apps _meta.ui.resourceUri to use nested format (SEP-1865) (#6372) feat(providers): add streaming support for Google Gemini provider (#6191) Blog: edit links in mcp apps post (#6371) fix: prevent infinite loop of tool-input notifications in MCP Apps (#6374)

* main: (31 commits) added validation and debug for invalid call tool result (#6368) Update MCP apps tutorial: fix _meta structure and version prereq (#6404) Fixed fonts (#6389) Update confidence levels prompt injection detection to reduce false positive rates (#6390) Add ML-based prompt injection detection (#5623) docs: update custom extensions tutorial (#6388) fix ResultsFormat error when loading old sessions (#6385) docs: add MCP Apps tutorial and documentation updates (#6384) changed z-index to make sure the search highlighter does not appear on modal overlay (#6386) Handling special claude model response in github copilot provider (#6369) fix: prevent duplicate rendering when tool returns both mcp-ui and mcp-apps resources (#6378) fix: update MCP Apps _meta.ui.resourceUri to use nested format (SEP-1865) (#6372) feat(providers): add streaming support for Google Gemini provider (#6191) Blog: edit links in mcp apps post (#6371) fix: prevent infinite loop of tool-input notifications in MCP Apps (#6374) fix: Show platform-specific keyboard shortcuts in UI (#6323) fix: we load extensions when agent starts so don't do it up front (#6350) docs: credit HumanLayer in RPI tutorial (#6365) Blog: Goose Lands MCP Apps (#6172) Claude 3.7 is out. we had some harcoded stuff (#6197) ...

* main: (89 commits) fix(google): treat signed text as regular content in streaming (#6400) Add frameDomains and baseUriDomains CSP support for MCP Apps (#6399) fix(ci): add missing dependencies to openapi-schema-check job (#6367) feat: http proxy support Add support for changing working dir and extensions in same window/session (#6057) Sort keys in canonical models (#6403) added validation and debug for invalid call tool result (#6368) Update MCP apps tutorial: fix _meta structure and version prereq (#6404) Fixed fonts (#6389) Update confidence levels prompt injection detection to reduce false positive rates (#6390) Add ML-based prompt injection detection (#5623) docs: update custom extensions tutorial (#6388) fix ResultsFormat error when loading old sessions (#6385) docs: add MCP Apps tutorial and documentation updates (#6384) changed z-index to make sure the search highlighter does not appear on modal overlay (#6386) Handling special claude model response in github copilot provider (#6369) fix: prevent duplicate rendering when tool returns both mcp-ui and mcp-apps resources (#6378) fix: update MCP Apps _meta.ui.resourceUri to use nested format (SEP-1865) (#6372) feat(providers): add streaming support for Google Gemini provider (#6191) Blog: edit links in mcp apps post (#6371) ...

dorien-koelemeijer requested a review from a team as a code owner November 7, 2025 01:21

Copilot AI review requested due to automatic review settings November 7, 2025 01:21

dorien-koelemeijer marked this pull request as draft November 7, 2025 01:21

Copilot AI reviewed Nov 7, 2025

View reviewed changes

documentation/docs/guides/config-files.md Outdated Show resolved Hide resolved

crates/goose/src/security/scanner.rs Outdated Show resolved Hide resolved

crates/goose/src/security/scanner.rs Outdated Show resolved Hide resolved

Add ML-based prompt injection detection using gondola-hosted BERT model

9fe268c

dorien-koelemeijer force-pushed the feat/ml-based-prompt-injection-detection branch from 1f5fd06 to 9fe268c Compare November 7, 2025 01:29

cleanup

a0d6d5c

dorien-koelemeijer force-pushed the feat/ml-based-prompt-injection-detection branch from 459eb56 to a0d6d5c Compare November 7, 2025 06:07

dorien-koelemeijer added 5 commits November 10, 2025 12:56

fix

cee8517

fix

62c2017

model config update

717b26d

fix

3582634

fix

137981f

dorien-koelemeijer changed the title ~~Add ML-based prompt injection detection using gondola-hosted BERT model [WIP]~~ Add ML-based prompt injection detection using gondola-hosted BERT model Nov 11, 2025

cleanup

ba5feee

dorien-koelemeijer force-pushed the feat/ml-based-prompt-injection-detection branch from eff6fbe to ba5feee Compare November 11, 2025 05:34

dorien-koelemeijer marked this pull request as ready for review November 11, 2025 05:35

Copilot AI review requested due to automatic review settings November 11, 2025 05:35

Copilot started reviewing on behalf of dorien-koelemeijer November 11, 2025 05:36 View session

Copilot finished reviewing on behalf of dorien-koelemeijer November 11, 2025 05:41

Copilot AI reviewed Nov 11, 2025

View reviewed changes

fmt

a5ee45c

block deleted a comment from Copilot AI Nov 11, 2025

Copilot AI review requested due to automatic review settings November 11, 2025 05:57

Copilot started reviewing on behalf of dorien-koelemeijer November 11, 2025 05:58 View session

Copilot finished reviewing on behalf of dorien-koelemeijer November 11, 2025 06:00

Copilot AI reviewed Nov 11, 2025

View reviewed changes

ui/desktop/src/components/settings/security/SecurityToggle.tsx Outdated Show resolved Hide resolved

crates/goose/src/providers/gondola.rs Outdated Show resolved Hide resolved

crates/goose/src/security/scanner.rs Show resolved Hide resolved

fix

fce20ae

dorien-koelemeijer force-pushed the feat/ml-based-prompt-injection-detection branch from ef65a16 to fce20ae Compare November 11, 2025 06:10

DOsinga approved these changes Dec 16, 2025

View reviewed changes

block deleted a comment from Copilot AI Jan 5, 2026

Remove unnecessary console.warn in securityToggle UI settings

19c49ba

Copilot AI review requested due to automatic review settings January 5, 2026 01:55

Copilot started reviewing on behalf of dorien-koelemeijer January 5, 2026 01:56 View session

Copilot AI reviewed Jan 5, 2026

View reviewed changes

dorien-koelemeijer added 2 commits January 5, 2026 12:10

Use 'effective_role' to exclude tool calls from conversation context …

1019506

…that is evaluated for potential prompt injection

minor update documentation/guide for classification api spec

04dad80

Copilot AI review requested due to automatic review settings January 5, 2026 02:17

Copilot started reviewing on behalf of dorien-koelemeijer January 5, 2026 02:18 View session

Copilot AI reviewed Jan 5, 2026

View reviewed changes

block deleted a comment from Copilot AI Jan 5, 2026

dianed-square reviewed Jan 6, 2026

View reviewed changes

dorien-koelemeijer added 2 commits January 7, 2026 08:34

Update to documentation

8af623a

fix

2e39959

Copilot AI review requested due to automatic review settings January 6, 2026 22:50

Copilot started reviewing on behalf of dorien-koelemeijer January 6, 2026 22:51 View session

Copilot AI reviewed Jan 6, 2026

View reviewed changes

dianed-square approved these changes Jan 7, 2026

View reviewed changes

dorien-koelemeijer merged commit 9dc548e into main Jan 8, 2026
26 checks passed

dorien-koelemeijer deleted the feat/ml-based-prompt-injection-detection branch January 8, 2026 01:56

github-actions bot mentioned this pull request Jan 13, 2026

chore(release): release version 1.20.0 (minor) #6457

Merged

fbalicchia pushed a commit to fbalicchia/goose that referenced this pull request Jan 13, 2026

Add ML-based prompt injection detection (block#5623)

31dc8a0


		## Overview

		Goose requires a classification endpoint that can analyze text and return a score indicating the likelihood of prompt injection. This API follows the HuggingFace Inference API format for text classification, making it compatible with HuggingFace Inference Endpoints to allow for easy usage in OSS goose.

	:::info Automatic Multi-Model Configuration
	The experimental [AutoPilot](/docs/guides/multi-model/autopilot) feature provides intelligent, context-aware model switching. Configure models for different roles using the `x-advanced-models` setting.
	:::

		Goose's Usage:
		- Goose looks for the label with the highest score

Add ML-based prompt injection detection #5623

Add ML-based prompt injection detection #5623

Conversation

dorien-koelemeijer commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Context

Key Changes

Planned follow-up PRs / Related PRs

Type of Change

Screenshots of changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DOsinga Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

dianed-square Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

dorien-koelemeijer Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

dianed-square Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

dorien-koelemeijer commented Nov 7, 2025 •

edited

Loading

github-actions bot commented Nov 7, 2025 •

edited

Loading