Skip to content

Conversation

@dorien-koelemeijer
Copy link
Collaborator

@dorien-koelemeijer dorien-koelemeijer commented Nov 7, 2025

Summary

This PR adds BERT model prompt injection evaluation alongside the existing pattern-based approach.

Context

https://docs.google.com/document/d/1GNvriNWLAaJUMpWE1heBapxFshQEED5ZYfn1ixJvGCE/edit?tab=t.0#heading=h.rj4p6llxqvrq

Key Changes

  • Updated PromptInjectionScanner to support both pattern-based + ML-based scanning, using the highest confidence score from either method to determine threats. ML-based scanning also looks at recent user messages in addition to tool call content.
  • Added ClassificationClient - Generic HTTP client that is compatible with HuggingFace's text classification API, supporting both internal users (Gondola hosted BERT models) and external/oss users (direct endpoints, such as using the hugging face API). The ML inference team is creating an additional wrapper endpoint that allows us to use their BatchInfer API with the same inputs and outputs as the Hugging Face text classification API.
  • Updated UI settings to allow users to configure what models to use if they decide to enable ML-based prompt injection detection. Have included BERT in the name of the variables so we can create a follow-up PR to also allow LLM-as-a-judge type prompt injection detection by re-using people's providers, and hopefully prevent confusing names of variables.

Planned follow-up PRs / Related PRs

  • Add a bunch of variables to goose-releases (will open PR shortly)
  • Re-use provider to do ML-based prompt injection detection rather than users requiring setup of the BERT models (mostly for oss/external users)
  • Provide reference implementation for users who want to run their BERT models locally (I've already prepared this but didn't want to make this PR super huge + I am not sure where best to store this info in the repo)
  • Waiting for this PR to get merged: https://github.com/squareup/gondola/pull/1085

Type of Change

  • Feature
  • Bug fix
  • Refactor / Code quality
  • Performance improvement
  • Documentation
  • Tests
  • Security fix
  • Build / Release
  • Other (specify below)

Screenshots of changes

Internal:
Screenshot 2025-11-26 at 11 51 36 am

Screenshot 2025-11-26 at 11 54 04 am

External:
Screenshot 2025-11-26 at 11 57 58 am

Screenshot 2025-11-26 at 11 58 46 am

@dorien-koelemeijer dorien-koelemeijer requested a review from a team as a code owner November 7, 2025 01:21
Copilot AI review requested due to automatic review settings November 7, 2025 01:21
@dorien-koelemeijer dorien-koelemeijer marked this pull request as draft November 7, 2025 01:21
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR standardizes security configuration keys from snake_case to UPPER_CASE convention and introduces ML-based prompt injection detection using BERT models through a new Gondola provider. The changes enable more accurate security threat detection while maintaining backward compatibility through pattern-based scanning as a fallback.

Key changes:

  • Configuration key standardization: security_prompt_*SECURITY_PROMPT_*
  • New ML detection infrastructure with Gondola provider for BERT-based model inference
  • Enhanced scanner to combine pattern-based and ML-based detection results

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
ui/desktop/src/utils/configUtils.ts Updated config labels to use new UPPER_CASE keys and added ML detection settings
ui/desktop/src/components/settings/security/SecurityToggle.tsx Added ML detection toggle UI with model selection dropdown
documentation/docs/guides/security/prompt-injection-detection.md Updated config examples to use new UPPER_CASE keys
documentation/docs/guides/config-files.md Updated configuration reference table with new ML detection settings and standardized keys
crates/goose/src/security/scanner.rs Refactored to support optional ML detection, enhanced conversation context scanning, and simplified tool content extraction
crates/goose/src/security/prompt_ml_detector.rs New ML detector implementation with model registry and Gondola provider integration
crates/goose/src/security/mod.rs Updated to initialize scanner with ML detection when enabled and handle fallback scenarios
crates/goose/src/providers/mod.rs Added gondola module to provider list
crates/goose/src/providers/gondola.rs New Gondola provider implementation for batch inference with BERT models

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 7, 2025

PR Preview Action v1.6.3
Preview removed because the pull request was closed.
2026-01-08 01:59 UTC

@dorien-koelemeijer dorien-koelemeijer force-pushed the feat/ml-based-prompt-injection-detection branch from 1f5fd06 to 9fe268c Compare November 7, 2025 01:29
@dorien-koelemeijer dorien-koelemeijer force-pushed the feat/ml-based-prompt-injection-detection branch from 459eb56 to a0d6d5c Compare November 7, 2025 06:07
@dorien-koelemeijer dorien-koelemeijer changed the title Add ML-based prompt injection detection using gondola-hosted BERT model [WIP] Add ML-based prompt injection detection using gondola-hosted BERT model Nov 11, 2025
@dorien-koelemeijer dorien-koelemeijer force-pushed the feat/ml-based-prompt-injection-detection branch from eff6fbe to ba5feee Compare November 11, 2025 05:34
@dorien-koelemeijer dorien-koelemeijer marked this pull request as ready for review November 11, 2025 05:35
Copilot AI review requested due to automatic review settings November 11, 2025 05:35
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

@block block deleted a comment from Copilot AI Nov 11, 2025
Copilot AI review requested due to automatic review settings November 11, 2025 05:57
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

@dorien-koelemeijer dorien-koelemeijer force-pushed the feat/ml-based-prompt-injection-detection branch from ef65a16 to fce20ae Compare November 11, 2025 06:10

## Overview

Goose requires a classification endpoint that can analyze text and return a score indicating the likelihood of prompt injection. This API follows the **HuggingFace Inference API format** for text classification, making it compatible with HuggingFace Inference Endpoints to allow for easy usage in OSS goose.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I understand, but we should write the documentation for the OSS people. I would just drop anything after ", making"

@block block deleted a comment from Copilot AI Jan 5, 2026
Copilot AI review requested due to automatic review settings January 5, 2026 01:55
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Comment on lines 53 to 56
:::info Automatic Multi-Model Configuration
The experimental [AutoPilot](/docs/guides/multi-model/autopilot) feature provides intelligent, context-aware model switching. Configure models for different roles using the `x-advanced-models` setting.
:::

Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The info box about AutoPilot configuration appears to be incorrectly placed here, as it's unrelated to the security configuration settings being documented. This content should either be moved to the appropriate section about multi-model configuration or removed from this location.

Suggested change
:::info Automatic Multi-Model Configuration
The experimental [AutoPilot](/docs/guides/multi-model/autopilot) feature provides intelligent, context-aware model switching. Configure models for different roles using the `x-advanced-models` setting.
:::

Copilot uses AI. Check for mistakes.
Comment on lines +157 to +163
let max_confidence = stream::iter(user_messages)
.map(|msg| async move { self.scan_with_classifier(&msg).await })
.buffer_unordered(ML_SCAN_CONCURRENCY)
.fold(0.0_f32, |acc, result| async move {
result.unwrap_or(0.0).max(acc)
})
.await;
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concurrent message scanning could create a performance bottleneck when processing 10 user messages with concurrent HTTP requests. The ML_SCAN_CONCURRENCY of 3 means up to 3 simultaneous HTTP requests, but if each request takes 5 seconds (the default timeout), this could add up to 17 seconds in the worst case for tool execution. Consider adding a circuit breaker or reducing the timeout for conversation scans specifically, or making the scan asynchronous to avoid blocking tool execution.

Copilot uses AI. Check for mistakes.
Copilot AI review requested due to automatic review settings January 5, 2026 02:17
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

@block block deleted a comment from Copilot AI Jan 5, 2026
@block block deleted a comment from Copilot AI Jan 5, 2026
@block block deleted a comment from Copilot AI Jan 5, 2026
| `otel_exporter_otlp_timeout` | Export timeout in milliseconds for [observability](/docs/guides/environment-variables#opentelemetry-protocol-otlp) | Integer (ms) | 10000 | No |
| `SECURITY_PROMPT_ENABLED` | Enable [prompt injection detection](/docs/guides/security/prompt-injection-detection) to identify potentially harmful commands | true/false | false | No |
| `SECURITY_PROMPT_THRESHOLD` | Sensitivity threshold for [prompt injection detection](/docs/guides/security/prompt-injection-detection) (higher = stricter) | Float between 0.01 and 1.0 | 0.7 | No |
| `SECURITY_PROMPT_ENABLED` | Enable prompt injection detection to identify potentially harmful commands | true/false | false | No |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dorien-koelemeijer Can you please revert the changes to this existing topic for now? Otherwise the docs will go out before this feature is available in a release.

We'll add them back in after the feature is released (except the note below because autopilot has been removed)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments! Do you mean it's best to revert all changes to this file for now?

@@ -0,0 +1,89 @@
# Classification API Specification
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Classification API Specification
---
title: Classification API Specification
unlisted: true
---

Copilot AI review requested due to automatic review settings January 6, 2026 22:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

unlisted: true
---

This document defines the API that Goose uses for ML-based prompt injection detection.
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference to "Goose" should be lowercase "goose" according to the project naming convention documented in HOWTOAI.md.

Copilot generated this review using guidance from repository custom instructions.
Comment on lines +71 to +72
**Goose's Usage:**
- Goose looks for the label with the highest score
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference to "Goose" should be lowercase "goose" according to the project naming convention documented in HOWTOAI.md.

Copilot generated this review using guidance from repository custom instructions.
**Goose's Usage:**
- Goose looks for the label with the highest score
- If the top label is "INJECTION" (or "LABEL_1"), the score is used as the injection confidence
- If the top label is "SAFE" (or "LABEL_0"), Goose uses `1.0 - score` as the injection confidence
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reference to "Goose" should be lowercase "goose" according to the project naming convention documented in HOWTOAI.md.

Copilot generated this review using guidance from repository custom instructions.
Copy link
Contributor

@dianed-square dianed-square left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@dorien-koelemeijer dorien-koelemeijer merged commit 9dc548e into main Jan 8, 2026
26 checks passed
@dorien-koelemeijer dorien-koelemeijer deleted the feat/ml-based-prompt-injection-detection branch January 8, 2026 01:56
zanesq added a commit that referenced this pull request Jan 8, 2026
* 'main' of github.com:block/goose:
  Fixed fonts (#6389)
  Update confidence levels prompt injection detection to reduce false positive rates (#6390)
  Add ML-based prompt injection detection  (#5623)
  docs: update custom extensions tutorial (#6388)
  fix ResultsFormat error when loading old sessions (#6385)
  docs: add MCP Apps tutorial and documentation updates (#6384)
  changed z-index to make sure the search highlighter does not appear on modal overlay (#6386)
  Handling special claude model response in github copilot provider (#6369)
  fix: prevent duplicate rendering when tool returns both mcp-ui and mcp-apps resources (#6378)
  fix: update MCP Apps _meta.ui.resourceUri to use nested format (SEP-1865) (#6372)
  feat(providers): add streaming support for Google Gemini provider (#6191)
  Blog: edit links in mcp apps post (#6371)
  fix: prevent infinite loop of tool-input notifications in MCP Apps (#6374)
michaelneale added a commit that referenced this pull request Jan 8, 2026
* main: (31 commits)
  added validation and debug for invalid call tool result (#6368)
  Update MCP apps tutorial: fix _meta structure and version prereq (#6404)
  Fixed fonts (#6389)
  Update confidence levels prompt injection detection to reduce false positive rates (#6390)
  Add ML-based prompt injection detection  (#5623)
  docs: update custom extensions tutorial (#6388)
  fix ResultsFormat error when loading old sessions (#6385)
  docs: add MCP Apps tutorial and documentation updates (#6384)
  changed z-index to make sure the search highlighter does not appear on modal overlay (#6386)
  Handling special claude model response in github copilot provider (#6369)
  fix: prevent duplicate rendering when tool returns both mcp-ui and mcp-apps resources (#6378)
  fix: update MCP Apps _meta.ui.resourceUri to use nested format (SEP-1865) (#6372)
  feat(providers): add streaming support for Google Gemini provider (#6191)
  Blog: edit links in mcp apps post (#6371)
  fix: prevent infinite loop of tool-input notifications in MCP Apps (#6374)
  fix: Show platform-specific keyboard shortcuts in UI (#6323)
  fix: we load extensions when agent starts so don't do it up front (#6350)
  docs: credit HumanLayer in RPI tutorial (#6365)
  Blog: Goose Lands MCP Apps (#6172)
  Claude 3.7 is out. we had some harcoded stuff (#6197)
  ...
wpfleger96 added a commit that referenced this pull request Jan 9, 2026
* main: (89 commits)
  fix(google): treat signed text as regular content in streaming (#6400)
  Add frameDomains and baseUriDomains CSP support for MCP Apps (#6399)
  fix(ci): add missing dependencies to openapi-schema-check job (#6367)
  feat: http proxy support
  Add support for changing working dir and extensions in same window/session (#6057)
  Sort keys in canonical models (#6403)
  added validation and debug for invalid call tool result (#6368)
  Update MCP apps tutorial: fix _meta structure and version prereq (#6404)
  Fixed fonts (#6389)
  Update confidence levels prompt injection detection to reduce false positive rates (#6390)
  Add ML-based prompt injection detection  (#5623)
  docs: update custom extensions tutorial (#6388)
  fix ResultsFormat error when loading old sessions (#6385)
  docs: add MCP Apps tutorial and documentation updates (#6384)
  changed z-index to make sure the search highlighter does not appear on modal overlay (#6386)
  Handling special claude model response in github copilot provider (#6369)
  fix: prevent duplicate rendering when tool returns both mcp-ui and mcp-apps resources (#6378)
  fix: update MCP Apps _meta.ui.resourceUri to use nested format (SEP-1865) (#6372)
  feat(providers): add streaming support for Google Gemini provider (#6191)
  Blog: edit links in mcp apps post (#6371)
  ...
wpfleger96 added a commit that referenced this pull request Jan 9, 2026
* main: (89 commits)
  fix(google): treat signed text as regular content in streaming (#6400)
  Add frameDomains and baseUriDomains CSP support for MCP Apps (#6399)
  fix(ci): add missing dependencies to openapi-schema-check job (#6367)
  feat: http proxy support
  Add support for changing working dir and extensions in same window/session (#6057)
  Sort keys in canonical models (#6403)
  added validation and debug for invalid call tool result (#6368)
  Update MCP apps tutorial: fix _meta structure and version prereq (#6404)
  Fixed fonts (#6389)
  Update confidence levels prompt injection detection to reduce false positive rates (#6390)
  Add ML-based prompt injection detection  (#5623)
  docs: update custom extensions tutorial (#6388)
  fix ResultsFormat error when loading old sessions (#6385)
  docs: add MCP Apps tutorial and documentation updates (#6384)
  changed z-index to make sure the search highlighter does not appear on modal overlay (#6386)
  Handling special claude model response in github copilot provider (#6369)
  fix: prevent duplicate rendering when tool returns both mcp-ui and mcp-apps resources (#6378)
  fix: update MCP Apps _meta.ui.resourceUri to use nested format (SEP-1865) (#6372)
  feat(providers): add streaming support for Google Gemini provider (#6191)
  Blog: edit links in mcp apps post (#6371)
  ...
fbalicchia pushed a commit to fbalicchia/goose that referenced this pull request Jan 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants