feat: Prompt injection detection #4021

dorien-koelemeijer · 2025-08-12T01:16:28Z

Overview

This PR implements prompt injection detection, providing users with security warnings when potentially malicious tool calls are detected. The way prompt injection is implemented re-uses as much of the existing infrastructure as possible.

Working design doc with requirements

These notes are a result of pairing sessions with Douwe: https://docs.google.com/document/d/1OWh7Ab_eu8STaoplPA6AKF6AnXQNqXc8JYiAwmUYDTs/edit?tab=t.0

Implementation approach

Added security scanning at the check_tool_permissions() stage - a natural choke point in the agent workflow.
Security scanning now joins existing permission checking and tool monitoring in one consolidated location
Security scanner receives both proposed tool calls and full conversation history for comprehensive analysis
Leveraged existing ToolCall confirmation workflow instead of building custom UI components + security decisions flow through existing approval/denial logging infrastructure
Users define what models they want to use in the Goose config file - currently we support ONNX models, but in the future we want to expand this scope. The model that is chosen as the default (see Examples below) can be downloaded from HuggingFace directly in ONNX format.

Key changes

agent.rs: Added apply_security_results_to_permissions() to move security-flagged tools from approved to needs_approval
tool_execution.rs: Enhanced handle_approval_tool_requests_with_security() to include security warnings in confirmation prompts
scanner.rs: Simplified security messages for user-friendly explanations
ToolCallConfirmation.tsx: Added support for custom security warning prompts
GooseMessage.tsx: Updated to pass security context from backend to UI components

Examples

Goose config file

opt-in security scanning
model selection and configuration

security:
  enabled: true
  models:
  - model: protectai/deberta-v3-base-prompt-injection-v2
    threshold: 0.7
    weight: 2.0

Examples of goose sessions (same tool call but different outcomes based on what user has provided in their message(s))

Thanks to Michael Rand for the test below:

Testing instructions

Add this to the goose config file:

security:
  enabled: true
  models:
  - model: protectai/deberta-v3-base-prompt-injection-v2
    threshold: 0.7
    weight: 2.0

or this:

security:
  enabled: true
  models:
  - model: meta-llama/Llama-Prompt-Guard-2-86M
    threshold: 0.7
    weight: 2.0

Then run the following:

just run-ui

…ssions

…ls was done - initial working version

…think this pattern based scanning adds a lot

…e if available + small updates in scanning approach - use both pattern based scanning and bert models (they don't always catch command injection)

dorien-koelemeijer · 2025-08-12T01:20:21Z

crates/goose/src/security/model_downloader.rs

+        let script_content = format!(
+            r#"#!/usr/bin/env python3
+"""
+Runtime model conversion script for Goose security models


Just to clarify: this only here as a fallback option and is not currently used

…', only give option to 'Allow once' or 'Deny'

…ut a better solution for this

…e certain tools are not automatically approved (even though they are flagged as security findings)

DOsinga

I think my proposal would be to split this into two things; get a proof of concept in that only uses the pattern matching, but has the right plumbing in for tool call checking and play with that for a bit. And then add the clever bit

DOsinga · 2025-08-12T16:19:41Z

crates/goose/Cargo.toml

+# ML inference backends for security scanning
+ort = "2.0.0-rc.10"  # ONNX Runtime - use latest RC
+tokenizers = { version = "0.20.4", default-features = false, features = ["onig"] }  # HuggingFace tokenizers
+


how much does this add to the executable? do we need the huggingface tokenizer? can we not get this done using tiktoken?

Apologies, didn't realise you'd left feedback - having a look now. Thanks for the feedback 🙏

DOsinga · 2025-08-12T16:19:56Z

crates/goose/src/agents/agent.rs

                                        ).await;

+                                    // DEBUG: Log tool categorization
+                                    println!("🔍 DEBUG: Tool categorization results:");


yeah, let's

DOsinga · 2025-08-14T15:29:30Z

crates/goose/src/agents/agent.rs

+                                        yield AgentEvent::Message(Message::assistant().with_text(download_message));
+                                    }
+
+                                    // SECURITY FIX: Scan tools for prompt injection BEFORE permission checking


this is probably fine for testing, but we should refactor this so we have a generic way to check whether we want to run a tool that says yes/no/ask the user with prompt and then take the least permissive version of that, i.e. if the counter says no, don't even run the security thing if that makes sense

DOsinga · 2025-08-14T15:32:18Z

crates/goose/src/agents/tool_execution.rs

+                            // Log user decision, especially for security-flagged tools
+                            if let Some(security_result) = security_context {
+                                match confirmation.permission {
+                                    Permission::AllowOnce | Permission::AlwaysAllow => {


what does allows allow mean in this context?

It means the request gets processed further.

yeah, I mean, AlwaysAllow doesn't make that much sense in this context but I guess you do want to catch all the arms of the match even though the client shouldn't return that here. so maybe unreachable is better?

DOsinga · 2025-08-14T15:32:53Z

crates/goose/src/agents/tool_execution.rs

+                                    permission = ?confirmation.permission,
+                                    "🔒 User decision for tool execution"
+                                );
+                            }


we should be able to simplify this logging block I think and reduce the code duplication

DOsinga · 2025-08-14T16:01:12Z