Skip to content

Conversation

@dorien-koelemeijer
Copy link
Collaborator

@dorien-koelemeijer dorien-koelemeijer commented Aug 12, 2025

Overview

This PR implements prompt injection detection, providing users with security warnings when potentially malicious tool calls are detected. The way prompt injection is implemented re-uses as much of the existing infrastructure as possible.

Working design doc with requirements

These notes are a result of pairing sessions with Douwe: https://docs.google.com/document/d/1OWh7Ab_eu8STaoplPA6AKF6AnXQNqXc8JYiAwmUYDTs/edit?tab=t.0

Implementation approach

  • Added security scanning at the check_tool_permissions() stage - a natural choke point in the agent workflow.
  • Security scanning now joins existing permission checking and tool monitoring in one consolidated location
  • Security scanner receives both proposed tool calls and full conversation history for comprehensive analysis
  • Leveraged existing ToolCall confirmation workflow instead of building custom UI components + security decisions flow through existing approval/denial logging infrastructure
  • Users define what models they want to use in the Goose config file - currently we support ONNX models, but in the future we want to expand this scope. The model that is chosen as the default (see Examples below) can be downloaded from HuggingFace directly in ONNX format.

Key changes

  • agent.rs: Added apply_security_results_to_permissions() to move security-flagged tools from approved to needs_approval
  • tool_execution.rs: Enhanced handle_approval_tool_requests_with_security() to include security warnings in confirmation prompts
  • scanner.rs: Simplified security messages for user-friendly explanations
  • ToolCallConfirmation.tsx: Added support for custom security warning prompts
  • GooseMessage.tsx: Updated to pass security context from backend to UI components

Examples

Goose config file

  • opt-in security scanning
  • model selection and configuration
security:
  enabled: true
  models:
  - model: protectai/deberta-v3-base-prompt-injection-v2
    threshold: 0.7
    weight: 2.0

Examples of goose sessions (same tool call but different outcomes based on what user has provided in their message(s))

Screenshot 2025-08-12 at 10 52 11 am Screenshot 2025-08-12 at 10 52 49 am

Thanks to Michael Rand for the test below:
Screenshot 2025-08-14 at 11 54 06 am

Testing instructions

Add this to the goose config file:

security:
  enabled: true
  models:
  - model: protectai/deberta-v3-base-prompt-injection-v2
    threshold: 0.7
    weight: 2.0

or this:

security:
  enabled: true
  models:
  - model: meta-llama/Llama-Prompt-Guard-2-86M
    threshold: 0.7
    weight: 2.0

Then run the following:

just run-ui

@dorien-koelemeijer dorien-koelemeijer marked this pull request as draft August 12, 2025 01:16
let script_content = format!(
r#"#!/usr/bin/env python3
"""
Runtime model conversion script for Goose security models
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify: this only here as a fallback option and is not currently used

@dorien-koelemeijer dorien-koelemeijer marked this pull request as ready for review August 12, 2025 22:13
@dorien-koelemeijer dorien-koelemeijer changed the title Prompt injection detection feat: Prompt injection detection Aug 13, 2025
Copy link
Collaborator

@DOsinga DOsinga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my proposal would be to split this into two things; get a proof of concept in that only uses the pattern matching, but has the right plumbing in for tool call checking and play with that for a bit. And then add the clever bit

# ML inference backends for security scanning
ort = "2.0.0-rc.10" # ONNX Runtime - use latest RC
tokenizers = { version = "0.20.4", default-features = false, features = ["onig"] } # HuggingFace tokenizers

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how much does this add to the executable? do we need the huggingface tokenizer? can we not get this done using tiktoken?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, didn't realise you'd left feedback - having a look now. Thanks for the feedback 🙏

).await;

// DEBUG: Log tool categorization
println!("🔍 DEBUG: Tool categorization results:");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, let's

yield AgentEvent::Message(Message::assistant().with_text(download_message));
}

// SECURITY FIX: Scan tools for prompt injection BEFORE permission checking
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably fine for testing, but we should refactor this so we have a generic way to check whether we want to run a tool that says yes/no/ask the user with prompt and then take the least permissive version of that, i.e. if the counter says no, don't even run the security thing if that makes sense

// Log user decision, especially for security-flagged tools
if let Some(security_result) = security_context {
match confirmation.permission {
Permission::AllowOnce | Permission::AlwaysAllow => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does allows allow mean in this context?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means the request gets processed further.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I mean, AlwaysAllow doesn't make that much sense in this context but I guess you do want to catch all the arms of the match even though the client shouldn't return that here. so maybe unreachable is better?

permission = ?confirmation.permission,
"🔒 User decision for tool execution"
);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should be able to simplify this logging block I think and reduce the code duplication

}

/// Check if models are available and trigger download if needed
fn check_and_prepare_models() -> bool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function always returns true. if the model is not around it spawns a task to download it, but that won't be finished when we need the model I think?

}

tracing::info!("🔒 ONNX model not available, will use pattern-based scanning");
Ok(None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function says it might return an error, but it doesn't. instead it returns None or a model. let's make it return an error

tracing::info!("🔒 ML model predict returned successfully");
// Get threshold from config
let threshold = self.get_threshold_from_config();
let ml_is_malicious = ml_confidence > threshold;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do this twice

"base64 -d" | "eval " | "exec " => 0.60,

_ => 0.50,
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're mentioning the patterns here twice, would be better to just group them with a score in the initial const. I'm also not sure about these patterns. rm -rf ~/iamges/tmp seems fine for example. network access is also fine

@dorien-koelemeijer
Copy link
Collaborator Author

Closing this as we discussed creating a simplified version of this with only pattern-based matching and leave the model scanning for a follow-up PR. New PR: #4237

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants