-
Notifications
You must be signed in to change notification settings - Fork 2.7k
feat: Prompt injection detection #4021
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Prompt injection detection #4021
Conversation
…ls was done - initial working version
…think this pattern based scanning adds a lot
…e if available + small updates in scanning approach - use both pattern based scanning and bert models (they don't always catch command injection)
| let script_content = format!( | ||
| r#"#!/usr/bin/env python3 | ||
| """ | ||
| Runtime model conversion script for Goose security models |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify: this only here as a fallback option and is not currently used
…', only give option to 'Allow once' or 'Deny'
…ut a better solution for this
…e certain tools are not automatically approved (even though they are flagged as security findings)
DOsinga
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my proposal would be to split this into two things; get a proof of concept in that only uses the pattern matching, but has the right plumbing in for tool call checking and play with that for a bit. And then add the clever bit
| # ML inference backends for security scanning | ||
| ort = "2.0.0-rc.10" # ONNX Runtime - use latest RC | ||
| tokenizers = { version = "0.20.4", default-features = false, features = ["onig"] } # HuggingFace tokenizers | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how much does this add to the executable? do we need the huggingface tokenizer? can we not get this done using tiktoken?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies, didn't realise you'd left feedback - having a look now. Thanks for the feedback 🙏
crates/goose/src/agents/agent.rs
Outdated
| ).await; | ||
|
|
||
| // DEBUG: Log tool categorization | ||
| println!("🔍 DEBUG: Tool categorization results:"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, let's
| yield AgentEvent::Message(Message::assistant().with_text(download_message)); | ||
| } | ||
|
|
||
| // SECURITY FIX: Scan tools for prompt injection BEFORE permission checking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is probably fine for testing, but we should refactor this so we have a generic way to check whether we want to run a tool that says yes/no/ask the user with prompt and then take the least permissive version of that, i.e. if the counter says no, don't even run the security thing if that makes sense
| // Log user decision, especially for security-flagged tools | ||
| if let Some(security_result) = security_context { | ||
| match confirmation.permission { | ||
| Permission::AllowOnce | Permission::AlwaysAllow => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does allows allow mean in this context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means the request gets processed further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I mean, AlwaysAllow doesn't make that much sense in this context but I guess you do want to catch all the arms of the match even though the client shouldn't return that here. so maybe unreachable is better?
| permission = ?confirmation.permission, | ||
| "🔒 User decision for tool execution" | ||
| ); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should be able to simplify this logging block I think and reduce the code duplication
| } | ||
|
|
||
| /// Check if models are available and trigger download if needed | ||
| fn check_and_prepare_models() -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function always returns true. if the model is not around it spawns a task to download it, but that won't be finished when we need the model I think?
| } | ||
|
|
||
| tracing::info!("🔒 ONNX model not available, will use pattern-based scanning"); | ||
| Ok(None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function says it might return an error, but it doesn't. instead it returns None or a model. let's make it return an error
| tracing::info!("🔒 ML model predict returned successfully"); | ||
| // Get threshold from config | ||
| let threshold = self.get_threshold_from_config(); | ||
| let ml_is_malicious = ml_confidence > threshold; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do this twice
| "base64 -d" | "eval " | "exec " => 0.60, | ||
|
|
||
| _ => 0.50, | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're mentioning the patterns here twice, would be better to just group them with a score in the initial const. I'm also not sure about these patterns. rm -rf ~/iamges/tmp seems fine for example. network access is also fine
|
Closing this as we discussed creating a simplified version of this with only pattern-based matching and leave the model scanning for a follow-up PR. New PR: #4237 |
Overview
This PR implements prompt injection detection, providing users with security warnings when potentially malicious tool calls are detected. The way prompt injection is implemented re-uses as much of the existing infrastructure as possible.
Working design doc with requirements
These notes are a result of pairing sessions with Douwe: https://docs.google.com/document/d/1OWh7Ab_eu8STaoplPA6AKF6AnXQNqXc8JYiAwmUYDTs/edit?tab=t.0
Implementation approach
Key changes
agent.rs: Added apply_security_results_to_permissions() to move security-flagged tools from approved to needs_approvaltool_execution.rs: Enhanced handle_approval_tool_requests_with_security() to include security warnings in confirmation promptsscanner.rs: Simplified security messages for user-friendly explanationsToolCallConfirmation.tsx: Added support for custom security warning promptsGooseMessage.tsx: Updated to pass security context from backend to UI componentsExamples
Goose config file
Examples of goose sessions (same tool call but different outcomes based on what user has provided in their message(s))
Thanks to Michael Rand for the test below:

Testing instructions
Add this to the goose config file:
or this:
Then run the following: