-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Prompt injection detection (simplified - only pattern matching) #4237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prompt injection detection (simplified - only pattern matching) #4237
Conversation
5d4f43d to
15fe965
Compare
DOsinga
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry got to this late - already day for you so sending you what I have
DOsinga
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
running out of time again, here are some remarks I had so far. mostly I think we need to integrate more tool checking into this and I would expect to see some deleted code coming out on the other hand. but I like the infra to do the checking
c46c49a to
557ed30
Compare
|
sorry @dorien-koelemeijer some conflicts to fix, I would but not sure if I would miss the correct intention |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a thought @dorien-koelemeijer, but would you consider the patterns.rs to instead load from a local embedded yaml, then a cached yaml, and then be able to fetch updated yaml on the fly as it starts, as new patterns emerge, something like:
threat_patterns:
- name: "rm_rf_root"
pattern: "rm\\s+(-[rf]*[rf][rf]*|--recursive|--force).*[/\\\\]"
description: "Recursive file deletion with rm -rf"
risk_level: "Critical"
category: "FileSystemDestruction"
- name: "rm_rf_system"
pattern: "rm\\s+(-[rf]*[rf][rf]*|--recursive|--force).*(bin|etc|usr|var|sys|proc|dev|boot|lib|opt|srv|tmp)"
description: "Recursive deletion of system directories"
risk_level: "Critical"
category: "FileSystemDestruction"
- name: "dd_destruction"
pattern: "dd\\s+.*if=/dev/(zero|random|urandom).*of=/dev/[sh]d[a-z]"
description: "Disk destruction using dd command"
risk_level: "Critical"
category: "FileSystemDestruction"
- name: "format_drive"
pattern: "(format|mkfs\\.[a-z]+)\\s+[/\\\\]dev[/\\\\][sh]d[a-z]"
description: "Formatting system drives"
risk_level: "Critical"
category: "FileSystemDestruction"
...where it works exactly as now - but it can also look up some CDN based content we provide which we can centrally manage and update. If it finds one there, then it downloads that to a ~/.config/goose/patterns.yaml - and then loads a superset of that vs baked in (baked in take precedence?). Then each time it can look for more to update to the config dir.
If you like I can try to add this @dorien-koelemeijer?
this means we can author more rules, perhaps 1000s of them and make it smarter and more dynamic. Would want to check, but even 20K threat patterns I expect could both load fast and eval fast if we do it right (maybe yaml not right format but you get the idea)
|
@dorien-koelemeijer odd - is missing DCO check for your account, if you can do that so it passes that check |
c234eb1 to
b7333b6
Compare
c234eb1 to
b685fb6
Compare
…canning will be added in follow-up PR Signed-off-by: Dorien Koelemeijer <[email protected]>
Signed-off-by: Dorien Koelemeijer <[email protected]>
Signed-off-by: Dorien Koelemeijer <[email protected]>
… respected + add some more detail to ToolCall pop up for user Signed-off-by: Dorien Koelemeijer <[email protected]>
…an - hopefully correctly understanding the main comment on this PR Signed-off-by: Dorien Koelemeijer <[email protected]>
Signed-off-by: Dorien Koelemeijer <[email protected]>
Signed-off-by: Dorien Koelemeijer <[email protected]>
…s for goose-cli Signed-off-by: Dorien Koelemeijer <[email protected]>
Signed-off-by: Dorien Koelemeijer <[email protected]>
Signed-off-by: Dorien Koelemeijer <[email protected]>
Signed-off-by: Dorien Koelemeijer <[email protected]>
Signed-off-by: Dorien Koelemeijer <[email protected]>
Signed-off-by: Dorien Koelemeijer <[email protected]>
Signed-off-by: Dorien Koelemeijer <[email protected]>
Signed-off-by: Dorien Koelemeijer <[email protected]>
Signed-off-by: Dorien Koelemeijer <[email protected]>
…rmation Signed-off-by: Dorien Koelemeijer <[email protected]>
Signed-off-by: Dorien Koelemeijer <[email protected]>
22247d9 to
9fa9e71
Compare
Signed-off-by: Dorien Koelemeijer <[email protected]>
|
ok tested it as of latest, and seems good with defaults... now about to test it with it enabled |
Signed-off-by: Dorien Koelemeijer <[email protected]>
|
@dorien-koelemeijer what is the best way to trigger this when I enable it? |
I usually import this recipe and then use it: |
Signed-off-by: Dorien Koelemeijer <[email protected]>
|
thanks @dorien-koelemeijer - I tried that recipe - and clicked "trust", but then it showed up. I pressed enter to submit, but didn't see any errors/warnings about it (using security:
enabled: true
confidence_threshold: 0.5 that expected? |
Signed-off-by: Dorien Koelemeijer <[email protected]>
|
|
||
| {/* TODO(alexhancock): Re-enable link previews once styled well again */} | ||
| {urls.length > 0 && ( | ||
| {/* TEMPORARILY DISABLED (dorien-koelemeijer): This is causing issues in properly "generating" tool calls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexhancock will need your eyeballs to see if this is ok (this feature needs it, it looks like)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is fine for me. i'm actually surprised to see the code was enabled in the way it was. i don't think we've rendered link previews for a long time.
if we want them again I can bring them back in a separate change.
this is looking good - will see what @alexhancock says |
…data * 'main' of github.com:block/goose: refactor: add new recipe dependency updater (#4596) chore: fix nightly builds to have tags (#4595) feat: Import file contents from recipe 'file' input type parameter (#4558) also adding this change to the api key send for recipes (#4587) Fix local (working directory) recipes storage (#4588) fix: don't redact tool calls (#4589) Prompt injection detection (simplified - only pattern matching) (#4237) feat: add streaming support to Tetrate Agent Router Service provider (#4477) docs: goosehints updates (#4581) Iand/recipe scanner updates (#4584) patching recipe scanning workflows for permissions changes (#4579) fix: onboarding endpoints send token secret (#4575) Fix : Google AI schema validation by adding missing array items fields (#4569) Add unified diff support to text editor (#4522)
…links-overflow * 'main' of github.com:block/goose: refactor: add new recipe dependency updater (#4596) chore: fix nightly builds to have tags (#4595) feat: Import file contents from recipe 'file' input type parameter (#4558) also adding this change to the api key send for recipes (#4587) Fix local (working directory) recipes storage (#4588) fix: don't redact tool calls (#4589) Prompt injection detection (simplified - only pattern matching) (#4237) feat: add streaming support to Tetrate Agent Router Service provider (#4477) docs: goosehints updates (#4581) Iand/recipe scanner updates (#4584) # Conflicts: # ui/desktop/src/components/GooseMessage.tsx
* main: (29 commits) docs: update built-in extensions list and fix link (#4601) Add Message Metadata for Visibility Control (#4538) Remove deprecated Claude 3.5 models (#4590) Remove unused loadRecipe function (#4599) Send the secret with decodeRecipe (#4597) fix markdown links overflowing content and hide agent link previews (#4585) refactor: add new recipe dependency updater (#4596) chore: fix nightly builds to have tags (#4595) feat: Import file contents from recipe 'file' input type parameter (#4558) also adding this change to the api key send for recipes (#4587) Fix local (working directory) recipes storage (#4588) fix: don't redact tool calls (#4589) Prompt injection detection (simplified - only pattern matching) (#4237) feat: add streaming support to Tetrate Agent Router Service provider (#4477) docs: goosehints updates (#4581) Iand/recipe scanner updates (#4584) patching recipe scanning workflows for permissions changes (#4579) fix: onboarding endpoints send token secret (#4575) Fix : Google AI schema validation by adding missing array items fields (#4569) Add unified diff support to text editor (#4522) ...
* main: (30 commits) docs: update built-in extensions list and fix link (#4601) Add Message Metadata for Visibility Control (#4538) Remove deprecated Claude 3.5 models (#4590) Remove unused loadRecipe function (#4599) Send the secret with decodeRecipe (#4597) fix markdown links overflowing content and hide agent link previews (#4585) refactor: add new recipe dependency updater (#4596) chore: fix nightly builds to have tags (#4595) feat: Import file contents from recipe 'file' input type parameter (#4558) also adding this change to the api key send for recipes (#4587) Fix local (working directory) recipes storage (#4588) fix: don't redact tool calls (#4589) Prompt injection detection (simplified - only pattern matching) (#4237) feat: add streaming support to Tetrate Agent Router Service provider (#4477) docs: goosehints updates (#4581) Iand/recipe scanner updates (#4584) patching recipe scanning workflows for permissions changes (#4579) fix: onboarding endpoints send token secret (#4575) Fix : Google AI schema validation by adding missing array items fields (#4569) Add unified diff support to text editor (#4522) ...
…k#4237) Signed-off-by: Dorien Koelemeijer <[email protected]> merging as looks good, lets keep an eye on it. Signed-off-by: Matt Donovan <[email protected]>
…k#4237) Signed-off-by: Dorien Koelemeijer <[email protected]> merging as looks good, lets keep an eye on it. Signed-off-by: HikaruEgashira <[email protected]>


Overview
This PR implements prompt injection detection using pattern-matching techniques, providing users with security warnings when potentially malicious tool calls are detected.
Working design doc with requirements
These notes are a result of pairing sessions with Douwe: https://docs.google.com/document/d/1OWh7Ab_eu8STaoplPA6AKF6AnXQNqXc8JYiAwmUYDTs/edit?tab=t.0
Implementation approach
Implemented pattern-based detection for prompt injection, scanning tool calls for potentially dangerous commands. Pattern-matching provides immediate protection while laying groundwork for future ML-based scanning with BERT models, which will be the next iteration of this work.
Security scanning was added in the reply_internal() method after permission checking, leveraging the existing tool approval workflow. The security scanner receives both proposed tool calls and full conversation history for contextual threat assessment (using BERT models and potentially other tools). In this iteration, the full conversation history is not considered, but we will do that once model scanning is integrated.
Security can be enabled/disabled via security.enabled config parameter with configurable confidence thresholds
Key changes
crates/goose/src/security/: New security module with scanner setup and patterns. Can be extended to support model-based scanning.agent.rs: Addedapply_security_results_to_permissions()to move security-flagged tools from approved to needs_approvaltool_execution.rs: Enhancedhandle_approval_tool_requests_with_security()to include security warnings in confirmation promptsToolCallConfirmation.tsx: Added support for custom security warning promptsGooseMessage.tsx: Updated to pass security context from backend to UI componentsTool Inspection Manager
Moved some of the code around to include the following in the tool inspection manager:
Configuration
Future work
ToolMonitor