-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Different approach to determining final confidence level of prompt injection evaluation outcomes #6729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different approach to determining final confidence level of prompt injection evaluation outcomes #6729
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adjusts how the prompt-injection scanner combines the tool-level and conversation-context classifier outputs to derive a final security confidence score and logging details. The goal appears to be a more nuanced confidence-combination heuristic and richer logging while keeping the external ScanResult interface stable.
Changes:
- Replace context-aware result selection (
select_result_with_context_awareness) with a new numeric combination heuristic incombine_confidences, using both tool and context confidences. - Update logging in
analyze_tool_call_with_contextto structured fields, including per-signal confidences, presence of ML and pattern matches, and the effective malicious decision. - Build the final
ScanResultfrom a syntheticDetailedScanResultthat uses the combined confidence along with the tool’s pattern matches and ML confidence.
c098b4e to
95465f8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.
michaelneale
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems good
* main: (30 commits) Different approach to determining final confidence level of prompt injection evaluation outcomes (#6729) fix: read_resource_tool deadlock causing test_compaction to hang (#6737) Upgrade error handling (#6747) Fix/filter audience 6703 local (#6773) chore: re-sync package-lock.json (#6783) upgrade electron to 39.3.0 (#6779) allow skipping providers in test_providers.sh (#6778) fix: enable custom model entry for OpenRouter provider (#6761) Remove codex skills flag support (#6775) Improve mcp test (#6671) Feat/anthropic custom headers (#6774) Fix/GitHub copilot error handling 5845 (#6771) fix(ui): respect width parameter in MCP app size-changed notifications (#6376) fix: address compilation issue in main (#6776) Upgrade GitHub Actions for Node 24 compatibility (#6699) fix(google): preserve thought signatures in streaming responses (#6708) added reduce motion support for css animations and streaming text (#6551) fix: Re-enable subagents for Gemini models (#6513) fix(google): use parametersJsonSchema for full JSON Schema support (#6555) fix: respect GOOSE_CLI_MIN_PRIORITY for shell streaming output (#6558) ...
* 'main' of github.com:block/goose: (62 commits) Swap canonical model from openrouter to models.dev (#6625) Hook thinking status (#6815) Fetch new skills hourly (#6814) copilot instructions: Update "No prerelease docs" instruction (#6795) refactor: centralize audience filtering before providers receive messages (#6728) update doc to remind contributors to activate hermit and document minimal npm and node version (#6727) nit: don't spit out compaction when in term mode as it fills up the screen (#6799) fix: correct tool support detection in Tetrate provider model fetching (#6808) Session manager fixes (#6809) fix(desktop): handle quoted paths with spaces in extension commands (#6430) fix: we can default gooseignore without writing it out (#6802) fix broken link (#6810) docs: add Beads MCP extension tutorial (#6792) feat(goose): add support for AWS_BEARER_TOKEN_BEDROCK environment variable (#6739) [docs] Add OSS Skills Marketplace (#6752) feat: make skills available in codemode (#6763) Fix: Recipe Extensions Not Loading in Desktop (#6777) Different approach to determining final confidence level of prompt injection evaluation outcomes (#6729) fix: read_resource_tool deadlock causing test_compaction to hang (#6737) Upgrade error handling (#6747) ...
…sion-session * 'main' of github.com:block/goose: (78 commits) copilot instructions: Update "No prerelease docs" instruction (#6795) refactor: centralize audience filtering before providers receive messages (#6728) update doc to remind contributors to activate hermit and document minimal npm and node version (#6727) nit: don't spit out compaction when in term mode as it fills up the screen (#6799) fix: correct tool support detection in Tetrate provider model fetching (#6808) Session manager fixes (#6809) fix(desktop): handle quoted paths with spaces in extension commands (#6430) fix: we can default gooseignore without writing it out (#6802) fix broken link (#6810) docs: add Beads MCP extension tutorial (#6792) feat(goose): add support for AWS_BEARER_TOKEN_BEDROCK environment variable (#6739) [docs] Add OSS Skills Marketplace (#6752) feat: make skills available in codemode (#6763) Fix: Recipe Extensions Not Loading in Desktop (#6777) Different approach to determining final confidence level of prompt injection evaluation outcomes (#6729) fix: read_resource_tool deadlock causing test_compaction to hang (#6737) Upgrade error handling (#6747) Fix/filter audience 6703 local (#6773) chore: re-sync package-lock.json (#6783) upgrade electron to 39.3.0 (#6779) ...
Summary
Simplifies the logic for combining tool and context confidence scores; the previous approach was slightly too aggressive in tuning down false positives by zeroing out confidence in certain cases. This refactor uses a more balanced threshold-based approach using dampening/boosting rules to reduce false positives.
We'll need to do some user testing to determine whether this approach is solid or whether we need to tweak this slightly later on.
Type of Change
AI Assistance
Testing
Quick local testing, but I can only test so much that way - will need broader user testing to determine if this is actually an improvement or if things need to be tweaked further.