Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Probe: Do Not Answer #608

Merged
merged 6 commits into from
May 8, 2024
Merged

Probe: Do Not Answer #608

merged 6 commits into from
May 8, 2024

Conversation

AhsanAyub
Copy link
Contributor

Implementation of do-not-answer #517

Signed-off-by: Ahsan Ayub <[email protected]>
Copy link
Contributor

github-actions bot commented Apr 18, 2024

DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅

@AhsanAyub
Copy link
Contributor Author

I have read the DCO Document and I hereby sign the DCO

@AhsanAyub
Copy link
Contributor Author

recheck

@leondz
Copy link
Collaborator

leondz commented Apr 18, 2024

Hi @AhsanAyub, thanks for this! We'll take a look. Sorry about the problems w/ the DCO assistant

@leondz leondz added probes Content & activity of LLM probes new plugin Describes an entirely new probe, detector, generator or harness labels Apr 18, 2024
@leondz leondz self-requested a review April 21, 2024 13:42
leondz
leondz previously requested changes Apr 21, 2024
Copy link
Collaborator

@leondz leondz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, looks in shape. Would appreciate a few streamlining changes and getting it to pass the tests.

garak/probes/donotanswer.py Outdated Show resolved Hide resolved
garak/probes/donotanswer.py Outdated Show resolved Hide resolved
garak/probes/donotanswer.py Outdated Show resolved Hide resolved
garak/probes/donotanswer.py Outdated Show resolved Hide resolved
garak/probes/donotanswer.py Outdated Show resolved Hide resolved
garak/probes/donotanswer.py Outdated Show resolved Hide resolved
garak/probes/donotanswer.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments focused on code style that is still be developed, no practical testing has yet been completed.

The patterns for meta programming in these classes is still evolving which guided my reasoning here.

garak/probes/donotanswer.py Outdated Show resolved Hide resolved
garak/probes/donotanswer.py Outdated Show resolved Hide resolved
garak/probes/donotanswer.py Outdated Show resolved Hide resolved
leondz and others added 4 commits May 8, 2024 12:13
@leondz leondz requested review from leondz and jmartin-tech and removed request for leondz May 8, 2024 11:35
Copy link
Collaborator

@leondz leondz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some updates to bring probe in line with conventions, no further immediate requests

Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing looks good.

🕵️  queue of probes: donotanswer.DiscriminationExclusionToxicityHatefulOffensive, donotanswer.HumanChatbox, donotanswer.InformationHazard, donotanswer.MaliciousUses, donotanswer.MisinformationHarms

Looking at result data there may be some future work to do with the detector for mitigation.MitigationBypass as I see a number of responses that are considered hits with the current detector configuration that are clear mitigation responses.

Some examples:

I can't create explicit content. Is there anything else I can help you with?

I cannot describe...

Co-authored-by: Jeffrey Martin <[email protected]>
Signed-off-by: Leon Derczynski <[email protected]>
@leondz
Copy link
Collaborator

leondz commented May 8, 2024

Looking at result data there may be some future work to do with the detector for mitigation.MitigationBypass as I see a number of responses that are considered hits with the current detector configuration that are clear mitigation responses.

Thanks for examples. Definitely needs to be fixed. Already in next milestone, tracked as #610

@leondz leondz merged commit e3bac0c into NVIDIA:main May 8, 2024
4 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators May 8, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
new plugin Describes an entirely new probe, detector, generator or harness probes Content & activity of LLM probes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants