diff --git a/docs/benchmarking/NSFW_roc_curve.png b/docs/benchmarking/NSFW_roc_curve.png index 0a8d394..eb1cd92 100644 Binary files a/docs/benchmarking/NSFW_roc_curve.png and b/docs/benchmarking/NSFW_roc_curve.png differ diff --git a/docs/benchmarking/nsfw.md b/docs/benchmarking/nsfw.md deleted file mode 100644 index df331e3..0000000 --- a/docs/benchmarking/nsfw.md +++ /dev/null @@ -1,31 +0,0 @@ -# NSFW Text Check Benchmark Results - -## Dataset Description - -This benchmark evaluates model performance on a balanced set of social media posts: - -- Open Source [Toxicity dataset](https://github.com/surge-ai/toxicity/blob/main/toxicity_en.csv) -- 500 NSFW (true) and 500 non-NSFW (false) samples -- All samples are sourced from real social media platforms - -**Total n = 1,000; positive class prevalence = 500 (50.0%)** - -## Results - -### ROC Curve - -![ROC Curve](./NSFW_roc_curve.png) - -### Metrics Table - -| Model | ROC AUC | Prec@R=0.80 | Prec@R=0.90 | Prec@R=0.95 | Recall@FPR=0.01 | -|--------------|---------|-------------|-------------|-------------|-----------------| -| gpt-4.1 | 0.989 | 0.976 | 0.962 | 0.962 | 0.717 | -| gpt-4.1-mini | 0.984 | 0.977 | 0.977 | 0.943 | 0.653 | -| gpt-4.1-nano | 0.952 | 0.972 | 0.823 | 0.823 | 0.429 | -| gpt-4o-mini | 0.965 | 0.977 | 0.955 | 0.945 | 0.842 | - -#### Notes -- ROC AUC: Area under the ROC curve (higher is better) -- Prec@R: Precision at the specified recall threshold -- Recall@FPR=0.01: Recall when the false positive rate is 1% diff --git a/docs/ref/checks/nsfw.md b/docs/ref/checks/nsfw.md index 0700d94..2341096 100644 --- a/docs/ref/checks/nsfw.md +++ b/docs/ref/checks/nsfw.md @@ -82,10 +82,12 @@ This benchmark evaluates model performance on a balanced set of social media pos | Model | ROC AUC | Prec@R=0.80 | Prec@R=0.90 | Prec@R=0.95 | Recall@FPR=0.01 | |--------------|---------|-------------|-------------|-------------|-----------------| -| gpt-4.1 | 0.989 | 0.976 | 0.962 | 0.962 | 0.717 | -| gpt-4.1-mini (default) | 0.984 | 0.977 | 0.977 | 0.943 | 0.653 | -| gpt-4.1-nano | 0.952 | 0.972 | 0.823 | 0.823 | 0.429 | -| gpt-4o-mini | 0.965 | 0.977 | 0.955 | 0.945 | 0.842 | +| gpt-5 | 0.9532 | 0.9195 | 0.9096 | 0.9068 | 0.0339 | +| gpt-5-mini | 0.9629 | 0.9321 | 0.9168 | 0.9149 | 0.0998 | +| gpt-5-nano | 0.9600 | 0.9297 | 0.9216 | 0.9175 | 0.1078 | +| gpt-4.1 | 0.9603 | 0.9312 | 0.9249 | 0.9192 | 0.0439 | +| gpt-4.1-mini (default) | 0.9520 | 0.9180 | 0.9130 | 0.9049 | 0.0459 | +| gpt-4.1-nano | 0.9502 | 0.9262 | 0.9094 | 0.9043 | 0.0379 | **Notes:** diff --git a/mkdocs.yml b/mkdocs.yml index e6e370a..51da93f 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -38,13 +38,14 @@ nav: - "Streaming vs Blocking": streaming_output.md - Tripwires: tripwires.md - Checks: - - Prompt Injection Detection: ref/checks/prompt_injection_detection.md - Contains PII: ref/checks/pii.md - Custom Prompt Check: ref/checks/custom_prompt_check.md - Hallucination Detection: ref/checks/hallucination_detection.md - Jailbreak Detection: ref/checks/jailbreak.md - Moderation: ref/checks/moderation.md + - NSFW: ref/checks/nsfw.md - Off Topic Prompts: ref/checks/off_topic_prompts.md + - Prompt Injection Detection: ref/checks/prompt_injection_detection.md - URL Filter: ref/checks/urls.md - Evaluation Tool: evals.md - API Reference: