diff --git a/docs/benchmarking/NSFW_roc_curve.png b/docs/benchmarking/NSFW_roc_curve.png
index 0a8d394..eb1cd92 100644
Binary files a/docs/benchmarking/NSFW_roc_curve.png and b/docs/benchmarking/NSFW_roc_curve.png differ
diff --git a/docs/benchmarking/nsfw.md b/docs/benchmarking/nsfw.md
deleted file mode 100644
index df331e3..0000000
--- a/docs/benchmarking/nsfw.md
+++ /dev/null
@@ -1,31 +0,0 @@
-# NSFW Text Check Benchmark Results
-
-## Dataset Description
-
-This benchmark evaluates model performance on a balanced set of social media posts:
-
-- Open Source [Toxicity dataset](https://github.com/surge-ai/toxicity/blob/main/toxicity_en.csv)
-- 500 NSFW (true) and 500 non-NSFW (false) samples
-- All samples are sourced from real social media platforms
-
-**Total n = 1,000; positive class prevalence = 500 (50.0%)**
-
-## Results
-
-### ROC Curve
-
-![ROC Curve](./NSFW_roc_curve.png)
-
-### Metrics Table
-
-| Model         | ROC AUC | Prec@R=0.80 | Prec@R=0.90 | Prec@R=0.95 | Recall@FPR=0.01 |
-|--------------|---------|-------------|-------------|-------------|-----------------|
-| gpt-4.1      | 0.989   | 0.976       | 0.962       | 0.962       | 0.717           |
-| gpt-4.1-mini | 0.984   | 0.977       | 0.977       | 0.943       | 0.653           |
-| gpt-4.1-nano | 0.952   | 0.972       | 0.823       | 0.823       | 0.429           |
-| gpt-4o-mini  | 0.965   | 0.977       | 0.955       | 0.945       | 0.842           |
-
-#### Notes
-- ROC AUC: Area under the ROC curve (higher is better)
-- Prec@R: Precision at the specified recall threshold
-- Recall@FPR=0.01: Recall when the false positive rate is 1%
diff --git a/docs/ref/checks/nsfw.md b/docs/ref/checks/nsfw.md
index 0700d94..2341096 100644
--- a/docs/ref/checks/nsfw.md
+++ b/docs/ref/checks/nsfw.md
@@ -82,10 +82,12 @@ This benchmark evaluates model performance on a balanced set of social media pos
 
 | Model         | ROC AUC | Prec@R=0.80 | Prec@R=0.90 | Prec@R=0.95 | Recall@FPR=0.01 |
 |--------------|---------|-------------|-------------|-------------|-----------------|
-| gpt-4.1      | 0.989   | 0.976       | 0.962       | 0.962       | 0.717           |
-| gpt-4.1-mini (default) | 0.984   | 0.977       | 0.977       | 0.943       | 0.653           |
-| gpt-4.1-nano | 0.952   | 0.972       | 0.823       | 0.823       | 0.429           |
-| gpt-4o-mini  | 0.965   | 0.977       | 0.955       | 0.945       | 0.842           |
+| gpt-5        | 0.9532  | 0.9195      | 0.9096      | 0.9068      | 0.0339          |
+| gpt-5-mini   | 0.9629  | 0.9321      | 0.9168      | 0.9149      | 0.0998          |
+| gpt-5-nano   | 0.9600  | 0.9297      | 0.9216      | 0.9175      | 0.1078          |
+| gpt-4.1      | 0.9603  | 0.9312      | 0.9249      | 0.9192      | 0.0439          |
+| gpt-4.1-mini (default) | 0.9520  | 0.9180      | 0.9130      | 0.9049      | 0.0459          |
+| gpt-4.1-nano | 0.9502  | 0.9262      | 0.9094      | 0.9043      | 0.0379          |
 
 **Notes:**
 
diff --git a/mkdocs.yml b/mkdocs.yml
index e6e370a..51da93f 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -38,13 +38,14 @@ nav:
     - "Streaming vs Blocking": streaming_output.md
     - Tripwires: tripwires.md
     - Checks:
-        - Prompt Injection Detection: ref/checks/prompt_injection_detection.md
         - Contains PII: ref/checks/pii.md
         - Custom Prompt Check: ref/checks/custom_prompt_check.md
         - Hallucination Detection: ref/checks/hallucination_detection.md
         - Jailbreak Detection: ref/checks/jailbreak.md
         - Moderation: ref/checks/moderation.md
+        - NSFW: ref/checks/nsfw.md
         - Off Topic Prompts: ref/checks/off_topic_prompts.md
+        - Prompt Injection Detection: ref/checks/prompt_injection_detection.md
         - URL Filter: ref/checks/urls.md
     - Evaluation Tool: evals.md
   - API Reference: