From 5d440e766097b3b873634215d440e5f94fbb6466 Mon Sep 17 00:00:00 2001
From: Nihit <nihit@cs.stanford.edu>
Date: Thu, 15 Jun 2023 06:58:54 -0700
Subject: [PATCH] typeform link in readme

---
 README.md                     | 10 ++++++++++
 docs/guide/llms/benchmarks.md | 10 ++++++++++
 2 files changed, 20 insertions(+)

diff --git a/README.md b/README.md
index ea4090d1..f9c49fa6 100644
--- a/README.md
+++ b/README.md
@@ -152,6 +152,16 @@ output_df.head()
 4. [Confidence estimation](https://docs.refuel.ai/guide/accuracy/confidence/) and explanations out of the box for every single output label
 5. [Caching and state management](https://docs.refuel.ai/guide/reliability/state-management/) to minimize costs and experimentation time
 
+## Access to Refuel hosted LLMs
+
+Refuel provides access to hosted open source LLMs for labeling, and for estimating confidence This is helpful, because you can calibrate a confidence threshold for your labeling task, and then route less confident labels to humans, while you still get the benefits of auto-labeling for the confident examples.
+
+In order to use Refuel hosted LLMs, you can [request access here](https://refuel-ai.typeform.com/llm-access).
+
+## Benchmark
+
+Check out our [technical report](https://www.refuel.ai/blog-posts/llm-labeling-technical-report) to learn more about the performance of various LLMs, and human annoators, on label quality, turnaround time and cost.
+
 ## 🛠️ Roadmap
 Our goal is to allow users to label, create or enrich any dataset, with any LLM - easily and quickly.
 
diff --git a/docs/guide/llms/benchmarks.md b/docs/guide/llms/benchmarks.md
index e69de29b..0061084e 100644
--- a/docs/guide/llms/benchmarks.md
+++ b/docs/guide/llms/benchmarks.md
@@ -0,0 +1,10 @@
+
+## Benchmarking LLMs for data labeling
+
+
+Key takeaways from our [technical report](https://www.refuel.ai/blog-posts/llm-labeling-technical-report):
+
+* State of the art LLMs can label text datasets at the same or better quality compared to skilled human annotators, **but ~20x faster and ~7x cheaper**.
+* For achieving the highest quality labels, GPT-4 is the best choice among out of the box LLMs (88.4% agreement with ground truth, compared to 86% for skilled human annotators). 
+* For achieving the best tradeoff between label quality and cost, GPT-3.5-turbo, PaLM-2 and open source models like FLAN-T5-XXL are compelling.
+* Confidence based thresholding can be a very effective way to mitigate impact of hallucinations and ensure high label quality.