diff --git a/README.md b/README.md index ea4090d1..f9c49fa6 100644 --- a/README.md +++ b/README.md @@ -152,6 +152,16 @@ output_df.head() 4. [Confidence estimation](https://docs.refuel.ai/guide/accuracy/confidence/) and explanations out of the box for every single output label 5. [Caching and state management](https://docs.refuel.ai/guide/reliability/state-management/) to minimize costs and experimentation time +## Access to Refuel hosted LLMs + +Refuel provides access to hosted open source LLMs for labeling, and for estimating confidence This is helpful, because you can calibrate a confidence threshold for your labeling task, and then route less confident labels to humans, while you still get the benefits of auto-labeling for the confident examples. + +In order to use Refuel hosted LLMs, you can [request access here](https://refuel-ai.typeform.com/llm-access). + +## Benchmark + +Check out our [technical report](https://www.refuel.ai/blog-posts/llm-labeling-technical-report) to learn more about the performance of various LLMs, and human annoators, on label quality, turnaround time and cost. + ## 🛠️ Roadmap Our goal is to allow users to label, create or enrich any dataset, with any LLM - easily and quickly. diff --git a/docs/guide/llms/benchmarks.md b/docs/guide/llms/benchmarks.md index e69de29b..0061084e 100644 --- a/docs/guide/llms/benchmarks.md +++ b/docs/guide/llms/benchmarks.md @@ -0,0 +1,10 @@ + +## Benchmarking LLMs for data labeling + + +Key takeaways from our [technical report](https://www.refuel.ai/blog-posts/llm-labeling-technical-report): + +* State of the art LLMs can label text datasets at the same or better quality compared to skilled human annotators, **but ~20x faster and ~7x cheaper**. +* For achieving the highest quality labels, GPT-4 is the best choice among out of the box LLMs (88.4% agreement with ground truth, compared to 86% for skilled human annotators). +* For achieving the best tradeoff between label quality and cost, GPT-3.5-turbo, PaLM-2 and open source models like FLAN-T5-XXL are compelling. +* Confidence based thresholding can be a very effective way to mitigate impact of hallucinations and ensure high label quality.