run-llama · logan-markewich · May 21, 2025 · May 16, 2025 · May 17, 2025 · May 17, 2025
diff --git a/docs/docs/examples/cookbooks/cleanlab_tlm_rag.ipynb b/docs/docs/examples/cookbooks/cleanlab_tlm_rag.ipynb
@@ -1,5 +1,12 @@
 {
  "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/cookbooks/cleanlab_tlm_rag.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+   ]
+  },
   {
    "attachments": {},
    "cell_type": "markdown",
@@ -8,9 +15,11 @@
     "# Trustworthy RAG with the Trustworthy Language Model\n",
     "\n",
     "This tutorial demonstrates how to use Cleanlab's [Trustworthy Language Model](https://cleanlab.ai/blog/trustworthy-language-model/) (TLM) in any RAG system, to score the trustworthiness of answers and improve overall reliability of the RAG system.\n",
-    "We recommend first completing the [TLM example tutorial](https://docs.llamaindex.ai/en/stable/examples/llm/cleanlab/).\n",
+    "We recommend first completing the [TLM example tutorial](https://docs.llamaindex.ai/en/stable/examples/llm/cleanlab/). <br />\n",
+    "If you're interested in using Cleanlab as a real-time Evaluator (which can also work as a Guardrail), check out [this tutorial](https://docs.llamaindex.ai/en/stable/examples/evaluation/Cleanlab/).\n",
     "\n",
-    "**Retrieval-Augmented Generation (RAG)** has become popular for building LLM-based Question-Answer systems in domains where LLMs alone suffer from: hallucination, knowledge gaps, and factual inaccuracies. However, RAG systems often still produce unreliable responses, because they depend on LLMs that are fundamentally unreliable. Cleanlab's Trustworthy Language Model (TLM) offers a solution by providing trustworthiness scores to assess and improve response quality, **independent of your RAG architecture or retrieval and indexing processes**. \n",
+    "\n",
+    "**Retrieval-Augmented Generation (RAG)** has become popular for building LLM-based Question-Answer systems in domains where LLMs alone suffer from: hallucination, knowledge gaps, and factual inaccuracies. However, RAG systems often still produce unreliable responses, because they depend on LLMs that are fundamentally unreliable. Cleanlab’s Trustworthy Language Model scores the trustworthiness of every LLM response in real-time, using state-of-the-art uncertainty estimates for LLMs, **independent of your RAG architecture or retrieval and indexing processes**. \n",
     "\n",
     "To diagnose when RAG answers cannot be trusted, simply swap your existing LLM that is generating answers based on the retrieved context with TLM. This notebook showcases this for a standard RAG system, based off a tutorial in the popular [LlamaIndex](https://docs.llamaindex.ai/) framework. Here we merely replace the LLM used in the LlamaIndex tutorial with TLM, and showcase some of the benefits. TLM can be similarly inserted into *any* other RAG framework.\n",
     "\n",
@@ -51,9 +60,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We then initialize Cleanlab's TLM. Here we initialize a CleanlabTLM object with default settings. \n",
-    "\n",
-    "You can get your Cleanlab API key here: https://app.cleanlab.ai/account after creating an account. For detailed instructions, refer to [this guide](https://help.cleanlab.ai/guide/quickstart/api/#api-key)."
+    "We then initialize Cleanlab's TLM. Here we initialize a CleanlabTLM object with default settings. "
    ]
   },
   {
@@ -65,6 +72,7 @@
     "from llama_index.llms.cleanlab import CleanlabTLM\n",
     "\n",
     "# set api key in env or in llm\n",
+    "# get free API key from: https://cleanlab.ai/\n",
     "# import os\n",
     "# os.environ[\"CLEANLAB_API_KEY\"] = \"your api key\"\n",
     "\n",
@@ -77,14 +85,14 @@
    "source": [
     "Note: If you encounter `ValidationError` during the above import, please upgrade your python version to >= 3.11\n",
     "\n",
-    "You can achieve better results by playing with the TLM configurations outlined in this [advanced TLM tutorial](https://help.cleanlab.ai/tutorials/tlm_advanced/).\n",
+    "You can achieve better results by playing with the TLM configurations outlined in this [advanced TLM tutorial](https://help.cleanlab.ai/tlm/tutorials/tlm_advanced/).\n",
     "\n",
     "For example, if your application requires OpenAI's GPT-4 model and restrict the output tokens to 256, you can configure it using the `options` argument:\n",
     "\n",
     "```python\n",
     "options = {\n",
     "    \"model\": \"gpt-4\",\n",
-    "    \"max_tokens\": 128,\n",
+    "    \"max_tokens\": 256,\n",
     "}\n",
     "llm = CleanlabTLM(api_key=\"your_api_key\", options=options)\n",
     "```\n",
@@ -200,9 +208,8 @@
     "documents = SimpleDirectoryReader(\"data\").load_data()\n",
     "# Optional step since we're loading just one data file\n",
     "for doc in documents:\n",
-    "    doc.excluded_llm_metadata_keys.append(\n",
-    "        \"file_path\"\n",
-    "    )  # file_path wouldn't be a useful metadata to add to LLM's context since our datasource contains just 1 file\n",
+    "    # file_path wouldn't be a useful metadata to add to LLM's context since our datasource contains just 1 file\n",
+    "    doc.excluded_llm_metadata_keys.append(\"file_path\")\n",
     "index = VectorStoreIndex.from_documents(documents)"
    ]
   },
@@ -231,7 +238,7 @@
     "In addition, you can just use TLM's trustworthiness score in an existing custom-built RAG pipeline (using any other LLM generator, streaming or not). <br>\n",
     "To achieve this, you'd need to fetch the prompt sent to LLM (including system instructions, retrieved context, user query, etc.) and the returned response. TLM requires both to predict trustworthiness.\n",
     "\n",
-    "Detailed information about this approach, along with example code, is available [here](https://help.cleanlab.ai/tlm/use-cases/tlm_rag/#alternate-low-latencystreaming-approach-use-tlm-to-assess-responses-from-an-existing-rag-system)."
+    "Detailed information about this approach, along with example code, is available [here](https://help.cleanlab.ai/tlm/tutorials/tlm/)."
    ]
   },
   {
@@ -240,7 +247,7 @@
    "source": [
     "### Extract Trustworthiness Score from LLM response\n",
     "\n",
-    "As we see above, Cleanlab's TLM also provides the `trustworthiness_score` in addition to the text, in its response to the prompt. \n",
+    "As we saw earlier, Cleanlab's TLM also provides the `trustworthiness_score` in addition to the text, in its response to the prompt. \n",
     "\n",
     "To get this score out when TLM is used in a RAG pipeline, Llamaindex provides an [instrumentation](https://docs.llamaindex.ai/en/stable/module_guides/observability/instrumentation/#instrumentation) tool that allows us to observe the events running behind the scenes in RAG. <br> \n",
     "We can utilise this tooling to extract `trustworthiness_score` from LLM's response.\n",
@@ -674,7 +681,8 @@
     "\n",
     "With TLM, you can easily increase trust in any RAG system! \n",
     "\n",
-    "Feel free to check [TLM's performance benchmarks](https://cleanlab.ai/blog/trustworthy-language-model/) for more details."
+    "Feel free to check [TLM's performance benchmarks](https://cleanlab.ai/blog/trustworthy-language-model/) for more details. <br />\n",
+    "If you're interested in using Cleanlab as a real-time Evaluator (which can also work as a Guardrail), check out [this tutorial](https://docs.llamaindex.ai/en/stable/examples/evaluation/Cleanlab/)."
    ]
   }
  ],

diff --git a/docs/docs/examples/evaluation/Cleanlab.ipynb b/docs/docs/examples/evaluation/Cleanlab.ipynb
@@ -1,5 +1,12 @@
 {
  "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/evaluation/Cleanlab.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -417,7 +424,7 @@
     "        prompt=full_prompt,\n",
     "    )\n",
     "    # Evaluate the response using TrustworthyRAG\n",
-    "    print(\"Evaluation results:\")\n",
+    "    print(\"### Evaluation results:\")\n",
     "    for metric, value in eval_result.items():\n",
     "        print(f\"{metric}: {value['score']}\")\n",
     "\n",
@@ -427,9 +434,9 @@
     "    response = query_engine.query(query)\n",
     "\n",
     "    print(\n",
-    "        f\"Query:\\n{query}\\n\\nTrimmed Context:\\n{get_retrieved_context(response)[:300]}...\"\n",
+    "        f\"### Query:\\n{query}\\n\\n### Trimmed Context:\\n{get_retrieved_context(response)[:300]}...\"\n",
     "    )\n",
-    "    print(f\"\\nGenerated response:\\n{response.response}\\n\")\n",
+    "    print(f\"\\n### Generated response:\\n{response.response}\\n\")\n",
     "\n",
     "    get_eval(query, response, event_handler, evaluator)"
    ]
@@ -443,7 +450,7 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Evaluation results:\n",
+      "### Evaluation results:\n",
       "trustworthiness: 1.0\n",
       "context_sufficiency: 0.9975124377856721\n",
       "response_groundedness: 0.9975124378045552\n",
@@ -479,20 +486,20 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Query:\n",
+      "### Query:\n",
       "How does the report explain why NVIDIA's Gaming revenue decreased year over year?\n",
       "\n",
-      "Trimmed Context:\n",
+      "### Trimmed Context:\n",
       "# NVIDIA Announces Financial Results for First Quarter Fiscal 2024\n",
       "\n",
       "NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago and up 19% from the previous quarter.\n",
       "\n",
       "- **Quarterly revenue** of $7.19 billion, up 19% from the pre...\n",
       "\n",
-      "Generated response:\n",
+      "### Generated response:\n",
       "The report indicates that NVIDIA's Gaming revenue decreased year over year by 38%, which is attributed to a combination of factors, although specific reasons are not detailed. The context highlights that the revenue for the first quarter was $2.24 billion, down from the previous year, while it did show an increase of 22% from the previous quarter. This suggests that while there may have been a seasonal or cyclical recovery, the overall year-over-year decline reflects challenges in the gaming segment during that period.\n",
       "\n",
-      "Evaluation results:\n",
+      "### Evaluation results:\n",
       "trustworthiness: 0.8018049078305449\n",
       "context_sufficiency: 0.26134514055082803\n",
       "response_groundedness: 0.8147481620994604\n",
@@ -530,20 +537,20 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Query:\n",
+      "### Query:\n",
       "How much did Nvidia's revenue decrease this quarter vs last quarter, in dollars?\n",
       "\n",
-      "Trimmed Context:\n",
+      "### Trimmed Context:\n",
       "# NVIDIA Announces Financial Results for First Quarter Fiscal 2024\n",
       "\n",
       "NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago and up 19% from the previous quarter.\n",
       "\n",
       "- **Quarterly revenue** of $7.19 billion, up 19% from the pre...\n",
       "\n",
-      "Generated response:\n",
+      "### Generated response:\n",
       "NVIDIA's revenue decreased by $1.10 billion this quarter compared to the last quarter.\n",
       "\n",
-      "Evaluation results:\n",
+      "### Evaluation results:\n",
       "trustworthiness: 0.572441384819641\n",
       "context_sufficiency: 0.9974990573223977\n",
       "response_groundedness: 0.006136548076912901\n",
@@ -578,17 +585,17 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Query:\n",
+      "### Query:\n",
       "If NVIDIA's Data Center segment maintains its Q1 FY2024 quarter-over-quarter growth rate for the next four quarters, what would be its projected annual revenue?\n",
       "\n",
-      "Trimmed Context:\n",
+      "### Trimmed Context:\n",
       "# NVIDIA Announces Financial Results for First Quarter Fiscal 2024\n",
       "\n",
       "NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago and up 19% from the previous quarter.\n",
       "\n",
       "- **Quarterly revenue** of $7.19 billion, up 19% from the pre...\n",
       "\n",
-      "Generated response:\n",
+      "### Generated response:\n",
       "If NVIDIA's Data Center segment maintains its quarter-over-quarter growth rate of 18% from Q1 FY2024 for the next four quarters, the projected revenue for the next four quarters can be calculated as follows:\n",
       "\n",
       "1. Q1 FY2024 revenue: $4.28 billion\n",
@@ -603,7 +610,7 @@
       "\n",
       "Therefore, the projected annual revenue for the Data Center segment would be approximately $30.57 billion.\n",
       "\n",
-      "Evaluation results:\n",
+      "### Evaluation results:\n",
       "trustworthiness: 0.23124932848015411\n",
       "context_sufficiency: 0.9299227307108295\n",
       "response_groundedness: 0.31247206392894905\n",
@@ -664,20 +671,20 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Query:\n",
+      "### Query:\n",
       "What significant transitions did Jensen comment on?\n",
       "\n",
-      "Trimmed Context:\n",
+      "### Trimmed Context:\n",
       "# NVIDIA Announces Financial Results for First Quarter Fiscal 2024\n",
       "\n",
       "NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago and up 19% from the previous quarter.\n",
       "\n",
       "- **Quarterly revenue** of $7.19 billion, up 19% from the pre...\n",
       "\n",
-      "Generated response:\n",
+      "### Generated response:\n",
       "Jensen Huang commented on the significant transitions the computer industry is undergoing, particularly in the areas of accelerated computing and generative AI.\n",
       "\n",
-      "Evaluation results:\n",
+      "### Evaluation results:\n",
       "trustworthiness: 0.9810004109697261\n",
       "context_sufficiency: 0.9902170786836257\n",
       "response_groundedness: 0.9975123614036665\n",
@@ -701,6 +708,7 @@
     "### Replace your LLM with Cleanlab's\n",
     "\n",
     "Beyond evaluating responses already generated from your LLM, Cleanlab can also generate responses and evaluate them simultaneously (using one of many [supported models](https://help.cleanlab.ai/tlm/api/python/tlm/#class-tlmoptions)). <br />\n",
+    "You can do this by calling `trustworthy_rag.generate(query=query, context=context, prompt=full_prompt)` <br />\n",
     "This replaces your own LLM within your RAG system and can be more convenient/accurate/faster.\n",
     "\n",
     "Let's replace our OpenAI LLM to call Cleanlab's endpoint instead:"
@@ -715,22 +723,22 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Query:\n",
+      "### Query:\n",
       "How much did Nvidia's revenue decrease this quarter vs last quarter, in dollars?\n",
       "\n",
-      "Trimmed Context:\n",
+      "### Trimmed Context:\n",
       "# NVIDIA Announces Financial Results for First Quarter Fiscal 2024\n",
       "\n",
       "NVIDIA (NASDAQ: NVDA) today reported revenue for the first quarter ended April 30, 2023, of $7.19 billion, down 13% from a year ago and up 19% from the previous quarter.\n",
       "\n",
       "- **Quarterly revenue** of $7.19 billion, up 19% from the pre\n",
       "\n",
-      "Generated Response:\n",
+      "### Generated Response:\n",
       "NVIDIA's revenue for the first quarter of fiscal 2024 was $7.19 billion, and for the previous quarter (Q4 FY23), it was $6.05 billion. Therefore, the revenue increased by $1.14 billion from the previous quarter, not decreased. \n",
       "\n",
       "So, the revenue did not decrease this quarter vs last quarter; it actually increased by $1.14 billion.\n",
       "\n",
-      "Evaluation Scores:\n",
+      "### Evaluation Scores:\n",
       "trustworthiness: 0.6810414232214796\n",
       "context_sufficiency: 0.9974887437375295\n",
       "response_groundedness: 0.9975116791816968\n",
@@ -743,16 +751,16 @@
     "query = \"How much did Nvidia's revenue decrease this quarter vs last quarter, in dollars?\"\n",
     "relevant_chunks = query_engine.retrieve(query)\n",
     "context = get_retrieved_context(relevant_chunks)\n",
-    "print(f\"Query:\\n{query}\\n\\nTrimmed Context:\\n{context[:300]}\")\n",
+    "print(f\"### Query:\\n{query}\\n\\n### Trimmed Context:\\n{context[:300]}\")\n",
     "\n",
     "pt = event_handler.PROMPT_TEMPLATE\n",
     "full_prompt = pt.format(context_str=context, query_str=query)\n",
     "\n",
     "result = trustworthy_rag.generate(\n",
     "    query=query, context=context, prompt=full_prompt\n",
     ")\n",
-    "print(f\"\\nGenerated Response:\\n{result['response']}\\n\")\n",
-    "print(\"Evaluation Scores:\")\n",
+    "print(f\"\\n### Generated Response:\\n{result['response']}\\n\")\n",
+    "print(\"### Evaluation Scores:\")\n",
     "for metric, value in result.items():\n",
     "    if metric != \"response\":\n",
     "        print(f\"{metric}: {value['score']}\")"