fix evaluations (#165)

* fix evaluations * Configure Azure Developer Pipeline * remove env var from chat.prompty * Configure Azure Developer Pipeline
Azure-Samples · Sep 12, 2024 · a95d01c · a95d01c
1 parent b0863c5
commit a95d01c
Show file tree

Hide file tree

Showing 13 changed files with 397 additions and 344 deletions.
diff --git a/.github/workflows/evaluations.yaml b/.github/workflows/evaluations.yaml
@@ -66,13 +66,13 @@ jobs:
       - name: evaluate
         working-directory: ./src/api
         run: |
-          python -m evaluators.evaluate
+          python -m evaluate
       
       - name: Upload eval results as build artifact
         uses: actions/upload-artifact@v4
         with:
           name: eval_result
-          path: ./src/api/evaluators/eval_results.jsonl
+          path: ./src/api/eval_results.jsonl
 
       - name: GitHub Summary Step
         if: ${{ success() }}

diff --git a/.gitignore b/.gitignore
@@ -28,3 +28,7 @@ src/api/evaluators/result.jsonl
 src/api/evaluators/eval_results.jsonl
 src/api/evaluators/eval_results.md
 src/api/evaluators/.runs/*
+src/api/result_evaluated.jsonl
+src/api/result.jsonl
+src/api/eval_results.jsonl
+src/api/eval_results.md
diff --git a/azure.yaml b/azure.yaml
@@ -14,4 +14,34 @@ hooks:
       interactive: true
       run: infra/hooks/postprovision.ps1
 infra:
-    provider: "bicep"
+    provider: "bicep"
+
+pipeline:
+  variables:
+    - APPINSIGHTS_CONNECTIONSTRING
+    - AZURE_CONTAINER_ENVIRONMENT_NAME
+    - AZURE_CONTAINER_REGISTRY_ENDPOINT
+    - AZURE_CONTAINER_REGISTRY_NAME
+    - AZURE_COSMOS_NAME
+    - AZURE_EMBEDDING_NAME
+    - AZURE_ENV_NAME
+    - AZURE_LOCATION
+    - AZURE_OPENAI_API_VERSION
+    - AZURE_OPENAI_CHAT_DEPLOYMENT
+    - AZURE_OPENAI_ENDPOINT
+    - AZURE_OPENAI_NAME
+    - AZURE_OPENAI_RESOURCE_GROUP_LOCATION
+    - AZURE_OPENAI_RESOURCE_GROUP_LOCATION
+    - AZURE_RESOURCE_GROUP
+    - AZURE_SEARCH_ENDPOINT
+    - AZURE_SEARCH_NAME
+    - AZURE_SUBSCRIPTION_ID
+    - COSMOS_CONTAINER
+    - COSMOS_ENDPOINT
+    - OPENAI_TYPE
+    - SERVICE_ACA_IMAGE_NAME
+    - SERVICE_ACA_NAME
+    - SERVICE_ACA_URI
+
+  secrets:
+    - BING_SEARCH_KEY
diff --git a/src/.dockerignore b/src/.dockerignore
@@ -1,3 +1,3 @@
 .git*
 .venv/
-**/*.pyc
+**/*.pyc
diff --git a/src/api/contoso_chat/chat.prompty b/src/api/contoso_chat/chat.prompty
@@ -8,7 +8,6 @@ model:
   configuration:
     type: azure_openai
     azure_deployment: gpt-35-turbo
-    azure_endpoint: ${ENV:AZURE_OPENAI_ENDPOINT}
     api_version: 2023-07-01-preview
   parameters:
     max_tokens: 128

diff --git a/src/api/evaluate-chat-flow.ipynb b/src/api/evaluate-chat-flow.ipynb
@@ -0,0 +1,178 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import json\n",
+    "import prompty\n",
+    "from evaluators.custom_evals.coherence import coherence_evaluation\n",
+    "from evaluators.custom_evals.relevance import relevance_evaluation\n",
+    "from evaluators.custom_evals.fluency import fluency_evaluation\n",
+    "from evaluators.custom_evals.groundedness import groundedness_evaluation\n",
+    "import jsonlines\n",
+    "import pandas as pd\n",
+    "from contoso_chat.chat_request import get_response"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Get output from data and save to results jsonl file"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def load_data():\n",
+    "    data_path = \"./evaluators/data.jsonl\"\n",
+    "\n",
+    "    df = pd.read_json(data_path, lines=True)\n",
+    "    df.head()\n",
+    "    return df"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "def create_response_data(df):\n",
+    "    results = []\n",
+    "\n",
+    "    for index, row in df.iterrows():\n",
+    "        customerId = row['customerId']\n",
+    "        question = row['question']\n",
+    "        \n",
+    "        # Run contoso-chat/chat_request flow to get response\n",
+    "        response = get_response(customerId=customerId, question=question, chat_history=[])\n",
+    "        print(response)\n",
+    "        \n",
+    "        # Add results to list\n",
+    "        result = {\n",
+    "            'question': question,\n",
+    "            'context': response[\"context\"],\n",
+    "            'answer': response[\"answer\"]\n",
+    "        }\n",
+    "        results.append(result)\n",
+    "\n",
+    "    # Save results to a JSONL file\n",
+    "    with open('result.jsonl', 'w') as file:\n",
+    "        for result in results:\n",
+    "            file.write(json.dumps(result) + '\\n')\n",
+    "    return results"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def evaluate():\n",
+    "    # Evaluate results from results file\n",
+    "    results_path = 'result.jsonl'\n",
+    "    results = []\n",
+    "    with open(results_path, 'r') as file:\n",
+    "        for line in file:\n",
+    "            print(line)\n",
+    "            results.append(json.loads(line))\n",
+    "\n",
+    "    for result in results:\n",
+    "        question = result['question']\n",
+    "        context = result['context']\n",
+    "        answer = result['answer']\n",
+    "        \n",
+    "        groundedness_score = groundedness_evaluation(question=question, answer=answer, context=context)\n",
+    "        fluency_score = fluency_evaluation(question=question, answer=answer, context=context)\n",
+    "        coherence_score = coherence_evaluation(question=question, answer=answer, context=context)\n",
+    "        relevance_score = relevance_evaluation(question=question, answer=answer, context=context)\n",
+    "        \n",
+    "        result['groundedness'] = groundedness_score\n",
+    "        result['fluency'] = fluency_score\n",
+    "        result['coherence'] = coherence_score\n",
+    "        result['relevance'] = relevance_score\n",
+    "\n",
+    "    # Save results to a JSONL file\n",
+    "    with open('result_evaluated.jsonl', 'w') as file:\n",
+    "        for result in results:\n",
+    "            file.write(json.dumps(result) + '\\n')\n",
+    "\n",
+    "    with jsonlines.open('eval_results.jsonl', 'w') as writer:\n",
+    "        writer.write(results)\n",
+    "    # Print results\n",
+    "\n",
+    "    df = pd.read_json('result_evaluated.jsonl', lines=True)\n",
+    "    df.head()\n",
+    "    \n",
+    "    return df"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def create_summary(df):\n",
+    "    print(\"Evaluation summary:\\n\")\n",
+    "    print(df)\n",
+    "    # drop question, context and answer\n",
+    "    mean_df = df.drop([\"question\", \"context\", \"answer\"], axis=1).mean()\n",
+    "    print(\"\\nAverage scores:\")\n",
+    "    print(mean_df)\n",
+    "    df.to_markdown('eval_results.md')\n",
+    "    with open('eval_results.md', 'a') as file:\n",
+    "        file.write(\"\\n\\nAverages scores:\\n\\n\")\n",
+    "    mean_df.to_markdown('eval_results.md', 'a')\n",
+    "\n",
+    "    print(\"Results saved to result_evaluated.jsonl\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# create main funciton for python script\n",
+    "if __name__ == \"__main__\":\n",
+    "\n",
+    "   test_data_df = load_data()\n",
+    "   response_results = create_response_data(test_data_df)\n",
+    "   result_evaluated = evaluate()\n",
+    "   create_summary(result_evaluated)\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "pf-prompty",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/src/api/evaluate.py b/src/api/evaluate.py
@@ -0,0 +1,115 @@
+# %%
+import os
+import json
+import prompty
+from evaluators.custom_evals.coherence import coherence_evaluation
+from evaluators.custom_evals.relevance import relevance_evaluation
+from evaluators.custom_evals.fluency import fluency_evaluation
+from evaluators.custom_evals.groundedness import groundedness_evaluation
+import jsonlines
+import pandas as pd
+from contoso_chat.chat_request import get_response
+
+# %% [markdown]
+# ## Get output from data and save to results jsonl file
+
+# %%
+def load_data():
+    data_path = "./evaluators/data.jsonl"
+
+    df = pd.read_json(data_path, lines=True)
+    df.head()
+    return df
+
+# %%
+
+def create_response_data(df):
+    results = []
+
+    for index, row in df.iterrows():
+        customerId = row['customerId']
+        question = row['question']
+
+        # Run contoso-chat/chat_request flow to get response
+        response = get_response(customerId=customerId, question=question, chat_history=[])
+        print(response)
+
+        # Add results to list
+        result = {
+            'question': question,
+            'context': response["context"],
+            'answer': response["answer"]
+        }
+        results.append(result)
+
+    # Save results to a JSONL file
+    with open('result.jsonl', 'w') as file:
+        for result in results:
+            file.write(json.dumps(result) + '\n')
+    return results
+
+# %%
+def evaluate():
+    # Evaluate results from results file
+    results_path = 'result.jsonl'
+    results = []
+    with open(results_path, 'r') as file:
+        for line in file:
+            print(line)
+            results.append(json.loads(line))
+
+    for result in results:
+        question = result['question']
+        context = result['context']
+        answer = result['answer']
+
+        groundedness_score = groundedness_evaluation(question=question, answer=answer, context=context)
+        fluency_score = fluency_evaluation(question=question, answer=answer, context=context)
+        coherence_score = coherence_evaluation(question=question, answer=answer, context=context)
+        relevance_score = relevance_evaluation(question=question, answer=answer, context=context)
+
+        result['groundedness'] = groundedness_score
+        result['fluency'] = fluency_score
+        result['coherence'] = coherence_score
+        result['relevance'] = relevance_score
+
+    # Save results to a JSONL file
+    with open('result_evaluated.jsonl', 'w') as file:
+        for result in results:
+            file.write(json.dumps(result) + '\n')
+
+    with jsonlines.open('eval_results.jsonl', 'w') as writer:
+        writer.write(results)
+    # Print results
+
+    df = pd.read_json('result_evaluated.jsonl', lines=True)
+    df.head()
+
+    return df
+
+# %%
+def create_summary(df):
+    print("Evaluation summary:\n")
+    print(df)
+    # drop question, context and answer
+    mean_df = df.drop(["question", "context", "answer"], axis=1).mean()
+    print("\nAverage scores:")
+    print(mean_df)
+    df.to_markdown('eval_results.md')
+    with open('eval_results.md', 'a') as file:
+        file.write("\n\nAverages scores:\n\n")
+    mean_df.to_markdown('eval_results.md', 'a')
+
+    print("Results saved to result_evaluated.jsonl")
+
+# %%
+# create main funciton for python script
+if __name__ == "__main__":
+
+   test_data_df = load_data()
+   response_results = create_response_data(test_data_df)
+   result_evaluated = evaluate()
+   create_summary(result_evaluated)
+
+
+
diff --git a/src/api/evaluators/custom_evals/coherence.prompty b/src/api/evaluators/custom_evals/coherence.prompty
@@ -1,6 +1,6 @@
 ---
 name: QnA Coherence Evaluation
-description: Compute the coherence of the answer base on the question using llm.
+description: Evaluates coherence score for QA scenario
 model:
   api: chat
   configuration:
@@ -22,10 +22,10 @@ sample:
   context: Track lighting, invented by Lightolier, was popular at one period of time because it was much easier to install than recessed lighting, and individual fixtures are decorative and can be easily aimed at a wall. It has regained some popularity recently in low-voltage tracks, which often look nothing like their predecessors because they do not have the safety issues that line-voltage systems have, and are therefore less bulky and more ornamental in themselves. A master transformer feeds all of the fixtures on the track or rod with 12 or 24 volts, instead of each light fixture having its own line-to-low voltage transformer. There are traditional spots and floods, as well as other small hanging fixtures. A modified version of this is cable lighting, where lights are hung from or clipped to bare metal cables under tension
   answer: The main transformer is the object that feeds all the fixtures in low voltage tracks.
 ---
-System:
-You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.
+system:
+You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric. You should return a single integer value between 1 to 5 representing the evaluation metric. You will include no other text or information.
 
-User:
+user:
 Coherence of an answer is measured by how well all the sentences fit together and sound naturally as a whole. Consider the overall quality of the answer when evaluating coherence. Given the question and answer, score the coherence of answer between one to five stars using the following rating scale:
 One star: the answer completely lacks coherence
 Two stars: the answer mostly lacks coherence