[15/N][RAG Demo fixes] [dev] benchmarking {gpt3.5,gpt4} X {generate_V…

…1, generate_V2} (#1206) [15/N][RAG Demo fixes] [dev] benchmarking {gpt3.5,gpt4} X {generate_V1, generate_V2} 5 trials each. With GPT4, generate prompt v2 does slightly better on the benchmark. GPT3.5, V1 <img width="873" alt="v1-gpt3 5" src="https://github.com/lastmile-ai/aiconfig/assets/148090348/22b40ef8-b015-4185-8fcd-00dee4e295f3"> GPT3.5, V2 <img width="862" alt="v2-gpt-3 5" src="https://github.com/lastmile-ai/aiconfig/assets/148090348/73dca37d-c696-4d31-b604-febcfc89fbce"> GPT4, V1 <img width="879" alt="v1-gpt4" src="https://github.com/lastmile-ai/aiconfig/assets/148090348/8b5b7f0c-cefb-4332-90fe-d26a2a0bbf2d"> GPT4, V2 <img width="870" alt="v2-gpt4" src="https://github.com/lastmile-ai/aiconfig/assets/148090348/3a7f1825-a5ce-4c11-9c7e-04135b22b94a"> --- Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/lastmile-ai/aiconfig/pull/1206). * __->__ #1206 * #1204 * #1203 * #1202 * #1200
lastmile-ai · Feb 9, 2024 · e6193b3 · e6193b3
2 parents f8d802d + da52f4a
commit e6193b3
Show file tree

Hide file tree

Showing 2 changed files with 1,883 additions and 125 deletions.
diff --git a/cookbooks/RAG-with-Model-Graded-Eval-v2/rag.aiconfig.yaml b/cookbooks/RAG-with-Model-Graded-Eval-v2/rag.aiconfig.yaml
@@ -76,16 +76,61 @@ metadata:
     remember_chat_context: false
 name: Rag Demo With Model-graded Eval
 prompts:
-- input: "Answer the following question using the provided context. Question: {{query}}\
-    \ \nContext: {{context}}"
+- input: "Answer the following question using the provided context. \n\nQuestion:\
+    \ {{query}} \nContext: {{context}}"
   metadata:
-    model: gpt-3.5-turbo
+    model:
+      name: gpt-4
+      settings: {}
     parameters: {}
     remember_chat_context: false
+  name: generate_v1
+  outputs:
+  - data: The price of flour sold in Boston was $11.87 a barrel in August.
+    execution_count: 0
+    metadata:
+      created: 1707439800
+      id: chatcmpl-8q9NY0YI6dDvzeFEebWIQZaOyLPwQ
+      model: gpt-3.5-turbo-16k-0613
+      object: chat.completion.chunk
+      raw_response:
+        content: The price of flour sold in Boston was $11.87 a barrel in August.
+        role: assistant
+      role: assistant
+    output_type: execute_result
+- input: "Answer the following question using the provided context. \n\nQuestion:\
+    \ {{query}} \nContext: {{context}}\n\nReview your answer and the context. Is the\
+    \ answer strictly justified by the context? If necessary, revise your answer.\n\
+    \nReview your best answer so far. Is it succinct, or does it contain extra information?\
+    \ If necessary, revise your answer.\n\nReview your best answer so far again. Does\
+    \ it actually answer the question? If necessary, revise your answer again.\n\n\
+    Review your best answer so far one last time. Is it easy to unerstand? If necessary,\
+    \ reviews your answer a final time.\n\n\nOutput ONLY YOUR BEST ANSWER.\n\nBEST\
+    \ ANSWER:"
+  metadata:
+    model:
+      name: gpt-3.5-turbo-16k-0613
+      settings: {}
+    parameters: {}
+  name: generate_v2
+  outputs: []
+- input: "Answer the following question using the provided context. \n\nQuestion:\
+    \ {{query}} \nContext: {{context}}\n\nReview your answer and the context. Is the\
+    \ answer strictly justified by the context? If necessary, revise your answer.\n\
+    \nReview your best answer so far. Is it succinct, or does it contain extra information?\
+    \ If necessary, revise your answer.\n\nReview your best answer so far again. Does\
+    \ it actually answer the question? If necessary, revise your answer again.\n\n\
+    Review your best answer so far one last time. Is it easy to unerstand? If necessary,\
+    \ reviews your answer a final time.\n\n\nOutput ONLY YOUR BEST ANSWER.\n\nBEST\
+    \ ANSWER:"
+  metadata:
+    model: gpt-4
+    parameters: {}
   name: generate
   outputs: []
 - input: "Given the following question, and answer, does the answer satisfactorily\
-    \ answer the question? \n\nQuestion: {{query}}\nAnswer: {{generate.output}}"
+    \ answer the question? \n\nGive a relevance verdict (YES or NO) with an explanation.\n\
+    \nQuestion: {{query}}\nAnswer: {{generate.output}}\n\nVerdict:\nExplanation:"
   metadata:
     model: gpt-4
     parameters: {}
@@ -103,15 +148,31 @@ prompts:
   name: evaluate_faithfulness
   outputs: []
 - input: 'Given the following answer, is the answer self-consistent and easy to understand?
-    Answer: {{generate.output}}'
+    Answer: {{generate.output}}
+
+
+    Give a relevance verdict (YES or NO) with an explanation.
+
+
+    Verdict:
+
+    Explanation:'
   metadata:
     model: gpt-4
     parameters: {}
     remember_chat_context: false
   name: evaluate_coherence
   outputs: []
 - input: 'Given the following answer, is the answer self-consistent and easy to understand?
-    Answer: {{generate.output}}'
+    Answer: {{generate.output}}
+
+
+    Give a succinctness verdict (YES or NO) with an explanation.
+
+
+    Verdict:
+
+    Explanation:'
   metadata:
     model: gpt-4
     parameters: {}