Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding first version of AgentEval -- a framework for assessing task utility for LLM-powered applications #681

Merged
merged 109 commits into from
Nov 21, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
109 commits
Select commit Hold shift + click to select a range
8e87fad
add agenteval-notebook for math problems and the blog post about it
julianakiseleva Nov 14, 2023
12c8c9b
update gitignore
julianakiseleva Nov 14, 2023
30e54db
updates to notebook
julianakiseleva Nov 14, 2023
b5451cc
adding folder for the logs
julianakiseleva Nov 14, 2023
a5332ab
adding math problems logs
julianakiseleva Nov 14, 2023
567798f
adding folder for alfworld logs
julianakiseleva Nov 14, 2023
c8c2caa
added limitiation and future work to blog post
julianakiseleva Nov 14, 2023
b1088f2
minor edits blog post
julianakiseleva Nov 14, 2023
1175cb0
Merge branch 'agenteval' of github.com:julianakiseleva/autogen into a…
julianakiseleva Nov 14, 2023
88002cb
Merge branch 'agenteval' of github.com:julianakiseleva/autogen into a…
julianakiseleva Nov 14, 2023
3c75e00
adding changes
julianakiseleva Nov 14, 2023
da97789
reorg
Narabzad Nov 14, 2023
be317d3
modify the main notebook
Narabzad Nov 14, 2023
5154214
modification of the main notebook
Narabzad Nov 14, 2023
4d22587
remove wrong notebook
Narabzad Nov 14, 2023
a4fe5b0
uploading new notebook
Narabzad Nov 14, 2023
67caaba
update agenteval notebook
Narabzad Nov 14, 2023
97398dd
change the sample
Narabzad Nov 14, 2023
c13817a
Update agenteval_cq_math.ipynb
Narabzad Nov 14, 2023
6c6b5b1
Merge branch 'agenteval' of github.com:julianakiseleva/autogen into a…
julianakiseleva Nov 14, 2023
6870056
adding final changes to notebook
julianakiseleva Nov 14, 2023
be8cf9f
Merge branch 'agenteval' of github.com:julianakiseleva/autogen into a…
julianakiseleva Nov 14, 2023
7829a65
updated framework picture
julianakiseleva Nov 14, 2023
a43a981
Update index.mdx
Narabzad Nov 14, 2023
06c53a2
Update index.md
Narabzad Nov 14, 2023
1dec052
Add files via upload
Narabzad Nov 14, 2023
12baceb
updates to notebool
Nov 15, 2023
dfddf29
Merge branch 'agenteval' of github.com:julianakiseleva/autogen into a…
julianakiseleva Nov 14, 2023
850b908
revise the blog
Narabzad Nov 15, 2023
8ff719f
revise the blog
Narabzad Nov 15, 2023
71826f1
update the agent img
Narabzad Nov 15, 2023
d3195e5
Merge branch 'agenteval' of https://github.com/julianakiseleva/autoge…
Narabzad Nov 15, 2023
6de3986
revise the blog
Narabzad Nov 15, 2023
a2f156b
revise the blog
Narabzad Nov 15, 2023
3ac137d
Merge branch 'agenteval' of github.com:julianakiseleva/autogen into a…
julianakiseleva Nov 14, 2023
d1f9a6d
Merge pull request #1 from julianakiseleva/agenteval
julianakiseleva Nov 15, 2023
16f8ae2
Excluded model logs from the main branch, you can find them in agente…
julianakiseleva Nov 14, 2023
cf38ebc
Merge branch 'main' into main
afourney Nov 15, 2023
5cacc90
Fixed pre-commit formatting.
afourney Nov 15, 2023
ebd4918
Update website/blog/2023-11-11-AgentEval/index.mdx
julianakiseleva Nov 16, 2023
af97e51
update gitignore
julianakiseleva Nov 16, 2023
5017042
update index.mdx
julianakiseleva Nov 16, 2023
e21d885
update authors.yml by adding Negar and Julia
julianakiseleva Nov 16, 2023
75bd08a
remove md file
julianakiseleva Nov 16, 2023
884e01c
remove md file
julianakiseleva Nov 16, 2023
4dee12a
update gitignore
julianakiseleva Nov 16, 2023
cced94e
update authors file
julianakiseleva Nov 16, 2023
669f62d
pre-commit checks
julianakiseleva Nov 16, 2023
28f8e54
pre-commit checks on authors.yml
julianakiseleva Nov 16, 2023
8fa57f9
pre-commit checks on authors.yml
julianakiseleva Nov 16, 2023
4d8896a
update index.mdx
julianakiseleva Nov 16, 2023
adea8bd
update authors.yml by adding Negar and Julia
julianakiseleva Nov 16, 2023
3e41984
Merge branch 'agenteval' of github.com:julianakiseleva/autogen into a…
julianakiseleva Nov 16, 2023
006b5dd
updated the blog-post version 1
julianakiseleva Nov 16, 2023
2f73ad7
updated the blog-post: TL;DR is ready
julianakiseleva Nov 16, 2023
97ea7c5
updated the blog-post: first part of introduction is ready
julianakiseleva Nov 16, 2023
8a11e8b
updated figures: typos on fig 1, changed terminology on the fig 2
julianakiseleva Nov 16, 2023
84561f8
upadated the Framework part
julianakiseleva Nov 17, 2023
e3661eb
fixed redering issues
julianakiseleva Nov 17, 2023
27b3c5c
upload zip file instead of single samples
Narabzad Nov 17, 2023
758f935
update prealgebra.zip
Narabzad Nov 17, 2023
f1aa64c
update
Narabzad Nov 17, 2023
c02fcbc
upload
Narabzad Nov 17, 2023
0f4685d
update z
Narabzad Nov 17, 2023
ddabd4f
update naming
Narabzad Nov 17, 2023
cfc6eef
update zip
Narabzad Nov 17, 2023
c9eaacb
update the agenteval notebook
Narabzad Nov 17, 2023
d6ac667
update the notebook - removing unmercenary logs
Narabzad Nov 17, 2023
9781d24
updated fig 1 and references to it
julianakiseleva Nov 17, 2023
44c9183
updated fig 1
julianakiseleva Nov 17, 2023
38eb944
incorporated PR comments
julianakiseleva Nov 17, 2023
14f9224
Merge branch 'agenteval' into main
julianakiseleva Nov 17, 2023
d18be08
merged agenteval branch
julianakiseleva Nov 17, 2023
104e702
Merge branch 'agenteval' into main
julianakiseleva Nov 17, 2023
361c5fb
final changes to the blog
julianakiseleva Nov 17, 2023
0109541
updated taxonomy
julianakiseleva Nov 17, 2023
8afdd76
update notebook
Narabzad Nov 17, 2023
367c56e
minor changes to the blog
julianakiseleva Nov 17, 2023
771ed07
Merge pull request #2 from julianakiseleva/main
Narabzad Nov 17, 2023
eb8dd96
Merge branch 'microsoft:main' into agenteval
Narabzad Nov 17, 2023
19f1b56
Fixed formatting
afourney Nov 18, 2023
9a9d4c2
Merge branch 'main' into main
qingyun-wu Nov 18, 2023
91d28e3
Update the link in agenteval_cq_math.ipynb
Narabzad Nov 20, 2023
8f9f2cf
Merge branch 'microsoft:main' into agenteval
Narabzad Nov 20, 2023
6710956
update the blog and link in notebook
Narabzad Nov 20, 2023
a41fdc3
Merge branch 'main' into agenteval
Narabzad Nov 20, 2023
0ca7eb4
Merge pull request #3 from julianakiseleva/agenteval
Narabzad Nov 20, 2023
963f2bb
Update index.mdx
Narabzad Nov 20, 2023
047c7f0
Merge remote-tracking branch 'origin/main'
Narabzad Nov 20, 2023
994ca36
change folder name
Narabzad Nov 20, 2023
221f123
Merge branch 'agenteval'
Narabzad Nov 20, 2023
f400aca
Changes to be committed:
Narabzad Nov 20, 2023
ea1b9ce
Merge branch 'agenteval'
Narabzad Nov 20, 2023
aa2e95a
add sample OAI file
Narabzad Nov 20, 2023
411ebc3
Merge branch 'agenteval'
Narabzad Nov 20, 2023
fc90c56
fix the url link to colab and typos
Narabzad Nov 21, 2023
be5186b
fix the url link to colab and typos
Narabzad Nov 21, 2023
f0ce406
Merge branch 'agenteval'
Narabzad Nov 21, 2023
f0aa317
add authors
Narabzad Nov 21, 2023
d3c3790
Merge branch 'agenteval'
Narabzad Nov 21, 2023
24e0347
Merge branch 'main' into main
Narabzad Nov 21, 2023
2cf8f0a
update profile pic
Narabzad Nov 21, 2023
4014365
Merge branch 'main' of https://github.com/julianakiseleva/autogen
Narabzad Nov 21, 2023
de6f62f
"update authors"
Narabzad Nov 21, 2023
b786f6e
fixing the problem in test_groupchat.py
Narabzad Nov 21, 2023
10b9756
update the title lower case
Narabzad Nov 21, 2023
9581acf
reverting changes in setup.py
Narabzad Nov 21, 2023
b2d98ef
rerun pre-commit
julianakiseleva Nov 21, 2023
af99079
Merge branch 'main' into main
sonichi Nov 21, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -167,3 +167,6 @@ wolfram.txt

# DB on disk for TeachableAgent
tmp/

# Storage for the AgentEval output
test/test_files/agenteval-in-out/out/
789 changes: 789 additions & 0 deletions notebook/agenteval_cq_math.ipynb

Large diffs are not rendered by default.

26 changes: 26 additions & 0 deletions test/test_files/agenteval-in-out/samples/sample_math_criteria.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
{
"Problem Interpretation": {
"description": "Ability to correctly interpret the problem.",
"accepted_values": ["completely off", "slightly relevant", "relevant", "mostly accurate", "completely accurate"]
},
"Mathematical Methodology": {
"description": "Adequacy of the chosen mathematical or algorithmic methodology for the question",
"accepted_values": ["inappropriate", "barely adequate", "adequate", "mostly effective", "completely effective"]
},
"Calculation Correctness": {
"description": "Accuracy of calculations made and solutions given",
"accepted_values": ["completely incorrect", "mostly incorrect", "neither", "mostly correct", "completely correct"]
},
"Explanation Clarity": {
"description": "Clarity and comprehensibility of explanations, including language use and structure",
"accepted_values": ["not at all clear", "slightly clear", "moderately clear", "very clear", "completely clear"]
},
"Code Efficiency": {
"description": "Quality of code in terms of efficiency and elegance",
"accepted_values": ["not at all efficient", "slightly efficient", "moderately efficient", "very efficient", "extremely efficient"]
},
"Code Correctness": {
"description": "Correctness of the provided code",
"accepted_values": ["completely incorrect", "mostly incorrect", "partly correct", "mostly correct", "completely correct"]
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
{
"sample_algebra.json": {
"actual_success": "true",
"estimated_performance": {
"Problem Interpretation": "completely accurate",
"Mathematical Methodology": "completely effective",
"Calculation Correctness": "completely correct",
"Explanation Clarity": "very clear",
"Code Efficiency": "very efficient",
"Code Correctness": "completely correct"
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
{
"problem": "Find all $x$ that satisfy the inequality $(2x+10)(x+3)<(3x+9)(x+8)$. Express your answer in interval notation.",
"level": "Level 5",
"type": "Algebra",
"solution": "We have \\begin{align*} (2x+10)(x+3)&<(3x+9)(x+8) \\quad \\Rightarrow\n\\\\ 2(x+5)(x+3)&<3(x+3)(x+8) \\quad \\Rightarrow\n\\\\ 2(x+5)(x+3)-3(x+3)(x+8)&<0 \\quad \\Rightarrow\n\\\\ (2x+10-(3x+24))(x+3)&<0 \\quad \\Rightarrow\n\\\\ (-x-14)(x+3)&<0 \\quad \\Rightarrow\n\\\\ (x+14)(x+3)&>0.\n\\end{align*} This inequality is satisfied if and only if $(x+14)$ and $(x+3)$ are either both positive or both negative. Both factors are positive for $x>-3$ and both factors are negative for $x<-14$. When $-14<x<-3$, one factor is positive and the other negative, so their product is negative. Therefore, the range of $x$ that satisfies the inequality is $ \\boxed{(-\\infty, -14)\\cup(-3,\\infty)} $.",
"correct_ans": "(-\\infty, -14)\\cup(-3,\\infty)",
"problem_id": "0",
"response_with_ans": "I apologize for the confusion in my previous message. Let's find the factorized form of the quadratic using Python code.\n\n```python\nfrom sympy import symbols, factor\r\nx = symbols('x')\r\nfactorized = factor(x**2 + 17*x + 42)\r\nfactorized\r\n```\n\nPlease run the Python code above to find the factorized form of the quadratic, and we can proceed to determine the intervals where the inequality holds true.",
"round": 1,
"messages": [
{
"content": "Find all $x$ that satisfy the inequality $(2x+10)(x+3)<(3x+9)(x+8)$. Express your answer in interval notation.",
"role": "user"
},
{
"content": "First, let's simplify and isolate the inequality to solve for $x$. For this, we need to expand the expressions on both sides and then move everything to one side.\n\nStep 1: Expand the expressions on both sides\n$(2x+10)(x+3) = 2x^2 + 6x + 10x + 30 = 2x^2 + 16x + 30$\n$(3x+9)(x+8) = 3x^2 + 24x + 9x + 72 = 3x^2 + 33x + 72$\n\nStep 2: Move everything to one side\n$0 < 3x^2 + 33x + 72 - (2x^2 + 16x + 30) = x^2 + 17x + 42$\n\nNow we need to find the factorized form of the quadratic inequality $x^2 + 17x + 42 > 0$ and then find intervals where the inequality holds true.\n\nLet's first find the factorized form of the quadratic:",
"role": "assistant"
},
{
"content": "",
"role": "user"
},
{
"content": "I apologize for the confusion in my previous message. Let's find the factorized form of the quadratic using Python code.\n\n```python\nfrom sympy import symbols, factor\r\nx = symbols('x')\r\nfactorized = factor(x**2 + 17*x + 42)\r\nfactorized\r\n```\n\nPlease run the Python code above to find the factorized form of the quadratic, and we can proceed to determine the intervals where the inequality holds true.",
"role": "assistant"
}
],
"time": 24.91333508491516,
"check_result": "the reply doesn't contain an answer.",
"is_correct": false,
"trial": -1
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
{
"problem": "What is the sum of all the distinct positive two-digit factors of 144?",
"level": "Level 5",
"type": "Number Theory",
"solution": "Prime factorize $144=2^4\\cdot3^2$. The sum of the positive two-digit factors of 144 is $2^4+2\\cdot3^2+2^2\\cdot3+2^2\\cdot3^2+2^3\\cdot3+2^3\\cdot3^2+2^4\\cdot3=\\boxed{226}.$",
"correct_ans": "226",
"problem_id": "0",
"response_with_ans": "To find the sum of all the distinct positive two-digit factors of 144, we need to first find all these factors. We can do this by iterating through the numbers from 10 to 99 and checking if they are factors of 144. Then, we can sum these factors and print their sum.\n\nHere's a Python script to accomplish this:\n\n```python\ntwo_digit_factors = []\n\nfor i in range(10, 100):\n if 144 % i == 0:\n two_digit_factors.append(i)\n\nsum_of_factors = sum(two_digit_factors)\nprint(\"The sum of all the distinct positive two-digit factors of 144 is:\", sum_of_factors)\n```\n\nPlease run this script to find the desired sum.",
"round": 0,
"messages": [
{
"content": "What is the sum of all the distinct positive two-digit factors of 144?",
"role": "user"
},
{
"content": "To find the sum of all the distinct positive two-digit factors of 144, we need to first find all these factors. We can do this by iterating through the numbers from 10 to 99 and checking if they are factors of 144. Then, we can sum these factors and print their sum.\n\nHere's a Python script to accomplish this:\n\n```python\ntwo_digit_factors = []\n\nfor i in range(10, 100):\n if 144 % i == 0:\n two_digit_factors.append(i)\n\nsum_of_factors = sum(two_digit_factors)\nprint(\"The sum of all the distinct positive two-digit factors of 144 is:\", sum_of_factors)\n```\n\nPlease run this script to find the desired sum.",
"role": "assistant"
}
],
"time": 11.140539407730103,
"check_result": "the answer is not directly provided in the reply, but i can extract it by running the given python code:\n\n```python\ntwo_digit_factors = []\n\nfor i in range(10, 100):\n if 144 % i == 0:\n two_digit_factors.append(i)\n\nsum_of_factors = sum(two_digit_factors)\nprint(\"the sum of all the distinct positive two-digit factors of 144 is:\", sum_of_factors)\n```\n\nafter running this code, the result is:\nthe sum of all the distinct positive two-digit factors of 144 is: 226\n\nthe answer is 226. the ground truth answer is 226. therefore,\n\nthe answer is correct.",
"is_correct": true,
"trial": -1
}
24 changes: 24 additions & 0 deletions test/test_files/agenteval-in-out/samples/sample_test_case.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
{
"problem": "Find $24^{-1} \\pmod{11^2}$. That is, find the residue $b$ for which $24b \\equiv 1\\pmod{11^2}$.\n\nExpress your answer as an integer from $0$ to $11^2-1$, inclusive.",
"level": "Level 5",
"type": "Number Theory",
"solution": "Since $5 \\times 24 = 120 = 121 - 1$, it follows that $-5 \\times 24 \\equiv 1 \\pmod{121}$. Adding 121 to $-5$ to make it positive, we find $(-5 + 121) \\times 24 \\equiv 116 \\times 24 \\equiv 1 \\pmod{121}$, so it follows that the modular inverse of $24$ is $\\boxed{116}$ when taken modulo $121$.",
"correct_ans": "116",
"problem_id": "5",
"response_with_ans": "To find the modular inverse of 24 modulo 11^2, we can use the Extended Euclidean Algorithm. Here is a Python function to compute the modular inverse using this algorithm:\n\n```python\ndef mod_inverse(a, m):\n g, x, _ = extended_gcd(a, m)\n if g != 1:\n raise Exception(f\"{a} and {m} are not coprime.\")\n return x % m\n\ndef extended_gcd(a, b):\n if a == 0:\n return b, 0, 1\n else:\n g, x, y = extended_gcd(b % a, a)\n return g, y - (b // a) * x, x\n```\n\nLet's use the above function to find the modular inverse of 24 modulo 11^2:\n\n```python\na = 24\nm = 11**2\nmod_inverse(a, m)\n```\n\nI will execute the above code to find the modular inverse of 24 modulo 11^2.",
"round": 0,
"messages": [
{
"content": "Find $24^{-1} \\pmod{11^2}$. That is, find the residue $b$ for which $24b \\equiv 1\\pmod{11^2}$.\n\nExpress your answer as an integer from $0$ to $11^2-1$, inclusive.",
"role": "user"
},
{
"content": "To find the modular inverse of 24 modulo 11^2, we can use the Extended Euclidean Algorithm. Here is a Python function to compute the modular inverse using this algorithm:\n\n```python\ndef mod_inverse(a, m):\n g, x, _ = extended_gcd(a, m)\n if g != 1:\n raise Exception(f\"{a} and {m} are not coprime.\")\n return x % m\n\ndef extended_gcd(a, b):\n if a == 0:\n return b, 0, 1\n else:\n g, x, y = extended_gcd(b % a, a)\n return g, y - (b // a) * x, x\n```\n\nLet's use the above function to find the modular inverse of 24 modulo 11^2:\n\n```python\na = 24\nm = 11**2\nmod_inverse(a, m)\n```\n\nI will execute the above code to find the modular inverse of 24 modulo 11^2.",
"role": "assistant"
}
],
"time": 13.481226921081543,
"check_result": "let me verify the modular inverse of 24 modulo 11^2.\n\nafter executing the provided python code:\n\n```python\na = 24\nm = 11**2\nmod_inverse(a, m)\n```\n\nthe answer is found to be: 116.\n\nthe answer is correct.",
"is_correct": true,
"trial": -1
}
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading