Adding first version of AgentEval -- a framework for assessing task utility for LLM-powered applications #681

julianakiseleva · 2023-11-15T02:44:55Z

We introduce AgentEval — the first version of the framework to automatically assess task utility for an arbitrary application. It suggests criteria to explain task utility and then quantifies these criteria for logs of your system. AgentEval consists of two key components:
- CriticAgent: This is an LLM-based agent that generates criteria to evaluate a given task.
- QuantifierAgent: This agent quantifies the performance of any sample task based on the criteria designed by the CriticAgent.
We demonstrate the usage of our framework with the Math Problems dataset in notebook that allows for running on single logs as well as plotting the quantified estimated performance for a set of problems.
The model logs you can find them in agenteval branch

This PR has added:

notebook/agenteval_cq_math.ipynb to demonstrate AgentEval using the math problems
sample files that required to demonstrate the work of notebook in test/test_files/agenteval-in-out
website/blog/2012-11-11-AgentEval -- the blog post to explain the AgentEval

Related issue number

Checks

I've included any doc changes needed for https://microsoft.github.io/autogen/. See https://microsoft.github.io/autogen/docs/Contribute#documentation to build and test documentation locally.
I've added tests (if relevant) corresponding to the changes introduced in this PR.
I've made sure all auto checks have passed.

…genteval merging

…genteval merging with agenteval branch'

…genteval merging with agenteval

…genteval merging on agenteval

…genteval

Narabzad · 2023-11-21T00:42:09Z

e.g., Previous Work is capitalized by other section titles use lower cas

@gagb done.

…tility for LLM-powered applications (microsoft#681) * add agenteval-notebook for math problems and the blog post about it * update gitignore * updates to notebook * adding folder for the logs * adding math problems logs * adding folder for alfworld logs * added limitiation and future work to blog post * minor edits blog post * adding changes * reorg * modify the main notebook * modification of the main notebook * remove wrong notebook * uploading new notebook * update agenteval notebook * change the sample * Update agenteval_cq_math.ipynb * adding final changes to notebook * updated framework picture * Update index.mdx * Update index.md * Add files via upload * updates to notebool * revise the blog * revise the blog * update the agent img * revise the blog * revise the blog * Excluded model logs from the main branch, you can find them in agenteval branch * Fixed pre-commit formatting. * Update website/blog/2023-11-11-AgentEval/index.mdx Co-authored-by: Chi Wang <[email protected]> * update gitignore * update index.mdx * update authors.yml by adding Negar and Julia * remove md file * remove md file * update gitignore * update authors file * pre-commit checks * pre-commit checks on authors.yml * pre-commit checks on authors.yml * update index.mdx * update authors.yml by adding Negar and Julia * updated the blog-post version 1 * updated the blog-post: TL;DR is ready * updated the blog-post: first part of introduction is ready * updated figures: typos on fig 1, changed terminology on the fig 2 * upadated the Framework part * fixed redering issues * upload zip file instead of single samples * update prealgebra.zip * update * upload * update z * update naming * update zip * update the agenteval notebook * update the notebook - removing unmercenary logs * updated fig 1 and references to it * updated fig 1 * incorporated PR comments * merged agenteval branch * final changes to the blog * updated taxonomy * update notebook * minor changes to the blog * Fixed formatting * Update the link in agenteval_cq_math.ipynb * update the blog and link in notebook * Update index.mdx * change folder name * Changes to be committed: modified: OAI_CONFIG_LIST_sample.txt * add sample OAI file * fix the url link to colab and typos * fix the url link to colab and typos * add authors * update profile pic * "update authors" * fixing the problem in test_groupchat.py * update the title lower case * reverting changes in setup.py * rerun pre-commit --------- Co-authored-by: Negar Arabzadeh <[email protected]> Co-authored-by: Julia Kiseleva <[email protected]> Co-authored-by: afourney <[email protected]> Co-authored-by: Chi Wang <[email protected]> Co-authored-by: Qingyun Wu <[email protected]>

julianakiseleva and others added 30 commits November 14, 2023 01:15

add agenteval-notebook for math problems and the blog post about it

8e87fad

update gitignore

12c8c9b

updates to notebook

30e54db

adding folder for the logs

b5451cc

adding math problems logs

a5332ab

adding folder for alfworld logs

567798f

added limitiation and future work to blog post

c8c2caa

minor edits blog post

b1088f2

Merge branch 'agenteval' of github.com:julianakiseleva/autogen into a…

1175cb0

…genteval merging

Merge branch 'agenteval' of github.com:julianakiseleva/autogen into a…

88002cb

…genteval merging with agenteval branch'

adding changes

3c75e00

reorg

da97789

modify the main notebook

be317d3

modification of the main notebook

5154214

remove wrong notebook

4d22587

uploading new notebook

a4fe5b0

update agenteval notebook

67caaba

change the sample

97398dd

Update agenteval_cq_math.ipynb

c13817a

Merge branch 'agenteval' of github.com:julianakiseleva/autogen into a…

6c6b5b1

…genteval merging with agenteval

adding final changes to notebook

6870056

Merge branch 'agenteval' of github.com:julianakiseleva/autogen into a…

be8cf9f

…genteval merging on agenteval

updated framework picture

7829a65

Update index.mdx

a43a981

Update index.md

06c53a2

Add files via upload

1dec052

updates to notebool

12baceb

Merge branch 'agenteval' of github.com:julianakiseleva/autogen into a…

dfddf29

…genteval

revise the blog

850b908

revise the blog

8ff719f

sonichi enabled auto-merge November 21, 2023 01:40

reverting changes in setup.py

9581acf

auto-merge was automatically disabled November 21, 2023 01:47
Head branch was pushed to by a user without write access

Narabzad had a problem deploying to openai1 November 21, 2023 01:47 — with GitHub Actions Failure

rerun pre-commit

b2d98ef

julianakiseleva had a problem deploying to openai1 November 21, 2023 03:23 — with GitHub Actions Failure

sonichi enabled auto-merge November 21, 2023 03:24

sonichi approved these changes Nov 21, 2023

View reviewed changes

sonichi added this pull request to the merge queue Nov 21, 2023

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 21, 2023

sonichi added this pull request to the merge queue Nov 21, 2023

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 21, 2023

sonichi mentioned this pull request Nov 21, 2023

Support custom text formats and recursive #496

Merged

3 tasks

Merge branch 'main' into main

af99079

sonichi had a problem deploying to openai1 November 21, 2023 04:04 — with GitHub Actions Failure

sonichi enabled auto-merge November 21, 2023 04:05

sonichi added this pull request to the merge queue Nov 21, 2023

Merged via the queue into microsoft:main with commit 19c7da2 Nov 21, 2023
16 of 19 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding first version of AgentEval -- a framework for assessing task utility for LLM-powered applications #681

Adding first version of AgentEval -- a framework for assessing task utility for LLM-powered applications #681

julianakiseleva commented Nov 15, 2023 •

edited

Loading

Narabzad commented Nov 21, 2023

Adding first version of AgentEval -- a framework for assessing task utility for LLM-powered applications #681

Adding first version of AgentEval -- a framework for assessing task utility for LLM-powered applications #681

Conversation

julianakiseleva commented Nov 15, 2023 • edited Loading

Related issue number

Checks

Narabzad commented Nov 21, 2023

julianakiseleva commented Nov 15, 2023 •

edited

Loading