[Roadmap]: Integrating AgentEval #2162
Labels
0.2
Issues which are related to the pre 0.4 codebase
needs-triage
roadmap
Issues related to roadmap of AutoGen
Describe the issue
Tip
Want to get involved?
We'd love it if you did! Please get in contact with the people assigned to this issue, or leave a comment. See general contributing advice here too.
Background:
AutoGen aims to simplify the development of LLM-powered multi-agent systems for various applications, ultimately making end users' lives easier by assisting with their tasks. Next, we all yearn to understand how our developed systems perform, their utility for users, and, perhaps most crucially, how we can enhance them. Directly evaluating multi-agent systems poses challenges as current approaches predominantly rely on success metrics – essentially, whether the agent accomplishes tasks. However, comprehending user interaction with a system involves far more than success alone. Take math problems, for instance; it's not merely about the agent solving the problem. Equally significant is its ability to convey solutions based on various criteria, including completeness, conciseness, and the clarity of the provided explanation. Furthermore, success isn't always clearly defined for every task.
Rapid advances in LLMs and multi-agent systems have brought forth many emerging capabilities that we're keen on translating into tangible utilities for end users. We introduce the first version of AgentEval framework - a tool crafted to empower developers in swiftly gauging the utility of LLM-powered applications designed to help end users accomplish the desired task.
Here is the blogpost for short description.
Here is a first paper on AgentEval for more details.
The goal of this issue is to integrate AgentEval into the AutoGen library (and further to AutoGenStudio) .
Roadmap involves:
1. Improvement to AgentEval schema [Black parts is tested, and blue part is ongoing work]: @siqingh
2. Usage of AgentEval:
Complete Offline mode: @jluey1 @SeunRomiluyi
That is how it used now, namely a system designer provides the following triple <the task description, successful task execution, failed task execution> and gets as an output the list, where criterion is e.g.:
Then, the system designer can ask AgentEval to quantify input datapoints, where input datapoints are logs of agents interactions (currently we mostly use AutoGen logs).
Online mode: @lalo @chinganc
Here, we envision to use AgentEval can be used as a part of Optimizer/Manager/Controller. Figure below provides an example of AgentEval can we used at each step of pipeline execution)
Function Signatures:
Update 1: PR #2156
Contributor: @jluey1
Update 2: RR #2526
Contributor: @lalo
The text was updated successfully, but these errors were encountered: