[Roadmap]: Integrating AgentEval #2162

julianakiseleva · 2024-03-27T01:03:49Z

Describe the issue

Tip

Want to get involved?

We'd love it if you did! Please get in contact with the people assigned to this issue, or leave a comment. See general contributing advice here too.

Background:
AutoGen aims to simplify the development of LLM-powered multi-agent systems for various applications, ultimately making end users' lives easier by assisting with their tasks. Next, we all yearn to understand how our developed systems perform, their utility for users, and, perhaps most crucially, how we can enhance them. Directly evaluating multi-agent systems poses challenges as current approaches predominantly rely on success metrics – essentially, whether the agent accomplishes tasks. However, comprehending user interaction with a system involves far more than success alone. Take math problems, for instance; it's not merely about the agent solving the problem. Equally significant is its ability to convey solutions based on various criteria, including completeness, conciseness, and the clarity of the provided explanation. Furthermore, success isn't always clearly defined for every task.
Rapid advances in LLMs and multi-agent systems have brought forth many emerging capabilities that we're keen on translating into tangible utilities for end users. We introduce the first version of AgentEval framework - a tool crafted to empower developers in swiftly gauging the utility of LLM-powered applications designed to help end users accomplish the desired task.

Here is the blogpost for short description.
Here is a first paper on AgentEval for more details.

The goal of this issue is to integrate AgentEval into the AutoGen library (and further to AutoGenStudio) .

Roadmap involves:

1. Improvement to AgentEval schema [Black parts is tested, and blue part is ongoing work]: @siqingh

2. Usage of AgentEval:

Complete Offline mode: @jluey1 @SeunRomiluyi

That is how it used now, namely a system designer provides the following triple <the task description, successful task execution, failed task execution> and gets as an output the list, where criterion is e.g.:

[
  { 
      "name": "Problem Interpretation",
      "description": "Ability to correctly interpret the problem.",
      "accepted_values": ["completely off", "slightly relevant", "relevant", "mostly accurate", "completely accurate"]
    },
]

Then, the system designer can ask AgentEval to quantify input datapoints, where input datapoints are logs of agents interactions (currently we mostly use AutoGen logs).

Online mode: @lalo @chinganc
Here, we envision to use AgentEval can be used as a part of Optimizer/Manager/Controller. Figure below provides an example of AgentEval can we used at each step of pipeline execution)

Function Signatures:

def generate_criteria:
    llm_config (dict or bool): llm inference configuration
    task (Task): the task to evaluate
    additional_instructions (str): additional instructions for the criteria agent
    max_round (int): The maximum number of rounds to run the conversation.
    use_subcritic (bool): Whether to use the subcritic agent to generate subcriteria.
return  list[Criterion]

def quantify_criteria:
    llm_config (dict or bool): llm inference configuration.
    criteria ([Criterion]): A list of criteria for evaluating the utility of a given task.
    task (Task): The task to evaluate.
    test_case (str): The test case to evaluate.
    ground_truth (str): The ground truth for the test case.
return A dictionary where the keys are the criteria and the values are the assessed performance based on accepted values for each criteria.

class Criterion:
    name (str): name of the criterion
    description (str): description of the criterion
    accepted_values (list[str]): list of possible values for the criterion (could this also be a range of values?)
    sub_criteria: list[Criterion] // list of sub criteria

class Task:
    name (str): Name of the task to be evaluated
    description (str): description of the task
    successful_response (str): chat message example of a successful response
    failed_response (str): chat message example of a failed response

Update 1: PR #2156
Contributor: @jluey1

Update 2: RR #2526
Contributor: @lalo

The text was updated successfully, but these errors were encountered:

YichiHuang · 2024-07-27T09:00:29Z

How is the latest development progress of the verifier agent?

jluey1 · 2024-07-30T18:33:24Z

How is the latest development progress of the verifier agent?

Hi Yichi, right now I am finalizing the work to integrate AgentEval into Autogen Studio. I do not have an eta on when I will get to the VerifierAgent work.

fniedtner · 2024-10-22T19:05:24Z

not planned for 0.4 unless there is significant interest

prabdeb · 2025-01-17T13:39:16Z

I think this will be a very useful utility for our customer engagments. I have tried an implemention using 0.4, please have a look and let me know if this will work, I can submit a Pull Request after that - prabdeb/agenteval-sample

@fniedtner @jluey1

ekzhu · 2025-01-17T20:06:52Z

@prabdeb Thanks. We need to figure out the maintenance for this

julianakiseleva added roadmap Issues related to roadmap of AutoGen evaluation labels Mar 27, 2024

julianakiseleva assigned lalo, BeibinLi, julianakiseleva, chinganc and jluey1 Mar 27, 2024

jluey1 mentioned this issue Apr 10, 2024

AgentEval Offline Integration #2345

Closed

3 tasks

jluey1 mentioned this issue May 13, 2024

Agenteval integration #2672

Merged

3 tasks

jluey1 mentioned this issue Jul 30, 2024

Agenteval integration into autogenstudio #3256

Closed

3 tasks

rysweet added 0.2 Issues which are related to the pre 0.4 codebase needs-triage labels Oct 2, 2024

fniedtner removed the in-progress label Oct 22, 2024

fniedtner closed this as not planned Won't fix, can't repro, duplicate, stale Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Roadmap]: Integrating AgentEval #2162

[Roadmap]: Integrating AgentEval #2162

julianakiseleva commented Mar 27, 2024 •

edited by jluey1

Loading

Want to get involved?

YichiHuang commented Jul 27, 2024

jluey1 commented Jul 30, 2024

fniedtner commented Oct 22, 2024

prabdeb commented Jan 17, 2025

ekzhu commented Jan 17, 2025

[Roadmap]: Integrating AgentEval #2162

[Roadmap]: Integrating AgentEval #2162

Comments

julianakiseleva commented Mar 27, 2024 • edited by jluey1 Loading

Describe the issue

Want to get involved?

YichiHuang commented Jul 27, 2024

jluey1 commented Jul 30, 2024

fniedtner commented Oct 22, 2024

prabdeb commented Jan 17, 2025

ekzhu commented Jan 17, 2025

julianakiseleva commented Mar 27, 2024 •

edited by jluey1

Loading