Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
23c9157
refactor: restructure rai_bench
jmatejcz Apr 8, 2025
4e417c2
refacotr: separate tool_calling tasks
jmatejcz Apr 8, 2025
86b4878
feat: add subtasks, validators
jmatejcz Apr 9, 2025
f584186
refactor: renamed benchmar dirs
jmatejcz Apr 9, 2025
f4cd614
refactor: base class for basic tasks
jmatejcz Apr 9, 2025
351893a
feat: add more subtask to cover basic tests
jmatejcz Apr 9, 2025
5c845a5
feat add pydantic models for messages
jmatejcz Apr 9, 2025
04a3f6f
feat: add mock tools for topics, services and actions
jmatejcz Apr 10, 2025
85a0a2d
feat: refactor maniupulaiton tasks to suit new frame
jmatejcz Apr 10, 2025
5187370
refactor: changed subtasks to more generic classes
jmatejcz Apr 10, 2025
e622591
refactor: changed errors and validation logging
jmatejcz Apr 10, 2025
1f07832
fix: raising errors SubTask
jmatejcz Apr 11, 2025
105519c
fix: validator properly iterates when extra tools
jmatejcz Apr 11, 2025
21c8e20
fix: proper logging in manipulation tasks
jmatejcz Apr 11, 2025
b57c01b
feat: add and intergarte custom interface tasks
jmatejcz Apr 11, 2025
6e31524
feat: GetROS2MessageInterfaceTool now return output as string with al…
jmatejcz Apr 1, 2025
f744940
fix: import adjustment after rebase
jmatejcz Apr 14, 2025
b6f7b0a
feat: migrated spatial reasoning tasks
MagdalenaKotynia Mar 28, 2025
8e8679d
feat: add action models
jmatejcz Apr 14, 2025
4637462
refactor: changed available models in mock action tool
jmatejcz Apr 14, 2025
29d1fc9
feat: add mock of GetDistanceToObjectsTool
jmatejcz Apr 14, 2025
7f790fa
feat: add action toolkit mock
jmatejcz Apr 15, 2025
136281b
feat: check fields action subtask added
jmatejcz Apr 15, 2025
8718728
feat: migrated and refactored navigation tasks
jmatejcz Apr 15, 2025
6fe1cb4
chore: deleted unused code
jmatejcz Apr 15, 2025
bb7ee73
fix: handling recursion limit error
jmatejcz Apr 15, 2025
27fa402
refactor: all message models now forbid extra params
jmatejcz Apr 15, 2025
ba751ee
feat: declared and moved sample tasks to one file
jmatejcz Apr 15, 2025
178bfb7
test: add unittests for subtasks
jmatejcz Apr 15, 2025
35032db
test: add unittest for validators
jmatejcz Apr 15, 2025
db35baf
feat: add args model_type and vendor
jmatejcz Apr 15, 2025
b0ef7d1
refactor: tests naming change
jmatejcz Apr 15, 2025
fd80394
docs: added README and small validation docs
jmatejcz Apr 16, 2025
de861ad
style: added docstrings
jmatejcz Apr 16, 2025
7037b4d
fix: fixes after rebase
jmatejcz Apr 16, 2025
3e19e97
chore: add licenses to tests
jmatejcz Apr 16, 2025
0d6a9f1
refactor: renamed interface parser file and moved it to generic folder
jmatejcz Apr 16, 2025
247c2f1
docs: moved docs to rai_bench
jmatejcz Apr 16, 2025
d9609c3
refactor: save to results errors from extra calls
jmatejcz Apr 16, 2025
fc2c1f1
fix: removed unnecessary configs froma action models
jmatejcz Apr 16, 2025
b70777d
refactor: removed deafult empty dict from subtasks
jmatejcz Apr 17, 2025
8dbadbd
feat: adjusted recursion limits
jmatejcz Apr 17, 2025
2510e7e
feat: added script for testing multiple models in one run
jmatejcz Apr 17, 2025
2b87ca5
docs: adjusted README to new changes
jmatejcz Apr 17, 2025
3fb30ca
fix: fixes after rebase
jmatejcz Apr 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
144 changes: 98 additions & 46 deletions src/rai_bench/README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,10 @@
## RAI Benchmark

### Description
# RAI Benchmarks

The RAI Bench is a package including benchmarks and providing frame for creating new benchmarks

### Frame Components
## Manipulation O3DE Benchmark

- `Task`
- `Scenario`
- `Benchmark`

For more information about these classes go to -> [benchmark_model](./rai_bench/benchmark_model.py)

### O3DE Test Benchmark

The O3DE Test Benchmark [o3de_test_benchmark_module](./rai_bench/o3de_test_bench/) provides tasks and scene configurations for robotic arm manipulation task. The tasks use a common `ManipulationTask` logic and can be parameterized, which allows for many task variants. The current tasks include:
The Manipulation O3DE Benchmark [manipulation_o3de_benchmark_module](./rai_bench//manipulation_o3de/) provides tasks and scene configurations for robotic arm manipulation simulation in O3DE. The tasks use a common `ManipulationTask` logic and can be parameterized, which allows for many task variants. The current tasks include:

- **MoveObjectToLeftTask**
- **GroupObjectsTask**
Expand All @@ -24,33 +14,17 @@ The O3DE Test Benchmark [o3de_test_benchmark_module](./rai_bench/o3de_test_bench

The result of a task is a value between 0 and 1, calculated like initially_misplaced_now_correct / initially_misplaced. This score is calculated at the end of each scenario.

Current O3DE simulation binaries:

### Running

1. Download O3DE simulation binary and unzip it.

- [ros2-humble](https://robotec-ml-rai-public.s3.eu-north-1.amazonaws.com/RAIManipulationDemo_jammyhumble.zip)
- [ros2-jazzy](https://robotec-ml-rai-public.s3.eu-north-1.amazonaws.com/RAIManipulationDemo_noblejazzy.zip)

2. Follow step 2 from [Manipulation demo Setup section](../../docs/demos/manipulation.md#setup)

3. Adjust the path to the binary in: [o3de_config.yaml](./rai_bench/o3de_test_bench/configs/o3de_config.yaml)
4. Run benchmark with:
### Frame Components

```bash
cd rai
source setup_shell.sh
python src/rai_bench/rai_bench/examples/o3de_test_benchmark.py
```
- `Task`
- `Scenario`
- `Benchmark`

> [!NOTE]
> For now benchmark runs all available scenarios (~160). See [Examples](#example-usege)
> section for details.
For more information about these classes go to -> [benchmark](./rai_bench//manipulation_o3de/benchmark.py) and [Task](./rai_bench//manipulation_o3de//interfaces.py) and

### Example usage

Example of how to load scenes, define scenarios and run benchmark can be found in [o3de_test_benchmark_example](./rai_bench/examples/o3de_test_benchmark.py)
Example of how to load scenes, define scenarios and run benchmark can be found in [manipulation_o3de_benchmark_example](rai_bench/examples/manipulation_o3de/main.py)

Scenarios can be loaded manually like:

Expand All @@ -73,7 +47,7 @@ scenarios = Benchmark.create_scenarios(

which will result in list of scenarios with combination of every possible task and scene(task decides if scene config is suitable for it).

or can be imported from exisitng packets [scenarios_packets](./rai_bench/o3de_test_bench/scenarios.py):
or can be imported from exisitng packets [scenarios_packets](rai_bench/examples/manipulation_o3de/scenarios.py):

```python
t_scenarios = trivial_scenarios(
Expand All @@ -94,28 +68,81 @@ vh_scenarios = very_hard_scenarios(
```

which are grouped by their subjective difficulty. For now there are 10 trivial, 42 easy, 23 medium, 38 hard and 47 very hard scenarios.
Check docstrings and code in [scenarios_packets](./rai_bench/o3de_test_bench/scenarios.py) if you want to know how scenarios are assigned to difficulty level.
Check docstrings and code in [scenarios_packets](rai_bench/examples/manipulation_o3de/scenarios.py) if you want to know how scenarios are assigned to difficulty level.

### Running

1. Download O3DE simulation binary and unzip it.

- [ros2-humble](https://robotec-ml-rai-public.s3.eu-north-1.amazonaws.com/RAIManipulationDemo_jammyhumble.zip)
- [ros2-jazzy](https://robotec-ml-rai-public.s3.eu-north-1.amazonaws.com/RAIManipulationDemo_noblejazzy.zip)

2. Follow step 2 from [Manipulation demo Setup section](../../docs/demos/manipulation.md#setup)

3. Adjust the path to the binary in: [o3de_config.yaml](./rai_bench/examples/manipulation_o3de/configs/o3de_config.yaml)
4. Chose the model you want to run and a vendor.
> [!NOTE]
> The configs of vendors are defined in [config.toml](../../config.toml) Change ithem if needed.
5. Run benchmark with:

```bash
cd rai
source setup_shell.sh
python src/rai_bench/rai_bench/examples/manipulation_o3de/main.py --model-name llama3.2 --vendor ollama
```

> [!NOTE]
> For now benchmark runs all available scenarios (~160). See [Examples](#example-usege)
> section for details.

### Development

When creating new task or changing existing ones, make sure to add unit tests for score calculation in [rai_bench_tests](../../tests/rai_bench/).
When creating new task or changing existing ones, make sure to add unit tests for score calculation in [rai_bench_tests](../../tests/rai_bench/manipulation_o3de/tasks/).
This applies also when you are adding or changing the helper methods in `Task` or `ManipulationTask`.

The number of scenarios can be easily extened without writing new tasks, by increasing number of variants of the same task and adding more simulation configs but it won't improve variety of scenarios as much as creating new tasks.

### Tool Calling Agent Benchmark
## Tool Calling Agent Benchmark

The Tool Calling Agent Benchmark is the benchmark for LangChain tool calling agents. It includes a set of tasks and a benchmark that evaluates the performance of the agent on those tasks by verifying the correctness of the tool calls requested by the agent. The benchmark is integrated with LangSmith and Langfuse tracing backends to easily track the performance of the agents.

#### Frame Components
### Frame Components

- [Tool Calling Agent Benchmark](rai_bench/tool_calling_agent_bench/agent_bench.py) - Benchmark for LangChain tool calling agents
- [Tasks Interfaces](rai_bench/tool_calling_agent_bench/agent_tasks_interfaces.py) - Interfaces for tool calling agent tasks
- [Tool Calling Agent Benchmark](rai_bench//tool_calling_agent/benchmark.py) - Benchmark for LangChain tool calling agents
- [Scores tracing](rai_bench/tool_calling_agent_bench/scores_tracing.py) - Component handling sending scores to tracing backends
- [Interfaces](rai_bench//tool_calling_agent/interfaces.py) - Interfaces for validation classes - Task, Validator, SubTask
For detailed description of validation visit -> [Validation](.//rai_bench/docs/tool_calling_agent_benchmark.md)

[tool_calling_agent_test_bench.py](rai_bench/examples/tool_calling_agent/main.py) - Script providing benchmark on tasks based on the ROS2 tools usage.

### Example Usage

#### Benchmark Example with ROS2 Tools
Validators can be constructed from any SubTasks, Tasks can be validated by any numer of Validators, which makes whole validation process incredibly versital.

```python
# subtasks
get_topics_subtask = CheckArgsToolCallSubTask(
expected_tool_name="get_ros2_topics_names_and_types"
)
color_image_subtask = CheckArgsToolCallSubTask(
expected_tool_name="get_ros2_image", expected_args={"topic": "/camera_image_color"}
)
# validators - consist of subtasks
topics_ord_val = OrderedCallsValidator(subtasks=[get_topics_subtask])
color_image_ord_val = OrderedCallsValidator(subtasks=[color_image_subtask])
topics_and_color_image_ord_val = OrderedCallsValidator(
subtasks=[
get_topics_subtask,
color_image_subtask,
]
)
# tasks - validated by list of validators
GetROS2TopicsTask(validators=[topics_ord_val])
GetROS2RGBCameraTask(validators=[topics_and_color_image_ord_val]),
GetROS2RGBCameraTask(validators=[topics_ord_val, color_image_ord_val]),
```

[tool_calling_agent_test_bench.py](rai_bench/examples/tool_calling_agent_test_bench.py) - Script providing benchmark on tasks based on the ROS2 tools usage.
### Running

To set up tracing backends, please follow the instructions in the [tracing.md](../../docs/tracing.md) document.

Expand All @@ -124,8 +151,33 @@ To run the benchmark:
```bash
cd rai
source setup_shell.sh
python src/rai_bench/rai_bench/examples/tool_calling_agent_test_bench.py
python src/rai_bench/rai_bench/examples/tool_calling_agent/main.py
```

There is also flags to declare model type and vendor:

```bash
python src/rai_bench/rai_bench/examples/tool_calling_agent/main.py --model-name llama3.2 --vendor ollama
```

> [!NOTE]
> The `simple_model` from [config.toml](../../config.toml) is currently set up in the example benchmark script. Change it to `complex_model` in the script if needed.
> The configs of vendors are defined in [config.toml](../../config.toml) Change ithem if needed.

## Testing Models

To test multiple models, different benchamrks or couple repeats in one go - use script [test_models](./rai_bench/examples/test_models.py)

Modify these params:

```python
models_name = ["llama3.2", "qwen2.5:7b"]
vendors = ["ollama", "ollama"]
benchmarks = ["tool_calling_agent"]
repeats = 1
```

to your liking and run the script!

```bash
python src/rai_bench/rai_bench/examples/test_models.py
```
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
17 changes: 17 additions & 0 deletions src/rai_bench/rai_bench/docs/tool_calling_agent_benchmark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Tool Calling Agent Benchmark

## Validaiton

Our validation flow consists of:

- SubTask - smallest block, responsible for validating single tool call (ex. ListTopics)
- Validator - consists of subtasks. Based on the validator type, checks whether all subtasks were completed in a certain way
- Task - consists of validators. Every Validator can be treated as single step that is scored atomically. Visit examples/tool_calling_agent/tasks.py for more intuition. Every Task has always the same prompt and available tools, only the validation methods can be parametrized. On top of validator, you can pass extra_tool_calls param to allow model to correct itself.

![alt text](imgs/tool_calling_agent_valid_schema.png)

Implementations can be found:

- Validators [Validators](../tool_calling_agent/validators.py)
- Subtasks [Validators](../tool_calling_agent/tasks/subtasks.py)
- Tasks, including navigation, spatial, custom interfaces and other [Tasks](../tool_calling_agent/tasks/)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.

########### EXAMPLE USAGE ###########
import argparse
import logging
import time
from datetime import datetime
Expand All @@ -23,7 +23,7 @@
from langchain.tools import BaseTool
from rai.agents.conversational_agent import create_conversational_agent
from rai.communication.ros2.connectors import ROS2Connector
from rai.initialization import get_llm_model
from rai.initialization import get_llm_model_direct
from rai.tools.ros2 import (
GetObjectPositionsTool,
GetROS2ImageTool,
Expand All @@ -32,26 +32,41 @@
)
from rai_open_set_vision.tools import GetGrabbingPointTool

from rai_bench.benchmark_model import Benchmark
from rai_bench.o3de_test_bench.scenarios import (
from rai_bench.examples.manipulation_o3de.scenarios import (
easy_scenarios,
hard_scenarios,
medium_scenarios,
trivial_scenarios,
very_hard_scenarios,
)
from rai_bench.manipulation_o3de.benchmark import Benchmark
from rai_sim.o3de.o3de_bridge import (
O3DEngineArmManipulationBridge,
)

if __name__ == "__main__":

def parse_args():
parser = argparse.ArgumentParser(description="Run the Tool Calling Agent Benchmark")
parser.add_argument(
"--model-name",
type=str,
help="Model name to use for benchmarking",
)
parser.add_argument(
"--vendor", type=str, default=None, help="Vendor of the model (optional)"
)
return parser.parse_args()


def run_benchmark(model_name: str, vendor: str):
rclpy.init()
connector = ROS2Connector()
node = connector.node
node.declare_parameter("conversion_ratio", 1.0)

# define model
llm = get_llm_model(model_type="complex_model", streaming=True)

llm = get_llm_model_direct(model_name=model_name, vendor=vendor)

system_prompt = """
You are a robotic arm with interfaces to detect and manipulate objects.
Expand All @@ -78,9 +93,7 @@
]
# define loggers
now = datetime.now()
experiment_dir = (
f"src/rai_bench/rai_bench/experiments/{now.strftime('%Y-%m-%d_%H-%M-%S')}"
)
experiment_dir = f"src/rai_bench/rai_bench/experiments/o3de_manipulation/{now.strftime('%Y-%m-%d_%H-%M-%S')}"
Path(experiment_dir).mkdir(parents=True, exist_ok=True)
log_file = f"{experiment_dir}/benchmark.log"
file_handler = logging.FileHandler(log_file)
Expand All @@ -99,7 +112,7 @@
agent_logger.setLevel(logging.INFO)
agent_logger.addHandler(file_handler)

configs_dir = "src/rai_bench/rai_bench/o3de_test_bench/configs/"
configs_dir = "src/rai_bench/rai_bench/examples/manipulation_o3de/configs/"
connector_path = configs_dir + "o3de_config.yaml"
#### Create scenarios manually
# load different scenes
Expand Down Expand Up @@ -171,7 +184,7 @@
logger=bench_logger,
results_filename=results_filename,
)
for i in range(len(all_scenarios)):
for _ in range(len(all_scenarios)):
agent = create_conversational_agent(
llm, tools, system_prompt, logger=agent_logger
)
Expand All @@ -190,3 +203,8 @@
connector.shutdown()
o3de.shutdown()
rclpy.shutdown()


if __name__ == "__main__":
args = parse_args()
run_benchmark(model_name=args.model_name, vendor=args.vendor)
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,9 @@

from rclpy.impl.rcutils_logger import RcutilsLogger

from rai_bench.benchmark_model import (
Benchmark,
Scenario,
Task,
)
from rai_bench.o3de_test_bench.tasks import (
from rai_bench.manipulation_o3de.benchmark import Benchmark, Scenario
from rai_bench.manipulation_o3de.interfaces import Task
from rai_bench.manipulation_o3de.tasks import (
BuildCubeTowerTask,
GroupObjectsTask,
MoveObjectsToLeftTask,
Expand Down
43 changes: 43 additions & 0 deletions src/rai_bench/rai_bench/examples/test_models.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Copyright (C) 2025 Robotec.AI
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import rai_bench.examples.manipulation_o3de.main as manipulation_o3de_bench
import rai_bench.examples.tool_calling_agent.main as tool_calling_agent_bench

if __name__ == "__main__":
models_name = ["llama3.2", "qwen2.5:7b"]
vendors = ["ollama", "ollama"]
benchmarks = ["tool_calling_agent"]
repeats = 1
if len(models_name) != len(vendors):
raise ValueError("Number of passed models must match number of passed vendors")
else:
for benchmark in benchmarks:
for i, model_name in enumerate(models_name):
for u in range(repeats):
try:
if benchmark == "tool_calling_agent":
tool_calling_agent_bench.run_benchmark(
model_name=model_name, vendor=vendors[i]
)
elif benchmark == "manipulation_o3de":
manipulation_o3de_bench.run_benchmark(
model_name=model_name, vendor=vendors[i]
)
else:
print(f"No benchmark named: {benchmark}")
except Exception as e:
print(
f"Failed to run {benchmark} benchmark for {model_name}, vendor: {vendors[i]}, execution number: {u + 1}, because: {str(e)}"
)
Loading