RobotecAI · maciejmajek · Apr 23, 2025 · Apr 8, 2025 · Apr 8, 2025 · Apr 9, 2025
diff --git a/src/rai_bench/README.md b/src/rai_bench/README.md
@@ -1,20 +1,10 @@
-## RAI Benchmark
-
-### Description
+# RAI Benchmarks
 
 The RAI Bench is a package including benchmarks and providing frame for creating new benchmarks
 
-### Frame Components
+## Manipulation O3DE Benchmark
 
-- `Task`
-- `Scenario`
-- `Benchmark`
-
-For more information about these classes go to -> [benchmark_model](./rai_bench/benchmark_model.py)
-
-### O3DE Test Benchmark
-
-The O3DE Test Benchmark [o3de_test_benchmark_module](./rai_bench/o3de_test_bench/) provides tasks and scene configurations for robotic arm manipulation task. The tasks use a common `ManipulationTask` logic and can be parameterized, which allows for many task variants. The current tasks include:
+The Manipulation O3DE Benchmark [manipulation_o3de_benchmark_module](./rai_bench//manipulation_o3de/) provides tasks and scene configurations for robotic arm manipulation simulation in O3DE. The tasks use a common `ManipulationTask` logic and can be parameterized, which allows for many task variants. The current tasks include:
 
 - **MoveObjectToLeftTask**
 - **GroupObjectsTask**
@@ -24,33 +14,17 @@ The O3DE Test Benchmark [o3de_test_benchmark_module](./rai_bench/o3de_test_bench
 
 The result of a task is a value between 0 and 1, calculated like initially_misplaced_now_correct / initially_misplaced. This score is calculated at the end of each scenario.
 
-Current O3DE simulation binaries:
-
-### Running
-
-1. Download O3DE simulation binary and unzip it.
-
-   - [ros2-humble](https://robotec-ml-rai-public.s3.eu-north-1.amazonaws.com/RAIManipulationDemo_jammyhumble.zip)
-   - [ros2-jazzy](https://robotec-ml-rai-public.s3.eu-north-1.amazonaws.com/RAIManipulationDemo_noblejazzy.zip)
-
-2. Follow step 2 from [Manipulation demo Setup section](../../docs/demos/manipulation.md#setup)
-
-3. Adjust the path to the binary in: [o3de_config.yaml](./rai_bench/o3de_test_bench/configs/o3de_config.yaml)
-4. Run benchmark with:
+### Frame Components
 
-   ```bash
-   cd rai
-   source setup_shell.sh
-   python src/rai_bench/rai_bench/examples/o3de_test_benchmark.py
-   ```
+- `Task`
+- `Scenario`
+- `Benchmark`
 
-> [!NOTE]
-> For now benchmark runs all available scenarios (~160). See [Examples](#example-usege)
-> section for details.
+For more information about these classes go to -> [benchmark](./rai_bench//manipulation_o3de/benchmark.py) and [Task](./rai_bench//manipulation_o3de//interfaces.py) and
 
 ### Example usage
 
-Example of how to load scenes, define scenarios and run benchmark can be found in [o3de_test_benchmark_example](./rai_bench/examples/o3de_test_benchmark.py)
+Example of how to load scenes, define scenarios and run benchmark can be found in [manipulation_o3de_benchmark_example](rai_bench/examples/manipulation_o3de/main.py)
 
 Scenarios can be loaded manually like:
 
@@ -73,7 +47,7 @@ scenarios = Benchmark.create_scenarios(
 
 which will result in list of scenarios with combination of every possible task and scene(task decides if scene config is suitable for it).
 
-or can be imported from exisitng packets [scenarios_packets](./rai_bench/o3de_test_bench/scenarios.py):
+or can be imported from exisitng packets [scenarios_packets](rai_bench/examples/manipulation_o3de/scenarios.py):
 
 ```python
 t_scenarios = trivial_scenarios(
@@ -94,28 +68,81 @@ vh_scenarios = very_hard_scenarios(
 ```
 
 which are grouped by their subjective difficulty. For now there are 10 trivial, 42 easy, 23 medium, 38 hard and 47 very hard scenarios.
-Check docstrings and code in [scenarios_packets](./rai_bench/o3de_test_bench/scenarios.py) if you want to know how scenarios are assigned to difficulty level.
+Check docstrings and code in [scenarios_packets](rai_bench/examples/manipulation_o3de/scenarios.py) if you want to know how scenarios are assigned to difficulty level.
+
+### Running
+
+1. Download O3DE simulation binary and unzip it.
+
+   - [ros2-humble](https://robotec-ml-rai-public.s3.eu-north-1.amazonaws.com/RAIManipulationDemo_jammyhumble.zip)
+   - [ros2-jazzy](https://robotec-ml-rai-public.s3.eu-north-1.amazonaws.com/RAIManipulationDemo_noblejazzy.zip)
+
+2. Follow step 2 from [Manipulation demo Setup section](../../docs/demos/manipulation.md#setup)
+
+3. Adjust the path to the binary in: [o3de_config.yaml](./rai_bench/examples/manipulation_o3de/configs/o3de_config.yaml)
+4. Chose the model you want to run and a vendor.
+   > [!NOTE]
+   > The configs of vendors are defined in [config.toml](../../config.toml) Change ithem if needed.
+5. Run benchmark with:
+
+```bash
+cd rai
+source setup_shell.sh
+python src/rai_bench/rai_bench/examples/manipulation_o3de/main.py --model-name llama3.2  --vendor ollama
+```
+
+> [!NOTE]
+> For now benchmark runs all available scenarios (~160). See [Examples](#example-usege)
+> section for details.
 
 ### Development
 
-When creating new task or changing existing ones, make sure to add unit tests for score calculation in [rai_bench_tests](../../tests/rai_bench/).
+When creating new task or changing existing ones, make sure to add unit tests for score calculation in [rai_bench_tests](../../tests/rai_bench/manipulation_o3de/tasks/).
 This applies also when you are adding or changing the helper methods in `Task` or `ManipulationTask`.
 
 The number of scenarios can be easily extened without writing new tasks, by increasing number of variants of the same task and adding more simulation configs but it won't improve variety of scenarios as much as creating new tasks.
 
-### Tool Calling Agent Benchmark
+## Tool Calling Agent Benchmark
 
 The Tool Calling Agent Benchmark is the benchmark for LangChain tool calling agents. It includes a set of tasks and a benchmark that evaluates the performance of the agent on those tasks by verifying the correctness of the tool calls requested by the agent. The benchmark is integrated with LangSmith and Langfuse tracing backends to easily track the performance of the agents.
 
-#### Frame Components
+### Frame Components
 
-- [Tool Calling Agent Benchmark](rai_bench/tool_calling_agent_bench/agent_bench.py) - Benchmark for LangChain tool calling agents
-- [Tasks Interfaces](rai_bench/tool_calling_agent_bench/agent_tasks_interfaces.py) - Interfaces for tool calling agent tasks
+- [Tool Calling Agent Benchmark](rai_bench//tool_calling_agent/benchmark.py) - Benchmark for LangChain tool calling agents
 - [Scores tracing](rai_bench/tool_calling_agent_bench/scores_tracing.py) - Component handling sending scores to tracing backends
+- [Interfaces](rai_bench//tool_calling_agent/interfaces.py) - Interfaces for validation classes - Task, Validator, SubTask
+  For detailed description of validation visit -> [Validation](.//rai_bench/docs/tool_calling_agent_benchmark.md)
+
+[tool_calling_agent_test_bench.py](rai_bench/examples/tool_calling_agent/main.py) - Script providing benchmark on tasks based on the ROS2 tools usage.
+
+### Example Usage
 
-#### Benchmark Example with ROS2 Tools
+Validators can be constructed from any SubTasks, Tasks can be validated by any numer of Validators, which makes whole validation process incredibly versital.
+
+```python
+# subtasks
+get_topics_subtask = CheckArgsToolCallSubTask(
+    expected_tool_name="get_ros2_topics_names_and_types"
+)
+color_image_subtask = CheckArgsToolCallSubTask(
+    expected_tool_name="get_ros2_image", expected_args={"topic": "/camera_image_color"}
+)
+# validators - consist of subtasks
+topics_ord_val = OrderedCallsValidator(subtasks=[get_topics_subtask])
+color_image_ord_val = OrderedCallsValidator(subtasks=[color_image_subtask])
+topics_and_color_image_ord_val = OrderedCallsValidator(
+    subtasks=[
+        get_topics_subtask,
+        color_image_subtask,
+    ]
+)
+# tasks - validated by list of validators
+GetROS2TopicsTask(validators=[topics_ord_val])
+GetROS2RGBCameraTask(validators=[topics_and_color_image_ord_val]),
+GetROS2RGBCameraTask(validators=[topics_ord_val, color_image_ord_val]),
+```
 
-[tool_calling_agent_test_bench.py](rai_bench/examples/tool_calling_agent_test_bench.py) - Script providing benchmark on tasks based on the ROS2 tools usage.
+### Running
 
 To set up tracing backends, please follow the instructions in the [tracing.md](../../docs/tracing.md) document.
 
@@ -124,8 +151,33 @@ To run the benchmark:
 ```bash
 cd rai
 source setup_shell.sh
-python src/rai_bench/rai_bench/examples/tool_calling_agent_test_bench.py
+python src/rai_bench/rai_bench/examples/tool_calling_agent/main.py
+```
+
+There is also flags to declare model type and vendor:
+
+```bash
+python src/rai_bench/rai_bench/examples/tool_calling_agent/main.py --model-name llama3.2 --vendor ollama
+```
 
 > [!NOTE]
-> The `simple_model` from [config.toml](../../config.toml) is currently set up in the example benchmark script. Change it to `complex_model` in the script if needed.
+> The configs of vendors are defined in [config.toml](../../config.toml) Change ithem if needed.
+
+## Testing Models
+
+To test multiple models, different benchamrks or couple repeats in one go - use script [test_models](./rai_bench/examples/test_models.py)
+
+Modify these params:
+
+```python
+models_name = ["llama3.2", "qwen2.5:7b"]
+vendors = ["ollama", "ollama"]
+benchmarks = ["tool_calling_agent"]
+repeats = 1
+```
+
+to your liking and run the script!
+
+```bash
+python src/rai_bench/rai_bench/examples/test_models.py
 ```
diff --git a/src/rai_bench/rai_bench/docs/imgs/tool_calling_agent_valid_schema.png b/src/rai_bench/rai_bench/docs/imgs/tool_calling_agent_valid_schema.png
diff --git a/src/rai_bench/rai_bench/docs/tool_calling_agent_benchmark.md b/src/rai_bench/rai_bench/docs/tool_calling_agent_benchmark.md
@@ -0,0 +1,17 @@
+# Tool Calling Agent Benchmark
+
+## Validaiton
+
+Our validation flow consists of:
+
+- SubTask - smallest block, responsible for validating single tool call (ex. ListTopics)
+- Validator - consists of subtasks. Based on the validator type, checks whether all subtasks were completed in a certain way
+- Task - consists of validators. Every Validator can be treated as single step that is scored atomically. Visit examples/tool_calling_agent/tasks.py for more intuition. Every Task has always the same prompt and available tools, only the validation methods can be parametrized. On top of validator, you can pass extra_tool_calls param to allow model to correct itself.
+
+![alt text](imgs/tool_calling_agent_valid_schema.png)
+
+Implementations can be found:
+
+- Validators [Validators](../tool_calling_agent/validators.py)
+- Subtasks [Validators](../tool_calling_agent/tasks/subtasks.py)
+- Tasks, including navigation, spatial, custom interfaces and other [Tasks](../tool_calling_agent/tasks/)
diff --git a/src/rai_bench/rai_bench/examples/images/image_1.jpg b/src/rai_bench/rai_bench/examples/images/image_1.jpg
diff --git a/src/rai_bench/rai_bench/examples/images/image_2.jpg b/src/rai_bench/rai_bench/examples/images/image_2.jpg
diff --git a/src/rai_bench/rai_bench/examples/images/image_3.jpg b/src/rai_bench/rai_bench/examples/images/image_3.jpg
diff --git a/src/rai_bench/rai_bench/examples/images/image_4.jpg b/src/rai_bench/rai_bench/examples/images/image_4.jpg
diff --git a/src/rai_bench/rai_bench/examples/images/image_5.jpg b/src/rai_bench/rai_bench/examples/images/image_5.jpg
diff --git a/src/rai_bench/rai_bench/examples/images/image_6.jpg b/src/rai_bench/rai_bench/examples/images/image_6.jpg
diff --git a/src/rai_bench/rai_bench/examples/images/image_7.jpg b/src/rai_bench/rai_bench/examples/images/image_7.jpg
diff --git a/...rai_bench/o3de_test_bench/configs/1a.yaml → ...xamples/manipulation_o3de/configs/1a.yaml b/...rai_bench/o3de_test_bench/configs/1a.yaml → ...xamples/manipulation_o3de/configs/1a.yaml
diff --git a/..._bench/o3de_test_bench/configs/1a_1t.yaml → ...ples/manipulation_o3de/configs/1a_1t.yaml b/..._bench/o3de_test_bench/configs/1a_1t.yaml → ...ples/manipulation_o3de/configs/1a_1t.yaml
diff --git a/...bench/o3de_test_bench/configs/1a_2bc.yaml → ...les/manipulation_o3de/configs/1a_2bc.yaml b/...bench/o3de_test_bench/configs/1a_2bc.yaml → ...les/manipulation_o3de/configs/1a_2bc.yaml
diff --git a/.../o3de_test_bench/configs/1bc_1rc_1yc.yaml → ...anipulation_o3de/configs/1bc_1rc_1yc.yaml b/.../o3de_test_bench/configs/1bc_1rc_1yc.yaml → ...anipulation_o3de/configs/1bc_1rc_1yc.yaml
diff --git a/...ench/o3de_test_bench/configs/1carrot.yaml → ...es/manipulation_o3de/configs/1carrot.yaml b/...ench/o3de_test_bench/configs/1carrot.yaml → ...es/manipulation_o3de/configs/1carrot.yaml
diff --git a/...ench/configs/1carrot_1a_1t_1bc_1corn.yaml → ...o3de/configs/1carrot_1a_1t_1bc_1corn.yaml b/...ench/configs/1carrot_1a_1t_1bc_1corn.yaml → ...o3de/configs/1carrot_1a_1t_1bc_1corn.yaml
diff --git a/...gs/1carrot_1a_2t_1bc_1rc_3yc_stacked.yaml → ...gs/1carrot_1a_2t_1bc_1rc_3yc_stacked.yaml b/...gs/1carrot_1a_2t_1bc_1rc_3yc_stacked.yaml → ...gs/1carrot_1a_2t_1bc_1rc_3yc_stacked.yaml
diff --git a/.../o3de_test_bench/configs/1carrot_1bc.yaml → ...anipulation_o3de/configs/1carrot_1bc.yaml b/.../o3de_test_bench/configs/1carrot_1bc.yaml → ...anipulation_o3de/configs/1carrot_1bc.yaml
diff --git a/...3de_test_bench/configs/1carrot_1corn.yaml → ...ipulation_o3de/configs/1carrot_1corn.yaml b/...3de_test_bench/configs/1carrot_1corn.yaml → ...ipulation_o3de/configs/1carrot_1corn.yaml
diff --git a/...de_test_bench/configs/1carrot_1t_1rc.yaml → ...pulation_o3de/configs/1carrot_1t_1rc.yaml b/...de_test_bench/configs/1carrot_1t_1rc.yaml → ...pulation_o3de/configs/1carrot_1t_1rc.yaml
diff --git a/...ai_bench/o3de_test_bench/configs/1rc.yaml → ...amples/manipulation_o3de/configs/1rc.yaml b/...ai_bench/o3de_test_bench/configs/1rc.yaml → ...amples/manipulation_o3de/configs/1rc.yaml
diff --git a/.../o3de_test_bench/configs/1rc_2bc_3yc.yaml → ...anipulation_o3de/configs/1rc_2bc_3yc.yaml b/.../o3de_test_bench/configs/1rc_2bc_3yc.yaml → ...anipulation_o3de/configs/1rc_2bc_3yc.yaml
diff --git a/...rai_bench/o3de_test_bench/configs/1t.yaml → ...xamples/manipulation_o3de/configs/1t.yaml b/...rai_bench/o3de_test_bench/configs/1t.yaml → ...xamples/manipulation_o3de/configs/1t.yaml
diff --git a/...ai_bench/o3de_test_bench/configs/1yc.yaml → ...amples/manipulation_o3de/configs/1yc.yaml b/...ai_bench/o3de_test_bench/configs/1yc.yaml → ...amples/manipulation_o3de/configs/1yc.yaml
diff --git a/...ench/o3de_test_bench/configs/1yc_1rc.yaml → ...es/manipulation_o3de/configs/1yc_1rc.yaml b/...ench/o3de_test_bench/configs/1yc_1rc.yaml → ...es/manipulation_o3de/configs/1yc_1rc.yaml
diff --git a/...bench/o3de_test_bench/configs/2a_1bc.yaml → ...les/manipulation_o3de/configs/2a_1bc.yaml b/...bench/o3de_test_bench/configs/2a_1bc.yaml → ...les/manipulation_o3de/configs/2a_1bc.yaml
diff --git a/...ch/o3de_test_bench/configs/2a_1c_2rc.yaml → .../manipulation_o3de/configs/2a_1c_2rc.yaml b/...ch/o3de_test_bench/configs/2a_1c_2rc.yaml → .../manipulation_o3de/configs/2a_1c_2rc.yaml
diff --git a/...igs/2carrots_1a_1t_1bc_1rc_1yc_1corn.yaml → ...igs/2carrots_1a_1t_1bc_1rc_1yc_1corn.yaml b/...igs/2carrots_1a_1t_1bc_1rc_1yc_1corn.yaml → ...igs/2carrots_1a_1t_1bc_1rc_1yc_1corn.yaml
diff --git a/.../o3de_test_bench/configs/2carrots_2a.yaml → ...anipulation_o3de/configs/2carrots_2a.yaml b/.../o3de_test_bench/configs/2carrots_2a.yaml → ...anipulation_o3de/configs/2carrots_2a.yaml
diff --git a/...ai_bench/o3de_test_bench/configs/2rc.yaml → ...amples/manipulation_o3de/configs/2rc.yaml b/...ai_bench/o3de_test_bench/configs/2rc.yaml → ...amples/manipulation_o3de/configs/2rc.yaml
diff --git a/...st_bench/configs/2rc_3bc_4yc_stacked.yaml → ...ion_o3de/configs/2rc_3bc_4yc_stacked.yaml b/...st_bench/configs/2rc_3bc_4yc_stacked.yaml → ...ion_o3de/configs/2rc_3bc_4yc_stacked.yaml
diff --git a/...rai_bench/o3de_test_bench/configs/2t.yaml → ...xamples/manipulation_o3de/configs/2t.yaml b/...rai_bench/o3de_test_bench/configs/2t.yaml → ...xamples/manipulation_o3de/configs/2t.yaml
diff --git a/...e_test_bench/configs/2t_3a_1corn_2rc.yaml → ...ulation_o3de/configs/2t_3a_1corn_2rc.yaml b/...e_test_bench/configs/2t_3a_1corn_2rc.yaml → ...ulation_o3de/configs/2t_3a_1corn_2rc.yaml
diff --git a/.../o3de_test_bench/configs/2yc_1bc_1rc.yaml → ...anipulation_o3de/configs/2yc_1bc_1rc.yaml b/.../o3de_test_bench/configs/2yc_1bc_1rc.yaml → ...anipulation_o3de/configs/2yc_1bc_1rc.yaml
diff --git a/...ch/o3de_test_bench/configs/3a_4t_2bc.yaml → .../manipulation_o3de/configs/3a_4t_2bc.yaml b/...ch/o3de_test_bench/configs/3a_4t_2bc.yaml → .../manipulation_o3de/configs/3a_4t_2bc.yaml
diff --git a/...bench/configs/3carrots_1a_1t_2bc_2yc.yaml → ..._o3de/configs/3carrots_1a_1t_2bc_2yc.yaml b/...bench/configs/3carrots_1a_1t_2bc_2yc.yaml → ..._o3de/configs/3carrots_1a_1t_2bc_2yc.yaml
diff --git a/...onfigs/3carrots_1a_2bc_1rc_1yc_1corn.yaml → ...onfigs/3carrots_1a_2bc_1rc_1yc_1corn.yaml b/...onfigs/3carrots_1a_2bc_1rc_1yc_1corn.yaml → ...onfigs/3carrots_1a_2bc_1rc_1yc_1corn.yaml
diff --git a/...e_test_bench/configs/3rc_3bc_stacked.yaml → ...ulation_o3de/configs/3rc_3bc_stacked.yaml b/...e_test_bench/configs/3rc_3bc_stacked.yaml → ...ulation_o3de/configs/3rc_3bc_stacked.yaml
diff --git a/...ai_bench/o3de_test_bench/configs/4bc.yaml → ...amples/manipulation_o3de/configs/4bc.yaml b/...ai_bench/o3de_test_bench/configs/4bc.yaml → ...amples/manipulation_o3de/configs/4bc.yaml
diff --git a/...nch/o3de_test_bench/configs/4carrots.yaml → ...s/manipulation_o3de/configs/4carrots.yaml b/...nch/o3de_test_bench/configs/4carrots.yaml → ...s/manipulation_o3de/configs/4carrots.yaml
diff --git a/..._test_bench/configs/4carrots_rotated.yaml → ...lation_o3de/configs/4carrots_rotated.yaml b/..._test_bench/configs/4carrots_rotated.yaml → ...lation_o3de/configs/4carrots_rotated.yaml
diff --git a/...rai_bench/examples/o3de_test_benchmark.py → ..._bench/examples/manipulation_o3de/main.py b/...rai_bench/examples/o3de_test_benchmark.py → ..._bench/examples/manipulation_o3de/main.py
@@ -12,7 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-########### EXAMPLE USAGE ###########
+import argparse
 import logging
 import time
 from datetime import datetime
@@ -23,7 +23,7 @@
 from langchain.tools import BaseTool
 from rai.agents.conversational_agent import create_conversational_agent
 from rai.communication.ros2.connectors import ROS2Connector
-from rai.initialization import get_llm_model
+from rai.initialization import get_llm_model_direct
 from rai.tools.ros2 import (
     GetObjectPositionsTool,
     GetROS2ImageTool,
@@ -32,26 +32,41 @@
 )
 from rai_open_set_vision.tools import GetGrabbingPointTool
 
-from rai_bench.benchmark_model import Benchmark
-from rai_bench.o3de_test_bench.scenarios import (
+from rai_bench.examples.manipulation_o3de.scenarios import (
     easy_scenarios,
     hard_scenarios,
     medium_scenarios,
     trivial_scenarios,
     very_hard_scenarios,
 )
+from rai_bench.manipulation_o3de.benchmark import Benchmark
 from rai_sim.o3de.o3de_bridge import (
     O3DEngineArmManipulationBridge,
 )
 
-if __name__ == "__main__":
+
+def parse_args():
+    parser = argparse.ArgumentParser(description="Run the Tool Calling Agent Benchmark")
+    parser.add_argument(
+        "--model-name",
+        type=str,
+        help="Model name to use for benchmarking",
+    )
+    parser.add_argument(
+        "--vendor", type=str, default=None, help="Vendor of the model (optional)"
+    )
+    return parser.parse_args()
+
+
+def run_benchmark(model_name: str, vendor: str):
     rclpy.init()
     connector = ROS2Connector()
     node = connector.node
     node.declare_parameter("conversion_ratio", 1.0)
 
     # define model
-    llm = get_llm_model(model_type="complex_model", streaming=True)
+
+    llm = get_llm_model_direct(model_name=model_name, vendor=vendor)
 
     system_prompt = """
     You are a robotic arm with interfaces to detect and manipulate objects.
@@ -78,9 +93,7 @@
     ]
     # define loggers
     now = datetime.now()
-    experiment_dir = (
-        f"src/rai_bench/rai_bench/experiments/{now.strftime('%Y-%m-%d_%H-%M-%S')}"
-    )
+    experiment_dir = f"src/rai_bench/rai_bench/experiments/o3de_manipulation/{now.strftime('%Y-%m-%d_%H-%M-%S')}"
     Path(experiment_dir).mkdir(parents=True, exist_ok=True)
     log_file = f"{experiment_dir}/benchmark.log"
     file_handler = logging.FileHandler(log_file)
@@ -99,7 +112,7 @@
     agent_logger.setLevel(logging.INFO)
     agent_logger.addHandler(file_handler)
 
-    configs_dir = "src/rai_bench/rai_bench/o3de_test_bench/configs/"
+    configs_dir = "src/rai_bench/rai_bench/examples/manipulation_o3de/configs/"
     connector_path = configs_dir + "o3de_config.yaml"
     #### Create scenarios manually
     # load different scenes
@@ -171,7 +184,7 @@
             logger=bench_logger,
             results_filename=results_filename,
         )
-        for i in range(len(all_scenarios)):
+        for _ in range(len(all_scenarios)):
             agent = create_conversational_agent(
                 llm, tools, system_prompt, logger=agent_logger
             )
@@ -190,3 +203,8 @@
         connector.shutdown()
         o3de.shutdown()
         rclpy.shutdown()
+
+
+if __name__ == "__main__":
+    args = parse_args()
+    run_benchmark(model_name=args.model_name, vendor=args.vendor)
diff --git a/...ch/rai_bench/o3de_test_bench/scenarios.py → ...h/examples/manipulation_o3de/scenarios.py b/...ch/rai_bench/o3de_test_bench/scenarios.py → ...h/examples/manipulation_o3de/scenarios.py
@@ -18,12 +18,9 @@
 
 from rclpy.impl.rcutils_logger import RcutilsLogger
 
-from rai_bench.benchmark_model import (
-    Benchmark,
-    Scenario,
-    Task,
-)
-from rai_bench.o3de_test_bench.tasks import (
+from rai_bench.manipulation_o3de.benchmark import Benchmark, Scenario
+from rai_bench.manipulation_o3de.interfaces import Task
+from rai_bench.manipulation_o3de.tasks import (
     BuildCubeTowerTask,
     GroupObjectsTask,
     MoveObjectsToLeftTask,

diff --git a/src/rai_bench/rai_bench/examples/test_models.py b/src/rai_bench/rai_bench/examples/test_models.py
@@ -0,0 +1,43 @@
+# Copyright (C) 2025 Robotec.AI
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#         http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import rai_bench.examples.manipulation_o3de.main as manipulation_o3de_bench
+import rai_bench.examples.tool_calling_agent.main as tool_calling_agent_bench
+
+if __name__ == "__main__":
+    models_name = ["llama3.2", "qwen2.5:7b"]
+    vendors = ["ollama", "ollama"]
+    benchmarks = ["tool_calling_agent"]
+    repeats = 1
+    if len(models_name) != len(vendors):
+        raise ValueError("Number of passed models must match number of passed vendors")
+    else:
+        for benchmark in benchmarks:
+            for i, model_name in enumerate(models_name):
+                for u in range(repeats):
+                    try:
+                        if benchmark == "tool_calling_agent":
+                            tool_calling_agent_bench.run_benchmark(
+                                model_name=model_name, vendor=vendors[i]
+                            )
+                        elif benchmark == "manipulation_o3de":
+                            manipulation_o3de_bench.run_benchmark(
+                                model_name=model_name, vendor=vendors[i]
+                            )
+                        else:
+                            print(f"No benchmark named: {benchmark}")
+                    except Exception as e:
+                        print(
+                            f"Failed to run {benchmark} benchmark for {model_name}, vendor: {vendors[i]}, execution number: {u + 1}, because: {str(e)}"
+                        )