Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 59 additions & 5 deletions resources_servers/terminus_judge/README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,65 @@
# Description

Data links: ?
This is a resource server for verifying terminal-based agent actions. It evaluates agent responses that represent terminal command sequences against expected answers. The server supports two different schema formats (`terminus_1` and `terminus_2`) for terminal interaction tasks.

For each verification request, the agent's JSON output is validated through multiple checks:
1. **JSON Parsing**: The model output must be valid JSON
2. **Schema Validation**: The response must conform to the specified harness schema (`terminus_1` or `terminus_2`)
3. **Task Completion**: If the expected answer requires task completion, the agent must also indicate completion
4. **Command Correctness**: The predicted keystrokes must exactly match the expected keystrokes in order
- This is evaluated via string similarity and equivalency llm-as-judge


## Supported Schemas

### terminus_1
- `state_analysis`: Description of the current terminal state
- `explanation`: Brief explanation of what the commands will do
- `commands`: List of command objects with `keystrokes`, `is_blocking`, and `timeout_sec`
- `is_task_complete`: Boolean indicating if the task is complete

### terminus_2
- `analysis`: Analysis of the current state based on terminal output
- `plan`: Description of the plan for next steps
- `commands`: List of command objects with `keystrokes` and optional `duration`
- `task_complete`: Boolean indicating if the task is complete (optional)


# Example usage

## Running servers

The following command can be used to run this resource server, along with the simple agent and a policy model:

```bash
config_paths="resources_servers/terminus_judge/configs/terminus_judge.yaml,\
responses_api_models/openai_model/configs/openai_model.yaml"

ng_run "+config_paths=[$config_paths]" \
+terminus_judge_resources_server.resources_servers.terminus_judge.judge_responses_create_params.max_output_tokens=512
```

Then, rollouts can be collected using a command such as the following:

```bash
ng_collect_rollouts +agent_name=terminus_judge_simple_agent \
+input_jsonl_fpath=resources_servers/terminus_judge/data/example.jsonl \
+output_jsonl_fpath=resources_servers/terminus_judge/example_rollouts.jsonl
```

## Expected Data Format

Each data sample should include:
- `expected_answer`: A JSON string containing the expected terminal commands
- `metadata.harness`: Either `"terminus_1"` or `"terminus_2"` to specify the schema format
- `threshold`: threshold for string similarity to calculate the reward

# Licensing information
Code: ?
Data: ?

Dependencies
Code: Apache 2.0<br>
Data: TBD

## Dependencies

- nemo_gym: Apache 2.0
?
- openapi-schema-validator: BSD-3-Clause
Loading