Skip to content

Conversation

@asamal4
Copy link
Collaborator

@asamal4 asamal4 commented Sep 2, 2025

Quick work-around to check dynamic argument values for tool eval.

Note: Long-term we should make this configurable to do pattern or exact match (requires config change.)

Summary by CodeRabbit

  • New Features

    • Tool call evaluation now supports regex-based matching for argument values, replacing exact matches.
    • Clearer mismatch messages indicating when a pattern is not found.
    • Invalid regex patterns are detected and reported gracefully.
  • Documentation

    • Updated guidance to emphasize regex-based argument checks.
    • Added a detailed example demonstrating regex usage in evaluations.
    • Noted LiteLLM integration in the Features section.
    • Clarified sample configuration for expected tool calls.
  • Tests

    • Added tests for regex matching and invalid patterns.
    • Updated expectations for mismatch logging.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 2, 2025

Walkthrough

Implements regex-based argument matching in tool call evaluation, updates corresponding tests, and revises documentation to state and exemplify regex usage for argument verification.

Changes

Cohort / File(s) Summary
Documentation
lsc_agent_eval/README.md
Clarifies that tool_eval checks argument values via regex (case-sensitive), updates examples and feature list, adds an example (eval3) demonstrating pattern matching.
Core Logic: Regex Argument Matching
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py
Replaces exact equality with regex-based value matching using re.search; handles missing/extra keys; logs mismatches and invalid regex errors; updates docstrings accordingly.
Tests
lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py
Adjusts log expectations for regex mismatch wording; adds tests for valid regex matching across multiple args and for invalid regex pattern failure.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Tester
  participant Evaluator as ToolCallEvaluator
  participant Matcher as _compare_tool_arguments

  Tester->>Evaluator: evaluate(expected_tool_call, actual_tool_call)
  Evaluator->>Matcher: compare args (expected_args, actual_args)

  rect rgba(230, 240, 255, 0.5)
    note right of Matcher: Key presence checks
    Matcher->>Matcher: verify all expected keys exist
    Matcher->>Matcher: detect extra keys (if any)
  end

  loop For each expected key
    rect rgba(235, 255, 235, 0.5)
      note right of Matcher: Regex-based value match
      Matcher->>Matcher: compile/search pattern with re.search
      alt Pattern matches
        Matcher-->>Evaluator: continue
      else Invalid regex
        Matcher-->>Evaluator: return False (log regex error)
      else No match
        Matcher-->>Evaluator: return False (log "pattern not found")
      end
    end
  end

  Evaluator-->>Tester: return True/False
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I twitch my nose at patterns fine,
Regex carrots on the vine—
pod-\d+ hops into view,
Logs align like morning dew.
If brackets slip, I tap my paw:
“Invalid!”—such regex law.
Thump! The tests all pass anew. 🥕🐇

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (7)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (3)

71-79: Docstring: clarify stringification behavior

State explicitly that non-string values are coerced to strings before regex matching.

Apply:

-"""Compare tool arguments name & value (regex pattern for the value)."""
+"""Compare tool argument names and values.
+
+Values are matched by applying the expected regex pattern against str(actual_value).
+Non-string expected values are also stringified unless handled specially."""

89-109: Avoid regex on non-strings; add typed and recursive matching

Stringifying lists/dicts and applying regex can produce false positives/negatives and brittle behavior. Recommend:

  • Use regex only when expected is a string.
  • For lists, require same length and compare elements (regex for strings, equality otherwise).
  • For dicts nested as values, recurse.

Apply:

-        expected_str = str(expected_value)
-        actual_str = str(actual_value)
-
-        # Use regex search for flexible matching
-        # This is a quick work-around, enhance this to use both regex & exact match.
-        try:
-            if not re.search(expected_str, actual_str):
-                logger.debug(
-                    "Argument value mismatch for '%s': pattern '%s' not found in '%s'",
-                    key,
-                    expected_str,
-                    actual_str,
-                )
-                return False
-        except re.error as e:
-            logger.debug(
-                "Invalid regex pattern '%s' for key '%s': %s", expected_str, key, e
-            )
-            # If regex is invalid, fail the comparison
-            return False
+        # Typed/recursive comparison
+        def _match_value(exp: Any, act: Any, path: str) -> bool:
+            if isinstance(exp, str):
+                try:
+                    if not re.search(exp, str(act)):
+                        logger.debug(
+                            "Argument value mismatch for '%s': pattern '%s' not found in '%s'",
+                            path, exp, str(act),
+                        )
+                        return False
+                    return True
+                except re.error as e:
+                    logger.debug(
+                        "Invalid regex pattern '%s' for key '%s': %s", exp, path, e
+                    )
+                    return False
+            if isinstance(exp, list):
+                if not isinstance(act, list) or len(exp) != len(act):
+                    logger.debug("List length/type mismatch for '%s': expected %s, got %s",
+                                 path, type(exp).__name__, type(act).__name__)
+                    return False
+                for i, (e_i, a_i) in enumerate(zip(exp, act)):
+                    if not _match_value(e_i, a_i, f"{path}[{i}]"):
+                        return False
+                return True
+            if isinstance(exp, dict):
+                if not isinstance(act, dict):
+                    logger.debug("Type mismatch for '%s': expected dict, got %s",
+                                 path, type(act))
+                    return False
+                # Compare only expected keys; extras handled by outer check
+                for k_i, v_i in exp.items():
+                    if k_i not in act:
+                        logger.debug("Missing nested key '%s' under '%s'", k_i, path)
+                        return False
+                    if not _match_value(v_i, act[k_i], f"{path}.{k_i}"):
+                        return False
+                return True
+            # Fallback to exact equality for numbers, bool, None, etc.
+            if exp != act:
+                logger.debug("Value mismatch for '%s': expected %r, got %r", path, exp, act)
+                return False
+            return True
+
+        if not _match_value(expected_value, actual_value, key):
+            return False

95-103: Consider fullmatch or explicit anchoring for stricter semantics

re.search permits partial matches; many users expect entire-value matching by default. Optionally switch to re.fullmatch or document recommending ^...$ in patterns.

lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py (1)

129-156: Regex test coverage — good; add a list-element regex case

Nice case. Recommend adding a list-valued argument with per-item regex to ensure nested handling (e.g., oc_get_args list).

Example:

+    def test_list_arguments_with_item_regex(self):
+        expected = [[{
+            "tool_name": "oc_get",
+            "arguments": {"oc_get_args": ["namespaces", "ns-\\w+"]}
+        }]]
+        actual = [[{
+            "tool_name": "oc_get",
+            "arguments": {"oc_get_args": ["namespaces", "ns-dev"]}
+        }]]
+        assert compare_tool_calls(expected, actual)
lsc_agent_eval/README.md (3)

13-13: Grammar and clarity

Tighten wording and hyphenation.

Apply:

-  - `tool_eval`: Tool call evaluation comparing expected vs actual tool calls with arguments, Only regex pattern check (case sensitive) is done for argument value
+  - `tool_eval`: Tool call evaluation comparing expected vs. actual tool calls with arguments. Only a regex pattern check (case-sensitive) is performed for argument values.

69-69: Grammar consistency

Pluralize “value” and mirror phrasing above.

Apply:

-- `expected_tool_calls`: Expected tool call sequences (list of lists) with tool_name and arguments (for tool_eval), Regex pattern check is done for argument value
+- `expected_tool_calls`: Expected tool call sequences (list of lists) with tool_name and arguments (for tool_eval). A regex pattern check is used for argument values.

136-143: YAML example: clarify list vs string and recommend anchoring

If oc_get_args is commonly a list, show a list with a regex item. Also nudge users to anchor when exact matching is desired.

Apply:

-    - eval_id: eval3
-      eval_query: get the log for the abc-pod
-      eval_types: [tool_eval]
-      expected_tool_calls:
-        - - tool_name: get_logs
-            arguments:
-              oc_get_args: abc-\\w+
+    - eval_id: eval3
+      eval_query: get the log for the abc-pod
+      eval_types: [tool_eval]
+      expected_tool_calls:
+        - - tool_name: get_logs
+            arguments:
+              # Use a regex for the pod name; anchor with ^$ for exact matches
+              oc_get_args: ["^abc-\\w+$"]

Also consider adding a short note after the example:

  • All values are matched via regex against their string representation.
  • Use ^ and $ to enforce whole-value matches.

</blockquote></details>

</blockquote></details>

<details>
<summary>📜 Review details</summary>

**Configuration used**: CodeRabbit UI

**Review profile**: CHILL

**Plan**: Pro

**💡 Knowledge Base configuration:**

- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 2bbafdd7190df1c749fddea4031405c31ed8b8c9 and b888ebfb83f859057558f2668f27ce5705a15789.

</details>

<details>
<summary>📒 Files selected for processing (3)</summary>

* `lsc_agent_eval/README.md` (3 hunks)
* `lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py` (3 hunks)
* `lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py` (2 hunks)

</details>

<details>
<summary>🧰 Additional context used</summary>

<details>
<summary>🧠 Learnings (6)</summary>

<details>
<summary>📚 Learning: 2025-08-26T11:17:48.640Z</summary>

Learnt from: asamal4
PR: #28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.


**Applied to files:**
- `lsc_agent_eval/README.md`

</details>
<details>
<summary>📚 Learning: 2025-07-16T13:20:45.006Z</summary>

Learnt from: asamal4
PR: #19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:45.006Z
Learning: In the lsc_agent_eval package, evaluation results use distinct values: "FAIL" means the evaluation ran successfully but the result was negative, while "ERROR" means there was an issue executing the evaluation itself (e.g., setup script failed, API connection failed).


**Applied to files:**
- `lsc_agent_eval/README.md`

</details>
<details>
<summary>📚 Learning: 2025-07-16T13:20:40.632Z</summary>

Learnt from: asamal4
PR: #19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).


**Applied to files:**
- `lsc_agent_eval/README.md`

</details>
<details>
<summary>📚 Learning: 2025-07-16T13:21:53.225Z</summary>

Learnt from: asamal4
PR: #19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:

  • FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
  • ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message

**Applied to files:**
- `lsc_agent_eval/README.md`

</details>
<details>
<summary>📚 Learning: 2025-07-16T10:41:09.399Z</summary>

Learnt from: asamal4
PR: #19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:274-297
Timestamp: 2025-07-16T10:41:09.399Z
Learning: In the lsc_agent_eval package, the team prefers to focus on core functionality testing first and considers testing cleanup script execution after setup failure as early optimization, noting that there's no guarantee cleanup scripts will run successfully anyway.


**Applied to files:**
- `lsc_agent_eval/README.md`

</details>
<details>
<summary>📚 Learning: 2025-07-28T14:26:03.119Z</summary>

Learnt from: asamal4
PR: #22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.


**Applied to files:**
- `lsc_agent_eval/README.md`

</details>

</details><details>
<summary>🧬 Code graph analysis (1)</summary>

<details>
<summary>lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py (1)</summary><blockquote>

<details>
<summary>lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (1)</summary>

* `compare_tool_calls` (10-19)

</details>

</blockquote></details>

</details><details>
<summary>🪛 LanguageTool</summary>

<details>
<summary>lsc_agent_eval/README.md</summary>

[grammar] ~69-~69: There might be a mistake here.
Context: ...pattern check is done for argument value - `eval_verify_script`: Verification script (for action_eval e...

(QB_NEW_EN)

</details>

</details>

</details>

<details>
<summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)</summary>

* GitHub Check: mypy

</details>

<details>
<summary>🔇 Additional comments (4)</summary><blockquote>

<details>
<summary>lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (2)</summary><blockquote>

`4-4`: **Import re — OK**

Needed for regex matching. No issues.

---

`110-116`: **Extra-key rejection — OK**

Strictness here is good; keeps evaluations predictable.

</blockquote></details>
<details>
<summary>lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py (2)</summary><blockquote>

`95-101`: **Updated mismatch assertion — OK**

Asserts new regex-oriented log message; looks correct.

---

`157-163`: **Invalid regex handling — OK**

Covers the failure path for bad patterns.

</blockquote></details>

</blockquote></details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Copy link
Contributor

@tisnik tisnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tisnik tisnik merged commit dbac173 into lightspeed-core:main Sep 2, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants