-
Notifications
You must be signed in to change notification settings - Fork 22
switch to regex check for tool arg value #41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughImplements regex-based argument matching in tool call evaluation, updates corresponding tests, and revises documentation to state and exemplify regex usage for argument verification. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Tester
participant Evaluator as ToolCallEvaluator
participant Matcher as _compare_tool_arguments
Tester->>Evaluator: evaluate(expected_tool_call, actual_tool_call)
Evaluator->>Matcher: compare args (expected_args, actual_args)
rect rgba(230, 240, 255, 0.5)
note right of Matcher: Key presence checks
Matcher->>Matcher: verify all expected keys exist
Matcher->>Matcher: detect extra keys (if any)
end
loop For each expected key
rect rgba(235, 255, 235, 0.5)
note right of Matcher: Regex-based value match
Matcher->>Matcher: compile/search pattern with re.search
alt Pattern matches
Matcher-->>Evaluator: continue
else Invalid regex
Matcher-->>Evaluator: return False (log regex error)
else No match
Matcher-->>Evaluator: return False (log "pattern not found")
end
end
end
Evaluator-->>Tester: return True/False
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (7)
lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (3)
71-79: Docstring: clarify stringification behaviorState explicitly that non-string values are coerced to strings before regex matching.
Apply:
-"""Compare tool arguments name & value (regex pattern for the value).""" +"""Compare tool argument names and values. + +Values are matched by applying the expected regex pattern against str(actual_value). +Non-string expected values are also stringified unless handled specially."""
89-109: Avoid regex on non-strings; add typed and recursive matchingStringifying lists/dicts and applying regex can produce false positives/negatives and brittle behavior. Recommend:
- Use regex only when expected is a string.
- For lists, require same length and compare elements (regex for strings, equality otherwise).
- For dicts nested as values, recurse.
Apply:
- expected_str = str(expected_value) - actual_str = str(actual_value) - - # Use regex search for flexible matching - # This is a quick work-around, enhance this to use both regex & exact match. - try: - if not re.search(expected_str, actual_str): - logger.debug( - "Argument value mismatch for '%s': pattern '%s' not found in '%s'", - key, - expected_str, - actual_str, - ) - return False - except re.error as e: - logger.debug( - "Invalid regex pattern '%s' for key '%s': %s", expected_str, key, e - ) - # If regex is invalid, fail the comparison - return False + # Typed/recursive comparison + def _match_value(exp: Any, act: Any, path: str) -> bool: + if isinstance(exp, str): + try: + if not re.search(exp, str(act)): + logger.debug( + "Argument value mismatch for '%s': pattern '%s' not found in '%s'", + path, exp, str(act), + ) + return False + return True + except re.error as e: + logger.debug( + "Invalid regex pattern '%s' for key '%s': %s", exp, path, e + ) + return False + if isinstance(exp, list): + if not isinstance(act, list) or len(exp) != len(act): + logger.debug("List length/type mismatch for '%s': expected %s, got %s", + path, type(exp).__name__, type(act).__name__) + return False + for i, (e_i, a_i) in enumerate(zip(exp, act)): + if not _match_value(e_i, a_i, f"{path}[{i}]"): + return False + return True + if isinstance(exp, dict): + if not isinstance(act, dict): + logger.debug("Type mismatch for '%s': expected dict, got %s", + path, type(act)) + return False + # Compare only expected keys; extras handled by outer check + for k_i, v_i in exp.items(): + if k_i not in act: + logger.debug("Missing nested key '%s' under '%s'", k_i, path) + return False + if not _match_value(v_i, act[k_i], f"{path}.{k_i}"): + return False + return True + # Fallback to exact equality for numbers, bool, None, etc. + if exp != act: + logger.debug("Value mismatch for '%s': expected %r, got %r", path, exp, act) + return False + return True + + if not _match_value(expected_value, actual_value, key): + return False
95-103: Consider fullmatch or explicit anchoring for stricter semanticsre.search permits partial matches; many users expect entire-value matching by default. Optionally switch to re.fullmatch or document recommending ^...$ in patterns.
lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py (1)
129-156: Regex test coverage — good; add a list-element regex caseNice case. Recommend adding a list-valued argument with per-item regex to ensure nested handling (e.g., oc_get_args list).
Example:
+ def test_list_arguments_with_item_regex(self): + expected = [[{ + "tool_name": "oc_get", + "arguments": {"oc_get_args": ["namespaces", "ns-\\w+"]} + }]] + actual = [[{ + "tool_name": "oc_get", + "arguments": {"oc_get_args": ["namespaces", "ns-dev"]} + }]] + assert compare_tool_calls(expected, actual)lsc_agent_eval/README.md (3)
13-13: Grammar and clarityTighten wording and hyphenation.
Apply:
- - `tool_eval`: Tool call evaluation comparing expected vs actual tool calls with arguments, Only regex pattern check (case sensitive) is done for argument value + - `tool_eval`: Tool call evaluation comparing expected vs. actual tool calls with arguments. Only a regex pattern check (case-sensitive) is performed for argument values.
69-69: Grammar consistencyPluralize “value” and mirror phrasing above.
Apply:
-- `expected_tool_calls`: Expected tool call sequences (list of lists) with tool_name and arguments (for tool_eval), Regex pattern check is done for argument value +- `expected_tool_calls`: Expected tool call sequences (list of lists) with tool_name and arguments (for tool_eval). A regex pattern check is used for argument values.
136-143: YAML example: clarify list vs string and recommend anchoringIf oc_get_args is commonly a list, show a list with a regex item. Also nudge users to anchor when exact matching is desired.
Apply:
- - eval_id: eval3 - eval_query: get the log for the abc-pod - eval_types: [tool_eval] - expected_tool_calls: - - - tool_name: get_logs - arguments: - oc_get_args: abc-\\w+ + - eval_id: eval3 + eval_query: get the log for the abc-pod + eval_types: [tool_eval] + expected_tool_calls: + - - tool_name: get_logs + arguments: + # Use a regex for the pod name; anchor with ^$ for exact matches + oc_get_args: ["^abc-\\w+$"]Also consider adding a short note after the example:
- All values are matched via regex against their string representation.
- Use ^ and $ to enforce whole-value matches.
</blockquote></details> </blockquote></details> <details> <summary>📜 Review details</summary> **Configuration used**: CodeRabbit UI **Review profile**: CHILL **Plan**: Pro **💡 Knowledge Base configuration:** - MCP integration is disabled by default for public repositories - Jira integration is disabled by default for public repositories - Linear integration is disabled by default for public repositories You can enable these sources in your CodeRabbit configuration. <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between 2bbafdd7190df1c749fddea4031405c31ed8b8c9 and b888ebfb83f859057558f2668f27ce5705a15789. </details> <details> <summary>📒 Files selected for processing (3)</summary> * `lsc_agent_eval/README.md` (3 hunks) * `lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py` (3 hunks) * `lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py` (2 hunks) </details> <details> <summary>🧰 Additional context used</summary> <details> <summary>🧠 Learnings (6)</summary> <details> <summary>📚 Learning: 2025-08-26T11:17:48.640Z</summary>Learnt from: asamal4
PR: #28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.**Applied to files:** - `lsc_agent_eval/README.md` </details> <details> <summary>📚 Learning: 2025-07-16T13:20:45.006Z</summary>Learnt from: asamal4
PR: #19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:45.006Z
Learning: In the lsc_agent_eval package, evaluation results use distinct values: "FAIL" means the evaluation ran successfully but the result was negative, while "ERROR" means there was an issue executing the evaluation itself (e.g., setup script failed, API connection failed).**Applied to files:** - `lsc_agent_eval/README.md` </details> <details> <summary>📚 Learning: 2025-07-16T13:20:40.632Z</summary>Learnt from: asamal4
PR: #19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).**Applied to files:** - `lsc_agent_eval/README.md` </details> <details> <summary>📚 Learning: 2025-07-16T13:21:53.225Z</summary>Learnt from: asamal4
PR: #19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:190-214
Timestamp: 2025-07-16T13:21:53.225Z
Learning: In the lsc_agent_eval framework, there's a distinction between FAIL and ERROR results:
- FAIL: evaluation failed (e.g., script verification failed, agent response doesn't match expected criteria) - result.error is None
- ERROR: error running eval (e.g., setup script failed, agent API error) - result.error contains error message
**Applied to files:** - `lsc_agent_eval/README.md` </details> <details> <summary>📚 Learning: 2025-07-16T10:41:09.399Z</summary>Learnt from: asamal4
PR: #19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:274-297
Timestamp: 2025-07-16T10:41:09.399Z
Learning: In the lsc_agent_eval package, the team prefers to focus on core functionality testing first and considers testing cleanup script execution after setup failure as early optimization, noting that there's no guarantee cleanup scripts will run successfully anyway.**Applied to files:** - `lsc_agent_eval/README.md` </details> <details> <summary>📚 Learning: 2025-07-28T14:26:03.119Z</summary>Learnt from: asamal4
PR: #22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.**Applied to files:** - `lsc_agent_eval/README.md` </details> </details><details> <summary>🧬 Code graph analysis (1)</summary> <details> <summary>lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py (1)</summary><blockquote> <details> <summary>lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (1)</summary> * `compare_tool_calls` (10-19) </details> </blockquote></details> </details><details> <summary>🪛 LanguageTool</summary> <details> <summary>lsc_agent_eval/README.md</summary> [grammar] ~69-~69: There might be a mistake here. Context: ...pattern check is done for argument value - `eval_verify_script`: Verification script (for action_eval e... (QB_NEW_EN) </details> </details> </details> <details> <summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)</summary> * GitHub Check: mypy </details> <details> <summary>🔇 Additional comments (4)</summary><blockquote> <details> <summary>lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/tool_call_eval.py (2)</summary><blockquote> `4-4`: **Import re — OK** Needed for regex matching. No issues. --- `110-116`: **Extra-key rejection — OK** Strictness here is good; keeps evaluations predictable. </blockquote></details> <details> <summary>lsc_agent_eval/tests/core/agent_goal_eval/test_tool_call_eval.py (2)</summary><blockquote> `95-101`: **Updated mismatch assertion — OK** Asserts new regex-oriented log message; looks correct. --- `157-163`: **Invalid regex handling — OK** Covers the failure path for bad patterns. </blockquote></details> </blockquote></details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
tisnik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Quick work-around to check dynamic argument values for tool eval.
Note: Long-term we should make this configurable to do pattern or exact match (requires config change.)
Summary by CodeRabbit
New Features
Documentation
Tests