-
Notifications
You must be signed in to change notification settings - Fork 21
Add keyword eval metric #93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds a new per-turn custom metric "custom:keywords_eval" and the TurnData field Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant evaluate_keywords
participant Validator
participant Matcher
Caller->>evaluate_keywords: evaluate_keywords(turn_data)
evaluate_keywords->>Validator: validate inputs (turn_data, response, expected_keywords)
alt Validation Fails
Validator-->>evaluate_keywords: error reason
evaluate_keywords-->>Caller: Return (None, reason)
else Validation Passes
Validator-->>evaluate_keywords: ok
evaluate_keywords->>Matcher: normalize response (lowercase)
loop for each expected_keywords option (Option N)
Matcher->>Matcher: check each keyword in option against response
alt all keywords matched
Matcher-->>evaluate_keywords: option matched
evaluate_keywords-->>Caller: Return (1.0, "Keywords eval successful: Option X — matched: [...]")
Note over evaluate_keywords: short-circuit on first full match
else not all matched
Matcher-->>evaluate_keywords: option failed (matched/unmatched details)
end
end
alt no options matched
evaluate_keywords->>evaluate_keywords: aggregate per-option matched/unmatched details
evaluate_keywords-->>Caller: Return (0.0, "Keywords eval failed: All options failed — details: ...")
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro 📒 Files selected for processing (9)
🚧 Files skipped from review as they are similar to previous changes (4)
🧰 Additional context used🧠 Learnings (8)📓 Common learnings📚 Learning: 2025-09-18T23:59:37.026ZApplied to files:
📚 Learning: 2025-10-31T11:54:59.126ZApplied to files:
📚 Learning: 2025-09-08T11:11:54.516ZApplied to files:
📚 Learning: 2025-07-29T05:15:39.782ZApplied to files:
📚 Learning: 2025-08-26T11:17:48.640ZApplied to files:
📚 Learning: 2025-07-28T14:26:03.119ZApplied to files:
📚 Learning: 2025-08-13T14:07:44.195ZApplied to files:
🧬 Code graph analysis (2)src/lightspeed_evaluation/core/metrics/custom/__init__.py (1)
tests/unit/core/models/test_data.py (1)
🪛 LanguageToolREADME.md[grammar] ~310-~310: Use a hyphen to join words. (QB_NEW_EN_HYPHEN) ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
🔇 Additional comments (14)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
config/system.yaml (1)
79-81: Add explicit threshold for consistency with binary metrics.For binary metrics that return 0 or 1, an explicit threshold should be specified for consistent behavior. The
custom:intent_evalmetric (line 87) usesthreshold: 1for similar boolean evaluation.Based on learnings
Apply this diff:
# Custom metrics "custom:keywords_eval": # boolean eval (either 0 or 1) + threshold: 1 description: "Keywords (ALL) matching evaluation with alternative sets"README.md (1)
314-314: Fix hyphenation for compound adjective.The phrase "case insensitive" should be hyphenated when used as a compound adjective modifying "matching".
Apply this diff:
-> - `expected_keywords`: Required for `custom:keywords_eval` (case insensitive matching) +> - `expected_keywords`: Required for `custom:keywords_eval` (case-insensitive matching)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
README.md(5 hunks)config/system.yaml(1 hunks)src/lightspeed_evaluation/core/metrics/custom/__init__.py(2 hunks)src/lightspeed_evaluation/core/metrics/custom/custom.py(2 hunks)src/lightspeed_evaluation/core/metrics/custom/keywords_eval.py(1 hunks)src/lightspeed_evaluation/core/models/data.py(2 hunks)src/lightspeed_evaluation/core/system/validator.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
config/system.yaml
📄 CodeRabbit inference engine (AGENTS.md)
config/system.yaml: Keep system configuration in config/system.yaml (LLM, API, metrics metadata, output, logging)
Add metric metadata to the metrics_metadata section in config/system.yaml when introducing new metrics
Files:
config/system.yaml
config/**/*.{yaml,yml}
📄 CodeRabbit inference engine (AGENTS.md)
Use YAML for configuration files under config/
Files:
config/system.yaml
src/lightspeed_evaluation/**
📄 CodeRabbit inference engine (AGENTS.md)
Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
Files:
src/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/metrics/custom/__init__.pysrc/lightspeed_evaluation/core/metrics/custom/custom.pysrc/lightspeed_evaluation/core/system/validator.pysrc/lightspeed_evaluation/core/metrics/custom/keywords_eval.py
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Require type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels
Files:
src/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/metrics/custom/__init__.pysrc/lightspeed_evaluation/core/metrics/custom/custom.pysrc/lightspeed_evaluation/core/system/validator.pysrc/lightspeed_evaluation/core/metrics/custom/keywords_eval.py
src/lightspeed_evaluation/core/metrics/custom/**
📄 CodeRabbit inference engine (AGENTS.md)
Add new custom metrics under src/lightspeed_evaluation/core/metrics/custom/
Files:
src/lightspeed_evaluation/core/metrics/custom/__init__.pysrc/lightspeed_evaluation/core/metrics/custom/custom.pysrc/lightspeed_evaluation/core/metrics/custom/keywords_eval.py
src/lightspeed_evaluation/core/metrics/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Register new metrics in MetricManager’s supported_metrics dictionary
Files:
src/lightspeed_evaluation/core/metrics/custom/__init__.pysrc/lightspeed_evaluation/core/metrics/custom/custom.pysrc/lightspeed_evaluation/core/metrics/custom/keywords_eval.py
🧠 Learnings (16)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:15:39.782Z
Learning: In the lsc_agent_eval framework, the substring evaluation logic in the `_evaluate_substring` method requires ALL expected keywords to be present in the agent response (logical AND), not just any keyword (logical OR). This is a stricter evaluation condition that was intentionally changed and may be subject to future modifications.
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/**/*.py : Register new metrics in MetricManager’s supported_metrics dictionary
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/system.yaml : Add metric metadata to the metrics_metadata section in config/system.yaml when introducing new metrics
Applied to files:
config/system.yaml
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/system.yaml : Keep system configuration in config/system.yaml (LLM, API, metrics metadata, output, logging)
Applied to files:
config/system.yaml
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/custom/** : Add new custom metrics under src/lightspeed_evaluation/core/metrics/custom/
Applied to files:
config/system.yamlsrc/lightspeed_evaluation/core/metrics/custom/__init__.pysrc/lightspeed_evaluation/core/metrics/custom/custom.py
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.
Applied to files:
config/system.yamlREADME.md
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
src/lightspeed_evaluation/core/models/data.pysrc/lightspeed_evaluation/core/system/validator.py
📚 Learning: 2025-10-31T11:54:59.126Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 90
File: src/lightspeed_evaluation/core/models/data.py:198-208
Timestamp: 2025-10-31T11:54:59.126Z
Learning: In the lightspeed_evaluation framework, the expected_tool_calls validator intentionally rejects a single empty set `[[]]` as the only alternative. This is by design: if no tool calls are expected, the tool_eval metric should not be configured for that turn. Empty sets are only valid as fallback alternatives (e.g., `[[[tool_call]], [[]]]`), representing optional tool call scenarios, not as primary or sole expectations.
Applied to files:
src/lightspeed_evaluation/core/models/data.pyREADME.md
📚 Learning: 2025-07-29T05:15:39.782Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:15:39.782Z
Learning: In the lsc_agent_eval framework, the substring evaluation logic in the `_evaluate_substring` method requires ALL expected keywords to be present in the agent response (logical AND), not just any keyword (logical OR). This is a stricter evaluation condition that was intentionally changed and may be subject to future modifications.
Applied to files:
README.mdsrc/lightspeed_evaluation/core/system/validator.pysrc/lightspeed_evaluation/core/metrics/custom/keywords_eval.py
📚 Learning: 2025-08-26T11:17:48.640Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
Applied to files:
README.mdsrc/lightspeed_evaluation/core/metrics/custom/custom.py
📚 Learning: 2025-09-10T06:57:46.326Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/pipeline/evaluation/evaluator.py:85-89
Timestamp: 2025-09-10T06:57:46.326Z
Learning: For binary metrics like custom:tool_eval, using an explicit threshold of 0.5 is preferred over None threshold with special case handling. This provides consistent behavior where 0.0 scores fail and 1.0 scores pass.
Applied to files:
README.md
📚 Learning: 2025-07-28T14:26:03.119Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.
Applied to files:
README.md
📚 Learning: 2025-08-13T14:07:44.195Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration uses "tool_name" as the key for tool names, not "name". The tool call evaluation implementation specifically looks for the "tool_name" field when comparing expected vs actual tool calls.
Applied to files:
README.md
📚 Learning: 2025-07-16T13:20:45.006Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:45.006Z
Learning: In the lsc_agent_eval package, evaluation results use distinct values: "FAIL" means the evaluation ran successfully but the result was negative, while "ERROR" means there was an issue executing the evaluation itself (e.g., setup script failed, API connection failed).
Applied to files:
README.md
📚 Learning: 2025-07-16T13:20:40.632Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).
Applied to files:
README.md
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/**/*.py : Register new metrics in MetricManager’s supported_metrics dictionary
Applied to files:
src/lightspeed_evaluation/core/metrics/custom/__init__.pysrc/lightspeed_evaluation/core/metrics/custom/custom.pysrc/lightspeed_evaluation/core/system/validator.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
Applied to files:
src/lightspeed_evaluation/core/metrics/custom/__init__.pysrc/lightspeed_evaluation/core/metrics/custom/custom.py
🧬 Code graph analysis (3)
src/lightspeed_evaluation/core/metrics/custom/__init__.py (1)
src/lightspeed_evaluation/core/metrics/custom/keywords_eval.py (1)
evaluate_keywords(82-129)
src/lightspeed_evaluation/core/metrics/custom/custom.py (1)
src/lightspeed_evaluation/core/metrics/custom/keywords_eval.py (1)
evaluate_keywords(82-129)
src/lightspeed_evaluation/core/metrics/custom/keywords_eval.py (1)
src/lightspeed_evaluation/core/models/data.py (1)
TurnData(35-295)
🪛 LanguageTool
README.md
[grammar] ~314-~314: Use a hyphen to join words.
Context: ...equired for custom:keywords_eval (case insensitive matching) > - `verify_script...
(QB_NEW_EN_HYPHEN)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: mypy
- GitHub Check: tests (3.11)
- GitHub Check: tests (3.13)
- GitHub Check: tests (3.12)
🔇 Additional comments (16)
README.md (4)
89-89: LGTM!The metric description accurately reflects the implementation logic, correctly noting that ALL keywords must match (AND logic) with case-insensitive matching. Based on learnings.
234-234: LGTM!The
expected_keywordsexample correctly demonstrates the list-of-lists format for alternative keyword sets, matching the data model structure.
241-241: LGTM!The metric is correctly positioned in the turn_metrics list, consistent with its turn-level-only implementation.
301-301: LGTM!The field documentation accurately describes the expected_keywords structure and its conditional requirement based on metric usage.
src/lightspeed_evaluation/core/metrics/custom/custom.py (2)
12-12: LGTM!The import statement correctly follows the established pattern for custom metric imports.
32-32: LGTM!The metric is correctly registered in the
supported_metricsdictionary, following the established pattern for custom metrics. As per coding guidelines.src/lightspeed_evaluation/core/metrics/custom/__init__.py (2)
4-4: LGTM!The import correctly exposes the new
evaluate_keywordsfunction from the keywords_eval module.
13-13: LGTM!The function is correctly added to
__all__to expose it as part of the package's public API.src/lightspeed_evaluation/core/system/validator.py (1)
43-46: LGTM!The metric requirements correctly specify the necessary fields (
responseandexpected_keywords) for keyword evaluation, consistent with the implementation and data model.src/lightspeed_evaluation/core/models/data.py (2)
65-68: LGTM!The
expected_keywordsfield is correctly defined with appropriate type annotation and description, matching the evaluation requirements.
96-124: LGTM!The validator thoroughly checks the structure and content of
expected_keywords, ensuring each alternative group is a non-empty list of non-empty, non-whitespace strings. Error messages include helpful index information for debugging.src/lightspeed_evaluation/core/metrics/custom/keywords_eval.py (5)
8-24: LGTM!The input validation correctly handles all error cases. The distinction between returning
None(ERROR) for invalid setup and0.0(FAIL) for missing response is semantically appropriate—no response means keywords cannot be matched, which is a legitimate evaluation failure rather than a configuration error.
27-40: LGTM!The keyword matching logic correctly implements case-insensitive substring matching with ALL-keywords-must-match semantics (AND logic), consistent with the framework's requirements. Based on learnings.
43-52: LGTM!The success result is well-formatted with a score of 1.0 and a clear, informative reason showing which option matched and listing all matched keywords.
55-79: LGTM!The failure result provides comprehensive debugging information, showing both matched and unmatched keywords for each alternative. This detailed feedback helps users understand why the evaluation failed and how to fix their data or responses.
82-129: LGTM!The main evaluation function is well-implemented with:
- Clear documentation explaining the sequential alternative-checking logic
- Proper type hints for all parameters and return values
- Efficient case-insensitive matching by lowercasing once
- Correct short-circuit behavior on first successful match
- Comprehensive error handling
ab6a85a to
5f49ecd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
README.md (1)
154-155: Verify threshold consistency with system config and consider hyphenation.Two notes:
Threshold documentation: A previous review flagged that the README example includes
threshold: 1forcustom:keywords_eval, but the actualconfig/system.yamlmay be missing this field. Ensure consistency between the documentation and implementation.Minor style: Consider using "case-insensitive" (with hyphen) in the description for grammatical consistency.
For the hyphenation:
- "custom:keywords_eval": # Binary evaluation (0 or 1) - description: "Keywords evaluation (ALL match) with sequential alternate checking (case insensitive)" + "custom:keywords_eval": + threshold: 1 # Binary evaluation (0 or 1) + description: "Keywords evaluation (ALL match) with sequential alternate checking (case-insensitive)"
🧹 Nitpick comments (2)
README.md (2)
89-89: Consider hyphenating compound modifier for style consistency.The phrase "case insensitive" should be "case-insensitive" when used as a compound adjective modifying "matching".
Apply this diff:
- - [`keywords_eval`](src/lightspeed_evaluation/core/metrics/custom/keywords_eval.py) - Keywords evaluation with alternatives (ALL keywords must match, case insensitive) + - [`keywords_eval`](src/lightspeed_evaluation/core/metrics/custom/keywords_eval.py) - Keywords evaluation with alternatives (ALL keywords must match, case-insensitive)
310-310: Minor: Use hyphenated compound adjective.For grammatical consistency, "case insensitive" should be hyphenated when used as a compound adjective.
Apply this diff:
-> - `expected_keywords`: Required for `custom:keywords_eval` (case insensitive matching) +> - `expected_keywords`: Required for `custom:keywords_eval` (case-insensitive matching)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
README.md(5 hunks)config/system.yaml(1 hunks)src/lightspeed_evaluation/core/metrics/custom/__init__.py(2 hunks)src/lightspeed_evaluation/core/metrics/custom/custom.py(2 hunks)src/lightspeed_evaluation/core/models/data.py(2 hunks)src/lightspeed_evaluation/core/system/validator.py(1 hunks)tests/unit/core/metrics/test_keywords_eval.py(1 hunks)tests/unit/core/models/test_data.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- src/lightspeed_evaluation/core/system/validator.py
- config/system.yaml
- tests/unit/core/metrics/test_keywords_eval.py
- src/lightspeed_evaluation/core/metrics/custom/custom.py
- tests/unit/core/models/test_data.py
🧰 Additional context used
📓 Path-based instructions (4)
src/lightspeed_evaluation/**
📄 CodeRabbit inference engine (AGENTS.md)
Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
Files:
src/lightspeed_evaluation/core/metrics/custom/__init__.pysrc/lightspeed_evaluation/core/models/data.py
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Require type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels
Files:
src/lightspeed_evaluation/core/metrics/custom/__init__.pysrc/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/metrics/custom/**
📄 CodeRabbit inference engine (AGENTS.md)
Add new custom metrics under src/lightspeed_evaluation/core/metrics/custom/
Files:
src/lightspeed_evaluation/core/metrics/custom/__init__.py
src/lightspeed_evaluation/core/metrics/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Register new metrics in MetricManager’s supported_metrics dictionary
Files:
src/lightspeed_evaluation/core/metrics/custom/__init__.py
🧠 Learnings (11)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:15:39.782Z
Learning: In the lsc_agent_eval framework, the substring evaluation logic in the `_evaluate_substring` method requires ALL expected keywords to be present in the agent response (logical AND), not just any keyword (logical OR). This is a stricter evaluation condition that was intentionally changed and may be subject to future modifications.
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.
Applied to files:
README.md
📚 Learning: 2025-07-29T05:15:39.782Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:15:39.782Z
Learning: In the lsc_agent_eval framework, the substring evaluation logic in the `_evaluate_substring` method requires ALL expected keywords to be present in the agent response (logical AND), not just any keyword (logical OR). This is a stricter evaluation condition that was intentionally changed and may be subject to future modifications.
Applied to files:
README.md
📚 Learning: 2025-08-26T11:17:48.640Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
Applied to files:
README.md
📚 Learning: 2025-07-28T14:26:03.119Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.
Applied to files:
README.md
📚 Learning: 2025-08-13T14:07:44.195Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration uses "tool_name" as the key for tool names, not "name". The tool call evaluation implementation specifically looks for the "tool_name" field when comparing expected vs actual tool calls.
Applied to files:
README.md
📚 Learning: 2025-10-31T11:54:59.126Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 90
File: src/lightspeed_evaluation/core/models/data.py:198-208
Timestamp: 2025-10-31T11:54:59.126Z
Learning: In the lightspeed_evaluation framework, the expected_tool_calls validator intentionally rejects a single empty set `[[]]` as the only alternative. This is by design: if no tool calls are expected, the tool_eval metric should not be configured for that turn. Empty sets are only valid as fallback alternatives (e.g., `[[[tool_call]], [[]]]`), representing optional tool call scenarios, not as primary or sole expectations.
Applied to files:
README.mdsrc/lightspeed_evaluation/core/models/data.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/custom/** : Add new custom metrics under src/lightspeed_evaluation/core/metrics/custom/
Applied to files:
src/lightspeed_evaluation/core/metrics/custom/__init__.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/**/*.py : Register new metrics in MetricManager’s supported_metrics dictionary
Applied to files:
src/lightspeed_evaluation/core/metrics/custom/__init__.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
Applied to files:
src/lightspeed_evaluation/core/metrics/custom/__init__.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
src/lightspeed_evaluation/core/models/data.py
🧬 Code graph analysis (1)
src/lightspeed_evaluation/core/metrics/custom/__init__.py (1)
src/lightspeed_evaluation/core/metrics/custom/keywords_eval.py (1)
evaluate_keywords(82-129)
🪛 LanguageTool
README.md
[grammar] ~310-~310: Use a hyphen to join words.
Context: ...equired for custom:keywords_eval (case insensitive matching) > - `expected_resp...
(QB_NEW_EN_HYPHEN)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: pydocstyle
- GitHub Check: mypy
🔇 Additional comments (4)
src/lightspeed_evaluation/core/models/data.py (2)
56-59: LGTM! Field definition is clear and well-structured.The
expected_keywordsfield is properly typed and documented. The nested list structurelist[list[str]]clearly represents alternatives with keyword groups.
96-124: LGTM! Validator is comprehensive and follows established patterns.The validation logic is thorough:
- Validates nested list structure at each level
- Ensures no empty groups or whitespace-only keywords
- Provides specific error messages with indices for debugging
- Follows the same validation pattern as
expected_tool_callsBased on learnings: The validator correctly enforces structure. The logical AND requirement (ALL keywords must match) is appropriately handled at evaluation time, not during validation.
src/lightspeed_evaluation/core/metrics/custom/__init__.py (1)
4-4: LGTM! Correctly exposes the new metric in the public API.The import and export follow the established pattern for custom metrics and are correctly positioned alphabetically.
As per coding guidelines: The metric is properly added under the custom metrics package structure.
Also applies to: 13-13
README.md (1)
233-233: LGTM! Clear example of the alternatives structure.The example effectively demonstrates the nested list format for keyword alternatives.
5f49ecd to
4e6ba3e
Compare
|
@VladimirKadlec @tisnik PTAL |
VladimirKadlec
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice!
Note for the future refactor: I'd break down CustomMetrics into subclasses (like CustomKeywords, CustomCorectness, ...) overloading evaluate
tisnik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Add new keyword eval metric:
Summary by CodeRabbit
New Features
Documentation
Tests
Validation