Add keyword eval metric #93

asamal4 · 2025-11-06T13:34:03Z

Add new keyword eval metric:

provide set of keywords (with alternatives) which will be matched to response
case-insensitive substring match for all expected keywords

Summary by CodeRabbit

New Features
- Added a new custom metric (keywords_eval) to evaluate per-turn keyword matching against alternative keyword sets.
- Added per-turn expected_keywords field to specify groups of alternative keywords (case-insensitive, all-matching logic).
Documentation
- Updated docs and examples to describe the new metric, expected_keywords usage, required positioning, and sample data.
Tests
- Added comprehensive unit tests covering evaluation logic and expected_keywords validation.
Validation
- Metric now requires response and expected_keywords fields for per-turn evaluation.

coderabbitai · 2025-11-06T13:34:16Z

Walkthrough

Adds a new per-turn custom metric "custom:keywords_eval" and the TurnData field expected_keywords; implements evaluate_keywords with validation and matching logic, registers the metric in config and validator requirements, updates docs/examples, and adds unit tests for metric behavior and data validation.

Changes

Cohort / File(s)	Summary
Documentation `README.md`	Documents new `expected_keywords` per-turn field and `custom:keywords_eval` metric, updates examples and evaluation data structure to show required positioning and usage.
Configuration & Metadata `config/system.yaml`	Inserts `custom:keywords_eval` into turn-level metrics metadata with description "Keywords (ALL) matching evaluation with alternative sets".
Custom metrics package `src/lightspeed_evaluation/core/metrics/custom/__init__.py`, `src/lightspeed_evaluation/core/metrics/custom/custom.py`	Exposes and registers the new `evaluate_keywords` function by importing it in package `__init__` and adding `"keywords_eval": evaluate_keywords` to `CustomMetrics.supported_metrics`.
Metric implementation `src/lightspeed_evaluation/core/metrics/custom/keywords_eval.py`	New module implementing `evaluate_keywords` with input validation, case-insensitive keyword/substr matching across alternative keyword lists, short-circuit success on first full match, and detailed success/failure reason formatting.
Data model `src/lightspeed_evaluation/core/models/data.py`	Adds `TurnData.expected_keywords: Optional[list[list[str]]]` and a `validate_expected_keywords` field validator enforcing a list-of-non-empty-lists-of-non-empty-strings structure.
System validator `src/lightspeed_evaluation/core/system/validator.py`	Adds METRIC_REQUIREMENTS entry for `custom:keywords_eval` requiring `["response", "expected_keywords"]`.
Unit tests `tests/unit/core/metrics/test_keywords_eval.py`, `tests/unit/core/models/test_data.py`	Adds tests for `evaluate_keywords` covering success/failure paths, case/substring handling, missing data and conversation-level errors; adds `TestTurnDataKeywordsValidation` suite validating `expected_keywords` acceptance and rejection cases.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant evaluate_keywords
    participant Validator
    participant Matcher

    Caller->>evaluate_keywords: evaluate_keywords(turn_data)
    evaluate_keywords->>Validator: validate inputs (turn_data, response, expected_keywords)
    alt Validation Fails
        Validator-->>evaluate_keywords: error reason
        evaluate_keywords-->>Caller: Return (None, reason)
    else Validation Passes
        Validator-->>evaluate_keywords: ok
        evaluate_keywords->>Matcher: normalize response (lowercase)
        loop for each expected_keywords option (Option N)
            Matcher->>Matcher: check each keyword in option against response
            alt all keywords matched
                Matcher-->>evaluate_keywords: option matched
                evaluate_keywords-->>Caller: Return (1.0, "Keywords eval successful: Option X — matched: [...]")
                Note over evaluate_keywords: short-circuit on first full match
            else not all matched
                Matcher-->>evaluate_keywords: option failed (matched/unmatched details)
            end
        end
        alt no options matched
            evaluate_keywords->>evaluate_keywords: aggregate per-option matched/unmatched details
            evaluate_keywords-->>Caller: Return (0.0, "Keywords eval failed: All options failed — details: ...")
        end
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Reviewers should focus on:
- correctness and edge cases in keywords_eval.py (validation, matching semantics, substring vs token matching, reason formatting).
- validate_expected_keywords in data.py for strictness and clear error messages.
- test coverage in tests/unit/core/metrics/test_keywords_eval.py and tests/unit/core/models/test_data.py to ensure all branches and invalid inputs are exercised.
- registration consistency: imports/exports and METRIC_REQUIREMENTS alignment.

Possibly related PRs

Turn metric override #55 — Modifies TurnData model with per-turn fields and validators; likely overlaps with TurnData additions and validator patterns.
API integration & refactoring #47 — Changes custom metrics wiring and TurnData/validator metadata; relevant to metric registration and exports.

Suggested reviewers

VladimirKadlec
tisnik

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add keyword eval metric' directly aligns with the PR's main objective of introducing a new keyword evaluation metric, clearly summarizing the primary change.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5f49ecd and 4e6ba3e.

📒 Files selected for processing (9)

README.md (5 hunks)
config/system.yaml (1 hunks)
src/lightspeed_evaluation/core/metrics/custom/__init__.py (2 hunks)
src/lightspeed_evaluation/core/metrics/custom/custom.py (2 hunks)
src/lightspeed_evaluation/core/metrics/custom/keywords_eval.py (1 hunks)
src/lightspeed_evaluation/core/models/data.py (2 hunks)
src/lightspeed_evaluation/core/system/validator.py (1 hunks)
tests/unit/core/metrics/test_keywords_eval.py (1 hunks)
tests/unit/core/models/test_data.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (4)

config/system.yaml
src/lightspeed_evaluation/core/metrics/custom/keywords_eval.py
src/lightspeed_evaluation/core/metrics/custom/custom.py
tests/unit/core/metrics/test_keywords_eval.py

🧰 Additional context used

🧠 Learnings (8)

📓 Common learnings

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:15:39.782Z
Learning: In the lsc_agent_eval framework, the substring evaluation logic in the `_evaluate_substring` method requires ALL expected keywords to be present in the agent response (logical AND), not just any keyword (logical OR). This is a stricter evaluation condition that was intentionally changed and may be subject to future modifications.

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.

📚 Learning: 2025-09-18T23:59:37.026Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/system/validator.py

📚 Learning: 2025-10-31T11:54:59.126Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 90
File: src/lightspeed_evaluation/core/models/data.py:198-208
Timestamp: 2025-10-31T11:54:59.126Z
Learning: In the lightspeed_evaluation framework, the expected_tool_calls validator intentionally rejects a single empty set `[[]]` as the only alternative. This is by design: if no tool calls are expected, the tool_eval metric should not be configured for that turn. Empty sets are only valid as fallback alternatives (e.g., `[[[tool_call]], [[]]]`), representing optional tool call scenarios, not as primary or sole expectations.

Applied to files:

src/lightspeed_evaluation/core/models/data.py
README.md

📚 Learning: 2025-09-08T11:11:54.516Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

README.md

📚 Learning: 2025-07-29T05:15:39.782Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:15:39.782Z
Learning: In the lsc_agent_eval framework, the substring evaluation logic in the `_evaluate_substring` method requires ALL expected keywords to be present in the agent response (logical AND), not just any keyword (logical OR). This is a stricter evaluation condition that was intentionally changed and may be subject to future modifications.

Applied to files:

README.md

📚 Learning: 2025-08-26T11:17:48.640Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

Applied to files:

README.md

📚 Learning: 2025-07-28T14:26:03.119Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.

Applied to files:

README.md

📚 Learning: 2025-08-13T14:07:44.195Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration uses "tool_name" as the key for tool names, not "name". The tool call evaluation implementation specifically looks for the "tool_name" field when comparing expected vs actual tool calls.

Applied to files:

README.md

🧬 Code graph analysis (2)

src/lightspeed_evaluation/core/metrics/custom/__init__.py (1)

src/lightspeed_evaluation/core/metrics/custom/keywords_eval.py (1)

evaluate_keywords (82-129)

tests/unit/core/models/test_data.py (1)

src/lightspeed_evaluation/core/models/data.py (1)

TurnData (35-295)

🪛 LanguageTool

README.md

[grammar] ~310-~310: Use a hyphen to join words.
Context: ...equired for custom:keywords_eval (case insensitive matching) > - `expected_resp...

(QB_NEW_EN_HYPHEN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: tests (3.13)
GitHub Check: tests (3.12)
GitHub Check: tests (3.11)
GitHub Check: mypy

🔇 Additional comments (14)

src/lightspeed_evaluation/core/metrics/custom/__init__.py (1)

4-4: LGTM! Properly exposes the new keyword evaluation metric.

The import and export follow the established pattern for other custom metrics in this package.

Also applies to: 13-13

src/lightspeed_evaluation/core/system/validator.py (1)

43-46: LGTM! Metric requirements properly defined.

The entry correctly specifies the required fields for the keywords evaluation metric and follows the established pattern for other custom metrics.

src/lightspeed_evaluation/core/models/data.py (2)

56-59: LGTM! Field declaration is well-structured.

The expected_keywords field type and description appropriately represent a list of keyword alternatives for evaluation.

96-124: LGTM! Comprehensive validation with clear error messages.

The validator thoroughly checks the structure of expected_keywords at each level and provides descriptive error messages including indices for easy debugging. The validation logic appropriately focuses on data structure rather than evaluation semantics.

README.md (5)

89-89: LGTM! Clear documentation of the new metric.

The description accurately conveys that ALL keywords must match (logical AND) and mentions case-insensitive matching and alternatives support.

154-155: LGTM! Metric metadata is well-documented.

The description clearly explains the evaluation behavior with sequential alternate checking and case-insensitive matching. The comment noting binary evaluation (0 or 1) is helpful context.

233-233: LGTM! Clear example demonstrating proper usage.

The example shows the expected_keywords structure with alternative groups and properly includes the metric in turn_metrics configuration.

Also applies to: 240-240

298-298: LGTM! Table entry is complete and accurate.

The field documentation properly describes the type, requirement status, and purpose.

310-310: LGTM! Requirement is clearly documented.

The note correctly states that expected_keywords is required when using the custom:keywords_eval metric and mentions case-insensitive matching.

tests/unit/core/models/test_data.py (5)

337-339: LGTM! Well-organized test class.

The test class is properly structured to focus on expected_keywords validation with a clear docstring.

340-366: LGTM! Valid cases are well-tested.

The tests cover both single and multiple keyword groups, verifying that valid data structures pass validation unchanged.

367-373: LGTM! None handling is properly tested.

The test correctly verifies that None is a valid value for the optional field.

375-441: LGTM! Comprehensive validation error testing.

The tests thoroughly cover all invalid input scenarios with appropriate error message assertions, ensuring the validator catches all edge cases.

443-467: LGTM! Complex scenario is well-tested.

The test verifies the validator handles realistic complex structures with multiple groups and multi-word keywords, ensuring robustness for real-world use cases.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

config/system.yaml (1)
79-81: Add explicit threshold for consistency with binary metrics.

For binary metrics that return 0 or 1, an explicit threshold should be specified for consistent behavior. The custom:intent_eval metric (line 87) uses threshold: 1 for similar boolean evaluation.

Based on learnings

Apply this diff:
     # Custom metrics
     "custom:keywords_eval":  # boolean eval (either 0 or 1)
+      threshold: 1
       description: "Keywords (ALL) matching evaluation with alternative sets"
README.md (1)
314-314: Fix hyphenation for compound adjective.

The phrase "case insensitive" should be hyphenated when used as a compound adjective modifying "matching".

Apply this diff:
-> - `expected_keywords`: Required for `custom:keywords_eval` (case insensitive matching)
+> - `expected_keywords`: Required for `custom:keywords_eval` (case-insensitive matching)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 64f8892 and 2b4fec9.

📒 Files selected for processing (7)

README.md (5 hunks)
config/system.yaml (1 hunks)
src/lightspeed_evaluation/core/metrics/custom/__init__.py (2 hunks)
src/lightspeed_evaluation/core/metrics/custom/custom.py (2 hunks)
src/lightspeed_evaluation/core/metrics/custom/keywords_eval.py (1 hunks)
src/lightspeed_evaluation/core/models/data.py (2 hunks)
src/lightspeed_evaluation/core/system/validator.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (6)

config/system.yaml

📄 CodeRabbit inference engine (AGENTS.md)

config/system.yaml: Keep system configuration in config/system.yaml (LLM, API, metrics metadata, output, logging)
Add metric metadata to the metrics_metadata section in config/system.yaml when introducing new metrics

Files:

config/system.yaml

config/**/*.{yaml,yml}

📄 CodeRabbit inference engine (AGENTS.md)

Use YAML for configuration files under config/

Files:

config/system.yaml

src/lightspeed_evaluation/**

📄 CodeRabbit inference engine (AGENTS.md)

Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Files:

src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/metrics/custom/__init__.py
src/lightspeed_evaluation/core/metrics/custom/custom.py
src/lightspeed_evaluation/core/system/validator.py
src/lightspeed_evaluation/core/metrics/custom/keywords_eval.py

src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Require type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels

Files:

src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/metrics/custom/__init__.py
src/lightspeed_evaluation/core/metrics/custom/custom.py
src/lightspeed_evaluation/core/system/validator.py
src/lightspeed_evaluation/core/metrics/custom/keywords_eval.py

src/lightspeed_evaluation/core/metrics/custom/**

📄 CodeRabbit inference engine (AGENTS.md)

Add new custom metrics under src/lightspeed_evaluation/core/metrics/custom/

Files:

src/lightspeed_evaluation/core/metrics/custom/__init__.py
src/lightspeed_evaluation/core/metrics/custom/custom.py
src/lightspeed_evaluation/core/metrics/custom/keywords_eval.py

src/lightspeed_evaluation/core/metrics/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Register new metrics in MetricManager’s supported_metrics dictionary

Files:

src/lightspeed_evaluation/core/metrics/custom/__init__.py
src/lightspeed_evaluation/core/metrics/custom/custom.py
src/lightspeed_evaluation/core/metrics/custom/keywords_eval.py

🧠 Learnings (16)

📓 Common learnings

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:15:39.782Z
Learning: In the lsc_agent_eval framework, the substring evaluation logic in the `_evaluate_substring` method requires ALL expected keywords to be present in the agent response (logical AND), not just any keyword (logical OR). This is a stricter evaluation condition that was intentionally changed and may be subject to future modifications.

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/**/*.py : Register new metrics in MetricManager’s supported_metrics dictionary

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/system.yaml : Add metric metadata to the metrics_metadata section in config/system.yaml when introducing new metrics

Applied to files:

config/system.yaml

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/system.yaml : Keep system configuration in config/system.yaml (LLM, API, metrics metadata, output, logging)

Applied to files:

config/system.yaml

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/custom/** : Add new custom metrics under src/lightspeed_evaluation/core/metrics/custom/

Applied to files:

config/system.yaml
src/lightspeed_evaluation/core/metrics/custom/__init__.py
src/lightspeed_evaluation/core/metrics/custom/custom.py

📚 Learning: 2025-09-08T11:11:54.516Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

config/system.yaml
README.md

📚 Learning: 2025-09-18T23:59:37.026Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

src/lightspeed_evaluation/core/models/data.py
src/lightspeed_evaluation/core/system/validator.py

📚 Learning: 2025-10-31T11:54:59.126Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 90
File: src/lightspeed_evaluation/core/models/data.py:198-208
Timestamp: 2025-10-31T11:54:59.126Z
Learning: In the lightspeed_evaluation framework, the expected_tool_calls validator intentionally rejects a single empty set `[[]]` as the only alternative. This is by design: if no tool calls are expected, the tool_eval metric should not be configured for that turn. Empty sets are only valid as fallback alternatives (e.g., `[[[tool_call]], [[]]]`), representing optional tool call scenarios, not as primary or sole expectations.

Applied to files:

src/lightspeed_evaluation/core/models/data.py
README.md

📚 Learning: 2025-07-29T05:15:39.782Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:15:39.782Z
Learning: In the lsc_agent_eval framework, the substring evaluation logic in the `_evaluate_substring` method requires ALL expected keywords to be present in the agent response (logical AND), not just any keyword (logical OR). This is a stricter evaluation condition that was intentionally changed and may be subject to future modifications.

Applied to files:

README.md
src/lightspeed_evaluation/core/system/validator.py
src/lightspeed_evaluation/core/metrics/custom/keywords_eval.py

📚 Learning: 2025-08-26T11:17:48.640Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

Applied to files:

README.md
src/lightspeed_evaluation/core/metrics/custom/custom.py

📚 Learning: 2025-09-10T06:57:46.326Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/pipeline/evaluation/evaluator.py:85-89
Timestamp: 2025-09-10T06:57:46.326Z
Learning: For binary metrics like custom:tool_eval, using an explicit threshold of 0.5 is preferred over None threshold with special case handling. This provides consistent behavior where 0.0 scores fail and 1.0 scores pass.

Applied to files:

README.md

📚 Learning: 2025-07-28T14:26:03.119Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.

Applied to files:

README.md

📚 Learning: 2025-08-13T14:07:44.195Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration uses "tool_name" as the key for tool names, not "name". The tool call evaluation implementation specifically looks for the "tool_name" field when comparing expected vs actual tool calls.

Applied to files:

README.md

📚 Learning: 2025-07-16T13:20:45.006Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:45.006Z
Learning: In the lsc_agent_eval package, evaluation results use distinct values: "FAIL" means the evaluation ran successfully but the result was negative, while "ERROR" means there was an issue executing the evaluation itself (e.g., setup script failed, API connection failed).

Applied to files:

README.md

📚 Learning: 2025-07-16T13:20:40.632Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_evaluator.py:0-0
Timestamp: 2025-07-16T13:20:40.632Z
Learning: In the lsc_agent_eval package, evaluation results use "FAIL" for evaluations that ran but didn't pass the criteria, and "ERROR" for errors in the evaluation process itself (like setup script failures, API errors, etc.).

Applied to files:

README.md

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/**/*.py : Register new metrics in MetricManager’s supported_metrics dictionary

Applied to files:

src/lightspeed_evaluation/core/metrics/custom/__init__.py
src/lightspeed_evaluation/core/metrics/custom/custom.py
src/lightspeed_evaluation/core/system/validator.py

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Applied to files:

src/lightspeed_evaluation/core/metrics/custom/__init__.py
src/lightspeed_evaluation/core/metrics/custom/custom.py

🧬 Code graph analysis (3)

src/lightspeed_evaluation/core/metrics/custom/__init__.py (1)

src/lightspeed_evaluation/core/metrics/custom/keywords_eval.py (1)

evaluate_keywords (82-129)

src/lightspeed_evaluation/core/metrics/custom/custom.py (1)

src/lightspeed_evaluation/core/metrics/custom/keywords_eval.py (1)

evaluate_keywords (82-129)

src/lightspeed_evaluation/core/metrics/custom/keywords_eval.py (1)

src/lightspeed_evaluation/core/models/data.py (1)

TurnData (35-295)

🪛 LanguageTool

README.md

[grammar] ~314-~314: Use a hyphen to join words.
Context: ...equired for custom:keywords_eval (case insensitive matching) > - `verify_script...

(QB_NEW_EN_HYPHEN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: mypy
GitHub Check: tests (3.11)
GitHub Check: tests (3.13)
GitHub Check: tests (3.12)

🔇 Additional comments (16)

README.md (4)

89-89: LGTM!

The metric description accurately reflects the implementation logic, correctly noting that ALL keywords must match (AND logic) with case-insensitive matching. Based on learnings.

234-234: LGTM!

The expected_keywords example correctly demonstrates the list-of-lists format for alternative keyword sets, matching the data model structure.

241-241: LGTM!

The metric is correctly positioned in the turn_metrics list, consistent with its turn-level-only implementation.

301-301: LGTM!

The field documentation accurately describes the expected_keywords structure and its conditional requirement based on metric usage.

src/lightspeed_evaluation/core/metrics/custom/custom.py (2)

12-12: LGTM!

The import statement correctly follows the established pattern for custom metric imports.

32-32: LGTM!

The metric is correctly registered in the supported_metrics dictionary, following the established pattern for custom metrics. As per coding guidelines.

src/lightspeed_evaluation/core/metrics/custom/__init__.py (2)

4-4: LGTM!

The import correctly exposes the new evaluate_keywords function from the keywords_eval module.

13-13: LGTM!

The function is correctly added to __all__ to expose it as part of the package's public API.

src/lightspeed_evaluation/core/system/validator.py (1)

43-46: LGTM!

The metric requirements correctly specify the necessary fields (response and expected_keywords) for keyword evaluation, consistent with the implementation and data model.

src/lightspeed_evaluation/core/models/data.py (2)

65-68: LGTM!

The expected_keywords field is correctly defined with appropriate type annotation and description, matching the evaluation requirements.

96-124: LGTM!

The validator thoroughly checks the structure and content of expected_keywords, ensuring each alternative group is a non-empty list of non-empty, non-whitespace strings. Error messages include helpful index information for debugging.

src/lightspeed_evaluation/core/metrics/custom/keywords_eval.py (5)

8-24: LGTM!

The input validation correctly handles all error cases. The distinction between returning None (ERROR) for invalid setup and 0.0 (FAIL) for missing response is semantically appropriate—no response means keywords cannot be matched, which is a legitimate evaluation failure rather than a configuration error.

27-40: LGTM!

The keyword matching logic correctly implements case-insensitive substring matching with ALL-keywords-must-match semantics (AND logic), consistent with the framework's requirements. Based on learnings.

43-52: LGTM!

The success result is well-formatted with a score of 1.0 and a clear, informative reason showing which option matched and listing all matched keywords.

55-79: LGTM!

The failure result provides comprehensive debugging information, showing both matched and unmatched keywords for each alternative. This detailed feedback helps users understand why the evaluation failed and how to fix their data or responses.

82-129: LGTM!

The main evaluation function is well-implemented with:

Clear documentation explaining the sequential alternative-checking logic

Proper type hints for all parameters and return values

Efficient case-insensitive matching by lowercasing once

Correct short-circuit behavior on first successful match

Comprehensive error handling

README.md

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

README.md (1)
154-155: Verify threshold consistency with system config and consider hyphenation.

Two notes:

Threshold documentation: A previous review flagged that the README example includes threshold: 1 for custom:keywords_eval, but the actual config/system.yaml may be missing this field. Ensure consistency between the documentation and implementation.

Minor style: Consider using "case-insensitive" (with hyphen) in the description for grammatical consistency.

For the hyphenation:
-    "custom:keywords_eval":  # Binary evaluation (0 or 1)
-      description: "Keywords evaluation (ALL match) with sequential alternate checking (case insensitive)"
+    "custom:keywords_eval":
+      threshold: 1  # Binary evaluation (0 or 1)
+      description: "Keywords evaluation (ALL match) with sequential alternate checking (case-insensitive)"

🧹 Nitpick comments (2)

README.md (2)
89-89: Consider hyphenating compound modifier for style consistency.

The phrase "case insensitive" should be "case-insensitive" when used as a compound adjective modifying "matching".

Apply this diff:
-    - [`keywords_eval`](src/lightspeed_evaluation/core/metrics/custom/keywords_eval.py) - Keywords evaluation with alternatives (ALL keywords must match, case insensitive)
+    - [`keywords_eval`](src/lightspeed_evaluation/core/metrics/custom/keywords_eval.py) - Keywords evaluation with alternatives (ALL keywords must match, case-insensitive)
310-310: Minor: Use hyphenated compound adjective.

For grammatical consistency, "case insensitive" should be hyphenated when used as a compound adjective.

Apply this diff:
-> - `expected_keywords`: Required for `custom:keywords_eval` (case insensitive matching)
+> - `expected_keywords`: Required for `custom:keywords_eval` (case-insensitive matching)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ab6a85a and 5f49ecd.

📒 Files selected for processing (8)

README.md (5 hunks)
config/system.yaml (1 hunks)
src/lightspeed_evaluation/core/metrics/custom/__init__.py (2 hunks)
src/lightspeed_evaluation/core/metrics/custom/custom.py (2 hunks)
src/lightspeed_evaluation/core/models/data.py (2 hunks)
src/lightspeed_evaluation/core/system/validator.py (1 hunks)
tests/unit/core/metrics/test_keywords_eval.py (1 hunks)
tests/unit/core/models/test_data.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (5)

src/lightspeed_evaluation/core/system/validator.py
config/system.yaml
tests/unit/core/metrics/test_keywords_eval.py
src/lightspeed_evaluation/core/metrics/custom/custom.py
tests/unit/core/models/test_data.py

🧰 Additional context used

📓 Path-based instructions (4)

src/lightspeed_evaluation/**

📄 CodeRabbit inference engine (AGENTS.md)

Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Files:

src/lightspeed_evaluation/core/metrics/custom/__init__.py
src/lightspeed_evaluation/core/models/data.py

src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Require type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels

Files:

src/lightspeed_evaluation/core/metrics/custom/__init__.py
src/lightspeed_evaluation/core/models/data.py

src/lightspeed_evaluation/core/metrics/custom/**

📄 CodeRabbit inference engine (AGENTS.md)

Add new custom metrics under src/lightspeed_evaluation/core/metrics/custom/

Files:

src/lightspeed_evaluation/core/metrics/custom/__init__.py

src/lightspeed_evaluation/core/metrics/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Register new metrics in MetricManager’s supported_metrics dictionary

Files:

src/lightspeed_evaluation/core/metrics/custom/__init__.py

🧠 Learnings (11)

📓 Common learnings

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:15:39.782Z
Learning: In the lsc_agent_eval framework, the substring evaluation logic in the `_evaluate_substring` method requires ALL expected keywords to be present in the agent response (logical AND), not just any keyword (logical OR). This is a stricter evaluation condition that was intentionally changed and may be subject to future modifications.

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.

📚 Learning: 2025-09-08T11:11:54.516Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

README.md

📚 Learning: 2025-07-29T05:15:39.782Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/evaluator.py:87-100
Timestamp: 2025-07-29T05:15:39.782Z
Learning: In the lsc_agent_eval framework, the substring evaluation logic in the `_evaluate_substring` method requires ALL expected keywords to be present in the agent response (logical AND), not just any keyword (logical OR). This is a stricter evaluation condition that was intentionally changed and may be subject to future modifications.

Applied to files:

README.md

📚 Learning: 2025-08-26T11:17:48.640Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

Applied to files:

README.md

📚 Learning: 2025-07-28T14:26:03.119Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.

Applied to files:

README.md

📚 Learning: 2025-08-13T14:07:44.195Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 24
File: lsc_agent_eval/README.md:116-136
Timestamp: 2025-08-13T14:07:44.195Z
Learning: In the lsc_agent_eval framework, the expected_tool_calls configuration uses "tool_name" as the key for tool names, not "name". The tool call evaluation implementation specifically looks for the "tool_name" field when comparing expected vs actual tool calls.

Applied to files:

README.md

📚 Learning: 2025-10-31T11:54:59.126Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 90
File: src/lightspeed_evaluation/core/models/data.py:198-208
Timestamp: 2025-10-31T11:54:59.126Z
Learning: In the lightspeed_evaluation framework, the expected_tool_calls validator intentionally rejects a single empty set `[[]]` as the only alternative. This is by design: if no tool calls are expected, the tool_eval metric should not be configured for that turn. Empty sets are only valid as fallback alternatives (e.g., `[[[tool_call]], [[]]]`), representing optional tool call scenarios, not as primary or sole expectations.

Applied to files:

README.md
src/lightspeed_evaluation/core/models/data.py

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/custom/** : Add new custom metrics under src/lightspeed_evaluation/core/metrics/custom/

Applied to files:

src/lightspeed_evaluation/core/metrics/custom/__init__.py

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/**/*.py : Register new metrics in MetricManager’s supported_metrics dictionary

Applied to files:

src/lightspeed_evaluation/core/metrics/custom/__init__.py

📚 Learning: 2025-10-16T11:17:19.324Z

Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)

Applied to files:

src/lightspeed_evaluation/core/metrics/custom/__init__.py

📚 Learning: 2025-09-18T23:59:37.026Z

Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

src/lightspeed_evaluation/core/models/data.py

🧬 Code graph analysis (1)

src/lightspeed_evaluation/core/metrics/custom/__init__.py (1)

src/lightspeed_evaluation/core/metrics/custom/keywords_eval.py (1)

evaluate_keywords (82-129)

🪛 LanguageTool

README.md

[grammar] ~310-~310: Use a hyphen to join words.
Context: ...equired for custom:keywords_eval (case insensitive matching) > - `expected_resp...

(QB_NEW_EN_HYPHEN)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: pydocstyle
GitHub Check: mypy

🔇 Additional comments (4)

src/lightspeed_evaluation/core/models/data.py (2)

56-59: LGTM! Field definition is clear and well-structured.

The expected_keywords field is properly typed and documented. The nested list structure list[list[str]] clearly represents alternatives with keyword groups.

96-124: LGTM! Validator is comprehensive and follows established patterns.

The validation logic is thorough:

Validates nested list structure at each level

Ensures no empty groups or whitespace-only keywords

Provides specific error messages with indices for debugging

Follows the same validation pattern as expected_tool_calls

Based on learnings: The validator correctly enforces structure. The logical AND requirement (ALL keywords must match) is appropriately handled at evaluation time, not during validation.

src/lightspeed_evaluation/core/metrics/custom/__init__.py (1)

4-4: LGTM! Correctly exposes the new metric in the public API.

The import and export follow the established pattern for custom metrics and are correctly positioned alphabetically.

As per coding guidelines: The metric is properly added under the custom metrics package structure.

Also applies to: 13-13

README.md (1)

233-233: LGTM! Clear example of the alternatives structure.

The example effectively demonstrates the nested list format for keyword alternatives.

asamal4 · 2025-11-17T11:39:59Z

@VladimirKadlec @tisnik PTAL

VladimirKadlec

LGTM, nice!

Note for the future refactor: I'd break down CustomMetrics into subclasses (like CustomKeywords, CustomCorectness, ...) overloading evaluate

tisnik

LGTM

coderabbitai bot reviewed Nov 6, 2025

View reviewed changes

README.md Outdated Show resolved Hide resolved

asamal4 force-pushed the keyword-eval branch 2 times, most recently from ab6a85a to 5f49ecd Compare November 6, 2025 14:02

coderabbitai bot reviewed Nov 6, 2025

View reviewed changes

asamal4 added 3 commits November 17, 2025 16:46

add core keyword eval logic

a9202c3

integrate keywords eval

5a0b340

add keyword eval test case

4e6ba3e

asamal4 force-pushed the keyword-eval branch from 5f49ecd to 4e6ba3e Compare November 17, 2025 11:18

VladimirKadlec approved these changes Nov 19, 2025

View reviewed changes

tisnik approved these changes Nov 20, 2025

View reviewed changes

tisnik merged commit f31911f into lightspeed-core:main Nov 20, 2025
15 checks passed

coderabbitai bot mentioned this pull request Nov 24, 2025

add support for fail_on_invalid_data option #94

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add keyword eval metric #93

Add keyword eval metric #93

Uh oh!

asamal4 commented Nov 6, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 6, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

asamal4 commented Nov 17, 2025

Uh oh!

VladimirKadlec left a comment

Uh oh!

tisnik left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add keyword eval metric #93

Add keyword eval metric #93

Uh oh!

Conversation

asamal4 commented Nov 6, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

asamal4 commented Nov 17, 2025

Uh oh!

VladimirKadlec left a comment

Choose a reason for hiding this comment

Uh oh!

tisnik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

asamal4 commented Nov 6, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 6, 2025 •

edited

Loading