Turn metric override #55

asamal4 · 2025-09-18T11:47:07Z

Use applicable default metrics, if override is not given.
Override turn level metrics - to enable use of different metrics if required (de-duplicate).
minor refactoring.

Summary by CodeRabbit

New Features
- Per-turn metric configuration with optional per-turn thresholds, explicit "skip turn" support, and a per-turn faithfulness threshold option (e.g., 0.99).
Bug Fixes
- Improved error messages with conversation/turn context and a new error summary view.
Refactor
- Centralized metric resolution and threshold handling; evaluation pipeline wired to use the centralized metric manager.
Configuration
- Metrics reordered, default flags added for several turn/conversation metrics.
Documentation
- Docs and examples updated to explain per-turn overrides, defaults, and behavior.
Tests
- Fixtures and unit tests updated for per-turn metric model.

coderabbitai · 2025-09-18T11:47:22Z

Walkthrough

Moves metric configuration from conversation-level to per-turn, adds MetricManager and MetricLevel for centralized metric/threshold resolution, updates models/validation to per-turn metrics, and refactors pipeline, evaluator, processor, error handling, tests, and docs to use per-turn metric resolution and framework routing.

Changes

Cohort / File(s)	Summary
Configs: Evaluation data `config/evaluation_data.yaml`	Remove group-level `turn_metrics`/`turn_metrics_metadata`; add per-turn `turn_metrics` and `turn_metrics_metadata` entries (explicit lists and empty `[]` to skip); add per-turn faithfulness threshold in one group.
Configs: System defaults `config/system.yaml`	Add `default` flags to turn-level metrics (`ragas:response_relevancy` default true, `ragas:faithfulness` default false), reorder `ragas:context_relevance`, and set `deepeval:conversation_completeness` default false.
Core models `src/.../core/models/data.py`	Add `_validate_and_deduplicate_metrics`; introduce `TurnData.turn_metrics` and `turn_metrics_metadata` with validators; remove `turn_metrics`/`turn_metrics_metadata` from `EvaluationData`; add `conversation_metrics`/`conversation_metrics_metadata`.
Metrics manager (new) `src/.../core/metrics/manager.py`	New `MetricLevel` enum and `MetricManager` for resolving metrics (defaults vs explicit vs skip), threshold precedence (turn → conv → system), and counting metrics per conversation.
System validator `src/.../core/system/validator.py`	Validate metrics per-turn by iterating `data.turns`; improve error messages to include conversation/turn context; enforce per-turn metric requirements.
Evaluator `src/.../pipeline/evaluation/evaluator.py`	Inject `MetricManager`/`MetricLevel`; route metrics by framework to handlers; delegate threshold resolution to `MetricManager`; centralize error handling; update constructor signature.
Processor & wiring `src/.../pipeline/evaluation/pipeline.py`, `src/.../pipeline/evaluation/processor.py`	Add `ProcessorComponents` dataclass; wire `MetricManager` into pipeline; ConversationProcessor now accepts `ProcessorComponents`, resolves per-turn/conversation metrics up-front, skips when resolved metrics empty, and delegates summary counting to `MetricManager`.
Error handling `src/.../pipeline/evaluation/errors.py`	`mark_all_metrics_as_error` now accepts resolved per-turn and conversation metric lists and emits ERROR results per resolved metric; add `get_error_summary`.
Tests & fixtures `tests/conftest.py`, `tests/unit/core/config/test_models.py`	Move sample metrics from conversation-level into per-turn `TurnData`; update fixtures/tests to use `TurnData.turn_metrics` and `turn_metrics_metadata`; adjust to new system metadata.
Docs `README.md`	Update docs to show per-turn metric overrides, default semantics (null/empty/explicit), examples, and Turn Data Fields additions.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Runner as Runner
  participant Pipeline as EvaluationPipeline
  participant Proc as ConversationProcessor
  participant MM as MetricManager
  participant Eval as MetricsEvaluator
  participant H as Handler

  Runner->>Pipeline: run(config, evaluation_data)
  Pipeline->>Proc: start with ProcessorComponents (incl. MetricManager)
  Proc->>MM: resolve_metrics(conv.conversation_metrics, CONVERSATION)
  MM-->>Proc: resolved_conversation_metrics
  loop for each turn
    Proc->>MM: resolve_metrics(turn.turn_metrics, TURN)
    MM-->>Proc: resolved_turn_metrics_for_turn
    alt resolved_turn_metrics non-empty
      Proc->>Eval: evaluate_turn(turn, resolved_turn_metrics_for_turn)
      Eval->>MM: get_effective_threshold(metric_id, TURN, conv, turn)
      MM-->>Eval: threshold
      Eval->>H: route(framework, metric_name)
      H-->>Eval: score/result
      Eval-->>Proc: EvaluationResult
    else skip turn
      Note right of Proc: turn_metrics == [] → skip evaluation
    end
  end
  alt resolved_conversation_metrics non-empty
    Proc->>Eval: evaluate_conversation(conv, resolved_conversation_metrics)
    Eval->>MM: get_effective_threshold(metric_id, CONVERSATION, conv)
    MM-->>Eval: threshold
    Eval->>H: route(framework, metric_name)
    H-->>Eval: score/result
    Eval-->>Proc: EvaluationResult
  else skip conversation
    Note right of Proc: no conversation metrics → skip
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

[nit] Clean up evaluation_data.yaml #52 — Overlaps on config/evaluation_data.yaml changes (per-turn metadata and faithfulness threshold).
API integration & refactoring #47 — Related refactor introducing centralized metric management and pipeline/evaluator wiring.
Generic eval tool #28 — Related work around MetricManager, metric routing, and defaults resolution.

Suggested reviewers

VladimirKadlec
tisnik

Poem

A nibble of metrics, a hop through the flow,
Defaults in clover, thresholds set just so.
Each turn a tunnel with rules to explore,
Manager thumps thrice — evaluate once more.
I twitch my whiskers — tests pass, carrots galore! 🥕🐇

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "Turn metric override" succinctly and accurately captures the primary change set: adding per-turn metric overrides and defaulting behavior described in the PR and raw_summary. It matches the PR objectives (enable per-turn overrides and fall back to applicable defaults) and is clear and concise for a reviewer scanning history. The title avoids noise and reflects the main intent of the changes.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/lightspeed_evaluation/pipeline/evaluation/processor.py (1)

78-85: API error path currently misses default metrics in ERROR results.

Because EvaluationErrorHandler iterates only explicit metrics on data, default metrics are not marked as ERROR after amendment failure. Align with the pipeline’s default resolution.

See proposed essential refactor in pipeline comment (inject MetricManager into EvaluationErrorHandler and resolve defaults there).

🧹 Nitpick comments (6)

src/lightspeed_evaluation/core/models/data.py (2)
54-65: Good format validation for turn metrics

Optional enhancement: dedupe metrics to avoid double-evaluating the same metric.

Example:
 @field_validator("turn_metrics")
 @classmethod
 def validate_turn_metrics(cls, v: Optional[list[str]]) -> Optional[list[str]]:
@@
-        return v
+        if v:
+            seen, uniq = set(), []
+            for m in v:
+                if m not in seen:
+                    seen.add(m); uniq.append(m)
+            return uniq
+        return v
141-153: Conversation metrics validator reads well

Consider applying the same dedupe idea here for symmetry.
tests/conftest.py (1)
127-135: Test system defaults may be empty; set one default=true

MetricManager picks defaults via metadata.default==true. In this fixture, no turn metric has default=true, so turns with turn_metrics=None will resolve to no metrics (skip). If you intend to test default fallback, mark at least one metric as default.

Patch:
         "metrics_metadata": {
             "turn_level": {
                 "ragas:faithfulness": {
                     "threshold": 0.8,
                     "description": "How faithful the response is to the provided context",
                     "default": False,
                 },
                 "ragas:response_relevancy": {
                     "threshold": 0.8,
+                    "default": True,
                 },
Also applies to: 140-142, 167-169, 184-186, 195-197
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
21-27: Integrating MetricManager is solid; handle custom:tool_eval threshold None

Per our prior learning, when custom:tool_eval has no threshold, passing should be score > 0 (not >= 0.5). Set a metric-specific fallback before determining status.

Minimal change:
             threshold = self.metric_manager.get_effective_threshold(
                 request.metric_identifier, level, request.conv_data, request.turn_data
             )
-            status = self._determine_status(score, threshold)
+            # Metric-specific default thresholds
+            if threshold is None and request.metric_identifier == "custom:tool_eval":
+                threshold = 0.0
+            status = self._determine_status(score, threshold)
Nit: clarify response expression with parentheses for readability:
-                response=request.turn_data.response or "" if request.turn_data else "",
+                response=(request.turn_data.response or "") if request.turn_data else "",
Also applies to: 43-44, 87-95
src/lightspeed_evaluation/pipeline/evaluation/processor.py (1)
33-40: Drop redundant self.components to avoid duplication.

You store both self.components and individual aliases. Keeping only the aliases reduces indirection.

Apply:
-        self.components = components
-
-        # Keep direct references for easier access
-        self.metrics_evaluator = components.metrics_evaluator
+        # Keep direct references for easier access
+        self.metrics_evaluator = components.metrics_evaluator
         self.api_amender = components.api_amender
         self.error_handler = components.error_handler
         self.metric_manager = components.metric_manager
src/lightspeed_evaluation/core/metrics/manager.py (1)
105-114: Minor tidy: list comprehension for defaults.

Equivalent, a bit tighter.
-        default_metrics = []
-        for metric_name, metadata in metrics_metadata.items():
-            if metadata.get("default", False):  # default=false if not specified
-                default_metrics.append(metric_name)
-        return default_metrics
+        return [
+            name for name, meta in metrics_metadata.items()
+            if meta.get("default", False)
+        ]

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between afa045b and f97e01c.

📒 Files selected for processing (11)

config/evaluation_data.yaml (3 hunks)
config/system.yaml (3 hunks)
src/lightspeed_evaluation/core/metrics/manager.py (1 hunks)
src/lightspeed_evaluation/core/models/data.py (3 hunks)
src/lightspeed_evaluation/core/system/validator.py (2 hunks)
src/lightspeed_evaluation/pipeline/evaluation/errors.py (1 hunks)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (4 hunks)
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (3 hunks)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (7 hunks)
tests/conftest.py (5 hunks)
tests/unit/core/config/test_models.py (3 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-09-08T11:11:54.516Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

🧬 Code graph analysis (5)

src/lightspeed_evaluation/core/metrics/manager.py (2)

src/lightspeed_evaluation/core/models/data.py (2)

EvaluationData (112-153)

TurnData (13-109)

src/lightspeed_evaluation/core/models/system.py (1)

SystemConfig (195-219)

src/lightspeed_evaluation/pipeline/evaluation/processor.py (5)

src/lightspeed_evaluation/core/metrics/manager.py (4)

MetricLevel (10-14)

MetricManager (17-136)

resolve_metrics (24-48)

count_metrics_for_conversation (115-136)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

MetricsEvaluator (18-143)

src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)

APIDataAmender (13-80)

src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)

EvaluationErrorHandler (10-77)

src/lightspeed_evaluation/core/system/loader.py (1)

ConfigLoader (65-118)

tests/unit/core/config/test_models.py (1)

src/lightspeed_evaluation/core/models/data.py (2)

TurnData (13-109)

EvaluationData (112-153)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (3)

src/lightspeed_evaluation/core/metrics/manager.py (3)

MetricLevel (10-14)

MetricManager (17-136)

get_effective_threshold (50-80)

src/lightspeed_evaluation/core/models/data.py (3)

EvaluationRequest (212-265)

EvaluationResult (156-195)

EvaluationScope (198-209)

src/lightspeed_evaluation/core/system/loader.py (1)

ConfigLoader (65-118)

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (3)

src/lightspeed_evaluation/core/metrics/manager.py (1)

MetricManager (17-136)

src/lightspeed_evaluation/pipeline/evaluation/processor.py (2)

ConversationProcessor (26-148)

ProcessorComponents (17-23)

src/lightspeed_evaluation/core/api/client.py (1)

close (232-235)

🪛 GitHub Actions: Ruff

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py

[error] 61-61: F821 Undefined name 'api_client' in APIDataAmender construction. Use 'self.api_client' instead of 'api_client'.

🪛 GitHub Actions: Python linter

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py

[error] 61-61: Command 'make pylint' failed with exit code 2. Pylint: Undefined variable 'api_client' (undefined-variable).

🪛 GitHub Actions: Pyright

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py

[error] 61-61: Step 'make pyright' failed: Pyright error: 'api_client' is not defined (reportUndefinedVariable).

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: mypy

🔇 Additional comments (14)

config/system.yaml (1)

37-38: Defaults look correct; confirm intent for no conversation-level defaults

Turn default enabling response_relevancy is good.

Faithfulness explicitly non-default and context_relevance reorder are fine (no semantic change).

conversation_completeness default=false means if conversation_metrics is None, no conversation-level metrics will run. Confirm this is intentional.

Also applies to: 39-42, 57-60, 75-75

config/evaluation_data.yaml (1)

18-26: YAML updates align with per-turn overrides and defaults

conv_group_1: per-turn metrics + per-turn threshold override look good.

conv_group_2: swapped metric set is consistent with requirements.

conv_group_3: [] to skip turn 1 and None to use system defaults for turn 2 are correct.

Please confirm that system defaults include at least one turn metric so conv_group_3 turn 2 actually evaluates.

Also applies to: 27-44, 58-59, 63-64

src/lightspeed_evaluation/core/models/data.py (2)

44-53: Per-turn metrics fields are well-defined

126-135: Conversation-level metadata relocation is consistent

tests/conftest.py (1)

72-74: Fixture now models per-turn metrics correctly

tests/unit/core/config/test_models.py (3)

75-81: Tests reflect per-turn metrics move

Also applies to: 92-92, 103-103

118-124: Invalid metric format check correctly shifted to TurnData

127-133: Per-turn metadata path covered; assertions look good

Also applies to: 141-141

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (1)

171-172: LGTM: clean API client teardown.

Gracefully closes the client if present.

src/lightspeed_evaluation/pipeline/evaluation/processor.py (3)

51-63: LGTM: fast-skip check correctly respects defaults and explicit skips.

Using MetricManager to resolve per-turn and conversation-level metrics before work is a good optimization.

86-95: Turn evaluation routing looks solid.

Resolves metrics per turn, logs, and delegates to evaluator. Clean separation.

96-104: Conversation-level routing is correct and consistent with turn-level path.

No issues.

src/lightspeed_evaluation/core/metrics/manager.py (2)

24-49: LGTM: resolve_metrics semantics are clear and predictable.

None → defaults, [] → skip, list → as-is. Matches the PR goals.

115-136: LGTM: counting uses resolved metrics, so totals include defaults and explicit skips.

Good for summaries and preflight checks.

src/lightspeed_evaluation/core/system/validator.py

src/lightspeed_evaluation/pipeline/evaluation/errors.py

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/lightspeed_evaluation/pipeline/evaluation/processor.py (1)

78-85: API-error path does not mark default metrics as ERROR.

mark_all_metrics_as_error only enumerates explicitly defined metrics (turn_data.turn_metrics / conv_data.conversation_metrics). When defaults apply (None), no ERROR results are produced, contradicting the log “marking all metrics as ERROR” and skewing summaries.

Proposed fix: pass resolved metrics to the error handler and mark those.

Apply this diff to use the precomputed values from the earlier refactor:

-        if api_error_occurred:
-            logger.error("API error detected - marking all metrics as ERROR")
-            error_results = self.error_handler.mark_all_metrics_as_error(
-                conv_data, "API error during data amendment"
-            )
-            return error_results
+        if api_error_occurred:
+            logger.error("API error detected - marking all metrics as ERROR")
+            # Build turn_id -> metrics map from precomputed resolution
+            turn_metrics_by_id = {
+                turn.turn_id: metrics for _, turn, metrics in resolved_turn_metrics
+            }
+            error_results = self.error_handler.mark_resolved_metrics_as_error(  # new API
+                conv_data,
+                turn_metrics_by_id,
+                conversation_metrics_resolved,
+                "API error during data amendment",
+            )
+            return error_results

Add this method to EvaluationErrorHandler (src/lightspeed_evaluation/pipeline/evaluation/errors.py):

def mark_resolved_metrics_as_error(
    self,
    conv_data: EvaluationData,
    resolved_turn_metrics: dict[str, list[str]],
    resolved_conversation_metrics: list[str],
    error_reason: str,
) -> list[EvaluationResult]:
    """Mark the provided resolved metrics as ERROR (separation of concerns: resolution happens upstream)."""
    results: list[EvaluationResult] = []

    # Turn-level
    for turn in conv_data.turns:
        for metric_identifier in resolved_turn_metrics.get(turn.turn_id, []) or []:
            results.append(
                EvaluationResult(
                    conversation_group_id=conv_data.conversation_group_id,
                    turn_id=turn.turn_id,
                    metric_identifier=metric_identifier,
                    result="ERROR",
                    score=None,
                    threshold=None,
                    reason=error_reason,
                    query=turn.query,
                    response="",
                    execution_time=0.0,
                )
            )

    # Conversation-level
    for metric_identifier in resolved_conversation_metrics or []:
        results.append(
            EvaluationResult(
                conversation_group_id=conv_data.conversation_group_id,
                turn_id=None,
                metric_identifier=metric_identifier,
                result="ERROR",
                score=None,
                threshold=None,
                reason=error_reason,
                query="",
                response="",
                execution_time=0.0,
            )
        )

    self.results.extend(results)
    return results

Note: This aligns with the retrieved learning that resolution occurs upstream; the handler remains a pure marker.

♻️ Duplicate comments (2)

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (2)
60-61: Fix undefined variable 'api_client' in APIDataAmender construction.

Use the instance attribute you just set. This is causing the test failures.

Apply this diff:
         self.api_client = self._create_api_client()
-        api_amender = APIDataAmender(api_client)
+        api_amender = APIDataAmender(self.api_client)
62-62: Pass metric_manager to EvaluationErrorHandler for proper default metric handling.

Based on the retrieved learning, metric resolution happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(). However, when API errors occur, the error handler needs to know which metrics to mark as ERROR, including defaults.

Update the error handler instantiation to accept the metric_manager:
-        error_handler = EvaluationErrorHandler()
+        error_handler = EvaluationErrorHandler(metric_manager)
Note: You'll also need to update src/lightspeed_evaluation/pipeline/evaluation/errors.py to accept and use the MetricManager for resolving metrics when marking errors.

🧹 Nitpick comments (10)

tests/conftest.py (1)

195-196: Consider testing all three metric override patterns.

The fixtures test explicit metrics (line 184) and None (line 195), but don't include a test case for empty list [] which should skip evaluation.

Consider adding a test fixture with turn_metrics: [] to ensure the skip behavior is properly tested:

                },
+               {
+                   "turn_id": "3",
+                   "query": "Skip this turn",
+                   "response": "This turn should be skipped",
+                   "turn_metrics": [],  # Test skip behavior
+                   "turn_metrics_metadata": {},
+               },
            ],

src/lightspeed_evaluation/core/metrics/manager.py (1)

105-113: Optional: stable, de-duplicated defaults.

If config ever contains duplicates or order needs to be stable, dedupe while preserving order.

Apply this diff:

-        default_metrics = []
-        for metric_name, metadata in metrics_metadata.items():
-            if metadata.get("default", False):  # default=false if not specified
-                default_metrics.append(metric_name)
-        return default_metrics
+        default_metrics: list[str] = []
+        for metric_name, metadata in metrics_metadata.items():
+            if metadata.get("default", False):  # default=false if not specified
+                if metric_name not in default_metrics:
+                    default_metrics.append(metric_name)
+        return default_metrics

src/lightspeed_evaluation/pipeline/evaluation/processor.py (8)

52-62: Avoid double-resolving metrics; precompute once.

resolve_metrics is called here and again below during processing. Precompute per-turn resolved metrics and reuse for both the “any()” check and evaluation.

Apply this diff:

-        has_turn_metrics = any(
-            self.metric_manager.resolve_metrics(
-                turn_data.turn_metrics, MetricLevel.TURN
-            )
-            for turn_data in conv_data.turns
-        )
+        resolved_turn_metrics = [
+            (idx, turn, self.metric_manager.resolve_metrics(turn.turn_metrics, MetricLevel.TURN))
+            for idx, turn in enumerate(conv_data.turns)
+        ]
+        has_turn_metrics = any(bool(metrics) for _, _, metrics in resolved_turn_metrics)
-        has_conversation_metrics = bool(
-            self.metric_manager.resolve_metrics(
-                conv_data.conversation_metrics, MetricLevel.CONVERSATION
-            )
-        )
+        conversation_metrics_resolved = self.metric_manager.resolve_metrics(
+            conv_data.conversation_metrics, MetricLevel.CONVERSATION
+        )
+        has_conversation_metrics = bool(conversation_metrics_resolved)

86-95: Reuse the pre-resolved turn metrics; avoid resolving again.

Minor perf/readability tidy-up.

Apply this diff:

-        for turn_idx, turn_data in enumerate(conv_data.turns):
-            turn_metrics = self.metric_manager.resolve_metrics(
-                turn_data.turn_metrics, MetricLevel.TURN
-            )
-            if turn_metrics:
-                logger.debug("Processing turn %d metrics: %s", turn_idx, turn_metrics)
-                turn_results = self._evaluate_turn(conv_data, turn_idx, turn_data)
+        for turn_idx, turn_data, turn_metrics in resolved_turn_metrics:
+            if turn_metrics:
+                logger.debug("Processing turn %d metrics: %s", turn_idx, turn_metrics)
+                turn_results = self._evaluate_turn(
+                    conv_data, turn_idx, turn_data, turn_metrics
+                )
                 results.extend(turn_results)

And change _evaluate_turn signature to accept pre-resolved metrics (see comment at Lines 110-121).

96-107: Reuse the pre-resolved conversation metrics.

Same duplication as turns; pass them through.

Apply this diff:

-        conversation_metrics = self.metric_manager.resolve_metrics(
-            conv_data.conversation_metrics, MetricLevel.CONVERSATION
-        )
-        if conversation_metrics:
+        if conversation_metrics_resolved:
             logger.debug(
                 "Processing conversation-level metrics: %s",
-                conversation_metrics,
+                conversation_metrics_resolved,
             )
-            conv_results = self._evaluate_conversation(conv_data)
+            conv_results = self._evaluate_conversation(conv_data, conversation_metrics_resolved)
             results.extend(conv_results)

And update _evaluate_conversation to accept the list (see comment at Lines 129-137).

110-121: Simplify: drop redundant resolve and pass metrics in.

Avoid re-resolving and remove the “or []” guard (resolve_metrics already returns a list).

Apply this diff:

-    def _evaluate_turn(
-        self, conv_data: EvaluationData, turn_idx: int, turn_data: TurnData
-    ) -> list[EvaluationResult]:
+    def _evaluate_turn(
+        self,
+        conv_data: EvaluationData,
+        turn_idx: int,
+        turn_data: TurnData,
+        turn_metrics: list[str],
+    ) -> list[EvaluationResult]:
@@
-        turn_metrics = self.metric_manager.resolve_metrics(
-            turn_data.turn_metrics, MetricLevel.TURN
-        )
-
-        for metric_identifier in turn_metrics or []:
+        for metric_identifier in turn_metrics:
             request = EvaluationRequest.for_turn(
                 conv_data, metric_identifier, turn_idx, turn_data
             )

129-137: Mirror change for conversation-level path.

Pass pre-resolved metrics; remove redundant resolve.

Apply this diff:

-    def _evaluate_conversation(
-        self, conv_data: EvaluationData
-    ) -> list[EvaluationResult]:
+    def _evaluate_conversation(
+        self, conv_data: EvaluationData, conversation_metrics: list[str]
+    ) -> list[EvaluationResult]:
@@
-        conversation_metrics = self.metric_manager.resolve_metrics(
-            conv_data.conversation_metrics, MetricLevel.CONVERSATION
-        )
-
-        for metric_identifier in conversation_metrics or []:
+        for metric_identifier in conversation_metrics:
             request = EvaluationRequest.for_conversation(conv_data, metric_identifier)

52-62: Nit: prefer explicit boolean in any() for readability.

any(list_obj) relies on list truthiness. Using bool(...) makes intent clearer.

After the precompute refactor, this becomes:

has_turn_metrics = any(bool(metrics) for _, _, metrics in resolved_turn_metrics)

80-83: Log message is misleading today.

It says “marking all metrics as ERROR,” but only explicitly defined metrics are marked. After implementing the resolved-metrics error path, the message becomes accurate.

70-77: Guard is correct but add a clearer failure message.

Raising ValueError on missing SystemConfig is right. Consider mentioning the expected call flow (ConfigLoader.load_system_config) to speed up diagnosis.

-            raise ValueError("SystemConfig must be loaded")
+            raise ValueError("SystemConfig must be loaded via ConfigLoader.load_system_config(...) before processing conversations")

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f97e01c and 955fef2.

📒 Files selected for processing (12)

README.md (3 hunks)
config/evaluation_data.yaml (3 hunks)
config/system.yaml (3 hunks)
src/lightspeed_evaluation/core/metrics/manager.py (1 hunks)
src/lightspeed_evaluation/core/models/data.py (3 hunks)
src/lightspeed_evaluation/core/system/validator.py (2 hunks)
src/lightspeed_evaluation/pipeline/evaluation/errors.py (1 hunks)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (4 hunks)
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (3 hunks)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (7 hunks)
tests/conftest.py (5 hunks)
tests/unit/core/config/test_models.py (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (5)

tests/unit/core/config/test_models.py
src/lightspeed_evaluation/core/system/validator.py
src/lightspeed_evaluation/pipeline/evaluation/errors.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/models/data.py

🧰 Additional context used

🧠 Learnings (3)

📓 Common learnings

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.762Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

📚 Learning: 2025-09-19T00:37:23.762Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.762Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py
src/lightspeed_evaluation/pipeline/evaluation/processor.py
src/lightspeed_evaluation/core/metrics/manager.py

📚 Learning: 2025-09-08T11:11:54.516Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

README.md

🧬 Code graph analysis (3)

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (5)

src/lightspeed_evaluation/core/metrics/manager.py (1)

MetricManager (17-136)

src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)

APIDataAmender (13-80)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

MetricsEvaluator (18-143)

src/lightspeed_evaluation/pipeline/evaluation/processor.py (2)

ConversationProcessor (26-148)

ProcessorComponents (17-23)

src/lightspeed_evaluation/core/api/client.py (1)

close (232-235)

src/lightspeed_evaluation/pipeline/evaluation/processor.py (5)

src/lightspeed_evaluation/core/metrics/manager.py (4)

MetricLevel (10-14)

MetricManager (17-136)

resolve_metrics (24-48)

count_metrics_for_conversation (115-136)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

MetricsEvaluator (18-143)

src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)

APIDataAmender (13-80)

src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)

EvaluationErrorHandler (10-77)

src/lightspeed_evaluation/core/system/loader.py (1)

ConfigLoader (65-118)

src/lightspeed_evaluation/core/metrics/manager.py (2)

src/lightspeed_evaluation/core/models/data.py (2)

EvaluationData (112-153)

TurnData (13-109)

src/lightspeed_evaluation/core/models/system.py (1)

SystemConfig (195-219)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: mypy

🔇 Additional comments (12)

config/system.yaml (2)

37-38: LGTM! Clear default metric configuration.

The addition of default: true for ragas:response_relevancy and default: false for ragas:faithfulness provides clear semantics for turn-level metric resolution when turn_metrics is null. This aligns well with the MetricManager.resolve_metrics() implementation.

Also applies to: 39-42, 57-59

75-75: LGTM! Consistent with turn-level defaults.

Adding default: false for conversation-level metrics maintains consistency with the turn-level approach.

config/evaluation_data.yaml (2)

18-26: LGTM! Per-turn metrics correctly migrated.

The metrics configuration has been properly moved from conversation group level to individual turns, implementing the per-turn override feature as intended.

40-43: resolve_metrics correctly handles None (defaults), [] (skip), and explicit lists — no changes required.

Verified in src/lightspeed_evaluation/core/metrics/manager.py: None -> self._extract_default_metrics(level); [] -> return []; otherwise return metrics.

README.md (2)

183-196: Excellent documentation of the metric override behavior!

The documentation clearly explains the three override patterns with a helpful table and concrete examples. This will be very useful for users understanding how to configure per-turn metrics.

Also applies to: 240-247

233-234: Documentation correctly reflects the new TurnData fields.

The addition of turn_metrics and turn_metrics_metadata to the Turn Data Fields table accurately documents the schema changes.

tests/conftest.py (1)

72-73: Test fixtures correctly updated for per-turn metrics.

The test fixtures have been properly updated to place turn_metrics and turn_metrics_metadata at the turn level instead of the conversation level, aligning with the new data model.

Also applies to: 167-168, 184-185, 195-196

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (2)

56-73: Clean integration of MetricManager into the pipeline.

The addition of MetricManager and its integration through ProcessorComponents provides a clean architecture for handling per-turn metric overrides and defaults.

171-172: Good resource cleanup for API client.

Properly closing the API client when present ensures resources are cleaned up.

src/lightspeed_evaluation/core/metrics/manager.py (2)

24-49: Resolve logic looks correct and matches PR intent.

None -> defaults, [] -> skip, list -> use as-is. Clear and predictable.

115-136: Counting logic is sound.

Accurately reflects effective metrics (including defaults) across turns and conversation.

src/lightspeed_evaluation/pipeline/evaluation/processor.py (1)

146-149: Summary delegation is correct.

Using MetricManager.count_metrics_for_conversation keeps single source of truth for counts.

src/lightspeed_evaluation/core/metrics/manager.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (9)

README.md (3)

118-130: Clarify defaults and threshold semantics for custom metrics in system config.

Consider documenting default/threshold for custom:tool_eval; currently only ragas metrics show default flags.

If tool_eval passes when score > 0 (no threshold), call that out explicitly to avoid confusion with the global 0.5 fallback.

Apply within the YAML example:
   metrics_metadata:
     turn_level:
       "ragas:response_relevancy":
         threshold: 0.8
         description: "How relevant the response is to the question"
         default: true   # Used by default when turn_metrics is null

       "ragas:faithfulness":
         threshold: 0.8
         description: "How faithful the response is to the provided context"
         default: false  # Only used when explicitly specified
+
+    "custom:tool_eval":
+      # If threshold is omitted, tool_eval passes when score > 0 (documented behavior)
+      # Set threshold to null to use that behavior; otherwise provide an explicit value.
+      threshold: null
+      description: "Tool call evaluation comparing expected vs actual tool calls (regex for arguments)"
+      default: false
181-192: Note deduplication behavior and valid identifier format for turn_metrics.

Mention that duplicate metrics are deduplicated (order-preserving) and identifiers must be framework:metric_name; helps users debugging YAML.

Add after the snippet:

turn_metrics entries must be in the form framework:metric_name (e.g., ragas:faithfulness).

Duplicates are removed while preserving the first occurrence’s order.

240-247: Nice override behavior table; add a note on conversation-level defaults.

Add a line indicating the same None/[] rules apply to conversation_metrics.

src/lightspeed_evaluation/core/models/data.py (2)

13-31: Harden metric ID validation and normalize inputs.

Currently only checks for ":" and not empty string; whitespace and empty sides can slip through, and duplicates with whitespace won’t dedup.

Apply:

 def _validate_and_deduplicate_metrics(
     metrics: list[str], metric_type: str = "metric"
 ) -> list[str]:
     """Validate format and deduplicate metrics while preserving order."""
-    # Validate format first
-    for metric in metrics:
-        if not metric or ":" not in metric:
-            raise ValueError(
-                f'{metric_type} "{metric}" must be in format "framework:metric_name"'
-            )
-
-    # Deduplicate while preserving order
-    seen = set()
-    deduplicated = []
-    for metric in metrics:
-        if metric not in seen:
-            deduplicated.append(metric)
-            seen.add(metric)
-    return deduplicated
+    # Normalize, validate format, and deduplicate while preserving order
+    seen: set[str] = set()
+    deduped: list[str] = []
+    for raw in metrics:
+        metric = raw.strip()
+        if not metric or ":" not in metric:
+            raise ValueError(
+                f'{metric_type} "{raw}" must be in format "framework:metric_name"'
+            )
+        framework, name = metric.split(":", 1)
+        if not framework.strip() or not name.strip():
+            raise ValueError(
+                f'{metric_type} "{raw}" must include both framework and metric name'
+            )
+        if metric not in seen:
+            seen.add(metric)
+            deduped.append(metric)
+    return deduped

83-127: Ensure arguments is a dict in expected_tool_calls.

If provided, arguments may currently be non-dict and get passed through silently.

Apply:

             for j, tool_call in enumerate(sequence):
                 if not isinstance(tool_call, dict):
                     raise ValueError(
                         f"Tool call {j} in sequence {i} must be a dictionary"
                     )
@@
-                # Validate required keys
+                # Validate required keys
                 if "tool_name" not in tool_call:
                     raise ValueError(
                         f"Tool call {j} in sequence {i} missing required 'tool_name' field"
                     )
@@
-                # Ensure arguments field exists (can be empty dict)
+                # Ensure arguments field exists and is a dict (can be empty)
+                if "arguments" in tool_call and not isinstance(
+                    tool_call["arguments"], dict
+                ):
+                    raise ValueError(
+                        f"Tool call {j} in sequence {i} field 'arguments' must be a dictionary"
+                    )
                 validated_tool_call = {
                     "tool_name": tool_call["tool_name"],
                     "arguments": tool_call.get("arguments", {}),
                 }

src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)

34-41: Optional: accept resolved metrics to avoid duplicate resolution.

Processor already resolves metrics; you could optionally pass them in to reduce duplication and keep resolution logic in one place.

Example signature change:
-def mark_all_metrics_as_error(self, conv_data: EvaluationData, error_reason: str) -> list[EvaluationResult]:
+def mark_all_metrics_as_error(
+    self,
+    conv_data: EvaluationData,
+    error_reason: str,
+    resolved_turn_metrics: list[list[str]] | None = None,
+    resolved_conversation_metrics: list[str] | None = None,
+) -> list[EvaluationResult]:
Then use the provided lists if not None; fall back to self.metric_manager.resolve_metrics otherwise.

Also applies to: 55-59

config/evaluation_data.yaml (3)

23-25: Faithfulness threshold 0.99 is likely too brittle; consider relaxing.

A 0.99 cut-off often flags acceptable answers as failures. Recommend 0.85–0.95 depending on your baseline; start at 0.90.

Apply this diff if acceptable:
       turn_metrics_metadata:
         "ragas:faithfulness":
-          threshold: 0.99
+          threshold: 0.90  # Less brittle; tune based on baseline runs
27-29: Group header change acknowledged. Ensure docs/examples match.

Since README updates are pending, please mirror this per‑turn configuration in the docs’ sample config.

40-44: LGTM on conv_group_2 per‑turn metrics.

These metrics require contexts and a reference; both are provided here.

If other turns reuse this set, consider a YAML anchor to avoid duplication. Example:
-      turn_metrics:
-        - "ragas:context_recall"
-        - "ragas:context_relevance"
-        - "ragas:context_precision_with_reference"
+      turn_metrics: &with_ref_metrics
+        - "ragas:context_recall"
+        - "ragas:context_relevance"
+        - "ragas:context_precision_with_reference"
# Reuse elsewhere with:
# turn_metrics: *with_ref_metrics

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 955fef2 and 74f0ba7.

📒 Files selected for processing (7)

README.md (3 hunks)
config/evaluation_data.yaml (3 hunks)
src/lightspeed_evaluation/core/models/data.py (4 hunks)
src/lightspeed_evaluation/pipeline/evaluation/errors.py (3 hunks)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (5 hunks)
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (3 hunks)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (6 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py

🧰 Additional context used

🧠 Learnings (4)

📓 Common learnings

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.762Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

📚 Learning: 2025-09-19T00:37:23.762Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.762Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py
src/lightspeed_evaluation/pipeline/evaluation/processor.py
src/lightspeed_evaluation/pipeline/evaluation/errors.py

📚 Learning: 2025-09-08T11:11:54.516Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

README.md

📚 Learning: 2025-09-18T23:59:37.010Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.010Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

src/lightspeed_evaluation/core/models/data.py

🧬 Code graph analysis (3)

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (10)

src/lightspeed_evaluation/core/metrics/manager.py (1)

MetricManager (17-136)

src/lightspeed_evaluation/core/models/data.py (2)

EvaluationData (129-166)

EvaluationResult (169-208)

src/lightspeed_evaluation/core/output/data_persistence.py (1)

save_evaluation_data (15-58)

src/lightspeed_evaluation/core/system/loader.py (1)

ConfigLoader (65-118)

src/lightspeed_evaluation/core/system/validator.py (1)

DataValidator (62-226)

src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)

APIDataAmender (13-80)

src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)

EvaluationErrorHandler (11-83)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

MetricsEvaluator (18-143)

src/lightspeed_evaluation/pipeline/evaluation/processor.py (2)

ConversationProcessor (26-144)

ProcessorComponents (17-23)

src/lightspeed_evaluation/core/api/client.py (1)

close (232-235)

src/lightspeed_evaluation/pipeline/evaluation/processor.py (4)

src/lightspeed_evaluation/core/metrics/manager.py (4)

MetricLevel (10-14)

MetricManager (17-136)

resolve_metrics (24-48)

count_metrics_for_conversation (115-136)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

MetricsEvaluator (18-143)

src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)

APIDataAmender (13-80)

src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)

EvaluationErrorHandler (11-83)

src/lightspeed_evaluation/pipeline/evaluation/errors.py (2)

src/lightspeed_evaluation/core/metrics/manager.py (3)

MetricLevel (10-14)

MetricManager (17-136)

resolve_metrics (24-48)

src/lightspeed_evaluation/core/models/data.py (1)

EvaluationResult (169-208)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: mypy

🔇 Additional comments (22)

README.md (2)

193-195: Good: explicit skip semantics documented.
Clear that [] skips evaluation for this turn; matches MetricManager.resolve_metrics behavior.

233-235: Turn metric fields table aligns with model changes.
Accurately reflects new TurnData fields.

src/lightspeed_evaluation/core/models/data.py (4)

65-74: Turn metric fields addition looks correct.
Matches README and processing flow.

75-82: Validation hook is good; relies on shared helper.
With the proposed helper tightening, this will handle whitespace and empty parts.

143-151: Conversation metrics fields align with the new model.
Good separation from per-turn config.

158-166: Validator reuse is appropriate.
Once helper is hardened, conversation metric IDs will be consistently normalized.

src/lightspeed_evaluation/pipeline/evaluation/errors.py (2)

14-17: Constructor now accepts MetricManager — good change.
Enables resolving defaults on API failure.

35-41: Resolves defaults before marking ERROR — correct behavior.
Ensures turns with turn_metrics: null and conversations with conversation_metrics: null get ERROR rows for default metrics.

Also applies to: 54-71

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (4)

56-66: Wiring MetricManager into evaluator and error handler looks good.
Matches the new per-turn defaults/override flow.

60-63: Fixed APIDataAmender construction using self.api_client.
Resolves previous undefined variable issue.

67-73: ProcessorComponents container improves cohesion.
Nice consolidation of dependencies.

171-172: Graceful API client cleanup.
Close guarded by presence check; OK.

src/lightspeed_evaluation/pipeline/evaluation/processor.py (7)

16-24: ProcessorComponents dataclass is clean and explicit.
Good dependency injection point.

29-40: Constructor stores component aliases — readable.
Keeps call sites terse.

50-58: Pre-resolving metrics per turn + conversation is correct.
Enables early skip and consistent behavior.

61-68: Early exit when no metrics — good guardrail.
Prevents unnecessary API calls.

86-95: Turn evaluation loops over resolved sets — correct with overrides and skips.
Matches desired semantics for None vs [] vs explicit list.

98-105: Conversation metrics evaluation correctly gated.
Aligned with resolved defaults.

142-144: Metrics summary delegates to MetricManager — nice single source of truth.

config/evaluation_data.yaml (3)

18-21: LGTM on per‑turn override for conv_group_1.

Metrics align with fields present (contexts and expected_response exist), so these should evaluate cleanly.

63-63: Confirmed: omitted turn_metrics (None) → system defaults; explicit [] → skip

ConversationProcessor calls MetricManager.resolve_metrics for each turn (src/lightspeed_evaluation/pipeline/evaluation/processor.py); turn_metrics is Optional[list] default None (src/lightspeed_evaluation/core/models/data.py); processor uses truthiness on resolved metrics so an explicit [] yields no metrics and is skipped; DataValidator only validates explicitly provided metrics (src/lightspeed_evaluation/core/system/validator.py).

57-58: Explicit turn-level skip supported (turn_metrics: [] → skip)

resolve_metrics returns [] for an explicit empty list; ConversationProcessor only evaluates turns when resolved turn metrics are truthy; EvaluationErrorHandler also resolves metrics before marking errors, so turns with turn_metrics: [] are neither evaluated nor marked ERROR — conversation-level metrics are handled separately and will still run if present. See MetricManager.resolve_metrics, ConversationProcessor.process_conversation, EvaluationErrorHandler.mark_all_metrics_as_error.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

src/lightspeed_evaluation/core/metrics/manager.py (1)
50-81: Threshold type safety — confirm inputs are numeric (repeat of prior concern).

get_effective_threshold returns values as-is. If a YAML author quotes numbers ("0.8"), downstream score >= threshold may raise TypeError. You previously opted against defensive parsing for controlled configs; keeping as-is is fine. Please just verify no quoted thresholds exist in repo configs/tests.
#!/bin/bash
# Search for quoted numeric thresholds in YAML/JSON to avoid TypeError at runtime.
fd -t f -e yaml -e yml -e json | xargs rg -nP -C1 '\bthreshold:\s*"[0-9]+(\.[0-9]+)?"'

🧹 Nitpick comments (5)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
139-139: Consider documenting the float cast behavior.

The explicit float cast ensures consistent comparison behavior, but it would be helpful to add a comment explaining why this is necessary (e.g., to handle cases where threshold might be stored as an int in configs).
-        return "PASS" if score >= float(threshold) else "FAIL"
+        # Cast to float to ensure consistent numeric comparison (handles int thresholds)
+        return "PASS" if score >= float(threshold) else "FAIL"
README.md (2)
134-138: Conversation-level defaults not set; clarify intent.

As written, conversation_level has no default: true, so if conversation_metrics is omitted, no conversation metrics will run. If that’s intentional, add a brief note; otherwise, set a default in the example.

Apply one of the following doc tweaks:
   conversation_level:
     "deepeval:conversation_completeness":
       threshold: 0.8
       description: "How completely the conversation addresses user intentions"
+      # Note: Leave `default` unset if you want conversation metrics to be opt-in.
+      # default: true  # Uncomment to enable as a system default
240-247: Scope of “Metrics override behavior” — specify turn vs conversation.

The table reads as global but examples are turn-focused. Add a one-liner stating that the same semantics apply to conversation_metrics at conversation level.
 #### Metrics override behavior
@@
 | `["metric1", ...]`  | Use specified metrics only |
+
+Note: The same override semantics apply to conversation-level metrics via `conversation_metrics`
+and `conversation_metrics_metadata` on the conversation object.
src/lightspeed_evaluation/core/metrics/manager.py (2)
1-1: Tiny naming nit: “mapping” → “management.”

Module docstring can read “Metrics management for evaluation.” Not blocking.
-"""Metrics mapping for evaluation."""
+"""Metrics management for evaluation."""
105-114: Micro-simplification: list comprehension for defaults.

Functionally identical, a bit tighter.
-        default_metrics = []
-        for metric_name, metadata in metrics_metadata.items():
-            if metadata.get("default", False):  # default=false if not specified
-                default_metrics.append(metric_name)
-        return default_metrics
+        return [
+            name for name, meta in metrics_metadata.items()
+            if meta.get("default", False)
+        ]

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 74f0ba7 and ed14fba.

📒 Files selected for processing (10)

README.md (3 hunks)
config/evaluation_data.yaml (3 hunks)
src/lightspeed_evaluation/core/metrics/manager.py (1 hunks)
src/lightspeed_evaluation/core/models/data.py (4 hunks)
src/lightspeed_evaluation/pipeline/evaluation/errors.py (2 hunks)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (5 hunks)
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (3 hunks)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (6 hunks)
tests/conftest.py (5 hunks)
tests/unit/core/config/test_models.py (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

tests/conftest.py
src/lightspeed_evaluation/pipeline/evaluation/processor.py
src/lightspeed_evaluation/core/models/data.py

🧰 Additional context used

🧠 Learnings (7)

📓 Common learnings

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.762Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

📚 Learning: 2025-09-19T00:37:23.762Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.762Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

src/lightspeed_evaluation/core/metrics/manager.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py
src/lightspeed_evaluation/pipeline/evaluation/errors.py

📚 Learning: 2025-09-08T11:11:54.516Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

src/lightspeed_evaluation/core/metrics/manager.py
README.md

📚 Learning: 2025-09-10T06:57:46.326Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: src/lightspeed_evaluation/pipeline/evaluation/evaluator.py:85-89
Timestamp: 2025-09-10T06:57:46.326Z
Learning: For binary metrics like custom:tool_eval, using an explicit threshold of 0.5 is preferred over None threshold with special case handling. This provides consistent behavior where 0.0 scores fail and 1.0 scores pass.

Applied to files:

src/lightspeed_evaluation/core/metrics/manager.py

📚 Learning: 2025-09-19T01:26:54.603Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/core/metrics/manager.py:50-81
Timestamp: 2025-09-19T01:26:54.603Z
Learning: In the lightspeed-evaluation codebase, prefer clean code over defensive programming for controlled configuration files like system.yaml. The maintainer asamal4 prefers not to add error handling for user configuration mistakes that are unlikely to occur in controlled internal settings, to avoid code complexity.

Applied to files:

src/lightspeed_evaluation/core/metrics/manager.py

📚 Learning: 2025-08-22T09:16:29.070Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.

Applied to files:

src/lightspeed_evaluation/core/metrics/manager.py

📚 Learning: 2025-09-18T23:59:37.010Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.010Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

tests/unit/core/config/test_models.py

🧬 Code graph analysis (5)

src/lightspeed_evaluation/core/metrics/manager.py (2)

src/lightspeed_evaluation/core/models/data.py (2)

EvaluationData (129-166)

TurnData (34-126)

src/lightspeed_evaluation/core/models/system.py (1)

SystemConfig (195-219)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

src/lightspeed_evaluation/core/metrics/manager.py (2)

MetricLevel (10-14)

get_effective_threshold (50-80)

tests/unit/core/config/test_models.py (1)

src/lightspeed_evaluation/core/models/data.py (2)

TurnData (34-126)

EvaluationData (129-166)

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (4)

src/lightspeed_evaluation/core/metrics/manager.py (1)

MetricManager (17-136)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

MetricsEvaluator (18-143)

src/lightspeed_evaluation/pipeline/evaluation/processor.py (2)

ConversationProcessor (26-147)

ProcessorComponents (17-23)

src/lightspeed_evaluation/core/api/client.py (1)

close (232-235)

src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)

src/lightspeed_evaluation/core/models/data.py (2)

EvaluationData (129-166)

EvaluationResult (169-208)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: tests (3.12)

🔇 Additional comments (20)

tests/unit/core/config/test_models.py (6)

75-80: LGTM! Per-turn metrics properly wired.

The test correctly creates TurnData with per-turn metrics, aligning with the new data model where metrics are now specified at the turn level.

92-92: LGTM! Assertion correctly validates turn-level metrics.

The test properly asserts that turn_metrics are stored at the individual turn level, consistent with the shift from conversation-level to per-turn metrics.

103-103: LGTM! Test correctly handles None case for turn metrics.

The assertion properly validates that when turn_metrics is not provided, it remains None (indicating system defaults will be used).

118-123: LGTM! Test correctly validates turn-level metric validation.

The test properly validates that metrics without the required "framework:metric_name" format are rejected at the turn level.

127-133: LGTM! Turn metadata fields properly tested.

The test correctly validates both turn_metrics and turn_metrics_metadata at the turn level.

141-143: LGTM! Assertions properly validate per-turn metadata.

The test correctly asserts that turn_metrics_metadata is stored at the turn level within each turn object.

src/lightspeed_evaluation/pipeline/evaluation/errors.py (3)

17-23: LGTM! Method signature properly updated for per-turn metrics.

The updated signature correctly accepts pre-resolved metrics lists, addressing the issue where defaults weren't being marked as ERROR during API failures. The separation of turn and conversation metrics aligns with the new per-turn architecture.

43-57: LGTM! Turn-level error marking properly handles per-turn metrics.

The implementation correctly iterates through resolved turn metrics and creates appropriate ERROR results with turn-specific data.

59-74: LGTM! Conversation-level error marking properly implemented.

The loop correctly handles conversation-level metrics by setting turn_id to None and empty query, properly distinguishing them from turn-level metrics.

config/evaluation_data.yaml (4)

18-26: LGTM! Per-turn metrics configuration properly structured.

The turn-specific metrics and metadata are correctly specified at the turn level, aligning with the new per-turn metrics architecture.

40-44: LGTM! Turn metrics properly specified for conv_group_2.

Different turn metrics are correctly applied to demonstrate the per-turn override capability.

58-58: LGTM! Skip semantics properly demonstrated.

The empty array [] correctly demonstrates the skip evaluation feature for specific turns.

63-63: LGTM! Default behavior properly documented.

The comment clearly explains that omitting turn_metrics (null) will use system defaults, which is valuable documentation.

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (3)

60-61: LGTM! API client properly stored as instance attribute.

The API client is correctly stored on self.api_client for use in the close() method.

56-73: LGTM! Components properly wired with MetricManager.

The MetricManager is correctly instantiated and passed to both the MetricsEvaluator and ProcessorComponents, ensuring consistent metric resolution across the pipeline.

171-172: LGTM! Resource cleanup properly implemented.

The close() method correctly checks for and closes the API client if it exists.

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

21-26: LGTM! MetricManager properly integrated.

The MetricManager is correctly added as a constructor parameter and stored for threshold resolution.

87-95: LGTM! Threshold resolution properly uses MetricManager.

The code correctly determines the metric level based on the request type and delegates threshold resolution to the MetricManager with appropriate context (conv_data and turn_data).

src/lightspeed_evaluation/core/metrics/manager.py (1)

115-136: Counting semantics LGTM.

Turn-level resolution respects None → defaults and [] → skip; conversation-level counting is consistent with resolve_metrics. No issues.

README.md (1)

118-157: Mapping verified — README metrics_metadata.turn_level / conversation_level are correctly mapped to SystemConfig fields.

loader.create_system_config copies config["metrics_metadata"].get("turn_level"/"conversation_level") → SystemConfig.default_turn_metrics_metadata / default_conversation_metrics_metadata (src/lightspeed_evaluation/core/system/loader.py ~107–116); loader also reconstructs metrics_metadata from system_config.default* before registering metrics (src/lightspeed_evaluation/core/system/loader.py ~30–43); metrics manager reads those fields to extract defaults (src/lightspeed_evaluation/core/metrics/manager.py ~101–113).

asamal4 · 2025-09-19T11:29:46Z

@VladimirKadlec @tisnik PTAL

update readme for metric override

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

99-109: Fix None access for conversation-level metrics (AttributeError risk)

Accessing request.turn_data.response when turn_data is None will raise and incorrectly mark conversation metrics as ERROR.

-                query=request.turn_data.query if request.turn_data else "",
-                response=request.turn_data.response or "" if request.turn_data else "",
+                query=request.turn_data.query if request.turn_data else "",
+                response=(
+                    request.turn_data.response
+                    if (request.turn_data and request.turn_data.response)
+                    else ""
+                ),

118-133: Apply the same None-safe fix in error path

Mirror the safe access to avoid spurious errors while constructing ERROR results.

-            query=request.turn_data.query if request.turn_data else "",
-            response=request.turn_data.response or "" if request.turn_data else "",
+            query=request.turn_data.query if request.turn_data else "",
+            response=(
+                request.turn_data.response
+                if (request.turn_data and request.turn_data.response)
+                else ""
+            ),

♻️ Duplicate comments (1)

src/lightspeed_evaluation/core/metrics/manager.py (1)

50-81: Threshold passthrough matches maintainer preference (no casting).

Previous suggestion to normalize threshold types was discussed and intentionally not adopted for this controlled config. No action.

🧹 Nitpick comments (12)

tests/conftest.py (1)

72-74: Set turn_metrics during construction to trigger validators

Assigning after model creation bypasses Pydantic validators unless validate_assignment=True. Set in the constructor instead.
-    # Add turn_metrics to the turn data
-    sample_turn_data.turn_metrics = ["ragas:faithfulness", "ragas:response_relevancy"]
Outside this hunk, update sample_turn_data fixture to include:
# in sample_turn_data() return
return TurnData(
    turn_id="1",
    query="What is Python?",
    response="Python is a high-level programming language.",
    contexts=[...],
    expected_response="Python is a high-level programming language used for various applications.",
    turn_metrics=["ragas:faithfulness", "ragas:response_relevancy"],
)

tests/unit/core/config/test_models.py (1)

84-85: Unify metric identifier naming for consistency

Elsewhere you use deepeval:conversation_completeness. Consider aligning here to reduce confusion in future refactors.

-            conversation_metrics=["deepeval:completeness"],
+            conversation_metrics=["deepeval:conversation_completeness"],
@@
-            conversation_metrics=["deepeval:completeness"],
-            conversation_metrics_metadata={"deepeval:completeness": {"threshold": 0.9}},
+            conversation_metrics=["deepeval:conversation_completeness"],
+            conversation_metrics_metadata={"deepeval:conversation_completeness": {"threshold": 0.9}},

Also applies to: 136-139

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

61-65: Guard against malformed metric identifiers

split(":", 1) will raise if the identifier lacks a colon, returning a generic error. Prefer a clear error message.

-            framework, metric_name = request.metric_identifier.split(":", 1)
+            if ":" not in request.metric_identifier:
+                execution_time = time.time() - start_time
+                return self._create_error_result(
+                    request,
+                    f"Invalid metric identifier '{request.metric_identifier}'. Expected 'framework:metric'.",
+                    execution_time,
+                )
+            framework, metric_name = request.metric_identifier.split(":", 1)

135-140: Status calc is fine; consider documenting 0.5 default

Defaulting threshold to 0.5 is a sensible fallback. Add a brief note in system docs to avoid surprises for binary metrics without thresholds.

src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)

43-45: Handle length mismatch between turns and resolved metrics

zip will silently drop extra turns or metrics. Add a guard to detect and log/raise on mismatch.

-        for turn_data, turn_metrics in zip(conv_data.turns, resolved_turn_metrics):
+        if len(resolved_turn_metrics) != len(conv_data.turns):
+            logger.error(
+                "Resolved turn metrics length (%d) does not match number of turns (%d) "
+                "for conversation %s",
+                len(resolved_turn_metrics),
+                len(conv_data.turns),
+                conv_data.conversation_group_id,
+            )
+        for turn_data, turn_metrics in zip(conv_data.turns, resolved_turn_metrics):

src/lightspeed_evaluation/pipeline/evaluation/processor.py (4)

16-24: Good dependency containerization; consider immutability.

ProcessorComponents groups dependencies cleanly. Optionally freeze to prevent accidental runtime mutation.

Apply:

-@dataclass
+@dataclass(frozen=True)
 class ProcessorComponents:

50-55: Micro: avoid recomputing defaults per turn.

When many turns use defaults (turn_metrics is None), cache the default list once.

- resolved_turn_metrics = [
-     self.metric_manager.resolve_metrics(turn_data.turn_metrics, MetricLevel.TURN)
-     for turn_data in conv_data.turns
- ]
+ default_turn_metrics: list[str] | None = None
+ resolved_turn_metrics: list[list[str]] = []
+ for t in conv_data.turns:
+     if t.turn_metrics is None:
+         if default_turn_metrics is None:
+             default_turn_metrics = self.metric_manager.resolve_metrics(None, MetricLevel.TURN)
+         resolved_turn_metrics.append(default_turn_metrics)
+     else:
+         resolved_turn_metrics.append(
+             self.metric_manager.resolve_metrics(t.turn_metrics, MetricLevel.TURN)
+         )

71-77: Config guard is fine; nit on error message clarity.

Consider including config path or loader state in the message to aid debugging.

-raise ValueError("SystemConfig must be loaded")
+raise ValueError("SystemConfig must be loaded before processing conversations (call ConfigLoader.load_system_config).")

90-97: Improve turn log context with turn_id.

Turn index alone is less useful; include turn_id.

-logger.debug("Processing turn %d metrics: %s", turn_idx, turn_metrics)
+logger.debug(
+    "Processing turn %d (%s) metrics: %s",
+    turn_idx,
+    turn_data.turn_id,
+    turn_metrics,
+)

src/lightspeed_evaluation/core/metrics/manager.py (2)

24-49: resolve_metrics semantics are solid; small doc tweak.

Docs say “use specified metrics from turn data” but this applies to both levels. Consider generic wording.

-        - [metrics...]: use specified metrics from turn data
+        - [metrics...]: use the explicitly provided metrics for the given level

115-137: Optional: precompute defaults during counting to avoid repeated extraction.

Minor micro-optimization when many turns rely on defaults.

- total_turn_metrics = 0
- for turn_data in conv_data.turns:
-     turn_metrics = self.resolve_metrics(turn_data.turn_metrics, MetricLevel.TURN)
-     total_turn_metrics += len(turn_metrics)
+ total_turn_metrics = 0
+ default_turn: list[str] | None = None
+ for turn_data in conv_data.turns:
+     if turn_data.turn_metrics is None:
+         if default_turn is None:
+             default_turn = self.resolve_metrics(None, MetricLevel.TURN)
+         total_turn_metrics += len(default_turn)
+     else:
+         total_turn_metrics += len(self.resolve_metrics(turn_data.turn_metrics, MetricLevel.TURN))

src/lightspeed_evaluation/core/models/data.py (1)

13-31: Trim metric strings before validation/dedup.

Helps with YAML lists containing incidental spaces without increasing complexity.

-def _validate_and_deduplicate_metrics(
-    metrics: list[str], metric_type: str = "metric"
-) -> list[str]:
+def _validate_and_deduplicate_metrics(
+    metrics: list[str], metric_type: str = "metric"
+) -> list[str]:
     """Validate format and deduplicate metrics while preserving order."""
-    # Validate format first
-    for metric in metrics:
+    # Normalize and validate format first
+    normalized = [m.strip() for m in metrics]
+    for metric in normalized:
         if not metric or ":" not in metric:
             raise ValueError(
                 f'{metric_type} "{metric}" must be in format "framework:metric_name"'
             )
 
     # Deduplicate while preserving order
     seen = set()
     deduplicated = []
-    for metric in metrics:
+    for metric in normalized:
         if metric not in seen:
             deduplicated.append(metric)
             seen.add(metric)
     return deduplicated

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ed14fba and ad15be3.

📒 Files selected for processing (12)

README.md (3 hunks)
config/evaluation_data.yaml (3 hunks)
config/system.yaml (3 hunks)
src/lightspeed_evaluation/core/metrics/manager.py (1 hunks)
src/lightspeed_evaluation/core/models/data.py (4 hunks)
src/lightspeed_evaluation/core/system/validator.py (2 hunks)
src/lightspeed_evaluation/pipeline/evaluation/errors.py (2 hunks)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (5 hunks)
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (3 hunks)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (6 hunks)
tests/conftest.py (5 hunks)
tests/unit/core/config/test_models.py (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (4)

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py
src/lightspeed_evaluation/core/system/validator.py
config/system.yaml
README.md

🧰 Additional context used

🧠 Learnings (7)

📓 Common learnings

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

📚 Learning: 2025-09-19T00:37:23.798Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

src/lightspeed_evaluation/pipeline/evaluation/errors.py
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
src/lightspeed_evaluation/core/metrics/manager.py
src/lightspeed_evaluation/pipeline/evaluation/processor.py

📚 Learning: 2025-09-18T23:59:37.026Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

tests/unit/core/config/test_models.py
src/lightspeed_evaluation/core/models/data.py

📚 Learning: 2025-09-08T11:11:54.516Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

src/lightspeed_evaluation/core/metrics/manager.py

📚 Learning: 2025-09-10T06:57:46.326Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: src/lightspeed_evaluation/pipeline/evaluation/evaluator.py:85-89
Timestamp: 2025-09-10T06:57:46.326Z
Learning: For binary metrics like custom:tool_eval, using an explicit threshold of 0.5 is preferred over None threshold with special case handling. This provides consistent behavior where 0.0 scores fail and 1.0 scores pass.

Applied to files:

src/lightspeed_evaluation/core/metrics/manager.py

📚 Learning: 2025-09-19T01:26:54.625Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/core/metrics/manager.py:50-81
Timestamp: 2025-09-19T01:26:54.625Z
Learning: In the lightspeed-evaluation codebase, prefer clean code over defensive programming for controlled configuration files like system.yaml. The maintainer asamal4 prefers not to add error handling for user configuration mistakes that are unlikely to occur in controlled internal settings, to avoid code complexity.

Applied to files:

src/lightspeed_evaluation/core/metrics/manager.py

📚 Learning: 2025-08-22T09:16:29.070Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.

Applied to files:

src/lightspeed_evaluation/core/metrics/manager.py

🧬 Code graph analysis (5)

src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)

src/lightspeed_evaluation/core/models/data.py (2)

EvaluationData (129-166)

EvaluationResult (169-208)

tests/unit/core/config/test_models.py (1)

src/lightspeed_evaluation/core/models/data.py (2)

TurnData (34-126)

EvaluationData (129-166)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (3)

src/lightspeed_evaluation/core/metrics/manager.py (3)

MetricLevel (10-14)

MetricManager (17-136)

get_effective_threshold (50-80)

src/lightspeed_evaluation/core/models/data.py (3)

EvaluationRequest (225-278)

EvaluationResult (169-208)

EvaluationScope (211-222)

src/lightspeed_evaluation/core/system/loader.py (1)

ConfigLoader (65-118)

src/lightspeed_evaluation/core/metrics/manager.py (2)

src/lightspeed_evaluation/core/models/data.py (2)

EvaluationData (129-166)

TurnData (34-126)

src/lightspeed_evaluation/core/models/system.py (1)

SystemConfig (195-219)

src/lightspeed_evaluation/pipeline/evaluation/processor.py (6)

src/lightspeed_evaluation/core/metrics/manager.py (4)

MetricLevel (10-14)

MetricManager (17-136)

resolve_metrics (24-48)

count_metrics_for_conversation (115-136)

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

MetricsEvaluator (18-143)

src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)

APIDataAmender (13-80)

src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)

EvaluationErrorHandler (10-85)

src/lightspeed_evaluation/core/system/loader.py (1)

ConfigLoader (65-118)

src/lightspeed_evaluation/core/models/data.py (3)

EvaluationData (129-166)

TurnData (34-126)

EvaluationResult (169-208)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: ruff
GitHub Check: black

🔇 Additional comments (24)

config/evaluation_data.yaml (4)

18-25: Per-turn metrics/metadata wired correctly; confirm 0.99 threshold intent

0.99 for ragas:faithfulness is extremely strict and will likely fail most turns. If this is intentional for a stress case, LGTM; otherwise consider 0.8–0.9.

27-44: Turn-level override migration looks good

Per-turn metrics on conv_group_2 are correctly scoped and will bypass system defaults for that turn. No issues spotted.

58-58: Explicit skip semantics are clear

turn_metrics: [] will skip evaluation for this turn as designed.

63-64: Turn-level default exists — no action needed

config/system.yaml defines a metric with default: true (line 37), so omitting turn_metrics will still evaluate that turn-level default.

tests/conftest.py (2)

167-169: Per-turn metrics in test data align with new model

Turn 1 explicitly overrides; Turn 2 uses None to test default resolution. LGTM assuming at least one default=true exists in system config.

Also applies to: 184-197

127-142: Incorrect — DataValidator intentionally skips turns without explicit turn_metrics

Validator._check_metric_requirements short-circuits when a turn has no turn_metrics (src/lightspeed_evaluation/core/system/validator.py: if not turn_data.turn_metrics: continue), and tests/conftest.py already includes a turn with "turn_metrics": None (around lines 195–196). Metrics metadata fallback to system defaults is handled in src/lightspeed_evaluation/core/metrics/manager.py, but DataValidator is by-design limited to validating only explicitly provided metrics, so changing the fixture's metrics_metadata defaults will not make the validator exercise a “defaulting” validation path.

Likely an incorrect or invalid review comment.

tests/unit/core/config/test_models.py (2)

75-80: TurnData per-turn metrics coverage is correct

Constructing with turn_metrics and asserting presence is aligned with the new API.

Also applies to: 92-93

118-123: Invalid metric format test is appropriate

This correctly targets the TurnData-level validator for "framework:metric_name" format.

src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)

59-74: Conversation-level ERROR marking is correct

Generates ERROR rows for each resolved conversation metric with turn_id=None. Looks good.

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

21-27: Constructor change — instantiations updated

MetricsEvaluator is instantiated with metric_manager in src/lightspeed_evaluation/pipeline/evaluation/pipeline.py:63–65 (MetricsEvaluator(llm_manager, self.config_loader, metric_manager)).

src/lightspeed_evaluation/pipeline/evaluation/processor.py (7)

50-59: Metric resolution moved upfront: LGTM.

Pre-resolving per-turn and conversation metrics centralizes logic and simplifies downstream code.

64-68: Confirm skipping also means no API amendment.

Behavior change: if no metrics (defaults and explicit), you return early and never amend via API. Ensure no downstream expects amended responses even when evaluation is skipped.

81-86: API error path aligns with design (use pre-resolved metrics).

Passing resolved metrics into the error handler matches the upstream-resolution approach.

113-131: Turn evaluation loop is clean and correct.

Iterating resolved per-turn metrics and delegating to MetricsEvaluator is straightforward.

100-109: Conversation-level evaluation flow is correct.

Processes only when metrics present; consistent with per-turn logic.

145-148: Delegating summary to MetricManager is a good consolidation.

Removes duplicate counting logic from the processor.

29-39: Resolved — ConversationProcessor constructor call sites updated

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py now constructs ProcessorComponents and passes it into ConversationProcessor (lines 68–78). No other instantiations using the old multi-positional-arg signature were found.

src/lightspeed_evaluation/core/metrics/manager.py (2)

10-15: Enum introduction is clear and appropriately scoped.

TURN/CONVERSATION values are intuitive.

105-114: Default metric extraction is correct and order-preserving.

Relies on YAML insertion order; good for predictable evaluation ordering.

src/lightspeed_evaluation/core/models/data.py (5)

65-73: Per-turn metrics fields are well-scoped.

Aligns with per-turn override design.

75-81: Validator is concise and reuses helper.

Dedup + format check at the edge is good.

143-151: Conversation-level fields look correct.

Matches the new two-level metric model.

158-166: Conversation metrics validator mirrors turn-level: good consistency.

Keeps behavior uniform across levels.

225-279: Factory methods for EvaluationRequest are clean.

Turn/conversation helpers reduce boilerplate and set flags correctly.

src/lightspeed_evaluation/pipeline/evaluation/errors.py

tisnik

LGTM

coderabbitai bot reviewed Sep 18, 2025

View reviewed changes

asamal4 force-pushed the turn-metric-override branch from f97e01c to 955fef2 Compare September 19, 2025 01:06

coderabbitai bot reviewed Sep 19, 2025

View reviewed changes

src/lightspeed_evaluation/core/metrics/manager.py Show resolved Hide resolved

asamal4 force-pushed the turn-metric-override branch from 955fef2 to 74f0ba7 Compare September 19, 2025 07:13

coderabbitai bot reviewed Sep 19, 2025

View reviewed changes

asamal4 force-pushed the turn-metric-override branch from 74f0ba7 to ed14fba Compare September 19, 2025 08:07

coderabbitai bot reviewed Sep 19, 2025

View reviewed changes

asamal4 added 4 commits September 19, 2025 17:32

add default property to metrics

12ded3c

move turn metrics to turn datamodel

7a4be82

process default & turn-level override

525ff55

update readme for metric override

deduplicate metric override

ad15be3

asamal4 force-pushed the turn-metric-override branch from ed14fba to ad15be3 Compare September 19, 2025 12:08

coderabbitai bot reviewed Sep 19, 2025

View reviewed changes

src/lightspeed_evaluation/pipeline/evaluation/errors.py Show resolved Hide resolved

VladimirKadlec approved these changes Sep 19, 2025

View reviewed changes

tisnik approved these changes Sep 22, 2025

View reviewed changes

tisnik merged commit e25dec0 into lightspeed-core:main Sep 22, 2025
15 checks passed

This was referenced Sep 23, 2025

feat: add support for custom embedding model #56

Merged

Env setup/cleanup ability and verify through script #62

Merged

coderabbitai bot mentioned this pull request Oct 22, 2025

Concurent eval #85

Merged

This was referenced Nov 6, 2025

Add keyword eval metric #93

Merged

PR 3/3 — Unit tests for GEval #98

Closed

GEval Integration #97

Merged

This was referenced Nov 17, 2025

LEADS-8: Lazy imports for eval tool #102

Closed

fix: run turn evaluation immediately after api call #105

Merged

This was referenced Nov 24, 2025

add support for fail_on_invalid_data option #94

Merged

LEADS-8: Lazy imports for eval tool #106

Merged

LEADS-26: Increased Unit test cases coverage #109

Open

Turn metric override #55

Turn metric override #55

Uh oh!

Conversation

asamal4 commented Sep 18, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

asamal4 commented Sep 19, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tisnik left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

asamal4 commented Sep 18, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 18, 2025 •

edited

Loading