Skip to content

Conversation

@asamal4
Copy link
Collaborator

@asamal4 asamal4 commented Sep 18, 2025

  • Use applicable default metrics, if override is not given.
  • Override turn level metrics - to enable use of different metrics if required (de-duplicate).
  • minor refactoring.

Summary by CodeRabbit

  • New Features

    • Per-turn metric configuration with optional per-turn thresholds, explicit "skip turn" support, and a per-turn faithfulness threshold option (e.g., 0.99).
  • Bug Fixes

    • Improved error messages with conversation/turn context and a new error summary view.
  • Refactor

    • Centralized metric resolution and threshold handling; evaluation pipeline wired to use the centralized metric manager.
  • Configuration

    • Metrics reordered, default flags added for several turn/conversation metrics.
  • Documentation

    • Docs and examples updated to explain per-turn overrides, defaults, and behavior.
  • Tests

    • Fixtures and unit tests updated for per-turn metric model.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 18, 2025

Walkthrough

Moves metric configuration from conversation-level to per-turn, adds MetricManager and MetricLevel for centralized metric/threshold resolution, updates models/validation to per-turn metrics, and refactors pipeline, evaluator, processor, error handling, tests, and docs to use per-turn metric resolution and framework routing.

Changes

Cohort / File(s) Summary
Configs: Evaluation data
config/evaluation_data.yaml
Remove group-level turn_metrics/turn_metrics_metadata; add per-turn turn_metrics and turn_metrics_metadata entries (explicit lists and empty [] to skip); add per-turn faithfulness threshold in one group.
Configs: System defaults
config/system.yaml
Add default flags to turn-level metrics (ragas:response_relevancy default true, ragas:faithfulness default false), reorder ragas:context_relevance, and set deepeval:conversation_completeness default false.
Core models
src/.../core/models/data.py
Add _validate_and_deduplicate_metrics; introduce TurnData.turn_metrics and turn_metrics_metadata with validators; remove turn_metrics/turn_metrics_metadata from EvaluationData; add conversation_metrics/conversation_metrics_metadata.
Metrics manager (new)
src/.../core/metrics/manager.py
New MetricLevel enum and MetricManager for resolving metrics (defaults vs explicit vs skip), threshold precedence (turn → conv → system), and counting metrics per conversation.
System validator
src/.../core/system/validator.py
Validate metrics per-turn by iterating data.turns; improve error messages to include conversation/turn context; enforce per-turn metric requirements.
Evaluator
src/.../pipeline/evaluation/evaluator.py
Inject MetricManager/MetricLevel; route metrics by framework to handlers; delegate threshold resolution to MetricManager; centralize error handling; update constructor signature.
Processor & wiring
src/.../pipeline/evaluation/pipeline.py, src/.../pipeline/evaluation/processor.py
Add ProcessorComponents dataclass; wire MetricManager into pipeline; ConversationProcessor now accepts ProcessorComponents, resolves per-turn/conversation metrics up-front, skips when resolved metrics empty, and delegates summary counting to MetricManager.
Error handling
src/.../pipeline/evaluation/errors.py
mark_all_metrics_as_error now accepts resolved per-turn and conversation metric lists and emits ERROR results per resolved metric; add get_error_summary.
Tests & fixtures
tests/conftest.py, tests/unit/core/config/test_models.py
Move sample metrics from conversation-level into per-turn TurnData; update fixtures/tests to use TurnData.turn_metrics and turn_metrics_metadata; adjust to new system metadata.
Docs
README.md
Update docs to show per-turn metric overrides, default semantics (null/empty/explicit), examples, and Turn Data Fields additions.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Runner as Runner
  participant Pipeline as EvaluationPipeline
  participant Proc as ConversationProcessor
  participant MM as MetricManager
  participant Eval as MetricsEvaluator
  participant H as Handler

  Runner->>Pipeline: run(config, evaluation_data)
  Pipeline->>Proc: start with ProcessorComponents (incl. MetricManager)
  Proc->>MM: resolve_metrics(conv.conversation_metrics, CONVERSATION)
  MM-->>Proc: resolved_conversation_metrics
  loop for each turn
    Proc->>MM: resolve_metrics(turn.turn_metrics, TURN)
    MM-->>Proc: resolved_turn_metrics_for_turn
    alt resolved_turn_metrics non-empty
      Proc->>Eval: evaluate_turn(turn, resolved_turn_metrics_for_turn)
      Eval->>MM: get_effective_threshold(metric_id, TURN, conv, turn)
      MM-->>Eval: threshold
      Eval->>H: route(framework, metric_name)
      H-->>Eval: score/result
      Eval-->>Proc: EvaluationResult
    else skip turn
      Note right of Proc: turn_metrics == [] → skip evaluation
    end
  end
  alt resolved_conversation_metrics non-empty
    Proc->>Eval: evaluate_conversation(conv, resolved_conversation_metrics)
    Eval->>MM: get_effective_threshold(metric_id, CONVERSATION, conv)
    MM-->>Eval: threshold
    Eval->>H: route(framework, metric_name)
    H-->>Eval: score/result
    Eval-->>Proc: EvaluationResult
  else skip conversation
    Note right of Proc: no conversation metrics → skip
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • VladimirKadlec
  • tisnik

Poem

A nibble of metrics, a hop through the flow,
Defaults in clover, thresholds set just so.
Each turn a tunnel with rules to explore,
Manager thumps thrice — evaluate once more.
I twitch my whiskers — tests pass, carrots galore! 🥕🐇

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "Turn metric override" succinctly and accurately captures the primary change set: adding per-turn metric overrides and defaulting behavior described in the PR and raw_summary. It matches the PR objectives (enable per-turn overrides and fall back to applicable defaults) and is clear and concise for a reviewer scanning history. The title avoids noise and reflects the main intent of the changes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (1)

78-85: API error path currently misses default metrics in ERROR results.

Because EvaluationErrorHandler iterates only explicit metrics on data, default metrics are not marked as ERROR after amendment failure. Align with the pipeline’s default resolution.

See proposed essential refactor in pipeline comment (inject MetricManager into EvaluationErrorHandler and resolve defaults there).

🧹 Nitpick comments (6)
src/lightspeed_evaluation/core/models/data.py (2)

54-65: Good format validation for turn metrics

Optional enhancement: dedupe metrics to avoid double-evaluating the same metric.

Example:

 @field_validator("turn_metrics")
 @classmethod
 def validate_turn_metrics(cls, v: Optional[list[str]]) -> Optional[list[str]]:
@@
-        return v
+        if v:
+            seen, uniq = set(), []
+            for m in v:
+                if m not in seen:
+                    seen.add(m); uniq.append(m)
+            return uniq
+        return v

141-153: Conversation metrics validator reads well

Consider applying the same dedupe idea here for symmetry.

tests/conftest.py (1)

127-135: Test system defaults may be empty; set one default=true

MetricManager picks defaults via metadata.default==true. In this fixture, no turn metric has default=true, so turns with turn_metrics=None will resolve to no metrics (skip). If you intend to test default fallback, mark at least one metric as default.

Patch:

         "metrics_metadata": {
             "turn_level": {
                 "ragas:faithfulness": {
                     "threshold": 0.8,
                     "description": "How faithful the response is to the provided context",
                     "default": False,
                 },
                 "ragas:response_relevancy": {
                     "threshold": 0.8,
+                    "default": True,
                 },

Also applies to: 140-142, 167-169, 184-186, 195-197

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

21-27: Integrating MetricManager is solid; handle custom:tool_eval threshold None

Per our prior learning, when custom:tool_eval has no threshold, passing should be score > 0 (not >= 0.5). Set a metric-specific fallback before determining status.

Minimal change:

             threshold = self.metric_manager.get_effective_threshold(
                 request.metric_identifier, level, request.conv_data, request.turn_data
             )
-            status = self._determine_status(score, threshold)
+            # Metric-specific default thresholds
+            if threshold is None and request.metric_identifier == "custom:tool_eval":
+                threshold = 0.0
+            status = self._determine_status(score, threshold)

Nit: clarify response expression with parentheses for readability:

-                response=request.turn_data.response or "" if request.turn_data else "",
+                response=(request.turn_data.response or "") if request.turn_data else "",

Also applies to: 43-44, 87-95

src/lightspeed_evaluation/pipeline/evaluation/processor.py (1)

33-40: Drop redundant self.components to avoid duplication.

You store both self.components and individual aliases. Keeping only the aliases reduces indirection.

Apply:

-        self.components = components
-
-        # Keep direct references for easier access
-        self.metrics_evaluator = components.metrics_evaluator
+        # Keep direct references for easier access
+        self.metrics_evaluator = components.metrics_evaluator
         self.api_amender = components.api_amender
         self.error_handler = components.error_handler
         self.metric_manager = components.metric_manager
src/lightspeed_evaluation/core/metrics/manager.py (1)

105-114: Minor tidy: list comprehension for defaults.

Equivalent, a bit tighter.

-        default_metrics = []
-        for metric_name, metadata in metrics_metadata.items():
-            if metadata.get("default", False):  # default=false if not specified
-                default_metrics.append(metric_name)
-        return default_metrics
+        return [
+            name for name, meta in metrics_metadata.items()
+            if meta.get("default", False)
+        ]
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between afa045b and f97e01c.

📒 Files selected for processing (11)
  • config/evaluation_data.yaml (3 hunks)
  • config/system.yaml (3 hunks)
  • src/lightspeed_evaluation/core/metrics/manager.py (1 hunks)
  • src/lightspeed_evaluation/core/models/data.py (3 hunks)
  • src/lightspeed_evaluation/core/system/validator.py (2 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/errors.py (1 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (4 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (3 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/processor.py (7 hunks)
  • tests/conftest.py (5 hunks)
  • tests/unit/core/config/test_models.py (3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧬 Code graph analysis (5)
src/lightspeed_evaluation/core/metrics/manager.py (2)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationData (112-153)
  • TurnData (13-109)
src/lightspeed_evaluation/core/models/system.py (1)
  • SystemConfig (195-219)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (5)
src/lightspeed_evaluation/core/metrics/manager.py (4)
  • MetricLevel (10-14)
  • MetricManager (17-136)
  • resolve_metrics (24-48)
  • count_metrics_for_conversation (115-136)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
  • MetricsEvaluator (18-143)
src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
  • APIDataAmender (13-80)
src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
  • EvaluationErrorHandler (10-77)
src/lightspeed_evaluation/core/system/loader.py (1)
  • ConfigLoader (65-118)
tests/unit/core/config/test_models.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
  • TurnData (13-109)
  • EvaluationData (112-153)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (3)
src/lightspeed_evaluation/core/metrics/manager.py (3)
  • MetricLevel (10-14)
  • MetricManager (17-136)
  • get_effective_threshold (50-80)
src/lightspeed_evaluation/core/models/data.py (3)
  • EvaluationRequest (212-265)
  • EvaluationResult (156-195)
  • EvaluationScope (198-209)
src/lightspeed_evaluation/core/system/loader.py (1)
  • ConfigLoader (65-118)
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (3)
src/lightspeed_evaluation/core/metrics/manager.py (1)
  • MetricManager (17-136)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (2)
  • ConversationProcessor (26-148)
  • ProcessorComponents (17-23)
src/lightspeed_evaluation/core/api/client.py (1)
  • close (232-235)
🪛 GitHub Actions: Ruff
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py

[error] 61-61: F821 Undefined name 'api_client' in APIDataAmender construction. Use 'self.api_client' instead of 'api_client'.

🪛 GitHub Actions: Python linter
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py

[error] 61-61: Command 'make pylint' failed with exit code 2. Pylint: Undefined variable 'api_client' (undefined-variable).

🪛 GitHub Actions: Pyright
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py

[error] 61-61: Step 'make pyright' failed: Pyright error: 'api_client' is not defined (reportUndefinedVariable).

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: mypy
🔇 Additional comments (14)
config/system.yaml (1)

37-38: Defaults look correct; confirm intent for no conversation-level defaults

  • Turn default enabling response_relevancy is good.
  • Faithfulness explicitly non-default and context_relevance reorder are fine (no semantic change).
  • conversation_completeness default=false means if conversation_metrics is None, no conversation-level metrics will run. Confirm this is intentional.

Also applies to: 39-42, 57-60, 75-75

config/evaluation_data.yaml (1)

18-26: YAML updates align with per-turn overrides and defaults

  • conv_group_1: per-turn metrics + per-turn threshold override look good.
  • conv_group_2: swapped metric set is consistent with requirements.
  • conv_group_3: [] to skip turn 1 and None to use system defaults for turn 2 are correct.

Please confirm that system defaults include at least one turn metric so conv_group_3 turn 2 actually evaluates.

Also applies to: 27-44, 58-59, 63-64

src/lightspeed_evaluation/core/models/data.py (2)

44-53: Per-turn metrics fields are well-defined


126-135: Conversation-level metadata relocation is consistent

tests/conftest.py (1)

72-74: Fixture now models per-turn metrics correctly

tests/unit/core/config/test_models.py (3)

75-81: Tests reflect per-turn metrics move

Also applies to: 92-92, 103-103


118-124: Invalid metric format check correctly shifted to TurnData


127-133: Per-turn metadata path covered; assertions look good

Also applies to: 141-141

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (1)

171-172: LGTM: clean API client teardown.

Gracefully closes the client if present.

src/lightspeed_evaluation/pipeline/evaluation/processor.py (3)

51-63: LGTM: fast-skip check correctly respects defaults and explicit skips.

Using MetricManager to resolve per-turn and conversation-level metrics before work is a good optimization.


86-95: Turn evaluation routing looks solid.

Resolves metrics per turn, logs, and delegates to evaluator. Clean separation.


96-104: Conversation-level routing is correct and consistent with turn-level path.

No issues.

src/lightspeed_evaluation/core/metrics/manager.py (2)

24-49: LGTM: resolve_metrics semantics are clear and predictable.

None → defaults, [] → skip, list → as-is. Matches the PR goals.


115-136: LGTM: counting uses resolved metrics, so totals include defaults and explicit skips.

Good for summaries and preflight checks.

@asamal4 asamal4 force-pushed the turn-metric-override branch from f97e01c to 955fef2 Compare September 19, 2025 01:06
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (1)

78-85: API-error path does not mark default metrics as ERROR.

mark_all_metrics_as_error only enumerates explicitly defined metrics (turn_data.turn_metrics / conv_data.conversation_metrics). When defaults apply (None), no ERROR results are produced, contradicting the log “marking all metrics as ERROR” and skewing summaries.

Proposed fix: pass resolved metrics to the error handler and mark those.

Apply this diff to use the precomputed values from the earlier refactor:

-        if api_error_occurred:
-            logger.error("API error detected - marking all metrics as ERROR")
-            error_results = self.error_handler.mark_all_metrics_as_error(
-                conv_data, "API error during data amendment"
-            )
-            return error_results
+        if api_error_occurred:
+            logger.error("API error detected - marking all metrics as ERROR")
+            # Build turn_id -> metrics map from precomputed resolution
+            turn_metrics_by_id = {
+                turn.turn_id: metrics for _, turn, metrics in resolved_turn_metrics
+            }
+            error_results = self.error_handler.mark_resolved_metrics_as_error(  # new API
+                conv_data,
+                turn_metrics_by_id,
+                conversation_metrics_resolved,
+                "API error during data amendment",
+            )
+            return error_results

Add this method to EvaluationErrorHandler (src/lightspeed_evaluation/pipeline/evaluation/errors.py):

def mark_resolved_metrics_as_error(
    self,
    conv_data: EvaluationData,
    resolved_turn_metrics: dict[str, list[str]],
    resolved_conversation_metrics: list[str],
    error_reason: str,
) -> list[EvaluationResult]:
    """Mark the provided resolved metrics as ERROR (separation of concerns: resolution happens upstream)."""
    results: list[EvaluationResult] = []

    # Turn-level
    for turn in conv_data.turns:
        for metric_identifier in resolved_turn_metrics.get(turn.turn_id, []) or []:
            results.append(
                EvaluationResult(
                    conversation_group_id=conv_data.conversation_group_id,
                    turn_id=turn.turn_id,
                    metric_identifier=metric_identifier,
                    result="ERROR",
                    score=None,
                    threshold=None,
                    reason=error_reason,
                    query=turn.query,
                    response="",
                    execution_time=0.0,
                )
            )

    # Conversation-level
    for metric_identifier in resolved_conversation_metrics or []:
        results.append(
            EvaluationResult(
                conversation_group_id=conv_data.conversation_group_id,
                turn_id=None,
                metric_identifier=metric_identifier,
                result="ERROR",
                score=None,
                threshold=None,
                reason=error_reason,
                query="",
                response="",
                execution_time=0.0,
            )
        )

    self.results.extend(results)
    return results

Note: This aligns with the retrieved learning that resolution occurs upstream; the handler remains a pure marker.

♻️ Duplicate comments (2)
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (2)

60-61: Fix undefined variable 'api_client' in APIDataAmender construction.

Use the instance attribute you just set. This is causing the test failures.

Apply this diff:

         self.api_client = self._create_api_client()
-        api_amender = APIDataAmender(api_client)
+        api_amender = APIDataAmender(self.api_client)

62-62: Pass metric_manager to EvaluationErrorHandler for proper default metric handling.

Based on the retrieved learning, metric resolution happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(). However, when API errors occur, the error handler needs to know which metrics to mark as ERROR, including defaults.

Update the error handler instantiation to accept the metric_manager:

-        error_handler = EvaluationErrorHandler()
+        error_handler = EvaluationErrorHandler(metric_manager)

Note: You'll also need to update src/lightspeed_evaluation/pipeline/evaluation/errors.py to accept and use the MetricManager for resolving metrics when marking errors.

🧹 Nitpick comments (10)
tests/conftest.py (1)

195-196: Consider testing all three metric override patterns.

The fixtures test explicit metrics (line 184) and None (line 195), but don't include a test case for empty list [] which should skip evaluation.

Consider adding a test fixture with turn_metrics: [] to ensure the skip behavior is properly tested:

                },
+               {
+                   "turn_id": "3",
+                   "query": "Skip this turn",
+                   "response": "This turn should be skipped",
+                   "turn_metrics": [],  # Test skip behavior
+                   "turn_metrics_metadata": {},
+               },
            ],
src/lightspeed_evaluation/core/metrics/manager.py (1)

105-113: Optional: stable, de-duplicated defaults.

If config ever contains duplicates or order needs to be stable, dedupe while preserving order.

Apply this diff:

-        default_metrics = []
-        for metric_name, metadata in metrics_metadata.items():
-            if metadata.get("default", False):  # default=false if not specified
-                default_metrics.append(metric_name)
-        return default_metrics
+        default_metrics: list[str] = []
+        for metric_name, metadata in metrics_metadata.items():
+            if metadata.get("default", False):  # default=false if not specified
+                if metric_name not in default_metrics:
+                    default_metrics.append(metric_name)
+        return default_metrics
src/lightspeed_evaluation/pipeline/evaluation/processor.py (8)

52-62: Avoid double-resolving metrics; precompute once.

resolve_metrics is called here and again below during processing. Precompute per-turn resolved metrics and reuse for both the “any()” check and evaluation.

Apply this diff:

-        has_turn_metrics = any(
-            self.metric_manager.resolve_metrics(
-                turn_data.turn_metrics, MetricLevel.TURN
-            )
-            for turn_data in conv_data.turns
-        )
+        resolved_turn_metrics = [
+            (idx, turn, self.metric_manager.resolve_metrics(turn.turn_metrics, MetricLevel.TURN))
+            for idx, turn in enumerate(conv_data.turns)
+        ]
+        has_turn_metrics = any(bool(metrics) for _, _, metrics in resolved_turn_metrics)
-        has_conversation_metrics = bool(
-            self.metric_manager.resolve_metrics(
-                conv_data.conversation_metrics, MetricLevel.CONVERSATION
-            )
-        )
+        conversation_metrics_resolved = self.metric_manager.resolve_metrics(
+            conv_data.conversation_metrics, MetricLevel.CONVERSATION
+        )
+        has_conversation_metrics = bool(conversation_metrics_resolved)

86-95: Reuse the pre-resolved turn metrics; avoid resolving again.

Minor perf/readability tidy-up.

Apply this diff:

-        for turn_idx, turn_data in enumerate(conv_data.turns):
-            turn_metrics = self.metric_manager.resolve_metrics(
-                turn_data.turn_metrics, MetricLevel.TURN
-            )
-            if turn_metrics:
-                logger.debug("Processing turn %d metrics: %s", turn_idx, turn_metrics)
-                turn_results = self._evaluate_turn(conv_data, turn_idx, turn_data)
+        for turn_idx, turn_data, turn_metrics in resolved_turn_metrics:
+            if turn_metrics:
+                logger.debug("Processing turn %d metrics: %s", turn_idx, turn_metrics)
+                turn_results = self._evaluate_turn(
+                    conv_data, turn_idx, turn_data, turn_metrics
+                )
                 results.extend(turn_results)

And change _evaluate_turn signature to accept pre-resolved metrics (see comment at Lines 110-121).


96-107: Reuse the pre-resolved conversation metrics.

Same duplication as turns; pass them through.

Apply this diff:

-        conversation_metrics = self.metric_manager.resolve_metrics(
-            conv_data.conversation_metrics, MetricLevel.CONVERSATION
-        )
-        if conversation_metrics:
+        if conversation_metrics_resolved:
             logger.debug(
                 "Processing conversation-level metrics: %s",
-                conversation_metrics,
+                conversation_metrics_resolved,
             )
-            conv_results = self._evaluate_conversation(conv_data)
+            conv_results = self._evaluate_conversation(conv_data, conversation_metrics_resolved)
             results.extend(conv_results)

And update _evaluate_conversation to accept the list (see comment at Lines 129-137).


110-121: Simplify: drop redundant resolve and pass metrics in.

Avoid re-resolving and remove the “or []” guard (resolve_metrics already returns a list).

Apply this diff:

-    def _evaluate_turn(
-        self, conv_data: EvaluationData, turn_idx: int, turn_data: TurnData
-    ) -> list[EvaluationResult]:
+    def _evaluate_turn(
+        self,
+        conv_data: EvaluationData,
+        turn_idx: int,
+        turn_data: TurnData,
+        turn_metrics: list[str],
+    ) -> list[EvaluationResult]:
@@
-        turn_metrics = self.metric_manager.resolve_metrics(
-            turn_data.turn_metrics, MetricLevel.TURN
-        )
-
-        for metric_identifier in turn_metrics or []:
+        for metric_identifier in turn_metrics:
             request = EvaluationRequest.for_turn(
                 conv_data, metric_identifier, turn_idx, turn_data
             )

129-137: Mirror change for conversation-level path.

Pass pre-resolved metrics; remove redundant resolve.

Apply this diff:

-    def _evaluate_conversation(
-        self, conv_data: EvaluationData
-    ) -> list[EvaluationResult]:
+    def _evaluate_conversation(
+        self, conv_data: EvaluationData, conversation_metrics: list[str]
+    ) -> list[EvaluationResult]:
@@
-        conversation_metrics = self.metric_manager.resolve_metrics(
-            conv_data.conversation_metrics, MetricLevel.CONVERSATION
-        )
-
-        for metric_identifier in conversation_metrics or []:
+        for metric_identifier in conversation_metrics:
             request = EvaluationRequest.for_conversation(conv_data, metric_identifier)

52-62: Nit: prefer explicit boolean in any() for readability.

any(list_obj) relies on list truthiness. Using bool(...) makes intent clearer.

After the precompute refactor, this becomes:

has_turn_metrics = any(bool(metrics) for _, _, metrics in resolved_turn_metrics)

80-83: Log message is misleading today.

It says “marking all metrics as ERROR,” but only explicitly defined metrics are marked. After implementing the resolved-metrics error path, the message becomes accurate.


70-77: Guard is correct but add a clearer failure message.

Raising ValueError on missing SystemConfig is right. Consider mentioning the expected call flow (ConfigLoader.load_system_config) to speed up diagnosis.

-            raise ValueError("SystemConfig must be loaded")
+            raise ValueError("SystemConfig must be loaded via ConfigLoader.load_system_config(...) before processing conversations")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f97e01c and 955fef2.

📒 Files selected for processing (12)
  • README.md (3 hunks)
  • config/evaluation_data.yaml (3 hunks)
  • config/system.yaml (3 hunks)
  • src/lightspeed_evaluation/core/metrics/manager.py (1 hunks)
  • src/lightspeed_evaluation/core/models/data.py (3 hunks)
  • src/lightspeed_evaluation/core/system/validator.py (2 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/errors.py (1 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (4 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (3 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/processor.py (7 hunks)
  • tests/conftest.py (5 hunks)
  • tests/unit/core/config/test_models.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • tests/unit/core/config/test_models.py
  • src/lightspeed_evaluation/core/system/validator.py
  • src/lightspeed_evaluation/pipeline/evaluation/errors.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/core/models/data.py
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.762Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
📚 Learning: 2025-09-19T00:37:23.762Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.762Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/pipeline.py
  • src/lightspeed_evaluation/pipeline/evaluation/processor.py
  • src/lightspeed_evaluation/core/metrics/manager.py
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

  • README.md
🧬 Code graph analysis (3)
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (5)
src/lightspeed_evaluation/core/metrics/manager.py (1)
  • MetricManager (17-136)
src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
  • APIDataAmender (13-80)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
  • MetricsEvaluator (18-143)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (2)
  • ConversationProcessor (26-148)
  • ProcessorComponents (17-23)
src/lightspeed_evaluation/core/api/client.py (1)
  • close (232-235)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (5)
src/lightspeed_evaluation/core/metrics/manager.py (4)
  • MetricLevel (10-14)
  • MetricManager (17-136)
  • resolve_metrics (24-48)
  • count_metrics_for_conversation (115-136)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
  • MetricsEvaluator (18-143)
src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
  • APIDataAmender (13-80)
src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
  • EvaluationErrorHandler (10-77)
src/lightspeed_evaluation/core/system/loader.py (1)
  • ConfigLoader (65-118)
src/lightspeed_evaluation/core/metrics/manager.py (2)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationData (112-153)
  • TurnData (13-109)
src/lightspeed_evaluation/core/models/system.py (1)
  • SystemConfig (195-219)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: mypy
🔇 Additional comments (12)
config/system.yaml (2)

37-38: LGTM! Clear default metric configuration.

The addition of default: true for ragas:response_relevancy and default: false for ragas:faithfulness provides clear semantics for turn-level metric resolution when turn_metrics is null. This aligns well with the MetricManager.resolve_metrics() implementation.

Also applies to: 39-42, 57-59


75-75: LGTM! Consistent with turn-level defaults.

Adding default: false for conversation-level metrics maintains consistency with the turn-level approach.

config/evaluation_data.yaml (2)

18-26: LGTM! Per-turn metrics correctly migrated.

The metrics configuration has been properly moved from conversation group level to individual turns, implementing the per-turn override feature as intended.


40-43: resolve_metrics correctly handles None (defaults), [] (skip), and explicit lists — no changes required.

Verified in src/lightspeed_evaluation/core/metrics/manager.py: None -> self._extract_default_metrics(level); [] -> return []; otherwise return metrics.

README.md (2)

183-196: Excellent documentation of the metric override behavior!

The documentation clearly explains the three override patterns with a helpful table and concrete examples. This will be very useful for users understanding how to configure per-turn metrics.

Also applies to: 240-247


233-234: Documentation correctly reflects the new TurnData fields.

The addition of turn_metrics and turn_metrics_metadata to the Turn Data Fields table accurately documents the schema changes.

tests/conftest.py (1)

72-73: Test fixtures correctly updated for per-turn metrics.

The test fixtures have been properly updated to place turn_metrics and turn_metrics_metadata at the turn level instead of the conversation level, aligning with the new data model.

Also applies to: 167-168, 184-185, 195-196

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (2)

56-73: Clean integration of MetricManager into the pipeline.

The addition of MetricManager and its integration through ProcessorComponents provides a clean architecture for handling per-turn metric overrides and defaults.


171-172: Good resource cleanup for API client.

Properly closing the API client when present ensures resources are cleaned up.

src/lightspeed_evaluation/core/metrics/manager.py (2)

24-49: Resolve logic looks correct and matches PR intent.

None -> defaults, [] -> skip, list -> use as-is. Clear and predictable.


115-136: Counting logic is sound.

Accurately reflects effective metrics (including defaults) across turns and conversation.

src/lightspeed_evaluation/pipeline/evaluation/processor.py (1)

146-149: Summary delegation is correct.

Using MetricManager.count_metrics_for_conversation keeps single source of truth for counts.

@asamal4 asamal4 force-pushed the turn-metric-override branch from 955fef2 to 74f0ba7 Compare September 19, 2025 07:13
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (9)
README.md (3)

118-130: Clarify defaults and threshold semantics for custom metrics in system config.

  • Consider documenting default/threshold for custom:tool_eval; currently only ragas metrics show default flags.
  • If tool_eval passes when score > 0 (no threshold), call that out explicitly to avoid confusion with the global 0.5 fallback.

Apply within the YAML example:

   metrics_metadata:
     turn_level:
       "ragas:response_relevancy":
         threshold: 0.8
         description: "How relevant the response is to the question"
         default: true   # Used by default when turn_metrics is null

       "ragas:faithfulness":
         threshold: 0.8
         description: "How faithful the response is to the provided context"
         default: false  # Only used when explicitly specified
+
+    "custom:tool_eval":
+      # If threshold is omitted, tool_eval passes when score > 0 (documented behavior)
+      # Set threshold to null to use that behavior; otherwise provide an explicit value.
+      threshold: null
+      description: "Tool call evaluation comparing expected vs actual tool calls (regex for arguments)"
+      default: false

181-192: Note deduplication behavior and valid identifier format for turn_metrics.

Mention that duplicate metrics are deduplicated (order-preserving) and identifiers must be framework:metric_name; helps users debugging YAML.

Add after the snippet:

  • turn_metrics entries must be in the form framework:metric_name (e.g., ragas:faithfulness).
  • Duplicates are removed while preserving the first occurrence’s order.

240-247: Nice override behavior table; add a note on conversation-level defaults.

Add a line indicating the same None/[] rules apply to conversation_metrics.

src/lightspeed_evaluation/core/models/data.py (2)

13-31: Harden metric ID validation and normalize inputs.

Currently only checks for ":" and not empty string; whitespace and empty sides can slip through, and duplicates with whitespace won’t dedup.

Apply:

 def _validate_and_deduplicate_metrics(
     metrics: list[str], metric_type: str = "metric"
 ) -> list[str]:
     """Validate format and deduplicate metrics while preserving order."""
-    # Validate format first
-    for metric in metrics:
-        if not metric or ":" not in metric:
-            raise ValueError(
-                f'{metric_type} "{metric}" must be in format "framework:metric_name"'
-            )
-
-    # Deduplicate while preserving order
-    seen = set()
-    deduplicated = []
-    for metric in metrics:
-        if metric not in seen:
-            deduplicated.append(metric)
-            seen.add(metric)
-    return deduplicated
+    # Normalize, validate format, and deduplicate while preserving order
+    seen: set[str] = set()
+    deduped: list[str] = []
+    for raw in metrics:
+        metric = raw.strip()
+        if not metric or ":" not in metric:
+            raise ValueError(
+                f'{metric_type} "{raw}" must be in format "framework:metric_name"'
+            )
+        framework, name = metric.split(":", 1)
+        if not framework.strip() or not name.strip():
+            raise ValueError(
+                f'{metric_type} "{raw}" must include both framework and metric name'
+            )
+        if metric not in seen:
+            seen.add(metric)
+            deduped.append(metric)
+    return deduped

83-127: Ensure arguments is a dict in expected_tool_calls.

If provided, arguments may currently be non-dict and get passed through silently.

Apply:

             for j, tool_call in enumerate(sequence):
                 if not isinstance(tool_call, dict):
                     raise ValueError(
                         f"Tool call {j} in sequence {i} must be a dictionary"
                     )
@@
-                # Validate required keys
+                # Validate required keys
                 if "tool_name" not in tool_call:
                     raise ValueError(
                         f"Tool call {j} in sequence {i} missing required 'tool_name' field"
                     )
@@
-                # Ensure arguments field exists (can be empty dict)
+                # Ensure arguments field exists and is a dict (can be empty)
+                if "arguments" in tool_call and not isinstance(
+                    tool_call["arguments"], dict
+                ):
+                    raise ValueError(
+                        f"Tool call {j} in sequence {i} field 'arguments' must be a dictionary"
+                    )
                 validated_tool_call = {
                     "tool_name": tool_call["tool_name"],
                     "arguments": tool_call.get("arguments", {}),
                 }
src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)

34-41: Optional: accept resolved metrics to avoid duplicate resolution.

Processor already resolves metrics; you could optionally pass them in to reduce duplication and keep resolution logic in one place.

Example signature change:

-def mark_all_metrics_as_error(self, conv_data: EvaluationData, error_reason: str) -> list[EvaluationResult]:
+def mark_all_metrics_as_error(
+    self,
+    conv_data: EvaluationData,
+    error_reason: str,
+    resolved_turn_metrics: list[list[str]] | None = None,
+    resolved_conversation_metrics: list[str] | None = None,
+) -> list[EvaluationResult]:

Then use the provided lists if not None; fall back to self.metric_manager.resolve_metrics otherwise.

Also applies to: 55-59

config/evaluation_data.yaml (3)

23-25: Faithfulness threshold 0.99 is likely too brittle; consider relaxing.

A 0.99 cut-off often flags acceptable answers as failures. Recommend 0.85–0.95 depending on your baseline; start at 0.90.

Apply this diff if acceptable:

       turn_metrics_metadata:
         "ragas:faithfulness":
-          threshold: 0.99
+          threshold: 0.90  # Less brittle; tune based on baseline runs

27-29: Group header change acknowledged. Ensure docs/examples match.

Since README updates are pending, please mirror this per‑turn configuration in the docs’ sample config.


40-44: LGTM on conv_group_2 per‑turn metrics.

These metrics require contexts and a reference; both are provided here.

If other turns reuse this set, consider a YAML anchor to avoid duplication. Example:

-      turn_metrics:
-        - "ragas:context_recall"
-        - "ragas:context_relevance"
-        - "ragas:context_precision_with_reference"
+      turn_metrics: &with_ref_metrics
+        - "ragas:context_recall"
+        - "ragas:context_relevance"
+        - "ragas:context_precision_with_reference"
# Reuse elsewhere with:
# turn_metrics: *with_ref_metrics
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 955fef2 and 74f0ba7.

📒 Files selected for processing (7)
  • README.md (3 hunks)
  • config/evaluation_data.yaml (3 hunks)
  • src/lightspeed_evaluation/core/models/data.py (4 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/errors.py (3 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (5 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (3 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/processor.py (6 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.762Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
📚 Learning: 2025-09-19T00:37:23.762Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.762Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/pipeline.py
  • src/lightspeed_evaluation/pipeline/evaluation/processor.py
  • src/lightspeed_evaluation/pipeline/evaluation/errors.py
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

  • README.md
📚 Learning: 2025-09-18T23:59:37.010Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.010Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

  • src/lightspeed_evaluation/core/models/data.py
🧬 Code graph analysis (3)
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (10)
src/lightspeed_evaluation/core/metrics/manager.py (1)
  • MetricManager (17-136)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationData (129-166)
  • EvaluationResult (169-208)
src/lightspeed_evaluation/core/output/data_persistence.py (1)
  • save_evaluation_data (15-58)
src/lightspeed_evaluation/core/system/loader.py (1)
  • ConfigLoader (65-118)
src/lightspeed_evaluation/core/system/validator.py (1)
  • DataValidator (62-226)
src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
  • APIDataAmender (13-80)
src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
  • EvaluationErrorHandler (11-83)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
  • MetricsEvaluator (18-143)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (2)
  • ConversationProcessor (26-144)
  • ProcessorComponents (17-23)
src/lightspeed_evaluation/core/api/client.py (1)
  • close (232-235)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (4)
src/lightspeed_evaluation/core/metrics/manager.py (4)
  • MetricLevel (10-14)
  • MetricManager (17-136)
  • resolve_metrics (24-48)
  • count_metrics_for_conversation (115-136)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
  • MetricsEvaluator (18-143)
src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
  • APIDataAmender (13-80)
src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
  • EvaluationErrorHandler (11-83)
src/lightspeed_evaluation/pipeline/evaluation/errors.py (2)
src/lightspeed_evaluation/core/metrics/manager.py (3)
  • MetricLevel (10-14)
  • MetricManager (17-136)
  • resolve_metrics (24-48)
src/lightspeed_evaluation/core/models/data.py (1)
  • EvaluationResult (169-208)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: mypy
🔇 Additional comments (22)
README.md (2)

193-195: Good: explicit skip semantics documented.
Clear that [] skips evaluation for this turn; matches MetricManager.resolve_metrics behavior.


233-235: Turn metric fields table aligns with model changes.
Accurately reflects new TurnData fields.

src/lightspeed_evaluation/core/models/data.py (4)

65-74: Turn metric fields addition looks correct.
Matches README and processing flow.


75-82: Validation hook is good; relies on shared helper.
With the proposed helper tightening, this will handle whitespace and empty parts.


143-151: Conversation metrics fields align with the new model.
Good separation from per-turn config.


158-166: Validator reuse is appropriate.
Once helper is hardened, conversation metric IDs will be consistently normalized.

src/lightspeed_evaluation/pipeline/evaluation/errors.py (2)

14-17: Constructor now accepts MetricManager — good change.
Enables resolving defaults on API failure.


35-41: Resolves defaults before marking ERROR — correct behavior.
Ensures turns with turn_metrics: null and conversations with conversation_metrics: null get ERROR rows for default metrics.

Also applies to: 54-71

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (4)

56-66: Wiring MetricManager into evaluator and error handler looks good.
Matches the new per-turn defaults/override flow.


60-63: Fixed APIDataAmender construction using self.api_client.
Resolves previous undefined variable issue.


67-73: ProcessorComponents container improves cohesion.
Nice consolidation of dependencies.


171-172: Graceful API client cleanup.
Close guarded by presence check; OK.

src/lightspeed_evaluation/pipeline/evaluation/processor.py (7)

16-24: ProcessorComponents dataclass is clean and explicit.
Good dependency injection point.


29-40: Constructor stores component aliases — readable.
Keeps call sites terse.


50-58: Pre-resolving metrics per turn + conversation is correct.
Enables early skip and consistent behavior.


61-68: Early exit when no metrics — good guardrail.
Prevents unnecessary API calls.


86-95: Turn evaluation loops over resolved sets — correct with overrides and skips.
Matches desired semantics for None vs [] vs explicit list.


98-105: Conversation metrics evaluation correctly gated.
Aligned with resolved defaults.


142-144: Metrics summary delegates to MetricManager — nice single source of truth.

config/evaluation_data.yaml (3)

18-21: LGTM on per‑turn override for conv_group_1.

Metrics align with fields present (contexts and expected_response exist), so these should evaluate cleanly.


63-63: Confirmed: omitted turn_metrics (None) → system defaults; explicit [] → skip

ConversationProcessor calls MetricManager.resolve_metrics for each turn (src/lightspeed_evaluation/pipeline/evaluation/processor.py); turn_metrics is Optional[list] default None (src/lightspeed_evaluation/core/models/data.py); processor uses truthiness on resolved metrics so an explicit [] yields no metrics and is skipped; DataValidator only validates explicitly provided metrics (src/lightspeed_evaluation/core/system/validator.py).


57-58: Explicit turn-level skip supported (turn_metrics: [] → skip)

resolve_metrics returns [] for an explicit empty list; ConversationProcessor only evaluates turns when resolved turn metrics are truthy; EvaluationErrorHandler also resolves metrics before marking errors, so turns with turn_metrics: [] are neither evaluated nor marked ERROR — conversation-level metrics are handled separately and will still run if present. See MetricManager.resolve_metrics, ConversationProcessor.process_conversation, EvaluationErrorHandler.mark_all_metrics_as_error.

@asamal4 asamal4 force-pushed the turn-metric-override branch from 74f0ba7 to ed14fba Compare September 19, 2025 08:07
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
src/lightspeed_evaluation/core/metrics/manager.py (1)

50-81: Threshold type safety — confirm inputs are numeric (repeat of prior concern).

get_effective_threshold returns values as-is. If a YAML author quotes numbers ("0.8"), downstream score >= threshold may raise TypeError. You previously opted against defensive parsing for controlled configs; keeping as-is is fine. Please just verify no quoted thresholds exist in repo configs/tests.

#!/bin/bash
# Search for quoted numeric thresholds in YAML/JSON to avoid TypeError at runtime.
fd -t f -e yaml -e yml -e json | xargs rg -nP -C1 '\bthreshold:\s*"[0-9]+(\.[0-9]+)?"'
🧹 Nitpick comments (5)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

139-139: Consider documenting the float cast behavior.

The explicit float cast ensures consistent comparison behavior, but it would be helpful to add a comment explaining why this is necessary (e.g., to handle cases where threshold might be stored as an int in configs).

-        return "PASS" if score >= float(threshold) else "FAIL"
+        # Cast to float to ensure consistent numeric comparison (handles int thresholds)
+        return "PASS" if score >= float(threshold) else "FAIL"
README.md (2)

134-138: Conversation-level defaults not set; clarify intent.

As written, conversation_level has no default: true, so if conversation_metrics is omitted, no conversation metrics will run. If that’s intentional, add a brief note; otherwise, set a default in the example.

Apply one of the following doc tweaks:

   conversation_level:
     "deepeval:conversation_completeness":
       threshold: 0.8
       description: "How completely the conversation addresses user intentions"
+      # Note: Leave `default` unset if you want conversation metrics to be opt-in.
+      # default: true  # Uncomment to enable as a system default

240-247: Scope of “Metrics override behavior” — specify turn vs conversation.

The table reads as global but examples are turn-focused. Add a one-liner stating that the same semantics apply to conversation_metrics at conversation level.

 #### Metrics override behavior
@@
 | `["metric1", ...]`  | Use specified metrics only |
+
+Note: The same override semantics apply to conversation-level metrics via `conversation_metrics`
+and `conversation_metrics_metadata` on the conversation object.
src/lightspeed_evaluation/core/metrics/manager.py (2)

1-1: Tiny naming nit: “mapping” → “management.”

Module docstring can read “Metrics management for evaluation.” Not blocking.

-"""Metrics mapping for evaluation."""
+"""Metrics management for evaluation."""

105-114: Micro-simplification: list comprehension for defaults.

Functionally identical, a bit tighter.

-        default_metrics = []
-        for metric_name, metadata in metrics_metadata.items():
-            if metadata.get("default", False):  # default=false if not specified
-                default_metrics.append(metric_name)
-        return default_metrics
+        return [
+            name for name, meta in metrics_metadata.items()
+            if meta.get("default", False)
+        ]
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 74f0ba7 and ed14fba.

📒 Files selected for processing (10)
  • README.md (3 hunks)
  • config/evaluation_data.yaml (3 hunks)
  • src/lightspeed_evaluation/core/metrics/manager.py (1 hunks)
  • src/lightspeed_evaluation/core/models/data.py (4 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/errors.py (2 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (5 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (3 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/processor.py (6 hunks)
  • tests/conftest.py (5 hunks)
  • tests/unit/core/config/test_models.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • tests/conftest.py
  • src/lightspeed_evaluation/pipeline/evaluation/processor.py
  • src/lightspeed_evaluation/core/models/data.py
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.762Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
📚 Learning: 2025-09-19T00:37:23.762Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.762Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/manager.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/pipeline/evaluation/pipeline.py
  • src/lightspeed_evaluation/pipeline/evaluation/errors.py
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/manager.py
  • README.md
📚 Learning: 2025-09-10T06:57:46.326Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: src/lightspeed_evaluation/pipeline/evaluation/evaluator.py:85-89
Timestamp: 2025-09-10T06:57:46.326Z
Learning: For binary metrics like custom:tool_eval, using an explicit threshold of 0.5 is preferred over None threshold with special case handling. This provides consistent behavior where 0.0 scores fail and 1.0 scores pass.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/manager.py
📚 Learning: 2025-09-19T01:26:54.603Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/core/metrics/manager.py:50-81
Timestamp: 2025-09-19T01:26:54.603Z
Learning: In the lightspeed-evaluation codebase, prefer clean code over defensive programming for controlled configuration files like system.yaml. The maintainer asamal4 prefers not to add error handling for user configuration mistakes that are unlikely to occur in controlled internal settings, to avoid code complexity.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/manager.py
📚 Learning: 2025-08-22T09:16:29.070Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/manager.py
📚 Learning: 2025-09-18T23:59:37.010Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.010Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

  • tests/unit/core/config/test_models.py
🧬 Code graph analysis (5)
src/lightspeed_evaluation/core/metrics/manager.py (2)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationData (129-166)
  • TurnData (34-126)
src/lightspeed_evaluation/core/models/system.py (1)
  • SystemConfig (195-219)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
src/lightspeed_evaluation/core/metrics/manager.py (2)
  • MetricLevel (10-14)
  • get_effective_threshold (50-80)
tests/unit/core/config/test_models.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
  • TurnData (34-126)
  • EvaluationData (129-166)
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (4)
src/lightspeed_evaluation/core/metrics/manager.py (1)
  • MetricManager (17-136)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
  • MetricsEvaluator (18-143)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (2)
  • ConversationProcessor (26-147)
  • ProcessorComponents (17-23)
src/lightspeed_evaluation/core/api/client.py (1)
  • close (232-235)
src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationData (129-166)
  • EvaluationResult (169-208)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: tests (3.12)
🔇 Additional comments (20)
tests/unit/core/config/test_models.py (6)

75-80: LGTM! Per-turn metrics properly wired.

The test correctly creates TurnData with per-turn metrics, aligning with the new data model where metrics are now specified at the turn level.


92-92: LGTM! Assertion correctly validates turn-level metrics.

The test properly asserts that turn_metrics are stored at the individual turn level, consistent with the shift from conversation-level to per-turn metrics.


103-103: LGTM! Test correctly handles None case for turn metrics.

The assertion properly validates that when turn_metrics is not provided, it remains None (indicating system defaults will be used).


118-123: LGTM! Test correctly validates turn-level metric validation.

The test properly validates that metrics without the required "framework:metric_name" format are rejected at the turn level.


127-133: LGTM! Turn metadata fields properly tested.

The test correctly validates both turn_metrics and turn_metrics_metadata at the turn level.


141-143: LGTM! Assertions properly validate per-turn metadata.

The test correctly asserts that turn_metrics_metadata is stored at the turn level within each turn object.

src/lightspeed_evaluation/pipeline/evaluation/errors.py (3)

17-23: LGTM! Method signature properly updated for per-turn metrics.

The updated signature correctly accepts pre-resolved metrics lists, addressing the issue where defaults weren't being marked as ERROR during API failures. The separation of turn and conversation metrics aligns with the new per-turn architecture.


43-57: LGTM! Turn-level error marking properly handles per-turn metrics.

The implementation correctly iterates through resolved turn metrics and creates appropriate ERROR results with turn-specific data.


59-74: LGTM! Conversation-level error marking properly implemented.

The loop correctly handles conversation-level metrics by setting turn_id to None and empty query, properly distinguishing them from turn-level metrics.

config/evaluation_data.yaml (4)

18-26: LGTM! Per-turn metrics configuration properly structured.

The turn-specific metrics and metadata are correctly specified at the turn level, aligning with the new per-turn metrics architecture.


40-44: LGTM! Turn metrics properly specified for conv_group_2.

Different turn metrics are correctly applied to demonstrate the per-turn override capability.


58-58: LGTM! Skip semantics properly demonstrated.

The empty array [] correctly demonstrates the skip evaluation feature for specific turns.


63-63: LGTM! Default behavior properly documented.

The comment clearly explains that omitting turn_metrics (null) will use system defaults, which is valuable documentation.

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (3)

60-61: LGTM! API client properly stored as instance attribute.

The API client is correctly stored on self.api_client for use in the close() method.


56-73: LGTM! Components properly wired with MetricManager.

The MetricManager is correctly instantiated and passed to both the MetricsEvaluator and ProcessorComponents, ensuring consistent metric resolution across the pipeline.


171-172: LGTM! Resource cleanup properly implemented.

The close() method correctly checks for and closes the API client if it exists.

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

21-26: LGTM! MetricManager properly integrated.

The MetricManager is correctly added as a constructor parameter and stored for threshold resolution.


87-95: LGTM! Threshold resolution properly uses MetricManager.

The code correctly determines the metric level based on the request type and delegates threshold resolution to the MetricManager with appropriate context (conv_data and turn_data).

src/lightspeed_evaluation/core/metrics/manager.py (1)

115-136: Counting semantics LGTM.

Turn-level resolution respects None → defaults and [] → skip; conversation-level counting is consistent with resolve_metrics. No issues.

README.md (1)

118-157: Mapping verified — README metrics_metadata.turn_level / conversation_level are correctly mapped to SystemConfig fields.

loader.create_system_config copies config["metrics_metadata"].get("turn_level"/"conversation_level") → SystemConfig.default_turn_metrics_metadata / default_conversation_metrics_metadata (src/lightspeed_evaluation/core/system/loader.py ~107–116); loader also reconstructs metrics_metadata from system_config.default* before registering metrics (src/lightspeed_evaluation/core/system/loader.py ~30–43); metrics manager reads those fields to extract defaults (src/lightspeed_evaluation/core/metrics/manager.py ~101–113).

@asamal4
Copy link
Collaborator Author

asamal4 commented Sep 19, 2025

@VladimirKadlec @tisnik PTAL

@asamal4 asamal4 force-pushed the turn-metric-override branch from ed14fba to ad15be3 Compare September 19, 2025 12:08
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

99-109: Fix None access for conversation-level metrics (AttributeError risk)

Accessing request.turn_data.response when turn_data is None will raise and incorrectly mark conversation metrics as ERROR.

-                query=request.turn_data.query if request.turn_data else "",
-                response=request.turn_data.response or "" if request.turn_data else "",
+                query=request.turn_data.query if request.turn_data else "",
+                response=(
+                    request.turn_data.response
+                    if (request.turn_data and request.turn_data.response)
+                    else ""
+                ),

118-133: Apply the same None-safe fix in error path

Mirror the safe access to avoid spurious errors while constructing ERROR results.

-            query=request.turn_data.query if request.turn_data else "",
-            response=request.turn_data.response or "" if request.turn_data else "",
+            query=request.turn_data.query if request.turn_data else "",
+            response=(
+                request.turn_data.response
+                if (request.turn_data and request.turn_data.response)
+                else ""
+            ),
♻️ Duplicate comments (1)
src/lightspeed_evaluation/core/metrics/manager.py (1)

50-81: Threshold passthrough matches maintainer preference (no casting).

Previous suggestion to normalize threshold types was discussed and intentionally not adopted for this controlled config. No action.

🧹 Nitpick comments (12)
tests/conftest.py (1)

72-74: Set turn_metrics during construction to trigger validators

Assigning after model creation bypasses Pydantic validators unless validate_assignment=True. Set in the constructor instead.

-    # Add turn_metrics to the turn data
-    sample_turn_data.turn_metrics = ["ragas:faithfulness", "ragas:response_relevancy"]

Outside this hunk, update sample_turn_data fixture to include:

# in sample_turn_data() return
return TurnData(
    turn_id="1",
    query="What is Python?",
    response="Python is a high-level programming language.",
    contexts=[...],
    expected_response="Python is a high-level programming language used for various applications.",
    turn_metrics=["ragas:faithfulness", "ragas:response_relevancy"],
)
tests/unit/core/config/test_models.py (1)

84-85: Unify metric identifier naming for consistency

Elsewhere you use deepeval:conversation_completeness. Consider aligning here to reduce confusion in future refactors.

-            conversation_metrics=["deepeval:completeness"],
+            conversation_metrics=["deepeval:conversation_completeness"],
@@
-            conversation_metrics=["deepeval:completeness"],
-            conversation_metrics_metadata={"deepeval:completeness": {"threshold": 0.9}},
+            conversation_metrics=["deepeval:conversation_completeness"],
+            conversation_metrics_metadata={"deepeval:conversation_completeness": {"threshold": 0.9}},

Also applies to: 136-139

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)

61-65: Guard against malformed metric identifiers

split(":", 1) will raise if the identifier lacks a colon, returning a generic error. Prefer a clear error message.

-            framework, metric_name = request.metric_identifier.split(":", 1)
+            if ":" not in request.metric_identifier:
+                execution_time = time.time() - start_time
+                return self._create_error_result(
+                    request,
+                    f"Invalid metric identifier '{request.metric_identifier}'. Expected 'framework:metric'.",
+                    execution_time,
+                )
+            framework, metric_name = request.metric_identifier.split(":", 1)

135-140: Status calc is fine; consider documenting 0.5 default

Defaulting threshold to 0.5 is a sensible fallback. Add a brief note in system docs to avoid surprises for binary metrics without thresholds.

src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)

43-45: Handle length mismatch between turns and resolved metrics

zip will silently drop extra turns or metrics. Add a guard to detect and log/raise on mismatch.

-        for turn_data, turn_metrics in zip(conv_data.turns, resolved_turn_metrics):
+        if len(resolved_turn_metrics) != len(conv_data.turns):
+            logger.error(
+                "Resolved turn metrics length (%d) does not match number of turns (%d) "
+                "for conversation %s",
+                len(resolved_turn_metrics),
+                len(conv_data.turns),
+                conv_data.conversation_group_id,
+            )
+        for turn_data, turn_metrics in zip(conv_data.turns, resolved_turn_metrics):
src/lightspeed_evaluation/pipeline/evaluation/processor.py (4)

16-24: Good dependency containerization; consider immutability.

ProcessorComponents groups dependencies cleanly. Optionally freeze to prevent accidental runtime mutation.

Apply:

-@dataclass
+@dataclass(frozen=True)
 class ProcessorComponents:

50-55: Micro: avoid recomputing defaults per turn.

When many turns use defaults (turn_metrics is None), cache the default list once.

- resolved_turn_metrics = [
-     self.metric_manager.resolve_metrics(turn_data.turn_metrics, MetricLevel.TURN)
-     for turn_data in conv_data.turns
- ]
+ default_turn_metrics: list[str] | None = None
+ resolved_turn_metrics: list[list[str]] = []
+ for t in conv_data.turns:
+     if t.turn_metrics is None:
+         if default_turn_metrics is None:
+             default_turn_metrics = self.metric_manager.resolve_metrics(None, MetricLevel.TURN)
+         resolved_turn_metrics.append(default_turn_metrics)
+     else:
+         resolved_turn_metrics.append(
+             self.metric_manager.resolve_metrics(t.turn_metrics, MetricLevel.TURN)
+         )

71-77: Config guard is fine; nit on error message clarity.

Consider including config path or loader state in the message to aid debugging.

-raise ValueError("SystemConfig must be loaded")
+raise ValueError("SystemConfig must be loaded before processing conversations (call ConfigLoader.load_system_config).")

90-97: Improve turn log context with turn_id.

Turn index alone is less useful; include turn_id.

-logger.debug("Processing turn %d metrics: %s", turn_idx, turn_metrics)
+logger.debug(
+    "Processing turn %d (%s) metrics: %s",
+    turn_idx,
+    turn_data.turn_id,
+    turn_metrics,
+)
src/lightspeed_evaluation/core/metrics/manager.py (2)

24-49: resolve_metrics semantics are solid; small doc tweak.

Docs say “use specified metrics from turn data” but this applies to both levels. Consider generic wording.

-        - [metrics...]: use specified metrics from turn data
+        - [metrics...]: use the explicitly provided metrics for the given level

115-137: Optional: precompute defaults during counting to avoid repeated extraction.

Minor micro-optimization when many turns rely on defaults.

- total_turn_metrics = 0
- for turn_data in conv_data.turns:
-     turn_metrics = self.resolve_metrics(turn_data.turn_metrics, MetricLevel.TURN)
-     total_turn_metrics += len(turn_metrics)
+ total_turn_metrics = 0
+ default_turn: list[str] | None = None
+ for turn_data in conv_data.turns:
+     if turn_data.turn_metrics is None:
+         if default_turn is None:
+             default_turn = self.resolve_metrics(None, MetricLevel.TURN)
+         total_turn_metrics += len(default_turn)
+     else:
+         total_turn_metrics += len(self.resolve_metrics(turn_data.turn_metrics, MetricLevel.TURN))
src/lightspeed_evaluation/core/models/data.py (1)

13-31: Trim metric strings before validation/dedup.

Helps with YAML lists containing incidental spaces without increasing complexity.

-def _validate_and_deduplicate_metrics(
-    metrics: list[str], metric_type: str = "metric"
-) -> list[str]:
+def _validate_and_deduplicate_metrics(
+    metrics: list[str], metric_type: str = "metric"
+) -> list[str]:
     """Validate format and deduplicate metrics while preserving order."""
-    # Validate format first
-    for metric in metrics:
+    # Normalize and validate format first
+    normalized = [m.strip() for m in metrics]
+    for metric in normalized:
         if not metric or ":" not in metric:
             raise ValueError(
                 f'{metric_type} "{metric}" must be in format "framework:metric_name"'
             )
 
     # Deduplicate while preserving order
     seen = set()
     deduplicated = []
-    for metric in metrics:
+    for metric in normalized:
         if metric not in seen:
             deduplicated.append(metric)
             seen.add(metric)
     return deduplicated
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ed14fba and ad15be3.

📒 Files selected for processing (12)
  • README.md (3 hunks)
  • config/evaluation_data.yaml (3 hunks)
  • config/system.yaml (3 hunks)
  • src/lightspeed_evaluation/core/metrics/manager.py (1 hunks)
  • src/lightspeed_evaluation/core/models/data.py (4 hunks)
  • src/lightspeed_evaluation/core/system/validator.py (2 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/errors.py (2 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (5 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (3 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/processor.py (6 hunks)
  • tests/conftest.py (5 hunks)
  • tests/unit/core/config/test_models.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/lightspeed_evaluation/pipeline/evaluation/pipeline.py
  • src/lightspeed_evaluation/core/system/validator.py
  • config/system.yaml
  • README.md
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/errors.py
  • src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
  • src/lightspeed_evaluation/core/metrics/manager.py
  • src/lightspeed_evaluation/pipeline/evaluation/processor.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

  • tests/unit/core/config/test_models.py
  • src/lightspeed_evaluation/core/models/data.py
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/manager.py
📚 Learning: 2025-09-10T06:57:46.326Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: src/lightspeed_evaluation/pipeline/evaluation/evaluator.py:85-89
Timestamp: 2025-09-10T06:57:46.326Z
Learning: For binary metrics like custom:tool_eval, using an explicit threshold of 0.5 is preferred over None threshold with special case handling. This provides consistent behavior where 0.0 scores fail and 1.0 scores pass.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/manager.py
📚 Learning: 2025-09-19T01:26:54.625Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/core/metrics/manager.py:50-81
Timestamp: 2025-09-19T01:26:54.625Z
Learning: In the lightspeed-evaluation codebase, prefer clean code over defensive programming for controlled configuration files like system.yaml. The maintainer asamal4 prefers not to add error handling for user configuration mistakes that are unlikely to occur in controlled internal settings, to avoid code complexity.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/manager.py
📚 Learning: 2025-08-22T09:16:29.070Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.

Applied to files:

  • src/lightspeed_evaluation/core/metrics/manager.py
🧬 Code graph analysis (5)
src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationData (129-166)
  • EvaluationResult (169-208)
tests/unit/core/config/test_models.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
  • TurnData (34-126)
  • EvaluationData (129-166)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (3)
src/lightspeed_evaluation/core/metrics/manager.py (3)
  • MetricLevel (10-14)
  • MetricManager (17-136)
  • get_effective_threshold (50-80)
src/lightspeed_evaluation/core/models/data.py (3)
  • EvaluationRequest (225-278)
  • EvaluationResult (169-208)
  • EvaluationScope (211-222)
src/lightspeed_evaluation/core/system/loader.py (1)
  • ConfigLoader (65-118)
src/lightspeed_evaluation/core/metrics/manager.py (2)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationData (129-166)
  • TurnData (34-126)
src/lightspeed_evaluation/core/models/system.py (1)
  • SystemConfig (195-219)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (6)
src/lightspeed_evaluation/core/metrics/manager.py (4)
  • MetricLevel (10-14)
  • MetricManager (17-136)
  • resolve_metrics (24-48)
  • count_metrics_for_conversation (115-136)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
  • MetricsEvaluator (18-143)
src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
  • APIDataAmender (13-80)
src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
  • EvaluationErrorHandler (10-85)
src/lightspeed_evaluation/core/system/loader.py (1)
  • ConfigLoader (65-118)
src/lightspeed_evaluation/core/models/data.py (3)
  • EvaluationData (129-166)
  • TurnData (34-126)
  • EvaluationResult (169-208)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: ruff
  • GitHub Check: black
🔇 Additional comments (24)
config/evaluation_data.yaml (4)

18-25: Per-turn metrics/metadata wired correctly; confirm 0.99 threshold intent

0.99 for ragas:faithfulness is extremely strict and will likely fail most turns. If this is intentional for a stress case, LGTM; otherwise consider 0.8–0.9.


27-44: Turn-level override migration looks good

Per-turn metrics on conv_group_2 are correctly scoped and will bypass system defaults for that turn. No issues spotted.


58-58: Explicit skip semantics are clear

turn_metrics: [] will skip evaluation for this turn as designed.


63-64: Turn-level default exists — no action needed

config/system.yaml defines a metric with default: true (line 37), so omitting turn_metrics will still evaluate that turn-level default.

tests/conftest.py (2)

167-169: Per-turn metrics in test data align with new model

Turn 1 explicitly overrides; Turn 2 uses None to test default resolution. LGTM assuming at least one default=true exists in system config.

Also applies to: 184-197


127-142: Incorrect — DataValidator intentionally skips turns without explicit turn_metrics

Validator._check_metric_requirements short-circuits when a turn has no turn_metrics (src/lightspeed_evaluation/core/system/validator.py: if not turn_data.turn_metrics: continue), and tests/conftest.py already includes a turn with "turn_metrics": None (around lines 195–196). Metrics metadata fallback to system defaults is handled in src/lightspeed_evaluation/core/metrics/manager.py, but DataValidator is by-design limited to validating only explicitly provided metrics, so changing the fixture's metrics_metadata defaults will not make the validator exercise a “defaulting” validation path.

Likely an incorrect or invalid review comment.

tests/unit/core/config/test_models.py (2)

75-80: TurnData per-turn metrics coverage is correct

Constructing with turn_metrics and asserting presence is aligned with the new API.

Also applies to: 92-93


118-123: Invalid metric format test is appropriate

This correctly targets the TurnData-level validator for "framework:metric_name" format.

src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)

59-74: Conversation-level ERROR marking is correct

Generates ERROR rows for each resolved conversation metric with turn_id=None. Looks good.

src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)

21-27: Constructor change — instantiations updated

MetricsEvaluator is instantiated with metric_manager in src/lightspeed_evaluation/pipeline/evaluation/pipeline.py:63–65 (MetricsEvaluator(llm_manager, self.config_loader, metric_manager)).

src/lightspeed_evaluation/pipeline/evaluation/processor.py (7)

50-59: Metric resolution moved upfront: LGTM.

Pre-resolving per-turn and conversation metrics centralizes logic and simplifies downstream code.


64-68: Confirm skipping also means no API amendment.

Behavior change: if no metrics (defaults and explicit), you return early and never amend via API. Ensure no downstream expects amended responses even when evaluation is skipped.


81-86: API error path aligns with design (use pre-resolved metrics).

Passing resolved metrics into the error handler matches the upstream-resolution approach.


113-131: Turn evaluation loop is clean and correct.

Iterating resolved per-turn metrics and delegating to MetricsEvaluator is straightforward.


100-109: Conversation-level evaluation flow is correct.

Processes only when metrics present; consistent with per-turn logic.


145-148: Delegating summary to MetricManager is a good consolidation.

Removes duplicate counting logic from the processor.


29-39: Resolved — ConversationProcessor constructor call sites updated

src/lightspeed_evaluation/pipeline/evaluation/pipeline.py now constructs ProcessorComponents and passes it into ConversationProcessor (lines 68–78). No other instantiations using the old multi-positional-arg signature were found.

src/lightspeed_evaluation/core/metrics/manager.py (2)

10-15: Enum introduction is clear and appropriately scoped.

TURN/CONVERSATION values are intuitive.


105-114: Default metric extraction is correct and order-preserving.

Relies on YAML insertion order; good for predictable evaluation ordering.

src/lightspeed_evaluation/core/models/data.py (5)

65-73: Per-turn metrics fields are well-scoped.

Aligns with per-turn override design.


75-81: Validator is concise and reuses helper.

Dedup + format check at the edge is good.


143-151: Conversation-level fields look correct.

Matches the new two-level metric model.


158-166: Conversation metrics validator mirrors turn-level: good consistency.

Keeps behavior uniform across levels.


225-279: Factory methods for EvaluationRequest are clean.

Turn/conversation helpers reduce boilerplate and set flags correctly.

Copy link
Contributor

@tisnik tisnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tisnik tisnik merged commit e25dec0 into lightspeed-core:main Sep 22, 2025
15 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Oct 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants