-
Notifications
You must be signed in to change notification settings - Fork 21
Turn metric override #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughMoves metric configuration from conversation-level to per-turn, adds MetricManager and MetricLevel for centralized metric/threshold resolution, updates models/validation to per-turn metrics, and refactors pipeline, evaluator, processor, error handling, tests, and docs to use per-turn metric resolution and framework routing. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Runner as Runner
participant Pipeline as EvaluationPipeline
participant Proc as ConversationProcessor
participant MM as MetricManager
participant Eval as MetricsEvaluator
participant H as Handler
Runner->>Pipeline: run(config, evaluation_data)
Pipeline->>Proc: start with ProcessorComponents (incl. MetricManager)
Proc->>MM: resolve_metrics(conv.conversation_metrics, CONVERSATION)
MM-->>Proc: resolved_conversation_metrics
loop for each turn
Proc->>MM: resolve_metrics(turn.turn_metrics, TURN)
MM-->>Proc: resolved_turn_metrics_for_turn
alt resolved_turn_metrics non-empty
Proc->>Eval: evaluate_turn(turn, resolved_turn_metrics_for_turn)
Eval->>MM: get_effective_threshold(metric_id, TURN, conv, turn)
MM-->>Eval: threshold
Eval->>H: route(framework, metric_name)
H-->>Eval: score/result
Eval-->>Proc: EvaluationResult
else skip turn
Note right of Proc: turn_metrics == [] → skip evaluation
end
end
alt resolved_conversation_metrics non-empty
Proc->>Eval: evaluate_conversation(conv, resolved_conversation_metrics)
Eval->>MM: get_effective_threshold(metric_id, CONVERSATION, conv)
MM-->>Eval: threshold
Eval->>H: route(framework, metric_name)
H-->>Eval: score/result
Eval-->>Proc: EvaluationResult
else skip conversation
Note right of Proc: no conversation metrics → skip
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (1)
78-85: API error path currently misses default metrics in ERROR results.Because EvaluationErrorHandler iterates only explicit metrics on data, default metrics are not marked as ERROR after amendment failure. Align with the pipeline’s default resolution.
See proposed essential refactor in pipeline comment (inject MetricManager into EvaluationErrorHandler and resolve defaults there).
🧹 Nitpick comments (6)
src/lightspeed_evaluation/core/models/data.py (2)
54-65: Good format validation for turn metricsOptional enhancement: dedupe metrics to avoid double-evaluating the same metric.
Example:
@field_validator("turn_metrics") @classmethod def validate_turn_metrics(cls, v: Optional[list[str]]) -> Optional[list[str]]: @@ - return v + if v: + seen, uniq = set(), [] + for m in v: + if m not in seen: + seen.add(m); uniq.append(m) + return uniq + return v
141-153: Conversation metrics validator reads wellConsider applying the same dedupe idea here for symmetry.
tests/conftest.py (1)
127-135: Test system defaults may be empty; set one default=trueMetricManager picks defaults via metadata.default==true. In this fixture, no turn metric has default=true, so turns with turn_metrics=None will resolve to no metrics (skip). If you intend to test default fallback, mark at least one metric as default.
Patch:
"metrics_metadata": { "turn_level": { "ragas:faithfulness": { "threshold": 0.8, "description": "How faithful the response is to the provided context", "default": False, }, "ragas:response_relevancy": { "threshold": 0.8, + "default": True, },Also applies to: 140-142, 167-169, 184-186, 195-197
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
21-27: Integrating MetricManager is solid; handle custom:tool_eval threshold NonePer our prior learning, when custom:tool_eval has no threshold, passing should be score > 0 (not >= 0.5). Set a metric-specific fallback before determining status.
Minimal change:
threshold = self.metric_manager.get_effective_threshold( request.metric_identifier, level, request.conv_data, request.turn_data ) - status = self._determine_status(score, threshold) + # Metric-specific default thresholds + if threshold is None and request.metric_identifier == "custom:tool_eval": + threshold = 0.0 + status = self._determine_status(score, threshold)Nit: clarify response expression with parentheses for readability:
- response=request.turn_data.response or "" if request.turn_data else "", + response=(request.turn_data.response or "") if request.turn_data else "",Also applies to: 43-44, 87-95
src/lightspeed_evaluation/pipeline/evaluation/processor.py (1)
33-40: Drop redundant self.components to avoid duplication.You store both self.components and individual aliases. Keeping only the aliases reduces indirection.
Apply:
- self.components = components - - # Keep direct references for easier access - self.metrics_evaluator = components.metrics_evaluator + # Keep direct references for easier access + self.metrics_evaluator = components.metrics_evaluator self.api_amender = components.api_amender self.error_handler = components.error_handler self.metric_manager = components.metric_managersrc/lightspeed_evaluation/core/metrics/manager.py (1)
105-114: Minor tidy: list comprehension for defaults.Equivalent, a bit tighter.
- default_metrics = [] - for metric_name, metadata in metrics_metadata.items(): - if metadata.get("default", False): # default=false if not specified - default_metrics.append(metric_name) - return default_metrics + return [ + name for name, meta in metrics_metadata.items() + if meta.get("default", False) + ]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
config/evaluation_data.yaml(3 hunks)config/system.yaml(3 hunks)src/lightspeed_evaluation/core/metrics/manager.py(1 hunks)src/lightspeed_evaluation/core/models/data.py(3 hunks)src/lightspeed_evaluation/core/system/validator.py(2 hunks)src/lightspeed_evaluation/pipeline/evaluation/errors.py(1 hunks)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py(4 hunks)src/lightspeed_evaluation/pipeline/evaluation/pipeline.py(3 hunks)src/lightspeed_evaluation/pipeline/evaluation/processor.py(7 hunks)tests/conftest.py(5 hunks)tests/unit/core/config/test_models.py(3 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧬 Code graph analysis (5)
src/lightspeed_evaluation/core/metrics/manager.py (2)
src/lightspeed_evaluation/core/models/data.py (2)
EvaluationData(112-153)TurnData(13-109)src/lightspeed_evaluation/core/models/system.py (1)
SystemConfig(195-219)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (5)
src/lightspeed_evaluation/core/metrics/manager.py (4)
MetricLevel(10-14)MetricManager(17-136)resolve_metrics(24-48)count_metrics_for_conversation(115-136)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
MetricsEvaluator(18-143)src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
APIDataAmender(13-80)src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
EvaluationErrorHandler(10-77)src/lightspeed_evaluation/core/system/loader.py (1)
ConfigLoader(65-118)
tests/unit/core/config/test_models.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
TurnData(13-109)EvaluationData(112-153)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (3)
src/lightspeed_evaluation/core/metrics/manager.py (3)
MetricLevel(10-14)MetricManager(17-136)get_effective_threshold(50-80)src/lightspeed_evaluation/core/models/data.py (3)
EvaluationRequest(212-265)EvaluationResult(156-195)EvaluationScope(198-209)src/lightspeed_evaluation/core/system/loader.py (1)
ConfigLoader(65-118)
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (3)
src/lightspeed_evaluation/core/metrics/manager.py (1)
MetricManager(17-136)src/lightspeed_evaluation/pipeline/evaluation/processor.py (2)
ConversationProcessor(26-148)ProcessorComponents(17-23)src/lightspeed_evaluation/core/api/client.py (1)
close(232-235)
🪛 GitHub Actions: Ruff
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py
[error] 61-61: F821 Undefined name 'api_client' in APIDataAmender construction. Use 'self.api_client' instead of 'api_client'.
🪛 GitHub Actions: Python linter
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py
[error] 61-61: Command 'make pylint' failed with exit code 2. Pylint: Undefined variable 'api_client' (undefined-variable).
🪛 GitHub Actions: Pyright
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py
[error] 61-61: Step 'make pyright' failed: Pyright error: 'api_client' is not defined (reportUndefinedVariable).
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: mypy
🔇 Additional comments (14)
config/system.yaml (1)
37-38: Defaults look correct; confirm intent for no conversation-level defaults
- Turn default enabling response_relevancy is good.
- Faithfulness explicitly non-default and context_relevance reorder are fine (no semantic change).
- conversation_completeness default=false means if conversation_metrics is None, no conversation-level metrics will run. Confirm this is intentional.
Also applies to: 39-42, 57-60, 75-75
config/evaluation_data.yaml (1)
18-26: YAML updates align with per-turn overrides and defaults
- conv_group_1: per-turn metrics + per-turn threshold override look good.
- conv_group_2: swapped metric set is consistent with requirements.
- conv_group_3: [] to skip turn 1 and None to use system defaults for turn 2 are correct.
Please confirm that system defaults include at least one turn metric so conv_group_3 turn 2 actually evaluates.
Also applies to: 27-44, 58-59, 63-64
src/lightspeed_evaluation/core/models/data.py (2)
44-53: Per-turn metrics fields are well-defined
126-135: Conversation-level metadata relocation is consistenttests/conftest.py (1)
72-74: Fixture now models per-turn metrics correctlytests/unit/core/config/test_models.py (3)
75-81: Tests reflect per-turn metrics moveAlso applies to: 92-92, 103-103
118-124: Invalid metric format check correctly shifted to TurnData
127-133: Per-turn metadata path covered; assertions look goodAlso applies to: 141-141
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (1)
171-172: LGTM: clean API client teardown.Gracefully closes the client if present.
src/lightspeed_evaluation/pipeline/evaluation/processor.py (3)
51-63: LGTM: fast-skip check correctly respects defaults and explicit skips.Using MetricManager to resolve per-turn and conversation-level metrics before work is a good optimization.
86-95: Turn evaluation routing looks solid.Resolves metrics per turn, logs, and delegates to evaluator. Clean separation.
96-104: Conversation-level routing is correct and consistent with turn-level path.No issues.
src/lightspeed_evaluation/core/metrics/manager.py (2)
24-49: LGTM: resolve_metrics semantics are clear and predictable.None → defaults, [] → skip, list → as-is. Matches the PR goals.
115-136: LGTM: counting uses resolved metrics, so totals include defaults and explicit skips.Good for summaries and preflight checks.
f97e01c to
955fef2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (1)
78-85: API-error path does not mark default metrics as ERROR.mark_all_metrics_as_error only enumerates explicitly defined metrics (turn_data.turn_metrics / conv_data.conversation_metrics). When defaults apply (None), no ERROR results are produced, contradicting the log “marking all metrics as ERROR” and skewing summaries.
Proposed fix: pass resolved metrics to the error handler and mark those.
Apply this diff to use the precomputed values from the earlier refactor:
- if api_error_occurred: - logger.error("API error detected - marking all metrics as ERROR") - error_results = self.error_handler.mark_all_metrics_as_error( - conv_data, "API error during data amendment" - ) - return error_results + if api_error_occurred: + logger.error("API error detected - marking all metrics as ERROR") + # Build turn_id -> metrics map from precomputed resolution + turn_metrics_by_id = { + turn.turn_id: metrics for _, turn, metrics in resolved_turn_metrics + } + error_results = self.error_handler.mark_resolved_metrics_as_error( # new API + conv_data, + turn_metrics_by_id, + conversation_metrics_resolved, + "API error during data amendment", + ) + return error_resultsAdd this method to EvaluationErrorHandler (src/lightspeed_evaluation/pipeline/evaluation/errors.py):
def mark_resolved_metrics_as_error( self, conv_data: EvaluationData, resolved_turn_metrics: dict[str, list[str]], resolved_conversation_metrics: list[str], error_reason: str, ) -> list[EvaluationResult]: """Mark the provided resolved metrics as ERROR (separation of concerns: resolution happens upstream).""" results: list[EvaluationResult] = [] # Turn-level for turn in conv_data.turns: for metric_identifier in resolved_turn_metrics.get(turn.turn_id, []) or []: results.append( EvaluationResult( conversation_group_id=conv_data.conversation_group_id, turn_id=turn.turn_id, metric_identifier=metric_identifier, result="ERROR", score=None, threshold=None, reason=error_reason, query=turn.query, response="", execution_time=0.0, ) ) # Conversation-level for metric_identifier in resolved_conversation_metrics or []: results.append( EvaluationResult( conversation_group_id=conv_data.conversation_group_id, turn_id=None, metric_identifier=metric_identifier, result="ERROR", score=None, threshold=None, reason=error_reason, query="", response="", execution_time=0.0, ) ) self.results.extend(results) return resultsNote: This aligns with the retrieved learning that resolution occurs upstream; the handler remains a pure marker.
♻️ Duplicate comments (2)
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (2)
60-61: Fix undefined variable 'api_client' in APIDataAmender construction.Use the instance attribute you just set. This is causing the test failures.
Apply this diff:
self.api_client = self._create_api_client() - api_amender = APIDataAmender(api_client) + api_amender = APIDataAmender(self.api_client)
62-62: Pass metric_manager to EvaluationErrorHandler for proper default metric handling.Based on the retrieved learning, metric resolution happens upstream in
ConversationProcessor.process_conversation()usingMetricManager.resolve_metrics(). However, when API errors occur, the error handler needs to know which metrics to mark as ERROR, including defaults.Update the error handler instantiation to accept the metric_manager:
- error_handler = EvaluationErrorHandler() + error_handler = EvaluationErrorHandler(metric_manager)Note: You'll also need to update
src/lightspeed_evaluation/pipeline/evaluation/errors.pyto accept and use theMetricManagerfor resolving metrics when marking errors.
🧹 Nitpick comments (10)
tests/conftest.py (1)
195-196: Consider testing all three metric override patterns.The fixtures test explicit metrics (line 184) and
None(line 195), but don't include a test case for empty list[]which should skip evaluation.Consider adding a test fixture with
turn_metrics: []to ensure the skip behavior is properly tested:}, + { + "turn_id": "3", + "query": "Skip this turn", + "response": "This turn should be skipped", + "turn_metrics": [], # Test skip behavior + "turn_metrics_metadata": {}, + }, ],src/lightspeed_evaluation/core/metrics/manager.py (1)
105-113: Optional: stable, de-duplicated defaults.If config ever contains duplicates or order needs to be stable, dedupe while preserving order.
Apply this diff:
- default_metrics = [] - for metric_name, metadata in metrics_metadata.items(): - if metadata.get("default", False): # default=false if not specified - default_metrics.append(metric_name) - return default_metrics + default_metrics: list[str] = [] + for metric_name, metadata in metrics_metadata.items(): + if metadata.get("default", False): # default=false if not specified + if metric_name not in default_metrics: + default_metrics.append(metric_name) + return default_metricssrc/lightspeed_evaluation/pipeline/evaluation/processor.py (8)
52-62: Avoid double-resolving metrics; precompute once.resolve_metrics is called here and again below during processing. Precompute per-turn resolved metrics and reuse for both the “any()” check and evaluation.
Apply this diff:
- has_turn_metrics = any( - self.metric_manager.resolve_metrics( - turn_data.turn_metrics, MetricLevel.TURN - ) - for turn_data in conv_data.turns - ) + resolved_turn_metrics = [ + (idx, turn, self.metric_manager.resolve_metrics(turn.turn_metrics, MetricLevel.TURN)) + for idx, turn in enumerate(conv_data.turns) + ] + has_turn_metrics = any(bool(metrics) for _, _, metrics in resolved_turn_metrics) - has_conversation_metrics = bool( - self.metric_manager.resolve_metrics( - conv_data.conversation_metrics, MetricLevel.CONVERSATION - ) - ) + conversation_metrics_resolved = self.metric_manager.resolve_metrics( + conv_data.conversation_metrics, MetricLevel.CONVERSATION + ) + has_conversation_metrics = bool(conversation_metrics_resolved)
86-95: Reuse the pre-resolved turn metrics; avoid resolving again.Minor perf/readability tidy-up.
Apply this diff:
- for turn_idx, turn_data in enumerate(conv_data.turns): - turn_metrics = self.metric_manager.resolve_metrics( - turn_data.turn_metrics, MetricLevel.TURN - ) - if turn_metrics: - logger.debug("Processing turn %d metrics: %s", turn_idx, turn_metrics) - turn_results = self._evaluate_turn(conv_data, turn_idx, turn_data) + for turn_idx, turn_data, turn_metrics in resolved_turn_metrics: + if turn_metrics: + logger.debug("Processing turn %d metrics: %s", turn_idx, turn_metrics) + turn_results = self._evaluate_turn( + conv_data, turn_idx, turn_data, turn_metrics + ) results.extend(turn_results)And change _evaluate_turn signature to accept pre-resolved metrics (see comment at Lines 110-121).
96-107: Reuse the pre-resolved conversation metrics.Same duplication as turns; pass them through.
Apply this diff:
- conversation_metrics = self.metric_manager.resolve_metrics( - conv_data.conversation_metrics, MetricLevel.CONVERSATION - ) - if conversation_metrics: + if conversation_metrics_resolved: logger.debug( "Processing conversation-level metrics: %s", - conversation_metrics, + conversation_metrics_resolved, ) - conv_results = self._evaluate_conversation(conv_data) + conv_results = self._evaluate_conversation(conv_data, conversation_metrics_resolved) results.extend(conv_results)And update _evaluate_conversation to accept the list (see comment at Lines 129-137).
110-121: Simplify: drop redundant resolve and pass metrics in.Avoid re-resolving and remove the “or []” guard (resolve_metrics already returns a list).
Apply this diff:
- def _evaluate_turn( - self, conv_data: EvaluationData, turn_idx: int, turn_data: TurnData - ) -> list[EvaluationResult]: + def _evaluate_turn( + self, + conv_data: EvaluationData, + turn_idx: int, + turn_data: TurnData, + turn_metrics: list[str], + ) -> list[EvaluationResult]: @@ - turn_metrics = self.metric_manager.resolve_metrics( - turn_data.turn_metrics, MetricLevel.TURN - ) - - for metric_identifier in turn_metrics or []: + for metric_identifier in turn_metrics: request = EvaluationRequest.for_turn( conv_data, metric_identifier, turn_idx, turn_data )
129-137: Mirror change for conversation-level path.Pass pre-resolved metrics; remove redundant resolve.
Apply this diff:
- def _evaluate_conversation( - self, conv_data: EvaluationData - ) -> list[EvaluationResult]: + def _evaluate_conversation( + self, conv_data: EvaluationData, conversation_metrics: list[str] + ) -> list[EvaluationResult]: @@ - conversation_metrics = self.metric_manager.resolve_metrics( - conv_data.conversation_metrics, MetricLevel.CONVERSATION - ) - - for metric_identifier in conversation_metrics or []: + for metric_identifier in conversation_metrics: request = EvaluationRequest.for_conversation(conv_data, metric_identifier)
52-62: Nit: prefer explicit boolean in any() for readability.any(list_obj) relies on list truthiness. Using bool(...) makes intent clearer.
After the precompute refactor, this becomes:
has_turn_metrics = any(bool(metrics) for _, _, metrics in resolved_turn_metrics)
80-83: Log message is misleading today.It says “marking all metrics as ERROR,” but only explicitly defined metrics are marked. After implementing the resolved-metrics error path, the message becomes accurate.
70-77: Guard is correct but add a clearer failure message.Raising ValueError on missing SystemConfig is right. Consider mentioning the expected call flow (ConfigLoader.load_system_config) to speed up diagnosis.
- raise ValueError("SystemConfig must be loaded") + raise ValueError("SystemConfig must be loaded via ConfigLoader.load_system_config(...) before processing conversations")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
README.md(3 hunks)config/evaluation_data.yaml(3 hunks)config/system.yaml(3 hunks)src/lightspeed_evaluation/core/metrics/manager.py(1 hunks)src/lightspeed_evaluation/core/models/data.py(3 hunks)src/lightspeed_evaluation/core/system/validator.py(2 hunks)src/lightspeed_evaluation/pipeline/evaluation/errors.py(1 hunks)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py(4 hunks)src/lightspeed_evaluation/pipeline/evaluation/pipeline.py(3 hunks)src/lightspeed_evaluation/pipeline/evaluation/processor.py(7 hunks)tests/conftest.py(5 hunks)tests/unit/core/config/test_models.py(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- tests/unit/core/config/test_models.py
- src/lightspeed_evaluation/core/system/validator.py
- src/lightspeed_evaluation/pipeline/evaluation/errors.py
- src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
- src/lightspeed_evaluation/core/models/data.py
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.762Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
📚 Learning: 2025-09-19T00:37:23.762Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.762Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/pipeline.pysrc/lightspeed_evaluation/pipeline/evaluation/processor.pysrc/lightspeed_evaluation/core/metrics/manager.py
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.
Applied to files:
README.md
🧬 Code graph analysis (3)
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (5)
src/lightspeed_evaluation/core/metrics/manager.py (1)
MetricManager(17-136)src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
APIDataAmender(13-80)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
MetricsEvaluator(18-143)src/lightspeed_evaluation/pipeline/evaluation/processor.py (2)
ConversationProcessor(26-148)ProcessorComponents(17-23)src/lightspeed_evaluation/core/api/client.py (1)
close(232-235)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (5)
src/lightspeed_evaluation/core/metrics/manager.py (4)
MetricLevel(10-14)MetricManager(17-136)resolve_metrics(24-48)count_metrics_for_conversation(115-136)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
MetricsEvaluator(18-143)src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
APIDataAmender(13-80)src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
EvaluationErrorHandler(10-77)src/lightspeed_evaluation/core/system/loader.py (1)
ConfigLoader(65-118)
src/lightspeed_evaluation/core/metrics/manager.py (2)
src/lightspeed_evaluation/core/models/data.py (2)
EvaluationData(112-153)TurnData(13-109)src/lightspeed_evaluation/core/models/system.py (1)
SystemConfig(195-219)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: mypy
🔇 Additional comments (12)
config/system.yaml (2)
37-38: LGTM! Clear default metric configuration.The addition of
default: trueforragas:response_relevancyanddefault: falseforragas:faithfulnessprovides clear semantics for turn-level metric resolution whenturn_metricsisnull. This aligns well with theMetricManager.resolve_metrics()implementation.Also applies to: 39-42, 57-59
75-75: LGTM! Consistent with turn-level defaults.Adding
default: falsefor conversation-level metrics maintains consistency with the turn-level approach.config/evaluation_data.yaml (2)
18-26: LGTM! Per-turn metrics correctly migrated.The metrics configuration has been properly moved from conversation group level to individual turns, implementing the per-turn override feature as intended.
40-43: resolve_metrics correctly handles None (defaults), [] (skip), and explicit lists — no changes required.Verified in src/lightspeed_evaluation/core/metrics/manager.py: None -> self._extract_default_metrics(level); [] -> return []; otherwise return metrics.
README.md (2)
183-196: Excellent documentation of the metric override behavior!The documentation clearly explains the three override patterns with a helpful table and concrete examples. This will be very useful for users understanding how to configure per-turn metrics.
Also applies to: 240-247
233-234: Documentation correctly reflects the new TurnData fields.The addition of
turn_metricsandturn_metrics_metadatato the Turn Data Fields table accurately documents the schema changes.tests/conftest.py (1)
72-73: Test fixtures correctly updated for per-turn metrics.The test fixtures have been properly updated to place
turn_metricsandturn_metrics_metadataat the turn level instead of the conversation level, aligning with the new data model.Also applies to: 167-168, 184-185, 195-196
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (2)
56-73: Clean integration of MetricManager into the pipeline.The addition of
MetricManagerand its integration throughProcessorComponentsprovides a clean architecture for handling per-turn metric overrides and defaults.
171-172: Good resource cleanup for API client.Properly closing the API client when present ensures resources are cleaned up.
src/lightspeed_evaluation/core/metrics/manager.py (2)
24-49: Resolve logic looks correct and matches PR intent.None -> defaults, [] -> skip, list -> use as-is. Clear and predictable.
115-136: Counting logic is sound.Accurately reflects effective metrics (including defaults) across turns and conversation.
src/lightspeed_evaluation/pipeline/evaluation/processor.py (1)
146-149: Summary delegation is correct.Using MetricManager.count_metrics_for_conversation keeps single source of truth for counts.
955fef2 to
74f0ba7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (9)
README.md (3)
118-130: Clarify defaults and threshold semantics for custom metrics in system config.
- Consider documenting default/threshold for custom:tool_eval; currently only ragas metrics show default flags.
- If tool_eval passes when score > 0 (no threshold), call that out explicitly to avoid confusion with the global 0.5 fallback.
Apply within the YAML example:
metrics_metadata: turn_level: "ragas:response_relevancy": threshold: 0.8 description: "How relevant the response is to the question" default: true # Used by default when turn_metrics is null "ragas:faithfulness": threshold: 0.8 description: "How faithful the response is to the provided context" default: false # Only used when explicitly specified + + "custom:tool_eval": + # If threshold is omitted, tool_eval passes when score > 0 (documented behavior) + # Set threshold to null to use that behavior; otherwise provide an explicit value. + threshold: null + description: "Tool call evaluation comparing expected vs actual tool calls (regex for arguments)" + default: false
181-192: Note deduplication behavior and valid identifier format for turn_metrics.Mention that duplicate metrics are deduplicated (order-preserving) and identifiers must be framework:metric_name; helps users debugging YAML.
Add after the snippet:
- turn_metrics entries must be in the form framework:metric_name (e.g., ragas:faithfulness).
- Duplicates are removed while preserving the first occurrence’s order.
240-247: Nice override behavior table; add a note on conversation-level defaults.Add a line indicating the same None/[] rules apply to conversation_metrics.
src/lightspeed_evaluation/core/models/data.py (2)
13-31: Harden metric ID validation and normalize inputs.Currently only checks for ":" and not empty string; whitespace and empty sides can slip through, and duplicates with whitespace won’t dedup.
Apply:
def _validate_and_deduplicate_metrics( metrics: list[str], metric_type: str = "metric" ) -> list[str]: """Validate format and deduplicate metrics while preserving order.""" - # Validate format first - for metric in metrics: - if not metric or ":" not in metric: - raise ValueError( - f'{metric_type} "{metric}" must be in format "framework:metric_name"' - ) - - # Deduplicate while preserving order - seen = set() - deduplicated = [] - for metric in metrics: - if metric not in seen: - deduplicated.append(metric) - seen.add(metric) - return deduplicated + # Normalize, validate format, and deduplicate while preserving order + seen: set[str] = set() + deduped: list[str] = [] + for raw in metrics: + metric = raw.strip() + if not metric or ":" not in metric: + raise ValueError( + f'{metric_type} "{raw}" must be in format "framework:metric_name"' + ) + framework, name = metric.split(":", 1) + if not framework.strip() or not name.strip(): + raise ValueError( + f'{metric_type} "{raw}" must include both framework and metric name' + ) + if metric not in seen: + seen.add(metric) + deduped.append(metric) + return deduped
83-127: Ensure arguments is a dict in expected_tool_calls.If provided, arguments may currently be non-dict and get passed through silently.
Apply:
for j, tool_call in enumerate(sequence): if not isinstance(tool_call, dict): raise ValueError( f"Tool call {j} in sequence {i} must be a dictionary" ) @@ - # Validate required keys + # Validate required keys if "tool_name" not in tool_call: raise ValueError( f"Tool call {j} in sequence {i} missing required 'tool_name' field" ) @@ - # Ensure arguments field exists (can be empty dict) + # Ensure arguments field exists and is a dict (can be empty) + if "arguments" in tool_call and not isinstance( + tool_call["arguments"], dict + ): + raise ValueError( + f"Tool call {j} in sequence {i} field 'arguments' must be a dictionary" + ) validated_tool_call = { "tool_name": tool_call["tool_name"], "arguments": tool_call.get("arguments", {}), }src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
34-41: Optional: accept resolved metrics to avoid duplicate resolution.Processor already resolves metrics; you could optionally pass them in to reduce duplication and keep resolution logic in one place.
Example signature change:
-def mark_all_metrics_as_error(self, conv_data: EvaluationData, error_reason: str) -> list[EvaluationResult]: +def mark_all_metrics_as_error( + self, + conv_data: EvaluationData, + error_reason: str, + resolved_turn_metrics: list[list[str]] | None = None, + resolved_conversation_metrics: list[str] | None = None, +) -> list[EvaluationResult]:Then use the provided lists if not None; fall back to self.metric_manager.resolve_metrics otherwise.
Also applies to: 55-59
config/evaluation_data.yaml (3)
23-25: Faithfulness threshold 0.99 is likely too brittle; consider relaxing.A 0.99 cut-off often flags acceptable answers as failures. Recommend 0.85–0.95 depending on your baseline; start at 0.90.
Apply this diff if acceptable:
turn_metrics_metadata: "ragas:faithfulness": - threshold: 0.99 + threshold: 0.90 # Less brittle; tune based on baseline runs
27-29: Group header change acknowledged. Ensure docs/examples match.Since README updates are pending, please mirror this per‑turn configuration in the docs’ sample config.
40-44: LGTM on conv_group_2 per‑turn metrics.These metrics require contexts and a reference; both are provided here.
If other turns reuse this set, consider a YAML anchor to avoid duplication. Example:
- turn_metrics: - - "ragas:context_recall" - - "ragas:context_relevance" - - "ragas:context_precision_with_reference" + turn_metrics: &with_ref_metrics + - "ragas:context_recall" + - "ragas:context_relevance" + - "ragas:context_precision_with_reference" # Reuse elsewhere with: # turn_metrics: *with_ref_metrics
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
README.md(3 hunks)config/evaluation_data.yaml(3 hunks)src/lightspeed_evaluation/core/models/data.py(4 hunks)src/lightspeed_evaluation/pipeline/evaluation/errors.py(3 hunks)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py(5 hunks)src/lightspeed_evaluation/pipeline/evaluation/pipeline.py(3 hunks)src/lightspeed_evaluation/pipeline/evaluation/processor.py(6 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lightspeed_evaluation/pipeline/evaluation/evaluator.py
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.762Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
📚 Learning: 2025-09-19T00:37:23.762Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.762Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/pipeline.pysrc/lightspeed_evaluation/pipeline/evaluation/processor.pysrc/lightspeed_evaluation/pipeline/evaluation/errors.py
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.
Applied to files:
README.md
📚 Learning: 2025-09-18T23:59:37.010Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.010Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
src/lightspeed_evaluation/core/models/data.py
🧬 Code graph analysis (3)
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (10)
src/lightspeed_evaluation/core/metrics/manager.py (1)
MetricManager(17-136)src/lightspeed_evaluation/core/models/data.py (2)
EvaluationData(129-166)EvaluationResult(169-208)src/lightspeed_evaluation/core/output/data_persistence.py (1)
save_evaluation_data(15-58)src/lightspeed_evaluation/core/system/loader.py (1)
ConfigLoader(65-118)src/lightspeed_evaluation/core/system/validator.py (1)
DataValidator(62-226)src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
APIDataAmender(13-80)src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
EvaluationErrorHandler(11-83)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
MetricsEvaluator(18-143)src/lightspeed_evaluation/pipeline/evaluation/processor.py (2)
ConversationProcessor(26-144)ProcessorComponents(17-23)src/lightspeed_evaluation/core/api/client.py (1)
close(232-235)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (4)
src/lightspeed_evaluation/core/metrics/manager.py (4)
MetricLevel(10-14)MetricManager(17-136)resolve_metrics(24-48)count_metrics_for_conversation(115-136)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
MetricsEvaluator(18-143)src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
APIDataAmender(13-80)src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
EvaluationErrorHandler(11-83)
src/lightspeed_evaluation/pipeline/evaluation/errors.py (2)
src/lightspeed_evaluation/core/metrics/manager.py (3)
MetricLevel(10-14)MetricManager(17-136)resolve_metrics(24-48)src/lightspeed_evaluation/core/models/data.py (1)
EvaluationResult(169-208)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: mypy
🔇 Additional comments (22)
README.md (2)
193-195: Good: explicit skip semantics documented.
Clear that [] skips evaluation for this turn; matches MetricManager.resolve_metrics behavior.
233-235: Turn metric fields table aligns with model changes.
Accurately reflects new TurnData fields.src/lightspeed_evaluation/core/models/data.py (4)
65-74: Turn metric fields addition looks correct.
Matches README and processing flow.
75-82: Validation hook is good; relies on shared helper.
With the proposed helper tightening, this will handle whitespace and empty parts.
143-151: Conversation metrics fields align with the new model.
Good separation from per-turn config.
158-166: Validator reuse is appropriate.
Once helper is hardened, conversation metric IDs will be consistently normalized.src/lightspeed_evaluation/pipeline/evaluation/errors.py (2)
14-17: Constructor now accepts MetricManager — good change.
Enables resolving defaults on API failure.
35-41: Resolves defaults before marking ERROR — correct behavior.
Ensures turns with turn_metrics: null and conversations with conversation_metrics: null get ERROR rows for default metrics.Also applies to: 54-71
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (4)
56-66: Wiring MetricManager into evaluator and error handler looks good.
Matches the new per-turn defaults/override flow.
60-63: Fixed APIDataAmender construction using self.api_client.
Resolves previous undefined variable issue.
67-73: ProcessorComponents container improves cohesion.
Nice consolidation of dependencies.
171-172: Graceful API client cleanup.
Close guarded by presence check; OK.src/lightspeed_evaluation/pipeline/evaluation/processor.py (7)
16-24: ProcessorComponents dataclass is clean and explicit.
Good dependency injection point.
29-40: Constructor stores component aliases — readable.
Keeps call sites terse.
50-58: Pre-resolving metrics per turn + conversation is correct.
Enables early skip and consistent behavior.
61-68: Early exit when no metrics — good guardrail.
Prevents unnecessary API calls.
86-95: Turn evaluation loops over resolved sets — correct with overrides and skips.
Matches desired semantics for None vs [] vs explicit list.
98-105: Conversation metrics evaluation correctly gated.
Aligned with resolved defaults.
142-144: Metrics summary delegates to MetricManager — nice single source of truth.config/evaluation_data.yaml (3)
18-21: LGTM on per‑turn override for conv_group_1.Metrics align with fields present (contexts and expected_response exist), so these should evaluate cleanly.
63-63: Confirmed: omitted turn_metrics (None) → system defaults; explicit [] → skipConversationProcessor calls MetricManager.resolve_metrics for each turn (src/lightspeed_evaluation/pipeline/evaluation/processor.py); turn_metrics is Optional[list] default None (src/lightspeed_evaluation/core/models/data.py); processor uses truthiness on resolved metrics so an explicit [] yields no metrics and is skipped; DataValidator only validates explicitly provided metrics (src/lightspeed_evaluation/core/system/validator.py).
57-58: Explicit turn-level skip supported (turn_metrics: [] → skip)resolve_metrics returns [] for an explicit empty list; ConversationProcessor only evaluates turns when resolved turn metrics are truthy; EvaluationErrorHandler also resolves metrics before marking errors, so turns with turn_metrics: [] are neither evaluated nor marked ERROR — conversation-level metrics are handled separately and will still run if present. See MetricManager.resolve_metrics, ConversationProcessor.process_conversation, EvaluationErrorHandler.mark_all_metrics_as_error.
74f0ba7 to
ed14fba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/lightspeed_evaluation/core/metrics/manager.py (1)
50-81: Threshold type safety — confirm inputs are numeric (repeat of prior concern).
get_effective_thresholdreturns values as-is. If a YAML author quotes numbers ("0.8"), downstreamscore >= thresholdmay raiseTypeError. You previously opted against defensive parsing for controlled configs; keeping as-is is fine. Please just verify no quoted thresholds exist in repo configs/tests.#!/bin/bash # Search for quoted numeric thresholds in YAML/JSON to avoid TypeError at runtime. fd -t f -e yaml -e yml -e json | xargs rg -nP -C1 '\bthreshold:\s*"[0-9]+(\.[0-9]+)?"'
🧹 Nitpick comments (5)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
139-139: Consider documenting the float cast behavior.The explicit float cast ensures consistent comparison behavior, but it would be helpful to add a comment explaining why this is necessary (e.g., to handle cases where threshold might be stored as an int in configs).
- return "PASS" if score >= float(threshold) else "FAIL" + # Cast to float to ensure consistent numeric comparison (handles int thresholds) + return "PASS" if score >= float(threshold) else "FAIL"README.md (2)
134-138: Conversation-level defaults not set; clarify intent.As written,
conversation_levelhas nodefault: true, so ifconversation_metricsis omitted, no conversation metrics will run. If that’s intentional, add a brief note; otherwise, set a default in the example.Apply one of the following doc tweaks:
conversation_level: "deepeval:conversation_completeness": threshold: 0.8 description: "How completely the conversation addresses user intentions" + # Note: Leave `default` unset if you want conversation metrics to be opt-in. + # default: true # Uncomment to enable as a system default
240-247: Scope of “Metrics override behavior” — specify turn vs conversation.The table reads as global but examples are turn-focused. Add a one-liner stating that the same semantics apply to
conversation_metricsat conversation level.#### Metrics override behavior @@ | `["metric1", ...]` | Use specified metrics only | + +Note: The same override semantics apply to conversation-level metrics via `conversation_metrics` +and `conversation_metrics_metadata` on the conversation object.src/lightspeed_evaluation/core/metrics/manager.py (2)
1-1: Tiny naming nit: “mapping” → “management.”Module docstring can read “Metrics management for evaluation.” Not blocking.
-"""Metrics mapping for evaluation.""" +"""Metrics management for evaluation."""
105-114: Micro-simplification: list comprehension for defaults.Functionally identical, a bit tighter.
- default_metrics = [] - for metric_name, metadata in metrics_metadata.items(): - if metadata.get("default", False): # default=false if not specified - default_metrics.append(metric_name) - return default_metrics + return [ + name for name, meta in metrics_metadata.items() + if meta.get("default", False) + ]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
README.md(3 hunks)config/evaluation_data.yaml(3 hunks)src/lightspeed_evaluation/core/metrics/manager.py(1 hunks)src/lightspeed_evaluation/core/models/data.py(4 hunks)src/lightspeed_evaluation/pipeline/evaluation/errors.py(2 hunks)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py(5 hunks)src/lightspeed_evaluation/pipeline/evaluation/pipeline.py(3 hunks)src/lightspeed_evaluation/pipeline/evaluation/processor.py(6 hunks)tests/conftest.py(5 hunks)tests/unit/core/config/test_models.py(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- tests/conftest.py
- src/lightspeed_evaluation/pipeline/evaluation/processor.py
- src/lightspeed_evaluation/core/models/data.py
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.762Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
📚 Learning: 2025-09-19T00:37:23.762Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.762Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/core/metrics/manager.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/pipeline/evaluation/pipeline.pysrc/lightspeed_evaluation/pipeline/evaluation/errors.py
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.
Applied to files:
src/lightspeed_evaluation/core/metrics/manager.pyREADME.md
📚 Learning: 2025-09-10T06:57:46.326Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: src/lightspeed_evaluation/pipeline/evaluation/evaluator.py:85-89
Timestamp: 2025-09-10T06:57:46.326Z
Learning: For binary metrics like custom:tool_eval, using an explicit threshold of 0.5 is preferred over None threshold with special case handling. This provides consistent behavior where 0.0 scores fail and 1.0 scores pass.
Applied to files:
src/lightspeed_evaluation/core/metrics/manager.py
📚 Learning: 2025-09-19T01:26:54.603Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/core/metrics/manager.py:50-81
Timestamp: 2025-09-19T01:26:54.603Z
Learning: In the lightspeed-evaluation codebase, prefer clean code over defensive programming for controlled configuration files like system.yaml. The maintainer asamal4 prefers not to add error handling for user configuration mistakes that are unlikely to occur in controlled internal settings, to avoid code complexity.
Applied to files:
src/lightspeed_evaluation/core/metrics/manager.py
📚 Learning: 2025-08-22T09:16:29.070Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.
Applied to files:
src/lightspeed_evaluation/core/metrics/manager.py
📚 Learning: 2025-09-18T23:59:37.010Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.010Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
tests/unit/core/config/test_models.py
🧬 Code graph analysis (5)
src/lightspeed_evaluation/core/metrics/manager.py (2)
src/lightspeed_evaluation/core/models/data.py (2)
EvaluationData(129-166)TurnData(34-126)src/lightspeed_evaluation/core/models/system.py (1)
SystemConfig(195-219)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
src/lightspeed_evaluation/core/metrics/manager.py (2)
MetricLevel(10-14)get_effective_threshold(50-80)
tests/unit/core/config/test_models.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
TurnData(34-126)EvaluationData(129-166)
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (4)
src/lightspeed_evaluation/core/metrics/manager.py (1)
MetricManager(17-136)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
MetricsEvaluator(18-143)src/lightspeed_evaluation/pipeline/evaluation/processor.py (2)
ConversationProcessor(26-147)ProcessorComponents(17-23)src/lightspeed_evaluation/core/api/client.py (1)
close(232-235)
src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
EvaluationData(129-166)EvaluationResult(169-208)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: tests (3.12)
🔇 Additional comments (20)
tests/unit/core/config/test_models.py (6)
75-80: LGTM! Per-turn metrics properly wired.The test correctly creates TurnData with per-turn metrics, aligning with the new data model where metrics are now specified at the turn level.
92-92: LGTM! Assertion correctly validates turn-level metrics.The test properly asserts that turn_metrics are stored at the individual turn level, consistent with the shift from conversation-level to per-turn metrics.
103-103: LGTM! Test correctly handles None case for turn metrics.The assertion properly validates that when turn_metrics is not provided, it remains None (indicating system defaults will be used).
118-123: LGTM! Test correctly validates turn-level metric validation.The test properly validates that metrics without the required "framework:metric_name" format are rejected at the turn level.
127-133: LGTM! Turn metadata fields properly tested.The test correctly validates both turn_metrics and turn_metrics_metadata at the turn level.
141-143: LGTM! Assertions properly validate per-turn metadata.The test correctly asserts that turn_metrics_metadata is stored at the turn level within each turn object.
src/lightspeed_evaluation/pipeline/evaluation/errors.py (3)
17-23: LGTM! Method signature properly updated for per-turn metrics.The updated signature correctly accepts pre-resolved metrics lists, addressing the issue where defaults weren't being marked as ERROR during API failures. The separation of turn and conversation metrics aligns with the new per-turn architecture.
43-57: LGTM! Turn-level error marking properly handles per-turn metrics.The implementation correctly iterates through resolved turn metrics and creates appropriate ERROR results with turn-specific data.
59-74: LGTM! Conversation-level error marking properly implemented.The loop correctly handles conversation-level metrics by setting turn_id to None and empty query, properly distinguishing them from turn-level metrics.
config/evaluation_data.yaml (4)
18-26: LGTM! Per-turn metrics configuration properly structured.The turn-specific metrics and metadata are correctly specified at the turn level, aligning with the new per-turn metrics architecture.
40-44: LGTM! Turn metrics properly specified for conv_group_2.Different turn metrics are correctly applied to demonstrate the per-turn override capability.
58-58: LGTM! Skip semantics properly demonstrated.The empty array
[]correctly demonstrates the skip evaluation feature for specific turns.
63-63: LGTM! Default behavior properly documented.The comment clearly explains that omitting turn_metrics (null) will use system defaults, which is valuable documentation.
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (3)
60-61: LGTM! API client properly stored as instance attribute.The API client is correctly stored on
self.api_clientfor use in the close() method.
56-73: LGTM! Components properly wired with MetricManager.The MetricManager is correctly instantiated and passed to both the MetricsEvaluator and ProcessorComponents, ensuring consistent metric resolution across the pipeline.
171-172: LGTM! Resource cleanup properly implemented.The close() method correctly checks for and closes the API client if it exists.
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
21-26: LGTM! MetricManager properly integrated.The MetricManager is correctly added as a constructor parameter and stored for threshold resolution.
87-95: LGTM! Threshold resolution properly uses MetricManager.The code correctly determines the metric level based on the request type and delegates threshold resolution to the MetricManager with appropriate context (conv_data and turn_data).
src/lightspeed_evaluation/core/metrics/manager.py (1)
115-136: Counting semantics LGTM.Turn-level resolution respects
None→ defaults and[]→ skip; conversation-level counting is consistent withresolve_metrics. No issues.README.md (1)
118-157: Mapping verified — READMEmetrics_metadata.turn_level/conversation_levelare correctly mapped to SystemConfig fields.loader.create_system_config copies config["metrics_metadata"].get("turn_level"/"conversation_level") → SystemConfig.default_turn_metrics_metadata / default_conversation_metrics_metadata (src/lightspeed_evaluation/core/system/loader.py ~107–116); loader also reconstructs metrics_metadata from system_config.default* before registering metrics (src/lightspeed_evaluation/core/system/loader.py ~30–43); metrics manager reads those fields to extract defaults (src/lightspeed_evaluation/core/metrics/manager.py ~101–113).
|
@VladimirKadlec @tisnik PTAL |
update readme for metric override
ed14fba to
ad15be3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
99-109: Fix None access for conversation-level metrics (AttributeError risk)Accessing request.turn_data.response when turn_data is None will raise and incorrectly mark conversation metrics as ERROR.
- query=request.turn_data.query if request.turn_data else "", - response=request.turn_data.response or "" if request.turn_data else "", + query=request.turn_data.query if request.turn_data else "", + response=( + request.turn_data.response + if (request.turn_data and request.turn_data.response) + else "" + ),
118-133: Apply the same None-safe fix in error pathMirror the safe access to avoid spurious errors while constructing ERROR results.
- query=request.turn_data.query if request.turn_data else "", - response=request.turn_data.response or "" if request.turn_data else "", + query=request.turn_data.query if request.turn_data else "", + response=( + request.turn_data.response + if (request.turn_data and request.turn_data.response) + else "" + ),
♻️ Duplicate comments (1)
src/lightspeed_evaluation/core/metrics/manager.py (1)
50-81: Threshold passthrough matches maintainer preference (no casting).Previous suggestion to normalize threshold types was discussed and intentionally not adopted for this controlled config. No action.
🧹 Nitpick comments (12)
tests/conftest.py (1)
72-74: Set turn_metrics during construction to trigger validatorsAssigning after model creation bypasses Pydantic validators unless validate_assignment=True. Set in the constructor instead.
- # Add turn_metrics to the turn data - sample_turn_data.turn_metrics = ["ragas:faithfulness", "ragas:response_relevancy"]Outside this hunk, update sample_turn_data fixture to include:
# in sample_turn_data() return return TurnData( turn_id="1", query="What is Python?", response="Python is a high-level programming language.", contexts=[...], expected_response="Python is a high-level programming language used for various applications.", turn_metrics=["ragas:faithfulness", "ragas:response_relevancy"], )tests/unit/core/config/test_models.py (1)
84-85: Unify metric identifier naming for consistencyElsewhere you use deepeval:conversation_completeness. Consider aligning here to reduce confusion in future refactors.
- conversation_metrics=["deepeval:completeness"], + conversation_metrics=["deepeval:conversation_completeness"], @@ - conversation_metrics=["deepeval:completeness"], - conversation_metrics_metadata={"deepeval:completeness": {"threshold": 0.9}}, + conversation_metrics=["deepeval:conversation_completeness"], + conversation_metrics_metadata={"deepeval:conversation_completeness": {"threshold": 0.9}},Also applies to: 136-139
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (2)
61-65: Guard against malformed metric identifierssplit(":", 1) will raise if the identifier lacks a colon, returning a generic error. Prefer a clear error message.
- framework, metric_name = request.metric_identifier.split(":", 1) + if ":" not in request.metric_identifier: + execution_time = time.time() - start_time + return self._create_error_result( + request, + f"Invalid metric identifier '{request.metric_identifier}'. Expected 'framework:metric'.", + execution_time, + ) + framework, metric_name = request.metric_identifier.split(":", 1)
135-140: Status calc is fine; consider documenting 0.5 defaultDefaulting threshold to 0.5 is a sensible fallback. Add a brief note in system docs to avoid surprises for binary metrics without thresholds.
src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
43-45: Handle length mismatch between turns and resolved metricszip will silently drop extra turns or metrics. Add a guard to detect and log/raise on mismatch.
- for turn_data, turn_metrics in zip(conv_data.turns, resolved_turn_metrics): + if len(resolved_turn_metrics) != len(conv_data.turns): + logger.error( + "Resolved turn metrics length (%d) does not match number of turns (%d) " + "for conversation %s", + len(resolved_turn_metrics), + len(conv_data.turns), + conv_data.conversation_group_id, + ) + for turn_data, turn_metrics in zip(conv_data.turns, resolved_turn_metrics):src/lightspeed_evaluation/pipeline/evaluation/processor.py (4)
16-24: Good dependency containerization; consider immutability.ProcessorComponents groups dependencies cleanly. Optionally freeze to prevent accidental runtime mutation.
Apply:
-@dataclass +@dataclass(frozen=True) class ProcessorComponents:
50-55: Micro: avoid recomputing defaults per turn.When many turns use defaults (turn_metrics is None), cache the default list once.
- resolved_turn_metrics = [ - self.metric_manager.resolve_metrics(turn_data.turn_metrics, MetricLevel.TURN) - for turn_data in conv_data.turns - ] + default_turn_metrics: list[str] | None = None + resolved_turn_metrics: list[list[str]] = [] + for t in conv_data.turns: + if t.turn_metrics is None: + if default_turn_metrics is None: + default_turn_metrics = self.metric_manager.resolve_metrics(None, MetricLevel.TURN) + resolved_turn_metrics.append(default_turn_metrics) + else: + resolved_turn_metrics.append( + self.metric_manager.resolve_metrics(t.turn_metrics, MetricLevel.TURN) + )
71-77: Config guard is fine; nit on error message clarity.Consider including config path or loader state in the message to aid debugging.
-raise ValueError("SystemConfig must be loaded") +raise ValueError("SystemConfig must be loaded before processing conversations (call ConfigLoader.load_system_config).")
90-97: Improve turn log context with turn_id.Turn index alone is less useful; include turn_id.
-logger.debug("Processing turn %d metrics: %s", turn_idx, turn_metrics) +logger.debug( + "Processing turn %d (%s) metrics: %s", + turn_idx, + turn_data.turn_id, + turn_metrics, +)src/lightspeed_evaluation/core/metrics/manager.py (2)
24-49: resolve_metrics semantics are solid; small doc tweak.Docs say “use specified metrics from turn data” but this applies to both levels. Consider generic wording.
- - [metrics...]: use specified metrics from turn data + - [metrics...]: use the explicitly provided metrics for the given level
115-137: Optional: precompute defaults during counting to avoid repeated extraction.Minor micro-optimization when many turns rely on defaults.
- total_turn_metrics = 0 - for turn_data in conv_data.turns: - turn_metrics = self.resolve_metrics(turn_data.turn_metrics, MetricLevel.TURN) - total_turn_metrics += len(turn_metrics) + total_turn_metrics = 0 + default_turn: list[str] | None = None + for turn_data in conv_data.turns: + if turn_data.turn_metrics is None: + if default_turn is None: + default_turn = self.resolve_metrics(None, MetricLevel.TURN) + total_turn_metrics += len(default_turn) + else: + total_turn_metrics += len(self.resolve_metrics(turn_data.turn_metrics, MetricLevel.TURN))src/lightspeed_evaluation/core/models/data.py (1)
13-31: Trim metric strings before validation/dedup.Helps with YAML lists containing incidental spaces without increasing complexity.
-def _validate_and_deduplicate_metrics( - metrics: list[str], metric_type: str = "metric" -) -> list[str]: +def _validate_and_deduplicate_metrics( + metrics: list[str], metric_type: str = "metric" +) -> list[str]: """Validate format and deduplicate metrics while preserving order.""" - # Validate format first - for metric in metrics: + # Normalize and validate format first + normalized = [m.strip() for m in metrics] + for metric in normalized: if not metric or ":" not in metric: raise ValueError( f'{metric_type} "{metric}" must be in format "framework:metric_name"' ) # Deduplicate while preserving order seen = set() deduplicated = [] - for metric in metrics: + for metric in normalized: if metric not in seen: deduplicated.append(metric) seen.add(metric) return deduplicated
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
README.md(3 hunks)config/evaluation_data.yaml(3 hunks)config/system.yaml(3 hunks)src/lightspeed_evaluation/core/metrics/manager.py(1 hunks)src/lightspeed_evaluation/core/models/data.py(4 hunks)src/lightspeed_evaluation/core/system/validator.py(2 hunks)src/lightspeed_evaluation/pipeline/evaluation/errors.py(2 hunks)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py(5 hunks)src/lightspeed_evaluation/pipeline/evaluation/pipeline.py(3 hunks)src/lightspeed_evaluation/pipeline/evaluation/processor.py(6 hunks)tests/conftest.py(5 hunks)tests/unit/core/config/test_models.py(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- src/lightspeed_evaluation/pipeline/evaluation/pipeline.py
- src/lightspeed_evaluation/core/system/validator.py
- config/system.yaml
- README.md
🧰 Additional context used
🧠 Learnings (7)
📓 Common learnings
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/errors.pysrc/lightspeed_evaluation/pipeline/evaluation/evaluator.pysrc/lightspeed_evaluation/core/metrics/manager.pysrc/lightspeed_evaluation/pipeline/evaluation/processor.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
tests/unit/core/config/test_models.pysrc/lightspeed_evaluation/core/models/data.py
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.
Applied to files:
src/lightspeed_evaluation/core/metrics/manager.py
📚 Learning: 2025-09-10T06:57:46.326Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#47
File: src/lightspeed_evaluation/pipeline/evaluation/evaluator.py:85-89
Timestamp: 2025-09-10T06:57:46.326Z
Learning: For binary metrics like custom:tool_eval, using an explicit threshold of 0.5 is preferred over None threshold with special case handling. This provides consistent behavior where 0.0 scores fail and 1.0 scores pass.
Applied to files:
src/lightspeed_evaluation/core/metrics/manager.py
📚 Learning: 2025-09-19T01:26:54.625Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#55
File: src/lightspeed_evaluation/core/metrics/manager.py:50-81
Timestamp: 2025-09-19T01:26:54.625Z
Learning: In the lightspeed-evaluation codebase, prefer clean code over defensive programming for controlled configuration files like system.yaml. The maintainer asamal4 prefers not to add error handling for user configuration mistakes that are unlikely to occur in controlled internal settings, to avoid code complexity.
Applied to files:
src/lightspeed_evaluation/core/metrics/manager.py
📚 Learning: 2025-08-22T09:16:29.070Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#26
File: lsc_agent_eval/src/lsc_agent_eval/core/utils/api_client.py:124-151
Timestamp: 2025-08-22T09:16:29.070Z
Learning: In lsc_agent_eval project, the maintainer (asamal4) prefers reactive error handling - adding support for additional error response fields only when they occur in practice, rather than preemptively handling all possible error formats.
Applied to files:
src/lightspeed_evaluation/core/metrics/manager.py
🧬 Code graph analysis (5)
src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
EvaluationData(129-166)EvaluationResult(169-208)
tests/unit/core/config/test_models.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
TurnData(34-126)EvaluationData(129-166)
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (3)
src/lightspeed_evaluation/core/metrics/manager.py (3)
MetricLevel(10-14)MetricManager(17-136)get_effective_threshold(50-80)src/lightspeed_evaluation/core/models/data.py (3)
EvaluationRequest(225-278)EvaluationResult(169-208)EvaluationScope(211-222)src/lightspeed_evaluation/core/system/loader.py (1)
ConfigLoader(65-118)
src/lightspeed_evaluation/core/metrics/manager.py (2)
src/lightspeed_evaluation/core/models/data.py (2)
EvaluationData(129-166)TurnData(34-126)src/lightspeed_evaluation/core/models/system.py (1)
SystemConfig(195-219)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (6)
src/lightspeed_evaluation/core/metrics/manager.py (4)
MetricLevel(10-14)MetricManager(17-136)resolve_metrics(24-48)count_metrics_for_conversation(115-136)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
MetricsEvaluator(18-143)src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
APIDataAmender(13-80)src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
EvaluationErrorHandler(10-85)src/lightspeed_evaluation/core/system/loader.py (1)
ConfigLoader(65-118)src/lightspeed_evaluation/core/models/data.py (3)
EvaluationData(129-166)TurnData(34-126)EvaluationResult(169-208)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: ruff
- GitHub Check: black
🔇 Additional comments (24)
config/evaluation_data.yaml (4)
18-25: Per-turn metrics/metadata wired correctly; confirm 0.99 threshold intent0.99 for ragas:faithfulness is extremely strict and will likely fail most turns. If this is intentional for a stress case, LGTM; otherwise consider 0.8–0.9.
27-44: Turn-level override migration looks goodPer-turn metrics on conv_group_2 are correctly scoped and will bypass system defaults for that turn. No issues spotted.
58-58: Explicit skip semantics are clearturn_metrics: [] will skip evaluation for this turn as designed.
63-64: Turn-level default exists — no action neededconfig/system.yaml defines a metric with
default: true(line 37), so omitting turn_metrics will still evaluate that turn-level default.tests/conftest.py (2)
167-169: Per-turn metrics in test data align with new modelTurn 1 explicitly overrides; Turn 2 uses None to test default resolution. LGTM assuming at least one default=true exists in system config.
Also applies to: 184-197
127-142: Incorrect — DataValidator intentionally skips turns without explicit turn_metricsValidator._check_metric_requirements short-circuits when a turn has no turn_metrics (src/lightspeed_evaluation/core/system/validator.py:
if not turn_data.turn_metrics: continue), and tests/conftest.py already includes a turn with"turn_metrics": None(around lines 195–196). Metrics metadata fallback to system defaults is handled in src/lightspeed_evaluation/core/metrics/manager.py, but DataValidator is by-design limited to validating only explicitly provided metrics, so changing the fixture's metrics_metadata defaults will not make the validator exercise a “defaulting” validation path.Likely an incorrect or invalid review comment.
tests/unit/core/config/test_models.py (2)
75-80: TurnData per-turn metrics coverage is correctConstructing with turn_metrics and asserting presence is aligned with the new API.
Also applies to: 92-93
118-123: Invalid metric format test is appropriateThis correctly targets the TurnData-level validator for "framework:metric_name" format.
src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
59-74: Conversation-level ERROR marking is correctGenerates ERROR rows for each resolved conversation metric with turn_id=None. Looks good.
src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
21-27: Constructor change — instantiations updatedMetricsEvaluator is instantiated with metric_manager in src/lightspeed_evaluation/pipeline/evaluation/pipeline.py:63–65 (MetricsEvaluator(llm_manager, self.config_loader, metric_manager)).
src/lightspeed_evaluation/pipeline/evaluation/processor.py (7)
50-59: Metric resolution moved upfront: LGTM.Pre-resolving per-turn and conversation metrics centralizes logic and simplifies downstream code.
64-68: Confirm skipping also means no API amendment.Behavior change: if no metrics (defaults and explicit), you return early and never amend via API. Ensure no downstream expects amended responses even when evaluation is skipped.
81-86: API error path aligns with design (use pre-resolved metrics).Passing resolved metrics into the error handler matches the upstream-resolution approach.
113-131: Turn evaluation loop is clean and correct.Iterating resolved per-turn metrics and delegating to MetricsEvaluator is straightforward.
100-109: Conversation-level evaluation flow is correct.Processes only when metrics present; consistent with per-turn logic.
145-148: Delegating summary to MetricManager is a good consolidation.Removes duplicate counting logic from the processor.
29-39: Resolved — ConversationProcessor constructor call sites updatedsrc/lightspeed_evaluation/pipeline/evaluation/pipeline.py now constructs ProcessorComponents and passes it into ConversationProcessor (lines 68–78). No other instantiations using the old multi-positional-arg signature were found.
src/lightspeed_evaluation/core/metrics/manager.py (2)
10-15: Enum introduction is clear and appropriately scoped.TURN/CONVERSATION values are intuitive.
105-114: Default metric extraction is correct and order-preserving.Relies on YAML insertion order; good for predictable evaluation ordering.
src/lightspeed_evaluation/core/models/data.py (5)
65-73: Per-turn metrics fields are well-scoped.Aligns with per-turn override design.
75-81: Validator is concise and reuses helper.Dedup + format check at the edge is good.
143-151: Conversation-level fields look correct.Matches the new two-level metric model.
158-166: Conversation metrics validator mirrors turn-level: good consistency.Keeps behavior uniform across levels.
225-279: Factory methods for EvaluationRequest are clean.Turn/conversation helpers reduce boilerplate and set flags correctly.
tisnik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Summary by CodeRabbit
New Features
Bug Fixes
Refactor
Configuration
Documentation
Tests