-
Notifications
You must be signed in to change notification settings - Fork 21
add support for fail_on_invalid_data option #94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
add support for fail_on_invalid_data option #94
Conversation
WalkthroughAdds config flag Changes
Sequence Diagram(s)sequenceDiagram
participant Config as Config
participant Validator as DataValidator
participant Data as Turn/Eval Data
participant Processor as ConversationProcessor
participant Evaluator as MetricsEvaluator
Config->>Validator: __init__(api_enabled, fail_on_invalid_data)
Validator->>Validator: validate(data)
alt fail_on_invalid_data = true
Validator-->>Config: return False on validation error
Note over Validator: validation fails (hard)
else fail_on_invalid_data = false
Validator->>Data: add_invalid_metric(metric)
Validator->>Validator: log notice, continue
end
Processor->>Data: is_metric_invalid(metric)?
alt metric invalid
Processor->>Processor: log error, emit ERROR result, skip metric
else metric valid
Processor->>Evaluator: evaluate(EvaluationRequest)
Evaluator-->>Processor: result / None
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
adef500 to
7eb05b7
Compare
|
Thanks @VladimirKadlec for working on this 👍 I agree with comments from Asutosh. |
3a498c2 to
059c7c9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
src/lightspeed_evaluation/core/models/system.py (1)
265-268: Improve field description for clarity.The description has a minor grammatical issue (missing comma) and could be more explicit about the behavior in both states.
Consider this improved description:
fail_on_invalid_data: bool = Field( default=True, - description="If False don't fail on invalid conversations", + description="When True, raise errors on validation failures; when False, log warnings and continue processing", )
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
README.md(1 hunks)config/system.yaml(2 hunks)src/lightspeed_evaluation/core/models/data.py(2 hunks)src/lightspeed_evaluation/core/models/system.py(1 hunks)src/lightspeed_evaluation/core/system/validator.py(6 hunks)src/lightspeed_evaluation/pipeline/evaluation/pipeline.py(1 hunks)src/lightspeed_evaluation/pipeline/evaluation/processor.py(2 hunks)src/lightspeed_evaluation/runner/evaluation.py(1 hunks)tests/unit/pipeline/evaluation/test_processor.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- README.md
- config/system.yaml
- src/lightspeed_evaluation/runner/evaluation.py
- src/lightspeed_evaluation/core/system/validator.py
- src/lightspeed_evaluation/core/models/data.py
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/pipeline.pysrc/lightspeed_evaluation/core/models/system.pysrc/lightspeed_evaluation/pipeline/evaluation/processor.pytests/unit/pipeline/evaluation/test_processor.py
📚 Learning: 2025-09-10T15:48:14.671Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:43-49
Timestamp: 2025-09-10T15:48:14.671Z
Learning: In the lightspeed-evaluation framework, system configuration uses Pydantic data models (SystemConfig, OutputConfig, LoggingConfig, etc.) rather than plain dictionaries. Components like OutputHandler receive properly structured Pydantic models, so direct attribute access (e.g., system_config.output.enabled_outputs) is the correct approach.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/processor.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/processor.pytests/unit/pipeline/evaluation/test_processor.py
🧬 Code graph analysis (2)
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (1)
src/lightspeed_evaluation/core/system/validator.py (1)
DataValidator(79-331)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (1)
src/lightspeed_evaluation/core/models/data.py (2)
is_metric_invalid(95-97)is_metric_invalid(355-357)
🔇 Additional comments (2)
src/lightspeed_evaluation/pipeline/evaluation/pipeline.py (1)
65-68: LGTM!The DataValidator initialization correctly passes the
fail_on_invalid_dataconfiguration parameter, aligning with the PR objective to add configurable validation behavior.tests/unit/pipeline/evaluation/test_processor.py (1)
1-394: Excellent test coverage!The test suite comprehensively covers the new invalid metric handling functionality with well-structured test cases including:
- Valid metrics scenarios
- Invalid metric detection and skipping
- Mixed valid/invalid metrics
- Edge cases (empty metrics, evaluator returning None)
- Verification of correct request construction and metric ordering
The tests properly use fixtures, mocks, and log capture to validate behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/lightspeed_evaluation/pipeline/evaluation/processor.py(2 hunks)tests/unit/pipeline/evaluation/test_processor.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lightspeed_evaluation/pipeline/evaluation/processor.py
🧰 Additional context used
📓 Path-based instructions (1)
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest with pytest-mock (mocker fixture), not unittest.mock, for all mocking
Test files should use naming conventiontest_*.pyfor files,test_*for functions, andTest*for classes
Files:
tests/unit/pipeline/evaluation/test_processor.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
tests/unit/pipeline/evaluation/test_processor.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
tests/unit/pipeline/evaluation/test_processor.py
🧬 Code graph analysis (1)
tests/unit/pipeline/evaluation/test_processor.py (8)
src/lightspeed_evaluation/core/models/data.py (8)
EvaluationData(309-367)EvaluationRequest(426-479)EvaluationResult(370-409)TurnData(35-306)add_invalid_metric(91-93)add_invalid_metric(351-353)is_metric_invalid(95-97)is_metric_invalid(355-357)src/lightspeed_evaluation/core/models/system.py (1)
SystemConfig(271-301)src/lightspeed_evaluation/core/system/loader.py (1)
ConfigLoader(70-125)src/lightspeed_evaluation/pipeline/evaluation/processor.py (3)
ConversationProcessor(37-299)ProcessorComponents(27-34)_evaluate_turn(181-214)src/lightspeed_evaluation/pipeline/evaluation/evaluator.py (1)
MetricsEvaluator(25-172)src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
APIDataAmender(13-80)src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
EvaluationErrorHandler(10-201)src/lightspeed_evaluation/core/metrics/manager.py (1)
MetricManager(17-166)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: tests (3.11)
- GitHub Check: tests (3.13)
- GitHub Check: tests (3.12)
- GitHub Check: mypy
🔇 Additional comments (11)
tests/unit/pipeline/evaluation/test_processor.py (11)
1-19: LGTM!Imports are well-organized and appropriate for the test module. Correctly uses pytest as required by the coding guidelines.
21-121: LGTM!Fixtures are well-structured and follow pytest best practices. Correctly uses
mockerfixture from pytest-mock as required by coding guidelines. Good use of fixture composition for assembling the processor with all its dependencies.
126-150: LGTM!Comprehensive test of the happy path with valid metrics. Correctly verifies result count, types, evaluator invocation count, and metric identifier ordering.
152-184: LGTM!Well-structured test for invalid metric handling. Correctly verifies that invalid metrics produce ERROR results without invoking the evaluator, while valid metrics are still evaluated. Good use of
caplogto verify error logging.
218-254: LGTM!Excellent test for mixed valid/invalid metrics. Correctly verifies that result order is preserved, invalid metrics produce ERROR results at the correct position, and only valid metrics trigger evaluator calls.
256-273: LGTM!Good edge case test for empty metrics list. Correctly verifies no results and no evaluator invocation.
275-299: LGTM!Thorough verification of EvaluationRequest construction. Correctly checks that all key fields (conv_data, metric_identifier, turn_id, turn_idx) are properly populated from the input data.
301-324: LGTM!Good test for handling evaluator returning None. Correctly reconfigures the mock and verifies graceful handling with empty results while confirming the evaluator is still invoked.
326-347: LGTM!Good test for turn index handling in multi-turn conversations. Correctly verifies that turn_idx and turn_id are properly set when evaluating a non-first turn.
349-374: LGTM!Excellent test for metric evaluation order preservation. Correctly verifies that metrics are evaluated in the order provided, which is important for consistent behavior.
376-401: LGTM!Comprehensive test for TurnData's invalid metric tracking. While this tests the data model directly rather than the processor, it's appropriately placed here since this functionality is central to the processor's behavior. Good coverage of initial state, additions, and idempotent behavior.
| def test_evaluate_turn_with_all_invalid_metrics( | ||
| self, processor, mock_metrics_evaluator, caplog | ||
| ): | ||
| """Test _evaluate_turn with all metrics invalid - should return empty results.""" | ||
| turn_data = TurnData( | ||
| turn_id="1", | ||
| query="What is Python?", | ||
| response="Python is a programming language.", | ||
| contexts=["Context"], | ||
| ) | ||
| conv_data = EvaluationData(conversation_group_id="test_conv", turns=[turn_data]) | ||
|
|
||
| # Mark all metrics as invalid | ||
| turn_data.add_invalid_metric("ragas:faithfulness") | ||
| turn_data.add_invalid_metric("custom:answer_correctness") | ||
|
|
||
| turn_metrics = ["ragas:faithfulness", "custom:answer_correctness"] | ||
|
|
||
| with caplog.at_level(logging.ERROR): | ||
| results = processor._evaluate_turn(conv_data, 0, turn_data, turn_metrics) | ||
|
|
||
| assert len(results) == 2 | ||
| assert results[0].result == "ERROR" | ||
| assert results[1].result == "ERROR" | ||
|
|
||
| # Verify evaluate_metric was never called | ||
| assert mock_metrics_evaluator.evaluate_metric.call_count == 0 | ||
|
|
||
| # Verify errors were logged for both invalid metrics | ||
| assert "Invalid turn metric 'ragas:faithfulness'" in caplog.text | ||
| assert "Invalid turn metric 'custom:answer_correctness'" in caplog.text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix misleading docstring.
The docstring states "should return empty results" but the test correctly expects 2 ERROR results (one for each invalid metric). The test logic is correct—invalid metrics produce ERROR results rather than being omitted entirely.
Apply this diff to correct the docstring:
- """Test _evaluate_turn with all metrics invalid - should return empty results."""
+ """Test _evaluate_turn with all metrics invalid - should return ERROR results."""📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def test_evaluate_turn_with_all_invalid_metrics( | |
| self, processor, mock_metrics_evaluator, caplog | |
| ): | |
| """Test _evaluate_turn with all metrics invalid - should return empty results.""" | |
| turn_data = TurnData( | |
| turn_id="1", | |
| query="What is Python?", | |
| response="Python is a programming language.", | |
| contexts=["Context"], | |
| ) | |
| conv_data = EvaluationData(conversation_group_id="test_conv", turns=[turn_data]) | |
| # Mark all metrics as invalid | |
| turn_data.add_invalid_metric("ragas:faithfulness") | |
| turn_data.add_invalid_metric("custom:answer_correctness") | |
| turn_metrics = ["ragas:faithfulness", "custom:answer_correctness"] | |
| with caplog.at_level(logging.ERROR): | |
| results = processor._evaluate_turn(conv_data, 0, turn_data, turn_metrics) | |
| assert len(results) == 2 | |
| assert results[0].result == "ERROR" | |
| assert results[1].result == "ERROR" | |
| # Verify evaluate_metric was never called | |
| assert mock_metrics_evaluator.evaluate_metric.call_count == 0 | |
| # Verify errors were logged for both invalid metrics | |
| assert "Invalid turn metric 'ragas:faithfulness'" in caplog.text | |
| assert "Invalid turn metric 'custom:answer_correctness'" in caplog.text | |
| def test_evaluate_turn_with_all_invalid_metrics( | |
| self, processor, mock_metrics_evaluator, caplog | |
| ): | |
| """Test _evaluate_turn with all metrics invalid - should return ERROR results.""" | |
| turn_data = TurnData( | |
| turn_id="1", | |
| query="What is Python?", | |
| response="Python is a programming language.", | |
| contexts=["Context"], | |
| ) | |
| conv_data = EvaluationData(conversation_group_id="test_conv", turns=[turn_data]) | |
| # Mark all metrics as invalid | |
| turn_data.add_invalid_metric("ragas:faithfulness") | |
| turn_data.add_invalid_metric("custom:answer_correctness") | |
| turn_metrics = ["ragas:faithfulness", "custom:answer_correctness"] | |
| with caplog.at_level(logging.ERROR): | |
| results = processor._evaluate_turn(conv_data, 0, turn_data, turn_metrics) | |
| assert len(results) == 2 | |
| assert results[0].result == "ERROR" | |
| assert results[1].result == "ERROR" | |
| # Verify evaluate_metric was never called | |
| assert mock_metrics_evaluator.evaluate_metric.call_count == 0 | |
| # Verify errors were logged for both invalid metrics | |
| assert "Invalid turn metric 'ragas:faithfulness'" in caplog.text | |
| assert "Invalid turn metric 'custom:answer_correctness'" in caplog.text |
🤖 Prompt for AI Agents
tests/unit/pipeline/evaluation/test_processor.py lines 186-216: the test
docstring is misleading — it says "should return empty results" but the test
asserts two ERROR results for the two invalid metrics; update the docstring to
accurately state that invalid metrics produce ERROR results (e.g., "Test
_evaluate_turn with all invalid metrics - should return ERROR results for each
invalid metric.") and ensure wording matches the assertions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now changes seem to be aligned.. Thank you..
LGTM
There is a minor doc string issue though (already highlighted by coderabbit).
LCORE-647
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.