-
Notifications
You must be signed in to change notification settings - Fork 22
PR 1/3 — Add GEval (DeepEval) core integration #96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PR 1/3 — Add GEval (DeepEval) core integration #96
Conversation
WalkthroughAdds GEval support by introducing GEvalHandler, extends DeepEvalMetrics to accept an optional registry_path and route evaluate() calls between built-in DeepEval metrics and GEval metrics, and reformats the module exports list without changing exported symbols. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant DeepEvalMetrics
participant GEvalHandler
participant GEval
Caller->>DeepEvalMetrics: evaluate(metric_name, conv_data, turn_idx, turn_data)
alt standard DeepEval metric
DeepEvalMetrics->>DeepEvalMetrics: run built-in metric handler
DeepEvalMetrics-->>Caller: score, reason
else GEval metric
DeepEvalMetrics->>GEvalHandler: evaluate(metric_name, conv_data, turn_idx, turn_data, is_conversation)
GEvalHandler->>GEvalHandler: _get_geval_config(runtime metadata → registry)
alt config found
GEvalHandler->>GEvalHandler: _convert_evaluation_params
alt conversation-level
GEvalHandler->>GEval: evaluate(conversation test case)
else turn-level
GEvalHandler->>GEval: evaluate(turn test case)
end
GEval-->>GEvalHandler: score, reason
GEvalHandler-->>DeepEvalMetrics: score, reason
else config missing
GEvalHandler-->>DeepEvalMetrics: None, error message
end
DeepEvalMetrics-->>Caller: score, reason
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (2)
🧰 Additional context used📓 Path-based instructions (3)src/lightspeed_evaluation/**📄 CodeRabbit inference engine (AGENTS.md)
Files:
src/**/*.py📄 CodeRabbit inference engine (AGENTS.md)
Files:
src/lightspeed_evaluation/core/metrics/**/*.py📄 CodeRabbit inference engine (AGENTS.md)
Files:
🧠 Learnings (8)📓 Common learnings📚 Learning: 2025-10-16T11:17:19.324ZApplied to files:
📚 Learning: 2025-09-19T00:37:23.798ZApplied to files:
📚 Learning: 2025-09-19T12:32:06.403ZApplied to files:
📚 Learning: 2025-10-16T11:17:19.324ZApplied to files:
📚 Learning: 2025-09-18T23:59:37.026ZApplied to files:
📚 Learning: 2025-10-16T11:17:19.324ZApplied to files:
📚 Learning: 2025-09-08T11:11:54.516ZApplied to files:
🧬 Code graph analysis (2)src/lightspeed_evaluation/core/metrics/deepeval.py (3)
src/lightspeed_evaluation/core/metrics/geval.py (2)
🔇 Additional comments (12)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/lightspeed_evaluation/core/metrics/__init__.py(1 hunks)src/lightspeed_evaluation/core/metrics/deepeval.py(3 hunks)src/lightspeed_evaluation/core/metrics/geval.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
src/lightspeed_evaluation/**
📄 CodeRabbit inference engine (AGENTS.md)
Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
Files:
src/lightspeed_evaluation/core/metrics/__init__.pysrc/lightspeed_evaluation/core/metrics/deepeval.pysrc/lightspeed_evaluation/core/metrics/geval.py
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Require type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels
Files:
src/lightspeed_evaluation/core/metrics/__init__.pysrc/lightspeed_evaluation/core/metrics/deepeval.pysrc/lightspeed_evaluation/core/metrics/geval.py
src/lightspeed_evaluation/core/metrics/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Register new metrics in MetricManager’s supported_metrics dictionary
Files:
src/lightspeed_evaluation/core/metrics/__init__.pysrc/lightspeed_evaluation/core/metrics/deepeval.pysrc/lightspeed_evaluation/core/metrics/geval.py
🧠 Learnings (5)
📓 Common learnings
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/**/*.py : Register new metrics in MetricManager’s supported_metrics dictionary
Applied to files:
src/lightspeed_evaluation/core/metrics/__init__.pysrc/lightspeed_evaluation/core/metrics/deepeval.pysrc/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
Applied to files:
src/lightspeed_evaluation/core/metrics/deepeval.pysrc/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
src/lightspeed_evaluation/core/metrics/geval.py
🧬 Code graph analysis (2)
src/lightspeed_evaluation/core/metrics/deepeval.py (3)
src/lightspeed_evaluation/core/metrics/geval.py (2)
GEvalHandler(20-501)evaluate(122-195)src/lightspeed_evaluation/core/llm/manager.py (4)
LLMManager(10-117)get_config(105-107)get_model_name(91-93)get_llm_params(95-103)src/lightspeed_evaluation/core/llm/deepeval.py (1)
DeepEvalLLMManager(8-42)
src/lightspeed_evaluation/core/metrics/geval.py (2)
src/lightspeed_evaluation/core/llm/deepeval.py (1)
DeepEvalLLMManager(8-42)src/lightspeed_evaluation/core/metrics/deepeval.py (1)
evaluate(92-129)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
src/lightspeed_evaluation/core/metrics/geval.py (2)
387-393: Critical: Same custom evaluation_params issue in conversation-level evaluation.This method has the identical bug as
_evaluate_turn. Whenconverted_paramsisNone(custom parameters), line 388's condition forces enum defaults, preventing custom GEval metrics from using their intended parameters.Apply the same fix pattern:
converted_params = self._convert_evaluation_params(evaluation_params) - if not converted_params: - # If no valid params, use sensible defaults for conversation evaluation - converted_params = [ - LLMTestCaseParams.INPUT, - LLMTestCaseParams.ACTUAL_OUTPUT, - ] # Configure the GEval metric for conversation-level evaluation metric_kwargs: dict[str, Any] = { "name": "GEval Conversation Metric", "criteria": criteria, - "evaluation_params": converted_params, "model": self.deepeval_llm_manager.get_llm(), "threshold": threshold, "top_logprobs": 5, # Vertex/Gemini throws an error if over 20. } + + if converted_params is None: + if not evaluation_params: + metric_kwargs["evaluation_params"] = [ + LLMTestCaseParams.INPUT, + LLMTestCaseParams.ACTUAL_OUTPUT, + ] + else: + metric_kwargs["evaluation_params"] = converted_params
287-293: Critical: Custom GEval evaluation_params are still being replaced with defaults.Despite the past review comment being marked as "Addressed in commit 4e37984", the code still has the same issue. When
_convert_evaluation_params()returnsNone(indicating custom string parameters that GEval should auto-detect), line 288's conditionif not converted_params:evaluates toTrueand forces the enum defaults, wiping out the caller's intent.This means registry-defined or runtime-provided custom parameters can never be used—they always get replaced with
[INPUT, ACTUAL_OUTPUT].Apply the fix from the previous review (or similar logic):
converted_params = self._convert_evaluation_params(evaluation_params) - if not converted_params: - # If no valid params, use sensible defaults for turn evaluation - converted_params = [ - LLMTestCaseParams.INPUT, - LLMTestCaseParams.ACTUAL_OUTPUT, - ] # Create GEval metric with runtime configuration metric_kwargs: dict[str, Any] = { "name": "GEval Turn Metric", "criteria": criteria, - "evaluation_params": converted_params, "model": self.deepeval_llm_manager.get_llm(), "threshold": threshold, "top_logprobs": 5, } + + # Only set evaluation_params if we have valid enum conversions + # or if no params were provided at all (then use defaults) + if converted_params is None: + if not evaluation_params: + metric_kwargs["evaluation_params"] = [ + LLMTestCaseParams.INPUT, + LLMTestCaseParams.ACTUAL_OUTPUT, + ] + # else: leave unset so GEval can auto-detect from custom strings + else: + metric_kwargs["evaluation_params"] = converted_params
🧹 Nitpick comments (2)
src/lightspeed_evaluation/core/metrics/geval.py (2)
76-101: Consider simplifying the path resolution logic.The current logic initializes
pathandpossible_pathsdefensively (lines 76-78) and then has overlapping conditionals that make the flow harder to follow. When an explicitregistry_pathis provided but doesn't exist, the code still iterates throughpossible_pathscontaining only that single invalid path.Consider restructuring to separate explicit path handling from auto-discovery:
- # Ensure variables are always bound for static analysis - - path: Optional[Path] = None - possible_paths: list[Path] = [] - - # Normalize user-specified path vs. auto-discovery if registry_path is not None: try: path = Path(registry_path) + if path.exists(): + # Use explicit path directly + self._load_from_path(path) + return + else: + logger.warning( + "Explicit registry path does not exist: %s. Falling back to auto-discovery.", + path + ) except TypeError: - # Bad type passed in; treat as no path provided - path = None - if path is not None: - possible_paths = [path] - else: - package_root = Path(__file__).resolve().parents[3] - possible_paths = [ - Path.cwd() / "config" / "registry" / "geval_metrics.yaml", - package_root / "config" / "registry" / "geval_metrics.yaml", - ] - - # If no explicit file exists yet, search candidates - if path is None or not path.exists(): - for candidate in possible_paths: - if candidate.exists(): - path = candidate - break + logger.warning("Invalid registry_path type: %s", type(registry_path)) + + # Auto-discovery + package_root = Path(__file__).resolve().parents[3] + possible_paths = [ + Path.cwd() / "config" / "registry" / "geval_metrics.yaml", + package_root / "config" / "registry" / "geval_metrics.yaml", + ] + + path = None + for candidate in possible_paths: + if candidate.exists(): + path = candidate + breakThen extract the loading logic into a helper method to avoid duplication.
450-524: LGTM for configuration retrieval logic.The priority order (turn metadata → conversation metadata → registry) is well-designed and properly documented. The method correctly handles all three configuration sources with appropriate logging.
The multiple
pylint: disablecomments (lines 505-508, 510, 514) suggest the type checker struggles with the class-level_registry: dict[str, Any] | None. Consider extracting registry access into a helper property to centralize the None-check:@property def _loaded_registry(self) -> dict[str, Any]: """Get loaded registry or empty dict.""" return GEvalHandler._registry or {}Then use
self._loaded_registrythroughout to eliminate the disables.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/lightspeed_evaluation/core/metrics/geval.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
src/lightspeed_evaluation/**
📄 CodeRabbit inference engine (AGENTS.md)
Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
Files:
src/lightspeed_evaluation/core/metrics/geval.py
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Require type hints for all public functions and methods
Use Google-style docstrings for all public APIs
Use custom exceptions from core.system.exceptions for error handling
Use structured logging with appropriate levels
Files:
src/lightspeed_evaluation/core/metrics/geval.py
src/lightspeed_evaluation/core/metrics/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Register new metrics in MetricManager’s supported_metrics dictionary
Files:
src/lightspeed_evaluation/core/metrics/geval.py
🧠 Learnings (8)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/** : Add all new evaluation features under src/lightspeed_evaluation/ (do not add new features elsewhere)
Applied to files:
src/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to src/lightspeed_evaluation/core/metrics/**/*.py : Register new metrics in MetricManager’s supported_metrics dictionary
Applied to files:
src/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
src/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-10-16T11:17:19.324Z
Learnt from: CR
Repo: lightspeed-core/lightspeed-evaluation PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-10-16T11:17:19.324Z
Learning: Applies to config/evaluation_data.yaml : Keep evaluation data in config/evaluation_data.yaml (conversation groups, turns, overrides, scripts)
Applied to files:
src/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
src/lightspeed_evaluation/core/metrics/geval.py
📚 Learning: 2025-09-08T11:11:54.516Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: config/system.yaml:78-82
Timestamp: 2025-09-08T11:11:54.516Z
Learning: For the custom:tool_eval metric, when threshold is not specified (None), the system defaults to checking if score > 0, providing less strict evaluation logic compared to exact matching. This allows for more flexible tool call evaluation where partial correctness is acceptable.
Applied to files:
src/lightspeed_evaluation/core/metrics/geval.py
🧬 Code graph analysis (1)
src/lightspeed_evaluation/core/metrics/geval.py (2)
src/lightspeed_evaluation/core/llm/deepeval.py (1)
DeepEvalLLMManager(8-42)src/lightspeed_evaluation/core/metrics/deepeval.py (1)
evaluate(92-129)
🔇 Additional comments (6)
src/lightspeed_evaluation/core/metrics/geval.py (6)
1-18: LGTM!Module setup follows best practices with clear documentation, appropriate imports, and proper logger initialization.
20-50: LGTM!Class structure is well-designed with appropriate use of class-level caching for the registry and clear documentation. Type hints and docstrings follow project guidelines.
133-204: LGTM!The main evaluation entry point is well-structured with clear delegation logic, comprehensive documentation, and appropriate error handling. The unused
_turn_idxparameter is appropriately marked and documented as interface compatibility.
206-250: LGTM!The parameter conversion logic correctly handles both standard enum-based parameters and custom string parameters. Returning
Nonefor custom params allows GEval to auto-detect required fields, which is the intended behavior.
312-356: LGTM for test case construction and evaluation logic.The test case building, context normalization, and error handling are well-implemented. The code properly handles optional fields and provides detailed logging for debugging. The broad exception catch is acceptable here given the external library call and comprehensive error logging.
395-448: LGTM for aggregation and evaluation logic.The conversation turn aggregation strategy is sound, creating a multi-turn narrative for GEval to evaluate. Error handling follows the same robust pattern as turn-level evaluation.
|
Consolidating into one comprehensive PR: #97 |
Introductory Note — Part 1 of 3 in the GEval Integration Series
This is the first pull request in a three-part series introducing GEval (DeepEval) LLM-as-a-judge support into the LSC Evaluation Framework.
The overall goal of the series is to extend the framework with DeepEval-based GEval metrics, configurable system settings and comprehensive unit tests.
PR Series Overview:
PR 1 — Core Integration (this PR):
GEvalHandler;DeepEvalMetricsintegrationPR 2 — Configuration Integration: adds config and system settings for GEval; Custom GEval metric registry
PR 3 — Unit Tests for
GEvalHandlerEach PR builds on the previous to enable incremental review and minimize code churn.
Summary
Adds first-class GEval support via the DeepEval framework to the LSC Evaluation Framework. Introduces a GEval handler core metric wiring, and pipeline extension points so downstream runs can evaluate with LLM-as-a-judge (GEval) alongside existing metrics.
Why? 🤔
Out of Scope 🚫
Key Changes
src/lightspeed_evaluation/core/metrics/deepeval.pyDeepEvalMetricsclasssrc/lightspeed_evaluation/core/metrics/geval.pyGEvalHandlerclassSummary by CodeRabbit