-
Notifications
You must be signed in to change notification settings - Fork 22
archive old eval and make lsc eval as primary #35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughProject reorganized and rebranded from lsc_eval / lightspeed-core-evaluation into a new lightspeed_evaluation package: packaging, CLI entry points, public exports, import paths, logging namespaces, docs, config formats, and many module initializers were added/removed or relocated; class name EvaluationEngine → EvaluationDriver. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant CLI as lightspeed-eval (CLI)
participant Loader as ConfigLoader
participant Validator as DataValidator
participant Driver as EvaluationDriver
participant LLM as LLMManager
participant Output as OutputHandler
User->>CLI: provide system.yaml & evaluation_data.yaml
CLI->>Loader: load configs
CLI->>Validator: validate configs & metrics
Validator-->>CLI: ok / errors
CLI->>Driver: run_evaluation(configs)
Driver->>LLM: initialize providers/adapters
Driver->>LLM: evaluate turns & conversations
LLM-->>Driver: metrics/results
Driver->>Output: write CSV/JSON/TXT/PNG
Output-->>User: reports & stats
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
b8a7fb4 to
aa729db
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (8)
src/lightspeed_evaluation/core/metrics/custom.py (1)
123-129: Avoid mis-normalizing already-normalized fractional scores (e.g., “12/10” → 1.2).Current logic divides by 10/100 even when a fraction was parsed, producing under-scaled scores. Clamp fraction/out-of cases to 1.0, otherwise scale plain 0–10/0–100 numbers.
Apply this diff:
- # Normalize score to 0-1 range if needed - if score is not None and score > 1.0: - if score <= 10.0: # Assume 0-10 scale - score = score / 10.0 - elif score <= 100.0: # Assume 0-100 scale - score = score / 100.0 + # Normalize score to 0-1 range if needed + if score is not None and score > 1.0: + # If text contains explicit scales, avoid re-normalizing fractions like "12/10" + if re.search(r'(\d+\.?\d*)/(\d+\.?\d*)|\bout of\b', response, re.IGNORECASE): + score = min(score, 1.0) + elif score <= 10.0: # Assume 0-10 scale + score = score / 10.0 + elif score <= 100.0: # Assume 0-100 scale + score = score / 100.0 + else: + score = 1.0src/lightspeed_evaluation/core/metrics/deepeval.py (1)
86-94: Guard against empty conversations in completeness metric.DeepEval may error on empty turn lists. Add an early check like relevancy does.
def _evaluate_conversation_completeness( self, conv_data: Any, _turn_idx: Optional[int], _turn_data: Optional[TurnData], is_conversation: bool, ) -> Tuple[Optional[float], str]: """Evaluate conversation completeness.""" if not is_conversation: return None, "Conversation completeness is a conversation-level metric" + if not getattr(conv_data, "turns", None): + return None, "No conversation turns available for completeness evaluation" test_case = self._build_conversational_test_case(conv_data) metric = ConversationCompletenessMetric(model=self.llm_manager.get_llm())src/lightspeed_evaluation/core/output/visualization.py (1)
201-238: Avoid injecting zeros into boxplots; they distort distributions.Build a list-of-arrays and let NaNs represent missing, or pass ragged arrays directly.
- # Convert to DataFrame with equal-length arrays (pad with NaN) - max_len = max(len(scores) for scores in metric_groups.values()) - score_data = {} - for metric_id, scores in metric_groups.items(): - padded_scores = scores + [np.nan] * (max_len - len(scores)) - score_data[metric_id] = padded_scores - - results_df = pd.DataFrame(score_data) - - _, ax = plt.subplots(figsize=tuple(self.figsize), dpi=self.dpi) - ax.set_xlabel("Score", fontsize=12, fontweight="bold") - ax.set_xlim(0, 1) + # Prepare ragged arrays per metric (no zero padding) + metrics = list(metric_groups.keys()) + values = [metric_groups[m] for m in metrics] + + _, ax = plt.subplots(figsize=tuple(self.figsize), dpi=self.dpi) + ax.set_xlabel("Score", fontsize=12, fontweight="bold") + ax.set_xlim(0, 1) @@ - bplot = ax.boxplot( - results_df.fillna(0), + bplot = ax.boxplot( + values, sym=".", widths=0.5, vert=False, patch_artist=True, ) - - labels = results_df.columns + labels = metrics @@ - ax.set_yticklabels(labels) + ax.set_yticklabels(labels)src/lightspeed_evaluation/core/metrics/ragas.py (4)
79-82: Guard against missing/renamed Ragas result columnsAccessing
df[result_key]will raise KeyError if Ragas changes column names (e.g., “nv_context_relevance” vs “context_relevance”). Convert this into a ValueError with available columns for better UX.- result = evaluate(dataset, metrics=[metric_instance]) - df = result.to_pandas() - score = df[result_key].iloc[0] - return score, f"Ragas {metric_name}: {score:.2f}" + result = evaluate(dataset, metrics=[metric_instance]) + df = result.to_pandas() + try: + score = df[result_key].iloc[0] + except KeyError: + available = ", ".join(df.columns.astype(str)) + raise ValueError( + f"Expected result column '{result_key}' not found. Available: {available}" + ) + return score, f"Ragas {metric_name}: {score:.2f}"
122-131: Require TurnData for response relevancyIf
turn_datais None, we silently evaluate empty strings. Fail fast with a clear message.def _evaluate_response_relevancy( @@ - if is_conversation: + if is_conversation: return None, "Response relevancy is a turn-level metric" - - query, response, _ = self._extract_turn_data(turn_data) + if turn_data is None: + return None, "TurnData is required for response relevancy" + query, response, _ = self._extract_turn_data(turn_data)
141-152: Require TurnData for faithfulnessSame issue as above; avoid evaluating empty fields.
- if is_conversation: + if is_conversation: return None, "Faithfulness is a turn-level metric" - - query, response, contexts = self._extract_turn_data(turn_data) + if turn_data is None: + return None, "TurnData is required for faithfulness" + query, response, contexts = self._extract_turn_data(turn_data)
164-176: Add None-check for context precision without referenceThis metric also needs
turn_data; currently it proceeds with empty values.- if is_conversation: + if is_conversation: return None, "Context precision without reference is a turn-level metric" - - query, response, contexts = self._extract_turn_data(turn_data) + if turn_data is None: + return None, "TurnData is required for context precision without reference" + query, response, contexts = self._extract_turn_data(turn_data)src/lightspeed_evaluation/core/output/generator.py (1)
49-64: Honor format toggles from system config (csv/json/txt)
ConfigLoaderprovidescsv_format,json_format, andtxt_format, but they’re ignored here. Respect them to avoid generating unwanted artifacts.- # Generate CSV report - csv_file = self._generate_csv_report(results, base_filename) - print(f" ✅ CSV: {csv_file}") + # Generate CSV report (if enabled) + if getattr(self.system_config, "csv_format", True): + csv_file = self._generate_csv_report(results, base_filename) + print(f" ✅ CSV: {csv_file}") - # Generate JSON summary (pass pre-calculated stats) - json_file = self._generate_json_summary( - results, base_filename, basic_stats, detailed_stats - ) - print(f" ✅ JSON: {json_file}") + # Generate JSON summary (if enabled) + if getattr(self.system_config, "json_format", True): + json_file = self._generate_json_summary( + results, base_filename, basic_stats, detailed_stats + ) + print(f" ✅ JSON: {json_file}") - # Generate text summary (pass pre-calculated stats) - txt_file = self._generate_text_summary( - results, base_filename, basic_stats, detailed_stats - ) - print(f" ✅ TXT: {txt_file}") + # Generate text summary (if enabled) + if getattr(self.system_config, "txt_format", True): + txt_file = self._generate_text_summary( + results, base_filename, basic_stats, detailed_stats + ) + print(f" ✅ TXT: {txt_file}")
🧹 Nitpick comments (25)
.gitignore (2)
183-183: Root-level .deepeval/ ignore: confirm intent and preserve templatesMoving the ignore to
/.deepeval/makes sense if all telemetry/config should remain untracked. If you plan to keep example configs or docs under this dir, add negation rules to retain them.Example (adjust filenames as needed):
.deepeval/ +!.deepeval/README.md +!.deepeval/*.example.* +!.deepeval/*.sample.*
186-186: Constrain wip/ to repo root to avoid broad ignores*
wip*/matches any directory named likewip*at any depth. If you only intend a root-level sandbox (per the comment), scope it to the root.-wip*/ +/wip*/If the intent is a single folder, consider just
/wip/.lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py (1)
97-109: Make content extraction robust to dict/object LiteLLM responses and completion/chat variants.Guard for dict-like structures and fall back to text for completion-style responses.
Apply:
- choices = getattr(response, "choices", None) + # Support both object-like and dict-like LiteLLM responses + choices = getattr(response, "choices", None) + if choices is None and isinstance(response, dict): + choices = response.get("choices") if choices and len(choices) > 0: - message = getattr( - choices[0], # pylint: disable=unsubscriptable-object - "message", - None, - ) - if message: - content = getattr(message, "content", None) + first = choices[0] # pylint: disable=unsubscriptable-object + # Chat-style + message = getattr(first, "message", None) + if message is None and isinstance(first, dict): + message = first.get("message") + if message: + content = getattr(message, "content", None) + if content is None and isinstance(message, dict): + content = message.get("content") + # Completion-style fallback + if not content: + content = getattr(first, "text", None) + if content is None and isinstance(first, dict): + content = first.get("text")archive/README.md (3)
88-88: Adjust heading level to maintain proper hierarchy.Line 88 has a heading jump from h1 to h4. It should be h2 to maintain proper document structure.
Apply this diff to fix the heading level:
-#### Arguments +## Arguments
35-37: Add language specification to fenced code blocks.Lines 35-37 contain a fenced code block without a language specification.
Apply this diff to add the language specification:
-``` +```bash pdm run evaluate--- `83-85`: **Add language specification to fenced code blocks.** Lines 83-85 contain a fenced code block without a language specification. Apply this diff to add the language specification: ```diff -``` +```bash python -m lightspeed_core_evaluation.evaluation.query_rag</blockquote></details> <details> <summary>src/lightspeed_evaluation/core/metrics/custom.py (1)</summary><blockquote> `229-236`: **Consider a deterministic system prompt for rubric clarity.** Passing a brief system_prompt with rubric reduces variance and parsing failures. </blockquote></details> <details> <summary>src/lightspeed_evaluation/core/output/visualization.py (3)</summary><blockquote> `79-82`: **Don’t return early when by_metric is empty; still generate other graphs.** Status pie and heatmap can still be useful with no metric stats. ```diff - if not summary_stats["by_metric"]: - self.logger.warning("No metric data available for graph generation") - return {} + if not summary_stats["by_metric"]: + self.logger.warning("No metric data available for pass-rate graph") + # continue; still generate other graphs
143-176: Fix axis labels and honor configured figsize/dpi for pass rates graph.X/Y labels are swapped; also use self.figsize/self.dpi for consistency.
- _, ax = plt.subplots(figsize=(12, 8)) + _, ax = plt.subplots(figsize=tuple(self.figsize), dpi=self.dpi) @@ - ax.set_title("Pass Rates by Metric", fontsize=16, fontweight="bold") - ax.set_xlabel("Pass Rate (%)", fontsize=12) - ax.set_ylabel("Metrics", fontsize=12) + ax.set_title("Pass Rates by Metric", fontsize=16, fontweight="bold") + ax.set_xlabel("Metrics", fontsize=12) + ax.set_ylabel("Pass Rate (%)", fontsize=12) @@ - filename = self.graphs_dir / f"{base_filename}_pass_rates.png" - plt.savefig(filename, dpi=300, bbox_inches="tight") + filename = self.graphs_dir / f"{base_filename}_pass_rates.png" + plt.savefig(filename, dpi=self.dpi, bbox_inches="tight")
420-421: Heatmap axis label: x-axis is Metrics, not “Pass Rate (%)”.Colorbar already communicates the pass-rate unit.
- ax.set_xlabel("Pass Rate (%)", fontsize=12, fontweight="bold") + ax.set_xlabel("Metrics", fontsize=12, fontweight="bold") ax.set_ylabel("Conversation Groups", fontsize=12, fontweight="bold")src/lightspeed_evaluation/core/metrics/ragas.py (2)
255-257: Potential incorrect column key: “nv_context_relevance”This column name looks non-standard and may not exist in current Ragas outputs. Consider “context_relevance” (or “context_relevancy”) instead, or rely on the metric’s actual output column after evaluation.
If correct, keep; if not, change to the appropriate key or use the defensive handling added above to surface the available columns.
22-49: Micro-optimization: avoid pandas for single-value extractionFor single-row, single-metric results, converting to pandas is heavy. If Ragas exposes a direct API to get the score, prefer that to reduce overhead. If not feasible, current approach is acceptable.
src/lightspeed_evaluation/core/output/generator.py (2)
28-28: Ensure nested output directories are createdUse
parents=Trueto avoid failures when intermediate directories are missing.- self.output_dir.mkdir(exist_ok=True) + self.output_dir.mkdir(parents=True, exist_ok=True)
174-175: Branding consistency: use “LightSpeed Evaluation Framework”Align the text report header with the new naming used elsewhere.
- f.write("LSC Evaluation Framework - Summary Report\n") + f.write("LightSpeed Evaluation Framework - Summary Report\n")src/lightspeed_evaluation/runner/evaluation.py (1)
38-40: Double validation of evaluation dataData is validated in
DataValidator.load_evaluation_dataand again insideEvaluationDriver.run_evaluation(). Consider skipping the second validation when upstream has already validated to save time on large datasets.I can draft a small change to pass a flag into
EvaluationDriver.run_evaluationto skip validation when appropriate.Also applies to: 46-53
src/lightspeed_evaluation/core/llm/__init__.py (1)
3-13: Avoid ambiguous LLMConfig export
There are twoLLMConfigclasses in the codebase (core/llm/manager.pyandcore/config/models.py). To prevent confusion, alias or drop the re-export:Option (non-breaking alias):
-from .manager import LLMManager, LLMConfig, LLMError +from .manager import LLMManager, LLMConfig as LLMRuntimeConfig, LLMError __all__ = [ - "LLMConfig", + "LLMRuntimeConfig", "LLMManager", "LLMError", "DeepEvalLLMManager", "RagasLLMManager", ]src/lightspeed_evaluation/core/__init__.py (1)
11-12: Import from the aggregated config package to avoid submodule coupling.You already re-export EvaluationResult and TurnData in core.config. Importing from core.config.models here creates a leaky dependency on the submodule path.
-from .config.models import EvaluationResult, TurnData +from .config import EvaluationResult, TurnDatasrc/lightspeed_evaluation/__init__.py (2)
4-14: Docstring mentions “Runner” but it isn’t exported.Either export a runner or update the docstring to avoid confusion.
- - Runner: Simple runner for command-line usage - - Core modules organized by functionality (config, llm, metrics, output) + - Drivers: programmatic API via EvaluationDriver + - Core modules organized by functionality (config, llm, metrics, output)
31-35: Heavy deps are declared in pyproject.toml
Verified thatpyproject.tomllistsragas>=0.3.0,deepeval>=1.3.0,matplotlib>=3.5.0, andseaborn>=0.11.0, so the top-level imports won’t fail due to missing packages. To reduce import-time overhead, you may defer loading of metrics and graph generators via lazy imports (e.g. PEP 562__getattr__).README.md (3)
20-26: Clarify uv prerequisite in Quick Start.Readers may not have uv installed. Add a brief note or command to install uv before
uv sync.### Installation ```bash # From Git pip install git+https://github.com/lightspeed-core/lightspeed-evaluation.git -# Local Development -uv sync +# Local Development +# Requires uv (https://docs.astral.sh/uv/) +pipx install uv # or: pip install uv +uv sync--- `61-81`: **Align turn_metrics with provided metadata in the example YAML.** `turn_metrics_metadata` includes `ragas:response_relevancy`, but `turn_metrics` omits it. Either add the metric to `turn_metrics` or drop its metadata to avoid confusion. ```diff turn_metrics: - "ragas:faithfulness" - "custom:answer_correctness" + - "ragas:response_relevancy"Also applies to: 83-112
31-36: Environment variables across providers.Since LiteLLM enables multiple providers, consider adding a one-liner noting additional env vars (e.g.,
ANTHROPIC_API_KEY, Azure creds) when those providers are selected.pyproject.toml (3)
22-22: Avoid pinning heavyweight Torch by default; make it optional.Torch is large and often unnecessary for evaluation-only flows. Also, a strict
==2.7.0pin risks resolution failures across platforms. Move Torch to an optional extra and loosen the version with a sane upper bound.dependencies = [ # Core evaluation framework dependencies "ragas>=0.3.0", "deepeval>=1.3.0", "litellm>=1.0.0", "pydantic>=2.0.0", "pyyaml>=6.0", "pandas>=2.1.4", - "datasets>=2.0.0", + "datasets>=2.0.0", "matplotlib>=3.5.0", "seaborn>=0.11.0", "numpy>=1.23.0", - "torch==2.7.0", # Agent evaluation dependencies (for future integration) "httpx>=0.27.2", "tqdm>=4.67.1", # Generate answers dependencies "click>=8.0.0", "diskcache>=5.6.3", "tenacity>=9.1.2", ] +[project.optional-dependencies] +# Install with: pip install ".[agent]" +agent = [ + "torch>=2.2,<3", +]
12-16: Pin upper bounds for volatile libs to reduce breakage.Ragas, DeepEval, and LiteLLM ship frequent breaking changes. Add an upper bound to improve reproducibility.
- "ragas>=0.3.0", - "deepeval>=1.3.0", - "litellm>=1.0.0", + "ragas>=0.3,<0.4", + "deepeval>=1.3,<2.0", + "litellm>=1.0,<2.0",
6-8: Python version window is reasonable.
>=3.11,<3.13is a safe baseline given ecosystem support; consider widening to<3.14once CI confirms green on 3.13.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
⛔ Files ignored due to path filters (9)
archive/assets/response_eval_flow.pngis excluded by!**/*.pngarchive/example_result/model_evaluation_result-answer_relevancy.pngis excluded by!**/*.pngarchive/example_result/model_evaluation_result-answer_similarity_llm.pngis excluded by!**/*.pngarchive/example_result/model_evaluation_result-cos_score.pngis excluded by!**/*.pngarchive/example_result/model_evaluation_result-rougeL_f1.pngis excluded by!**/*.pngarchive/example_result/model_evaluation_result-rougeL_precision.pngis excluded by!**/*.pngarchive/example_result/model_evaluation_result-rougeL_recall.pngis excluded by!**/*.pnglsc_eval/uv.lockis excluded by!**/*.lockuv.lockis excluded by!**/*.lock
📒 Files selected for processing (34)
.gitignore(1 hunks)README.md(1 hunks)archive/README.md(1 hunks)archive/example_result/README.md(1 hunks)archive/pyproject.toml(1 hunks)config/evaluation_data.yaml(1 hunks)config/system.yaml(1 hunks)lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py(1 hunks)lsc_eval/README.md(0 hunks)lsc_eval/pyproject.toml(0 hunks)lsc_eval/src/lsc_eval/__init__.py(0 hunks)lsc_eval/src/lsc_eval/core/__init__.py(0 hunks)lsc_eval/src/lsc_eval/llm_managers/__init__.py(0 hunks)lsc_eval/src/lsc_eval/metrics/__init__.py(0 hunks)pyproject.toml(3 hunks)src/lightspeed_evaluation/__init__.py(1 hunks)src/lightspeed_evaluation/core/__init__.py(1 hunks)src/lightspeed_evaluation/core/config/__init__.py(1 hunks)src/lightspeed_evaluation/core/config/loader.py(2 hunks)src/lightspeed_evaluation/core/config/models.py(1 hunks)src/lightspeed_evaluation/core/config/validator.py(1 hunks)src/lightspeed_evaluation/core/llm/__init__.py(1 hunks)src/lightspeed_evaluation/core/metrics/__init__.py(1 hunks)src/lightspeed_evaluation/core/metrics/custom.py(1 hunks)src/lightspeed_evaluation/core/metrics/deepeval.py(1 hunks)src/lightspeed_evaluation/core/metrics/ragas.py(1 hunks)src/lightspeed_evaluation/core/output/__init__.py(1 hunks)src/lightspeed_evaluation/core/output/generator.py(1 hunks)src/lightspeed_evaluation/core/output/statistics.py(1 hunks)src/lightspeed_evaluation/core/output/visualization.py(2 hunks)src/lightspeed_evaluation/drivers/__init__.py(1 hunks)src/lightspeed_evaluation/drivers/evaluation.py(4 hunks)src/lightspeed_evaluation/runner/__init__.py(1 hunks)src/lightspeed_evaluation/runner/evaluation.py(4 hunks)
💤 Files with no reviewable changes (6)
- lsc_eval/README.md
- lsc_eval/pyproject.toml
- lsc_eval/src/lsc_eval/llm_managers/init.py
- lsc_eval/src/lsc_eval/metrics/init.py
- lsc_eval/src/lsc_eval/core/init.py
- lsc_eval/src/lsc_eval/init.py
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
📚 Learning: 2025-08-26T11:17:48.640Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
Applied to files:
config/system.yamlarchive/README.mdsrc/lightspeed_evaluation/runner/evaluation.pypyproject.tomlREADME.md
📚 Learning: 2025-07-28T14:26:03.119Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.
Applied to files:
config/evaluation_data.yaml
🧬 Code graph analysis (16)
src/lightspeed_evaluation/core/output/statistics.py (1)
src/lightspeed_evaluation/core/config/models.py (2)
EvaluationResult(133-169)TurnData(8-44)
src/lightspeed_evaluation/core/metrics/custom.py (3)
src/lightspeed_evaluation/core/config/models.py (1)
TurnData(8-44)src/lightspeed_evaluation/core/llm/manager.py (1)
LLMManager(36-173)src/lightspeed_evaluation/core/output/statistics.py (1)
EvaluationScope(11-16)
src/lightspeed_evaluation/core/metrics/deepeval.py (4)
src/lightspeed_evaluation/core/config/models.py (1)
TurnData(8-44)src/lightspeed_evaluation/core/llm/deepeval.py (1)
DeepEvalLLMManager(8-43)src/lightspeed_evaluation/core/llm/manager.py (1)
LLMManager(36-173)src/lightspeed_evaluation/core/output/statistics.py (1)
EvaluationScope(11-16)
src/lightspeed_evaluation/core/config/__init__.py (3)
src/lightspeed_evaluation/core/config/loader.py (3)
ConfigLoader(193-275)setup_environment_variables(30-47)SystemConfig(151-190)src/lightspeed_evaluation/core/config/models.py (3)
EvaluationData(47-130)EvaluationResult(133-169)TurnData(8-44)src/lightspeed_evaluation/core/config/validator.py (1)
DataValidator(11-82)
src/lightspeed_evaluation/runner/__init__.py (1)
src/lightspeed_evaluation/runner/evaluation.py (2)
main(95-128)run_evaluation(15-92)
src/lightspeed_evaluation/__init__.py (10)
src/lightspeed_evaluation/drivers/evaluation.py (1)
EvaluationDriver(118-322)src/lightspeed_evaluation/core/llm/manager.py (1)
LLMManager(36-173)src/lightspeed_evaluation/core/config/loader.py (2)
ConfigLoader(193-275)SystemConfig(151-190)src/lightspeed_evaluation/core/config/models.py (3)
EvaluationData(47-130)TurnData(8-44)EvaluationResult(133-169)src/lightspeed_evaluation/core/config/validator.py (1)
DataValidator(11-82)src/lightspeed_evaluation/core/metrics/ragas.py (1)
RagasMetrics(23-263)src/lightspeed_evaluation/core/metrics/deepeval.py (1)
DeepEvalMetrics(19-138)src/lightspeed_evaluation/core/metrics/custom.py (1)
CustomMetrics(29-251)src/lightspeed_evaluation/core/output/generator.py (1)
OutputHandler(15-244)src/lightspeed_evaluation/core/output/visualization.py (1)
GraphGenerator(17-439)
src/lightspeed_evaluation/core/output/__init__.py (1)
src/lightspeed_evaluation/core/output/generator.py (1)
OutputHandler(15-244)
src/lightspeed_evaluation/drivers/__init__.py (1)
src/lightspeed_evaluation/drivers/evaluation.py (1)
EvaluationDriver(118-322)
src/lightspeed_evaluation/core/__init__.py (4)
src/lightspeed_evaluation/core/config/loader.py (3)
ConfigLoader(193-275)SystemConfig(151-190)setup_environment_variables(30-47)src/lightspeed_evaluation/core/config/models.py (3)
EvaluationData(47-130)EvaluationResult(133-169)TurnData(8-44)src/lightspeed_evaluation/core/config/validator.py (1)
DataValidator(11-82)src/lightspeed_evaluation/core/llm/manager.py (1)
LLMManager(36-173)
src/lightspeed_evaluation/core/metrics/__init__.py (3)
src/lightspeed_evaluation/core/metrics/custom.py (1)
CustomMetrics(29-251)src/lightspeed_evaluation/core/metrics/deepeval.py (1)
DeepEvalMetrics(19-138)src/lightspeed_evaluation/core/metrics/ragas.py (1)
RagasMetrics(23-263)
src/lightspeed_evaluation/core/metrics/ragas.py (4)
src/lightspeed_evaluation/core/config/models.py (1)
TurnData(8-44)src/lightspeed_evaluation/core/llm/manager.py (1)
LLMManager(36-173)src/lightspeed_evaluation/core/llm/ragas.py (1)
RagasLLMManager(84-112)src/lightspeed_evaluation/core/output/statistics.py (1)
EvaluationScope(11-16)
src/lightspeed_evaluation/runner/evaluation.py (5)
src/lightspeed_evaluation/core/config/loader.py (2)
ConfigLoader(193-275)setup_environment_variables(30-47)src/lightspeed_evaluation/core/config/validator.py (1)
DataValidator(11-82)src/lightspeed_evaluation/core/output/generator.py (1)
OutputHandler(15-244)src/lightspeed_evaluation/core/output/statistics.py (1)
calculate_basic_stats(19-45)src/lightspeed_evaluation/drivers/evaluation.py (2)
EvaluationDriver(118-322)run_evaluation(146-172)
src/lightspeed_evaluation/drivers/evaluation.py (8)
src/lightspeed_evaluation/core/config/loader.py (1)
ConfigLoader(193-275)src/lightspeed_evaluation/core/config/validator.py (1)
DataValidator(11-82)src/lightspeed_evaluation/core/config/models.py (3)
EvaluationData(47-130)EvaluationResult(133-169)TurnData(8-44)src/lightspeed_evaluation/core/llm/manager.py (1)
LLMManager(36-173)src/lightspeed_evaluation/core/metrics/custom.py (1)
CustomMetrics(29-251)src/lightspeed_evaluation/core/metrics/deepeval.py (1)
DeepEvalMetrics(19-138)src/lightspeed_evaluation/core/metrics/ragas.py (1)
RagasMetrics(23-263)src/lightspeed_evaluation/core/output/statistics.py (1)
EvaluationScope(11-16)
src/lightspeed_evaluation/core/llm/__init__.py (4)
src/lightspeed_evaluation/core/llm/manager.py (3)
LLMManager(36-173)LLMConfig(13-33)LLMError(8-9)src/lightspeed_evaluation/core/config/models.py (1)
LLMConfig(172-195)src/lightspeed_evaluation/core/llm/deepeval.py (1)
DeepEvalLLMManager(8-43)src/lightspeed_evaluation/core/llm/ragas.py (1)
RagasLLMManager(84-112)
src/lightspeed_evaluation/core/output/generator.py (2)
src/lightspeed_evaluation/core/config/models.py (1)
EvaluationResult(133-169)src/lightspeed_evaluation/core/output/statistics.py (2)
calculate_basic_stats(19-45)calculate_detailed_stats(48-68)
src/lightspeed_evaluation/core/output/visualization.py (2)
src/lightspeed_evaluation/core/config/models.py (1)
EvaluationResult(133-169)src/lightspeed_evaluation/core/output/statistics.py (2)
calculate_basic_stats(19-45)calculate_detailed_stats(48-68)
🪛 LanguageTool
archive/README.md
[grammar] ~1-~1: There might be a mistake here.
Context: # Lightspeed Core Evaluation Evaluation tooling for lightspeed-core p...
(QB_NEW_EN)
[grammar] ~2-~2: There might be a mistake here.
Context: ...peed Core Evaluation Evaluation tooling for lightspeed-core project. [Refer latest ...
(QB_NEW_EN)
[grammar] ~6-~6: There might be a mistake here.
Context: ...t maintained anymore.** ## Installation - Requires Python 3.11 - Install pdm -...
(QB_NEW_EN)
[grammar] ~10-~10: There might be a mistake here.
Context: ... a clean venv for Python 3.11 and pdm. - Run pdm install - Optional: For develo...
(QB_NEW_EN)
[grammar] ~11-~11: There might be a mistake here.
Context: ...n venv for Python 3.11 and pdm. - Run pdm install - Optional: For development, run `make ins...
(QB_NEW_EN)
[grammar] ~18-~18: There might be a mistake here.
Context: ...ion of similarity distances are used to calculate final score. Cut-off scores are used to...
(QB_NEW_EN)
[grammar] ~18-~18: There might be a mistake here.
Context: ...eviations. This also stores a .csv file with query, pre-defined answer, API response...
(QB_NEW_EN)
[grammar] ~20-~20: There might be a mistake here.
Context: .... model: Ability to compare responses against single ground-truth answer. Here we can...
(QB_NEW_EN)
[grammar] ~20-~20: There might be a mistake here.
Context: ...del at a time. This creates a json file as summary report with scores (f1-score) f...
(QB_NEW_EN)
[grammar] ~26-~26: There might be a mistake here.
Context: ...modified or removed, please create a PR. - OLS API should be ready/live with all th...
(QB_NEW_EN)
[grammar] ~27-~27: There might be a mistake here.
Context: ... the required provider+model configured. - It is possible that we want to run both ...
(QB_NEW_EN)
[style] ~28-~28: For conciseness, try rephrasing this sentence.
Context: ...e required provider+model configured. - It is possible that we want to run both consistency and model evalu...
(MAY_MIGHT_BE)
[grammar] ~28-~28: There might be a mistake here.
Context: ...n together. To avoid multiple API calls for same query, model evaluation first ch...
(QB_NEW_EN)
[grammar] ~28-~28: There might be a mistake here.
Context: ... generated by consistency evaluation. If response is not present in csv file, th...
(QB_NEW_EN)
[grammar] ~28-~28: There might be a mistake here.
Context: ...s not present in csv file, then only we call API to get the response. ### e2e test ...
(QB_NEW_EN)
[grammar] ~32-~32: Ensure spelling is correct
Context: .... Currently consistency evaluation is parimarily used to gate PRs. Final e2e suite will ...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~46-~46: There might be a mistake here.
Context: ...add new data accordingly. ### Arguments eval_type: This will control which eva...
(QB_NEW_EN)
[grammar] ~49-~49: There might be a mistake here.
Context: ...nAs provided in json file 2. model -> Compares set of models based on their response a...
(QB_NEW_EN)
[grammar] ~52-~52: There might be a mistake here.
Context: ...t:8080`. If deployed in a cluster, then pass cluster API url. **eval_api_token_file...
(QB_NEW_EN)
[grammar] ~54-~54: There might be a mistake here.
Context: ...l_api_token_file**: Path to a text file containing OLS API token. Required, if OLS is depl...
(QB_NEW_EN)
[grammar] ~54-~54: There might be a mistake here.
Context: ...API token. Required, if OLS is deployed in cluster. eval_scenario: This is pr...
(QB_NEW_EN)
[grammar] ~56-~56: Ensure spelling is correct
Context: ...enario**: This is primarily required to indetify which pre-defined answers need to be co...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~58-~58: There might be a mistake here.
Context: ...ith rag. eval_query_ids: Option to give set of query ids for evaluation. By def...
(QB_NEW_EN)
[grammar] ~60-~60: There might be a mistake here.
Context: ...ed. eval_provider_model_id: We can provide set of provider/model combinations as i...
(QB_NEW_EN)
[grammar] ~62-~62: There might be a mistake here.
Context: ...Applicable only for model evaluation. Provide file path to the parquet file having ad...
(QB_NEW_EN)
[grammar] ~71-~71: There might be a mistake here.
Context: .../rcsconfig.yaml) eval_modes: Apart from OLS api, we may want to evaluate vanill...
(QB_NEW_EN)
[grammar] ~71-~71: There might be a mistake here.
Context: ...s**: Apart from OLS api, we may want to evaluate vanilla model or with just OLS paramate...
(QB_NEW_EN)
[grammar] ~71-~71: Ensure spelling is correct
Context: ...evaluate vanilla model or with just OLS paramaters/prompt/RAG so that we can have baseline...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~71-~71: There might be a mistake here.
Context: ...LS paramaters/prompt/RAG so that we can have baseline score. This is a list of modes...
(QB_NEW_EN)
[grammar] ~73-~73: There might be a mistake here.
Context: ...ls_rag, & ols (actual api). ### Outputs Evaluation scripts creates below files. ...
(QB_NEW_EN)
[grammar] ~74-~74: There might be a mistake here.
Context: ... Evaluation scripts creates below files. - CSV file with response for given provide...
(QB_NEW_EN)
[grammar] ~86-~86: There might be a mistake here.
Context: ...ate a .csv file having retrieved chunks for given set of queries with similarity sc...
(QB_NEW_EN)
[grammar] ~86-~86: There might be a mistake here.
Context: ...with similarity score. This is not part of actual evaluation. But useful to do a s...
(QB_NEW_EN)
[grammar] ~88-~88: There might be a mistake here.
Context: ...viation in the response) #### Arguments db-path: Path to the RAG index *produc...
(QB_NEW_EN)
archive/example_result/README.md
[grammar] ~14-~14: There might be a mistake here.
Context: ... llama-3-1-8b-instruct - QnA evaluation dataset: [QnAs from OCP doc](../eval_data/ocp_do...
(QB_NEW_EN)
README.md
[grammar] ~5-~5: There might be a mistake here.
Context: ... GenAI applications. ## 🎯 Key Features - Multi-Framework Support: Seamlessly us...
(QB_NEW_EN)
[grammar] ~16-~16: There might be a mistake here.
Context: ... integration planned) ## 🚀 Quick Start ### Installation ```bash # From Git pip ins...
(QB_NEW_EN)
[grammar] ~114-~114: There might be a mistake here.
Context: ...t...." ``` ## 📈 Output & Visualization ### Generated Reports - CSV: Detailed re...
(QB_NEW_EN)
[grammar] ~116-~116: There might be a mistake here.
Context: ...t & Visualization ### Generated Reports - CSV: Detailed results with status, sco...
(QB_NEW_EN)
[grammar] ~117-~117: There might be a mistake here.
Context: ...led results with status, scores, reasons - JSON: Summary statistics with score di...
(QB_NEW_EN)
[grammar] ~118-~118: There might be a mistake here.
Context: ...mary statistics with score distributions - TXT: Human-readable summary - PNG:...
(QB_NEW_EN)
[grammar] ~119-~119: There might be a mistake here.
Context: ...utions - TXT: Human-readable summary - PNG: 4 visualization types (pass rates...
(QB_NEW_EN)
[grammar] ~122-~122: There might be a mistake here.
Context: ...us breakdown) ### Key Metrics in Output - PASS/FAIL/ERROR: Status based on thres...
(QB_NEW_EN)
[grammar] ~123-~123: There might be a mistake here.
Context: ...FAIL/ERROR**: Status based on thresholds - Actual Reasons: DeepEval provides LLM-...
(QB_NEW_EN)
[grammar] ~127-~127: There might be a mistake here.
Context: .../max for every metric ## 🧪 Development ### Development Tools ```bash uv sync --grou...
(QB_NEW_EN)
[grammar] ~140-~140: There might be a mistake here.
Context: ...tests --cov=src ``` ## Agent Evaluation For a detailed walkthrough of the new ag...
(QB_NEW_EN)
[grammar] ~141-~141: There might be a mistake here.
Context: ...he new agent-evaluation framework, refer [lsc_agent_eval/README.md](lsc_agent_eva...
(QB_NEW_EN)
[grammar] ~147-~147: There might be a mistake here.
Context: ...nswers.md) ## 📄 License & Contributing This project is licensed under the Apach...
(QB_NEW_EN)
[grammar] ~151-~151: There might be a mistake here.
Context: ...e for details. Contributions welcome - see development setup above for code qualit...
(QB_NEW_EN)
🪛 markdownlint-cli2 (0.17.2)
archive/README.md
13-13: Unordered list style
Expected: dash; Actual: plus
(MD004, ul-style)
13-13: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
24-24: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
35-35: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
83-83: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
88-88: Heading levels should only increment by one level at a time
Expected: h2; Actual: h4
(MD001, heading-increment)
🔇 Additional comments (32)
src/lightspeed_evaluation/core/config/models.py (1)
1-1: LGTM on branding docstring update.No functional changes; consistent with package rename.
config/evaluation_data.yaml (1)
1-1: Header rename looks good.Matches the new LightSpeed branding.
config/system.yaml (1)
1-1: Header rename acknowledged.No semantic changes; config remains intact.
lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py (1)
102-105: LGTM: formatting-only change.No behavioral impact; safe.
src/lightspeed_evaluation/drivers/evaluation.py (6)
2-2: LGTM!The comment accurately reflects the renamed class from
EvaluationEnginetoEvaluationDriver.
10-21: LGTM!The import restructuring to the new
..coremodule structure is correct and aligns with the broader repository refactor to consolidate the core functionality.
118-118: LGTM!The class rename from
EvaluationEnginetoEvaluationDriveris consistent with the documentation updates.
120-120: LGTM!The docstring accurately reflects the renamed class.
130-130: LGTM!The docstring is updated to reference "evaluation driver" instead of "evaluation engine".
140-140: LGTM!The initialization message correctly refers to "Evaluation Driver" now.
archive/README.md (1)
1-5: Fix typo in the documentation title.Line 1 has "Evaluation Evaluation" which appears to be a duplication error.
Apply this diff to fix the typo:
-# Lightspeed Core Evaluation -Evaluation tooling for lightspeed-core project. [Refer latest README.md](../README.md). +# Lightspeed Core Evaluation +Evaluation tooling for lightspeed-core project. [Refer latest README.md](../README.md).⛔ Skipped due to learnings
Learnt from: asamal4 PR: lightspeed-core/lightspeed-evaluation#28 File: lsc_eval/runner.py:99-103 Timestamp: 2025-08-26T11:17:48.640Z Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.archive/pyproject.toml (1)
1-76: LGTM!The archived
pyproject.tomlcorrectly preserves the legacy configuration with:
- Original package name
lightspeed-core-evaluation- Legacy entry points pointing to
lightspeed_core_evaluationmodule- Complete dependency list including PyTorch CPU variant configuration
This aligns with the PR objective to archive the old evaluation logic.src/lightspeed_evaluation/core/output/statistics.py (1)
7-7: LGTM: Import path updated for module reorganization.The import correctly references the new location of models in
..config.models, aligning with the package restructuring.src/lightspeed_evaluation/core/config/validator.py (1)
7-7: LGTM: Import path updated for config reorganization.The import now correctly references
.loaderinstead of.config_loader, aligning with the module restructuring.src/lightspeed_evaluation/core/config/loader.py (2)
1-1: LGTM: Module docstring updated for consistency.The docstring correctly reflects the new "Evaluation Framework" branding, removing the "LSC" acronym.
100-100: LGTM: Logger name updated for consistent branding.The logger name has been updated from "lsc_eval" to "lightspeed_evaluation", ensuring consistent logging namespace throughout the application.
src/lightspeed_evaluation/core/metrics/custom.py (2)
9-11: Import path updates look consistent with the package re-org.These align with TurnData/LLMManager/EvaluationScope’s new locations.
73-81: Remove retry parameter verification The call correctly usesnum_retries, which matches the official LiteLLM completion API; no change required.src/lightspeed_evaluation/drivers/__init__.py (1)
3-5: Re-export looks good and stabilizes the public API.src/lightspeed_evaluation/core/output/__init__.py (1)
3-3: Updated import path for OutputHandler is correct.Matches the generator relocation.
src/lightspeed_evaluation/core/metrics/deepeval.py (1)
13-16: Import path updates are consistent with the new module layout.src/lightspeed_evaluation/core/output/visualization.py (2)
13-15: Import path changes look correct and align with stats/model moves.
32-32: Logger namespace rename is appropriate.src/lightspeed_evaluation/runner/__init__.py (1)
3-5: Re-exports look goodPublic surface is clear:
mainandrun_evaluationexposed via__all__. No issues.src/lightspeed_evaluation/core/metrics/ragas.py (1)
16-19: Import path updates are correctImports align with the new module structure. No objections.
src/lightspeed_evaluation/core/metrics/__init__.py (1)
3-7: Package exports are cleanCentralized re-exports and
__all__look good for the new public surface.src/lightspeed_evaluation/runner/evaluation.py (1)
71-87: Return summary already computed—looks goodUsing
calculate_basic_statsfor the CLI exit code and printing is consistent and simple.src/lightspeed_evaluation/core/config/__init__.py (1)
3-15: Re-exports look consistent and minimal.Clean aggregation; symbols map correctly to loader/models/validator. No issues.
README.md (2)
41-49: Verify metric identifiers match implementation.Confirm that the listed Ragas metric keys are the exact strings your resolver expects (e.g.,
context_precision_without_referencevs any library-expected canonical names). Otherwise, users will hit “unknown metric” errors at runtime.
140-143: Check relative link path.Ensure
lsc_agent_eval/README.mdexists at repo root after this reorg; otherwise, the link will 404.pyproject.toml (2)
45-46: Ensure entry points exist post-reorg.Verify that
lightspeed_evaluation.runner.evaluation:mainandgenerate_answers.generate_answers:mainexist and are importable in the final package layout.
65-66: LGTM: package path aligns with new src layout.Wheel packaging target correctly points to
src/lightspeed_evaluation.
aa729db to
eab0836
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lightspeed_evaluation/core/metrics/deepeval.py (1)
62-78: Harden exception handling to match Ragas’ robustness.LLM/network errors (timeouts, broken pipe) will bubble up and abort evaluation. Mirror Ragas’ handling to return (None, reason) instead of raising.
Apply:
try: return self.supported_metrics[metric_name]( conv_data, scope.turn_idx, scope.turn_data, scope.is_conversation ) - except (ValueError, AttributeError, KeyError) as e: + except BrokenPipeError as e: + return None, f"DeepEval {metric_name} failed due to broken pipe: {str(e)}" + except OSError as e: + if getattr(e, "errno", None) == 32: + return None, f"DeepEval {metric_name} failed due to broken pipe (timeout): {str(e)}" + return None, f"DeepEval {metric_name} evaluation failed: {str(e)}" + except (RuntimeError, ValueError, AttributeError, KeyError, TypeError, ImportError) as e: return None, f"DeepEval {metric_name} evaluation failed: {str(e)}"
♻️ Duplicate comments (1)
src/lightspeed_evaluation/runner/evaluation.py (1)
97-99: Re: prior suggestion to make --eval-data required.You addressed the crash risk differently by providing a default and doing existence checks, which is fine. No change needed. If you still prefer enforcing explicit input, flip to required=True and drop the default.
Example:
- parser.add_argument( - "--eval-data", - default="config/evaluation_data.yaml", - help="Path to evaluation data file (default: config/evaluation_data.yaml)", - ) + parser.add_argument( + "--eval-data", + required=True, + help="Path to evaluation data file", + )Also, because env vars are loaded from system config inside main(), confirm no modules read env at import time.
#!/bin/bash # Grep for top-level env reads; review any matches outside defs/classes. rg -nP '(?m)^\s*(os\.getenv\(|os\.environ\[)' -g 'src/**/*.py' -C2
🧹 Nitpick comments (14)
src/lightspeed_evaluation/core/metrics/deepeval.py (1)
90-96: Guard against empty conversations for completeness metric.Avoid calling DeepEval with zero turns.
def _evaluate_conversation_completeness( @@ - test_case = self._build_conversational_test_case(conv_data) + if not getattr(conv_data, "turns", None): + return None, "No conversation turns available for completeness evaluation" + test_case = self._build_conversational_test_case(conv_data)archive/README.md (5)
13-13: Fix nested list marker to satisfy markdownlint (MD004/MD007).- + if `pdm` is not installed this will install `pdm` by running `pip install pdm` in your current Python environment. + - If `pdm` is not installed this will install `pdm` by running `pip install pdm` in your current Python environment.
32-32: Correct spelling: “parimarily” → “primarily”.-These evaluations are also part of **e2e test cases**. Currently *consistency* evaluation is parimarily used to gate PRs. +These evaluations are also part of **e2e test cases**. Currently *consistency* evaluation is primarily used to gate PRs.
73-78: Tighten grammar and number agreement.-### Outputs -Evaluation scripts creates below files. -- CSV file with response for given provider/model & modes. -- response evaluation result with scores (for consistency check). -- Final csv file with all results, json score summary & graph (for model evaluation) +### Outputs +Evaluation scripts create the following files: +- CSV file with responses for the given provider/model & modes. +- Response evaluation result with scores (for consistency check). +- Final CSV file with all results, JSON score summary, and graph (for model evaluation).
83-86: Specify language for fenced code block (MD040).-``` +```bash python -m lightspeed_core_evaluation.evaluation.query_rag--- `88-88`: **Fix heading increment (MD001).** ```diff -#### Arguments +## ArgumentsREADME.md (1)
90-101: Align example: include response_relevancy in turn_metrics or adjust metadata.The YAML shows metadata for "ragas:response_relevancy" but it isn’t listed under turn_metrics, which can confuse users. Add it to turn_metrics for consistency.
# Turn-level metrics (empty list = skip turn evaluation) turn_metrics: - "ragas:faithfulness" + - "ragas:response_relevancy" - "custom:answer_correctness" # Turn-level metrics metadata (threshold + other properties) turn_metrics_metadata: "ragas:response_relevancy": threshold: 0.8 weight: 1.0 "custom:answer_correctness": threshold: 0.75pyproject.toml (2)
8-8: Use standard license metadata.Prefer SPDX identifier or include the license file for clarity in package metadata.
-license = {text = "Apache"} +license = {text = "Apache-2.0"} +# Alternatively: +# license = {file = "LICENSE"}
22-22: Loosen torch version constraint and document install extras
Torch 2.7.0 is available on PyPI, but pinning to an exact patch release may force users to manually update for security/bug fixes and can conflict with platform-specific wheels. Change to a compatible range, for example:- torch==2.7.0 + torch>=2.7,<3.0and update the README with instructions for installing the appropriate CPU/GPU variants.
src/lightspeed_evaluation/core/output/visualization.py (3)
144-176: Fix axes and apply configured figsize/dpi for consistency.Bars are vertical (metrics on x, pass rates on y), but labels are swapped. Also, honor self.figsize/self.dpi.
- _, ax = plt.subplots(figsize=(12, 8)) + _, ax = plt.subplots(figsize=tuple(self.figsize), dpi=self.dpi) @@ - ax.set_title("Pass Rates by Metric", fontsize=16, fontweight="bold") - ax.set_xlabel("Pass Rate (%)", fontsize=12) - ax.set_ylabel("Metrics", fontsize=12) + ax.set_title("Pass Rates by Metric", fontsize=16, fontweight="bold") + ax.set_xlabel("Metrics", fontsize=12) + ax.set_ylabel("Pass Rate (%)", fontsize=12) @@ - plt.savefig(filename, dpi=300, bbox_inches="tight") + plt.savefig(filename, dpi=self.dpi, bbox_inches="tight")
420-422: Correct heatmap axis label.X-axis shows metrics, not “Pass Rate (%)”.
- ax.set_xlabel("Pass Rate (%)", fontsize=12, fontweight="bold") + ax.set_xlabel("Metrics", fontsize=12, fontweight="bold")
228-241: Optional: avoid filling NaNs with zeros in boxplot.Filling with 0 biases distributions. Use per-metric arrays with NaNs dropped.
- bplot = ax.boxplot( - results_df.fillna(0), - sym=".", - widths=0.5, - vert=False, - patch_artist=True, - ) - - labels = results_df.columns + labels = list(results_df.columns) + data = [results_df[col].dropna().values for col in labels] + bplot = ax.boxplot( + data, + sym=".", + widths=0.5, + vert=False, + patch_artist=True, + labels=labels, + )src/lightspeed_evaluation/runner/evaluation.py (2)
29-29: Optional: route user-facing prints through logging.Since logging is configured from system.yaml, consider using a module logger (e.g., logging.getLogger(name)) instead of print for consistency and level control.
50-52: Avoid double validation of evaluation data (minor).Data is validated in DataValidator.load_evaluation_data here and again inside EvaluationDriver.run_evaluation. Consider de-duplicating to reduce overhead (e.g., let the driver handle validation exclusively or add a non-validating loader).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
⛔ Files ignored due to path filters (9)
archive/assets/response_eval_flow.pngis excluded by!**/*.pngarchive/example_result/model_evaluation_result-answer_relevancy.pngis excluded by!**/*.pngarchive/example_result/model_evaluation_result-answer_similarity_llm.pngis excluded by!**/*.pngarchive/example_result/model_evaluation_result-cos_score.pngis excluded by!**/*.pngarchive/example_result/model_evaluation_result-rougeL_f1.pngis excluded by!**/*.pngarchive/example_result/model_evaluation_result-rougeL_precision.pngis excluded by!**/*.pngarchive/example_result/model_evaluation_result-rougeL_recall.pngis excluded by!**/*.pnglsc_eval/uv.lockis excluded by!**/*.lockuv.lockis excluded by!**/*.lock
📒 Files selected for processing (34)
.gitignore(1 hunks)README.md(1 hunks)archive/README.md(1 hunks)archive/example_result/README.md(1 hunks)archive/pyproject.toml(1 hunks)config/evaluation_data.yaml(1 hunks)config/system.yaml(1 hunks)lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py(1 hunks)lsc_eval/README.md(0 hunks)lsc_eval/pyproject.toml(0 hunks)lsc_eval/src/lsc_eval/__init__.py(0 hunks)lsc_eval/src/lsc_eval/core/__init__.py(0 hunks)lsc_eval/src/lsc_eval/llm_managers/__init__.py(0 hunks)lsc_eval/src/lsc_eval/metrics/__init__.py(0 hunks)pyproject.toml(3 hunks)src/lightspeed_evaluation/__init__.py(1 hunks)src/lightspeed_evaluation/core/__init__.py(1 hunks)src/lightspeed_evaluation/core/config/__init__.py(1 hunks)src/lightspeed_evaluation/core/config/loader.py(2 hunks)src/lightspeed_evaluation/core/config/models.py(1 hunks)src/lightspeed_evaluation/core/config/validator.py(1 hunks)src/lightspeed_evaluation/core/llm/__init__.py(1 hunks)src/lightspeed_evaluation/core/metrics/__init__.py(1 hunks)src/lightspeed_evaluation/core/metrics/custom.py(1 hunks)src/lightspeed_evaluation/core/metrics/deepeval.py(1 hunks)src/lightspeed_evaluation/core/metrics/ragas.py(1 hunks)src/lightspeed_evaluation/core/output/__init__.py(1 hunks)src/lightspeed_evaluation/core/output/generator.py(1 hunks)src/lightspeed_evaluation/core/output/statistics.py(1 hunks)src/lightspeed_evaluation/core/output/visualization.py(2 hunks)src/lightspeed_evaluation/drivers/__init__.py(1 hunks)src/lightspeed_evaluation/drivers/evaluation.py(4 hunks)src/lightspeed_evaluation/runner/__init__.py(1 hunks)src/lightspeed_evaluation/runner/evaluation.py(4 hunks)
💤 Files with no reviewable changes (6)
- lsc_eval/pyproject.toml
- lsc_eval/src/lsc_eval/core/init.py
- lsc_eval/src/lsc_eval/metrics/init.py
- lsc_eval/README.md
- lsc_eval/src/lsc_eval/init.py
- lsc_eval/src/lsc_eval/llm_managers/init.py
✅ Files skipped from review due to trivial changes (2)
- config/system.yaml
- lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py
🚧 Files skipped from review as they are similar to previous changes (15)
- src/lightspeed_evaluation/core/output/statistics.py
- src/lightspeed_evaluation/core/config/validator.py
- src/lightspeed_evaluation/core/metrics/ragas.py
- src/lightspeed_evaluation/drivers/init.py
- config/evaluation_data.yaml
- src/lightspeed_evaluation/core/config/loader.py
- src/lightspeed_evaluation/core/config/models.py
- archive/pyproject.toml
- src/lightspeed_evaluation/core/output/generator.py
- src/lightspeed_evaluation/core/init.py
- src/lightspeed_evaluation/core/config/init.py
- src/lightspeed_evaluation/drivers/evaluation.py
- src/lightspeed_evaluation/init.py
- .gitignore
- src/lightspeed_evaluation/core/metrics/custom.py
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
📚 Learning: 2025-08-26T11:17:48.640Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
Applied to files:
pyproject.tomlarchive/README.md
🧬 Code graph analysis (7)
src/lightspeed_evaluation/core/metrics/deepeval.py (4)
src/lightspeed_evaluation/core/config/models.py (1)
TurnData(8-44)src/lightspeed_evaluation/core/llm/deepeval.py (1)
DeepEvalLLMManager(8-43)src/lightspeed_evaluation/core/llm/manager.py (1)
LLMManager(36-173)src/lightspeed_evaluation/core/output/statistics.py (1)
EvaluationScope(11-16)
src/lightspeed_evaluation/core/metrics/__init__.py (3)
src/lightspeed_evaluation/core/metrics/custom.py (1)
CustomMetrics(29-251)src/lightspeed_evaluation/core/metrics/deepeval.py (1)
DeepEvalMetrics(19-138)src/lightspeed_evaluation/core/metrics/ragas.py (1)
RagasMetrics(23-263)
src/lightspeed_evaluation/core/output/__init__.py (1)
src/lightspeed_evaluation/core/output/generator.py (1)
OutputHandler(15-244)
src/lightspeed_evaluation/core/output/visualization.py (2)
src/lightspeed_evaluation/core/config/models.py (1)
EvaluationResult(133-169)src/lightspeed_evaluation/core/output/statistics.py (2)
calculate_basic_stats(19-45)calculate_detailed_stats(48-68)
src/lightspeed_evaluation/core/llm/__init__.py (4)
src/lightspeed_evaluation/core/llm/manager.py (3)
LLMManager(36-173)LLMConfig(13-33)LLMError(8-9)src/lightspeed_evaluation/core/config/models.py (1)
LLMConfig(172-195)src/lightspeed_evaluation/core/llm/deepeval.py (1)
DeepEvalLLMManager(8-43)src/lightspeed_evaluation/core/llm/ragas.py (1)
RagasLLMManager(84-112)
src/lightspeed_evaluation/runner/evaluation.py (5)
src/lightspeed_evaluation/core/config/loader.py (2)
ConfigLoader(193-275)setup_environment_variables(30-47)src/lightspeed_evaluation/core/config/validator.py (1)
DataValidator(11-82)src/lightspeed_evaluation/core/output/generator.py (1)
OutputHandler(15-244)src/lightspeed_evaluation/core/output/statistics.py (1)
calculate_basic_stats(19-45)src/lightspeed_evaluation/drivers/evaluation.py (2)
EvaluationDriver(118-322)run_evaluation(146-172)
src/lightspeed_evaluation/runner/__init__.py (1)
src/lightspeed_evaluation/runner/evaluation.py (2)
main(95-129)run_evaluation(15-92)
🪛 LanguageTool
README.md
[grammar] ~5-~5: There might be a mistake here.
Context: ... GenAI applications. ## 🎯 Key Features - Multi-Framework Support: Seamlessly us...
(QB_NEW_EN)
[grammar] ~16-~16: There might be a mistake here.
Context: ... integration planned) ## 🚀 Quick Start ### Installation ```bash # From Git pip ins...
(QB_NEW_EN)
[grammar] ~115-~115: There might be a mistake here.
Context: ...t...." ``` ## 📈 Output & Visualization ### Generated Reports - CSV: Detailed re...
(QB_NEW_EN)
[grammar] ~117-~117: There might be a mistake here.
Context: ...t & Visualization ### Generated Reports - CSV: Detailed results with status, sco...
(QB_NEW_EN)
[grammar] ~118-~118: There might be a mistake here.
Context: ...led results with status, scores, reasons - JSON: Summary statistics with score di...
(QB_NEW_EN)
[grammar] ~119-~119: There might be a mistake here.
Context: ...mary statistics with score distributions - TXT: Human-readable summary - PNG:...
(QB_NEW_EN)
[grammar] ~120-~120: There might be a mistake here.
Context: ...utions - TXT: Human-readable summary - PNG: 4 visualization types (pass rates...
(QB_NEW_EN)
[grammar] ~123-~123: There might be a mistake here.
Context: ...us breakdown) ### Key Metrics in Output - PASS/FAIL/ERROR: Status based on thres...
(QB_NEW_EN)
[grammar] ~124-~124: There might be a mistake here.
Context: ...FAIL/ERROR**: Status based on thresholds - Actual Reasons: DeepEval provides LLM-...
(QB_NEW_EN)
[grammar] ~128-~128: There might be a mistake here.
Context: .../max for every metric ## 🧪 Development ### Development Tools ```bash uv sync --grou...
(QB_NEW_EN)
[grammar] ~141-~141: There might be a mistake here.
Context: ...tests --cov=src ``` ## Agent Evaluation For a detailed walkthrough of the new ag...
(QB_NEW_EN)
[grammar] ~142-~142: There might be a mistake here.
Context: ...he new agent-evaluation framework, refer [lsc_agent_eval/README.md](lsc_agent_eva...
(QB_NEW_EN)
[grammar] ~148-~148: There might be a mistake here.
Context: ...nswers.md) ## 📄 License & Contributing This project is licensed under the Apach...
(QB_NEW_EN)
[grammar] ~152-~152: There might be a mistake here.
Context: ...e for details. Contributions welcome - see development setup above for code qualit...
(QB_NEW_EN)
archive/README.md
[grammar] ~1-~1: There might be a mistake here.
Context: # Lightspeed Core Evaluation Evaluation tooling for lightspeed-core p...
(QB_NEW_EN)
[grammar] ~2-~2: There might be a mistake here.
Context: ...peed Core Evaluation Evaluation tooling for lightspeed-core project. [Refer latest ...
(QB_NEW_EN)
[grammar] ~6-~6: There might be a mistake here.
Context: ...t maintained anymore.** ## Installation - Requires Python 3.11 - Install pdm -...
(QB_NEW_EN)
[grammar] ~10-~10: There might be a mistake here.
Context: ... a clean venv for Python 3.11 and pdm. - Run pdm install - Optional: For develo...
(QB_NEW_EN)
[grammar] ~11-~11: There might be a mistake here.
Context: ...n venv for Python 3.11 and pdm. - Run pdm install - Optional: For development, run `make ins...
(QB_NEW_EN)
[grammar] ~18-~18: There might be a mistake here.
Context: ...ion of similarity distances are used to calculate final score. Cut-off scores are used to...
(QB_NEW_EN)
[grammar] ~18-~18: There might be a mistake here.
Context: ...eviations. This also stores a .csv file with query, pre-defined answer, API response...
(QB_NEW_EN)
[grammar] ~20-~20: There might be a mistake here.
Context: .... model: Ability to compare responses against single ground-truth answer. Here we can...
(QB_NEW_EN)
[grammar] ~20-~20: There might be a mistake here.
Context: ...del at a time. This creates a json file as summary report with scores (f1-score) f...
(QB_NEW_EN)
[grammar] ~26-~26: There might be a mistake here.
Context: ...modified or removed, please create a PR. - OLS API should be ready/live with all th...
(QB_NEW_EN)
[grammar] ~27-~27: There might be a mistake here.
Context: ... the required provider+model configured. - It is possible that we want to run both ...
(QB_NEW_EN)
[style] ~28-~28: For conciseness, try rephrasing this sentence.
Context: ...e required provider+model configured. - It is possible that we want to run both consistency and model evalu...
(MAY_MIGHT_BE)
[grammar] ~28-~28: There might be a mistake here.
Context: ...n together. To avoid multiple API calls for same query, model evaluation first ch...
(QB_NEW_EN)
[grammar] ~28-~28: There might be a mistake here.
Context: ... generated by consistency evaluation. If response is not present in csv file, th...
(QB_NEW_EN)
[grammar] ~28-~28: There might be a mistake here.
Context: ...s not present in csv file, then only we call API to get the response. ### e2e test ...
(QB_NEW_EN)
[grammar] ~32-~32: Ensure spelling is correct
Context: .... Currently consistency evaluation is parimarily used to gate PRs. Final e2e suite will ...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~46-~46: There might be a mistake here.
Context: ... add new data accordingly. ## Arguments eval_type: This will control which eva...
(QB_NEW_EN)
[grammar] ~49-~49: There might be a mistake here.
Context: ...nAs provided in json file 2. model -> Compares set of models based on their response a...
(QB_NEW_EN)
[grammar] ~52-~52: There might be a mistake here.
Context: ...t:8080`. If deployed in a cluster, then pass cluster API url. **eval_api_token_file...
(QB_NEW_EN)
[grammar] ~54-~54: There might be a mistake here.
Context: ...l_api_token_file**: Path to a text file containing OLS API token. Required, if OLS is depl...
(QB_NEW_EN)
[grammar] ~54-~54: There might be a mistake here.
Context: ...API token. Required, if OLS is deployed in cluster. eval_scenario: This is pr...
(QB_NEW_EN)
[grammar] ~58-~58: There might be a mistake here.
Context: ...ith rag. eval_query_ids: Option to give set of query ids for evaluation. By def...
(QB_NEW_EN)
[grammar] ~60-~60: There might be a mistake here.
Context: ...ed. eval_provider_model_id: We can provide set of provider/model combinations as i...
(QB_NEW_EN)
[grammar] ~62-~62: There might be a mistake here.
Context: ...Applicable only for model evaluation. Provide file path to the parquet file having ad...
(QB_NEW_EN)
[grammar] ~71-~71: There might be a mistake here.
Context: .../rcsconfig.yaml) eval_modes: Apart from OLS api, we may want to evaluate vanill...
(QB_NEW_EN)
[grammar] ~71-~71: There might be a mistake here.
Context: ...s**: Apart from OLS api, we may want to evaluate vanilla model or with just OLS paramete...
(QB_NEW_EN)
[grammar] ~71-~71: There might be a mistake here.
Context: ...LS parameters/prompt/RAG so that we can have baseline score. This is a list of modes...
(QB_NEW_EN)
[grammar] ~73-~73: There might be a mistake here.
Context: ...ls_rag, & ols (actual api). ### Outputs Evaluation scripts creates below files. ...
(QB_NEW_EN)
[grammar] ~74-~74: There might be a mistake here.
Context: ... Evaluation scripts creates below files. - CSV file with response for given provide...
(QB_NEW_EN)
[grammar] ~86-~86: There might be a mistake here.
Context: ...ate a .csv file having retrieved chunks for given set of queries with similarity sc...
(QB_NEW_EN)
[grammar] ~86-~86: There might be a mistake here.
Context: ...with similarity score. This is not part of actual evaluation. But useful to do a s...
(QB_NEW_EN)
[grammar] ~88-~88: There might be a mistake here.
Context: ...viation in the response) #### Arguments db-path: Path to the RAG index *produc...
(QB_NEW_EN)
archive/example_result/README.md
[grammar] ~14-~14: There might be a mistake here.
Context: ... llama-3-1-8b-instruct - QnA evaluation dataset: [QnAs from OCP doc](../../eval_data/ocp...
(QB_NEW_EN)
🪛 markdownlint-cli2 (0.17.2)
archive/README.md
13-13: Unordered list style
Expected: dash; Actual: plus
(MD004, ul-style)
13-13: Unordered list indentation
Expected: 2; Actual: 4
(MD007, ul-indent)
24-24: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
83-83: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
88-88: Heading levels should only increment by one level at a time
Expected: h2; Actual: h4
(MD001, heading-increment)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: mypy
🔇 Additional comments (12)
archive/example_result/README.md (1)
14-14: Link verification passed:eval_data/ocp_doc_qna-edited.parquetexists at the repo root.src/lightspeed_evaluation/core/metrics/deepeval.py (1)
13-16: Import path updates look correct.src/lightspeed_evaluation/core/metrics/__init__.py (1)
1-7: Clean public re-exports. LGTM.README.md (1)
141-146: Verified link target exists –lsc_agent_eval/README.mdis present in the repo.pyproject.toml (1)
29-29: Tenacity>=9.1.2 is valid. Version 9.1.2 is available on PyPI; no change needed.src/lightspeed_evaluation/core/output/__init__.py (2)
3-6: LGTM: clean re-export path.Exporting OutputHandler from .generator matches the new layout.
3-3: No unterminated f-string or parenthesis ingenerator.py
The “Fail” lines in_generate_text_summaryalready have matching quotes and closing parentheses (e.g., lines 202–204), so no syntax fix is required.Likely an incorrect or invalid review comment.
src/lightspeed_evaluation/runner/__init__.py (1)
1-5: LGTM: simple, clear re-exports for CLI.Keeps public API stable.
src/lightspeed_evaluation/runner/evaluation.py (4)
1-1: Branding/docstring update looks good.Clear, accurate module purpose.
19-19: Docstring now reflects EvaluationDriver.Accurate and helpful.
46-48: Driver initialization aligns with new architecture.Instantiation with ConfigLoader is consistent with drivers.evaluation.
9-12: Re-exports confirmed; imports are valid. No action required.
lpiwowar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a step in the right direction:), IMO 👍 . I'm just curious:
Why don't we just remove the code that is not going to be used? 👀 I mean, the code in the current src folder, which is being moved to the archive folder. It's not like the code is going to be lost if we remove it. If one needs it, they can run git revert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/lightspeed_evaluation/core/config/models.py (1)
184-191: Make validation errors field-specific.
Current message “Value cannot be empty” is ambiguous. Include the field name for faster debugging.Apply this diff:
-@field_validator("provider", "model") -@classmethod -def _validate_non_empty(cls, v: str) -> str: - """Validate provider and model are non-empty strings.""" - if not v or not isinstance(v, str) or not v.strip(): - raise ValueError("Value cannot be empty") - return v.strip() +@field_validator("provider", "model") +@classmethod +def _validate_non_empty(cls, v: str, info): # pydantic v2: FieldValidationInfo + """Validate provider and model are non-empty strings.""" + if not isinstance(v, str) or not v.strip(): + raise ValueError(f"{info.field_name} cannot be empty") + return v.strip()Add this import (outside the changed hunk) to type-hint the validator info if you prefer:
from pydantic import FieldValidationInfo # optional typing
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (12)
src/lightspeed_evaluation/__init__.py(1 hunks)src/lightspeed_evaluation/core/__init__.py(1 hunks)src/lightspeed_evaluation/core/config/__init__.py(1 hunks)src/lightspeed_evaluation/core/config/models.py(2 hunks)src/lightspeed_evaluation/core/llm/__init__.py(1 hunks)src/lightspeed_evaluation/core/llm/manager.py(1 hunks)src/lightspeed_evaluation/core/metrics/custom.py(1 hunks)src/lightspeed_evaluation/core/metrics/deepeval.py(1 hunks)src/lightspeed_evaluation/core/metrics/ragas.py(1 hunks)src/lightspeed_evaluation/core/output/generator.py(1 hunks)src/lightspeed_evaluation/core/output/statistics.py(1 hunks)src/lightspeed_evaluation/core/output/visualization.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (10)
- src/lightspeed_evaluation/core/output/statistics.py
- src/lightspeed_evaluation/core/output/visualization.py
- src/lightspeed_evaluation/init.py
- src/lightspeed_evaluation/core/init.py
- src/lightspeed_evaluation/core/output/generator.py
- src/lightspeed_evaluation/core/metrics/custom.py
- src/lightspeed_evaluation/core/metrics/deepeval.py
- src/lightspeed_evaluation/core/metrics/ragas.py
- src/lightspeed_evaluation/core/config/init.py
- src/lightspeed_evaluation/core/llm/init.py
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
🧬 Code graph analysis (1)
src/lightspeed_evaluation/core/llm/manager.py (1)
src/lightspeed_evaluation/core/config/models.py (1)
LLMConfig(172-202)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: mypy
🔇 Additional comments (3)
src/lightspeed_evaluation/core/config/models.py (2)
1-1: Docstring rename aligns with rebrand.
Good rename; no issues.
192-203: Remove backward-compat formodel_name– not needed
Search for legacymodel_namekeys found only in thearchive/folder and unrelated test suites; no production config usesmodel_name. The currentfrom_dictdefaults are explicit and safe.Likely an incorrect or invalid review comment.
src/lightspeed_evaluation/core/llm/manager.py (1)
6-7: LLMConfig re-export verified
LLMConfigis imported frommodelsand included in__init__.py’s exports, sofrom ..config import LLMConfigis stable.
tisnik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this approach.
You are right. But I just didn't want to delete it yet for convenience.. There are some additional features/use-cases (not usable as it is though).. I will go through the code once I am done with all current changes (may be next week), will create necessary stories for future enhancement. Then I will remove this. |
VladimirKadlec
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As Lukas said, I'd remove the archive folder.
LGTM
This will fix lint issues
Summary by CodeRabbit
New Features
Refactor
Documentation
Chores