archive old eval and make lsc eval as primary #35

asamal4 · 2025-09-01T05:52:41Z

Archive old eval logic from OLS.
Make lsc eval as primary and change the name.

This will fix lint issues

Summary by CodeRabbit

New Features
- New CLI tools: lightspeed-eval and generate_answers.
- YAML-based system/evaluation configs; turn- and conversation-level metrics.
- Rich outputs & visualizations (CSV/JSON/TXT/PNG) with summary statistics.
Refactor
- Rebranded to LightSpeed Evaluation Framework; consolidated package and renamed evaluation driver.
- Public API and import paths reorganized (new package/entry-point layout).
Documentation
- README overhauled with feature-first guidance; legacy docs archived/removed.
Chores
- .gitignore and telemetry/config ignore location adjusted.

coderabbitai · 2025-09-01T05:52:48Z

Walkthrough

Project reorganized and rebranded from lsc_eval / lightspeed-core-evaluation into a new lightspeed_evaluation package: packaging, CLI entry points, public exports, import paths, logging namespaces, docs, config formats, and many module initializers were added/removed or relocated; class name EvaluationEngine → EvaluationDriver.

Changes

Cohort / File(s)	Summary
Repo metadata & housekeeping `/.gitignore`	Adjusted ignore rules: removed `lsc_eval/eval_output/` ignore, moved `.deepeval/` ignore to repo root, preserved `wip/` with newline formatting tweak.
Branding & top-level docs `/README.md`, `/config/system.yaml`, `/config/evaluation_data.yaml`	Rewrote README and rebranded to “LightSpeed Evaluation Framework”; replaced legacy usage with YAML-driven config examples; removed `conversation_group_id` and updated header comments.
Legacy archive `/archive/README.md`, `/archive/example_result/README.md`, `/archive/pyproject.toml`	Added archived documentation and a legacy pyproject; updated example result path and preserved legacy console scripts metadata.
Remove old lsc_eval surface `/lsc_eval/README.md`, `/lsc_eval/pyproject.toml`, `/lsc_eval/src/lsc_eval/...`	Deleted lsc_eval package README, pyproject, and package-root init re-exports (core, llm_managers, metrics), removing prior top-level re-exported symbols.
Top-level packaging & CLI `/pyproject.toml`	Renamed project to `lightspeed-evaluation`, changed dependencies set, updated requires-python, added new entry points (`lightspeed-eval`, `generate_answers`), and targeted `src/lightspeed_evaluation`.
New public package surface `/src/lightspeed_evaluation/__init__.py`, `/src/lightspeed_evaluation/core/...`, `/src/lightspeed_evaluation/drivers/__init__.py`, `/src/lightspeed_evaluation/runner/__init__.py`	Added new package initializers that re-export core components, establish `__all__`, set `__version__ = "0.1.0"`, and expose drivers/runner entrypoints.
Config & logging tweaks `/src/lightspeed_evaluation/core/config/loader.py`, `/src/lightspeed_evaluation/core/config/models.py`, `/src/lightspeed_evaluation/core/config/validator.py`	Docstring and logger namespace updated to `lightspeed_evaluation`; LLMConfig reworked (fields renamed/added/removed, new from_dict factory, validator changed); validator import source adjusted.
LLM package reorg `/src/lightspeed_evaluation/core/llm/__init__.py`, `/src/lightspeed_evaluation/core/llm/manager.py`	Introduced llm package init that re-exports LLMManager/DeepEval/Ragas and LLMConfig; manager module now imports LLMConfig from core.config instead of defining it inline.
Metrics import refactor `/src/lightspeed_evaluation/core/metrics/{custom.py,deepeval.py,ragas.py}`, `/src/lightspeed_evaluation/core/metrics/__init__.py`	Added metrics package init; updated imports to new module paths (..config, ..llm, ..output.statistics); no behavior changes.
Output pipeline refactor `/src/lightspeed_evaluation/core/output/{__init__.py,generator.py,statistics.py,visualization.py}`	Reorganized imports and moved OutputHandler sourcing to `.generator`; moved stats helpers to `.statistics`; updated logger namespaces.
Driver rename & CLI runner `/src/lightspeed_evaluation/drivers/evaluation.py`, `/src/lightspeed_evaluation/runner/evaluation.py`	Renamed `EvaluationEngine` → `EvaluationDriver` (class and usages); rewired imports to new package layout; updated CLI runner to use driver and adjusted branding/messages.
Minor formatting `/lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py`	Reformatted a `getattr` call and moved a pylint comment; semantics unchanged.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant CLI as lightspeed-eval (CLI)
  participant Loader as ConfigLoader
  participant Validator as DataValidator
  participant Driver as EvaluationDriver
  participant LLM as LLMManager
  participant Output as OutputHandler

  User->>CLI: provide system.yaml & evaluation_data.yaml
  CLI->>Loader: load configs
  CLI->>Validator: validate configs & metrics
  Validator-->>CLI: ok / errors
  CLI->>Driver: run_evaluation(configs)
  Driver->>LLM: initialize providers/adapters
  Driver->>LLM: evaluate turns & conversations
  LLM-->>Driver: metrics/results
  Driver->>Output: write CSV/JSON/TXT/PNG
  Output-->>User: reports & stats

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Generic eval tool #28 — Overlaps on package reorganization, re-export surface and .gitignore adjustments (strong code-level overlap).
agent-eval: py version #23 — Matches changes to Python version constraints and packaging metadata.
feat: add agent e2e eval #19 — Related edits to .gitignore evaluation output patterns and small formatting/judge-related changes.

Suggested reviewers

VladimirKadlec
tisnik
Anxhela21

Poem

I hop through paths that shifted names,
From engine’s hum to driver’s lanes.
New maps and exports, tidy and bright,
YAML trails guide the metrics' flight.
I thump once — releases look just right. 🐇✨

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

asamal4 · 2025-09-01T06:01:31Z

@coderabbitai review

coderabbitai · 2025-09-01T06:01:37Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (8)

src/lightspeed_evaluation/core/metrics/custom.py (1)

123-129: Avoid mis-normalizing already-normalized fractional scores (e.g., “12/10” → 1.2).

Current logic divides by 10/100 even when a fraction was parsed, producing under-scaled scores. Clamp fraction/out-of cases to 1.0, otherwise scale plain 0–10/0–100 numbers.

Apply this diff:

-        # Normalize score to 0-1 range if needed
-        if score is not None and score > 1.0:
-            if score <= 10.0:  # Assume 0-10 scale
-                score = score / 10.0
-            elif score <= 100.0:  # Assume 0-100 scale
-                score = score / 100.0
+        # Normalize score to 0-1 range if needed
+        if score is not None and score > 1.0:
+            # If text contains explicit scales, avoid re-normalizing fractions like "12/10"
+            if re.search(r'(\d+\.?\d*)/(\d+\.?\d*)|\bout of\b', response, re.IGNORECASE):
+                score = min(score, 1.0)
+            elif score <= 10.0:  # Assume 0-10 scale
+                score = score / 10.0
+            elif score <= 100.0:  # Assume 0-100 scale
+                score = score / 100.0
+            else:
+                score = 1.0

src/lightspeed_evaluation/core/metrics/deepeval.py (1)

86-94: Guard against empty conversations in completeness metric.

DeepEval may error on empty turn lists. Add an early check like relevancy does.

     def _evaluate_conversation_completeness(
         self,
         conv_data: Any,
         _turn_idx: Optional[int],
         _turn_data: Optional[TurnData],
         is_conversation: bool,
     ) -> Tuple[Optional[float], str]:
         """Evaluate conversation completeness."""
         if not is_conversation:
             return None, "Conversation completeness is a conversation-level metric"
+        if not getattr(conv_data, "turns", None):
+            return None, "No conversation turns available for completeness evaluation"
 
         test_case = self._build_conversational_test_case(conv_data)
         metric = ConversationCompletenessMetric(model=self.llm_manager.get_llm())

src/lightspeed_evaluation/core/output/visualization.py (1)

201-238: Avoid injecting zeros into boxplots; they distort distributions.

Build a list-of-arrays and let NaNs represent missing, or pass ragged arrays directly.

-        # Convert to DataFrame with equal-length arrays (pad with NaN)
-        max_len = max(len(scores) for scores in metric_groups.values())
-        score_data = {}
-        for metric_id, scores in metric_groups.items():
-            padded_scores = scores + [np.nan] * (max_len - len(scores))
-            score_data[metric_id] = padded_scores
-
-        results_df = pd.DataFrame(score_data)
-
-        _, ax = plt.subplots(figsize=tuple(self.figsize), dpi=self.dpi)
-        ax.set_xlabel("Score", fontsize=12, fontweight="bold")
-        ax.set_xlim(0, 1)
+        # Prepare ragged arrays per metric (no zero padding)
+        metrics = list(metric_groups.keys())
+        values = [metric_groups[m] for m in metrics]
+
+        _, ax = plt.subplots(figsize=tuple(self.figsize), dpi=self.dpi)
+        ax.set_xlabel("Score", fontsize=12, fontweight="bold")
+        ax.set_xlim(0, 1)
@@
-        bplot = ax.boxplot(
-            results_df.fillna(0),
+        bplot = ax.boxplot(
+            values,
             sym=".",
             widths=0.5,
             vert=False,
             patch_artist=True,
         )
-
-        labels = results_df.columns
+        labels = metrics
@@
-        ax.set_yticklabels(labels)
+        ax.set_yticklabels(labels)

src/lightspeed_evaluation/core/metrics/ragas.py (4)

79-82: Guard against missing/renamed Ragas result columns

Accessing df[result_key] will raise KeyError if Ragas changes column names (e.g., “nv_context_relevance” vs “context_relevance”). Convert this into a ValueError with available columns for better UX.

-        result = evaluate(dataset, metrics=[metric_instance])
-        df = result.to_pandas()
-        score = df[result_key].iloc[0]
-        return score, f"Ragas {metric_name}: {score:.2f}"
+        result = evaluate(dataset, metrics=[metric_instance])
+        df = result.to_pandas()
+        try:
+            score = df[result_key].iloc[0]
+        except KeyError:
+            available = ", ".join(df.columns.astype(str))
+            raise ValueError(
+                f"Expected result column '{result_key}' not found. Available: {available}"
+            )
+        return score, f"Ragas {metric_name}: {score:.2f}"

122-131: Require TurnData for response relevancy

If turn_data is None, we silently evaluate empty strings. Fail fast with a clear message.

     def _evaluate_response_relevancy(
@@
-        if is_conversation:
+        if is_conversation:
             return None, "Response relevancy is a turn-level metric"
-
-        query, response, _ = self._extract_turn_data(turn_data)
+        if turn_data is None:
+            return None, "TurnData is required for response relevancy"
+        query, response, _ = self._extract_turn_data(turn_data)

141-152: Require TurnData for faithfulness

Same issue as above; avoid evaluating empty fields.

-        if is_conversation:
+        if is_conversation:
             return None, "Faithfulness is a turn-level metric"
-
-        query, response, contexts = self._extract_turn_data(turn_data)
+        if turn_data is None:
+            return None, "TurnData is required for faithfulness"
+        query, response, contexts = self._extract_turn_data(turn_data)

164-176: Add None-check for context precision without reference

This metric also needs turn_data; currently it proceeds with empty values.

-        if is_conversation:
+        if is_conversation:
             return None, "Context precision without reference is a turn-level metric"
-
-        query, response, contexts = self._extract_turn_data(turn_data)
+        if turn_data is None:
+            return None, "TurnData is required for context precision without reference"
+        query, response, contexts = self._extract_turn_data(turn_data)

src/lightspeed_evaluation/core/output/generator.py (1)

49-64: Honor format toggles from system config (csv/json/txt)

ConfigLoader provides csv_format, json_format, and txt_format, but they’re ignored here. Respect them to avoid generating unwanted artifacts.

-        # Generate CSV report
-        csv_file = self._generate_csv_report(results, base_filename)
-        print(f"  ✅ CSV: {csv_file}")
+        # Generate CSV report (if enabled)
+        if getattr(self.system_config, "csv_format", True):
+            csv_file = self._generate_csv_report(results, base_filename)
+            print(f"  ✅ CSV: {csv_file}")

-        # Generate JSON summary (pass pre-calculated stats)
-        json_file = self._generate_json_summary(
-            results, base_filename, basic_stats, detailed_stats
-        )
-        print(f"  ✅ JSON: {json_file}")
+        # Generate JSON summary (if enabled)
+        if getattr(self.system_config, "json_format", True):
+            json_file = self._generate_json_summary(
+                results, base_filename, basic_stats, detailed_stats
+            )
+            print(f"  ✅ JSON: {json_file}")

-        # Generate text summary (pass pre-calculated stats)
-        txt_file = self._generate_text_summary(
-            results, base_filename, basic_stats, detailed_stats
-        )
-        print(f"  ✅ TXT: {txt_file}")
+        # Generate text summary (if enabled)
+        if getattr(self.system_config, "txt_format", True):
+            txt_file = self._generate_text_summary(
+                results, base_filename, basic_stats, detailed_stats
+            )
+            print(f"  ✅ TXT: {txt_file}")

🧹 Nitpick comments (25)

.gitignore (2)
183-183: Root-level .deepeval/ ignore: confirm intent and preserve templates

Moving the ignore to /.deepeval/ makes sense if all telemetry/config should remain untracked. If you plan to keep example configs or docs under this dir, add negation rules to retain them.

Example (adjust filenames as needed):
 .deepeval/
+!.deepeval/README.md
+!.deepeval/*.example.*
+!.deepeval/*.sample.*
186-186: Constrain wip/ to repo root to avoid broad ignores*

wip*/ matches any directory named like wip* at any depth. If you only intend a root-level sandbox (per the comment), scope it to the root.
-wip*/
+/wip*/
If the intent is a single folder, consider just /wip/.
lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py (1)
97-109: Make content extraction robust to dict/object LiteLLM responses and completion/chat variants.

Guard for dict-like structures and fall back to text for completion-style responses.

Apply:
-                choices = getattr(response, "choices", None)
+                # Support both object-like and dict-like LiteLLM responses
+                choices = getattr(response, "choices", None)
+                if choices is None and isinstance(response, dict):
+                    choices = response.get("choices")
                 if choices and len(choices) > 0:
-                    message = getattr(
-                        choices[0],  # pylint: disable=unsubscriptable-object
-                        "message",
-                        None,
-                    )
-                    if message:
-                        content = getattr(message, "content", None)
+                    first = choices[0]  # pylint: disable=unsubscriptable-object
+                    # Chat-style
+                    message = getattr(first, "message", None)
+                    if message is None and isinstance(first, dict):
+                        message = first.get("message")
+                    if message:
+                        content = getattr(message, "content", None)
+                        if content is None and isinstance(message, dict):
+                            content = message.get("content")
+                    # Completion-style fallback
+                    if not content:
+                        content = getattr(first, "text", None)
+                        if content is None and isinstance(first, dict):
+                            content = first.get("text")
archive/README.md (3)
88-88: Adjust heading level to maintain proper hierarchy.

Line 88 has a heading jump from h1 to h4. It should be h2 to maintain proper document structure.

Apply this diff to fix the heading level:
-#### Arguments
+## Arguments
35-37: Add language specification to fenced code blocks.

Lines 35-37 contain a fenced code block without a language specification.

Apply this diff to add the language specification:
-```
+```bash
 pdm run evaluate
---

`83-85`: **Add language specification to fenced code blocks.**

Lines 83-85 contain a fenced code block without a language specification.

Apply this diff to add the language specification:

```diff
-```
+```bash
 python -m lightspeed_core_evaluation.evaluation.query_rag
</blockquote></details>
<details>
<summary>src/lightspeed_evaluation/core/metrics/custom.py (1)</summary><blockquote>

`229-236`: **Consider a deterministic system prompt for rubric clarity.**

Passing a brief system_prompt with rubric reduces variance and parsing failures.

</blockquote></details>
<details>
<summary>src/lightspeed_evaluation/core/output/visualization.py (3)</summary><blockquote>

`79-82`: **Don’t return early when by_metric is empty; still generate other graphs.**

Status pie and heatmap can still be useful with no metric stats.

```diff
-            if not summary_stats["by_metric"]:
-                self.logger.warning("No metric data available for graph generation")
-                return {}
+            if not summary_stats["by_metric"]:
+                self.logger.warning("No metric data available for pass-rate graph")
+                # continue; still generate other graphs
143-176: Fix axis labels and honor configured figsize/dpi for pass rates graph.

X/Y labels are swapped; also use self.figsize/self.dpi for consistency.
-        _, ax = plt.subplots(figsize=(12, 8))
+        _, ax = plt.subplots(figsize=tuple(self.figsize), dpi=self.dpi)
@@
-        ax.set_title("Pass Rates by Metric", fontsize=16, fontweight="bold")
-        ax.set_xlabel("Pass Rate (%)", fontsize=12)
-        ax.set_ylabel("Metrics", fontsize=12)
+        ax.set_title("Pass Rates by Metric", fontsize=16, fontweight="bold")
+        ax.set_xlabel("Metrics", fontsize=12)
+        ax.set_ylabel("Pass Rate (%)", fontsize=12)
@@
-        filename = self.graphs_dir / f"{base_filename}_pass_rates.png"
-        plt.savefig(filename, dpi=300, bbox_inches="tight")
+        filename = self.graphs_dir / f"{base_filename}_pass_rates.png"
+        plt.savefig(filename, dpi=self.dpi, bbox_inches="tight")
420-421: Heatmap axis label: x-axis is Metrics, not “Pass Rate (%)”.

Colorbar already communicates the pass-rate unit.
-        ax.set_xlabel("Pass Rate (%)", fontsize=12, fontweight="bold")
+        ax.set_xlabel("Metrics", fontsize=12, fontweight="bold")
         ax.set_ylabel("Conversation Groups", fontsize=12, fontweight="bold")
src/lightspeed_evaluation/core/metrics/ragas.py (2)

255-257: Potential incorrect column key: “nv_context_relevance”

This column name looks non-standard and may not exist in current Ragas outputs. Consider “context_relevance” (or “context_relevancy”) instead, or rely on the metric’s actual output column after evaluation.

If correct, keep; if not, change to the appropriate key or use the defensive handling added above to surface the available columns.

22-49: Micro-optimization: avoid pandas for single-value extraction

For single-row, single-metric results, converting to pandas is heavy. If Ragas exposes a direct API to get the score, prefer that to reduce overhead. If not feasible, current approach is acceptable.
src/lightspeed_evaluation/core/output/generator.py (2)
28-28: Ensure nested output directories are created

Use parents=True to avoid failures when intermediate directories are missing.
-        self.output_dir.mkdir(exist_ok=True)
+        self.output_dir.mkdir(parents=True, exist_ok=True)
174-175: Branding consistency: use “LightSpeed Evaluation Framework”

Align the text report header with the new naming used elsewhere.
-            f.write("LSC Evaluation Framework - Summary Report\n")
+            f.write("LightSpeed Evaluation Framework - Summary Report\n")
src/lightspeed_evaluation/runner/evaluation.py (1)

38-40: Double validation of evaluation data

Data is validated in DataValidator.load_evaluation_data and again inside EvaluationDriver.run_evaluation(). Consider skipping the second validation when upstream has already validated to save time on large datasets.

I can draft a small change to pass a flag into EvaluationDriver.run_evaluation to skip validation when appropriate.

Also applies to: 46-53
src/lightspeed_evaluation/core/llm/__init__.py (1)
3-13: Avoid ambiguous LLMConfig export
There are two LLMConfig classes in the codebase (core/llm/manager.py and core/config/models.py). To prevent confusion, alias or drop the re-export:

Option (non-breaking alias):
-from .manager import LLMManager, LLMConfig, LLMError
+from .manager import LLMManager, LLMConfig as LLMRuntimeConfig, LLMError

 __all__ = [
-    "LLMConfig",
+    "LLMRuntimeConfig",
     "LLMManager",
     "LLMError",
     "DeepEvalLLMManager",
     "RagasLLMManager",
 ]
src/lightspeed_evaluation/core/__init__.py (1)
11-12: Import from the aggregated config package to avoid submodule coupling.

You already re-export EvaluationResult and TurnData in core.config. Importing from core.config.models here creates a leaky dependency on the submodule path.
-from .config.models import EvaluationResult, TurnData
+from .config import EvaluationResult, TurnData
src/lightspeed_evaluation/__init__.py (2)
4-14: Docstring mentions “Runner” but it isn’t exported.

Either export a runner or update the docstring to avoid confusion.
- - Runner: Simple runner for command-line usage
- - Core modules organized by functionality (config, llm, metrics, output)
+ - Drivers: programmatic API via EvaluationDriver
+ - Core modules organized by functionality (config, llm, metrics, output)
31-35: Heavy deps are declared in pyproject.toml
Verified that pyproject.toml lists ragas>=0.3.0, deepeval>=1.3.0, matplotlib>=3.5.0, and seaborn>=0.11.0, so the top-level imports won’t fail due to missing packages. To reduce import-time overhead, you may defer loading of metrics and graph generators via lazy imports (e.g. PEP 562 __getattr__).
README.md (3)
20-26: Clarify uv prerequisite in Quick Start.

Readers may not have uv installed. Add a brief note or command to install uv before uv sync.
 ### Installation
 ```bash
 # From Git
 pip install git+https://github.com/lightspeed-core/lightspeed-evaluation.git

-# Local Development  
-uv sync
+# Local Development
+# Requires uv (https://docs.astral.sh/uv/)
+pipx install uv  # or: pip install uv
+uv sync
---

`61-81`: **Align turn_metrics with provided metadata in the example YAML.**

`turn_metrics_metadata` includes `ragas:response_relevancy`, but `turn_metrics` omits it. Either add the metric to `turn_metrics` or drop its metadata to avoid confusion.


```diff
   turn_metrics:
     - "ragas:faithfulness"
     - "custom:answer_correctness"
+    - "ragas:response_relevancy"
Also applies to: 83-112

31-36: Environment variables across providers.

Since LiteLLM enables multiple providers, consider adding a one-liner noting additional env vars (e.g., ANTHROPIC_API_KEY, Azure creds) when those providers are selected.
pyproject.toml (3)
22-22: Avoid pinning heavyweight Torch by default; make it optional.

Torch is large and often unnecessary for evaluation-only flows. Also, a strict ==2.7.0 pin risks resolution failures across platforms. Move Torch to an optional extra and loosen the version with a sane upper bound.
 dependencies = [
   # Core evaluation framework dependencies
   "ragas>=0.3.0",
   "deepeval>=1.3.0",
   "litellm>=1.0.0",
   "pydantic>=2.0.0",
   "pyyaml>=6.0",
   "pandas>=2.1.4",
-  "datasets>=2.0.0",
+  "datasets>=2.0.0",
   "matplotlib>=3.5.0",
   "seaborn>=0.11.0",
   "numpy>=1.23.0",
-  "torch==2.7.0",
   # Agent evaluation dependencies (for future integration)
   "httpx>=0.27.2",
   "tqdm>=4.67.1",
   # Generate answers dependencies
   "click>=8.0.0",
   "diskcache>=5.6.3",
   "tenacity>=9.1.2",
 ]

+[project.optional-dependencies]
+# Install with: pip install ".[agent]"
+agent = [
+  "torch>=2.2,<3",
+]
12-16: Pin upper bounds for volatile libs to reduce breakage.

Ragas, DeepEval, and LiteLLM ship frequent breaking changes. Add an upper bound to improve reproducibility.
-    "ragas>=0.3.0",
-    "deepeval>=1.3.0",
-    "litellm>=1.0.0",
+    "ragas>=0.3,<0.4",
+    "deepeval>=1.3,<2.0",
+    "litellm>=1.0,<2.0",
6-8: Python version window is reasonable.

>=3.11,<3.13 is a safe baseline given ecosystem support; consider widening to <3.14 once CI confirms green on 3.13.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between ca9a863 and aa729db.

⛔ Files ignored due to path filters (9)

archive/assets/response_eval_flow.png is excluded by !**/*.png
archive/example_result/model_evaluation_result-answer_relevancy.png is excluded by !**/*.png
archive/example_result/model_evaluation_result-answer_similarity_llm.png is excluded by !**/*.png
archive/example_result/model_evaluation_result-cos_score.png is excluded by !**/*.png
archive/example_result/model_evaluation_result-rougeL_f1.png is excluded by !**/*.png
archive/example_result/model_evaluation_result-rougeL_precision.png is excluded by !**/*.png
archive/example_result/model_evaluation_result-rougeL_recall.png is excluded by !**/*.png
lsc_eval/uv.lock is excluded by !**/*.lock
uv.lock is excluded by !**/*.lock

📒 Files selected for processing (34)

.gitignore (1 hunks)
README.md (1 hunks)
archive/README.md (1 hunks)
archive/example_result/README.md (1 hunks)
archive/pyproject.toml (1 hunks)
config/evaluation_data.yaml (1 hunks)
config/system.yaml (1 hunks)
lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py (1 hunks)
lsc_eval/README.md (0 hunks)
lsc_eval/pyproject.toml (0 hunks)
lsc_eval/src/lsc_eval/__init__.py (0 hunks)
lsc_eval/src/lsc_eval/core/__init__.py (0 hunks)
lsc_eval/src/lsc_eval/llm_managers/__init__.py (0 hunks)
lsc_eval/src/lsc_eval/metrics/__init__.py (0 hunks)
pyproject.toml (3 hunks)
src/lightspeed_evaluation/__init__.py (1 hunks)
src/lightspeed_evaluation/core/__init__.py (1 hunks)
src/lightspeed_evaluation/core/config/__init__.py (1 hunks)
src/lightspeed_evaluation/core/config/loader.py (2 hunks)
src/lightspeed_evaluation/core/config/models.py (1 hunks)
src/lightspeed_evaluation/core/config/validator.py (1 hunks)
src/lightspeed_evaluation/core/llm/__init__.py (1 hunks)
src/lightspeed_evaluation/core/metrics/__init__.py (1 hunks)
src/lightspeed_evaluation/core/metrics/custom.py (1 hunks)
src/lightspeed_evaluation/core/metrics/deepeval.py (1 hunks)
src/lightspeed_evaluation/core/metrics/ragas.py (1 hunks)
src/lightspeed_evaluation/core/output/__init__.py (1 hunks)
src/lightspeed_evaluation/core/output/generator.py (1 hunks)
src/lightspeed_evaluation/core/output/statistics.py (1 hunks)
src/lightspeed_evaluation/core/output/visualization.py (2 hunks)
src/lightspeed_evaluation/drivers/__init__.py (1 hunks)
src/lightspeed_evaluation/drivers/evaluation.py (4 hunks)
src/lightspeed_evaluation/runner/__init__.py (1 hunks)
src/lightspeed_evaluation/runner/evaluation.py (4 hunks)

💤 Files with no reviewable changes (6)

lsc_eval/README.md
lsc_eval/pyproject.toml
lsc_eval/src/lsc_eval/llm_managers/init.py
lsc_eval/src/lsc_eval/metrics/init.py
lsc_eval/src/lsc_eval/core/init.py
lsc_eval/src/lsc_eval/init.py

🧰 Additional context used

🧠 Learnings (3)

📓 Common learnings

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

📚 Learning: 2025-08-26T11:17:48.640Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

Applied to files:

config/system.yaml
archive/README.md
src/lightspeed_evaluation/runner/evaluation.py
pyproject.toml
README.md

📚 Learning: 2025-07-28T14:26:03.119Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#22
File: lsc_agent_eval/src/lsc_agent_eval/core/agent_goal_eval/eval_data.py:146-153
Timestamp: 2025-07-28T14:26:03.119Z
Learning: In the lsc_agent_eval framework, evaluations are identified by a composite key of (conversation_group, eval_id). This design allows the same eval_id to exist across different conversation groups (logged as warning) but prevents duplicates within the same conversation group (validation error). This supports logical separation and reusable eval_ids across different conversation contexts.

Applied to files:

config/evaluation_data.yaml

🧬 Code graph analysis (16)

src/lightspeed_evaluation/core/output/statistics.py (1)

src/lightspeed_evaluation/core/config/models.py (2)

EvaluationResult (133-169)

TurnData (8-44)

src/lightspeed_evaluation/core/metrics/custom.py (3)

src/lightspeed_evaluation/core/config/models.py (1)

TurnData (8-44)

src/lightspeed_evaluation/core/llm/manager.py (1)

LLMManager (36-173)

src/lightspeed_evaluation/core/output/statistics.py (1)

EvaluationScope (11-16)

src/lightspeed_evaluation/core/metrics/deepeval.py (4)

src/lightspeed_evaluation/core/config/models.py (1)

TurnData (8-44)

src/lightspeed_evaluation/core/llm/deepeval.py (1)

DeepEvalLLMManager (8-43)

src/lightspeed_evaluation/core/llm/manager.py (1)

LLMManager (36-173)

src/lightspeed_evaluation/core/output/statistics.py (1)

EvaluationScope (11-16)

src/lightspeed_evaluation/core/config/__init__.py (3)

src/lightspeed_evaluation/core/config/loader.py (3)

ConfigLoader (193-275)

setup_environment_variables (30-47)

SystemConfig (151-190)

src/lightspeed_evaluation/core/config/models.py (3)

EvaluationData (47-130)

EvaluationResult (133-169)

TurnData (8-44)

src/lightspeed_evaluation/core/config/validator.py (1)

DataValidator (11-82)

src/lightspeed_evaluation/runner/__init__.py (1)

src/lightspeed_evaluation/runner/evaluation.py (2)

main (95-128)

run_evaluation (15-92)

src/lightspeed_evaluation/__init__.py (10)

src/lightspeed_evaluation/drivers/evaluation.py (1)

EvaluationDriver (118-322)

src/lightspeed_evaluation/core/llm/manager.py (1)

LLMManager (36-173)

src/lightspeed_evaluation/core/config/loader.py (2)

ConfigLoader (193-275)

SystemConfig (151-190)

src/lightspeed_evaluation/core/config/models.py (3)

EvaluationData (47-130)

TurnData (8-44)

EvaluationResult (133-169)

src/lightspeed_evaluation/core/config/validator.py (1)

DataValidator (11-82)

src/lightspeed_evaluation/core/metrics/ragas.py (1)

RagasMetrics (23-263)

src/lightspeed_evaluation/core/metrics/deepeval.py (1)

DeepEvalMetrics (19-138)

src/lightspeed_evaluation/core/metrics/custom.py (1)

CustomMetrics (29-251)

src/lightspeed_evaluation/core/output/generator.py (1)

OutputHandler (15-244)

src/lightspeed_evaluation/core/output/visualization.py (1)

GraphGenerator (17-439)

src/lightspeed_evaluation/core/output/__init__.py (1)

src/lightspeed_evaluation/core/output/generator.py (1)

OutputHandler (15-244)

src/lightspeed_evaluation/drivers/__init__.py (1)

src/lightspeed_evaluation/drivers/evaluation.py (1)

EvaluationDriver (118-322)

src/lightspeed_evaluation/core/__init__.py (4)

src/lightspeed_evaluation/core/config/loader.py (3)

ConfigLoader (193-275)

SystemConfig (151-190)

setup_environment_variables (30-47)

src/lightspeed_evaluation/core/config/models.py (3)

EvaluationData (47-130)

EvaluationResult (133-169)

TurnData (8-44)

src/lightspeed_evaluation/core/config/validator.py (1)

DataValidator (11-82)

src/lightspeed_evaluation/core/llm/manager.py (1)

LLMManager (36-173)

src/lightspeed_evaluation/core/metrics/__init__.py (3)

src/lightspeed_evaluation/core/metrics/custom.py (1)

CustomMetrics (29-251)

src/lightspeed_evaluation/core/metrics/deepeval.py (1)

DeepEvalMetrics (19-138)

src/lightspeed_evaluation/core/metrics/ragas.py (1)

RagasMetrics (23-263)

src/lightspeed_evaluation/core/metrics/ragas.py (4)

src/lightspeed_evaluation/core/config/models.py (1)

TurnData (8-44)

src/lightspeed_evaluation/core/llm/manager.py (1)

LLMManager (36-173)

src/lightspeed_evaluation/core/llm/ragas.py (1)

RagasLLMManager (84-112)

src/lightspeed_evaluation/core/output/statistics.py (1)

EvaluationScope (11-16)

src/lightspeed_evaluation/runner/evaluation.py (5)

src/lightspeed_evaluation/core/config/loader.py (2)

ConfigLoader (193-275)

setup_environment_variables (30-47)

src/lightspeed_evaluation/core/config/validator.py (1)

DataValidator (11-82)

src/lightspeed_evaluation/core/output/generator.py (1)

OutputHandler (15-244)

src/lightspeed_evaluation/core/output/statistics.py (1)

calculate_basic_stats (19-45)

src/lightspeed_evaluation/drivers/evaluation.py (2)

EvaluationDriver (118-322)

run_evaluation (146-172)

src/lightspeed_evaluation/drivers/evaluation.py (8)

src/lightspeed_evaluation/core/config/loader.py (1)

ConfigLoader (193-275)

src/lightspeed_evaluation/core/config/validator.py (1)

DataValidator (11-82)

src/lightspeed_evaluation/core/config/models.py (3)

EvaluationData (47-130)

EvaluationResult (133-169)

TurnData (8-44)

src/lightspeed_evaluation/core/llm/manager.py (1)

LLMManager (36-173)

src/lightspeed_evaluation/core/metrics/custom.py (1)

CustomMetrics (29-251)

src/lightspeed_evaluation/core/metrics/deepeval.py (1)

DeepEvalMetrics (19-138)

src/lightspeed_evaluation/core/metrics/ragas.py (1)

RagasMetrics (23-263)

src/lightspeed_evaluation/core/output/statistics.py (1)

EvaluationScope (11-16)

src/lightspeed_evaluation/core/llm/__init__.py (4)

src/lightspeed_evaluation/core/llm/manager.py (3)

LLMManager (36-173)

LLMConfig (13-33)

LLMError (8-9)

src/lightspeed_evaluation/core/config/models.py (1)

LLMConfig (172-195)

src/lightspeed_evaluation/core/llm/deepeval.py (1)

DeepEvalLLMManager (8-43)

src/lightspeed_evaluation/core/llm/ragas.py (1)

RagasLLMManager (84-112)

src/lightspeed_evaluation/core/output/generator.py (2)

src/lightspeed_evaluation/core/config/models.py (1)

EvaluationResult (133-169)

src/lightspeed_evaluation/core/output/statistics.py (2)

calculate_basic_stats (19-45)

calculate_detailed_stats (48-68)

src/lightspeed_evaluation/core/output/visualization.py (2)

src/lightspeed_evaluation/core/config/models.py (1)

EvaluationResult (133-169)

src/lightspeed_evaluation/core/output/statistics.py (2)

calculate_basic_stats (19-45)

calculate_detailed_stats (48-68)

🪛 LanguageTool

archive/README.md

[grammar] ~1-~1: There might be a mistake here.
Context: # Lightspeed Core Evaluation Evaluation tooling for lightspeed-core p...

(QB_NEW_EN)

[grammar] ~2-~2: There might be a mistake here.
Context: ...peed Core Evaluation Evaluation tooling for lightspeed-core project. [Refer latest ...

(QB_NEW_EN)

[grammar] ~6-~6: There might be a mistake here.
Context: ...t maintained anymore.** ## Installation - Requires Python 3.11 - Install pdm -...

(QB_NEW_EN)

[grammar] ~10-~10: There might be a mistake here.
Context: ... a clean venv for Python 3.11 and pdm. - Run pdm install - Optional: For develo...

(QB_NEW_EN)

[grammar] ~11-~11: There might be a mistake here.
Context: ...n venv for Python 3.11 and pdm. - Run pdm install - Optional: For development, run `make ins...

(QB_NEW_EN)

[grammar] ~18-~18: There might be a mistake here.
Context: ...ion of similarity distances are used to calculate final score. Cut-off scores are used to...

(QB_NEW_EN)

[grammar] ~18-~18: There might be a mistake here.
Context: ...eviations. This also stores a .csv file with query, pre-defined answer, API response...

(QB_NEW_EN)

[grammar] ~20-~20: There might be a mistake here.
Context: .... model: Ability to compare responses against single ground-truth answer. Here we can...

(QB_NEW_EN)

[grammar] ~20-~20: There might be a mistake here.
Context: ...del at a time. This creates a json file as summary report with scores (f1-score) f...

(QB_NEW_EN)

[grammar] ~26-~26: There might be a mistake here.
Context: ...modified or removed, please create a PR. - OLS API should be ready/live with all th...

(QB_NEW_EN)

[grammar] ~27-~27: There might be a mistake here.
Context: ... the required provider+model configured. - It is possible that we want to run both ...

(QB_NEW_EN)

[style] ~28-~28: For conciseness, try rephrasing this sentence.
Context: ...e required provider+model configured. - It is possible that we want to run both consistency and model evalu...

(MAY_MIGHT_BE)

[grammar] ~28-~28: There might be a mistake here.
Context: ...n together. To avoid multiple API calls for same query, model evaluation first ch...

(QB_NEW_EN)

[grammar] ~28-~28: There might be a mistake here.
Context: ... generated by consistency evaluation. If response is not present in csv file, th...

(QB_NEW_EN)

[grammar] ~28-~28: There might be a mistake here.
Context: ...s not present in csv file, then only we call API to get the response. ### e2e test ...

(QB_NEW_EN)

[grammar] ~32-~32: Ensure spelling is correct
Context: .... Currently consistency evaluation is parimarily used to gate PRs. Final e2e suite will ...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

[grammar] ~46-~46: There might be a mistake here.
Context: ...add new data accordingly. ### Arguments eval_type: This will control which eva...

(QB_NEW_EN)

[grammar] ~49-~49: There might be a mistake here.
Context: ...nAs provided in json file 2. model -> Compares set of models based on their response a...

(QB_NEW_EN)

[grammar] ~52-~52: There might be a mistake here.
Context: ...t:8080`. If deployed in a cluster, then pass cluster API url. **eval_api_token_file...

(QB_NEW_EN)

[grammar] ~54-~54: There might be a mistake here.
Context: ...l_api_token_file**: Path to a text file containing OLS API token. Required, if OLS is depl...

(QB_NEW_EN)

[grammar] ~54-~54: There might be a mistake here.
Context: ...API token. Required, if OLS is deployed in cluster. eval_scenario: This is pr...

(QB_NEW_EN)

[grammar] ~56-~56: Ensure spelling is correct
Context: ...enario**: This is primarily required to indetify which pre-defined answers need to be co...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

[grammar] ~58-~58: There might be a mistake here.
Context: ...ith rag. eval_query_ids: Option to give set of query ids for evaluation. By def...

(QB_NEW_EN)

[grammar] ~60-~60: There might be a mistake here.
Context: ...ed. eval_provider_model_id: We can provide set of provider/model combinations as i...

(QB_NEW_EN)

[grammar] ~62-~62: There might be a mistake here.
Context: ...Applicable only for model evaluation. Provide file path to the parquet file having ad...

(QB_NEW_EN)

[grammar] ~71-~71: There might be a mistake here.
Context: .../rcsconfig.yaml) eval_modes: Apart from OLS api, we may want to evaluate vanill...

(QB_NEW_EN)

[grammar] ~71-~71: There might be a mistake here.
Context: ...s**: Apart from OLS api, we may want to evaluate vanilla model or with just OLS paramate...

(QB_NEW_EN)

[grammar] ~71-~71: Ensure spelling is correct
Context: ...evaluate vanilla model or with just OLS paramaters/prompt/RAG so that we can have baseline...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

[grammar] ~71-~71: There might be a mistake here.
Context: ...LS paramaters/prompt/RAG so that we can have baseline score. This is a list of modes...

(QB_NEW_EN)

[grammar] ~73-~73: There might be a mistake here.
Context: ...ls_rag, & ols (actual api). ### Outputs Evaluation scripts creates below files. ...

(QB_NEW_EN)

[grammar] ~74-~74: There might be a mistake here.
Context: ... Evaluation scripts creates below files. - CSV file with response for given provide...

(QB_NEW_EN)

[grammar] ~86-~86: There might be a mistake here.
Context: ...ate a .csv file having retrieved chunks for given set of queries with similarity sc...

(QB_NEW_EN)

[grammar] ~86-~86: There might be a mistake here.
Context: ...with similarity score. This is not part of actual evaluation. But useful to do a s...

(QB_NEW_EN)

[grammar] ~88-~88: There might be a mistake here.
Context: ...viation in the response) #### Arguments db-path: Path to the RAG index *produc...

(QB_NEW_EN)

archive/example_result/README.md

[grammar] ~14-~14: There might be a mistake here.
Context: ... llama-3-1-8b-instruct - QnA evaluation dataset: [QnAs from OCP doc](../eval_data/ocp_do...

(QB_NEW_EN)

README.md

[grammar] ~5-~5: There might be a mistake here.
Context: ... GenAI applications. ## 🎯 Key Features - Multi-Framework Support: Seamlessly us...

(QB_NEW_EN)

[grammar] ~16-~16: There might be a mistake here.
Context: ... integration planned) ## 🚀 Quick Start ### Installation ```bash # From Git pip ins...

(QB_NEW_EN)

[grammar] ~114-~114: There might be a mistake here.
Context: ...t...." ``` ## 📈 Output & Visualization ### Generated Reports - CSV: Detailed re...

(QB_NEW_EN)

[grammar] ~116-~116: There might be a mistake here.
Context: ...t & Visualization ### Generated Reports - CSV: Detailed results with status, sco...

(QB_NEW_EN)

[grammar] ~117-~117: There might be a mistake here.
Context: ...led results with status, scores, reasons - JSON: Summary statistics with score di...

(QB_NEW_EN)

[grammar] ~118-~118: There might be a mistake here.
Context: ...mary statistics with score distributions - TXT: Human-readable summary - PNG:...

(QB_NEW_EN)

[grammar] ~119-~119: There might be a mistake here.
Context: ...utions - TXT: Human-readable summary - PNG: 4 visualization types (pass rates...

(QB_NEW_EN)

[grammar] ~122-~122: There might be a mistake here.
Context: ...us breakdown) ### Key Metrics in Output - PASS/FAIL/ERROR: Status based on thres...

(QB_NEW_EN)

[grammar] ~123-~123: There might be a mistake here.
Context: ...FAIL/ERROR**: Status based on thresholds - Actual Reasons: DeepEval provides LLM-...

(QB_NEW_EN)

[grammar] ~127-~127: There might be a mistake here.
Context: .../max for every metric ## 🧪 Development ### Development Tools ```bash uv sync --grou...

(QB_NEW_EN)

[grammar] ~140-~140: There might be a mistake here.
Context: ...tests --cov=src ``` ## Agent Evaluation For a detailed walkthrough of the new ag...

(QB_NEW_EN)

[grammar] ~141-~141: There might be a mistake here.
Context: ...he new agent-evaluation framework, refer [lsc_agent_eval/README.md](lsc_agent_eva...

(QB_NEW_EN)

[grammar] ~147-~147: There might be a mistake here.
Context: ...nswers.md) ## 📄 License & Contributing This project is licensed under the Apach...

(QB_NEW_EN)

[grammar] ~151-~151: There might be a mistake here.
Context: ...e for details. Contributions welcome - see development setup above for code qualit...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)

archive/README.md

13-13: Unordered list style
Expected: dash; Actual: plus

(MD004, ul-style)

13-13: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

24-24: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

35-35: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

83-83: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

88-88: Heading levels should only increment by one level at a time
Expected: h2; Actual: h4

(MD001, heading-increment)

🔇 Additional comments (32)

src/lightspeed_evaluation/core/config/models.py (1)

1-1: LGTM on branding docstring update.

No functional changes; consistent with package rename.

config/evaluation_data.yaml (1)

1-1: Header rename looks good.

Matches the new LightSpeed branding.

config/system.yaml (1)

1-1: Header rename acknowledged.

No semantic changes; config remains intact.

lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py (1)

102-105: LGTM: formatting-only change.

No behavioral impact; safe.

src/lightspeed_evaluation/drivers/evaluation.py (6)

2-2: LGTM!

The comment accurately reflects the renamed class from EvaluationEngine to EvaluationDriver.

10-21: LGTM!

The import restructuring to the new ..core module structure is correct and aligns with the broader repository refactor to consolidate the core functionality.

118-118: LGTM!

The class rename from EvaluationEngine to EvaluationDriver is consistent with the documentation updates.

120-120: LGTM!

The docstring accurately reflects the renamed class.

130-130: LGTM!

The docstring is updated to reference "evaluation driver" instead of "evaluation engine".

140-140: LGTM!

The initialization message correctly refers to "Evaluation Driver" now.
archive/README.md (1)
1-5: Fix typo in the documentation title.

Line 1 has "Evaluation Evaluation" which appears to be a duplication error.

Apply this diff to fix the typo:
-# Lightspeed Core Evaluation
-Evaluation tooling for lightspeed-core project. [Refer latest README.md](../README.md).
+# Lightspeed Core Evaluation
+Evaluation tooling for lightspeed-core project. [Refer latest README.md](../README.md).
⛔ Skipped due to learnings
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
archive/pyproject.toml (1)

1-76: LGTM!

The archived pyproject.toml correctly preserves the legacy configuration with:

Original package name lightspeed-core-evaluation

Legacy entry points pointing to lightspeed_core_evaluation module

Complete dependency list including PyTorch CPU variant configuration
This aligns with the PR objective to archive the old evaluation logic.

src/lightspeed_evaluation/core/output/statistics.py (1)

7-7: LGTM: Import path updated for module reorganization.

The import correctly references the new location of models in ..config.models, aligning with the package restructuring.

src/lightspeed_evaluation/core/config/validator.py (1)

7-7: LGTM: Import path updated for config reorganization.

The import now correctly references .loader instead of .config_loader, aligning with the module restructuring.

src/lightspeed_evaluation/core/config/loader.py (2)

1-1: LGTM: Module docstring updated for consistency.

The docstring correctly reflects the new "Evaluation Framework" branding, removing the "LSC" acronym.

100-100: LGTM: Logger name updated for consistent branding.

The logger name has been updated from "lsc_eval" to "lightspeed_evaluation", ensuring consistent logging namespace throughout the application.

src/lightspeed_evaluation/core/metrics/custom.py (2)

9-11: Import path updates look consistent with the package re-org.

These align with TurnData/LLMManager/EvaluationScope’s new locations.

73-81: Remove retry parameter verification The call correctly uses num_retries, which matches the official LiteLLM completion API; no change required.

src/lightspeed_evaluation/drivers/__init__.py (1)

3-5: Re-export looks good and stabilizes the public API.

src/lightspeed_evaluation/core/output/__init__.py (1)

3-3: Updated import path for OutputHandler is correct.

Matches the generator relocation.

src/lightspeed_evaluation/core/metrics/deepeval.py (1)

13-16: Import path updates are consistent with the new module layout.

src/lightspeed_evaluation/core/output/visualization.py (2)

13-15: Import path changes look correct and align with stats/model moves.

32-32: Logger namespace rename is appropriate.

src/lightspeed_evaluation/runner/__init__.py (1)

3-5: Re-exports look good

Public surface is clear: main and run_evaluation exposed via __all__. No issues.

src/lightspeed_evaluation/core/metrics/ragas.py (1)

16-19: Import path updates are correct

Imports align with the new module structure. No objections.

src/lightspeed_evaluation/core/metrics/__init__.py (1)

3-7: Package exports are clean

Centralized re-exports and __all__ look good for the new public surface.

src/lightspeed_evaluation/runner/evaluation.py (1)

71-87: Return summary already computed—looks good

Using calculate_basic_stats for the CLI exit code and printing is consistent and simple.

src/lightspeed_evaluation/core/config/__init__.py (1)

3-15: Re-exports look consistent and minimal.

Clean aggregation; symbols map correctly to loader/models/validator. No issues.

README.md (2)

41-49: Verify metric identifiers match implementation.

Confirm that the listed Ragas metric keys are the exact strings your resolver expects (e.g., context_precision_without_reference vs any library-expected canonical names). Otherwise, users will hit “unknown metric” errors at runtime.

140-143: Check relative link path.

Ensure lsc_agent_eval/README.md exists at repo root after this reorg; otherwise, the link will 404.

pyproject.toml (2)

45-46: Ensure entry points exist post-reorg.

Verify that lightspeed_evaluation.runner.evaluation:main and generate_answers.generate_answers:main exist and are importable in the final package layout.

65-66: LGTM: package path aligns with new src layout.

Wheel packaging target correctly points to src/lightspeed_evaluation.

archive/example_result/README.md

archive/README.md

src/lightspeed_evaluation/runner/evaluation.py

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/lightspeed_evaluation/core/metrics/deepeval.py (1)

62-78: Harden exception handling to match Ragas’ robustness.

LLM/network errors (timeouts, broken pipe) will bubble up and abort evaluation. Mirror Ragas’ handling to return (None, reason) instead of raising.

Apply:

     try:
       return self.supported_metrics[metric_name](
           conv_data, scope.turn_idx, scope.turn_data, scope.is_conversation
       )
-    except (ValueError, AttributeError, KeyError) as e:
+    except BrokenPipeError as e:
+      return None, f"DeepEval {metric_name} failed due to broken pipe: {str(e)}"
+    except OSError as e:
+      if getattr(e, "errno", None) == 32:
+        return None, f"DeepEval {metric_name} failed due to broken pipe (timeout): {str(e)}"
+      return None, f"DeepEval {metric_name} evaluation failed: {str(e)}"
+    except (RuntimeError, ValueError, AttributeError, KeyError, TypeError, ImportError) as e:
       return None, f"DeepEval {metric_name} evaluation failed: {str(e)}"

♻️ Duplicate comments (1)

src/lightspeed_evaluation/runner/evaluation.py (1)
97-99: Re: prior suggestion to make --eval-data required.

You addressed the crash risk differently by providing a default and doing existence checks, which is fine. No change needed. If you still prefer enforcing explicit input, flip to required=True and drop the default.

Example:
-    parser.add_argument(
-        "--eval-data",
-        default="config/evaluation_data.yaml",
-        help="Path to evaluation data file (default: config/evaluation_data.yaml)",
-    )
+    parser.add_argument(
+        "--eval-data",
+        required=True,
+        help="Path to evaluation data file",
+    )
Also, because env vars are loaded from system config inside main(), confirm no modules read env at import time.
#!/bin/bash
# Grep for top-level env reads; review any matches outside defs/classes.
rg -nP '(?m)^\s*(os\.getenv\(|os\.environ\[)' -g 'src/**/*.py' -C2

🧹 Nitpick comments (14)

src/lightspeed_evaluation/core/metrics/deepeval.py (1)
90-96: Guard against empty conversations for completeness metric.

Avoid calling DeepEval with zero turns.
   def _evaluate_conversation_completeness(
@@
-    test_case = self._build_conversational_test_case(conv_data)
+    if not getattr(conv_data, "turns", None):
+      return None, "No conversation turns available for completeness evaluation"
+    test_case = self._build_conversational_test_case(conv_data)
archive/README.md (5)
13-13: Fix nested list marker to satisfy markdownlint (MD004/MD007).
-    + if `pdm` is not installed this will install `pdm` by running `pip install pdm` in your current Python environment.
+  - If `pdm` is not installed this will install `pdm` by running `pip install pdm` in your current Python environment.
32-32: Correct spelling: “parimarily” → “primarily”.
-These evaluations are also part of **e2e test cases**. Currently *consistency* evaluation is parimarily used to gate PRs.
+These evaluations are also part of **e2e test cases**. Currently *consistency* evaluation is primarily used to gate PRs.
73-78: Tighten grammar and number agreement.
-### Outputs
-Evaluation scripts creates below files.
-- CSV file with response for given provider/model & modes.
-- response evaluation result with scores (for consistency check).
-- Final csv file with all results, json score summary & graph (for model evaluation)
+### Outputs
+Evaluation scripts create the following files:
+- CSV file with responses for the given provider/model & modes.
+- Response evaluation result with scores (for consistency check).
+- Final CSV file with all results, JSON score summary, and graph (for model evaluation).
83-86: Specify language for fenced code block (MD040).
-```
+```bash
 python -m lightspeed_core_evaluation.evaluation.query_rag
---

`88-88`: **Fix heading increment (MD001).**

```diff
-#### Arguments
+## Arguments
README.md (1)
90-101: Align example: include response_relevancy in turn_metrics or adjust metadata.

The YAML shows metadata for "ragas:response_relevancy" but it isn’t listed under turn_metrics, which can confuse users. Add it to turn_metrics for consistency.
   # Turn-level metrics (empty list = skip turn evaluation)
   turn_metrics:
     - "ragas:faithfulness"
+    - "ragas:response_relevancy"
     - "custom:answer_correctness"
 
   # Turn-level metrics metadata (threshold + other properties)
   turn_metrics_metadata:
     "ragas:response_relevancy": 
       threshold: 0.8
       weight: 1.0
     "custom:answer_correctness": 
       threshold: 0.75
pyproject.toml (2)
8-8: Use standard license metadata.

Prefer SPDX identifier or include the license file for clarity in package metadata.
-license = {text = "Apache"}
+license = {text = "Apache-2.0"}
+# Alternatively:
+# license = {file = "LICENSE"}
22-22: Loosen torch version constraint and document install extras
Torch 2.7.0 is available on PyPI, but pinning to an exact patch release may force users to manually update for security/bug fixes and can conflict with platform-specific wheels. Change to a compatible range, for example:
- torch==2.7.0
+ torch>=2.7,<3.0
and update the README with instructions for installing the appropriate CPU/GPU variants.
src/lightspeed_evaluation/core/output/visualization.py (3)
144-176: Fix axes and apply configured figsize/dpi for consistency.

Bars are vertical (metrics on x, pass rates on y), but labels are swapped. Also, honor self.figsize/self.dpi.
-        _, ax = plt.subplots(figsize=(12, 8))
+        _, ax = plt.subplots(figsize=tuple(self.figsize), dpi=self.dpi)
@@
-        ax.set_title("Pass Rates by Metric", fontsize=16, fontweight="bold")
-        ax.set_xlabel("Pass Rate (%)", fontsize=12)
-        ax.set_ylabel("Metrics", fontsize=12)
+        ax.set_title("Pass Rates by Metric", fontsize=16, fontweight="bold")
+        ax.set_xlabel("Metrics", fontsize=12)
+        ax.set_ylabel("Pass Rate (%)", fontsize=12)
@@
-        plt.savefig(filename, dpi=300, bbox_inches="tight")
+        plt.savefig(filename, dpi=self.dpi, bbox_inches="tight")
420-422: Correct heatmap axis label.

X-axis shows metrics, not “Pass Rate (%)”.
-        ax.set_xlabel("Pass Rate (%)", fontsize=12, fontweight="bold")
+        ax.set_xlabel("Metrics", fontsize=12, fontweight="bold")
228-241: Optional: avoid filling NaNs with zeros in boxplot.

Filling with 0 biases distributions. Use per-metric arrays with NaNs dropped.
-        bplot = ax.boxplot(
-            results_df.fillna(0),
-            sym=".",
-            widths=0.5,
-            vert=False,
-            patch_artist=True,
-        )
-
-        labels = results_df.columns
+        labels = list(results_df.columns)
+        data = [results_df[col].dropna().values for col in labels]
+        bplot = ax.boxplot(
+            data,
+            sym=".",
+            widths=0.5,
+            vert=False,
+            patch_artist=True,
+            labels=labels,
+        )
src/lightspeed_evaluation/runner/evaluation.py (2)

29-29: Optional: route user-facing prints through logging.

Since logging is configured from system.yaml, consider using a module logger (e.g., logging.getLogger(name)) instead of print for consistency and level control.

50-52: Avoid double validation of evaluation data (minor).

Data is validated in DataValidator.load_evaluation_data here and again inside EvaluationDriver.run_evaluation. Consider de-duplicating to reduce overhead (e.g., let the driver handle validation exclusively or add a non-validating loader).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between aa729db and eab0836.

⛔ Files ignored due to path filters (9)

archive/assets/response_eval_flow.png is excluded by !**/*.png
archive/example_result/model_evaluation_result-answer_relevancy.png is excluded by !**/*.png
archive/example_result/model_evaluation_result-answer_similarity_llm.png is excluded by !**/*.png
archive/example_result/model_evaluation_result-cos_score.png is excluded by !**/*.png
archive/example_result/model_evaluation_result-rougeL_f1.png is excluded by !**/*.png
archive/example_result/model_evaluation_result-rougeL_precision.png is excluded by !**/*.png
archive/example_result/model_evaluation_result-rougeL_recall.png is excluded by !**/*.png
lsc_eval/uv.lock is excluded by !**/*.lock
uv.lock is excluded by !**/*.lock

📒 Files selected for processing (34)

.gitignore (1 hunks)
README.md (1 hunks)
archive/README.md (1 hunks)
archive/example_result/README.md (1 hunks)
archive/pyproject.toml (1 hunks)
config/evaluation_data.yaml (1 hunks)
config/system.yaml (1 hunks)
lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py (1 hunks)
lsc_eval/README.md (0 hunks)
lsc_eval/pyproject.toml (0 hunks)
lsc_eval/src/lsc_eval/__init__.py (0 hunks)
lsc_eval/src/lsc_eval/core/__init__.py (0 hunks)
lsc_eval/src/lsc_eval/llm_managers/__init__.py (0 hunks)
lsc_eval/src/lsc_eval/metrics/__init__.py (0 hunks)
pyproject.toml (3 hunks)
src/lightspeed_evaluation/__init__.py (1 hunks)
src/lightspeed_evaluation/core/__init__.py (1 hunks)
src/lightspeed_evaluation/core/config/__init__.py (1 hunks)
src/lightspeed_evaluation/core/config/loader.py (2 hunks)
src/lightspeed_evaluation/core/config/models.py (1 hunks)
src/lightspeed_evaluation/core/config/validator.py (1 hunks)
src/lightspeed_evaluation/core/llm/__init__.py (1 hunks)
src/lightspeed_evaluation/core/metrics/__init__.py (1 hunks)
src/lightspeed_evaluation/core/metrics/custom.py (1 hunks)
src/lightspeed_evaluation/core/metrics/deepeval.py (1 hunks)
src/lightspeed_evaluation/core/metrics/ragas.py (1 hunks)
src/lightspeed_evaluation/core/output/__init__.py (1 hunks)
src/lightspeed_evaluation/core/output/generator.py (1 hunks)
src/lightspeed_evaluation/core/output/statistics.py (1 hunks)
src/lightspeed_evaluation/core/output/visualization.py (2 hunks)
src/lightspeed_evaluation/drivers/__init__.py (1 hunks)
src/lightspeed_evaluation/drivers/evaluation.py (4 hunks)
src/lightspeed_evaluation/runner/__init__.py (1 hunks)
src/lightspeed_evaluation/runner/evaluation.py (4 hunks)

💤 Files with no reviewable changes (6)

lsc_eval/pyproject.toml
lsc_eval/src/lsc_eval/core/init.py
lsc_eval/src/lsc_eval/metrics/init.py
lsc_eval/README.md
lsc_eval/src/lsc_eval/init.py
lsc_eval/src/lsc_eval/llm_managers/init.py

✅ Files skipped from review due to trivial changes (2)

config/system.yaml
lsc_agent_eval/src/lsc_agent_eval/core/utils/judge.py

🚧 Files skipped from review as they are similar to previous changes (15)

src/lightspeed_evaluation/core/output/statistics.py
src/lightspeed_evaluation/core/config/validator.py
src/lightspeed_evaluation/core/metrics/ragas.py
src/lightspeed_evaluation/drivers/init.py
config/evaluation_data.yaml
src/lightspeed_evaluation/core/config/loader.py
src/lightspeed_evaluation/core/config/models.py
archive/pyproject.toml
src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/core/init.py
src/lightspeed_evaluation/core/config/init.py
src/lightspeed_evaluation/drivers/evaluation.py
src/lightspeed_evaluation/init.py
.gitignore
src/lightspeed_evaluation/core/metrics/custom.py

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

📚 Learning: 2025-08-26T11:17:48.640Z

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

Applied to files:

pyproject.toml
archive/README.md

🧬 Code graph analysis (7)

src/lightspeed_evaluation/core/metrics/deepeval.py (4)

src/lightspeed_evaluation/core/config/models.py (1)

TurnData (8-44)

src/lightspeed_evaluation/core/llm/deepeval.py (1)

DeepEvalLLMManager (8-43)

src/lightspeed_evaluation/core/llm/manager.py (1)

LLMManager (36-173)

src/lightspeed_evaluation/core/output/statistics.py (1)

EvaluationScope (11-16)

src/lightspeed_evaluation/core/metrics/__init__.py (3)

src/lightspeed_evaluation/core/metrics/custom.py (1)

CustomMetrics (29-251)

src/lightspeed_evaluation/core/metrics/deepeval.py (1)

DeepEvalMetrics (19-138)

src/lightspeed_evaluation/core/metrics/ragas.py (1)

RagasMetrics (23-263)

src/lightspeed_evaluation/core/output/__init__.py (1)

src/lightspeed_evaluation/core/output/generator.py (1)

OutputHandler (15-244)

src/lightspeed_evaluation/core/output/visualization.py (2)

src/lightspeed_evaluation/core/config/models.py (1)

EvaluationResult (133-169)

src/lightspeed_evaluation/core/output/statistics.py (2)

calculate_basic_stats (19-45)

calculate_detailed_stats (48-68)

src/lightspeed_evaluation/core/llm/__init__.py (4)

src/lightspeed_evaluation/core/llm/manager.py (3)

LLMManager (36-173)

LLMConfig (13-33)

LLMError (8-9)

src/lightspeed_evaluation/core/config/models.py (1)

LLMConfig (172-195)

src/lightspeed_evaluation/core/llm/deepeval.py (1)

DeepEvalLLMManager (8-43)

src/lightspeed_evaluation/core/llm/ragas.py (1)

RagasLLMManager (84-112)

src/lightspeed_evaluation/runner/evaluation.py (5)

src/lightspeed_evaluation/core/config/loader.py (2)

ConfigLoader (193-275)

setup_environment_variables (30-47)

src/lightspeed_evaluation/core/config/validator.py (1)

DataValidator (11-82)

src/lightspeed_evaluation/core/output/generator.py (1)

OutputHandler (15-244)

src/lightspeed_evaluation/core/output/statistics.py (1)

calculate_basic_stats (19-45)

src/lightspeed_evaluation/drivers/evaluation.py (2)

EvaluationDriver (118-322)

run_evaluation (146-172)

src/lightspeed_evaluation/runner/__init__.py (1)

src/lightspeed_evaluation/runner/evaluation.py (2)

main (95-129)

run_evaluation (15-92)

🪛 LanguageTool

README.md

[grammar] ~5-~5: There might be a mistake here.
Context: ... GenAI applications. ## 🎯 Key Features - Multi-Framework Support: Seamlessly us...

(QB_NEW_EN)

[grammar] ~16-~16: There might be a mistake here.
Context: ... integration planned) ## 🚀 Quick Start ### Installation ```bash # From Git pip ins...

(QB_NEW_EN)

[grammar] ~115-~115: There might be a mistake here.
Context: ...t...." ``` ## 📈 Output & Visualization ### Generated Reports - CSV: Detailed re...

(QB_NEW_EN)

[grammar] ~117-~117: There might be a mistake here.
Context: ...t & Visualization ### Generated Reports - CSV: Detailed results with status, sco...

(QB_NEW_EN)

[grammar] ~118-~118: There might be a mistake here.
Context: ...led results with status, scores, reasons - JSON: Summary statistics with score di...

(QB_NEW_EN)

[grammar] ~119-~119: There might be a mistake here.
Context: ...mary statistics with score distributions - TXT: Human-readable summary - PNG:...

(QB_NEW_EN)

[grammar] ~120-~120: There might be a mistake here.
Context: ...utions - TXT: Human-readable summary - PNG: 4 visualization types (pass rates...

(QB_NEW_EN)

[grammar] ~123-~123: There might be a mistake here.
Context: ...us breakdown) ### Key Metrics in Output - PASS/FAIL/ERROR: Status based on thres...

(QB_NEW_EN)

[grammar] ~124-~124: There might be a mistake here.
Context: ...FAIL/ERROR**: Status based on thresholds - Actual Reasons: DeepEval provides LLM-...

(QB_NEW_EN)

[grammar] ~128-~128: There might be a mistake here.
Context: .../max for every metric ## 🧪 Development ### Development Tools ```bash uv sync --grou...

(QB_NEW_EN)

[grammar] ~141-~141: There might be a mistake here.
Context: ...tests --cov=src ``` ## Agent Evaluation For a detailed walkthrough of the new ag...

(QB_NEW_EN)

[grammar] ~142-~142: There might be a mistake here.
Context: ...he new agent-evaluation framework, refer [lsc_agent_eval/README.md](lsc_agent_eva...

(QB_NEW_EN)

[grammar] ~148-~148: There might be a mistake here.
Context: ...nswers.md) ## 📄 License & Contributing This project is licensed under the Apach...

(QB_NEW_EN)

[grammar] ~152-~152: There might be a mistake here.
Context: ...e for details. Contributions welcome - see development setup above for code qualit...

(QB_NEW_EN)

archive/README.md

[grammar] ~1-~1: There might be a mistake here.
Context: # Lightspeed Core Evaluation Evaluation tooling for lightspeed-core p...

(QB_NEW_EN)

[grammar] ~2-~2: There might be a mistake here.
Context: ...peed Core Evaluation Evaluation tooling for lightspeed-core project. [Refer latest ...

(QB_NEW_EN)

[grammar] ~6-~6: There might be a mistake here.
Context: ...t maintained anymore.** ## Installation - Requires Python 3.11 - Install pdm -...

(QB_NEW_EN)

[grammar] ~10-~10: There might be a mistake here.
Context: ... a clean venv for Python 3.11 and pdm. - Run pdm install - Optional: For develo...

(QB_NEW_EN)

[grammar] ~11-~11: There might be a mistake here.
Context: ...n venv for Python 3.11 and pdm. - Run pdm install - Optional: For development, run `make ins...

(QB_NEW_EN)

[grammar] ~18-~18: There might be a mistake here.
Context: ...ion of similarity distances are used to calculate final score. Cut-off scores are used to...

(QB_NEW_EN)

[grammar] ~18-~18: There might be a mistake here.
Context: ...eviations. This also stores a .csv file with query, pre-defined answer, API response...

(QB_NEW_EN)

[grammar] ~20-~20: There might be a mistake here.
Context: .... model: Ability to compare responses against single ground-truth answer. Here we can...

(QB_NEW_EN)

[grammar] ~20-~20: There might be a mistake here.
Context: ...del at a time. This creates a json file as summary report with scores (f1-score) f...

(QB_NEW_EN)

[grammar] ~26-~26: There might be a mistake here.
Context: ...modified or removed, please create a PR. - OLS API should be ready/live with all th...

(QB_NEW_EN)

[grammar] ~27-~27: There might be a mistake here.
Context: ... the required provider+model configured. - It is possible that we want to run both ...

(QB_NEW_EN)

[style] ~28-~28: For conciseness, try rephrasing this sentence.
Context: ...e required provider+model configured. - It is possible that we want to run both consistency and model evalu...

(MAY_MIGHT_BE)

[grammar] ~28-~28: There might be a mistake here.
Context: ...n together. To avoid multiple API calls for same query, model evaluation first ch...

(QB_NEW_EN)

[grammar] ~28-~28: There might be a mistake here.
Context: ... generated by consistency evaluation. If response is not present in csv file, th...

(QB_NEW_EN)

[grammar] ~28-~28: There might be a mistake here.
Context: ...s not present in csv file, then only we call API to get the response. ### e2e test ...

(QB_NEW_EN)

[grammar] ~32-~32: Ensure spelling is correct
Context: .... Currently consistency evaluation is parimarily used to gate PRs. Final e2e suite will ...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

[grammar] ~46-~46: There might be a mistake here.
Context: ... add new data accordingly. ## Arguments eval_type: This will control which eva...

(QB_NEW_EN)

[grammar] ~49-~49: There might be a mistake here.
Context: ...nAs provided in json file 2. model -> Compares set of models based on their response a...

(QB_NEW_EN)

[grammar] ~52-~52: There might be a mistake here.
Context: ...t:8080`. If deployed in a cluster, then pass cluster API url. **eval_api_token_file...

(QB_NEW_EN)

[grammar] ~54-~54: There might be a mistake here.
Context: ...l_api_token_file**: Path to a text file containing OLS API token. Required, if OLS is depl...

(QB_NEW_EN)

[grammar] ~54-~54: There might be a mistake here.
Context: ...API token. Required, if OLS is deployed in cluster. eval_scenario: This is pr...

(QB_NEW_EN)

[grammar] ~58-~58: There might be a mistake here.
Context: ...ith rag. eval_query_ids: Option to give set of query ids for evaluation. By def...

(QB_NEW_EN)

[grammar] ~60-~60: There might be a mistake here.
Context: ...ed. eval_provider_model_id: We can provide set of provider/model combinations as i...

(QB_NEW_EN)

[grammar] ~62-~62: There might be a mistake here.
Context: ...Applicable only for model evaluation. Provide file path to the parquet file having ad...

(QB_NEW_EN)

[grammar] ~71-~71: There might be a mistake here.
Context: .../rcsconfig.yaml) eval_modes: Apart from OLS api, we may want to evaluate vanill...

(QB_NEW_EN)

[grammar] ~71-~71: There might be a mistake here.
Context: ...s**: Apart from OLS api, we may want to evaluate vanilla model or with just OLS paramete...

(QB_NEW_EN)

[grammar] ~71-~71: There might be a mistake here.
Context: ...LS parameters/prompt/RAG so that we can have baseline score. This is a list of modes...

(QB_NEW_EN)

[grammar] ~73-~73: There might be a mistake here.
Context: ...ls_rag, & ols (actual api). ### Outputs Evaluation scripts creates below files. ...

(QB_NEW_EN)

[grammar] ~74-~74: There might be a mistake here.
Context: ... Evaluation scripts creates below files. - CSV file with response for given provide...

(QB_NEW_EN)

[grammar] ~86-~86: There might be a mistake here.
Context: ...ate a .csv file having retrieved chunks for given set of queries with similarity sc...

(QB_NEW_EN)

[grammar] ~86-~86: There might be a mistake here.
Context: ...with similarity score. This is not part of actual evaluation. But useful to do a s...

(QB_NEW_EN)

[grammar] ~88-~88: There might be a mistake here.
Context: ...viation in the response) #### Arguments db-path: Path to the RAG index *produc...

(QB_NEW_EN)

archive/example_result/README.md

[grammar] ~14-~14: There might be a mistake here.
Context: ... llama-3-1-8b-instruct - QnA evaluation dataset: [QnAs from OCP doc](../../eval_data/ocp...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)

archive/README.md

13-13: Unordered list style
Expected: dash; Actual: plus

(MD004, ul-style)

13-13: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

24-24: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

83-83: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

88-88: Heading levels should only increment by one level at a time
Expected: h2; Actual: h4

(MD001, heading-increment)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: mypy

🔇 Additional comments (12)

archive/example_result/README.md (1)

14-14: Link verification passed: eval_data/ocp_doc_qna-edited.parquet exists at the repo root.

src/lightspeed_evaluation/core/metrics/deepeval.py (1)

13-16: Import path updates look correct.

src/lightspeed_evaluation/core/metrics/__init__.py (1)

1-7: Clean public re-exports. LGTM.

README.md (1)

141-146: Verified link target exists – lsc_agent_eval/README.md is present in the repo.

pyproject.toml (1)

29-29: Tenacity>=9.1.2 is valid. Version 9.1.2 is available on PyPI; no change needed.

src/lightspeed_evaluation/core/output/__init__.py (2)

3-6: LGTM: clean re-export path.

Exporting OutputHandler from .generator matches the new layout.

3-3: No unterminated f-string or parenthesis in generator.py
The “Fail” lines in _generate_text_summary already have matching quotes and closing parentheses (e.g., lines 202–204), so no syntax fix is required.

Likely an incorrect or invalid review comment.

src/lightspeed_evaluation/runner/__init__.py (1)

1-5: LGTM: simple, clear re-exports for CLI.

Keeps public API stable.

src/lightspeed_evaluation/runner/evaluation.py (4)

1-1: Branding/docstring update looks good.

Clear, accurate module purpose.

19-19: Docstring now reflects EvaluationDriver.

Accurate and helpful.

46-48: Driver initialization aligns with new architecture.

Instantiation with ConfigLoader is consistent with drivers.evaluation.

9-12: Re-exports confirmed; imports are valid. No action required.

pyproject.toml

src/lightspeed_evaluation/core/llm/__init__.py

lpiwowar

This is a step in the right direction:), IMO 👍 . I'm just curious:

Why don't we just remove the code that is not going to be used? 👀 I mean, the code in the current src folder, which is being moved to the archive folder. It's not like the code is going to be lost if we remove it. If one needs it, they can run git revert.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/lightspeed_evaluation/core/config/models.py (1)

184-191: Make validation errors field-specific.
Current message “Value cannot be empty” is ambiguous. Include the field name for faster debugging.

Apply this diff:

-@field_validator("provider", "model")
-@classmethod
-def _validate_non_empty(cls, v: str) -> str:
-    """Validate provider and model are non-empty strings."""
-    if not v or not isinstance(v, str) or not v.strip():
-        raise ValueError("Value cannot be empty")
-    return v.strip()
+@field_validator("provider", "model")
+@classmethod
+def _validate_non_empty(cls, v: str, info):  # pydantic v2: FieldValidationInfo
+    """Validate provider and model are non-empty strings."""
+    if not isinstance(v, str) or not v.strip():
+        raise ValueError(f"{info.field_name} cannot be empty")
+    return v.strip()

Add this import (outside the changed hunk) to type-hint the validator info if you prefer:

from pydantic import FieldValidationInfo  # optional typing

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between eab0836 and ea4cda4.

📒 Files selected for processing (12)

src/lightspeed_evaluation/__init__.py (1 hunks)
src/lightspeed_evaluation/core/__init__.py (1 hunks)
src/lightspeed_evaluation/core/config/__init__.py (1 hunks)
src/lightspeed_evaluation/core/config/models.py (2 hunks)
src/lightspeed_evaluation/core/llm/__init__.py (1 hunks)
src/lightspeed_evaluation/core/llm/manager.py (1 hunks)
src/lightspeed_evaluation/core/metrics/custom.py (1 hunks)
src/lightspeed_evaluation/core/metrics/deepeval.py (1 hunks)
src/lightspeed_evaluation/core/metrics/ragas.py (1 hunks)
src/lightspeed_evaluation/core/output/generator.py (1 hunks)
src/lightspeed_evaluation/core/output/statistics.py (1 hunks)
src/lightspeed_evaluation/core/output/visualization.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (10)

src/lightspeed_evaluation/core/output/statistics.py
src/lightspeed_evaluation/core/output/visualization.py
src/lightspeed_evaluation/init.py
src/lightspeed_evaluation/core/init.py
src/lightspeed_evaluation/core/output/generator.py
src/lightspeed_evaluation/core/metrics/custom.py
src/lightspeed_evaluation/core/metrics/deepeval.py
src/lightspeed_evaluation/core/metrics/ragas.py
src/lightspeed_evaluation/core/config/init.py
src/lightspeed_evaluation/core/llm/init.py

🧰 Additional context used

🧠 Learnings (1)

📓 Common learnings

Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.640Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.

🧬 Code graph analysis (1)

src/lightspeed_evaluation/core/llm/manager.py (1)

src/lightspeed_evaluation/core/config/models.py (1)

LLMConfig (172-202)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: mypy

🔇 Additional comments (3)

src/lightspeed_evaluation/core/config/models.py (2)

1-1: Docstring rename aligns with rebrand.
Good rename; no issues.

192-203: Remove backward-compat for model_name – not needed
Search for legacy model_name keys found only in the archive/ folder and unrelated test suites; no production config uses model_name. The current from_dict defaults are explicit and safe.

Likely an incorrect or invalid review comment.

src/lightspeed_evaluation/core/llm/manager.py (1)

6-7: LLMConfig re-export verified
LLMConfig is imported from models and included in __init__.py’s exports, so from ..config import LLMConfig is stable.

tisnik

I like this approach.

asamal4 · 2025-09-01T08:37:12Z

@lpiwowar

It's not like the code is going to be lost if we remove it. If one needs it, they can run git revert

You are right. But I just didn't want to delete it yet for convenience.. There are some additional features/use-cases (not usable as it is though).. I will go through the code once I am done with all current changes (may be next week), will create necessary stories for future enhancement. Then I will remove this.

asamal4 · 2025-09-01T09:45:31Z

@VladimirKadlec @Anxhela21 PTAL

VladimirKadlec

As Lukas said, I'd remove the archive folder.

LGTM

asamal4 force-pushed the lsc-eval-primary branch from b8a7fb4 to aa729db Compare September 1, 2025 06:03

asamal4 marked this pull request as ready for review September 1, 2025 06:05

coderabbitai bot reviewed Sep 1, 2025

View reviewed changes

archive/example_result/README.md Outdated Show resolved Hide resolved

archive/README.md Outdated Show resolved Hide resolved

archive/README.md Outdated Show resolved Hide resolved

src/lightspeed_evaluation/runner/evaluation.py Show resolved Hide resolved

asamal4 added 2 commits September 1, 2025 12:35

archive old ols eval tool

832eb06

make new lsc eval as primary eval package

eab0836

asamal4 force-pushed the lsc-eval-primary branch from aa729db to eab0836 Compare September 1, 2025 07:05

coderabbitai bot reviewed Sep 1, 2025

View reviewed changes

pyproject.toml Show resolved Hide resolved

src/lightspeed_evaluation/core/llm/__init__.py Outdated Show resolved Hide resolved

lpiwowar reviewed Sep 1, 2025

View reviewed changes

remove unused llmconfig

ea4cda4

coderabbitai bot reviewed Sep 1, 2025

View reviewed changes

tisnik approved these changes Sep 1, 2025

View reviewed changes

VladimirKadlec approved these changes Sep 1, 2025

View reviewed changes

VladimirKadlec mentioned this pull request Sep 1, 2025

[LSC_EVAL] Test Scenario Generation #37

Closed

tisnik merged commit 2bbafdd into lightspeed-core:main Sep 1, 2025
15 checks passed

This was referenced Sep 2, 2025

Added Unit test cases as well as integration test cases #42

Merged

fix rule for black & pydocstyle #45

Merged

Add client for query endpoint #43

Merged

API integration & refactoring #47

Merged

coderabbitai bot mentioned this pull request Sep 22, 2025

feat: add support for custom embedding model #56

Merged

This was referenced Oct 6, 2025

use absolute imports #68

Merged

add common custom llm #70

Merged

coderabbitai bot mentioned this pull request Nov 17, 2025

LEADS-8: Lazy imports for eval tool #102

Closed

coderabbitai bot mentioned this pull request Nov 24, 2025

LEADS-8: Lazy imports for eval tool #106

Merged

archive old eval and make lsc eval as primary #35

archive old eval and make lsc eval as primary #35

Uh oh!

Conversation

asamal4 commented Sep 1, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

asamal4 commented Sep 1, 2025

Uh oh!

coderabbitai bot commented Sep 1, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lpiwowar left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tisnik left a comment

Choose a reason for hiding this comment

Uh oh!

asamal4 commented Sep 1, 2025

Uh oh!

asamal4 commented Sep 1, 2025

Uh oh!

VladimirKadlec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

asamal4 commented Sep 1, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 1, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)