-
Notifications
You must be signed in to change notification settings - Fork 22
Generic eval tool #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Warning Rate limit exceeded@asamal4 has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 0 minutes and 8 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (17)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 45
🧹 Nitpick comments (42)
.gitignore (2)
181-182: Remove redundant ignore:lsc_eval/eval_output*/is unnecessaryThe existing pattern
eval_output*/(Line 180) already matches directories namedeval_output*at any depth in the repo. The addedlsc_eval/eval_output*/is redundant and can be dropped to avoid drift and duplication.Apply this minimal diff:
- lsc_eval/eval_output*/
183-184: Broaden.deepevalignore to any location, not just underlsc_eval/DeepEval may create
.deepeval/at the repo root (or elsewhere). Scoping the ignore tolsc_eval/.deepeval/could miss those artifacts. Prefer ignoring.deepeval/globally.Proposed change:
-# DeepEval telemetry and configuration -lsc_eval/.deepeval/ +# DeepEval telemetry and configuration +.deepeval/If you explicitly want to scope it to any subdir (not only root), you could use
**/.deepeval/instead.lsc_eval/src/lsc_eval/llm_managers/deepeval_llm.py (2)
29-29: Use logging instead of print for library code.Printing in a library is noisy for CLI/consumers; prefer
logging.- print(f"✅ DeepEval LLM Manager: {self.model_name}") + logger.info("DeepEval LLM Manager initialized for model %s", self.model_name)
31-33: Stringize or postpone annotations to avoid import-time typing issues.If DeepEval isn’t installed, exported annotations referencing
LiteLLMModelcan still cause tooling churn. With the guard above, you’re mostly fine; optionally stringize the return type to be extra-safe.- def get_llm(self) -> LiteLLMModel: + def get_llm(self) -> "LiteLLMModel": """Get the configured DeepEval LiteLLM model.""" return self.llm_modellsc_eval/README.md (5)
3-3: Tighten intro grammar.-A comprehensive framework/tool to evaluate GenAI application. +A comprehensive tool to evaluate GenAI applications.
19-25: Unify install instructions with repo org and mention optional deps.Ensure readers install the correct repo and, when using Ragas/DeepEval metrics, pull optional dependencies.
-# From Git -pdm add git+https://github.com/your-org/lightspeed-evaluation.git#subdirectory=lsc_eval -# or pip install git+https://github.com/lightspeed-core/lightspeed-evaluation.git#subdirectory=lsc_eval +# From Git (PDM) +pdm add git+https://github.com/lightspeed-core/lightspeed-evaluation.git#subdirectory=lsc_eval +# Or via pip +# pip install "git+https://github.com/lightspeed-core/lightspeed-evaluation.git#subdirectory=lsc_eval" + +# Optional metric backends: +# Ragas + LiteLLM +# pdm add ragas litellm +# DeepEval + LiteLLM +# pdm add deepeval litellm
32-34: Prefer the CLI entrypoint if provided, keep runner.py as an alternative.The PR summary mentions a console script. Consider showcasing it first.
-# Run evaluation (Create your own data) -python runner.py --system-config config/system.yaml --eval-data config/evaluation_data.yaml +# Run evaluation (create your own data) +# If installed as a package with console script: +# lsc-eval --system-config config/system.yaml --eval-data config/evaluation_data.yaml +# Or directly via module: +python runner.py --system-config config/system.yaml --eval-data config/evaluation_data.yaml
121-125: Clarify “Actual Reasons” wording.Minor readability tweak.
-**Actual Reasons**: DeepEval provides LLM-generated explanations, Custom metrics provide detailed reasoning +**Reasons/Explanations**: DeepEval provides LLM-generated explanations; custom metrics include detailed reasoning
126-133: Pin the dev tools order and add pyright/pylint notes.Not required, but it helps new contributors reproduce CI locally.
pdm run black . pdm run ruff check . pdm run mypy . -pdm run pyright . -pdm run pylint . +pdm run pyright . +pdm run pylint . +# Tip: if you don't use DeepEval/Ragas, you can still run linters; +# optional imports in code are guarded to avoid hard failures.lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py (3)
57-59: Use logging for errors instead of print.Keep libraries quiet unless explicitly configured.
- except Exception as e: - print(f"❌ Ragas LLM failed: {e}") - raise RuntimeError(f"Ragas LLM evaluation failed: {str(e)}") from e + except Exception as e: + logger.error("Ragas LLM failed: %s", e) + raise RuntimeError(f"Ragas LLM evaluation failed: {e}") from e
18-18: Use logging instead of print on initialization.- print(f"✅ Ragas Custom LLM: {self.model_name}") + logger.info("Ragas Custom LLM initialized for model %s", self.model_name)
91-95: Avoid global metric side effects or make them opt-in.Assigning to
answer_relevancy.llmandfaithfulness.llmmutates module-level singletons and can surprise callers if multiple managers exist. Prefer per-metric instances or make this opt-in.- # Configure Ragas metrics to use our custom LLM - answer_relevancy.llm = self.custom_llm - faithfulness.llm = self.custom_llM - print("✅ Ragas LLM Manager configured") + # Optional: configure default Ragas metrics to use our custom LLM + if self.litellm_params.get("inject_global_llm", True): + answer_relevancy.llm = self.custom_llm + faithfulness.llm = self.custom_llm + logger.info("Ragas LLM Manager configured (inject_global_llm=%s)", self.litellm_params.get("inject_global_llm", True))If you keep global assignment, document it in README to set expectations.
lsc_eval/src/lsc_eval/metrics/__init__.py (1)
3-7: Consider lazy imports to avoid heavyweight import costsIf startup time matters for CLI runs that only use a subset of frameworks, switch to local (on-demand) imports inside the evaluation dispatch or use a lightweight registry to avoid importing
ragas/deepevalunless needed.lsc_eval/src/lsc_eval/llm_managers/__init__.py (1)
3-7: Re-export LLMConfig to avoid deep importsConsumers often need the config type; re-exporting it keeps imports consistent with this package’s public surface.
Apply:
-from .llm_manager import LLMManager +from .llm_manager import LLMManager, LLMConfig @@ -__all__ = ["LLMManager", "RagasLLMManager", "DeepEvalLLMManager"] +__all__ = ["LLMManager", "LLMConfig", "RagasLLMManager", "DeepEvalLLMManager"]lsc_eval/pyproject.toml (1)
8-8: Use a valid SPDX license identifier“Apache-2.0” is the canonical SPDX identifier and improves license detection.
-license = {text = "Apache"} +license = {text = "Apache-2.0"}lsc_eval/config/evaluation_data.yaml (1)
46-47: Add missing group description for conv_group_3 (consistency with other groups)Earlier groups include
description. If your model validation expects it, the absence here will fail validation. Add a description for consistency.- conversation_group_id: "conv_group_3" - + description: "conversation group description"lsc_eval/src/lsc_eval/core/__init__.py (3)
3-5: Public API is coherent; one naming collision to consider.Re-exporting core types is handy. However, exposing
LLMConfigfrom.modelscan be confused with the differentLLMConfigdefined inllm_managers/llm_manager.py. Consider droppingLLMConfigfrom this module’s public surface or renaming one of the two to avoid accidental imports.
7-17: Black reformat failure — run formatter.CI shows Black would reformat this file. Please run the project’s formatter to unblock CI.
1-1: Black reformat failure — run formatter.CI flagged Black on this file too. After edits, run Black to satisfy CI.
lsc_eval/src/lsc_eval/output/visualization.py (2)
1-1: Black reformat failure — run formatter.CI flagged Black on this file. After applying the above changes, please run Black to normalize formatting.
204-213: Duplicate-code warning is acceptable; consider a helper later.Pylint reports duplicate lines vs. another module. Given this is a new, decoupled tool, we can defer, but extracting shared plotting helpers under
output/utils.pywould reduce drift long-term.lsc_eval/runner.py (1)
1-1: Black reformat failure — run formatter.After the above edits, run Black to satisfy CI.
lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py (4)
51-61: Type hints for _evaluate_metric.Apply:
- def _evaluate_metric(self, metric, test_case) -> Tuple[float, str]: + def _evaluate_metric(self, metric: Any, test_case: Any) -> Tuple[float, str]:
79-94: Type hints for conversation_completeness.Apply:
- def _evaluate_conversation_completeness( + def _evaluate_conversation_completeness( self, - conv_data, + conv_data: EvaluationData, _turn_idx: Optional[int], _turn_data: Optional[TurnData], is_conversation: bool, ) -> Tuple[Optional[float], str]:
95-114: Type hints for conversation_relevancy.Apply:
- def _evaluate_conversation_relevancy( + def _evaluate_conversation_relevancy( self, - conv_data, + conv_data: EvaluationData, _turn_idx: Optional[int], _turn_data: Optional[TurnData], is_conversation: bool, ) -> Tuple[Optional[float], str]:
115-133: Type hints for knowledge_retention.Apply:
- def _evaluate_knowledge_retention( + def _evaluate_knowledge_retention( self, - conv_data, + conv_data: EvaluationData, _turn_idx: Optional[int], _turn_data: Optional[TurnData], is_conversation: bool, ) -> Tuple[Optional[float], str]:lsc_eval/src/lsc_eval/core/data_validator.py (1)
35-36: Prefer structured logging over printsSwitch to the project’s logging setup for consistent, filterable output (especially useful in CI). If logging is not wired here yet, keep prints for now and follow up later.
lsc_eval/src/lsc_eval/output/output_handler.py (1)
80-82: Remove stale comment about DataFrameThe CSV writer doesn’t use a DataFrame. Removing this avoids confusion.
- # Move to dataframe for better aggregation csv_file = self.output_dir / f"{base_filename}_detailed.csv"lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (4)
66-72: Use logger.warning for generic provider path; include remediation hintUsing a generic provider silently can lead to confusing runtime errors when LiteLLM rejects the prefix. Emit a warning via logging and suggest valid providers.
- print(f"⚠️ Using generic provider format for {provider}") + logger.warning("Using generic provider format for '%s'. Verify LiteLLM supports this prefix.", provider)
106-112: Demote noisy Ollama notice to debug and route through loggerRunning locally without OLLAMA_HOST is normal; keep logs clean by using debug level.
- if not os.environ.get("OLLAMA_HOST"): - print("ℹ️ OLLAMA_HOST not set, using default localhost:11434") + if not os.environ.get("OLLAMA_HOST"): + logger.debug("OLLAMA_HOST not set; defaulting to localhost:11434")
12-33: Naming collision with another LLMConfig; consolidate or renameThere’s a second LLMConfig (Pydantic) in core/models.py. Two different LLMConfig types across modules will cause confusion and type mismatches in IDEs and reviews.
Options:
- Prefer a single canonical config model (e.g., the Pydantic one) and derive the LiteLLM params from it.
- Or rename this dataclass to LLMManagerConfig to prevent ambiguity.
148-156: Double-check LiteLLM parameter names: timeout/request_timeout, num_retriesLiteLLM’s completion kwargs use specific names. Some versions use request_timeout instead of timeout. Validate and adjust to avoid silent no-ops.
If needed:
- "timeout": self.config.timeout, + "request_timeout": self.config.timeout,Please verify against the LiteLLM version in your pyproject.
lsc_eval/src/lsc_eval/metrics/custom_metrics.py (3)
14-23: Type the contexts field preciselyUse Optional[List[Dict[str, str]]] to align with TurnData.contexts and improve static checks.
- contexts: Optional[list] = Field(None, description="Context information if available") + contexts: Optional[list[dict[str, str]]] = Field( + None, description="Context information if available" + )If Python <3.9 support is needed, use typing.List/Dict instead.
77-81: Harden choice/content extraction across providersSome providers return dicts instead of objects. Add a safe fallback; raise if missing.
- content = response.choices[0].message.content # type: ignore + choice0 = response.choices[0] # type: ignore[index] + msg = getattr(choice0, "message", getattr(choice0, "delta", None)) + content = None if msg is None else (getattr(msg, "content", None) or getattr(msg, "text", None) or (isinstance(msg, dict) and msg.get("content")))And keep the existing empty-response check.
1-41: Run Black; minor nits onlyBlack check is failing. After the above diffs, please run Black to settle whitespace/line-wrapping.
lsc_eval/src/lsc_eval/core/config_loader.py (3)
179-184: Add return type for init and prefer explicit None assignmentsMinor typing/readability polish.
- def __init__(self): + def __init__(self) -> None: """Initialize Config Loader.""" self.system_config = None self.evaluation_data = None self.logger = None
140-176: Duplicate “system config” models across modulesThis file defines SystemConfig while core/models.py defines EvaluationSystemConfig. Two overlapping system config models increase drift risk.
Consider keeping only one canonical system config model (e.g., in core/models.py) and have ConfigLoader populate it. If you keep both, add conversion helpers and tests to ensure parity.
1-1: Run Black to fix formattingCI shows Black would reformat this file.
lsc_eval/src/lsc_eval/core/models.py (4)
89-124: Validate metric requirements only for turn metrics; consider conversation-level tooToday only turn_metrics are checked for context/expected_response. If any conversation-level metrics require contexts/answers, this will miss them.
Add analogous checks for conversation_metrics if needed by your metric set.
165-183: Name clash: LLMConfig here vs llm_managers.LLMConfigTwo different LLMConfig classes exist across modules. This is confusing for maintainers and tooling.
- Prefer a single shared LLMConfig (likely this Pydantic model) and have LLMManager accept it.
- If you need a lightweight runtime class for LLMManager, rename it to LLMManagerConfig.
- Add tests to ensure config serialization/deserialization remains consistent.
185-212: Duplication with SystemConfig in config_loaderEvaluationSystemConfig overlaps with SystemConfig. Consolidate to one to avoid divergence in future changes.
1-1: Run Black to fix formattingBlack check is failing on this file as well.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (25)
.gitignore(1 hunks)Makefile(3 hunks)lsc_eval/README.md(1 hunks)lsc_eval/config/evaluation_data.yaml(1 hunks)lsc_eval/config/system.yaml(1 hunks)lsc_eval/pyproject.toml(1 hunks)lsc_eval/runner.py(1 hunks)lsc_eval/src/lsc_eval/__init__.py(1 hunks)lsc_eval/src/lsc_eval/core/__init__.py(1 hunks)lsc_eval/src/lsc_eval/core/config_loader.py(1 hunks)lsc_eval/src/lsc_eval/core/data_validator.py(1 hunks)lsc_eval/src/lsc_eval/core/models.py(1 hunks)lsc_eval/src/lsc_eval/evaluation_engine.py(1 hunks)lsc_eval/src/lsc_eval/llm_managers/__init__.py(1 hunks)lsc_eval/src/lsc_eval/llm_managers/deepeval_llm.py(1 hunks)lsc_eval/src/lsc_eval/llm_managers/llm_manager.py(1 hunks)lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py(1 hunks)lsc_eval/src/lsc_eval/metrics/__init__.py(1 hunks)lsc_eval/src/lsc_eval/metrics/custom_metrics.py(1 hunks)lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py(1 hunks)lsc_eval/src/lsc_eval/metrics/ragas_metrics.py(1 hunks)lsc_eval/src/lsc_eval/output/__init__.py(1 hunks)lsc_eval/src/lsc_eval/output/output_handler.py(1 hunks)lsc_eval/src/lsc_eval/output/utils.py(1 hunks)lsc_eval/src/lsc_eval/output/visualization.py(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-16T12:07:29.169Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.
Applied to files:
lsc_eval/runner.py
🧬 Code graph analysis (19)
lsc_eval/src/lsc_eval/output/__init__.py (2)
lsc_eval/src/lsc_eval/output/output_handler.py (1)
OutputHandler(15-224)lsc_eval/src/lsc_eval/output/visualization.py (1)
GraphGenerator(17-412)
lsc_eval/src/lsc_eval/output/utils.py (1)
lsc_eval/src/lsc_eval/core/models.py (2)
EvaluationResult(126-162)TurnData(8-44)
lsc_eval/src/lsc_eval/metrics/__init__.py (3)
lsc_eval/src/lsc_eval/metrics/custom_metrics.py (1)
CustomMetrics(25-242)lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py (1)
DeepEvalMetrics(19-138)lsc_eval/src/lsc_eval/metrics/ragas_metrics.py (1)
RagasMetrics(23-247)
lsc_eval/src/lsc_eval/core/__init__.py (4)
lsc_eval/src/lsc_eval/core/config_loader.py (4)
ConfigLoader(176-256)SystemConfig(140-173)setup_environment_variables(27-44)validate_metrics(123-137)lsc_eval/src/lsc_eval/core/models.py (5)
validate_metrics(82-87)EvaluationData(47-123)EvaluationResult(126-162)LLMConfig(165-182)TurnData(8-44)lsc_eval/src/lsc_eval/core/data_validator.py (1)
DataValidator(11-82)lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (1)
LLMConfig(13-33)
lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (4)
lsc_eval/src/lsc_eval/core/models.py (1)
LLMConfig(165-182)lsc_eval/src/lsc_eval/metrics/custom_metrics.py (1)
from_system_config(239-242)lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py (1)
from_system_config(135-138)lsc_eval/src/lsc_eval/metrics/ragas_metrics.py (1)
from_system_config(244-247)
lsc_eval/src/lsc_eval/output/output_handler.py (3)
lsc_eval/src/lsc_eval/core/models.py (1)
EvaluationResult(126-162)lsc_eval/src/lsc_eval/output/utils.py (2)
calculate_basic_stats(19-45)calculate_detailed_stats(48-68)lsc_eval/src/lsc_eval/output/visualization.py (2)
GraphGenerator(17-412)generate_all_graphs(53-103)
lsc_eval/src/lsc_eval/metrics/ragas_metrics.py (6)
lsc_eval/src/lsc_eval/metrics/custom_metrics.py (2)
evaluate(42-57)from_system_config(239-242)lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py (2)
evaluate(62-77)from_system_config(135-138)lsc_eval/src/lsc_eval/core/models.py (1)
TurnData(8-44)lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (5)
LLMManager(36-167)get_model_name(144-146)get_litellm_params(148-156)from_dict(24-33)from_system_config(163-167)lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py (2)
RagasLLMManager(78-106)get_llm(97-99)lsc_eval/src/lsc_eval/output/utils.py (1)
EvaluationScope(11-16)
lsc_eval/src/lsc_eval/llm_managers/deepeval_llm.py (1)
lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py (2)
get_llm(97-99)get_model_info(101-106)
lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py (3)
lsc_eval/src/lsc_eval/core/models.py (1)
TurnData(8-44)lsc_eval/src/lsc_eval/llm_managers/deepeval_llm.py (2)
DeepEvalLLMManager(8-43)get_llm(31-33)lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (4)
LLMManager(36-167)get_model_name(144-146)get_litellm_params(148-156)from_system_config(163-167)
lsc_eval/src/lsc_eval/output/visualization.py (2)
lsc_eval/src/lsc_eval/core/models.py (1)
EvaluationResult(126-162)lsc_eval/src/lsc_eval/output/utils.py (2)
calculate_basic_stats(19-45)calculate_detailed_stats(48-68)
lsc_eval/src/lsc_eval/llm_managers/__init__.py (3)
lsc_eval/src/lsc_eval/llm_managers/deepeval_llm.py (1)
DeepEvalLLMManager(8-43)lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (1)
LLMManager(36-167)lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py (1)
RagasLLMManager(78-106)
lsc_eval/src/lsc_eval/core/config_loader.py (1)
lsc_eval/src/lsc_eval/core/models.py (1)
validate_metrics(82-87)
lsc_eval/src/lsc_eval/metrics/custom_metrics.py (3)
lsc_eval/src/lsc_eval/core/models.py (1)
TurnData(8-44)lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (4)
LLMManager(36-167)get_model_name(144-146)get_litellm_params(148-156)from_system_config(163-167)lsc_eval/src/lsc_eval/output/utils.py (1)
EvaluationScope(11-16)
lsc_eval/src/lsc_eval/__init__.py (6)
lsc_eval/src/lsc_eval/core/config_loader.py (2)
ConfigLoader(176-256)SystemConfig(140-173)lsc_eval/src/lsc_eval/core/data_validator.py (1)
DataValidator(11-82)lsc_eval/src/lsc_eval/core/models.py (3)
EvaluationData(47-123)EvaluationResult(126-162)TurnData(8-44)lsc_eval/src/lsc_eval/evaluation_engine.py (1)
EvaluationEngine(104-294)lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (1)
LLMManager(36-167)lsc_eval/src/lsc_eval/output/output_handler.py (1)
OutputHandler(15-224)
lsc_eval/runner.py (5)
lsc_eval/src/lsc_eval/core/config_loader.py (3)
ConfigLoader(176-256)setup_environment_variables(27-44)load_system_config(185-240)lsc_eval/src/lsc_eval/core/data_validator.py (2)
DataValidator(11-82)load_evaluation_data(19-37)lsc_eval/src/lsc_eval/evaluation_engine.py (1)
EvaluationEngine(104-294)lsc_eval/src/lsc_eval/output/output_handler.py (2)
OutputHandler(15-224)generate_reports(32-76)lsc_eval/src/lsc_eval/output/utils.py (1)
calculate_basic_stats(19-45)
lsc_eval/src/lsc_eval/core/data_validator.py (1)
lsc_eval/src/lsc_eval/core/models.py (2)
EvaluationData(47-123)validate_metric_requirements(89-123)
lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py (1)
lsc_eval/src/lsc_eval/llm_managers/deepeval_llm.py (2)
get_llm(31-33)get_model_info(35-43)
lsc_eval/src/lsc_eval/evaluation_engine.py (8)
lsc_eval/src/lsc_eval/core/config_loader.py (2)
ConfigLoader(176-256)get_llm_config_dict(242-256)lsc_eval/src/lsc_eval/core/data_validator.py (2)
DataValidator(11-82)validate_evaluation_data(39-54)lsc_eval/src/lsc_eval/core/models.py (3)
EvaluationData(47-123)EvaluationResult(126-162)TurnData(8-44)lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (2)
LLMManager(36-167)from_system_config(163-167)lsc_eval/src/lsc_eval/metrics/custom_metrics.py (3)
CustomMetrics(25-242)evaluate(42-57)from_system_config(239-242)lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py (3)
DeepEvalMetrics(19-138)evaluate(62-77)from_system_config(135-138)lsc_eval/src/lsc_eval/metrics/ragas_metrics.py (3)
RagasMetrics(23-247)evaluate(78-107)from_system_config(244-247)lsc_eval/src/lsc_eval/output/utils.py (1)
EvaluationScope(11-16)
lsc_eval/src/lsc_eval/core/models.py (2)
lsc_eval/src/lsc_eval/core/config_loader.py (1)
validate_metrics(123-137)lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (1)
LLMConfig(13-33)
🪛 YAMLlint (1.37.1)
lsc_eval/config/evaluation_data.yaml
[error] 10-10: trailing spaces
(trailing-spaces)
[error] 11-11: trailing spaces
(trailing-spaces)
[error] 12-12: trailing spaces
(trailing-spaces)
[error] 16-16: trailing spaces
(trailing-spaces)
[error] 31-31: trailing spaces
(trailing-spaces)
[error] 33-33: trailing spaces
(trailing-spaces)
[error] 37-37: trailing spaces
(trailing-spaces)
[error] 47-47: trailing spaces
(trailing-spaces)
[error] 50-50: trailing spaces
(trailing-spaces)
[error] 52-52: trailing spaces
(trailing-spaces)
[error] 56-56: trailing spaces
(trailing-spaces)
[error] 58-58: trailing spaces
(trailing-spaces)
[error] 66-66: trailing spaces
(trailing-spaces)
lsc_eval/config/system.yaml
[error] 16-16: trailing spaces
(trailing-spaces)
[error] 23-23: trailing spaces
(trailing-spaces)
[error] 26-26: trailing spaces
(trailing-spaces)
[error] 30-30: trailing spaces
(trailing-spaces)
[error] 51-51: trailing spaces
(trailing-spaces)
[error] 57-57: trailing spaces
(trailing-spaces)
[error] 64-64: trailing spaces
(trailing-spaces)
[error] 70-70: trailing spaces
(trailing-spaces)
[error] 76-76: trailing spaces
(trailing-spaces)
[error] 82-82: trailing spaces
(trailing-spaces)
[error] 89-89: trailing spaces
(trailing-spaces)
[error] 98-98: trailing spaces
(trailing-spaces)
[error] 101-101: trailing spaces
(trailing-spaces)
[error] 104-104: trailing spaces
(trailing-spaces)
[error] 120-120: trailing spaces
(trailing-spaces)
[error] 124-124: trailing spaces
(trailing-spaces)
[error] 138-138: trailing spaces
(trailing-spaces)
🪛 LanguageTool
lsc_eval/README.md
[grammar] ~5-~5: There might be a mistake here.
Context: ...e GenAI application. ## 🎯 Key Features - Multi-Framework Support: Seamlessly use...
(QB_NEW_EN)
[grammar] ~15-~15: There might be a mistake here.
Context: ...distribution analysis ## 🚀 Quick Start ### Installation ```bash # From Git pdm add g...
(QB_NEW_EN)
[grammar] ~36-~36: There might be a mistake here.
Context: ...n_data.yaml ``` ## 📊 Supported Metrics ### Turn-Level (Single Query) - Ragas -...
(QB_NEW_EN)
[grammar] ~38-~38: There might be a mistake here.
Context: ...d Metrics ### Turn-Level (Single Query) - Ragas - Response Evaluation - `f...
(QB_NEW_EN)
[grammar] ~39-~39: There might be a mistake here.
Context: ... ### Turn-Level (Single Query) - Ragas - Response Evaluation - faithfulness...
(QB_NEW_EN)
[grammar] ~48-~48: There might be a mistake here.
Context: ...ext_precision_with_reference- **Custom** - Response Evaluation -answer_correc...
(QB_NEW_EN)
[grammar] ~52-~52: There might be a mistake here.
Context: ...ss ### Conversation-Level (Multi-turn) - **DeepEval** -conversation_completenes...
(QB_NEW_EN)
[grammar] ~113-~113: There might be a mistake here.
Context: ...t...." ``` ## 📈 Output & Visualization ### Generated Reports - CSV: Detailed res...
(QB_NEW_EN)
[grammar] ~115-~115: There might be a mistake here.
Context: ...t & Visualization ### Generated Reports - CSV: Detailed results with status, sco...
(QB_NEW_EN)
[grammar] ~116-~116: There might be a mistake here.
Context: ...led results with status, scores, reasons - JSON: Summary statistics with score di...
(QB_NEW_EN)
[grammar] ~117-~117: There might be a mistake here.
Context: ...mary statistics with score distributions - TXT: Human-readable summary - PNG:...
(QB_NEW_EN)
[grammar] ~118-~118: There might be a mistake here.
Context: ...utions - TXT: Human-readable summary - PNG: 4 visualization types (pass rates...
(QB_NEW_EN)
[grammar] ~121-~121: There might be a mistake here.
Context: ...us breakdown) ### Key Metrics in Output - PASS/FAIL/ERROR: Status based on thres...
(QB_NEW_EN)
[grammar] ~122-~122: There might be a mistake here.
Context: ...FAIL/ERROR**: Status based on thresholds - Actual Reasons: DeepEval provides LLM-...
(QB_NEW_EN)
[grammar] ~135-~135: There might be a mistake here.
Context: ...lint . ``` ## 📄 License & Contributing This project is licensed under the Apache...
(QB_NEW_EN)
[grammar] ~139-~139: There might be a mistake here.
Context: ...e for details. Contributions welcome - see development setup above for code qualit...
(QB_NEW_EN)
[grammar] ~139-~139: There might be a mistake here.
Context: ...ment setup above for code quality tools.
(QB_NEW_EN)
🪛 GitHub Actions: Black
lsc_eval/src/lsc_eval/output/utils.py
[error] 1-1: Black would reformat this file. Command 'pdm run black --check .' failed with exit code 1.
lsc_eval/src/lsc_eval/core/__init__.py
[error] 1-1: Black would reformat this file. Command 'pdm run black --check .' failed with exit code 1.
lsc_eval/src/lsc_eval/llm_managers/llm_manager.py
[error] 1-1: Black would reformat this file. Command 'pdm run black --check .' failed with exit code 1.
lsc_eval/src/lsc_eval/output/output_handler.py
[error] 1-1: Black would reformat this file. Command 'pdm run black --check .' failed with exit code 1.
lsc_eval/src/lsc_eval/metrics/ragas_metrics.py
[error] 1-1: Black would reformat this file. Command 'pdm run black --check .' failed with exit code 1.
lsc_eval/src/lsc_eval/output/visualization.py
[error] 1-1: Black would reformat this file. Command 'pdm run black --check .' failed with exit code 1.
lsc_eval/src/lsc_eval/core/config_loader.py
[error] 1-1: Black would reformat this file. Command 'pdm run black --check .' failed with exit code 1.
lsc_eval/src/lsc_eval/metrics/custom_metrics.py
[error] 1-1: Black would reformat this file. Command 'pdm run black --check .' failed with exit code 1.
lsc_eval/runner.py
[error] 1-1: Black would reformat this file. Command 'pdm run black --check .' failed with exit code 1.
lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py
[error] 1-1: Black would reformat this file. Command 'pdm run black --check .' failed with exit code 1.
lsc_eval/src/lsc_eval/evaluation_engine.py
[error] 1-1: Black would reformat this file. Command 'pdm run black --check .' failed with exit code 1.
lsc_eval/src/lsc_eval/core/models.py
[error] 1-1: Black would reformat this file. Command 'pdm run black --check .' failed with exit code 1.
🪛 GitHub Actions: Type checks
lsc_eval/src/lsc_eval/output/output_handler.py
[error] 18-18: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]
lsc_eval/src/lsc_eval/metrics/ragas_metrics.py
[error] 64-64: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]
[error] 78-78: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]
[error] 109-109: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]
[error] 128-128: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]
[error] 145-145: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]
[error] 167-167: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]
[error] 197-197: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]
[error] 224-224: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]
lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py
[error] 40-40: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]
[error] 51-51: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]
[error] 62-62: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]
[error] 79-79: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]
[error] 95-95: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]
[error] 115-115: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]
lsc_eval/src/lsc_eval/core/config_loader.py
[error] 27-27: Mypy: Function is missing a return type annotation. [no-untyped-def]
[error] 47-47: Mypy: Function is missing a return type annotation. [no-untyped-def]
[error] 105-105: Mypy: Function is missing a return type annotation. [no-untyped-def]
[error] 179-179: Mypy: Function is missing a return type annotation. [no-untyped-def]
lsc_eval/src/lsc_eval/metrics/custom_metrics.py
[error] 42-42: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]
[error] 193-193: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]
lsc_eval/runner.py
[error] 20-20: Mypy: Module 'lsc_eval' has no attribute 'ConfigLoader'. [attr-defined]
[error] 20-20: Mypy: Module 'lsc_eval' has no attribute 'DataValidator'. [attr-defined]
[error] 20-20: Mypy: Module 'lsc_eval' has no attribute 'EvaluationEngine'. [attr-defined]
[error] 20-20: Mypy: Module 'lsc_eval' has no attribute 'OutputHandler'. [attr-defined]
[error] 99-99: Mypy: Function is missing a return type annotation. [no-untyped-def]
[error] 135-135: Mypy: Call to untyped function 'main' in typed context. [no-untyped-call]
lsc_eval/src/lsc_eval/core/data_validator.py
[error] 14-14: Mypy: Function is missing a return type annotation. [no-untyped-def]
[error] 56-56: Mypy: Function is missing a return type annotation. [no-untyped-def]
[error] 74-74: Mypy: Function is missing a return type annotation. [no-untyped-def]
lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py
[error] 20-20: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]
[error] 61-61: Mypy: Function is missing a type annotation for one or more arguments. [no-untyped-def]
lsc_eval/src/lsc_eval/evaluation_engine.py
[error] 83-83: Mypy: Function is missing a return type annotation. [no-untyped-def]
[error] 118-118: Mypy: Call to untyped function 'DataValidator' in typed context. [no-untyped-call]
[error] 158-158: Mypy: Function is missing a return type annotation. [no-untyped-def]
[error] 178-178: Mypy: Function is missing a return type annotation. [no-untyped-def]
[error] 186-186: Mypy: Function is missing a return type annotation. [no-untyped-def]
🪛 GitHub Actions: Pyright
lsc_eval/src/lsc_eval/llm_managers/deepeval_llm.py
[error] 5-5: Import "deepeval.models" could not be resolved (reportMissingImports)
lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py
[error] 5-5: Import "deepeval.metrics" could not be resolved (reportMissingImports)
[error] 10-10: Import "deepeval.test_case" could not be resolved (reportMissingImports)
[error] 11-11: Import "deepeval.test_case" could not be resolved (reportMissingImports)
lsc_eval/src/lsc_eval/output/visualization.py
[warning] 10-10: Import "seaborn" could not be resolved from source (reportMissingModuleSource)
lsc_eval/src/lsc_eval/metrics/custom_metrics.py
[error] 6-6: Import "litellm" could not be resolved (reportMissingImports)
lsc_eval/runner.py
[error] 20-20: "ConfigLoader" is unknown import symbol (reportAttributeAccessIssue)
[error] 20-20: "DataValidator" is unknown import symbol (reportAttributeAccessIssue)
[error] 20-20: "EvaluationEngine" is unknown import symbol (reportAttributeAccessIssue)
[error] 20-20: "OutputHandler" is unknown import symbol (reportAttributeAccessIssue)
[error] 21-21: Import "lsc_eval.core" could not be resolved (reportMissingImports)
[error] 22-22: Import "lsc_eval.output.utils" could not be resolved (reportMissingImports)
lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py
[error] 5-5: Import "litellm" could not be resolved (reportMissingImports)
🪛 GitHub Actions: Python linter
lsc_eval/src/lsc_eval/llm_managers/deepeval_llm.py
[error] 5-5: Pylint: E0401: Unable to import 'deepeval.models' (import-error).
lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py
[error] 5-5: Pylint: E0401: Unable to import 'deepeval.metrics' (import-error).
[error] 10-10: Pylint: E0401: Unable to import 'deepeval.test_case' (import-error).
lsc_eval/src/lsc_eval/output/visualization.py
[error] 10-10: Pylint: E0401: Unable to import 'seaborn' (import-error).
[warning] 204-213: Pylint: duplicate-code detected between lightspeed_core_evaluation.utils.plot [26:35] and lsc_eval/src/lsc_eval/output/visualization.py [204:213].
lsc_eval/src/lsc_eval/metrics/custom_metrics.py
[error] 6-6: Pylint: E0401: Unable to import 'litellm' (import-error).
lsc_eval/runner.py
[warning] 1-1: Pylint: R0801: Similar lines in 2 files.
lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py
[error] 5-5: Pylint: E0401: Unable to import 'litellm' (import-error).
🔇 Additional comments (32)
lsc_eval/src/lsc_eval/llm_managers/deepeval_llm.py (1)
35-43: Model info helper looks good.The info map mirrors constructor params and includes sensible defaults.
Makefile (2)
42-44: Type-check scope expansion looks good.Including
lsc_eval/src/andlsc_eval/runner.pyin mypy is appropriate for the new module.
75-80: pyright target expansion is correct; keep optional-import guards aligned.With the guards added in code, this target should pass even without installing optional deps.
lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py (1)
1-7: Format and Lint: Run Black and Re-run Static AnalysisFile: lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py
Lines: 1–7Please ensure the file is properly formatted and free of style or type-checking errors:
- Run Black to normalize formatting:
black lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py- Re-run static analysis checks and address any new issues:
pylint lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py mypy lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py pyright lsc_eval/src/lsc_eval/llm_managers/ragas_llm.pyOnce formatting and lint/type errors are cleared, this can be merged.
lsc_eval/src/lsc_eval/output/__init__.py (1)
1-6: Public surface looks clean and minimal.Re-exporting OutputHandler and GraphGenerator with all is tidy and conventional.
lsc_eval/src/lsc_eval/metrics/__init__.py (1)
1-7: Clean public API for metrics — LGTMConcise docstring, clear re-exports, and stable surface for
from lsc_eval.metrics import .... Looks good.lsc_eval/src/lsc_eval/llm_managers/__init__.py (1)
1-7: LLM managers aggregator — LGTMGood, straightforward re-exports that match usage patterns elsewhere.
lsc_eval/pyproject.toml (2)
35-37: Console script entry point validated
lsc_eval/runner.pydefinesmain()on line 99, so thelsc-eval = "lsc_eval.runner:main"entry point is valid and will install/run successfully.
23-33: Dev dependency versions are valid
All specified dev-tool versions exist on PyPI and none exceed the current latest releases:
- black >=25.1.0 (latest 25.1.0)
- mypy >=1.15.0 (latest 1.17.1)
- ruff >=0.8.0 (latest 0.12.10)
- pyright >=1.1.401 (latest 1.1.404)
- pydocstyle >=6.3.0 (latest 6.3.0)
- pylint >=3.3.2 (latest 3.3.8)
- pytest >=8.3.2 (latest 8.4.1)
- pytest-cov >=5.0.0 (latest 6.2.1)
No adjustments needed.
lsc_eval/config/evaluation_data.yaml (1)
1-73: All trailing whitespace removed inlsc_eval/config/evaluation_data.yamlThe sed cleanup has stripped all trailing spaces and a ripgrep check confirms there are none remaining. No further action is needed.
lsc_eval/config/system.yaml (2)
1-145: Trailing spaces removed – no further action neededAll trailing whitespace has been cleaned from
lsc_eval/config/system.yamland a subsequent search found no remaining instances. CI should now pass without yamllint errors.
41-110: It looks like the combined install + inspect script didn’t emit any output, so we can’t confirm whether those classes actually exist. Could you please run just the inspection portion and share what it prints? For example:python3 - << 'EOF' import importlib # Ragas mod = importlib.import_module('ragas.evaluation.metrics') print("Ragas exports:", [n for n in ('LLMContextPrecisionWithReference', 'LLMContextPrecisionWithoutReference') if hasattr(mod, n)]) # DeepEval mod2 = importlib.import_module('deepeval.metrics') print("DeepEval exports:", [n for n in ('ConversationCompletenessMetric', 'TurnRelevancyMetric') if hasattr(mod2, n)]) EOFThat will confirm definitively whether you can safely reference these metric identifiers against ragas>=0.3.0 and deepeval>=1.3.0.
lsc_eval/runner.py (1)
14-19: Ignore thesrc/-layout shim suggestionThis repository uses a flat
lsc_eval/package, not asrc/layout, so there is no…/lsc_eval/srcdirectory to insert onsys.path. Rather than hackingsys.path, you should:
- Install the package in editable/dev mode (e.g.
pip install -e .) so that
python -m lsc_eval.runnerandpython lsc_eval/runner.pyboth work out of the box.- Or add a proper entrypoint/
__main__.pyinsidelsc_eval/to support
python -m lsc_evalwithout manual path manipulation.Remove the proposed
sys.pathshim and adopt one of the above approaches to ensure the runner is discoverable in development.Likely an incorrect or invalid review comment.
lsc_eval/src/lsc_eval/core/data_validator.py (2)
39-55: LGTM on the validation flowClear separation: availability check, then per-metric field requirements. The boolean return plus aggregated error printout is practical.
1-1: ✅ Type-check passed fordata_validator.py
- Installed
mypyand itspydanticplugin dependency.- Ran
mypy lsc_eval/src/lsc_eval/core/data_validator.pyand confirmed no type errors.All checks are green—no further action needed.
lsc_eval/src/lsc_eval/output/utils.py (3)
19-46: Basic stats calculation looks correct and robust to empty inputGuarding empty results and returning 0.0 rates is sensible; integer counts + float rates are fine.
71-116: Detailed stats aggregation is solid; nice touch on stddev guardUsing stdev only when count > 1 avoids StatisticsError. Good structure for consumers.
1-1: Black formatting failure — reformat to pass CII attempted to verify formatting but the
pdmcommand wasn’t available in this environment. Please re-formatlsc_eval/src/lsc_eval/output/utils.pywith Black in your development setup and confirm it passes CI. For example:
- Install dev dependencies (if not already):
pdm install --dev- Reformat the file:
pdm run black lsc_eval/src/lsc_eval/output/utils.py- Verify no changes are needed:
pdm run black --check lsc_eval/src/lsc_eval/output/utils.pylsc_eval/src/lsc_eval/output/output_handler.py (2)
61-77: Graph generation error handling is appropriateGracefully catching and logging graph errors keeps the reporting path resilient.
1-1: Please verify formatting and type checks
Since the sandbox couldn’t runpdm,black, ormypy, please re-run your formatter and type checker locally to ensure no regressions were introduced by the recent signature change.• From your project root, run:
black --check lsc_eval/src/lsc_eval/output/output_handler.py mypy lsc_eval/src/lsc_eval/output/output_handler.py• Address any formatting diffs or type errors before merging.
lsc_eval/src/lsc_eval/metrics/ragas_metrics.py (2)
1-1: Action Required: Manually verify Black formatting and mypy checks onragas_metrics.pyIt looks like
pdmisn’t available in this environment, so the Black and mypy commands couldn’t run. To ensure there are no lingering formatting or typing errors, please perform the following steps locally:
- Install Black and mypy (and PDM if you rely on it):
pip install black mypy # Install formatting and type-checking tools # pip install pdm black mypy # If you use PDM for dependency management- Apply Black in “check” mode:
black --check lsc_eval/src/lsc_eval/metrics/ragas_metrics.py- Re-run mypy for this file to catch any remaining
no-untyped-deferrors:mypy lsc_eval/src/lsc_eval/metrics/ragas_metrics.pyOnce you’ve confirmed the file is properly formatted and free of mypy errors, please re-run the CI pipeline to ensure everything passes.
1-248: Verify Ragas Metrics DataFrame Column Names
To prevent run-time key errors, confirm that the columns produced by your locked ragas version match the keys used in these evaluation calls. In most recent versions, the mappings are:
- ResponseRelevancy →
answer_relevancy- Faithfulness →
faithfulness- ContextRelevance →
nv_context_relevance- LLMContextRecall →
context_recall- LLMContextPrecisionWithReference →
llm_context_precision_with_reference- LLMContextPrecisionWithoutReference →
llm_context_precision_without_referenceIf your pinned version uses different column names, update the
result_keyarguments (and any downstream logic) accordingly.lsc_eval/src/lsc_eval/evaluation_engine.py (2)
211-226: Good user-facing progress loggingClear, concise status lines with emoji are helpful during long runs.
1-1: Ensure Black and mypy checks are installed and passing on the evaluation engineI attempted to run the formatting and type‐checking pipeline, but neither Black nor mypy was available in this environment. Please install the project’s dev dependencies and verify that this file passes both checks:
- Install development tools (via PDM or your preferred method):
pdm install --dev- Run Black in check mode:
pdm run black --check lsc_eval/src/lsc_eval/evaluation_engine.py- Run mypy type checks:
pdm run mypy lsc_eval/src/lsc_eval/evaluation_engine.pyOnce installed, confirm there are no remaining formatting or type‐annotation errors in this file.
lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (2)
52-65: LGTM: provider handler dispatch is clean and extensibleThe handler map + small provider-specific validators make this maintainable.
118-123: Please verify LiteLLM Azure model prefix and required environment variables
- Confirm that the model string format
"azure/<deployment_name>"is correct for your target Azure OpenAI API version.- Ensure
_validate_azure_env()enforces presence of:
AZURE_OPENAI_API_KEYAZURE_OPENAI_ENDPOINTAZURE_OPENAI_API_VERSION- If LiteLLM requires passing additional parameters (e.g.
api_base,api_version) in the client setup, update this handler to include them.lsc_eval/src/lsc_eval/metrics/custom_metrics.py (1)
238-243: Factory method usage is clean and composableNice decoupling via LLMManager.from_system_config; aligns with other metrics modules.
lsc_eval/src/lsc_eval/core/config_loader.py (2)
185-241: load_system_config typing is good; logging setup first is a plusThe flow—logging first, then metric registry population, then model construction—is clear.
242-256: get_llm_config_dict: add return type and keep dict shape aligned with LLMManager.from_system_configMypy flagged missing return type; adding it keeps types consistent. The dict shape matches LLMManager expectations.
-def get_llm_config_dict(self) -> Dict[str, Any]: +def get_llm_config_dict(self) -> Dict[str, Any]: """Get system config as dictionary for LLMManager."""Note: Signature already matches; ensure mypy config picks it up after earlier fixes.
lsc_eval/src/lsc_eval/core/models.py (3)
8-16: LGTM: TurnData structure and validators are solidGood constraints on non-empty query/response and positive turn_id; contexts schema check is helpful.
80-88: Metric identifier format validation is clearEnforcing "framework:metric_name" early prevents downstream confusion.
148-155: Score bounds are enforced here; ensure upstream parsers normalizeCustomMetrics now clamps scores; this validator will still catch anomalies if any slip through. Good layering.
| Usage: | ||
| python runner.py --system-config config/system.yaml --eval-data config/evaluation_data.yaml | ||
| Or programmatically: | ||
| from runner import run_evaluation | ||
| results = run_evaluation("config/system.yaml", "config/evaluation_data.yaml") | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix usage guidance for src/ layout and console entry point.
python runner.py will not find the package in a src/ layout. Recommend using python -m lsc_eval.runner or the console script.
Apply:
-Usage:
- python runner.py --system-config config/system.yaml --eval-data config/evaluation_data.yaml
+Usage:
+ # After installing (or in editable mode):
+ lsc-eval --system-config config/system.yaml --eval-data config/evaluation_data.yaml
+
+ # Or directly from the repo (src/ layout):
+ python -m lsc_eval.runner --system-config config/system.yaml --eval-data config/evaluation_data.yaml
@@
-Or programmatically:
- from runner import run_evaluation
- results = run_evaluation("config/system.yaml", "config/evaluation_data.yaml")
+Or programmatically:
+ from lsc_eval.runner import run_evaluation
+ results = run_evaluation("config/system.yaml", "config/evaluation_data.yaml")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Usage: | |
| python runner.py --system-config config/system.yaml --eval-data config/evaluation_data.yaml | |
| Or programmatically: | |
| from runner import run_evaluation | |
| results = run_evaluation("config/system.yaml", "config/evaluation_data.yaml") | |
| """ | |
| """ | |
| Usage: | |
| # After installing (or in editable mode): | |
| lsc-eval --system-config config/system.yaml --eval-data config/evaluation_data.yaml | |
| # Or directly from the repo (src/ layout): | |
| python -m lsc_eval.runner --system-config config/system.yaml --eval-data config/evaluation_data.yaml | |
| Or programmatically: | |
| from lsc_eval.runner import run_evaluation | |
| results = run_evaluation("config/system.yaml", "config/evaluation_data.yaml") | |
| """ |
🤖 Prompt for AI Agents
In lsc_eval/runner.py around lines 6 to 12, the usage comment incorrectly
suggests running "python runner.py" which fails in a src/ layout; update the
guidance to recommend using "python -m lsc_eval.runner" (or the provided console
script name if one exists) and adjust the programmatic example to "from
lsc_eval.runner import run_evaluation" to reflect the package import path so
users can run the module or import it correctly.
| except (FileNotFoundError, ValueError, RuntimeError) as e: | ||
| print(f"\n❌ Evaluation failed: {e}") | ||
| traceback.print_exc() | ||
| return None | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Catch ImportError to surface missing optional deps cleanly.
Broaden the exception tuple so missing libs (e.g., seaborn/deepeval) don’t crash with a traceback but return a non-zero exit code.
Apply:
- except (FileNotFoundError, ValueError, RuntimeError) as e:
+ except (FileNotFoundError, ValueError, RuntimeError, ImportError) as e:📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except (FileNotFoundError, ValueError, RuntimeError) as e: | |
| print(f"\n❌ Evaluation failed: {e}") | |
| traceback.print_exc() | |
| return None | |
| except (FileNotFoundError, ValueError, RuntimeError, ImportError) as e: | |
| print(f"\n❌ Evaluation failed: {e}") | |
| traceback.print_exc() | |
| return None |
🤖 Prompt for AI Agents
In lsc_eval/runner.py around lines 93-97, the except block currently catches
FileNotFoundError, ValueError and RuntimeError but not ImportError; update the
exception tuple to include ImportError so missing optional dependencies are
handled cleanly, and change the error path to return a non-zero exit signal (for
example return 1 or call sys.exit(1)) instead of returning None — if you use
sys.exit, import sys at the top.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the issues will get resolved when we make this primary eval tool and remove the existing one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asamal4, understood! That makes sense - consolidating to a single evaluation tool will indeed clean up many of these edge cases naturally. Thanks for the context about the broader architectural plan to make this the primary tool.
✏️ Learnings added
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#28
File: lsc_eval/runner.py:99-103
Timestamp: 2025-08-26T11:17:48.603Z
Learning: The lsc_eval generic evaluation tool is intended to become the primary evaluation framework, replacing an existing evaluation tool in the lightspeed-evaluation repository.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
| _, ax = plt.subplots(figsize=(12, 8)) | ||
|
|
||
| # Create bar chart | ||
| bars = ax.bar(metrics, pass_rates, color="skyblue", alpha=0.7) | ||
|
|
||
| # Add percentage labels with status breakdown | ||
| for chart_bar, breakdown in zip(bars, status_breakdowns): | ||
| height = chart_bar.get_height() | ||
| ax.text( | ||
| chart_bar.get_x() + chart_bar.get_width() / 2.0, | ||
| height + 1, | ||
| f"{height:.1f}%\n({breakdown})", | ||
| ha="center", | ||
| va="bottom", | ||
| fontsize=9, | ||
| ) | ||
|
|
||
| # Customize chart | ||
| ax.set_title("Pass Rates by Metric", fontsize=16, fontweight="bold") | ||
| ax.set_xlabel("Pass Rate (%)", fontsize=12) | ||
| ax.set_ylabel("Metrics", fontsize=12) | ||
| ax.set_ylim(0, 105) # Give space for labels | ||
|
|
||
| # Rotate x-axis labels if needed | ||
| if len(metrics) > 5: | ||
| plt.xticks(rotation=45, ha="right") | ||
|
|
||
| plt.tight_layout() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pass rates chart: wrong axis labels and inconsistent figure sizing.
- Using hardcoded
figsize=(12,8)ignores configured size/DPI. - X/Y labels are flipped: x-axis is metrics, y-axis is pass rate.
Apply:
- _, ax = plt.subplots(figsize=(12, 8))
+ _, ax = plt.subplots(figsize=tuple(self.figsize), dpi=self.dpi)
@@
- ax.set_title("Pass Rates by Metric", fontsize=16, fontweight="bold")
- ax.set_xlabel("Pass Rate (%)", fontsize=12)
- ax.set_ylabel("Metrics", fontsize=12)
+ ax.set_title("Pass Rates by Metric", fontsize=16, fontweight="bold")
+ ax.set_xlabel("Metrics", fontsize=12)
+ ax.set_ylabel("Pass Rate (%)", fontsize=12)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| _, ax = plt.subplots(figsize=(12, 8)) | |
| # Create bar chart | |
| bars = ax.bar(metrics, pass_rates, color="skyblue", alpha=0.7) | |
| # Add percentage labels with status breakdown | |
| for chart_bar, breakdown in zip(bars, status_breakdowns): | |
| height = chart_bar.get_height() | |
| ax.text( | |
| chart_bar.get_x() + chart_bar.get_width() / 2.0, | |
| height + 1, | |
| f"{height:.1f}%\n({breakdown})", | |
| ha="center", | |
| va="bottom", | |
| fontsize=9, | |
| ) | |
| # Customize chart | |
| ax.set_title("Pass Rates by Metric", fontsize=16, fontweight="bold") | |
| ax.set_xlabel("Pass Rate (%)", fontsize=12) | |
| ax.set_ylabel("Metrics", fontsize=12) | |
| ax.set_ylim(0, 105) # Give space for labels | |
| # Rotate x-axis labels if needed | |
| if len(metrics) > 5: | |
| plt.xticks(rotation=45, ha="right") | |
| plt.tight_layout() | |
| _, ax = plt.subplots(figsize=tuple(self.figsize), dpi=self.dpi) | |
| # Create bar chart | |
| bars = ax.bar(metrics, pass_rates, color="skyblue", alpha=0.7) | |
| # Add percentage labels with status breakdown | |
| for chart_bar, breakdown in zip(bars, status_breakdowns): | |
| height = chart_bar.get_height() | |
| ax.text( | |
| chart_bar.get_x() + chart_bar.get_width() / 2.0, | |
| height + 1, | |
| f"{height:.1f}%\n({breakdown})", | |
| ha="center", | |
| va="bottom", | |
| fontsize=9, | |
| ) | |
| # Customize chart | |
| ax.set_title("Pass Rates by Metric", fontsize=16, fontweight="bold") | |
| ax.set_xlabel("Metrics", fontsize=12) | |
| ax.set_ylabel("Pass Rate (%)", fontsize=12) | |
| ax.set_ylim(0, 105) # Give space for labels | |
| # Rotate x-axis labels if needed | |
| if len(metrics) > 5: | |
| plt.xticks(rotation=45, ha="right") | |
| plt.tight_layout() |
🤖 Prompt for AI Agents
In lsc_eval/src/lsc_eval/output/visualization.py around lines 124-151, the
figure creation and axis labels are incorrect and ignore configured sizing:
replace the hardcoded plt.subplots(figsize=(12, 8)) with
plt.subplots(figsize=fig_size, dpi=dpi) (or the appropriate passed/configured
fig_size and dpi variables), and swap the axis labels so the x-axis reads
"Metrics" and the y-axis reads "Pass Rate (%)"; keep the bar chart as
ax.bar(metrics, pass_rates), maintain ax.set_ylim(0, 105) for label space, and
preserve the xtick rotation logic for long metric lists.
| # Create heatmap | ||
| _, ax = plt.subplots(figsize=tuple(self.figsize), dpi=self.dpi) | ||
|
|
||
| sns.heatmap( | ||
| df, | ||
| annot=True, | ||
| fmt=".1f", | ||
| cmap="RdYlGn", | ||
| vmin=0, | ||
| vmax=100, | ||
| cbar_kws={"label": "Pass Rate (%)"}, | ||
| ax=ax, | ||
| mask=df.isna(), # Mask NaN values (missing data) | ||
| linewidths=0.5, | ||
| linecolor="lightgray", | ||
| ) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Skip heatmap gracefully when seaborn is unavailable.
Avoid runtime errors when seaborn isn’t installed.
Apply:
- sns.heatmap(
+ if not HAS_SEABORN:
+ self.logger.warning("seaborn not available; skipping heatmap")
+ return None
+ sns.heatmap(
df,📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Create heatmap | |
| _, ax = plt.subplots(figsize=tuple(self.figsize), dpi=self.dpi) | |
| sns.heatmap( | |
| df, | |
| annot=True, | |
| fmt=".1f", | |
| cmap="RdYlGn", | |
| vmin=0, | |
| vmax=100, | |
| cbar_kws={"label": "Pass Rate (%)"}, | |
| ax=ax, | |
| mask=df.isna(), # Mask NaN values (missing data) | |
| linewidths=0.5, | |
| linecolor="lightgray", | |
| ) | |
| # Create heatmap | |
| _, ax = plt.subplots(figsize=tuple(self.figsize), dpi=self.dpi) | |
| if not HAS_SEABORN: | |
| self.logger.warning("seaborn not available; skipping heatmap") | |
| return None | |
| sns.heatmap( | |
| df, | |
| annot=True, | |
| fmt=".1f", | |
| cmap="RdYlGn", | |
| vmin=0, | |
| vmax=100, | |
| cbar_kws={"label": "Pass Rate (%)"}, | |
| ax=ax, | |
| mask=df.isna(), # Mask NaN values (missing data) | |
| linewidths=0.5, | |
| linecolor="lightgray", | |
| ) |
🤖 Prompt for AI Agents
In lsc_eval/src/lsc_eval/output/visualization.py around lines 371 to 387, the
heatmap block currently assumes seaborn is available; update it to handle
seaborn absence by first attempting "import seaborn as sns" (or checking an
existing import) in a try/except ImportError, and if seaborn is missing log or
warn that seaborn is unavailable and either (a) fall back to a matplotlib-based
heatmap using imshow + text annotations to replicate annot=True, cmap, vmin/vmax
and mask handling, or (b) gracefully skip drawing the heatmap and
return/continue without raising; ensure the fallback respects figsize/dpi and
the cbar label and does not raise runtime errors when seaborn is not installed.
| ax.set_title( | ||
| "Conversation Performance Heatmap\n(Only showing metrics with data)", | ||
| fontsize=14, | ||
| fontweight="bold", | ||
| pad=20, | ||
| ) | ||
| ax.set_xlabel("Pass Rate (%)", fontsize=12, fontweight="bold") | ||
| ax.set_ylabel("Conversation Groups", fontsize=12, fontweight="bold") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Heatmap axis labels are misleading.
X-axis is metrics; colorbar already conveys pass rate.
Apply:
- ax.set_xlabel("Pass Rate (%)", fontsize=12, fontweight="bold")
- ax.set_ylabel("Conversation Groups", fontsize=12, fontweight="bold")
+ ax.set_xlabel("Metrics", fontsize=12, fontweight="bold")
+ ax.set_ylabel("Conversation Groups", fontsize=12, fontweight="bold")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ax.set_title( | |
| "Conversation Performance Heatmap\n(Only showing metrics with data)", | |
| fontsize=14, | |
| fontweight="bold", | |
| pad=20, | |
| ) | |
| ax.set_xlabel("Pass Rate (%)", fontsize=12, fontweight="bold") | |
| ax.set_ylabel("Conversation Groups", fontsize=12, fontweight="bold") | |
| ax.set_title( | |
| "Conversation Performance Heatmap\n(Only showing metrics with data)", | |
| fontsize=14, | |
| fontweight="bold", | |
| pad=20, | |
| ) | |
| ax.set_xlabel("Metrics", fontsize=12, fontweight="bold") | |
| ax.set_ylabel("Conversation Groups", fontsize=12, fontweight="bold") |
🤖 Prompt for AI Agents
In lsc_eval/src/lsc_eval/output/visualization.py around lines 389 to 397, the
x-axis label is incorrect (it currently reads "Pass Rate (%)" while the x-axis
actually lists metric names and the colorbar conveys pass rate); change the
ax.set_xlabel call to something like "Metrics" (or remove the x-axis label) and
ensure the colorbar is explicitly labeled "Pass Rate (%)" so axis and legend
accurately reflect what is being plotted.
| verify: install-deps-test ## Verify the code using various linters | ||
| pdm run black . --check | ||
| pdm run ruff check . --per-file-ignores=tests/*:S101 --per-file-ignores=scripts/*:S101 --per-file-ignores=lsc_agent_eval/tests/*:S101 | ||
| pdm run pylint src tests lsc_agent_eval/src lsc_agent_eval/tests | ||
| pdm run pylint src tests lsc_agent_eval/src lsc_agent_eval/tests lsc_eval/src/ lsc_eval/runner.py | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Pylint fails due to optional metric deps; consider guard or install in CI.
Current CI fails on deepeval import. You can either:
- keep the code guards (recommended; see code diffs) and
- also install optional deps in CI when running full linters.
Optionally update CI or Makefile to install extras for metrics:
🏁 Script executed:
#!/bin/bash
# Quick check if optional deps are present; if not, install for CI lint pass.
python - <<'PY'
import importlib, sys
missing=[m for m in ['deepeval','litellm'] if importlib.util.find_spec(m) is None]
print("Missing:", missing)
PYLength of output: 182
🏁 Script executed:
#!/bin/bash
# Re-run the missing dependencies check with Python 3
python3 - << 'PY'
import importlib.util, sys
missing=[m for m in ['deepeval','litellm'] if importlib.util.find_spec(m) is None]
print("Missing:", missing)
PYLength of output: 44
🏁 Script executed:
#!/bin/bash
set -e
# Search for deepeval imports and usage with context
rg -n 'import deepeval' -C3 .
rg -n 'deepeval' -C3 .
# Search for litellm imports and usage with context
rg -n 'import litellm' -C3 .
rg -n 'litellm' -C3 .Length of output: 40
🏁 Script executed:
#!/bin/bash
set -e
# Inspect optional dependencies/extras in pyproject.toml
rg -nP '^\[tool\.pdm\.extras\]' -C5 pyproject.toml
rg -nP '^metrics\s*=' -C5 pyproject.tomlLength of output: 61
🏁 Script executed:
#!/bin/bash
# Search for deepeval and litellm references without exiting on first miss
rg -n 'import deepeval' -C3 . || true
rg -n 'deepeval' -C3 . || true
rg -n 'import litellm' -C3 . || true
rg -n 'litellm' -C3 . || trueLength of output: 35588
🏁 Script executed:
#!/bin/bash
# Show the verify target and context in Makefile
rg -n 'verify:' -C3 Makefile
echo '--- First 50 lines of Makefile ---'
head -n 50 MakefileLength of output: 3073
Review comment
File: Makefile
Lines: 49–53
Snippet showing the final state of code at these lines
verify: install-deps-test ## Verify the code using various linters
pdm run black . --check
pdm run ruff check . --per-file-ignores=tests/*:S101 --per-file-ignores=scripts/*:S101 --per-file-ignores=lsc_agent_eval/tests/*:S101
pdm run pylint src tests lsc_agent_eval/src lsc_agent_eval/tests lsc_eval/src/ lsc_eval/runner.py
Comment
Optional metric dependencies (deepeval, litellm) are not installed in CI, causing Pylint import errors.
These packages are imported unconditionally in your metrics and LLM-manager modules, so the CI lint pass (which runs pdm sync --dev) fails when Pylint encounters missing imports. You can address this in one of two ways:
- Wrap the imports in
try/except ImportErrorguards (as shown in your code diffs). - Install the optional dependencies in CI before running the linters.
To install them via the Makefile, update the install-deps-test (or the verify) target:
install-deps-test: install-tools pdm-lock-check ## Install all required dev dependencies needed to test the service
@for a in 1 2 3 4 5; do pdm sync --dev && break || sleep 15; done
+ # Ensure metric packages are available for full linting
+ pip install deepeval litellmAlternatively, if you prefer using PDM extras, first define a metrics extra in your pyproject.toml:
[tool.pdm.extras]
metrics = ["deepeval>=1.3.0", "litellm>=1.0.0"]and then install it in the Makefile:
install-deps-test: install-tools pdm-lock-check
pdm sync --dev
+ pdm install --group metricsPlease apply one of these changes so that Pylint can successfully import the metric modules in CI.
🤖 Prompt for AI Agents
Makefile lines 49–53: CI Pylint fails because optional metric deps (deepeval,
litellm) aren’t installed; update the dependency install step so linters can
import metric modules by adding installation of the metrics extra (or the
specific packages) before running linters — e.g. in the install-deps-test target
run pdm sync --dev then pdm install --group metrics (or pdm install deepeval
litellm) so the verify target’s pdm run pylint step no longer errors on missing
imports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review continued from previous batch...
| ] | ||
|
|
||
|
|
||
| def setup_environment_variables(config_path: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add return type annotation: setup_environment_variables -> None
Mypy flagged missing return type. Also prefer warnings over prints and avoid setting env vars to the literal string "None".
-def setup_environment_variables(config_path: str):
+from warnings import warn
+
+def setup_environment_variables(config_path: str) -> None:
@@
- for env_var, value in env_config.items():
- os.environ[env_var] = str(value)
+ for env_var, value in env_config.items():
+ if value is None:
+ continue
+ os.environ[env_var] = str(value)
@@
- print(f"Warning: Could not load environment config from {config_path}: {e}")
- print("Using fallback environment settings...")
+ warn(f"Could not load environment config from {config_path}: {e}. Using fallback env settings…")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def setup_environment_variables(config_path: str): | |
| from warnings import warn | |
| def setup_environment_variables(config_path: str) -> None: | |
| try: | |
| # … load env_config from config_path … | |
| for env_var, value in env_config.items(): | |
| if value is None: | |
| continue | |
| os.environ[env_var] = str(value) | |
| except Exception as e: | |
| warn(f"Could not load environment config from {config_path}: {e}. Using fallback env settings…") |
🧰 Tools
🪛 GitHub Actions: Type checks
[error] 27-27: Mypy: Function is missing a return type annotation. [no-untyped-def]
🤖 Prompt for AI Agents
In lsc_eval/src/lsc_eval/core/config_loader.py around line 27, add an explicit
return type annotation to the function signature (def
setup_environment_variables(config_path: str) -> None:), replace any print(...)
calls with warnings.warn(...) imports from the warnings module, and when setting
environment variables avoid assigning the literal string "None" by checking for
None/empty values and either skipping os.environ assignment or deleting the key
instead of setting it to "None"; ensure the function does not return a value.
| try: | ||
| response = litellm.completion( | ||
| model=self.model_name, | ||
| messages=messages, | ||
| temperature=self.litellm_params.get("temperature", 0.0), | ||
| max_tokens=self.litellm_params.get("max_tokens"), | ||
| timeout=self.litellm_params.get("timeout"), | ||
| num_retries=self.litellm_params.get("num_retries", 3), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Don’t pass None-valued args to litellm; build kwargs dynamically
Passing None for max_tokens/timeout can cause API errors on some providers. Build kwargs conditionally.
- response = litellm.completion(
- model=self.model_name,
- messages=messages,
- temperature=self.litellm_params.get("temperature", 0.0),
- max_tokens=self.litellm_params.get("max_tokens"),
- timeout=self.litellm_params.get("timeout"),
- num_retries=self.litellm_params.get("num_retries", 3),
- )
+ kwargs: Dict[str, Any] = {
+ "model": self.model_name,
+ "messages": messages,
+ "temperature": self.litellm_params.get("temperature", 0.0),
+ "num_retries": self.litellm_params.get("num_retries", 3),
+ }
+ if (mt := self.litellm_params.get("max_tokens")) is not None:
+ kwargs["max_tokens"] = mt
+ # Use request_timeout if your LiteLLM version expects that name
+ if (to := self.litellm_params.get("timeout")) is not None:
+ kwargs["timeout"] = to
+ response = litellm.completion(**kwargs)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try: | |
| response = litellm.completion( | |
| model=self.model_name, | |
| messages=messages, | |
| temperature=self.litellm_params.get("temperature", 0.0), | |
| max_tokens=self.litellm_params.get("max_tokens"), | |
| timeout=self.litellm_params.get("timeout"), | |
| num_retries=self.litellm_params.get("num_retries", 3), | |
| ) | |
| try: | |
| kwargs: Dict[str, Any] = { | |
| "model": self.model_name, | |
| "messages": messages, | |
| "temperature": self.litellm_params.get("temperature", 0.0), | |
| "num_retries": self.litellm_params.get("num_retries", 3), | |
| } | |
| if (mt := self.litellm_params.get("max_tokens")) is not None: | |
| kwargs["max_tokens"] = mt | |
| # Use request_timeout if your LiteLLM version expects that name | |
| if (to := self.litellm_params.get("timeout")) is not None: | |
| kwargs["timeout"] = to | |
| response = litellm.completion(**kwargs) |
🤖 Prompt for AI Agents
In lsc_eval/src/lsc_eval/metrics/custom_metrics.py around lines 67 to 75, the
call to litellm.completion passes max_tokens and timeout (and potentially other
params) even when their values are None which can cause API errors; change to
build a kwargs dict (e.g., kwargs = {"model": self.model_name, "messages":
messages, "temperature": self.litellm_params.get("temperature", 0.0),
"num_retries": self.litellm_params.get("num_retries", 3)}) and then
conditionally add max_tokens and timeout only if their values are not None (and
omit any other None-valued entries), then call response =
litellm.completion(**kwargs).
| if score is not None and score > 1.0: | ||
| if score <= 10.0: # Assume 0-10 scale | ||
| score = score / 10.0 | ||
| elif score <= 100.0: # Assume 0-100 scale | ||
| score = score / 100.0 | ||
|
|
||
| return score, reason |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Normalize and clamp scores to [0,1]; support percentages
Values like “85%” or spurious integers can exceed 1. Clamp and parse percent format to avoid downstream validation errors.
- if score is not None and score > 1.0:
- if score <= 10.0: # Assume 0-10 scale
- score = score / 10.0
- elif score <= 100.0: # Assume 0-100 scale
- score = score / 100.0
+ if score is None:
+ # Additional: handle percentages like "85%"
+ pct = re.search(r'(\d+(\.\d+)?)\s*%', response)
+ if pct:
+ score = float(pct.group(1)) / 100.0
+ # Final normalization and clamp
+ if score is not None:
+ if score > 1.0 and score <= 10.0:
+ score = score / 10.0
+ elif score > 1.0 and score <= 100.0:
+ score = score / 100.0
+ score = max(0.0, min(1.0, score))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if score is not None and score > 1.0: | |
| if score <= 10.0: # Assume 0-10 scale | |
| score = score / 10.0 | |
| elif score <= 100.0: # Assume 0-100 scale | |
| score = score / 100.0 | |
| return score, reason | |
| if score is None: | |
| # Additional: handle percentages like "85%" | |
| pct = re.search(r'(\d+(\.\d+)?)\s*%', response) | |
| if pct: | |
| score = float(pct.group(1)) / 100.0 | |
| # Final normalization and clamp | |
| if score is not None: | |
| if score > 1.0 and score <= 10.0: | |
| score = score / 10.0 | |
| elif score > 1.0 and score <= 100.0: | |
| score = score / 100.0 | |
| score = max(0.0, min(1.0, score)) | |
| return score, reason |
🤖 Prompt for AI Agents
In lsc_eval/src/lsc_eval/metrics/custom_metrics.py around lines 118 to 124, the
normalization only handles numeric scales >1 but doesn't parse percentage
strings or clamp values into [0,1]; update the logic to: detect and strip a
trailing '%' then parse as a float and divide by 100, otherwise if numeric and
>1 apply the existing 0-10/0-100 heuristics, then clamp the final score into the
[0.0, 1.0] range (also handle negative values by clamping to 0.0); keep the
existing reason value and return the normalized, clamped score and reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (6)
lsc_eval/src/lsc_eval/metrics/custom_metrics.py (2)
6-6: Fix CI/type-check failures and avoid None-args: lazy-import LiteLLM and build kwargs dynamicallyModule-level import of litellm breaks CI/type checks when the dep isn’t installed, and passing None for provider params can cause API errors. Lazy-import inside the call path and construct kwargs conditionally.
Apply:
@@ - import litellm @@ def _call_llm(self, prompt: str, system_prompt: Optional[str] = None) -> str: """Make a LiteLLM call with the configured parameters.""" # Prepare messages messages = [] if system_prompt: messages.append({"role": "system", "content": system_prompt}) messages.append({"role": "user", "content": prompt}) + # Lazy-import to avoid import errors in environments without litellm + try: + import litellm # type: ignore + except Exception as exc: + raise RuntimeError( + "litellm is required for CustomMetrics. Install with `pip install litellm`." + ) from exc + try: - response = litellm.completion( - model=self.model_name, - messages=messages, - temperature=self.litellm_params.get("temperature", 0.0), - max_tokens=self.litellm_params.get("max_tokens"), - timeout=self.litellm_params.get("timeout"), - num_retries=self.litellm_params.get("num_retries", 3), - ) + kwargs: Dict[str, Any] = { + "model": self.model_name, + "messages": messages, + "temperature": self.litellm_params.get("temperature", 0.0), + "num_retries": self.litellm_params.get("num_retries", 3), + } + if (mt := self.litellm_params.get("max_tokens")) is not None: + kwargs["max_tokens"] = mt + if (to := self.litellm_params.get("timeout")) is not None: + # If your LiteLLM version expects 'request_timeout', switch the key here. + kwargs["timeout"] = to + response = litellm.completion(**kwargs)Also applies to: 59-76
85-124: Harden score parsing: support “85%”, “Score: 8.5/10”, and clamp to [0,1]Current logic fails on percent formats, may misinterpret scales, and doesn’t clamp. Normalize robustly and clamp the final score.
@@ - # Try to find explicit score/reason format + # Try to find explicit score/reason format for line in lines: line = line.strip() if line.lower().startswith("score:"): try: - score_text = line.split(":", 1)[1].strip() - score = float(score_text) + score_text = line.split(":", 1)[1].strip() + # Percent e.g., "85%" + pct = re.match(r"(\d+(?:\.\d+)?)\s*%", score_text) + if pct: + score = float(pct.group(1)) / 100.0 + # Fraction or "out of" inside Score line, e.g., "8.5/10", "4 out of 5" + elif "/" in score_text or re.search(r"\bout of\b", score_text, re.IGNORECASE): + score = self._extract_score_from_text(score_text) + else: + score = float(score_text) except (ValueError, IndexError): pass elif line.lower().startswith("reason:"): try: reason = line.split(":", 1)[1].strip() except IndexError: pass - # If no explicit score found, try to extract from text + # If no explicit score found, try to extract from text (percent first, then numeric) if score is None: - score = self._extract_score_from_text(response) + pct = re.search(r"(\d+(?:\.\d+)?)\s*%", response) + if pct: + score = float(pct.group(1)) / 100.0 + else: + score = self._extract_score_from_text(response) - # Normalize score to 0-1 range if needed - if score is not None and score > 1.0: - if score <= 10.0: # Assume 0-10 scale - score = score / 10.0 - elif score <= 100.0: # Assume 0-100 scale - score = score / 100.0 + # Normalize and clamp into [0,1] + if score is not None: + if score > 1.0: + if score <= 10.0: # Assume 0-10 scale + score = score / 10.0 + elif score <= 100.0: # Assume 0-100 scale + score = score / 100.0 + score = max(0.0, min(1.0, score))lsc_eval/runner.py (2)
6-12: Fix usage examples for src/ layout and import path."python -m runner" and "from runner import ..." won’t work in a src/ layout. Use the fully qualified package path.
-Usage: - python -m runner --system-config config/system.yaml --eval-data config/evaluation_data.yaml +Usage: + # Run as a module (src/ layout friendly): + python -m lsc_eval.runner --system-config config/system.yaml --eval-data config/evaluation_data.yaml @@ -Or programmatically: - from runner import run_evaluation - results = run_evaluation("config/system.yaml", "config/evaluation_data.yaml") +Or programmatically: + from lsc_eval.runner import run_evaluation + results = run_evaluation("config/system.yaml", "config/evaluation_data.yaml")
93-97: Handle missing optional dependencies cleanly (catch ImportError).Broaden the except tuple so missing libs (e.g., matplotlib/seaborn) don’t crash with an unhandled traceback.
- except (FileNotFoundError, ValueError, RuntimeError) as e: + except (FileNotFoundError, ValueError, RuntimeError, ImportError) as e: print(f"\n❌ Evaluation failed: {e}") traceback.print_exc() return Nonelsc_eval/src/lsc_eval/llm_managers/ragas_llm.py (2)
5-5: Guard optional LiteLLM import and fail fast if unavailable.Unconditional
import litellmbreaks CI when the package isn’t installed, and runtime should surface a clear error before calling into LiteLLM. This was previously flagged; please implement the guard and runtime check.Apply this diff:
-from typing import Any, Dict, List, Optional - -import litellm +from typing import Any, Dict, List, Optional, TYPE_CHECKING +import logging +import asyncio +try: + import litellm # pyright: ignore[reportMissingImports] + _LITELLM_AVAILABLE = True +except Exception: # pragma: no cover + litellm = None # type: ignore[assignment] + _LITELLM_AVAILABLE = False +logger = logging.getLogger(__name__)And fail fast before the LiteLLM call:
- try: + if not _LITELLM_AVAILABLE: + raise ImportError( + "litellm is not installed. Install with `pip install litellm` or enable the appropriate extras." + ) + try: response = litellm.completion(Also applies to: 34-35, 61-61
25-26: Honor the stop parameter in LiteLLM calls.The function accepts
stopbut doesn’t forward it to LiteLLM. Please pass it through. This was noted earlier and is still unresolved.Apply this diff:
response = litellm.completion( model=self.model_name, messages=[{"role": "user", "content": prompt_text}], n=n, temperature=temp, max_tokens=self.litellm_params.get("max_tokens"), timeout=self.litellm_params.get("timeout"), - num_retries=self.litellm_params.get("num_retries"), + num_retries=self.litellm_params.get("num_retries"), + stop=stop, )Also applies to: 34-43
🧹 Nitpick comments (12)
lsc_eval/src/lsc_eval/metrics/custom_metrics.py (5)
3-4: Use logging instead of print for library initialization statusPrinting from libraries is noisy; prefer the standard logging pipeline.
@@ -import re +import re +import logging @@ - +logger = logging.getLogger(__name__) @@ - print(f"✅ Custom Metrics initialized: {self.model_name}") + logger.info("Custom Metrics initialized: %s", self.model_name)Also applies to: 13-13, 40-40
4-4: Tighten contexts typing to match TurnData and help mypy/pydantic toolingTurnData.contexts is List[Dict[str, str]]. Align the param model to reduce ambiguity.
-from typing import Any, Dict, Optional, Tuple +from typing import Any, Dict, Optional, Tuple, List @@ - contexts: Optional[list] = Field(None, description="Context information if available") + contexts: Optional[List[Dict[str, str]]] = Field( + None, description="Context information if available" + )Also applies to: 21-21
180-189: Constrain output format to improve parse reliabilityTighten instructions so the model emits exactly two lines, reducing parser ambiguity.
prompt_parts.extend( [ "", - f"Rate the {params.metric_name} and provide your reasoning.", + f"Rate the {params.metric_name} and provide your reasoning.", "", - "Format your response as:", - f"Score: [your score on {params.scale}]", - "Reason: [your detailed explanation]", + "Output exactly two lines in this exact format (no extra text before or after):", + f"Score: <number on {params.scale}>", + "Reason: <brief explanation without additional numeric scores>", ] )
229-231: Nudge the model with a minimal system prompt to enforce structureA short system prompt often cuts variance and helps parsing.
- llm_response = self._call_llm(prompt) + llm_response = self._call_llm( + prompt, + system_prompt=( + "You are a strict evaluator. Respond in exactly two lines: " + "'Score: <number>' on the first line and 'Reason: <text>' on the second." + ), + )
158-191: Optional: consider adding guardrails for very large context listsIf contexts can be long, consider truncating or summarizing to keep prompts within provider limits/cost targets.
Would you like a helper that truncates contexts by token budget (e.g., tiktoken/litellm token counting) and appends “+N more items omitted”?
lsc_eval/runner.py (3)
48-50: Avoid double validation of evaluation data.DataValidator.load_evaluation_data() validates, and engine.run_evaluation() validates again. Consider trusting the already-validated input or adding a skip_validation flag in the engine to avoid redundant work on large datasets.
Also applies to: 58-61
114-131: Resolve CLI paths to absolute early for predictable behavior.Aligns with the prior learning to prefer absolute paths internally; improves logs and downstream file handling consistency.
args = parser.parse_args() - # CRITICAL: Setup environment variables from system config FIRST - setup_environment_variables(args.system_config) + # Normalize to absolute paths for consistency + system_config_path = str(Path(args.system_config).resolve()) + eval_data_path = str(Path(args.eval_data).resolve()) + + # CRITICAL: Setup environment variables from system config FIRST + setup_environment_variables(system_config_path) # Validate input files exist - if not Path(args.system_config).exists(): - print(f"❌ System config file not found: {args.system_config}") + if not Path(system_config_path).exists(): + print(f"❌ System config file not found: {system_config_path}") return 1 - if not Path(args.eval_data).exists(): - print(f"❌ Evaluation data file not found: {args.eval_data}") + if not Path(eval_data_path).exists(): + print(f"❌ Evaluation data file not found: {eval_data_path}") return 1 # Run evaluation - summary = run_evaluation(args.system_config, args.eval_data, args.output_dir) + summary = run_evaluation(system_config_path, eval_data_path, args.output_dir)
134-136: Add minimal tests for CLI and programmatic runner.Even if broader tests land later, add a smoke test for:
- non-existent files → exit code 1 and user-friendly message
- happy path with stubbed engine/validator → exit code 0 and dict summary shape
I can scaffold pytest tests that monkeypatch ConfigLoader/DataValidator/EvaluationEngine to avoid I/O and assert exit codes/output. Want me to draft them?
lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py (4)
18-18: Replace print statements with logging.Library code shouldn’t print; use the module logger to respect caller logging config and avoid noisy stdout.
Apply this diff:
- print(f"✅ Ragas Custom LLM: {self.model_name}") + logger.info("Ragas Custom LLM initialized: %s", self.model_name) @@ - except Exception as e: - print(f"❌ Ragas LLM failed: {e}") - raise RuntimeError(f"Ragas LLM evaluation failed: {str(e)}") from e + except Exception as e: + logger.exception("Ragas LLM failed") + raise RuntimeError(f"Ragas LLM evaluation failed: {str(e)}") from e @@ - print("✅ Ragas LLM Manager configured") + logger.info("Ragas LLM Manager configured")Also applies to: 57-59, 95-95
101-106: Expose full model info for consistency with other managers.Align with
deepeval_llm.get_model_infoby surfacingmax_tokens,timeout, andnum_retries.Apply this diff:
def get_model_info(self) -> Dict[str, Any]: """Get information about the configured model.""" return { "model_name": self.model_name, - "temperature": self.litellm_params.get("temperature", 0.0), + "temperature": self.litellm_params.get("temperature", 0.0), + "max_tokens": self.litellm_params.get("max_tokens"), + "timeout": self.litellm_params.get("timeout"), + "num_retries": self.litellm_params.get("num_retries", 3), }
91-94: Global side effects on metric singletons.Rebinding
answer_relevancy.llmandfaithfulness.llmmutates module-level state and may surprise other tests/components if multiple managers are instantiated. If feasible, pass the LLM per-evaluation or document that this manager globally configures Ragás metrics.
20-27: Unify temperature handling across sync/async (optional)I ran the ripgrep check and confirmed that the only occurrences of the
1e-08sentinel live in lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py; no other modules or call sites depend on that exact value. If you’d like to eliminate this “magic number” and centralize the default-temperature logic, you can:
In
generate_text(...)(lines 20–27):
- Change the signature from
todef generate_text( …, temperature: float = 1e-08, … ) -> LLMResult:def generate_text( …, temperature: Optional[float] = None, … ) -> LLMResult:- Update the temp-fallback from
totemp = temperature if temperature != 1e-08 else self.litellm_params.get("temperature", 0.0)temp = temperature if temperature is not None else self.litellm_params.get("temperature", 0.0)In
generate_text_async(...)(lines 68–72):
- Remove the explicit fallback to
1e-08:- temp = temperature if temperature is not None else 1e-08 - return self.generate_text(prompt, n=n, temperature=temp, stop=stop, callbacks=callbacks) + return self.generate_text(prompt, n=n, temperature=temperature, stop=stop, callbacks=callbacks)This optional refactor eliminates the numeric sentinel and lets the synchronous path alone handle defaulting to
litellm_params["temperature"].
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (12)
lsc_eval/pyproject.toml(1 hunks)lsc_eval/runner.py(1 hunks)lsc_eval/src/lsc_eval/__init__.py(1 hunks)lsc_eval/src/lsc_eval/core/config_loader.py(1 hunks)lsc_eval/src/lsc_eval/core/data_validator.py(1 hunks)lsc_eval/src/lsc_eval/evaluation_engine.py(1 hunks)lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py(1 hunks)lsc_eval/src/lsc_eval/metrics/custom_metrics.py(1 hunks)lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py(1 hunks)lsc_eval/src/lsc_eval/metrics/ragas_metrics.py(1 hunks)lsc_eval/src/lsc_eval/output/output_handler.py(1 hunks)lsc_eval/src/lsc_eval/output/visualization.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
- lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py
- lsc_eval/src/lsc_eval/core/config_loader.py
- lsc_eval/src/lsc_eval/init.py
- lsc_eval/src/lsc_eval/core/data_validator.py
- lsc_eval/pyproject.toml
- lsc_eval/src/lsc_eval/evaluation_engine.py
- lsc_eval/src/lsc_eval/output/visualization.py
- lsc_eval/src/lsc_eval/metrics/ragas_metrics.py
- lsc_eval/src/lsc_eval/output/output_handler.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-07-16T12:07:29.169Z
Learnt from: asamal4
PR: lightspeed-core/lightspeed-evaluation#19
File: lsc_agent_eval/tests/core/agent_goal_eval/test_script_runner.py:0-0
Timestamp: 2025-07-16T12:07:29.169Z
Learning: In the lsc_agent_eval package, the ScriptRunner class was modified to use absolute paths internally rather than documenting path normalization behavior, providing more predictable and consistent path handling.
Applied to files:
lsc_eval/runner.py
🧬 Code graph analysis (3)
lsc_eval/src/lsc_eval/metrics/custom_metrics.py (4)
lsc_eval/src/lsc_eval/core/models.py (1)
TurnData(8-44)lsc_eval/src/lsc_eval/llm_managers/llm_manager.py (4)
LLMManager(36-167)get_model_name(144-146)get_litellm_params(148-156)from_system_config(163-167)lsc_eval/src/lsc_eval/output/utils.py (1)
EvaluationScope(11-16)lsc_eval/src/lsc_eval/metrics/deepeval_metrics.py (2)
evaluate(62-77)from_system_config(135-138)
lsc_eval/runner.py (5)
lsc_eval/src/lsc_eval/core/config_loader.py (3)
ConfigLoader(179-259)setup_environment_variables(30-47)load_system_config(188-243)lsc_eval/src/lsc_eval/core/data_validator.py (2)
DataValidator(11-82)load_evaluation_data(19-37)lsc_eval/src/lsc_eval/evaluation_engine.py (2)
EvaluationEngine(106-296)run_evaluation(134-158)lsc_eval/src/lsc_eval/output/output_handler.py (2)
OutputHandler(15-224)generate_reports(32-76)lsc_eval/src/lsc_eval/output/utils.py (1)
calculate_basic_stats(19-45)
lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py (1)
lsc_eval/src/lsc_eval/llm_managers/deepeval_llm.py (2)
get_llm(31-33)get_model_info(35-43)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: black
- GitHub Check: ruff
- GitHub Check: Pyright
- GitHub Check: tests (3.11)
- GitHub Check: tests (3.12)
- GitHub Check: mypy
- GitHub Check: tests (3.13)
- GitHub Check: Pylinter
- GitHub Check: pydocstyle
🔇 Additional comments (4)
lsc_eval/src/lsc_eval/metrics/custom_metrics.py (2)
42-48: LGTM: public evaluate signature is typed and consistentGood call making conv_data typed as Any and returning Tuple[Optional[float], str]. This aligns with deepeval_metrics.evaluate and unblocks mypy.
193-200: Minor: parameters prefixed with “_” are correctly typed and unusedThe unused _conv_data and optional _turn_idx are properly annotated; no action needed.
lsc_eval/runner.py (2)
99-101: Good: main() has an explicit return type.This satisfies type checkers and matches the sys.exit(main()) pattern.
20-21: No changes needed: package-level imports are valid.The root and
corepackages both re-export the referenced symbols, so the existing imports will resolve correctly at runtime:
lsc_eval/src/lsc_eval/__init__.pyexportsConfigLoader,DataValidator,EvaluationEngine, andOutputHandlerin its__all__.lsc_eval/src/lsc_eval/core/__init__.pyexportssetup_environment_variablesalongsideConfigLoaderandDataValidator.You can safely keep the current imports:
from lsc_eval import ConfigLoader, DataValidator, EvaluationEngine, OutputHandler from lsc_eval.core import setup_environment_variablesLikely an incorrect or invalid review comment.
| async def agenerate_text( # pylint: disable=too-many-arguments,too-many-positional-arguments | ||
| self, | ||
| prompt: Any, | ||
| n: int = 1, | ||
| temperature: Optional[float] = None, | ||
| stop: Optional[List[str]] = None, | ||
| callbacks: Optional[Any] = None, | ||
| ) -> LLMResult: | ||
| """Async generate.""" | ||
| temp = temperature if temperature is not None else 1e-08 | ||
| return self.generate_text(prompt, n=n, temperature=temp, stop=stop, callbacks=callbacks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Async method currently blocks the event loop; use litellm.acompletion or offload to a thread.
agenerate_text delegates to the sync method, defeating the purpose of async and potentially stalling callers.
Apply this diff:
async def agenerate_text( # pylint: disable=too-many-arguments,too-many-positional-arguments
self,
prompt: Any,
n: int = 1,
temperature: Optional[float] = None,
stop: Optional[List[str]] = None,
callbacks: Optional[Any] = None,
) -> LLMResult:
- """Async generate."""
- temp = temperature if temperature is not None else 1e-08
- return self.generate_text(prompt, n=n, temperature=temp, stop=stop, callbacks=callbacks)
+ """Async generate using LiteLLM's acompletion when available; falls back to a thread."""
+ temp = temperature if temperature is not None else self.litellm_params.get("temperature", 0.0)
+ if not _LITELLM_AVAILABLE:
+ raise ImportError(
+ "litellm is not installed. Install with `pip install litellm` or enable the appropriate extras."
+ )
+ try:
+ if hasattr(litellm, "acompletion"):
+ response = await litellm.acompletion(
+ model=self.model_name,
+ messages=[{"role": "user", "content": str(prompt)}],
+ n=n,
+ temperature=temp,
+ max_tokens=self.litellm_params.get("max_tokens"),
+ timeout=self.litellm_params.get("timeout"),
+ num_retries=self.litellm_params.get("num_retries"),
+ stop=stop,
+ )
+ generations: List[Generation] = []
+ for choice in response.choices: # type: ignore[attr-defined]
+ content = getattr(choice.message, "content", "") or ""
+ generations.append(Generation(text=content.strip()))
+ return LLMResult(generations=[generations])
+ # Fallback: offload sync call to a thread to avoid blocking
+ return await asyncio.to_thread(
+ self.generate_text,
+ prompt,
+ n,
+ temp,
+ stop,
+ callbacks,
+ )
+ except Exception as e:
+ logger.exception("Async Ragas LLM failed")
+ raise RuntimeError(f"Ragas LLM evaluation failed: {str(e)}") from eCommittable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In lsc_eval/src/lsc_eval/llm_managers/ragas_llm.py around lines 61–71, the async
method agenerate_text currently calls the synchronous generate_text and blocks
the event loop; replace that blocking call with a true async implementation by
either awaiting litellm.acompletion with the same parameters (ensuring you map
its result into an LLMResult and preserve the default temperature logic) or, if
litellm.acompletion is not available, offload the sync call to a thread using
asyncio.to_thread(self.generate_text, prompt, n=n, temperature=temp, stop=stop,
callbacks=callbacks) and return its LLMResult; ensure exceptions propagate
correctly and the signature/return type remain unchanged.
| try: | ||
| return self.supported_metrics[metric_name]( | ||
| conv_data, scope.turn_idx, scope.turn_data, scope.is_conversation | ||
| ) | ||
| except (ValueError, AttributeError, KeyError) as e: | ||
| return None, f"Custom {metric_name} evaluation failed: {str(e)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Prevent hard crashes by catching _call_llm RuntimeError in evaluate()
unhandled RuntimeError from _call_llm will bubble out and crash evaluation. Include it in the caught exceptions so the API returns a (None, reason) tuple consistently.
- except (ValueError, AttributeError, KeyError) as e:
+ except (ValueError, AttributeError, KeyError, RuntimeError) as e:
return None, f"Custom {metric_name} evaluation failed: {str(e)}"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try: | |
| return self.supported_metrics[metric_name]( | |
| conv_data, scope.turn_idx, scope.turn_data, scope.is_conversation | |
| ) | |
| except (ValueError, AttributeError, KeyError) as e: | |
| return None, f"Custom {metric_name} evaluation failed: {str(e)}" | |
| try: | |
| return self.supported_metrics[metric_name]( | |
| conv_data, scope.turn_idx, scope.turn_data, scope.is_conversation | |
| ) | |
| except (ValueError, AttributeError, KeyError, RuntimeError) as e: | |
| return None, f"Custom {metric_name} evaluation failed: {str(e)}" |
🤖 Prompt for AI Agents
In lsc_eval/src/lsc_eval/metrics/custom_metrics.py around lines 52 to 57, the
except block currently catches ValueError, AttributeError, and KeyError but
misses RuntimeError from _call_llm which can bubble up; update the except clause
to also catch RuntimeError so the function returns (None, f"Custom {metric_name}
evaluation failed: {str(e)}") consistently when _call_llm raises a RuntimeError.
fix linting issues
5b854a6 to
d94afec
Compare
|
This is still in WIP.. Most of the issues will get resolved when we make this primary eval tool and remove the existing one. |
|
@tisnik PTAL if we decide not to use this, then we can simply delete the folder in future, so no impact at all. |
|
Also the Bandit issue is not related to code change.. We will have to fix this separately. |
tisnik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, very good job!
Anxhela21
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
Generic & configurable eval tool/framework (no dependency on lightspeed-stack)
Test cases are pending. will add in follow up PRs.
This is still WIP, adding to the repo for others to test.
Summary by CodeRabbit
New Features
Documentation
Chores