-
Notifications
You must be signed in to change notification settings - Fork 21
[nit] Clean up evaluation_data.yaml #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[nit] Clean up evaluation_data.yaml #52
Conversation
This update makes sure the evaluation_data.yaml adheres the models [1][2]. - Standardize turn_id format to strings - Simplify contexts structure from objects to arrays - Add missing description field to conversation group 3 Co-Authored-By: Claude <[email protected]> [1] https://github.com/lightspeed-core/lightspeed-evaluation/blob/b629ddc56912cfa6ce42eee06569c9bc7b27cfd9/src/lightspeed_evaluation/core/models/data.py#L31 [2] https://github.com/lightspeed-core/lightspeed-evaluation/blob/b629ddc56912cfa6ce42eee06569c9bc7b27cfd9/src/lightspeed_evaluation/core/models/data.py#L146
WalkthroughUpdates two YAML configs: evaluation data schema changes (IDs stringified, context entry structure adjusted, group fields added/removed, and per-turn metric threshold restructured) and system config enhancements (adds turn- and conversation-level metrics, extends CSV output columns, and enables additional visualization graphs). Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Runner as Evaluation Runner
participant Data as evaluation_data.yaml
participant System as system.yaml
participant Metrics as Metric Engines
participant CSV as CSV Writer
participant Viz as Visualizer
Runner->>Data: Load conversation groups, turns, contexts
Note over Data: turn_id: string<br/>contexts: string list<br/>per-turn thresholds (faithfulness)
Runner->>System: Load metrics metadata and output schema
Note over System: +Turn metrics (ragas:*)<br/>+Conversation metrics (deepeval:*)<br/>+CSV columns & graphs
loop For each conversation/turn
Runner->>Metrics: Evaluate turn-level metrics
Metrics-->>Runner: Scores (e.g., response_relevancy, context_precision...)
alt Conversation-level phase
Runner->>Metrics: Evaluate conversation-level metrics
Metrics-->>Runner: Scores (conversation_relevancy, knowledge_retention)
end
Runner->>CSV: Write row with new columns (result, reason, query, response, execution_time)
end
Runner->>Viz: Generate graphs (score_distribution, heatmap, status_breakdown)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Pre-merge checks (3 passed)✅ Passed checks (3 passed)
Poem
✨ Finishing touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
tisnik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (7)
config/system.yaml (5)
34-37: Clarify default vs per-group override for faithfulness threshold.System default is 0.8 here, while config/evaluation_data.yaml sets 0.99 for conv_group_1. Please confirm precedence and consider clarifying in-line.
Proposed tweak to the description to reduce ambiguity:
"ragas:faithfulness": threshold: 0.8 - description: "How faithful the response is to the provided context" + description: "How faithful the response is to the provided context (default; can be overridden per conversation group)"
64-66: Missing threshold handling for custom:tool_eval.If pass/fail is computed, define a threshold; otherwise ensure the pipeline tolerates metrics without thresholds and doesn’t mark them as failures.
91-103: CSV schema expanded; ensure writer emits new columns.Confirm the CSV writer populates result, reason, query, response, and execution_time to avoid empty columns or KeyErrors.
109-114: New visualization graph types; verify implementation exists.Check that score_distribution, conversation_heatmap, and status_breakdown are implemented and feature-flagged here, not just documented.
136-144: Logger names may be case-sensitive; align with library logger IDs.Python logger names are exact matches. Consider lowercase for litellm/deepeval if those are the actual logger names.
Proposed tweak (adjust if your code confirms different names):
package_overrides: - httpx: ERROR - urllib3: ERROR - requests: ERROR - matplotlib: ERROR - LiteLLM: WARNING - DeepEval: WARNING - ragas: WARNING + httpx: ERROR + urllib3: ERROR + requests: ERROR + matplotlib: ERROR + litellm: WARNING + deepeval: WARNING + ragas: WARNINGconfig/evaluation_data.yaml (2)
1-2: Consider adding an explicit schema version to lock format changes.Helps downstream tooling validate shape changes (string IDs, contexts as strings) and ease future migrations.
Proposed header:
# LightSpeed Evaluation Framework - Sample/Mock Data + +schema_version: 2
17-25: Minor consistency nit: normalize sample text casing."User query" vs "User Query" can be standardized to reduce brittle tests that compare raw text.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
config/evaluation_data.yaml(1 hunks)config/system.yaml(3 hunks)
🔇 Additional comments (8)
config/system.yaml (3)
38-58: New RAGAS turn-level metrics look consistent; verify evaluator support and inputs.Identifiers and thresholds align with evaluation_data.yaml. Please verify the scoring code recognizes these metric keys and that required inputs (reference/contexts) are present for each metric.
74-80: Conversation-level DeepEval metrics added; confirm availability.Looks good. Please confirm deepeval is installed and these identifiers match the implementation names.
117-122: Env flags for DeepEval/LiteLLM logging/telemetry: LGTM.config/evaluation_data.yaml (5)
11-16: Per-group threshold override for ragas:faithfulness.Setting 0.99 here is fine if it overrides system defaults. Please confirm the loader applies per-group turn_metrics_metadata over system.yaml thresholds.
47-48: Added description for conv_group_3: LGTM.
55-57: Conversation metrics listed match system.yaml additions.Looks consistent with deepeval:conversation_completeness and deepeval:conversation_relevancy.
61-74: Additional turns with string IDs: LGTM.IDs are unique within the group and fields align with the schema.
18-24: Verified — turn_id is a string and contexts-as-strings are supported (no change required).TurnData.turn_id is annotated as str (src/lightspeed_evaluation/core/models/data.py); contexts are normalized to accept either dicts with "content" or plain strings (src/lightspeed_evaluation/core/metrics/ragas.py and existing validators/tests); CSV writer emits attribute values as-is (src/lightspeed_evaluation/core/output/generator.py).
asamal4
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.. I forgot to change this..
LGTM
|
@asamal4 no problem! |
This update makes sure the evaluation_data.yaml adheres the models
[1][2].
Co-Authored-By: Claude [email protected]
[1]
lightspeed-evaluation/src/lightspeed_evaluation/core/models/data.py
Line 31 in b629ddc
[2]
lightspeed-evaluation/src/lightspeed_evaluation/core/models/data.py
Line 146 in b629ddc
Summary by CodeRabbit
New Features
Changes