-
Notifications
You must be signed in to change notification settings - Fork 22
fix: run turn evaluation immediately after api call #105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Assisted-by: Cursor
WalkthroughRefactors the amendment flow from conversation-level to per-turn processing. Replaces Changes
Sequence Diagram(s)sequenceDiagram
participant Proc as Processor
participant Am as Amender
participant API as API Client
participant EH as ErrorHandler
Proc->>Proc: For each turn in conversation
Proc->>Am: amend_single_turn(turn_data, conversation_id)
Am->>API: Call API with turn payload
alt API Success
API-->>Am: APIResponse
Am->>Am: Update turn_data (response, conversation_id, contexts, tool_calls)
Am-->>Proc: (None, new_conversation_id)
Proc->>Proc: Evaluate turn metrics (Step 2b)
Proc->>Proc: continue next turn (conversation_id threaded)
else API Error
API-->>Am: APIError
Am-->>Proc: (error_message, current_conversation_id)
rect rgb(255,230,230)
Note over Proc,EH: Cascade failure handling
Proc->>EH: mark_turn_metrics_as_error(conv, turn_idx, turn_data, turn_metrics, reason)
Proc->>EH: mark_cascade_failure(conv, failed_turn_idx, resolved_turn_metrics, resolved_conversation_metrics, reason)
end
Proc->>Proc: Abort further turn processing
end
Proc->>Proc: After loop: evaluate conversation-level metrics (if not cascaded)
Proc->>Proc: Cleanup & return results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Areas requiring extra attention:
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (1)
46-177: Per-turn processing and error cascade look solid; fix the line-too-long lintThe refactor of
process_conversationto:
- run setup once,
- then, for each turn, do “API amend → evaluate turn metrics,”
- and only afterwards evaluate conversation-level metrics,
matches the stated goal of evaluating turns immediately after their API call. The error path that:- marks the current turn’s metrics as ERROR, then
- marks remaining turns and all conversation metrics as ERROR, and
- exits early while still running cleanup in
finally,
is consistent with the newEvaluationErrorHandlerhelpers and avoids wasted work on failed conversations.The only concrete issue is the pylint
line-too-longat Line 141 (remaining_errors = error_handler.mark_remaining_turns_and_conversation_as_error(...)). You can resolve this by aliasing the method before calling it, e.g.:- cascade_error_reason = ( - f"Cascade failure from turn {turn_idx + 1} API error: " - f"{api_error_message}" - ) - error_handler = self.components.error_handler - remaining_errors = error_handler.mark_remaining_turns_and_conversation_as_error( - conv_data, - turn_idx, - resolved_turn_metrics, - resolved_conversation_metrics, - cascade_error_reason, - ) + cascade_error_reason = ( + f"Cascade failure from turn {turn_idx + 1} API error: " + f"{api_error_message}" + ) + error_handler = self.components.error_handler + mark_remaining = ( + error_handler.mark_remaining_turns_and_conversation_as_error + ) + remaining_errors = mark_remaining( + conv_data, + turn_idx, + resolved_turn_metrics, + resolved_conversation_metrics, + cascade_error_reason, + )This keeps the behaviour unchanged while satisfying the linter.
🧹 Nitpick comments (1)
tests/unit/pipeline/evaluation/test_errors.py (1)
182-294: New per-turn error handler tests look correct and aligned with implementationThese tests exercise both
mark_turn_metrics_as_errorandmark_remaining_turns_and_conversation_as_errorin realistic scenarios and validate:
- Field population on
EvaluationResult(IDs, status, reason, query/response, execution_time).- Correct ordering and counts of turn-level vs conversation-level errors.
- Aggregation behaviour in
get_error_summary.This gives good coverage for the new error-handling paths in the processor. You might optionally add an assertion on
query/responsefor the remaining-turns case, but it’s not strictly necessary.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
src/lightspeed_evaluation/pipeline/evaluation/amender.py(2 hunks)src/lightspeed_evaluation/pipeline/evaluation/errors.py(2 hunks)src/lightspeed_evaluation/pipeline/evaluation/processor.py(4 hunks)tests/unit/pipeline/evaluation/test_amender.py(3 hunks)tests/unit/pipeline/evaluation/test_errors.py(1 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/pipeline/evaluation/amender.py:32-41
Timestamp: 2025-09-09T14:58:10.630Z
Learning: In the lightspeed-evaluation framework, when API is enabled, every turn should make a fresh API call regardless of whether the turn already has response or tool_calls data. This ensures consistency and fresh responses for each evaluation run.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.
Applied to files:
tests/unit/pipeline/evaluation/test_errors.pysrc/lightspeed_evaluation/pipeline/evaluation/errors.pysrc/lightspeed_evaluation/pipeline/evaluation/processor.py
📚 Learning: 2025-09-09T14:58:10.630Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/pipeline/evaluation/amender.py:32-41
Timestamp: 2025-09-09T14:58:10.630Z
Learning: In the lightspeed-evaluation framework, when API is enabled, every turn should make a fresh API call regardless of whether the turn already has response or tool_calls data. This ensures consistency and fresh responses for each evaluation run.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/amender.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/amender.pysrc/lightspeed_evaluation/pipeline/evaluation/errors.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.
Applied to files:
src/lightspeed_evaluation/pipeline/evaluation/errors.pysrc/lightspeed_evaluation/pipeline/evaluation/processor.py
🧬 Code graph analysis (5)
tests/unit/pipeline/evaluation/test_errors.py (2)
src/lightspeed_evaluation/pipeline/evaluation/errors.py (4)
EvaluationErrorHandler(10-201)mark_turn_metrics_as_error(79-125)get_error_summary(195-201)mark_remaining_turns_and_conversation_as_error(127-193)src/lightspeed_evaluation/core/models/data.py (2)
TurnData(35-261)EvaluationData(264-311)
src/lightspeed_evaluation/pipeline/evaluation/amender.py (3)
src/lightspeed_evaluation/core/models/data.py (2)
EvaluationData(264-311)TurnData(35-261)src/lightspeed_evaluation/core/api/client.py (1)
query(71-105)src/lightspeed_evaluation/core/system/exceptions.py (1)
APIError(8-9)
src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
src/lightspeed_evaluation/core/models/data.py (3)
EvaluationData(264-311)EvaluationResult(314-353)TurnData(35-261)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (3)
src/lightspeed_evaluation/core/models/data.py (2)
EvaluationData(264-311)EvaluationResult(314-353)src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
amend_single_turn(20-69)src/lightspeed_evaluation/pipeline/evaluation/errors.py (2)
mark_turn_metrics_as_error(79-125)mark_remaining_turns_and_conversation_as_error(127-193)
tests/unit/pipeline/evaluation/test_amender.py (5)
src/lightspeed_evaluation/core/models/api.py (1)
APIResponse(80-116)src/lightspeed_evaluation/core/models/data.py (1)
TurnData(35-261)src/lightspeed_evaluation/core/system/exceptions.py (1)
APIError(8-9)src/lightspeed_evaluation/pipeline/evaluation/amender.py (2)
APIDataAmender(13-80)amend_single_turn(20-69)src/lightspeed_evaluation/core/api/client.py (1)
query(71-105)
🪛 GitHub Actions: Python linter
src/lightspeed_evaluation/pipeline/evaluation/processor.py
[error] 141-141: pylint: C0301 Line too long (104/100) (line-too-long) in processor.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: mypy
- GitHub Check: tests (3.13)
- GitHub Check: tests (3.11)
- GitHub Check: tests (3.12)
🔇 Additional comments (3)
src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
7-69: Per-turn amendment logic is consistent and matches the new contract
amend_single_turncleanly encapsulates the per-turn API call and in-place mutation ofTurnData, returning a simple(error_message, updated_conversation_id)tuple. It:
- Always performs a fresh API call when a client is present, independent of existing
response/tool_calls, in line with the framework’s expectations.- Correctly threads
conversation_idthrough calls and preserves it on API failure.- Avoids setting empty
contexts/tool_calls, which keepsTurnDatacompatible with its validators and the tests.No functional issues here from my side.
src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
5-193: New error-handling helpers are well-factored and consistent with existing patternsThe additions of:
mark_turn_metrics_as_error(...)for a single turn, andmark_remaining_turns_and_conversation_as_error(...)for cascading failuresfit cleanly with
mark_all_metrics_as_errorandget_error_summary:
- They operate on pre-resolved metrics passed from
ConversationProcessor, respecting the existing division of responsibilities.- They produce
EvaluationResultobjects with the same ERROR-shape (empty response, 0.0 execution_time, reason populated) used elsewhere.self.resultsis updated so summaries aggregate errors across multiple calls and conversations, as covered by the tests.No changes needed here.
tests/unit/pipeline/evaluation/test_amender.py (1)
3-214: Tests comprehensively cover the newamend_single_turnbehaviourThe updated tests validate all key aspects of the per-turn amender:
- Correct tuple return semantics for success, no-client, and API error cases.
- Proper propagation and use of
conversation_id, including follow-up turns.- In-place mutation of
TurnData(response, conversation_id, contexts, tool_calls), including the intended behaviour when contexts/tool_calls are empty (fields remainNone).- Attachment handling via
attachmentsbeing forwarded into the API client.This suite gives strong confidence in the new per-turn API amendment flow.
a1c716c to
a52690e
Compare
|
@VladimirKadlec @tisnik PTAL |
VladimirKadlec
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
tisnik
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
fix: run turn evaluation immediately after api call
Currently we do API call for all queries in the conversation, then start the evaluation. This PR will modify this logic to do turn evaluation immediately after API call.
Code can be further modularized, but will do later as it will involve un-related logic change.
Summary by CodeRabbit