Skip to content

Conversation

@asamal4
Copy link
Collaborator

@asamal4 asamal4 commented Nov 19, 2025

Currently we do API call for all queries in the conversation, then start the evaluation. This PR will modify this logic to do turn evaluation immediately after API call.

  • Avoid un-necessary evaluation for follow up turns/queries when API has failed.
  • Script eval needs to be run immediately after the Agent action for the turn.

Code can be further modularized, but will do later as it will involve un-related logic change.

Summary by CodeRabbit

  • Refactor
    • Evaluation now processes and amends model responses per turn (threads conversation context across turns) for finer-grained handling and clearer per-turn outcomes.
  • Bug Fixes
    • On API errors, remaining turns and conversation-level metrics are automatically marked as failed to avoid misleading results.
  • Tests
    • Added unit tests covering turn-level error marking and cascade-failure behavior.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 19, 2025

Walkthrough

Refactors the amendment flow from conversation-level to per-turn processing. Replaces amend_conversation_data with amend_single_turn(turn_data, conversation_id=None) that returns (error_message, conversation_id). Adds per-turn cascade error marking methods and updates the conversation processor to call the amender per turn and handle cascade failures.

Changes

Cohort / File(s) Summary
Amender (per-turn API)
src/lightspeed_evaluation/pipeline/evaluation/amender.py
Replaces amend_conversation_data(conv_data) with amend_single_turn(turn_data, conversation_id=None). Accepts a single TurnData, performs the API call, updates turn_data in-place (response, conversation_id, contexts, tool_calls/attachments), and returns (error_message, conversation_id). Logs per-turn actions and returns API errors as (error_message, conversation_id).
Error handling additions
src/lightspeed_evaluation/pipeline/evaluation/errors.py
Adds mark_turn_metrics_as_error(conv_data, turn_idx, turn_data, turn_metrics, error_reason) and mark_cascade_failure(conv_data, failed_turn_idx, resolved_turn_metrics, resolved_conversation_metrics, error_reason) on EvaluationErrorHandler. These create ERROR EvaluationResult entries for specific turns and for remaining turns + conversation-level metrics, and log warnings. Also exposes TurnData in imports.
Processor: per-turn flow & cascade semantics
src/lightspeed_evaluation/pipeline/evaluation/processor.py
Changes process_conversation() to call amend_single_turn() per turn, thread conversation_id across calls, run per-turn metric evaluation after amendment, and on per-turn API error call error-handler methods to mark current and remaining metrics as ERROR and abort further processing. Reorders finalization/cleanup steps accordingly.
Tests — amender
tests/unit/pipeline/evaluation/test_amender.py
Updates tests to call amend_single_turn() with TurnData and optional conversation_id; asserts tuple return (error_msg, conversation_id). Adds/adjusts cases for missing API client, existing conversation_id, tool_calls, attachments, empty contexts, and API error formatting.
Tests — errors
tests/unit/pipeline/evaluation/test_errors.py
Adds tests test_mark_turn_metrics_as_error() and test_mark_cascade_failure() to validate per-turn ERROR results and cascade marking of remaining turns and conversation-level metrics.

Sequence Diagram(s)

sequenceDiagram
    participant Proc as Processor
    participant Am as Amender
    participant API as API Client
    participant EH as ErrorHandler

    Proc->>Proc: For each turn in conversation
    Proc->>Am: amend_single_turn(turn_data, conversation_id)
    Am->>API: Call API with turn payload

    alt API Success
        API-->>Am: APIResponse
        Am->>Am: Update turn_data (response, conversation_id, contexts, tool_calls)
        Am-->>Proc: (None, new_conversation_id)
        Proc->>Proc: Evaluate turn metrics (Step 2b)
        Proc->>Proc: continue next turn (conversation_id threaded)
    else API Error
        API-->>Am: APIError
        Am-->>Proc: (error_message, current_conversation_id)
        rect rgb(255,230,230)
        Note over Proc,EH: Cascade failure handling
        Proc->>EH: mark_turn_metrics_as_error(conv, turn_idx, turn_data, turn_metrics, reason)
        Proc->>EH: mark_cascade_failure(conv, failed_turn_idx, resolved_turn_metrics, resolved_conversation_metrics, reason)
        end
        Proc->>Proc: Abort further turn processing
    end

    Proc->>Proc: After loop: evaluate conversation-level metrics (if not cascaded)
    Proc->>Proc: Cleanup & return results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas requiring extra attention:

  • src/lightspeed_evaluation/pipeline/evaluation/processor.py: correctness of conversation_id threading, early-exit/cascade semantics, and that cleanup still runs in failure paths.
  • src/lightspeed_evaluation/pipeline/evaluation/errors.py: ensure EvaluationResult fields for ERROR entries match expected schema and that counts/summaries are accurate.
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py and related tests: verify in-place TurnData updates and exact tuple return semantics are consistent across success/error branches.

Possibly related PRs

Suggested reviewers

  • VladimirKadlec
  • tisnik

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately captures the main change: shifting from batch API processing to immediate per-turn evaluation after each API call.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

  • Provide your own instructions using the high_level_summary_instructions setting.
  • Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
  • Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

  1. 📝 Description — Summarize the main change in 50–60 words, explaining what was done.
  2. 📓 References — List relevant issues, discussions, documentation, or related PRs.
  3. 📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.
  4. 📊 Contributor Summary — Include a Markdown table showing contributions:
    | Contributor | Lines Added | Lines Removed | Files Changed |
  5. ✔️ Additional Notes — Add any extra reviewer context.
    Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (1)

46-177: Per-turn processing and error cascade look solid; fix the line-too-long lint

The refactor of process_conversation to:

  • run setup once,
  • then, for each turn, do “API amend → evaluate turn metrics,”
  • and only afterwards evaluate conversation-level metrics,
    matches the stated goal of evaluating turns immediately after their API call. The error path that:
  • marks the current turn’s metrics as ERROR, then
  • marks remaining turns and all conversation metrics as ERROR, and
  • exits early while still running cleanup in finally,
    is consistent with the new EvaluationErrorHandler helpers and avoids wasted work on failed conversations.

The only concrete issue is the pylint line-too-long at Line 141 (remaining_errors = error_handler.mark_remaining_turns_and_conversation_as_error(...)). You can resolve this by aliasing the method before calling it, e.g.:

-                        cascade_error_reason = (
-                            f"Cascade failure from turn {turn_idx + 1} API error: "
-                            f"{api_error_message}"
-                        )
-                        error_handler = self.components.error_handler
-                        remaining_errors = error_handler.mark_remaining_turns_and_conversation_as_error(
-                            conv_data,
-                            turn_idx,
-                            resolved_turn_metrics,
-                            resolved_conversation_metrics,
-                            cascade_error_reason,
-                        )
+                        cascade_error_reason = (
+                            f"Cascade failure from turn {turn_idx + 1} API error: "
+                            f"{api_error_message}"
+                        )
+                        error_handler = self.components.error_handler
+                        mark_remaining = (
+                            error_handler.mark_remaining_turns_and_conversation_as_error
+                        )
+                        remaining_errors = mark_remaining(
+                            conv_data,
+                            turn_idx,
+                            resolved_turn_metrics,
+                            resolved_conversation_metrics,
+                            cascade_error_reason,
+                        )

This keeps the behaviour unchanged while satisfying the linter.

🧹 Nitpick comments (1)
tests/unit/pipeline/evaluation/test_errors.py (1)

182-294: New per-turn error handler tests look correct and aligned with implementation

These tests exercise both mark_turn_metrics_as_error and mark_remaining_turns_and_conversation_as_error in realistic scenarios and validate:

  • Field population on EvaluationResult (IDs, status, reason, query/response, execution_time).
  • Correct ordering and counts of turn-level vs conversation-level errors.
  • Aggregation behaviour in get_error_summary.

This gives good coverage for the new error-handling paths in the processor. You might optionally add an assertion on query/response for the remaining-turns case, but it’s not strictly necessary.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4518a87 and a1c716c.

📒 Files selected for processing (5)
  • src/lightspeed_evaluation/pipeline/evaluation/amender.py (2 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/errors.py (2 hunks)
  • src/lightspeed_evaluation/pipeline/evaluation/processor.py (4 hunks)
  • tests/unit/pipeline/evaluation/test_amender.py (3 hunks)
  • tests/unit/pipeline/evaluation/test_errors.py (1 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/pipeline/evaluation/amender.py:32-41
Timestamp: 2025-09-09T14:58:10.630Z
Learning: In the lightspeed-evaluation framework, when API is enabled, every turn should make a fresh API call regardless of whether the turn already has response or tool_calls data. This ensures consistency and fresh responses for each evaluation run.
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/core/output/generator.py:140-145
Timestamp: 2025-09-11T12:47:06.747Z
Learning: User asamal4 prefers that non-critical comments are sent when actual code changes are pushed, not on unrelated commits.
📚 Learning: 2025-09-19T12:32:06.403Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:18-31
Timestamp: 2025-09-19T12:32:06.403Z
Learning: When analyzing method calls, always examine the complete call site including all parameters before suggesting fixes. In the lightspeed-evaluation codebase, mark_all_metrics_as_error in processor.py correctly passes both resolved_turn_metrics and resolved_conversation_metrics parameters.

Applied to files:

  • tests/unit/pipeline/evaluation/test_errors.py
  • src/lightspeed_evaluation/pipeline/evaluation/errors.py
  • src/lightspeed_evaluation/pipeline/evaluation/processor.py
📚 Learning: 2025-09-09T14:58:10.630Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 47
File: src/lightspeed_evaluation/pipeline/evaluation/amender.py:32-41
Timestamp: 2025-09-09T14:58:10.630Z
Learning: In the lightspeed-evaluation framework, when API is enabled, every turn should make a fresh API call regardless of whether the turn already has response or tool_calls data. This ensures consistency and fresh responses for each evaluation run.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/amender.py
📚 Learning: 2025-09-18T23:59:37.026Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/core/system/validator.py:146-155
Timestamp: 2025-09-18T23:59:37.026Z
Learning: In the lightspeed-evaluation project, the DataValidator in `src/lightspeed_evaluation/core/system/validator.py` is intentionally designed to validate only explicitly provided user evaluation data, not resolved metrics that include system defaults. When turn_metrics is None, the system falls back to system config defaults, and this validation separation is by design.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/amender.py
  • src/lightspeed_evaluation/pipeline/evaluation/errors.py
📚 Learning: 2025-09-19T00:37:23.798Z
Learnt from: asamal4
Repo: lightspeed-core/lightspeed-evaluation PR: 55
File: src/lightspeed_evaluation/pipeline/evaluation/errors.py:33-36
Timestamp: 2025-09-19T00:37:23.798Z
Learning: In the lightspeed-evaluation codebase, metric resolution (including applying defaults when turn_metrics is None) happens upstream in ConversationProcessor.process_conversation() using MetricManager.resolve_metrics(), not in the EvaluationErrorHandler. The error handler only marks explicitly defined metrics as ERROR.

Applied to files:

  • src/lightspeed_evaluation/pipeline/evaluation/errors.py
  • src/lightspeed_evaluation/pipeline/evaluation/processor.py
🧬 Code graph analysis (5)
tests/unit/pipeline/evaluation/test_errors.py (2)
src/lightspeed_evaluation/pipeline/evaluation/errors.py (4)
  • EvaluationErrorHandler (10-201)
  • mark_turn_metrics_as_error (79-125)
  • get_error_summary (195-201)
  • mark_remaining_turns_and_conversation_as_error (127-193)
src/lightspeed_evaluation/core/models/data.py (2)
  • TurnData (35-261)
  • EvaluationData (264-311)
src/lightspeed_evaluation/pipeline/evaluation/amender.py (3)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationData (264-311)
  • TurnData (35-261)
src/lightspeed_evaluation/core/api/client.py (1)
  • query (71-105)
src/lightspeed_evaluation/core/system/exceptions.py (1)
  • APIError (8-9)
src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)
src/lightspeed_evaluation/core/models/data.py (3)
  • EvaluationData (264-311)
  • EvaluationResult (314-353)
  • TurnData (35-261)
src/lightspeed_evaluation/pipeline/evaluation/processor.py (3)
src/lightspeed_evaluation/core/models/data.py (2)
  • EvaluationData (264-311)
  • EvaluationResult (314-353)
src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)
  • amend_single_turn (20-69)
src/lightspeed_evaluation/pipeline/evaluation/errors.py (2)
  • mark_turn_metrics_as_error (79-125)
  • mark_remaining_turns_and_conversation_as_error (127-193)
tests/unit/pipeline/evaluation/test_amender.py (5)
src/lightspeed_evaluation/core/models/api.py (1)
  • APIResponse (80-116)
src/lightspeed_evaluation/core/models/data.py (1)
  • TurnData (35-261)
src/lightspeed_evaluation/core/system/exceptions.py (1)
  • APIError (8-9)
src/lightspeed_evaluation/pipeline/evaluation/amender.py (2)
  • APIDataAmender (13-80)
  • amend_single_turn (20-69)
src/lightspeed_evaluation/core/api/client.py (1)
  • query (71-105)
🪛 GitHub Actions: Python linter
src/lightspeed_evaluation/pipeline/evaluation/processor.py

[error] 141-141: pylint: C0301 Line too long (104/100) (line-too-long) in processor.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: mypy
  • GitHub Check: tests (3.13)
  • GitHub Check: tests (3.11)
  • GitHub Check: tests (3.12)
🔇 Additional comments (3)
src/lightspeed_evaluation/pipeline/evaluation/amender.py (1)

7-69: Per-turn amendment logic is consistent and matches the new contract

amend_single_turn cleanly encapsulates the per-turn API call and in-place mutation of TurnData, returning a simple (error_message, updated_conversation_id) tuple. It:

  • Always performs a fresh API call when a client is present, independent of existing response/tool_calls, in line with the framework’s expectations.
  • Correctly threads conversation_id through calls and preserves it on API failure.
  • Avoids setting empty contexts/tool_calls, which keeps TurnData compatible with its validators and the tests.

No functional issues here from my side.

src/lightspeed_evaluation/pipeline/evaluation/errors.py (1)

5-193: New error-handling helpers are well-factored and consistent with existing patterns

The additions of:

  • mark_turn_metrics_as_error(...) for a single turn, and
  • mark_remaining_turns_and_conversation_as_error(...) for cascading failures

fit cleanly with mark_all_metrics_as_error and get_error_summary:

  • They operate on pre-resolved metrics passed from ConversationProcessor, respecting the existing division of responsibilities.
  • They produce EvaluationResult objects with the same ERROR-shape (empty response, 0.0 execution_time, reason populated) used elsewhere.
  • self.results is updated so summaries aggregate errors across multiple calls and conversations, as covered by the tests.

No changes needed here.

tests/unit/pipeline/evaluation/test_amender.py (1)

3-214: Tests comprehensively cover the new amend_single_turn behaviour

The updated tests validate all key aspects of the per-turn amender:

  • Correct tuple return semantics for success, no-client, and API error cases.
  • Proper propagation and use of conversation_id, including follow-up turns.
  • In-place mutation of TurnData (response, conversation_id, contexts, tool_calls), including the intended behaviour when contexts/tool_calls are empty (fields remain None).
  • Attachment handling via attachments being forwarded into the API client.

This suite gives strong confidence in the new per-turn API amendment flow.

@asamal4 asamal4 changed the title run turn evaluation immediately after api call fix: run turn evaluation immediately after api call Nov 19, 2025
@asamal4
Copy link
Collaborator Author

asamal4 commented Nov 21, 2025

@VladimirKadlec @tisnik PTAL

Copy link
Contributor

@VladimirKadlec VladimirKadlec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Contributor

@tisnik tisnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tisnik tisnik merged commit 686ae36 into lightspeed-core:main Nov 21, 2025
15 checks passed
bsatapat-jpg pushed a commit to bsatapat-jpg/lightspeed-evaluation that referenced this pull request Nov 24, 2025
fix: run turn evaluation immediately after api call
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants