Skip to content

Python: Skip get_final_response in OTel _finalize_stream when stream errored#5232

Merged
moonbox3 merged 7 commits intomicrosoft:mainfrom
droideronline:dinesh/fix-otel-finalize-stream-result-hooks-on-error
Apr 14, 2026
Merged

Python: Skip get_final_response in OTel _finalize_stream when stream errored#5232
moonbox3 merged 7 commits intomicrosoft:mainfrom
droideronline:dinesh/fix-otel-finalize-stream-result-hooks-on-error

Conversation

@droideronline
Copy link
Copy Markdown
Contributor

@droideronline droideronline commented Apr 13, 2026

Closes #5231

Problem

AgentTelemetryLayer registers _finalize_stream as a cleanup hook on the ResponseStream. Cleanup hooks run on both normal completion and streaming errors. Inside _finalize_stream, get_final_response() was called unconditionally — and get_final_response() always runs all registered result hooks, including _post_hook in _agents.py which calls _run_after_providers.

This caused after_run context providers to fire on error paths, which is incorrect.

The call chain on streaming error:

stream raises Exception
  → ResponseStream.__anext__ runs _run_cleanup_hooks()
    → _finalize_stream() called
      → get_final_response() called unconditionally   ← bug
        → result hooks fire (e.g. _post_hook)
          → _run_after_providers called               ← wrong on error

Fix

_types.pyResponseStream

  • Store the exception in _stream_error when __anext__ catches one, then clear it after _run_cleanup_hooks() completes to avoid retaining the traceback and any large object graphs it references.
  • Expose two public properties: consumed: bool (True only after normal StopAsyncIteration) and stream_error: Exception | None.

observability.py — both _finalize_stream closures (chat and agent layers)

  • Key the early-return off result_stream.stream_error is not None — this is precise to actual errors rather than any non-normal completion state, so edge cases where consumed is False but no error is set still proceed to get_final_response().
  • Call capture_exception(span, result_stream.stream_error, ...) before returning so the OTel span records the failure rather than closing silently.

Notes

Affects both _trace_chat_client_invocation (chat layer) and _trace_agent_invocation (agent layer) — both had the same pattern and are fixed here.

Copilot AI review requested due to automatic review settings April 13, 2026 17:37
…errored

When a streaming error occurs, _finalize_stream (a cleanup hook registered by
AgentTelemetryLayer) was unconditionally calling get_final_response(), which
triggers all registered result hooks including after_run context providers.
This caused providers to fire incorrectly on error paths.

Guard against this by checking result_stream._consumed: True only after
StopAsyncIteration (normal completion), False when an exception was raised.
The fix applies to both the chat client and agent telemetry layers.

Closes microsoft#5231
@droideronline droideronline force-pushed the dinesh/fix-otel-finalize-stream-result-hooks-on-error branch from ec7ad25 to b345761 Compare April 13, 2026 17:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Fixes OTel streaming instrumentation so result hooks (e.g., after_run context providers) don’t fire when a stream errors, by skipping get_final_response() unless the stream completed normally.

Changes:

  • Add a _consumed guard in the chat stream _finalize_stream to avoid calling get_final_response() on error paths.
  • Add the same guard in the agent stream _finalize_stream.

Comment thread python/packages/core/agent_framework/observability.py Outdated
Comment thread python/packages/core/agent_framework/observability.py Outdated
…ror in OTel span

Address Copilot review feedback on microsoft#5232:

- Add `_stream_error: Exception | None` to ResponseStream, set in __anext__'s
  except branch so cleanup hooks can inspect the failure.
- Expose public `consumed` and `stream_error` properties to avoid coupling
  observability.py to private stream internals.
- Update both _finalize_stream closures (chat and agent layers) to use the
  public properties and call capture_exception() with the stream error before
  returning early, ensuring the OTel span records the failure rather than
  closing silently.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread python/packages/core/agent_framework/observability.py Outdated
Comment thread python/packages/core/agent_framework/_types.py Outdated
- Use stream_error is not None as the guard in _finalize_stream instead of
  not consumed, so the early-return path is keyed precisely to actual errors
  rather than any non-normal completion state.
- Clear _stream_error after _run_cleanup_hooks() completes to avoid retaining
  the exception traceback (and any large object graphs it references) on the
  stream instance beyond the cleanup phase.
@droideronline
Copy link
Copy Markdown
Contributor Author

@moonbox3 @eavanvalkenburg @TaoChenOSU - Would appreciate a look when you get a chance.

Copy link
Copy Markdown
Member

@eavanvalkenburg eavanvalkenburg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment, otherwise looks good and thanks

Comment thread python/packages/core/agent_framework/_types.py Outdated
droideronline and others added 2 commits April 14, 2026 13:21
…rectly

Per review feedback: since observability.py and _types.py are in the same
package, accessing _stream_error directly is fine and the public properties
are unnecessary.
@moonbox3
Copy link
Copy Markdown
Contributor

moonbox3 commented Apr 14, 2026

Python Test Coverage

Python Test Coverage Report •
FileStmtsMissCoverMissing
packages/core/agent_framework
   _types.py10869191%58, 67–68, 122, 127, 146, 148, 152, 156, 158, 160, 162, 180, 184, 210, 232, 237, 242, 246, 276, 687–688, 847–848, 1233, 1305, 1340, 1360, 1370, 1422, 1554–1556, 1738, 1841–1846, 1871, 1959, 1967–1969, 1974, 2065, 2077, 2100, 2355, 2379, 2474, 2723, 2930, 3003, 3014, 3016–3020, 3022, 3025–3033, 3043, 3113, 3247–3249, 3252–3254, 3258, 3263, 3267, 3351–3353, 3382, 3459–3463
   observability.py7558488%377, 379–380, 383, 386, 389–390, 395–396, 402–403, 409–410, 417, 419–421, 424–426, 431–432, 438–439, 445–446, 453, 610–611, 739, 743–745, 747, 751–752, 756, 794, 796, 807–809, 811–813, 817, 825, 949–950, 1112, 1355–1356, 1454–1459, 1466–1469, 1473–1481, 1488, 1553–1557, 1616–1617, 1751, 1843, 2040, 2258, 2260
TOTAL27264320388% 

Python Unit Test Overview

Tests Skipped Failures Errors Time
5486 20 💤 0 ❌ 0 🔥 1m 24s ⏱️

Keep _stream_error private (consistent with rest of ResponseStream), and
suppress reportPrivateUsage at the call sites in observability.py with
inline pyright: ignore comments — access is intentional within the package.
@moonbox3 moonbox3 enabled auto-merge April 14, 2026 08:11
@moonbox3 moonbox3 added this pull request to the merge queue Apr 14, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 14, 2026
@droideronline
Copy link
Copy Markdown
Contributor Author

@moonbox3 - seems like a rate limit error in ms learn MCP.

@moonbox3 moonbox3 added this pull request to the merge queue Apr 14, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 14, 2026
@moonbox3 moonbox3 added this pull request to the merge queue Apr 14, 2026
Merged via the queue into microsoft:main with commit 64c68ca Apr 14, 2026
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python: OTel _finalize_stream triggers result hooks on streaming error, causing after_run providers to fire incorrectly

4 participants