fix(openai-agents): span attribute handling for tool calls and results#3422
fix(openai-agents): span attribute handling for tool calls and results#3422doronkopit5 merged 4 commits intomainfrom
Conversation
- Refactored the OpenTelemetryTracingProcessor to support unified handling of message formats, including both standard OpenAI chat and OpenAI Agents SDK formats. - Added logic to extract and set attributes for tool calls and results, ensuring proper recording in span attributes. - Introduced a new test to validate the correct propagation of tool call and result attributes in spans. - Added a YAML cassette for testing tool call interactions with the OpenAI API.
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughNormalize OpenAI Agent messages into a unified dict format and emit hierarchical span attributes (LLM_PROMPTS.{i}.*) for role, content, tool_call_id, and tool_calls; add tests and a cassette verifying tool_call and tool result attributes in response spans. Changes
Sequence Diagram(s)sequenceDiagram
participant Agent as OpenAI Agent
participant Span as Instrumentation Span
participant Normalizer as Message Normalizer
Agent->>Span: Response messages arrive
Span->>Normalizer: on_span_end receives messages
Normalizer->>Normalizer: Normalize each message to dict
alt standard chat message
Normalizer->>Normalizer: extract role, content
else function_call (Agents SDK)
Normalizer->>Normalizer: map to assistant role, build tool_calls (id,name,arguments)
else function_call_output (tool result)
Normalizer->>Normalizer: map to tool role, set tool_call_id + content
end
Normalizer->>Span: set attributes LLM_PROMPTS.{i}.role/content/tool_call_id/tool_calls
note over Span: Tool calls normalized (id,name,args stringified) and emitted
Span->>Span: emit final span attributes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Caution
Changes requested ❌
Reviewed everything up to 2ba1445 in 1 minute and 36 seconds. Click for details.
- Reviewed
523lines of code in3files - Skipped
0files when reviewing. - Skipped posting
1draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/opentelemetry-instrumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_hooks.py:267
- Draft comment:
The branching for handling tool call messages based on the 'type' field is clear, but consider adding inline comments to emphasize how each branch maps to the respective OpenAI formats. This minor improvement could aid future maintainers. - Reason this comment was not posted:
Comment looked like it was already resolved.
Workflow ID: wflow_54Qm2El5iGf4Xcvn
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
| content = message.content | ||
| prefix = f"{SpanAttributes.LLM_PROMPTS}.{i}" | ||
|
|
||
| # Convert message to dict for unified handling |
There was a problem hiding this comment.
Consider extracting the object-to-dict conversion (lines 244–253) into a helper function to improve readability and reduce duplication. Using something like vars(message) or a dedicated converter might simplify the logic.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
packages/opentelemetry-instrumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_hooks.py (1)
452-471: Apply normalization logic to legacy fallback path for consistency across all span types.The legacy fallback path (lines 452-500+) handles AgentSpanData, HandoffSpanData, and FunctionSpanData but only extracts
roleandcontentfrom messages, missingtool_call_idandtool_callsthat the main normalization path (lines 244-335) handles. This creates inconsistent span attributes across different span data types. Apply the same message normalization and tool extraction logic from lines 244-335 to the legacy fallback block.
🧹 Nitpick comments (2)
packages/opentelemetry-instrumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_hooks.py (2)
242-253: Consider adding error handling for attribute extraction.The attribute extraction loop uses
hasattr()andgetattr()to convert message objects to dicts. If attribute access raises an exception (e.g., property with side effects), the code will fail. Consider wrapping in a try-except to handle unexpected attribute access errors gracefully.</parameter_end -->
260-282: Silent fallthrough when message format is unrecognized.If a message has neither 'role' nor 'type' keys, the code silently skips it without setting role, content, tool_call_id, or tool_calls. This could make debugging difficult if unexpected message formats appear.
Consider logging a debug message when an unrecognized message format is encountered:
elif 'type' in msg: # OpenAI Agents SDK format msg_type = msg['type'] if msg_type == 'function_call': # ... existing code ... elif msg_type == 'function_call_output': # ... existing code ... + else: + # Unrecognized format - log for debugging + import logging + logging.debug(f"Unrecognized message format: {msg.keys()}")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
packages/opentelemetry-instrumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_hooks.py(1 hunks)packages/opentelemetry-instrumentation-openai-agents/tests/cassettes/test_openai_agents/test_tool_call_and_result_attributes.yaml(1 hunks)packages/opentelemetry-instrumentation-openai-agents/tests/test_openai_agents.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/cassettes/**/*.{yaml,yml,json}
📄 CodeRabbit inference engine (CLAUDE.md)
Never commit secrets or PII in VCR cassettes; scrub sensitive data
Files:
packages/opentelemetry-instrumentation-openai-agents/tests/cassettes/test_openai_agents/test_tool_call_and_result_attributes.yaml
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules
Files:
packages/opentelemetry-instrumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_hooks.pypackages/opentelemetry-instrumentation-openai-agents/tests/test_openai_agents.py
🧬 Code graph analysis (2)
packages/opentelemetry-instrumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_hooks.py (1)
packages/opentelemetry-semantic-conventions-ai/opentelemetry/semconv_ai/__init__.py (1)
SpanAttributes(64-261)
packages/opentelemetry-instrumentation-openai-agents/tests/test_openai_agents.py (3)
packages/opentelemetry-instrumentation-openai-agents/tests/conftest.py (1)
exporter(55-65)packages/traceloop-sdk/traceloop/sdk/utils/in_memory_span_exporter.py (1)
get_finished_spans(40-43)packages/opentelemetry-semantic-conventions-ai/opentelemetry/semconv_ai/__init__.py (1)
SpanAttributes(64-261)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Test Packages (3.10)
- GitHub Check: Test Packages (3.12)
- GitHub Check: Test Packages (3.11)
- GitHub Check: Build Packages (3.11)
- GitHub Check: Lint
🔇 Additional comments (4)
packages/opentelemetry-instrumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_hooks.py (1)
299-335: Robust tool_calls normalization and attribute population.The code handles multiple tool_call formats comprehensively:
- Converts objects to dicts by extracting attributes
- Handles nested function details (standard OpenAI format)
- Flattens to a consistent structure
- Properly JSON-stringifies non-string arguments
The implementation correctly accommodates both OpenAI SDK format (nested function) and direct format.
packages/opentelemetry-instrumentation-openai-agents/tests/cassettes/test_openai_agents/test_tool_call_and_result_attributes.yaml (1)
202-203: Verify cookie values are scrubbed or expired.The cassette contains cookie values (
__cf_bmand_cfuvid) that could be sensitive session tokens. Ensure these are either scrubbed placeholders or naturally expired tokens before committing.As per coding guidelines.
packages/opentelemetry-instrumentation-openai-agents/tests/test_openai_agents.py (2)
6-6: Import addition looks good.The
function_toolimport is correctly added and used in the new test case below.
459-543: Comprehensive test for tool_call and tool_result attributes.The test validates:
- Tool call attributes (name, arguments) in assistant messages
- Tool result attributes (content, tool_call_id) in tool messages
- Both attributes are found in the second response span's conversation history
The test logic is thorough and correctly structured. The iteration limit of 20 messages should be sufficient for typical agent conversations.
There was a problem hiding this comment.
Important
Looks good to me! 👍
Reviewed b0d7b2b in 45 seconds. Click for details.
- Reviewed
52lines of code in2files - Skipped
0files when reviewing. - Skipped posting
2draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/opentelemetry-instrumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_hooks.py:250
- Draft comment:
Improved readability: converting the inline list of attribute names to a multi-line format makes the code clearer. Ensure that all necessary attributes are still captured and no key is accidentally omitted. - Reason this comment was not posted:
Confidence changes required:0%<= threshold50%None
2. packages/opentelemetry-instrumentation-openai-agents/tests/test_openai_agents.py:467
- Draft comment:
Refactored get_city_info: Using multi-line string literals via implicit concatenation improves readability. Verify that the resulting string maintains the intended format without unintended whitespace changes. - Reason this comment was not posted:
Confidence changes required:0%<= threshold50%None
Workflow ID: wflow_IJUIrfUag1j34B6x
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Actionable comments posted: 0
♻️ Duplicate comments (1)
packages/opentelemetry-instrumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_hooks.py (1)
279-286: Critical: Missing 'id' field in tool_calls created from function_call format.When converting
function_callmessages to thetool_callsstructure, the code creates a dict with onlynameandargumentsbut omits theidfield. However, line 337 expectstool_call.get('id')to exist. This means tool call IDs won't be recorded in span attributes for messages in thefunction_callformat, resulting in incomplete observability data.This issue was previously flagged and remains unresolved.
Apply this diff to include the call_id:
elif msg_type == 'function_call': # Tool calls are assistant messages role = 'assistant' # Create tool_calls structure matching OpenAI SDK format tool_calls = [{ + 'id': msg.get('call_id', ''), 'name': msg.get('name', ''), 'arguments': msg.get('arguments', '') }]
🧹 Nitpick comments (2)
packages/opentelemetry-instrumentation-openai-agents/tests/test_openai_agents.py (1)
502-502: Replace magic number with a documented constant.The value
20appears arbitrary. Consider extracting it as a named constant (e.g.,MAX_CONVERSATION_HISTORY_ENTRIES = 20) with a comment explaining why this limit is chosen, improving code maintainability.Apply this diff:
+ # Maximum conversation history entries to check (sufficient for typical tool call scenarios) + MAX_CONVERSATION_HISTORY_ENTRIES = 20 + # The tool call and result appear in the SECOND response span as part of conversation history # Find the assistant message with tool call tool_call_found = False tool_result_found = False - for i in range(20): # Check conversation history + for i in range(MAX_CONVERSATION_HISTORY_ENTRIES):packages/opentelemetry-instrumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_hooks.py (1)
244-262: Consider extracting message-to-dict conversion to a helper function.The object-to-dict conversion logic is repeated inline and could benefit from extraction to a dedicated helper function (e.g.,
_message_to_dict(message)) for improved readability and reusability. This aligns with feedback from previous reviews.Example helper function:
def _message_to_dict(message) -> dict: """Convert a message object to a dict for unified processing.""" if isinstance(message, dict): return message msg = {} for attr in [ "role", "content", "tool_call_id", "tool_calls", "type", "name", "arguments", "call_id", "output", ]: if hasattr(message, attr): msg[attr] = getattr(message, attr) return msgThen replace lines 244-262 with:
- # Convert message to dict for unified handling - if isinstance(message, dict): - msg = message - else: - # Convert object to dict - msg = {} - for attr in [ - "role", - "content", - "tool_call_id", - "tool_calls", - "type", - "name", - "arguments", - "call_id", - "output", - ]: - if hasattr(message, attr): - msg[attr] = getattr(message, attr) + msg = self._message_to_dict(message)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
packages/opentelemetry-instrumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_hooks.py(1 hunks)packages/opentelemetry-instrumentation-openai-agents/tests/test_openai_agents.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules
Files:
packages/opentelemetry-instrumentation-openai-agents/tests/test_openai_agents.pypackages/opentelemetry-instrumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_hooks.py
🧬 Code graph analysis (2)
packages/opentelemetry-instrumentation-openai-agents/tests/test_openai_agents.py (3)
packages/opentelemetry-instrumentation-openai-agents/tests/conftest.py (1)
exporter(55-65)packages/traceloop-sdk/traceloop/sdk/utils/in_memory_span_exporter.py (1)
get_finished_spans(40-43)packages/opentelemetry-semantic-conventions-ai/opentelemetry/semconv_ai/__init__.py (1)
SpanAttributes(64-261)
packages/opentelemetry-instrumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_hooks.py (1)
packages/opentelemetry-semantic-conventions-ai/opentelemetry/semconv_ai/__init__.py (1)
SpanAttributes(64-261)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Test Packages (3.12)
- GitHub Check: Test Packages (3.10)
- GitHub Check: Test Packages (3.11)
- GitHub Check: Build Packages (3.11)
- GitHub Check: Lint
🔇 Additional comments (4)
packages/opentelemetry-instrumentation-openai-agents/tests/test_openai_agents.py (2)
6-6: LGTM: Import addition supports new test.The
function_toolimport is appropriately added to support the new test case.
459-550: Well-structured test with comprehensive assertions.The test effectively validates that tool calls and tool results are properly recorded in span attributes. The logic correctly identifies the second response span (which contains conversation history) and verifies both tool call attributes (name, arguments) and tool result attributes (content, tool_call_id).
packages/opentelemetry-instrumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_hooks.py (2)
293-305: LGTM: Proper attribute handling with correct null checks.The role, content, and tool_call_id attribute handling is well-implemented. The check for
content is not None(line 298) correctly allows empty strings while preventing null values, and JSON encoding for non-string content ensures proper serialization.
307-345: Well-structured tool_calls processing with comprehensive format support.The tool_calls handling logic correctly converts various object formats to dicts, extracts nested function details, and sets appropriate span attributes. The JSON encoding for non-string arguments (lines 342-344) is appropriate.
Note: This logic depends on the
idfield being present in tool_calls. Ensure the critical issue at lines 283-286 is addressed so that tool call IDs are properly recorded for all message formats.
There was a problem hiding this comment.
Important
Looks good to me! 👍
Reviewed adbc139 in 1 minute and 35 seconds. Click for details.
- Reviewed
269lines of code in3files - Skipped
0files when reviewing. - Skipped posting
3draft comments. View those below. - Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/opentelemetry-instrumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_hooks.py:284
- Draft comment:
Added the 'id' attribute for tool calls. Consider adding an inline comment explaining that this field is used to correlate tool call requests with their results. Also, if a tool call message might sometimes lack an 'id', you might consider generating a fallback (instead of defaulting to empty string) to satisfy test assertions expecting a non‐empty value. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% The comment has two parts. The first part about adding documentation is purely informative and not actionable. The second part about generating a fallback ID is speculative - it assumes there might be test assertions expecting non-empty values, but there's no evidence of this. The code is clearly trying to match OpenAI SDK format, and empty string is a valid default there. The suggestion about fallback IDs could be valid if there are indeed test assertions requiring non-empty values. I may be too quick to dismiss this possibility. Without seeing actual test failures or requirements around non-empty IDs, making this change would be premature optimization. The current code matches the OpenAI SDK format which allows empty IDs. The comment should be deleted as it combines non-actionable documentation requests with speculative suggestions about potential test issues that aren't evidenced.
2. packages/opentelemetry-instrumentation-openai-agents/tests/test_openai_agents.py:519
- Draft comment:
The test now verifies that the tool call ID attribute exists and is non-empty. This helps ensure consistency between the instrumentation and expected behavior. No changes needed, but double-check that all tool call messages consistently provide an 'id'. - Reason this comment was not posted:
Confidence changes required:0%<= threshold50%None
3. packages/opentelemetry-instrumentation-openai-agents/tests/cassettes/test_openai_agents/test_tool_call_and_result_attributes.yaml:294
- Draft comment:
The test cassette now includes tool call IDs in the recorded interactions. Ensure that these cassette values remain stable across runs, or consider marking them as dynamic if they can change. - Reason this comment was not posted:
Confidence changes required:33%<= threshold50%None
Workflow ID: wflow_CT9NorqIp4rWSaQK
You can customize by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
packages/opentelemetry-instrumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_hooks.py (1)
244-263: Consider extracting object-to-dict conversion to a helper function.The manual attribute extraction is verbose and reduces readability. Extracting this logic into a helper function would improve maintainability and make the code more testable.
This aligns with the previous review suggestion from ellipsis-dev[bot].
🧹 Nitpick comments (2)
packages/opentelemetry-instrumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_hooks.py (2)
310-326: Consider simplifying tool_call object-to-dict conversion.The nested conditionals for converting tool_call objects to dictionaries are complex and could be more maintainable. Consider extracting this logic into a small helper function.
Example helper function:
def _tool_call_to_dict(tool_call): """Convert a tool_call object to dict format.""" if isinstance(tool_call, dict): return tool_call tc_dict = {} if hasattr(tool_call, 'id'): tc_dict['id'] = tool_call.id if hasattr(tool_call, 'function'): func = tool_call.function if hasattr(func, 'name'): tc_dict['name'] = func.name if hasattr(func, 'arguments'): tc_dict['arguments'] = func.arguments elif hasattr(tool_call, 'name'): tc_dict['name'] = tool_call.name if hasattr(tool_call, 'arguments'): tc_dict['arguments'] = tool_call.arguments return tc_dictThis would simplify the loop to:
for j, tool_call in enumerate(tool_calls): tool_call = _tool_call_to_dict(tool_call) # ... rest of processing
463-481: Legacy path doesn't use unified message normalization.The legacy fallback path (lines 463-481) uses simpler role/content extraction and doesn't leverage the new unified normalization logic introduced in lines 244-293. This creates inconsistent message handling between response spans and other span types.
If legacy spans also need to support tool calls and the OpenAI Agents SDK format, consider reusing the normalization logic. If they don't, adding a comment explaining why the simpler handling is appropriate would improve clarity.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
packages/opentelemetry-instrumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_hooks.py(1 hunks)packages/opentelemetry-instrumentation-openai-agents/tests/cassettes/test_openai_agents/test_tool_call_and_result_attributes.yaml(1 hunks)packages/opentelemetry-instrumentation-openai-agents/tests/test_openai_agents.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- packages/opentelemetry-instrumentation-openai-agents/tests/cassettes/test_openai_agents/test_tool_call_and_result_attributes.yaml
- packages/opentelemetry-instrumentation-openai-agents/tests/test_openai_agents.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules
Files:
packages/opentelemetry-instrumentation-openai-agents/opentelemetry/instrumentation/openai_agents/_hooks.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: Build Packages (3.11)
- GitHub Check: Test Packages (3.11)
- GitHub Check: Test Packages (3.10)
- GitHub Check: Test Packages (3.12)
- GitHub Check: Lint
feat(instrumentation): ...orfix(instrumentation): ....Important
Unified agent message normalization and added tests for tool call and result span attributes in OpenAI agents.
_hooks.pyto consistently capture roles, message content, tool call IDs, and tool call details across OpenAI chat and Agents formats.test_tool_call_and_result_attributesintest_openai_agents.pyto validate that agent spans record tool calls and tool results, including tool_call IDs and content.test_tool_call_and_result_attributes.yamladded to support the new test.This description was created by
for adbc139. You can customize this summary. It will automatically update as commits are pushed.
Summary by CodeRabbit
New Features
Tests