fix(langchain): ensure llm span is created#2805
fix(langchain): ensure llm span is created#2805obs-gh-abhishekrao wants to merge 24 commits intotraceloop:mainfrom
Conversation
There was a problem hiding this comment.
❌ Changes requested. Reviewed everything up to 08116d9 in 2 minutes and 8 seconds
More details
- Looked at
98lines of code in3files - Skipped
0files when reviewing. - Skipped posting
4drafted comments based on config settings.
1. packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/v1/assistant_wrappers.py:215
- Draft comment:
Avoid using 'SpanAttributes.LLM_RESPONSE_MODEL or "unknown"' as the attribute name. Instead, pass SpanAttributes.LLM_RESPONSE_MODEL as the key and handle fallback in the value (e.g., assistants[assistant_id]["model"] or "unknown"). - Reason this comment was not posted:
Marked as duplicate.
2. packages/opentelemetry-instrumentation-langchain/opentelemetry/instrumentation/langchain/callback_handler.py:362
- Draft comment:
Changed token attachment from suppressing instrumentation to setting the span context with set_span_in_context(span). Ensure this change correctly retains the parent context without re-enabling downstream instrumentation. - Reason this comment was not posted:
Comment did not seem useful. Confidence is useful =0%<= threshold50%
The comment is asking the author to ensure that the change retains the parent context without re-enabling downstream instrumentation. This is a request for confirmation of intended behavior, which violates the rules.
3. packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/v1/assistant_wrappers.py:130
- Draft comment:
The expression 'SpanAttributes.LLM_RESPONSE_MODEL or "unknown"' is used as the attribute key. Likely you meant to assign a default value to the attribute value (i.e. assistant["model"] or "unknown"). Use _set_span_attribute(span, SpanAttributes.LLM_RESPONSE_MODEL, assistant["model"] or "unknown") instead. - Reason this comment was not posted:
Marked as duplicate.
4. packages/opentelemetry-instrumentation-openai/opentelemetry/instrumentation/openai/v1/assistant_wrappers.py:215
- Draft comment:
Similar issue: 'SpanAttributes.LLM_RESPONSE_MODEL or "unknown"' is incorrectly used as the key. It should be applied to the model value instead. Consider using _set_span_attribute(span, SpanAttributes.LLM_RESPONSE_MODEL, assistants[assistant_id]["model"] or "unknown") for proper defaulting. - Reason this comment was not posted:
Marked as duplicate.
Workflow ID: wflow_x4Sz3CnV2AcX7OLr
Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.
| _set_span_attribute( | ||
| span, | ||
| SpanAttributes.LLM_RESPONSE_MODEL, | ||
| SpanAttributes.LLM_RESPONSE_MODEL or "unknown", |
There was a problem hiding this comment.
Avoid using 'SpanAttributes.LLM_RESPONSE_MODEL or "unknown"' as the attribute name. Instead, pass SpanAttributes.LLM_RESPONSE_MODEL as the key and handle fallback in the value (e.g., assistant["model"] or "unknown").
| SpanAttributes.LLM_RESPONSE_MODEL or "unknown", | |
| SpanAttributes.LLM_RESPONSE_MODEL, |
nirga
left a comment
There was a problem hiding this comment.
Thanks for submitting @obs-gh-abhishekrao - can you take a look into the failing tests, lint and sign the CLA?
|
Hi @nirga, sorry it's taking time, I'm still working on getting the CLA signed. Took a look a the test failures. The langchain test failure happening on SequentialChain raises some suspicions for me. It looks like context isn't updated / propagated during So far I'm wondering if this is because the langchain instrumentation uses BaseCallbackHandler and (probably) missing an AsyncCallbackHandler.
I tried adding opentelemetry-instrumentation-threading to the test to see if it solves the context propagation but it didn't help. Suggestions would be really helpful here, so I'm curious to hear your thoughts. |
obs-gh-abhishekrao
left a comment
There was a problem hiding this comment.
This should also fix #2661. I just ran into that bug as well.
25fe3fd to
6fcbdd9
Compare
|
@nirga Please review the PR at your convenience. CLA is signed. |
6fcbdd9 to
bb1e024
Compare
f9d466d to
3d4f99f
Compare
nirga
left a comment
There was a problem hiding this comment.
Thanks @obs-gh-abhishekrao - I think it's almost there. Can you add a test for this specific use case?
| _set_span_attribute( | ||
| span, | ||
| SpanAttributes.LLM_RESPONSE_MODEL, | ||
| SpanAttributes.LLM_RESPONSE_MODEL or "unknown", |
There was a problem hiding this comment.
Hmm that was incorrectly coalesced. The reason I added this was that I'd encountered silent failures in the metric generation which I resolved in 4263495. I was hoping similar logic was needed here. Although, I don't see any metrics generation as part of the assistant wrapper
My latest commit a2daa86 fixes the coalescing mistake, but I'm happy to remove it altogether too.
3d4f99f to
4263495
Compare
|
Added test based on the sample code, but noticing some unexpected behaviour that I'm still debugging.
My analysis so far: |
6301498 to
73648f1
Compare
|
Found the reason
This happens because all the (langchain) tests are happening under a shared trace context (and hence all spans nest under the same trace). So, |
7ed511a to
5b491f8
Compare
ronensc
left a comment
There was a problem hiding this comment.
@obs-gh-abhishekrao Thanks for working on fixing this issue! I’ve come across a subtle problem with the proposed fix. Please see the details below.
| token = context_api.attach( | ||
| context_api.set_value(SUPPRESS_LANGUAGE_MODEL_INSTRUMENTATION_KEY, True) | ||
| ) | ||
| token = context_api.attach(set_span_in_context(span)) |
There was a problem hiding this comment.
I believe this should have a corresponding detach() in _end_span(). Without it, the context of the caller could be altered, potentially breaking the parent-child relationship between spans and causing unrelated spans to be linked together.
Consider the following unit test, which currently fails due to this issue:
@pytest.mark.vcr
def test_langgraph_double_invoke(exporter):
class DummyGraphState(TypedDict):
result: str
def mynode_func(state: DummyGraphState) -> DummyGraphState:
return state
def build_graph():
workflow = StateGraph(DummyGraphState)
workflow.add_node("mynode", mynode_func)
workflow.set_entry_point("mynode")
langgraph = workflow.compile()
return langgraph
graph = build_graph()
from opentelemetry import trace
assert "test_langgraph_double_invoke" == trace.get_current_span().name
graph.invoke({"result": "init"})
assert "test_langgraph_double_invoke" == trace.get_current_span().name # incorrectly set to 'mynode.task'
graph.invoke({"result": "init"})
assert "test_langgraph_double_invoke" == trace.get_current_span().name # still incorrectly set to 'mynode.task' As you can see, the span name unexpectedly switches to 'mynode.task', indicating that the span context is not properly detached after each invocation.
Please take a look at my proposed fix here:
Fix Suggestion
Feel free to review and adjust as needed.
There was a problem hiding this comment.
Thanks @ronensc! Great catch, will look into adding this test and fix.
023a52c to
be86077
Compare
|
|
||
|
|
||
| @pytest.mark.vcr | ||
| def test_langgraph_double_invoke(exporter): |
There was a problem hiding this comment.
The test_langgraph_double_invoke() and test_langgraph_double_ainvoke() don't catch the regression I mentioned.
If I comment out the detach() invocation, the tests pass even though they should have fail.
https://github.com/traceloop/openllmetry/pull/2805/files#diff-caff857eafc9121a585e40cdac1c0815d97f474cd3d0bf1a23755572e3b38105R702-R703
There was a problem hiding this comment.
Apologies, I agree the test case is incorrect. However the fix you suggested earlier does not work for me (at least not locally). Not only does the (fixed) test fail but I'm also noticing there's a problem with detaching context in _end_span. There appears to be a flurry of these errors when it's detached there, possibly due to async behaviour:
ValueError: <Token var=<ContextVar name='current_context' default={} at 0xXXXXX> at 0xYYYYY> was created in a different Context
There was a problem hiding this comment.
I wasn’t able to reproduce the error - could you share more details on how to reproduce it? For example, the Python version you're using, how you created the virtual environment, and how you ran the tests.
There was a problem hiding this comment.
This only happens when you test for the async case i.e., ainvoke, which I don't think was included in your suggested fix.
- Python 3.11.8
I setup the virtual environment using: nx run opentelemetry-instrumentation-langchain:lock- And then
nx run opentelemetry-instrumentation-langchain:install - Ran the tests within the langchain instrumentation directory
poetry run pytest -svv tests/(I could have ran it through nx but I preferred the verbosity of-svvas I was logging some stuff out.
There was a problem hiding this comment.
Thanks, I'm able to reproduce the issue.
It looks like the root cause is that the on_XXX_end() callbacks run in a different task (and thus a different context) from the on_XXX_start() callbacks. I added the following lines to each callback to verify this:
task = asyncio.current_task()
print(f"Task ID: {id(task)} Task name: {task.get_name()}")I also came across this LangChain discussion, though it hasn't received any input from the maintainers yet.
From what I can tell, the on_XXX_end() callbacks are dispatched on a different task due to the use of the @shielded decorator:
https://github.com/langchain-ai/langchain/blob/1ebcbf1d11578cb55db26013c745dc1e5722966e/libs/core/langchain_core/callbacks/manager.py#L962-L963
Removing the decorator might resolve the issue, but I couldn't find any explanation in the PR that introduced it for why it was added in the first place.
There was a problem hiding this comment.
Thanks for taking a closer look!
For what it's worth, there's a similar discussion here - langchain-ai/langsmith-sdk#1725. Could it also be that opentelemetry is not propagating context correctly? I wonder if injecting the OTel asyncio instrumentor and tagging the appropriate coroutine might take care of that.
Another issue I noticed was the contextvar error doesn't happen (or rather silently ignored) if the test case "passes". But when I added some logging, I found out that a bunch of tests were passing despite the context.detach happening outside the token's original context. That doesn't sound right to me.
I also wonder if Openllmetry could benefit forking the callback handler into Sync vs Async like the langchain tracer does i.e., BaseTracer vs AsyncBaseTracer.
There was a problem hiding this comment.
I've created an issue in langchain's repo. You're right about removing the decorator solving the issue, I was able to verify that locally.
One other approach I tried here is safeguard the detach by storing copy_context() in SpanHolder
ctx = contextvars.copy_context()
self.spans[run_id] = SpanHolder(
span, token, None, ctx, [], workflow_name, entity_name, entity_path
)
# later when detaching, detach it in a ctx.run(...)
if token:
ctx = self.spans[run_id].ctx
ctx.run(context_api.detach, token)But this does not fix the issue and somehow I still get the error for incorrect context used during detach and the test still fails with an incorrect span name :-(
There was a problem hiding this comment.
I'm of the opinion that we should skip fixing span relationships for now and retain behavior from main. The main issue this PR addresses is missing downstream spans, which should be fixed. Aside from reporting upstream in langchain, I've also asked the OTel folks if there's any gotchas with shielded functions. I have however retained the double invoke/ainvoke tests for posterity for when we get to addressing the span relationships.
There was a problem hiding this comment.
AFAIU, context_api.detach() should be invoked only from the context that created the token and the contextvars package doesn't have an API to get the current context (only a copy of it).
ronensc
left a comment
There was a problem hiding this comment.
I've reviewed the rest of the PR and shared a few thoughts and questions.
|
|
||
|
|
||
| @pytest.mark.vcr | ||
| def test_langgraph_double_invoke(exporter): |
There was a problem hiding this comment.
I wasn’t able to reproduce the error - could you share more details on how to reproduce it? For example, the Python version you're using, how you created the virtual environment, and how you ran the tests.
e76c1cf to
e304cd4
Compare
nirga
left a comment
There was a problem hiding this comment.
Thanks @obs-gh-abhishekrao but can you help me understand what does this changes in terms of functionality? I'm not sure it resolves the context propagation issue, right?
| return None | ||
| if isinstance(value, (bool, str, bytes, int, float)): | ||
| return value | ||
| if isinstance(value, (list, tuple)): |
There was a problem hiding this comment.
There's some edge case that's causing span attributes and metrics to not get added / computed. I (locally) added tests but could not find a way to replicate this.
However, the sample code when run through sample-app/ consistently reproduces the missing spans and metrics.
I took a closer look, and have narrowed it down to this line of code..
What's really strange is between the above change and
main, there is no difference in the response being passed to model_as_dict and subsequently model.model_dump()
All that said I'm happy to revert this and file a separate bug. I can also add the sample code to sample-app for posterity.
There was a problem hiding this comment.
Found the root cause and fixed it (and reverted the change from langchain). This was actually happening within the SDK here
Thanks for reviewing again @nirga. Apologies for the flurry of commits. To sum it up - the PR will address
Context will not be propagated from langchain to downstreams unfortunately, and this means the new LLM spans get disconnected from the trace. |
…nt fix on langchain and add tests
45ff713 to
d4f5403
Compare
Sorry for the hassle. I don't see any such option. Apparently this is a limitation when a fork was made from an org :(. https://github.com/orgs/community/discussions/5634. I'm checking to see what my options are. Worst case, I'll have to re-create this PR off a personal fork. |
|
Since #3201 has been merged, I believe we can close this PR. |
feat(instrumentation): ...orfix(instrumentation): ....Fixes #2271
Reproducible code from bug report
Courtesy #2271. Thanks @jemo21k!
Expected behaviour: LLM span should be created.
Before
No

openai.chatspan.After
openai.chatspan is present.sample_app/langgraph_example
Expected behaviour: No change. Additional LLM span should not be created since
ChatOpenAIcallback already creates one.Before
After
Important
Fixes LLM span creation and context retention in Langchain and OpenAI instrumentation by setting default values for missing attributes and ensuring spans are correctly attached to context.
openai.chatspan creation incallback_handler.pyby ensuringLLM_RESPONSE_MODELis set tomodelor"unknown"._create_llm_span()incallback_handler.py.LLM_RESPONSE_MODELtomodelor"unknown"in_build_from_streaming_response()and_abuild_from_streaming_response()inchat_wrappers.py.messages_list_wrapper()andruns_create_and_stream_wrapper()inassistant_wrappers.pyto setLLM_RESPONSE_MODELtomodelor"unknown".callback_handler.py.This description was created by
for 08116d9. It will automatically update as commits are pushed.