Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setTimeout callback wrong context on node 10 #1494

Closed
blumamir opened this issue Sep 3, 2020 · 8 comments · Fixed by #1530
Closed

setTimeout callback wrong context on node 10 #1494

blumamir opened this issue Sep 3, 2020 · 8 comments · Fixed by #1530
Assignees
Labels
bug Something isn't working

Comments

@blumamir
Copy link
Member

blumamir commented Sep 3, 2020

What version of OpenTelemetry are you using?

v0.11.0

What version of Node are you using?

v10.22.0
The issue is reproducible in nodejs v10.x.x, and does not exists in versions 11 and forward.

What did you do?

import { NodeTracerProvider } from "@opentelemetry/node";
import {
  ConsoleSpanExporter,
  SimpleSpanProcessor,
} from "@opentelemetry/tracing";

const provider = new NodeTracerProvider();
provider.addSpanProcessor(new SimpleSpanProcessor(new ConsoleSpanExporter()));
provider.register();
const tracer = provider.getTracer("setTimout-context-error");

const parentSpan = tracer.startSpan("main");
tracer.withSpan(parentSpan, () => {
  setTimeout(() => {
    tracer.startSpan("timer 1").end();
  }, 1000);
});
parentSpan.end();

setTimeout(() => {
  tracer.startSpan("timer 2").end();
}, 1000);

setTimeout(() => {
  tracer.startSpan("timer 3").end();
}, 1001);

What did you expect to see?

Expected timer 2 span to have it's own trace id and be the root span. Or a least to behave the same as timer 3 which created a new trace as expected.

What did you see instead?

timer 2 has the same context as timer 1 (span's parent is main, and has the same traceId).
Output of ConsoleSpanExporter:

{ traceId: 'd14cd7a4649ab705aa418e33ea6f7971',
  parentId: undefined,
  name: 'main',
  id: '03a02971e60b0562',
  kind: 0,
  timestamp: 1599125013177453,
  duration: 1991,
  attributes: {},
  status: { code: 0 },
  events: [] }
{ traceId: 'd14cd7a4649ab705aa418e33ea6f7971',
  parentId: '03a02971e60b0562',
  name: 'timer 1',
  id: '064ec0b6b515b53b',
  kind: 0,
  timestamp: 1599125014185837,
  duration: 100,
  attributes: {},
  status: { code: 0 },
  events: [] }
{ traceId: 'd14cd7a4649ab705aa418e33ea6f7971',  <============
  parentId: '03a02971e60b0562',    <========================
  name: 'timer 2',
  id: 'bb2fb8257570a7d8',
  kind: 0,
  timestamp: 1599125014187252,
  duration: 15,
  attributes: {},
  status: { code: 0 },
  events: [] }
{ traceId: '6ee34e2236f691951bd56676e89a7dfa',
  parentId: undefined,
  name: 'timer 3',
  id: '0fe63ea07167f298',
  kind: 0,
  timestamp: 1599125014188849,
  duration: 15,
  attributes: {},
  status: { code: 0 },
  events: [] }

Additional context

The above code snippet is just a minimal code to reproduce the issue. It was encountered in a complex setup with multiple plugins, where http module set the a timer with timeout of 0ms, and another part of the code which called a timer with same timeout received the cascading spans as the HTTP children instead of being created under a new trace.
From my short investigation, it looks like node 10 is grouping timers with same duration under the same 'TIMERWRAP', which receives its async context from the first setTimeout and then calls following timers from the same wrapper with the wrong context.

In the example above, timer 3 has the correct context since it is create with duration 1001 ms, which get's it's own fresh wrapper internally in node.

@Flarna
Copy link
Member

Flarna commented Sep 12, 2020

I think the problem is that AsyncHooksContextManager starts with an empty _stack instead [Context.ROOT_CONTEXT].

  1. Script starts with an empty context stack
  2. tracer.withSpan creates and activates ContextA referring to parentSpan
  3. Timer1 creates a Timeout and a TIMERWRAP resource and ContextA is propagated as expected into both new async ids
  4. when timer2 is started context stack is empty so nothing is propagated to this async id of the Timeout
  5. when timer3 is started context stack is empty so nothing is propagated to the asyncId of the TIMERWRAP and Timeout
  6. when the first timer expires the TIMERWRAP activates ContextA, then the first Timeout activates it again. The timer1 span is created and then ContextA is deactivated because of the Timeout but stays active because of the TIMERWRAP
  7. The Timeout of timer 2 has no context to activate therefore ContextA is still active and span is linked incorrect

If we start the AsyncHooksContextManager with Context.ROOT_CONTEXT on stack the Timeout for timer 2 will activate it instead ContextA and a root span will be created as expected.

Node 12 uses no queuing via TIMERWRAP so the problem doesn't exist there.

@Flarna
Copy link
Member

Flarna commented Sep 12, 2020

Thinking once more about this the proposed change above may result in quite some side effects and even context loss where context is propagated correct now.
A simpler, isolated solution would be to just ignore TIMERWRAP in AsyncHooksContextManager. it's an internal object and context propagagtion works anyway because of the Timeout object used by each timer.

@vmarchaud
Copy link
Member

@Flarna Thanks for diging into this issue, do you plan to make a PR to fix this ? Otherwise i'll implement what you suggested

@Flarna
Copy link
Member

Flarna commented Sep 13, 2020

@vmarchaud I will create a PR after the weekend

@blumamir
Copy link
Member Author

Thanks @Flarna for investing time and effort to debug and create a PR.
I really like the original idea, because currently async resource run with the context in which they were created if a context exists, and it makes sense to me that if no context existed, they should run with root context for consistency and uniformity.

Thinking once more about this the proposed change above may result in quite some side effects and even context loss where context is propagated correct now.

I too think it's a dangerous change, but curious if you have some examples of what it can break.

@Flarna
Copy link
Member

Flarna commented Sep 15, 2020

I have no example but it would be easy to construct one. But such artificial samples are not really of help because in Javascript it's always easy to break half of the world... I have no real world example I could refer to.

Main reason why the current PR is only addressing TIMERWRAP is to get this problem fixed without extra risk.

The other part - to pass ROOT_CONTEXT if nothing else is set can be done as followup.
I think the AsyncLocalStorageContextManager actually acts like this as it stores the context on the resource. There is no fallback like looking into a "enclosing resource" if the current one holds nothing.

@vmarchaud
Copy link
Member

@Flarna We already fallback on ROOT_CONTEXT if there none in the stack (here), if i remember correctly i had other issues if i was setting the root as the first element on the stack.

@vmarchaud vmarchaud removed their assignment Sep 15, 2020
@Flarna
Copy link
Member

Flarna commented Sep 15, 2020

@vmarchaud Yes, that's true. But if there is already a context active then this one is used.
So in case of TIMERWRAP => Timeout the TIMERWRAP before callback pushed it's stored context on stack. The Timeout before callback didn't change the stack. As a result top of stack is the context from TIMERWRAP.

The same use case seen from resource point of view would be different. The active resource is Timeout and has no context stored so it falls back to ROOT_CONTEXT.

pichlermarc pushed a commit to dynatrace-oss-contrib/opentelemetry-js that referenced this issue Dec 15, 2023
martinkuba pushed a commit to martinkuba/opentelemetry-js that referenced this issue Mar 13, 2024
martinkuba pushed a commit to martinkuba/opentelemetry-js that referenced this issue Mar 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants