Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: a way to get traceId into an SLF4j MDC #150

Closed
ryangardner opened this issue Mar 29, 2016 · 32 comments · Fixed by #389
Closed

Feature request: a way to get traceId into an SLF4j MDC #150

ryangardner opened this issue Mar 29, 2016 · 32 comments · Fixed by #389

Comments

@ryangardner
Copy link
Contributor

We've got a lot of log aggregation software already, and it'd be very powerful if we could correlate traceIds in our logs to specific requests in Zipkin (see a slow request in Zipkin? go search for the traceId in our log aggregation and see any messages that tie into it)

It'd be nice to be able to put the traceId into the MDC every time it is set and clear it out whenever the request is over (or the current one is set to nothing) then clear it out.

I think in order for this to work some kind of hook needs to be exposed.

I can do it myself with AOP and intercepting calls to the ServerClientAndLocalSpanState - but I think that other people would like this functionality as well (It's one of the first things people ask me about in my company when I show brave / zipkin to them)

If you can think of the approach you would prefer for this, I can take a first stab at implementing it and submit a pull request with the changes

@codefromthecrypt
Copy link
Member

relates to #110 if you don't mind reading up on that.

@ryangardner
Copy link
Contributor Author

Switching out the spanId generator would make another place to work with - but wouldn't necessarily let you know when the spans are created / ended, but a couple other ways that come to mind to handle this are:

  • Make it possible to decorate behavior of the the setStateCurrentTrace setStateCurrentTrace(..) and clearCurrentSpan() of the ServerTracer methods to have calls to those methods set / unset things in an MDC (this one seems like it would be the least disruptive option)
  • Some kind of event system where you can register listeners with the ServerTracer to be notified of trace start / trace end events (this seems like it'd be both overkill and add an unnecessary level of complexity for just this one use case)

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Apr 4, 2016 via email

@ryangardner
Copy link
Contributor Author

I came up with this thing that seems to work for now - I made a decorator for the ServerClientAndLocalSpanState:

import com.github.kristofa.brave.IdConversion;
import com.github.kristofa.brave.ServerClientAndLocalSpanState;
import com.github.kristofa.brave.ServerSpan;
import com.twitter.zipkin.gen.Endpoint;
import com.twitter.zipkin.gen.Span;
import org.slf4j.MDC;

/**
 * Expose the trace / spanIds to the MDC and delegates all operations to the provided ServerClientAndLocalSpanState
 */
public class MdcDecoratingServerClientAndLocalSpanState implements ServerClientAndLocalSpanState {
    public final String ZIPKIN_TRACEID_MDC = "zipkinTraceId";
    public final String ZIPKIN_SPANID_MDC = "zipkinSpanId";
    public final String ZIPKIN_PARENTID_MDC = "zipkinParentId";

    private ServerClientAndLocalSpanState serverClientAndLocalSpanState;

    public MdcDecoratingServerClientAndLocalSpanState(ServerClientAndLocalSpanState serverClientAndLocalSpanState) {
        this.serverClientAndLocalSpanState = serverClientAndLocalSpanState;
    }

    @Override
    public ServerSpan getCurrentServerSpan() {
        return serverClientAndLocalSpanState.getCurrentServerSpan();
    }

    @Override
    public Endpoint getServerEndpoint() {
        return serverClientAndLocalSpanState.getServerEndpoint();
    }

    // clear the MDC when the CurrentServerSpan is cleared, store info in the MDC when the span state is valid
    @Override
    public void setCurrentServerSpan(ServerSpan span) {
        if (span == null || span.getSpan() == null || span.getSample() == false) {
            MDC.remove(ZIPKIN_TRACEID_MDC);
            MDC.remove(ZIPKIN_SPANID_MDC);
            MDC.remove(ZIPKIN_PARENTID_MDC);
        } else {
            MDC.put(ZIPKIN_TRACEID_MDC, IdConversion.convertToString(span.getSpan().getTrace_id()));
            MDC.put(ZIPKIN_SPANID_MDC, IdConversion.convertToString(span.getSpan().getId()));
            if (span.getSpan().getParent_id() != null) {
                MDC.put(ZIPKIN_PARENTID_MDC, IdConversion.convertToString(span.getSpan().getParent_id()));
            } else {
                MDC.remove(ZIPKIN_PARENTID_MDC);
            }
        }
        serverClientAndLocalSpanState.setCurrentServerSpan(span);
    }

    @Override
    public Endpoint getClientEndpoint() {
        return serverClientAndLocalSpanState.getClientEndpoint();
    }

    @Override
    public Span getCurrentClientSpan() {
        return serverClientAndLocalSpanState.getCurrentClientSpan();
    }

    @Override
    public void setCurrentClientSpan(Span span) {
        serverClientAndLocalSpanState.setCurrentClientSpan(span);
    }

    @Override
    public void setCurrentClientServiceName(String serviceName) {
        serverClientAndLocalSpanState.setCurrentClientServiceName(serviceName);
    }

    @Override
    public Boolean sample() {
        return serverClientAndLocalSpanState.sample();
    }

    @Override
    public Span getCurrentLocalSpan() {
        return serverClientAndLocalSpanState.getCurrentLocalSpan();
    }

    @Override
    public void setCurrentLocalSpan(Span span) {
        serverClientAndLocalSpanState.setCurrentLocalSpan(span);
    }
}

I'll take a look at the spring cloud sleuth stuff too

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Apr 11, 2016 via email

@kristofa
Copy link
Member

Yes, clever. I didn't know about MDC so I learned something new :)

One of the interesting side effects of enabling distributed tracing and trace context propagation is that it should be straight forward to enable log correlation for requests across services. Unfortunately the current brave api doesn't make it easy to do but it looks like what you have should work.

@eirslett
Copy link
Contributor

If Brave standardises on something like SLF4J (instead of jul), we could do it as part of Brave core?

@kristofa
Copy link
Member

The switch to JUL for the main code was done to get rid of dependencies for brave-core.

Unfortunately this code will only work for traces that are actually being sampled which can be a small percentage. For log correlation it would be nicer if we can generate span ids even if we decide we don't sample the trace.

@eirslett
Copy link
Contributor

I was going to bring up exactly that - unlike Finagle, Brave simply stops tracing behaviour altogether if a trace is decided to not be sampled. It would be better if that was changed, so that tracing occurs anyways, but the span consumer ignores unsampled spans?

@eirslett
Copy link
Contributor

Regarding jul vs slf4j, maybe it's possible to make some kind of hook in Brave, where we can register a MDC "interceptor"? Then Brave wouldn't need a hard dependency on slf4j.

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Apr 13, 2016 via email

@ryangardner
Copy link
Contributor Author

Indeed this does only work for requests that are already sampled. The scenario people wanted to use it for was "If I find a really long request in zipkin, and now I want to go to our log aggregation framework and look it up - if I could just paste in that traceId and then find any messages that were logged along the way that would help me gather additional information"

It seems that spring-cloud-sleuth seems to have implemented a layer around brave to have it always generate traceIds and sampleIds even if a request isn't sampled - this seems like it shouldn't be too hard to do in brave since the "sampled" header is being passed along with the rest of the headers... there might be a few places that need to be tweaked but I don't think

SLF4j is an extremely widespread logging API, and doesn't dictate a specific logging framework (someone could use slf4j over java.util.logging if they wanted to by including sjf4j-jul) - removing certain dependencies from brave-core seems prudent but removing all dependencies at some point reaches a point of diminishing returns. The logging framework one is one of those rare cases that a lot of places bring in slf4j simply because JUL is considered to be one of the least desirable logging frameworks around.

As far as request overhead is concerned, you could still have the traceIds get passed along but have the short-circuiting if a request isn't sampled based on the "sampled" header being set to false - then you could still pass the traceId and spanIds along but only do additional things like adding annotations if the request is actually being sampled. (A lot of the short circuits are based around null checks now it seems, switching it to be around a false check seems like it wouldn't be too hard)

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Apr 13, 2016 via email

@eirslett
Copy link
Contributor

I think the bulk of the overhead in Zipkin instrumentation is about sending span data via the transport to the collectors and storing them in the storage engine. The extra CPU cycles aren't that expensive, and the HTTP headers aren't that big. (And, it works for Twitter, apparently)

@codefromthecrypt
Copy link
Member

I think the bulk of the overhead in Zipkin instrumentation is about
sending span data via the transport to the collectors and storing them in
the storage engine. The extra CPU cycles aren't that expensive, and the
HTTP headers aren't that big. (And, it works for Twitter, apparently)

To be clear, I think we are on the same page. Use twitter's approach which
is still before-the-fact sampling and early exit logic.

Ex. we collect and pass around the trace id (which is cheap and could use
for MDC etc), but still guard before locally collecting annotations via "if
sampled or debug, then do more expensive things like trace a function or
concat strings" That way, we can permit SLF4J or whatever, but also honor
early-exit logic. Make sense?

TraceId(_, _, _, Some(false), flags) if !flags.isDebug

https://github.com/twitter/finagle/blob/develop/finagle-core/src/main/scala/com/twitter/finagle/tracing/Trace.scala#L254

@eirslett
Copy link
Contributor

Ex. we collect and pass around the trace id

Yes, that's what we should do.

Brave's current implementation (

public SpanId startNewSpan(String requestName) {
) is to ditch the trace id altogether, and reset, as if no tracing had ever happened. I think we should change the way Brave works, so that it acts more like Finagle.

@codefromthecrypt
Copy link
Member

ok unless someone else says so, I'll raise a PR to rejig the ids as a separate PR. Afterwards, we can use this to implement MDC logging even when we aren't "actively tracing"

@kristofa
Copy link
Member

@adriancole Great that you want to implement this!

With Finagle we create a logger which decorates a slf4j logger and which prepends the trace + span + parent span id to every log message (by doing Trace.idOption.getOrElse("unknown")).

@testark
Copy link

testark commented Jun 17, 2016

Has any progress been made on this lately?

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Jun 21, 2016 via email

@eirslett
Copy link
Contributor

One thing that worked well in zipkin-js was to make the context api pluggable. In Brave, it's hard-coded to be a ThreadLocal. I'm sure we should be able to make something similar in Brave. Then, people could add their own context, or we could have an interface with onChange handlers that people can hook into to build SLF4J support in a loosely coupled fashion.

@kuroneko25
Copy link

kuroneko25 commented Sep 18, 2016

With Finagle we create a logger which decorates a slf4j logger and which prepends the trace + span + parent span id to every log message (by doing Trace.idOption.getOrElse("unknown")).

This is exactly what we are trying to do in our application. Has any progress been made on this? If not what is the closest work around that we could implement today?

@codefromthecrypt
Copy link
Member

No progress, but to Eirik's point, we should make sure any new Context
api has a means to do this.

Closest would be to make a wrapping ServerClientAndLocalSpanState, and
associate/disassociate the trace id with the MDC there.

@OverRide
public void setCurrentServerSpan(ServerSpan span) {
if (span == null) {
MDC.remove("your-trace-id-key");
} else {
MDC.put("your-trace-id-key",
IdConversion.convertToString(span.getSpan().getTrace_id())));
}
delegate. setCurrentServerSpan(span);
}

// repeat for similar things...

On Mon, Sep 19, 2016 at 5:52 AM, kuroneko25 [email protected] wrote:

With Finagle we create a logger which decorates a slf4j logger and which
prepends the trace + span + parent span id to every log message (by doing
Trace.idOption.getOrElse("unknown")).

This is exactly what we are trying to do as well in our application. Has any
progress been made on this? If not what is the closest work around that we
could implement today?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@tmack8001
Copy link

This is very similar behavior as to what I want to put into an application of mine that is using brave. Any further progress, how can I help make this a reality for us all?

@codefromthecrypt
Copy link
Member

FWIW @eirslett's suggestion of a Context (basically a trace id holder) which is decoupled from thread locals is the way to go. This is used by zipkin-js, but also in finagle I believe. Even log4j2 recently started supporting affecting their context in a way you can control whether thread locals are used.

TL;DR; we should decouple brave in some generic fashion from thread locals, but also keep in mind the use-case of synchronizing with logging contexts. If someone wants to spike an impl please do! In absence, I'm pretty close to being interested enough to take it on :)

@kuroneko25
Copy link

kuroneko25 commented Jan 16, 2017

Anything new for this this in Brave 4?

@codefromthecrypt
Copy link
Member

codefromthecrypt commented Jan 17, 2017 via email

@codefromthecrypt
Copy link
Member

Here's a recap of where this is (and practical knowledge about things).

Brave 4 was designed to unhook brave from a specific thread model. This was direly needed in order to properly address topics of in-process propagation and retrieval of the "current span". For example, we definitely don't need ServerSpan anymore (since all spans are now the same and they propagate sampled state). This simplifies the problem space such that things like MDC integration are easier to deal with.

Brave 4 is similar to Finagle, now. For example, there's an internal division between a trace context (which is the only thing used by MDC), and the span which does heavy lifting (ex adding tags). Moreover, you now always have a trace context, even when not sampled. This means MDC can rely on identifiers being present even if they aren't reported to Zipkin... a big deal.

So, how do we get to MDC integration (slf4j, log4j2 etc), then?!

Well, firstly we need a "current span" api, which will have a default implementation with possibly pluggable storage. For example, zipkin-js has this, and we might want to re-use grpc's propagation model, when on their platform. Taking lessons learned from Finagle, this almost certainly should be a reference to a TraceContext (which could be converted to a span via Tracer.newSpan as needed).

Internally, we need hooks to manage the attachment and re-attachment of span context to that api (not that the mutation might actually be internal or an SPI.. iotw, users may never see how a span is attached). The lifecycle will typically end up routing through brave's recorder (on joinSpan, newTrace, newSpan and corresponding finish/flush statements).

In the process of implementing, we can decide how much of this is practical to expose, as complexity is easy here. For example, should SLF4J be special cased (as it is in spring-cloud-sleuth)? or should we allow an event model which can synchronize multiple state stores on activation or deactivation of a span (ex SLF4J and also Brave 3 or gRPC context?)

On a practical note, I have no problem with and in fact am actively interested in doing this work. However, I'd love someone to offer to help. I round-robin through zipkin projects and am currently working on server improvements. This means probably a week or two before I can start, maybe longer. Regardless, I hope this stuff helps in case someone wants to know more or help.

@codefromthecrypt
Copy link
Member

@kmaci3k fyi here's the issue about whatever api we end up with that supports SLF4J integration.

@codefromthecrypt
Copy link
Member

summarized a design I have in mind for the current span problem here #166 (comment)

codefromthecrypt pushed a commit that referenced this issue May 2, 2017
This adds trace and span IDs to the SLF4J Mapped Diagnostic Context (MDC)
so that you can search or aggregate logs accordingly.

To enable this, configure `brave.Tracing` with `MDCCurrentTraceContext`
like so:

```java
tracing = Tracing.newBuilder()
    .currentTraceContext(MDCCurrentTraceContext.create())
    ...
    .build();
```

Then, in your log configuration, you can use `traceId` and or `spanId`.

Here's an example logback configuration:

```xml
<pattern>%d [%X{traceId}/%X{spanId}] [%thread] %-5level %logger{36} - %msg%n</pattern>
```

When a trace is in progress, it would log statements like this:
```
2017-05-02 23:36:04,789 [fcd015bf6f8b05ba/fcd015bf6f8b05ba] [main] INFO  c.a.FooController - I got here!
```

Users could then copy/paste the trace ID into the zipkin UI, or use log
correlation to further debug a problem.

Fixes #150
See #369
@codefromthecrypt
Copy link
Member

#389 here's the impl

@kuroneko25
Copy link

How do we get this in log4j2?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants