Feature request: a way to get traceId into an SLF4j MDC #150

ryangardner · 2016-03-29T22:04:00Z

We've got a lot of log aggregation software already, and it'd be very powerful if we could correlate traceIds in our logs to specific requests in Zipkin (see a slow request in Zipkin? go search for the traceId in our log aggregation and see any messages that tie into it)

It'd be nice to be able to put the traceId into the MDC every time it is set and clear it out whenever the request is over (or the current one is set to nothing) then clear it out.

I think in order for this to work some kind of hook needs to be exposed.

I can do it myself with AOP and intercepting calls to the ServerClientAndLocalSpanState - but I think that other people would like this functionality as well (It's one of the first things people ask me about in my company when I show brave / zipkin to them)

If you can think of the approach you would prefer for this, I can take a first stab at implementing it and submit a pull request with the changes

The text was updated successfully, but these errors were encountered:

codefromthecrypt · 2016-03-30T10:07:22Z

relates to #110 if you don't mind reading up on that.

ryangardner · 2016-04-04T14:41:45Z

Switching out the spanId generator would make another place to work with - but wouldn't necessarily let you know when the spans are created / ended, but a couple other ways that come to mind to handle this are:

Make it possible to decorate behavior of the the setStateCurrentTrace setStateCurrentTrace(..) and clearCurrentSpan() of the ServerTracer methods to have calls to those methods set / unset things in an MDC (this one seems like it would be the least disruptive option)
Some kind of event system where you can register listeners with the ServerTracer to be notified of trace start / trace end events (this seems like it'd be both overkill and add an unnecessary level of complexity for just this one use case)

codefromthecrypt · 2016-04-04T15:12:35Z

ahh I remember this. yeah it is a bit tricky as lifecycle of a span aren't abstracted 100% (ex we have 3 tracers at the moment). One thing to add to the toolbox is SpanCollector.collect() is definitely the end. spring-cloud-sleuth has an implementation like what you're mentioning.. I think its scope is a bit broader as it requires ids even when the trace isn't sampled. anyway prior art is often helpful. https://github.com/spring-cloud/spring-cloud-sleuth/blob/master/spring-cloud-sleuth-core/src/main/java/org/springframework/cloud/sleuth/log/Slf4jSpanLogger.java Other interest is opentracing's span is like a tracer, and seems to have end callbacks. how this goes across threads etc is as yet undefined. in case you'd like a look opentracing/opentracing-java#25

ryangardner · 2016-04-11T17:33:05Z

I came up with this thing that seems to work for now - I made a decorator for the ServerClientAndLocalSpanState:

import com.github.kristofa.brave.IdConversion;
import com.github.kristofa.brave.ServerClientAndLocalSpanState;
import com.github.kristofa.brave.ServerSpan;
import com.twitter.zipkin.gen.Endpoint;
import com.twitter.zipkin.gen.Span;
import org.slf4j.MDC;

/**
 * Expose the trace / spanIds to the MDC and delegates all operations to the provided ServerClientAndLocalSpanState
 */
public class MdcDecoratingServerClientAndLocalSpanState implements ServerClientAndLocalSpanState {
    public final String ZIPKIN_TRACEID_MDC = "zipkinTraceId";
    public final String ZIPKIN_SPANID_MDC = "zipkinSpanId";
    public final String ZIPKIN_PARENTID_MDC = "zipkinParentId";

    private ServerClientAndLocalSpanState serverClientAndLocalSpanState;

    public MdcDecoratingServerClientAndLocalSpanState(ServerClientAndLocalSpanState serverClientAndLocalSpanState) {
        this.serverClientAndLocalSpanState = serverClientAndLocalSpanState;
    }

    @Override
    public ServerSpan getCurrentServerSpan() {
        return serverClientAndLocalSpanState.getCurrentServerSpan();
    }

    @Override
    public Endpoint getServerEndpoint() {
        return serverClientAndLocalSpanState.getServerEndpoint();
    }

    // clear the MDC when the CurrentServerSpan is cleared, store info in the MDC when the span state is valid
    @Override
    public void setCurrentServerSpan(ServerSpan span) {
        if (span == null || span.getSpan() == null || span.getSample() == false) {
            MDC.remove(ZIPKIN_TRACEID_MDC);
            MDC.remove(ZIPKIN_SPANID_MDC);
            MDC.remove(ZIPKIN_PARENTID_MDC);
        } else {
            MDC.put(ZIPKIN_TRACEID_MDC, IdConversion.convertToString(span.getSpan().getTrace_id()));
            MDC.put(ZIPKIN_SPANID_MDC, IdConversion.convertToString(span.getSpan().getId()));
            if (span.getSpan().getParent_id() != null) {
                MDC.put(ZIPKIN_PARENTID_MDC, IdConversion.convertToString(span.getSpan().getParent_id()));
            } else {
                MDC.remove(ZIPKIN_PARENTID_MDC);
            }
        }
        serverClientAndLocalSpanState.setCurrentServerSpan(span);
    }

    @Override
    public Endpoint getClientEndpoint() {
        return serverClientAndLocalSpanState.getClientEndpoint();
    }

    @Override
    public Span getCurrentClientSpan() {
        return serverClientAndLocalSpanState.getCurrentClientSpan();
    }

    @Override
    public void setCurrentClientSpan(Span span) {
        serverClientAndLocalSpanState.setCurrentClientSpan(span);
    }

    @Override
    public void setCurrentClientServiceName(String serviceName) {
        serverClientAndLocalSpanState.setCurrentClientServiceName(serviceName);
    }

    @Override
    public Boolean sample() {
        return serverClientAndLocalSpanState.sample();
    }

    @Override
    public Span getCurrentLocalSpan() {
        return serverClientAndLocalSpanState.getCurrentLocalSpan();
    }

    @Override
    public void setCurrentLocalSpan(Span span) {
        serverClientAndLocalSpanState.setCurrentLocalSpan(span);
    }
}

I'll take a look at the spring cloud sleuth stuff too

codefromthecrypt · 2016-04-11T23:30:33Z

Clever! Thanks for sharing

kristofa · 2016-04-12T18:12:40Z

Yes, clever. I didn't know about MDC so I learned something new :)

One of the interesting side effects of enabling distributed tracing and trace context propagation is that it should be straight forward to enable log correlation for requests across services. Unfortunately the current brave api doesn't make it easy to do but it looks like what you have should work.

eirslett · 2016-04-12T22:05:29Z

If Brave standardises on something like SLF4J (instead of jul), we could do it as part of Brave core?

kristofa · 2016-04-13T18:55:28Z

The switch to JUL for the main code was done to get rid of dependencies for brave-core.

Unfortunately this code will only work for traces that are actually being sampled which can be a small percentage. For log correlation it would be nicer if we can generate span ids even if we decide we don't sample the trace.

eirslett · 2016-04-13T19:15:19Z

I was going to bring up exactly that - unlike Finagle, Brave simply stops tracing behaviour altogether if a trace is decided to not be sampled. It would be better if that was changed, so that tracing occurs anyways, but the span consumer ignores unsampled spans?

eirslett · 2016-04-13T19:25:25Z

Regarding jul vs slf4j, maybe it's possible to make some kind of hook in Brave, where we can register a MDC "interceptor"? Then Brave wouldn't need a hard dependency on slf4j.

codefromthecrypt · 2016-04-13T19:25:58Z

A problem could be that you don't reduce application overhead as a side-effect of sampling, but that may be ok for some folks. Also, if we ever do after-the-fact sampling, we'll run into this anyway. There are ways to avoid some overhead here, for example, google have a string cat api, where annotation.values aren't joined unless/until they are sent out of process. You'll notice there's a lot of places we early-exit code if something wasn't sampled.

ryangardner · 2016-04-13T19:30:13Z

Indeed this does only work for requests that are already sampled. The scenario people wanted to use it for was "If I find a really long request in zipkin, and now I want to go to our log aggregation framework and look it up - if I could just paste in that traceId and then find any messages that were logged along the way that would help me gather additional information"

It seems that spring-cloud-sleuth seems to have implemented a layer around brave to have it always generate traceIds and sampleIds even if a request isn't sampled - this seems like it shouldn't be too hard to do in brave since the "sampled" header is being passed along with the rest of the headers... there might be a few places that need to be tweaked but I don't think

SLF4j is an extremely widespread logging API, and doesn't dictate a specific logging framework (someone could use slf4j over java.util.logging if they wanted to by including sjf4j-jul) - removing certain dependencies from brave-core seems prudent but removing all dependencies at some point reaches a point of diminishing returns. The logging framework one is one of those rare cases that a lot of places bring in slf4j simply because JUL is considered to be one of the least desirable logging frameworks around.

As far as request overhead is concerned, you could still have the traceIds get passed along but have the short-circuiting if a request isn't sampled based on the "sampled" header being set to false - then you could still pass the traceId and spanIds along but only do additional things like adding annotations if the request is actually being sampled. (A lot of the short circuits are based around null checks now it seems, switching it to be around a false check seems like it wouldn't be too hard)

codefromthecrypt · 2016-04-13T19:36:31Z

There are other opinions on the matter of logging, including log4j2 etc. I don't really want us to escalate this feature into that kind of bikeshed, considering it can be non-invasively accomplished without changing the hard deps. I'm a no on adding slf4j as a hard-dependency, but we should be able to set it as an optional one to do the job. One way is instead of passing the trace id, to pass the propagated context around, similar to finagle's TraceId (which includes propagated context and also sampled flag iirc). So, it would be api breaking, but a way to implement what Ryan said.

eirslett · 2016-04-13T19:46:51Z

I think the bulk of the overhead in Zipkin instrumentation is about sending span data via the transport to the collectors and storing them in the storage engine. The extra CPU cycles aren't that expensive, and the HTTP headers aren't that big. (And, it works for Twitter, apparently)

codefromthecrypt · 2016-04-13T19:54:57Z

I think the bulk of the overhead in Zipkin instrumentation is about
sending span data via the transport to the collectors and storing them in
the storage engine. The extra CPU cycles aren't that expensive, and the
HTTP headers aren't that big. (And, it works for Twitter, apparently)

To be clear, I think we are on the same page. Use twitter's approach which
is still before-the-fact sampling and early exit logic.

Ex. we collect and pass around the trace id (which is cheap and could use
for MDC etc), but still guard before locally collecting annotations via "if
sampled or debug, then do more expensive things like trace a function or
concat strings" That way, we can permit SLF4J or whatever, but also honor
early-exit logic. Make sense?

TraceId(_, _, _, Some(false), flags) if !flags.isDebug

https://github.com/twitter/finagle/blob/develop/finagle-core/src/main/scala/com/twitter/finagle/tracing/Trace.scala#L254

eirslett · 2016-04-13T20:36:29Z

Ex. we collect and pass around the trace id

Yes, that's what we should do.

Brave's current implementation (

brave/brave-core/src/main/java/com/github/kristofa/brave/ClientTracer.java

Line 100 in 2d19a27

public SpanId startNewSpan(String requestName) {

) is to ditch the trace id altogether, and reset, as if no tracing had ever happened. I think we should change the way Brave works, so that it acts more like Finagle.

codefromthecrypt · 2016-04-13T20:40:31Z

ok unless someone else says so, I'll raise a PR to rejig the ids as a separate PR. Afterwards, we can use this to implement MDC logging even when we aren't "actively tracing"

kristofa · 2016-04-14T19:09:45Z

@adriancole Great that you want to implement this!

With Finagle we create a logger which decorates a slf4j logger and which prepends the trace + span + parent span id to every log message (by doing Trace.idOption.getOrElse("unknown")).

testark · 2016-06-17T16:56:56Z

Has any progress been made on this lately?

codefromthecrypt · 2016-06-21T02:47:50Z

no progress as displaced on server work then holiday. regrouping now

eirslett · 2016-06-21T08:14:32Z

One thing that worked well in zipkin-js was to make the context api pluggable. In Brave, it's hard-coded to be a ThreadLocal. I'm sure we should be able to make something similar in Brave. Then, people could add their own context, or we could have an interface with onChange handlers that people can hook into to build SLF4J support in a loosely coupled fashion.

kuroneko25 · 2016-09-18T21:52:06Z

With Finagle we create a logger which decorates a slf4j logger and which prepends the trace + span + parent span id to every log message (by doing Trace.idOption.getOrElse("unknown")).

This is exactly what we are trying to do in our application. Has any progress been made on this? If not what is the closest work around that we could implement today?

codefromthecrypt · 2016-09-19T02:01:39Z

No progress, but to Eirik's point, we should make sure any new Context
api has a means to do this.

Closest would be to make a wrapping ServerClientAndLocalSpanState, and
associate/disassociate the trace id with the MDC there.

@OverRide
public void setCurrentServerSpan(ServerSpan span) {
if (span == null) {
MDC.remove("your-trace-id-key");
} else {
MDC.put("your-trace-id-key",
IdConversion.convertToString(span.getSpan().getTrace_id())));
}
delegate. setCurrentServerSpan(span);
}

// repeat for similar things...

On Mon, Sep 19, 2016 at 5:52 AM, kuroneko25 [email protected] wrote:

With Finagle we create a logger which decorates a slf4j logger and which
prepends the trace + span + parent span id to every log message (by doing
Trace.idOption.getOrElse("unknown")).

This is exactly what we are trying to do as well in our application. Has any
progress been made on this? If not what is the closest work around that we
could implement today?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

tmack8001 · 2016-10-17T20:47:27Z

This is very similar behavior as to what I want to put into an application of mine that is using brave. Any further progress, how can I help make this a reality for us all?

codefromthecrypt · 2016-10-26T09:50:02Z

FWIW @eirslett's suggestion of a Context (basically a trace id holder) which is decoupled from thread locals is the way to go. This is used by zipkin-js, but also in finagle I believe. Even log4j2 recently started supporting affecting their context in a way you can control whether thread locals are used.

TL;DR; we should decouple brave in some generic fashion from thread locals, but also keep in mind the use-case of synchronizing with logging contexts. If someone wants to spike an impl please do! In absence, I'm pretty close to being interested enough to take it on :)

kuroneko25 · 2017-01-16T23:19:17Z

Anything new for this this in Brave 4?

codefromthecrypt · 2017-01-17T00:26:12Z

not yet, as brave 4 doesn't have any code yet for in-process propagation of any sort. pull requests welcome!

codefromthecrypt · 2017-01-23T03:02:11Z

Here's a recap of where this is (and practical knowledge about things).

Brave 4 was designed to unhook brave from a specific thread model. This was direly needed in order to properly address topics of in-process propagation and retrieval of the "current span". For example, we definitely don't need ServerSpan anymore (since all spans are now the same and they propagate sampled state). This simplifies the problem space such that things like MDC integration are easier to deal with.

Brave 4 is similar to Finagle, now. For example, there's an internal division between a trace context (which is the only thing used by MDC), and the span which does heavy lifting (ex adding tags). Moreover, you now always have a trace context, even when not sampled. This means MDC can rely on identifiers being present even if they aren't reported to Zipkin... a big deal.

So, how do we get to MDC integration (slf4j, log4j2 etc), then?!

Well, firstly we need a "current span" api, which will have a default implementation with possibly pluggable storage. For example, zipkin-js has this, and we might want to re-use grpc's propagation model, when on their platform. Taking lessons learned from Finagle, this almost certainly should be a reference to a TraceContext (which could be converted to a span via Tracer.newSpan as needed).

Internally, we need hooks to manage the attachment and re-attachment of span context to that api (not that the mutation might actually be internal or an SPI.. iotw, users may never see how a span is attached). The lifecycle will typically end up routing through brave's recorder (on joinSpan, newTrace, newSpan and corresponding finish/flush statements).

In the process of implementing, we can decide how much of this is practical to expose, as complexity is easy here. For example, should SLF4J be special cased (as it is in spring-cloud-sleuth)? or should we allow an event model which can synchronize multiple state stores on activation or deactivation of a span (ex SLF4J and also Brave 3 or gRPC context?)

On a practical note, I have no problem with and in fact am actively interested in doing this work. However, I'd love someone to offer to help. I round-robin through zipkin projects and am currently working on server improvements. This means probably a week or two before I can start, maybe longer. Regardless, I hope this stuff helps in case someone wants to know more or help.

codefromthecrypt · 2017-01-31T02:58:58Z

@kmaci3k fyi here's the issue about whatever api we end up with that supports SLF4J integration.

codefromthecrypt · 2017-02-06T09:55:07Z

summarized a design I have in mind for the current span problem here #166 (comment)

This adds trace and span IDs to the SLF4J Mapped Diagnostic Context (MDC) so that you can search or aggregate logs accordingly. To enable this, configure `brave.Tracing` with `MDCCurrentTraceContext` like so: ```java tracing = Tracing.newBuilder() .currentTraceContext(MDCCurrentTraceContext.create()) ... .build(); ``` Then, in your log configuration, you can use `traceId` and or `spanId`. Here's an example logback configuration: ```xml <pattern>%d [%X{traceId}/%X{spanId}] [%thread] %-5level %logger{36} - %msg%n</pattern> ``` When a trace is in progress, it would log statements like this: ``` 2017-05-02 23:36:04,789 [fcd015bf6f8b05ba/fcd015bf6f8b05ba] [main] INFO c.a.FooController - I got here! ``` Users could then copy/paste the trace ID into the zipkin UI, or use log correlation to further debug a problem. Fixes #150 See #369

codefromthecrypt · 2017-05-02T15:46:52Z

#389 here's the impl

kuroneko25 · 2017-05-04T23:12:46Z

How do we get this in log4j2?

devinsba mentioned this issue Aug 23, 2016

Adds slf4j logging span collectors #226

Closed

codefromthecrypt mentioned this issue Sep 19, 2016

Support multiple child spans from the same thread, aka variable trace shapes #166

Closed

codefromthecrypt mentioned this issue Jan 23, 2017

WIP: Introduce changes related to separating span and span context. openzipkin-contrib/brave-opentracing#9

Closed

codefromthecrypt mentioned this issue Jan 31, 2017

Option to log traces to file only on demand. #346

Closed

codefromthecrypt mentioned this issue May 2, 2017

Adds MDCCurrentTraceContext for log correlation #389

Merged

codefromthecrypt closed this as completed in #389 May 3, 2017

codefromthecrypt mentioned this issue Nov 25, 2017

How to log the parentSpanId? #537

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: a way to get traceId into an SLF4j MDC #150

Feature request: a way to get traceId into an SLF4j MDC #150

ryangardner commented Mar 29, 2016

codefromthecrypt commented Mar 30, 2016

ryangardner commented Apr 4, 2016

codefromthecrypt commented Apr 4, 2016 via email

ryangardner commented Apr 11, 2016

codefromthecrypt commented Apr 11, 2016 via email

kristofa commented Apr 12, 2016

eirslett commented Apr 12, 2016

kristofa commented Apr 13, 2016

eirslett commented Apr 13, 2016

eirslett commented Apr 13, 2016

codefromthecrypt commented Apr 13, 2016 via email

ryangardner commented Apr 13, 2016

codefromthecrypt commented Apr 13, 2016 via email

eirslett commented Apr 13, 2016

codefromthecrypt commented Apr 13, 2016

eirslett commented Apr 13, 2016

codefromthecrypt commented Apr 13, 2016

kristofa commented Apr 14, 2016

testark commented Jun 17, 2016

codefromthecrypt commented Jun 21, 2016 via email

eirslett commented Jun 21, 2016

kuroneko25 commented Sep 18, 2016 •

edited

Loading

codefromthecrypt commented Sep 19, 2016

tmack8001 commented Oct 17, 2016

codefromthecrypt commented Oct 26, 2016

kuroneko25 commented Jan 16, 2017 •

edited

Loading

codefromthecrypt commented Jan 17, 2017 via email

codefromthecrypt commented Jan 23, 2017

codefromthecrypt commented Jan 31, 2017

codefromthecrypt commented Feb 6, 2017

codefromthecrypt commented May 2, 2017

kuroneko25 commented May 4, 2017

Feature request: a way to get traceId into an SLF4j MDC #150

Feature request: a way to get traceId into an SLF4j MDC #150

Comments

ryangardner commented Mar 29, 2016

codefromthecrypt commented Mar 30, 2016

ryangardner commented Apr 4, 2016

codefromthecrypt commented Apr 4, 2016 via email

ryangardner commented Apr 11, 2016

codefromthecrypt commented Apr 11, 2016 via email

kristofa commented Apr 12, 2016

eirslett commented Apr 12, 2016

kristofa commented Apr 13, 2016

eirslett commented Apr 13, 2016

eirslett commented Apr 13, 2016

codefromthecrypt commented Apr 13, 2016 via email

ryangardner commented Apr 13, 2016

codefromthecrypt commented Apr 13, 2016 via email

eirslett commented Apr 13, 2016

codefromthecrypt commented Apr 13, 2016

eirslett commented Apr 13, 2016

codefromthecrypt commented Apr 13, 2016

kristofa commented Apr 14, 2016

testark commented Jun 17, 2016

codefromthecrypt commented Jun 21, 2016 via email

eirslett commented Jun 21, 2016

kuroneko25 commented Sep 18, 2016 • edited Loading

codefromthecrypt commented Sep 19, 2016

tmack8001 commented Oct 17, 2016

codefromthecrypt commented Oct 26, 2016

kuroneko25 commented Jan 16, 2017 • edited Loading

codefromthecrypt commented Jan 17, 2017 via email

codefromthecrypt commented Jan 23, 2017

codefromthecrypt commented Jan 31, 2017

codefromthecrypt commented Feb 6, 2017

codefromthecrypt commented May 2, 2017

kuroneko25 commented May 4, 2017

kuroneko25 commented Sep 18, 2016 •

edited

Loading

kuroneko25 commented Jan 16, 2017 •

edited

Loading