-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: a way to get traceId into an SLF4j MDC #150
Comments
relates to #110 if you don't mind reading up on that. |
Switching out the spanId generator would make another place to work with - but wouldn't necessarily let you know when the spans are created / ended, but a couple other ways that come to mind to handle this are:
|
ahh I remember this. yeah it is a bit tricky as lifecycle of a span
aren't abstracted 100% (ex we have 3 tracers at the moment). One thing
to add to the toolbox is SpanCollector.collect() is definitely the
end.
spring-cloud-sleuth has an implementation like what you're
mentioning.. I think its scope is a bit broader as it requires ids
even when the trace isn't sampled. anyway prior art is often helpful.
https://github.com/spring-cloud/spring-cloud-sleuth/blob/master/spring-cloud-sleuth-core/src/main/java/org/springframework/cloud/sleuth/log/Slf4jSpanLogger.java
Other interest is opentracing's span is like a tracer, and seems to
have end callbacks. how this goes across threads etc is as yet
undefined. in case you'd like a look
opentracing/opentracing-java#25
|
I came up with this thing that seems to work for now - I made a decorator for the ServerClientAndLocalSpanState: import com.github.kristofa.brave.IdConversion;
import com.github.kristofa.brave.ServerClientAndLocalSpanState;
import com.github.kristofa.brave.ServerSpan;
import com.twitter.zipkin.gen.Endpoint;
import com.twitter.zipkin.gen.Span;
import org.slf4j.MDC;
/**
* Expose the trace / spanIds to the MDC and delegates all operations to the provided ServerClientAndLocalSpanState
*/
public class MdcDecoratingServerClientAndLocalSpanState implements ServerClientAndLocalSpanState {
public final String ZIPKIN_TRACEID_MDC = "zipkinTraceId";
public final String ZIPKIN_SPANID_MDC = "zipkinSpanId";
public final String ZIPKIN_PARENTID_MDC = "zipkinParentId";
private ServerClientAndLocalSpanState serverClientAndLocalSpanState;
public MdcDecoratingServerClientAndLocalSpanState(ServerClientAndLocalSpanState serverClientAndLocalSpanState) {
this.serverClientAndLocalSpanState = serverClientAndLocalSpanState;
}
@Override
public ServerSpan getCurrentServerSpan() {
return serverClientAndLocalSpanState.getCurrentServerSpan();
}
@Override
public Endpoint getServerEndpoint() {
return serverClientAndLocalSpanState.getServerEndpoint();
}
// clear the MDC when the CurrentServerSpan is cleared, store info in the MDC when the span state is valid
@Override
public void setCurrentServerSpan(ServerSpan span) {
if (span == null || span.getSpan() == null || span.getSample() == false) {
MDC.remove(ZIPKIN_TRACEID_MDC);
MDC.remove(ZIPKIN_SPANID_MDC);
MDC.remove(ZIPKIN_PARENTID_MDC);
} else {
MDC.put(ZIPKIN_TRACEID_MDC, IdConversion.convertToString(span.getSpan().getTrace_id()));
MDC.put(ZIPKIN_SPANID_MDC, IdConversion.convertToString(span.getSpan().getId()));
if (span.getSpan().getParent_id() != null) {
MDC.put(ZIPKIN_PARENTID_MDC, IdConversion.convertToString(span.getSpan().getParent_id()));
} else {
MDC.remove(ZIPKIN_PARENTID_MDC);
}
}
serverClientAndLocalSpanState.setCurrentServerSpan(span);
}
@Override
public Endpoint getClientEndpoint() {
return serverClientAndLocalSpanState.getClientEndpoint();
}
@Override
public Span getCurrentClientSpan() {
return serverClientAndLocalSpanState.getCurrentClientSpan();
}
@Override
public void setCurrentClientSpan(Span span) {
serverClientAndLocalSpanState.setCurrentClientSpan(span);
}
@Override
public void setCurrentClientServiceName(String serviceName) {
serverClientAndLocalSpanState.setCurrentClientServiceName(serviceName);
}
@Override
public Boolean sample() {
return serverClientAndLocalSpanState.sample();
}
@Override
public Span getCurrentLocalSpan() {
return serverClientAndLocalSpanState.getCurrentLocalSpan();
}
@Override
public void setCurrentLocalSpan(Span span) {
serverClientAndLocalSpanState.setCurrentLocalSpan(span);
}
} I'll take a look at the spring cloud sleuth stuff too |
Clever! Thanks for sharing
|
Yes, clever. I didn't know about MDC so I learned something new :) One of the interesting side effects of enabling distributed tracing and trace context propagation is that it should be straight forward to enable log correlation for requests across services. Unfortunately the current brave api doesn't make it easy to do but it looks like what you have should work. |
If Brave standardises on something like SLF4J (instead of jul), we could do it as part of Brave core? |
The switch to JUL for the main code was done to get rid of dependencies for brave-core. Unfortunately this code will only work for traces that are actually being sampled which can be a small percentage. For log correlation it would be nicer if we can generate span ids even if we decide we don't sample the trace. |
I was going to bring up exactly that - unlike Finagle, Brave simply stops tracing behaviour altogether if a trace is decided to not be sampled. It would be better if that was changed, so that tracing occurs anyways, but the span consumer ignores unsampled spans? |
Regarding jul vs slf4j, maybe it's possible to make some kind of hook in Brave, where we can register a MDC "interceptor"? Then Brave wouldn't need a hard dependency on slf4j. |
A problem could be that you don't reduce application overhead as a
side-effect of sampling, but that may be ok for some folks. Also, if we
ever do after-the-fact sampling, we'll run into this anyway. There are ways
to avoid some overhead here, for example, google have a string cat api,
where annotation.values aren't joined unless/until they are sent out of
process.
You'll notice there's a lot of places we early-exit code if something
wasn't sampled.
|
Indeed this does only work for requests that are already sampled. The scenario people wanted to use it for was "If I find a really long request in zipkin, and now I want to go to our log aggregation framework and look it up - if I could just paste in that traceId and then find any messages that were logged along the way that would help me gather additional information" It seems that spring-cloud-sleuth seems to have implemented a layer around brave to have it always generate traceIds and sampleIds even if a request isn't sampled - this seems like it shouldn't be too hard to do in brave since the "sampled" header is being passed along with the rest of the headers... there might be a few places that need to be tweaked but I don't think SLF4j is an extremely widespread logging API, and doesn't dictate a specific logging framework (someone could use slf4j over java.util.logging if they wanted to by including sjf4j-jul) - removing certain dependencies from brave-core seems prudent but removing all dependencies at some point reaches a point of diminishing returns. The logging framework one is one of those rare cases that a lot of places bring in slf4j simply because JUL is considered to be one of the least desirable logging frameworks around. As far as request overhead is concerned, you could still have the traceIds get passed along but have the short-circuiting if a request isn't sampled based on the "sampled" header being set to false - then you could still pass the traceId and spanIds along but only do additional things like adding annotations if the request is actually being sampled. (A lot of the short circuits are based around null checks now it seems, switching it to be around a false check seems like it wouldn't be too hard) |
There are other opinions on the matter of logging, including log4j2 etc. I
don't really want us to escalate this feature into that kind of bikeshed,
considering it can be non-invasively accomplished without changing the hard
deps. I'm a no on adding slf4j as a hard-dependency, but we should be able
to set it as an optional one to do the job.
One way is instead of passing the trace id, to pass the propagated context
around, similar to finagle's TraceId (which includes propagated context and
also sampled flag iirc). So, it would be api breaking, but a way to
implement what Ryan said.
|
I think the bulk of the overhead in Zipkin instrumentation is about sending span data via the transport to the collectors and storing them in the storage engine. The extra CPU cycles aren't that expensive, and the HTTP headers aren't that big. (And, it works for Twitter, apparently) |
Ex. we collect and pass around the trace id (which is cheap and could use TraceId(_, _, _, Some(false), flags) if !flags.isDebug |
Yes, that's what we should do. Brave's current implementation (
|
ok unless someone else says so, I'll raise a PR to rejig the ids as a separate PR. Afterwards, we can use this to implement MDC logging even when we aren't "actively tracing" |
@adriancole Great that you want to implement this! With Finagle we create a logger which decorates a slf4j logger and which prepends the trace + span + parent span id to every log message (by doing |
Has any progress been made on this lately? |
no progress as displaced on server work then holiday. regrouping now
|
One thing that worked well in zipkin-js was to make the context api pluggable. In Brave, it's hard-coded to be a ThreadLocal. I'm sure we should be able to make something similar in Brave. Then, people could add their own context, or we could have an interface with onChange handlers that people can hook into to build SLF4J support in a loosely coupled fashion. |
This is exactly what we are trying to do in our application. Has any progress been made on this? If not what is the closest work around that we could implement today? |
No progress, but to Eirik's point, we should make sure any new Context Closest would be to make a wrapping ServerClientAndLocalSpanState, and @OverRide // repeat for similar things... On Mon, Sep 19, 2016 at 5:52 AM, kuroneko25 [email protected] wrote:
|
This is very similar behavior as to what I want to put into an application of mine that is using brave. Any further progress, how can I help make this a reality for us all? |
FWIW @eirslett's suggestion of a Context (basically a trace id holder) which is decoupled from thread locals is the way to go. This is used by zipkin-js, but also in finagle I believe. Even log4j2 recently started supporting affecting their context in a way you can control whether thread locals are used. TL;DR; we should decouple brave in some generic fashion from thread locals, but also keep in mind the use-case of synchronizing with logging contexts. If someone wants to spike an impl please do! In absence, I'm pretty close to being interested enough to take it on :) |
Anything new for this this in Brave 4? |
not yet, as brave 4 doesn't have any code yet for in-process propagation of
any sort. pull requests welcome!
|
Here's a recap of where this is (and practical knowledge about things). Brave 4 was designed to unhook brave from a specific thread model. This was direly needed in order to properly address topics of in-process propagation and retrieval of the "current span". For example, we definitely don't need ServerSpan anymore (since all spans are now the same and they propagate sampled state). This simplifies the problem space such that things like MDC integration are easier to deal with. Brave 4 is similar to Finagle, now. For example, there's an internal division between a trace context (which is the only thing used by MDC), and the span which does heavy lifting (ex adding tags). Moreover, you now always have a trace context, even when not sampled. This means MDC can rely on identifiers being present even if they aren't reported to Zipkin... a big deal. So, how do we get to MDC integration (slf4j, log4j2 etc), then?! Well, firstly we need a "current span" api, which will have a default implementation with possibly pluggable storage. For example, zipkin-js has this, and we might want to re-use grpc's propagation model, when on their platform. Taking lessons learned from Finagle, this almost certainly should be a reference to a TraceContext (which could be converted to a span via Tracer.newSpan as needed). Internally, we need hooks to manage the attachment and re-attachment of span context to that api (not that the mutation might actually be internal or an SPI.. iotw, users may never see how a span is attached). The lifecycle will typically end up routing through brave's recorder (on joinSpan, newTrace, newSpan and corresponding finish/flush statements). In the process of implementing, we can decide how much of this is practical to expose, as complexity is easy here. For example, should SLF4J be special cased (as it is in spring-cloud-sleuth)? or should we allow an event model which can synchronize multiple state stores on activation or deactivation of a span (ex SLF4J and also Brave 3 or gRPC context?) On a practical note, I have no problem with and in fact am actively interested in doing this work. However, I'd love someone to offer to help. I round-robin through zipkin projects and am currently working on server improvements. This means probably a week or two before I can start, maybe longer. Regardless, I hope this stuff helps in case someone wants to know more or help. |
@kmaci3k fyi here's the issue about whatever api we end up with that supports SLF4J integration. |
summarized a design I have in mind for the current span problem here #166 (comment) |
This adds trace and span IDs to the SLF4J Mapped Diagnostic Context (MDC) so that you can search or aggregate logs accordingly. To enable this, configure `brave.Tracing` with `MDCCurrentTraceContext` like so: ```java tracing = Tracing.newBuilder() .currentTraceContext(MDCCurrentTraceContext.create()) ... .build(); ``` Then, in your log configuration, you can use `traceId` and or `spanId`. Here's an example logback configuration: ```xml <pattern>%d [%X{traceId}/%X{spanId}] [%thread] %-5level %logger{36} - %msg%n</pattern> ``` When a trace is in progress, it would log statements like this: ``` 2017-05-02 23:36:04,789 [fcd015bf6f8b05ba/fcd015bf6f8b05ba] [main] INFO c.a.FooController - I got here! ``` Users could then copy/paste the trace ID into the zipkin UI, or use log correlation to further debug a problem. Fixes #150 See #369
#389 here's the impl |
How do we get this in log4j2? |
We've got a lot of log aggregation software already, and it'd be very powerful if we could correlate traceIds in our logs to specific requests in Zipkin (see a slow request in Zipkin? go search for the traceId in our log aggregation and see any messages that tie into it)
It'd be nice to be able to put the traceId into the MDC every time it is set and clear it out whenever the request is over (or the current one is set to nothing) then clear it out.
I think in order for this to work some kind of hook needs to be exposed.
I can do it myself with AOP and intercepting calls to the
ServerClientAndLocalSpanState
- but I think that other people would like this functionality as well (It's one of the first things people ask me about in my company when I show brave / zipkin to them)If you can think of the approach you would prefer for this, I can take a first stab at implementing it and submit a pull request with the changes
The text was updated successfully, but these errors were encountered: