Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add customizable global error handler #1419

Conversation

MitchellDumovic
Copy link
Contributor

This is a POC solution to open-telemetry/opentelemetry-specification#696

Much of the inspiration for this PR came from the respective implementation in Go here

Adds an error handler delegate to OpenTelemetry, which can handle errors through a call to OpenTelemetry.handleError.

The delegate accepts errors in the form of exceptions of type OpenTelemetryException, which most often are constructed at the time of an error by wrapping an exception with some additional information about where it occurred. By default, the delegate logs a warning using the native Java logger. A custom delegate can be registered by providing a ErrorHandlerFactory class, in a similar way that a custom TraceProvider can be registered.

This PR also updates several locations (primarily in various exporters) to handle an error using the global handler. Some locations I have added net-new reporting, and others I have just replaced the log.warn message that previously existed.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jul 15, 2020

CLA Check
The committers are authorized under a signed CLA.

@jkwatson
Copy link
Contributor

I don't think this belongs in the API. The API is reserved for instrumentation concerns. To me, this is very clearly an operator concern, and should only be a part of the SDK.

@carlosalberto
Copy link
Contributor

I see the point on why @jkwatson doesn't want it here - in the end, we probably don't event want to have errors reported when the API alone is used. The fact we are consuming such error handler from SDK-level components seem to add weight to that argument.

Personally I'd be fine with adding this very same concept to either OpenTelemetrySdk or a similar global component at the SDK level. Would that work for you @jkwatson?

Also, I'm wondering whether we need the factory for the error. I imagine users setting such handler through code directly (when configuring the SDK itself, such as the java-instrumentation code would do), but I'm not totally sure we need a factory here.

@jkwatson
Copy link
Contributor

I'm very happy to have an error handling mechanism baked into the SDK, definitely.

@MitchellDumovic MitchellDumovic force-pushed the mitchell/implement-error-handler branch from f9a9a4a to 07ff0d6 Compare July 17, 2020 23:07
@codecov
Copy link

codecov bot commented Jul 17, 2020

Codecov Report

Merging #1419 into master will decrease coverage by 0.05%.
The diff coverage is 90.47%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #1419      +/-   ##
============================================
- Coverage     92.28%   92.22%   -0.06%     
- Complexity      898      904       +6     
============================================
  Files           114      116       +2     
  Lines          3241     3256      +15     
  Branches        264      264              
============================================
+ Hits           2991     3003      +12     
- Misses          170      171       +1     
- Partials         80       82       +2     
Impacted Files Coverage Δ Complexity Δ
.../src/main/java/io/opentelemetry/OpenTelemetry.java 98.21% <ø> (ø) 22.00 <0.00> (ø)
...telemetry/sdk/trace/export/BatchSpanProcessor.java 94.59% <0.00%> (-1.36%) 8.00 <0.00> (ø)
...elemetry/sdk/trace/export/SimpleSpanProcessor.java 90.62% <0.00%> (-0.29%) 9.00 <0.00> (-1.00)
...in/java/io/opentelemetry/sdk/OpenTelemetrySdk.java 100.00% <100.00%> (ø) 7.00 <4.00> (+4.00)
...elemetry/sdk/errorhandler/DefaultErrorHandler.java 100.00% <100.00%> (ø) 3.00 <3.00> (?)
...metry/sdk/errorhandler/OpenTelemetryException.java 100.00% <100.00%> (ø) 2.00 <2.00> (?)
...ntelemetry/sdk/trace/export/MultiSpanExporter.java 100.00% <100.00%> (ø) 11.00 <0.00> (-1.00)
...k/extensions/trace/jaeger/sampler/RateLimiter.java 94.11% <0.00%> (-5.89%) 4.00% <0.00%> (-1.00%)
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2ec91e1...3cd57e5. Read the comment docs.

@MitchellDumovic
Copy link
Contributor Author

I agree that it makes sense for this to be in the SDK - I've updated this PR to do so, and also took @carlosalberto's suggestion to use a code setter instead of a factory.

@@ -0,0 +1,51 @@
/*
* Copyright 2019, OpenTelemetry Authors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: 2020

@@ -0,0 +1,27 @@
/*
* Copyright 2019, OpenTelemetry Authors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also here the year

@@ -0,0 +1,35 @@
/*
* Copyright 2019, OpenTelemetry Authors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here

* Default implementation of {@link ErrorHandler}. By default, logs the full string of the error
* with a level of "Warning".
*
* @since 0.1.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @since 0.1.0
* @since 0.7.0

*
* @see OpenTelemetry
*/
@ThreadSafe
public final class OpenTelemetrySdk {
private static final Object mutex = new Object();

@Nullable private static volatile OpenTelemetrySdk instance;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than introducing a whole instance and double-checked locking in the getInstance, why not just have the errorHandler be static, since it's global anyway?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the handler being static (we don't need anything fancy here, at least for now).

}

@Override
public void handle(OpenTelemetryException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious do we have a story about controlling the default behavior based on the type of exception? A level flag to `OpenTelemetryException?

The context is I am looking at null / arg checks which we should remove to abide by the error handling specs, that the SDK should not throw errors due to incorrect usage by a user. I think for these logging by default would be problematic since if a simple method like .setAttribute was logging it could easily be enough to bring down the app not so unlike an actual exception. But naturally for less common / sporadic errors like export error it would generally be ok to log.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is basically an API for error logging, I'm starting to wonder if adding this additional error handler is overkill. Couldn't users just specify a custom logger if they want special behavior, outside the normal java.util.logging behavior?

That is to say, if we stipulate the all SDK logging will be in the io.opentelemetry.sdk namespace, can't we just use normal logging facilities to handle the global error handler use-case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context is I am looking at null / arg checks which we should remove to abide by the error handling specs, that the SDK should not throw errors due to incorrect usage by a user.

I personally think we shouldn't log wrong parameters, if that's what you are asking. Then again, other implementations log such errors (the Python one does, for example). But we should probably discuss that in the Specification call tomorrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkwatson What if the user wants to do something with the error, other than logging?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the logging Handler implementation can do whatever it wants, right? It's not required to do any actual logging, is it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And what if the user wants to do something other than simple logging? Probably we should look for cases where simple logging is not the principal use case (to go one way or another).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by 'user', do you mean the SDK author generating the "errors" or, the user of the SDK who is doing something with them? A java.util.logging.Handler, I think, can do anything it wants...it doesn't have to be "simple logging".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I have mixed feelings about using the Java's logging facility. In theory having a handler is a straightforward, super simple thing to do, albeit limited, while the logging path includes dealing with an entire LogRecord object (and what to expect to handle there? Throwable, message? parameters?), where the user has to override three abstract methods for Handler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not convinced it was the right way to go, but rather than re-invent an internal logging API, I'd rather use something off the shelf. I can also see arguments for not wanting to generate an entire stack trace just to report some sort of SDK behavior...in which case we're getting dangerously close to implementing a logging API. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jumping in with my two-cents here…

Taking a step back, I think the the high-level goal here is that we want it to be easy for users of the SDK to get visibility into any errors that occur during setup or at runtime that prevent the expected tracing or metrics information from being exported.

I think that three things that are very important for achieving this high-level goal are:

(1) Making it easy for the SDK authors to report errors in a standardized way.
(2) Making it easy for the SDK user (not the SDK author) to have control over where errors get logged and which types of errors get logged
(3) Having a sane default so that important errors do not get squashed when someone is trying to set up OpenTelemetry.

I think that both the solution outlined in this PR and using Java’s built-in logging facility technically achieve all of these goals, but I do agree with @carlosalberto that Java’s built-in logging facility potentially adds friction that makes the whole process clunkier to work with. A user has to implement three abstract methods, it’s harder to transmit information about where an error came from and switch on that error, and it’s potentially less obvious to the SDK author that calling logger.warning will be correctly handled and propagated to the right place.

Having said that, I do agree with the notion that we shouldn’t be trying to implement our own logging API from scratch. If we feel like we need a lot of the bells and whistles that the Java logging library provides for our use cases, then I think that we should just go ahead and use that. If we only need a small subset of them, then I think that the overhead of building and maintaining them is worth the increased clarity and ease-of-use that this solution brings.

Copy link
Member

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My two cents here...

There is something wrong with OpenTelemetry if an "observability" library goes back to error handling and intensive logging to debug it, why do we have this library then? Why do we ask users to use this library if we then tell them to "handle" our errors.

I know that there is a known problem of how to monitor your monitoring tools, but I feel that we should try to do a smarter thing than this, using metrics, using tracing, and only when something is fatal and we cannot do anything to recover inform the user.

@anuraaga
Copy link
Contributor

I agree that we should have great defaults so users shouldn't have to do anything in general. Some users may want control but if the defaults are good enough, maybe not.

Putting on the user hat, if I had the ability to, I would want

Unexpected, recoverable / transient errors logged as warning
Unrecoverable exceptions like validation issues logged at debug, or logged once to warn
Validation issues thrown as exceptions in unit tests.

If otel provided a flag that I could enable in tests for the extra behavior there, then I'd just use it. If there was a error handler, I'd implement my own flag using it. If it was a jul handler, it would take me some more time to understand how to use jul, but I assume it would work too. None of these options feel bad to me. And if there's no way to have a flag for unit tests, I don't think I'd feel significantly bothered though there probably will be points where I wonder why a tag is missing and the debug to find a null value takes longer than it could have.

@carlosalberto
Copy link
Contributor

Interesting points @bogdandrutu

To me, it feels like an (absolutely beautiful) utopia to try to monitor OpenTelemetry with itself (If I got it right from your comment) - what will you do if you are stuck exporting data, for example? Where can you report such errors as traces/metrics? Also, there's the case of local errors that could easily be diagnosed, without need to do fancy work.

As the description of this PR mentions, Go already such error handler. Would you be against such approach? Asking also as the idea is to have something like this in the Specification itself ;)

Finally, as we already do logging for many errors, what do you think about going for @jkwatson's idea about 'formalizing' the usage of Java logging facility to make this possible? Would you be fine with that?

@carlosalberto
Copy link
Contributor

Ping @bogdandrutu ;)

@MitchellDumovic
Copy link
Contributor Author

After discussion at the spec SIG this week and in the Otel spec issue, it looks like the best approach here is to just leverage the java.util.logging library directly to get the customizable handler here that we want.

From what I understand, we actually don't need to make any code changes for this (besides actually adding additional log messages where they are missing now). However, I'd like to add documentation with examples of how the logger can be customized for those unfamiliar with java.util.logging, which I'd like to do in a separate PR as I think this has diverged sufficiently from this PR's original implementation.

With this, OpenTelemetry authors can continue to use logger.log messages to log errors, warnings, or info, and it is in the user's hands to accept or filter out any errors output.

Does this all sound like a reasonable way to move forward? @carlosalberto @bogdandrutu @anuraaga @jkwatson

@MitchellDumovic
Copy link
Contributor Author

I took a stab at adding some documentation for the java logger here - comments are appreciated!

@jkwatson
Copy link
Contributor

jkwatson commented Aug 6, 2020

With #1501 merged, can we close this for now? If we run into a case that logging really can't handle, we can re-open this idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants