Add customizable global error handler #1419

MitchellDumovic · 2020-07-15T17:53:44Z

This is a POC solution to open-telemetry/opentelemetry-specification#696

Much of the inspiration for this PR came from the respective implementation in Go here

Adds an error handler delegate to OpenTelemetry, which can handle errors through a call to OpenTelemetry.handleError.

The delegate accepts errors in the form of exceptions of type OpenTelemetryException, which most often are constructed at the time of an error by wrapping an exception with some additional information about where it occurred. By default, the delegate logs a warning using the native Java logger. A custom delegate can be registered by providing a ErrorHandlerFactory class, in a similar way that a custom TraceProvider can be registered.

This PR also updates several locations (primarily in various exporters) to handle an error using the global handler. Some locations I have added net-new reporting, and others I have just replaced the log.warn message that previously existed.

linux-foundation-easycla · 2020-07-15T17:53:47Z

The committers are authorized under a signed CLA.

✅ MitchellDumovic (2e7243d)

jkwatson · 2020-07-15T18:25:11Z

I don't think this belongs in the API. The API is reserved for instrumentation concerns. To me, this is very clearly an operator concern, and should only be a part of the SDK.

carlosalberto · 2020-07-16T23:27:58Z

I see the point on why @jkwatson doesn't want it here - in the end, we probably don't event want to have errors reported when the API alone is used. The fact we are consuming such error handler from SDK-level components seem to add weight to that argument.

Personally I'd be fine with adding this very same concept to either OpenTelemetrySdk or a similar global component at the SDK level. Would that work for you @jkwatson?

Also, I'm wondering whether we need the factory for the error. I imagine users setting such handler through code directly (when configuring the SDK itself, such as the java-instrumentation code would do), but I'm not totally sure we need a factory here.

jkwatson · 2020-07-17T19:40:20Z

I'm very happy to have an error handling mechanism baked into the SDK, definitely.

codecov · 2020-07-17T23:16:33Z

Codecov Report

Merging #1419 into master will decrease coverage by 0.05%.
The diff coverage is 90.47%.

@@             Coverage Diff              @@
##             master    #1419      +/-   ##
============================================
- Coverage     92.28%   92.22%   -0.06%     
- Complexity      898      904       +6     
============================================
  Files           114      116       +2     
  Lines          3241     3256      +15     
  Branches        264      264              
============================================
+ Hits           2991     3003      +12     
- Misses          170      171       +1     
- Partials         80       82       +2

Impacted Files	Coverage Δ	Complexity Δ
.../src/main/java/io/opentelemetry/OpenTelemetry.java	`98.21% <ø> (ø)`	`22.00 <0.00> (ø)`
...telemetry/sdk/trace/export/BatchSpanProcessor.java	`94.59% <0.00%> (-1.36%)`	`8.00 <0.00> (ø)`
...elemetry/sdk/trace/export/SimpleSpanProcessor.java	`90.62% <0.00%> (-0.29%)`	`9.00 <0.00> (-1.00)`
...in/java/io/opentelemetry/sdk/OpenTelemetrySdk.java	`100.00% <100.00%> (ø)`	`7.00 <4.00> (+4.00)`
...elemetry/sdk/errorhandler/DefaultErrorHandler.java	`100.00% <100.00%> (ø)`	`3.00 <3.00> (?)`
...metry/sdk/errorhandler/OpenTelemetryException.java	`100.00% <100.00%> (ø)`	`2.00 <2.00> (?)`
...ntelemetry/sdk/trace/export/MultiSpanExporter.java	`100.00% <100.00%> (ø)`	`11.00 <0.00> (-1.00)`
...k/extensions/trace/jaeger/sampler/RateLimiter.java	`94.11% <0.00%> (-5.89%)`	`4.00% <0.00%> (-1.00%)`
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2ec91e1...3cd57e5. Read the comment docs.

MitchellDumovic · 2020-07-17T23:26:02Z

I agree that it makes sense for this to be in the SDK - I've updated this PR to do so, and also took @carlosalberto's suggestion to use a code setter instead of a factory.

thisthat · 2020-07-20T12:03:06Z

sdk/src/main/java/io/opentelemetry/sdk/errorhandler/DefaultErrorHandler.java

@@ -0,0 +1,51 @@
+/*
+ * Copyright 2019, OpenTelemetry Authors


thisthat · 2020-07-20T12:03:17Z

sdk/src/main/java/io/opentelemetry/sdk/errorhandler/ErrorHandler.java

@@ -0,0 +1,27 @@
+/*
+ * Copyright 2019, OpenTelemetry Authors


also here the year

thisthat · 2020-07-20T12:03:33Z

sdk/src/main/java/io/opentelemetry/sdk/errorhandler/OpenTelemetryException.java

@@ -0,0 +1,35 @@
+/*
+ * Copyright 2019, OpenTelemetry Authors


thisthat · 2020-07-20T12:04:25Z

sdk/src/main/java/io/opentelemetry/sdk/errorhandler/DefaultErrorHandler.java

+ * Default implementation of {@link ErrorHandler}. By default, logs the full string of the error
+ * with a level of "Warning".
+ *
+ * @since 0.1.0


Suggested change

* @since 0.1.0

* @since 0.7.0

jkwatson · 2020-07-20T15:44:38Z

sdk/src/main/java/io/opentelemetry/sdk/OpenTelemetrySdk.java

 *
 * @see OpenTelemetry
 */
 @ThreadSafe
 public final class OpenTelemetrySdk {
+  private static final Object mutex = new Object();
+
+  @Nullable private static volatile OpenTelemetrySdk instance;


rather than introducing a whole instance and double-checked locking in the getInstance, why not just have the errorHandler be static, since it's global anyway?

+1 for the handler being static (we don't need anything fancy here, at least for now).

anuraaga · 2020-07-20T15:47:17Z

sdk/src/main/java/io/opentelemetry/sdk/errorhandler/DefaultErrorHandler.java

+  }
+
+  @Override
+  public void handle(OpenTelemetryException e) {


Just curious do we have a story about controlling the default behavior based on the type of exception? A level flag to `OpenTelemetryException?

The context is I am looking at null / arg checks which we should remove to abide by the error handling specs, that the SDK should not throw errors due to incorrect usage by a user. I think for these logging by default would be problematic since if a simple method like .setAttribute was logging it could easily be enough to bring down the app not so unlike an actual exception. But naturally for less common / sporadic errors like export error it would generally be ok to log.

Since this is basically an API for error logging, I'm starting to wonder if adding this additional error handler is overkill. Couldn't users just specify a custom logger if they want special behavior, outside the normal java.util.logging behavior?

That is to say, if we stipulate the all SDK logging will be in the io.opentelemetry.sdk namespace, can't we just use normal logging facilities to handle the global error handler use-case?

The context is I am looking at null / arg checks which we should remove to abide by the error handling specs, that the SDK should not throw errors due to incorrect usage by a user.

I personally think we shouldn't log wrong parameters, if that's what you are asking. Then again, other implementations log such errors (the Python one does, for example). But we should probably discuss that in the Specification call tomorrow.

@jkwatson What if the user wants to do something with the error, other than logging?

Well, the logging Handler implementation can do whatever it wants, right? It's not required to do any actual logging, is it?

And what if the user wants to do something other than simple logging? Probably we should look for cases where simple logging is not the principal use case (to go one way or another).

by 'user', do you mean the SDK author generating the "errors" or, the user of the SDK who is doing something with them? A java.util.logging.Handler, I think, can do anything it wants...it doesn't have to be "simple logging".

So I have mixed feelings about using the Java's logging facility. In theory having a handler is a straightforward, super simple thing to do, albeit limited, while the logging path includes dealing with an entire LogRecord object (and what to expect to handle there? Throwable, message? parameters?), where the user has to override three abstract methods for Handler.

Yeah, I'm not convinced it was the right way to go, but rather than re-invent an internal logging API, I'd rather use something off the shelf. I can also see arguments for not wanting to generate an entire stack trace just to report some sort of SDK behavior...in which case we're getting dangerously close to implementing a logging API. :)

Jumping in with my two-cents here…

Taking a step back, I think the the high-level goal here is that we want it to be easy for users of the SDK to get visibility into any errors that occur during setup or at runtime that prevent the expected tracing or metrics information from being exported.

I think that three things that are very important for achieving this high-level goal are:

(1) Making it easy for the SDK authors to report errors in a standardized way.
(2) Making it easy for the SDK user (not the SDK author) to have control over where errors get logged and which types of errors get logged
(3) Having a sane default so that important errors do not get squashed when someone is trying to set up OpenTelemetry.

I think that both the solution outlined in this PR and using Java’s built-in logging facility technically achieve all of these goals, but I do agree with @carlosalberto that Java’s built-in logging facility potentially adds friction that makes the whole process clunkier to work with. A user has to implement three abstract methods, it’s harder to transmit information about where an error came from and switch on that error, and it’s potentially less obvious to the SDK author that calling logger.warning will be correctly handled and propagated to the right place.

Having said that, I do agree with the notion that we shouldn’t be trying to implement our own logging API from scratch. If we feel like we need a lot of the bells and whistles that the Java logging library provides for our use cases, then I think that we should just go ahead and use that. If we only need a small subset of them, then I think that the overhead of building and maintaining them is worth the increased clarity and ease-of-use that this solution brings.

bogdandrutu

My two cents here...

There is something wrong with OpenTelemetry if an "observability" library goes back to error handling and intensive logging to debug it, why do we have this library then? Why do we ask users to use this library if we then tell them to "handle" our errors.

I know that there is a known problem of how to monitor your monitoring tools, but I feel that we should try to do a smarter thing than this, using metrics, using tracing, and only when something is fatal and we cannot do anything to recover inform the user.

anuraaga · 2020-07-22T03:42:09Z

I agree that we should have great defaults so users shouldn't have to do anything in general. Some users may want control but if the defaults are good enough, maybe not.

Putting on the user hat, if I had the ability to, I would want

Unexpected, recoverable / transient errors logged as warning
Unrecoverable exceptions like validation issues logged at debug, or logged once to warn
Validation issues thrown as exceptions in unit tests.

If otel provided a flag that I could enable in tests for the extra behavior there, then I'd just use it. If there was a error handler, I'd implement my own flag using it. If it was a jul handler, it would take me some more time to understand how to use jul, but I assume it would work too. None of these options feel bad to me. And if there's no way to have a flag for unit tests, I don't think I'd feel significantly bothered though there probably will be points where I wonder why a tag is missing and the debug to find a null value takes longer than it could have.

carlosalberto · 2020-07-22T18:15:49Z

Interesting points @bogdandrutu

To me, it feels like an (absolutely beautiful) utopia to try to monitor OpenTelemetry with itself (If I got it right from your comment) - what will you do if you are stuck exporting data, for example? Where can you report such errors as traces/metrics? Also, there's the case of local errors that could easily be diagnosed, without need to do fancy work.

As the description of this PR mentions, Go already such error handler. Would you be against such approach? Asking also as the idea is to have something like this in the Specification itself ;)

Finally, as we already do logging for many errors, what do you think about going for @jkwatson's idea about 'formalizing' the usage of Java logging facility to make this possible? Would you be fine with that?

carlosalberto · 2020-07-27T11:49:16Z

Ping @bogdandrutu ;)

MitchellDumovic · 2020-07-31T16:32:15Z

After discussion at the spec SIG this week and in the Otel spec issue, it looks like the best approach here is to just leverage the java.util.logging library directly to get the customizable handler here that we want.

From what I understand, we actually don't need to make any code changes for this (besides actually adding additional log messages where they are missing now). However, I'd like to add documentation with examples of how the logger can be customized for those unfamiliar with java.util.logging, which I'd like to do in a separate PR as I think this has diverged sufficiently from this PR's original implementation.

With this, OpenTelemetry authors can continue to use logger.log messages to log errors, warnings, or info, and it is in the user's hands to accept or filter out any errors output.

Does this all sound like a reasonable way to move forward? @carlosalberto @bogdandrutu @anuraaga @jkwatson

MitchellDumovic · 2020-08-04T00:52:00Z

I took a stab at adding some documentation for the java logger here - comments are appreciated!

jkwatson · 2020-08-06T20:16:47Z

With #1501 merged, can we close this for now? If we run into a case that logging really can't handle, we can re-open this idea.

Implement global error handler for OpenTelemetry java

2e7243d

MitchellDumovic requested review from arminru, bogdandrutu, carlosalberto, jkwatson, pavolloffay, thisthat and tylerbenson as code owners July 15, 2020 17:53

fix headers

bf6118c

moved to sdk, removed factory

07ff0d6

MitchellDumovic force-pushed the mitchell/implement-error-handler branch from f9a9a4a to 07ff0d6 Compare July 17, 2020 23:07

jkwatson mentioned this pull request Jul 18, 2020

Makes Export methods async #1428

Merged

thisthat mentioned this pull request Jul 20, 2020

Span not exported to Jaeger Collector #1435

Closed

thisthat approved these changes Jul 20, 2020

View reviewed changes

thisthat reviewed Jul 20, 2020

View reviewed changes

thisthat mentioned this pull request Jul 20, 2020

Remove null checks in Span API. #1433

Merged

jkwatson reviewed Jul 20, 2020

View reviewed changes

anuraaga reviewed Jul 20, 2020

View reviewed changes

address review comments

3cd57e5

bogdandrutu reviewed Jul 22, 2020

View reviewed changes

MitchellDumovic mentioned this pull request Aug 4, 2020

Add documentation about Java Logger for Error Handling #1501

Merged

carlosalberto closed this Aug 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add customizable global error handler #1419

Add customizable global error handler #1419

MitchellDumovic commented Jul 15, 2020

linux-foundation-easycla bot commented Jul 15, 2020 •

edited

Loading

jkwatson commented Jul 15, 2020

carlosalberto commented Jul 16, 2020

jkwatson commented Jul 17, 2020

codecov bot commented Jul 17, 2020 •

edited

Loading

MitchellDumovic commented Jul 17, 2020

thisthat Jul 20, 2020

thisthat Jul 20, 2020

thisthat Jul 20, 2020

thisthat Jul 20, 2020

jkwatson Jul 20, 2020

carlosalberto Jul 20, 2020

anuraaga Jul 20, 2020

jkwatson Jul 20, 2020

carlosalberto Jul 20, 2020

carlosalberto Jul 20, 2020

jkwatson Jul 20, 2020

carlosalberto Jul 21, 2020

jkwatson Jul 21, 2020

carlosalberto Jul 21, 2020

jkwatson Jul 21, 2020

MitchellDumovic Jul 21, 2020

bogdandrutu left a comment

anuraaga commented Jul 22, 2020

carlosalberto commented Jul 22, 2020

carlosalberto commented Jul 27, 2020

MitchellDumovic commented Jul 31, 2020

MitchellDumovic commented Aug 4, 2020

jkwatson commented Aug 6, 2020

Add customizable global error handler #1419

Add customizable global error handler #1419

Conversation

MitchellDumovic commented Jul 15, 2020

linux-foundation-easycla bot commented Jul 15, 2020 • edited Loading

jkwatson commented Jul 15, 2020

carlosalberto commented Jul 16, 2020

jkwatson commented Jul 17, 2020

codecov bot commented Jul 17, 2020 • edited Loading

Codecov Report

MitchellDumovic commented Jul 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bogdandrutu left a comment

Choose a reason for hiding this comment

anuraaga commented Jul 22, 2020

carlosalberto commented Jul 22, 2020

carlosalberto commented Jul 27, 2020

MitchellDumovic commented Jul 31, 2020

MitchellDumovic commented Aug 4, 2020

jkwatson commented Aug 6, 2020

linux-foundation-easycla bot commented Jul 15, 2020 •

edited

Loading

codecov bot commented Jul 17, 2020 •

edited

Loading