-
Notifications
You must be signed in to change notification settings - Fork 219
Add application err category #2485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 7 commits
4fd8dd9
71c11a4
1858f95
4e8a98e
433c066
6875557
29ff5dc
44de349
a64f646
d22a2ad
823848e
ab0a99a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,62 @@ | ||||||
| /* | ||||||
| * Copyright (C) 2024 Temporal Technologies, Inc. All Rights Reserved. | ||||||
| * | ||||||
| * Copyright (C) 2012-2016 Amazon.com, Inc. or its affiliates. All Rights Reserved. | ||||||
| * | ||||||
| * Modifications copyright (C) 2017 Uber Technologies, Inc. | ||||||
| * | ||||||
| * Licensed under the Apache License, Version 2.0 (the "License"); | ||||||
| * you may not use this material except in compliance with the License. | ||||||
| * You may obtain a copy of the License at | ||||||
| * | ||||||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||||||
| * | ||||||
| * Unless required by applicable law or agreed to in writing, software | ||||||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||||||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||||
| * See the License for the specific language governing permissions and | ||||||
| * limitations under the License. | ||||||
| */ | ||||||
|
|
||||||
| package io.temporal.failure; | ||||||
|
|
||||||
| /** | ||||||
| * Mirrors the proto definition for ApplicationErrorCategory. Used to categorize application | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
I don't think we need to mention the proto def. here the docs are for the end user and that isn't really relevant for them
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||||||
| * failures. | ||||||
| * | ||||||
| * @see io.temporal.api.enums.v1.ApplicationErrorCategory | ||||||
| */ | ||||||
| public enum ApplicationErrorCategory { | ||||||
| UNSPECIFIED, | ||||||
| /** Expected application error with little/no severity. */ | ||||||
| BENIGN, | ||||||
| ; | ||||||
|
|
||||||
| public static ApplicationErrorCategory fromProto( | ||||||
| io.temporal.api.enums.v1.ApplicationErrorCategory protoCategory) { | ||||||
| if (protoCategory == null) { | ||||||
| return UNSPECIFIED; | ||||||
| } | ||||||
| switch (protoCategory) { | ||||||
| case APPLICATION_ERROR_CATEGORY_BENIGN: | ||||||
| return BENIGN; | ||||||
| case APPLICATION_ERROR_CATEGORY_UNSPECIFIED: | ||||||
| case UNRECOGNIZED: | ||||||
| default: | ||||||
| // Fallback unrecognized or unspecified proto values as UNSPECIFIED | ||||||
| return UNSPECIFIED; | ||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the concept of unrecognized is ideal and different from unspecified (then again though, arguably we should be using raw API proto enums and not a new enumerate here).
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. using raw API proto enums now |
||||||
| } | ||||||
| } | ||||||
|
|
||||||
| public io.temporal.api.enums.v1.ApplicationErrorCategory toProto() { | ||||||
| switch (this) { | ||||||
| case BENIGN: | ||||||
| return io.temporal.api.enums.v1.ApplicationErrorCategory.APPLICATION_ERROR_CATEGORY_BENIGN; | ||||||
| case UNSPECIFIED: | ||||||
| default: | ||||||
| // Fallback to UNSPECIFIED for unknown values | ||||||
| return io.temporal.api.enums.v1.ApplicationErrorCategory | ||||||
| .APPLICATION_ERROR_CATEGORY_UNSPECIFIED; | ||||||
| } | ||||||
| } | ||||||
| } | ||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -51,13 +51,15 @@ | |
| * <li>nonRetryable is set to false | ||
| * <li>details are set to null | ||
| * <li>stack trace is copied from the original exception | ||
| * <li>stack category is set to ApplicationErrorCategory.UNSPECIFIED | ||
| * </ul> | ||
| */ | ||
| public final class ApplicationFailure extends TemporalFailure { | ||
| private final String type; | ||
| private final Values details; | ||
| private boolean nonRetryable; | ||
| private Duration nextRetryDelay; | ||
| private final ApplicationErrorCategory category; | ||
|
|
||
| /** | ||
| * New ApplicationFailure with {@link #isNonRetryable()} flag set to false. | ||
|
|
@@ -92,7 +94,14 @@ public static ApplicationFailure newFailure(String message, String type, Object. | |
| */ | ||
| public static ApplicationFailure newFailureWithCause( | ||
| String message, String type, @Nullable Throwable cause, Object... details) { | ||
| return new ApplicationFailure(message, type, false, new EncodedValues(details), cause, null); | ||
| return new ApplicationFailure( | ||
| message, | ||
| type, | ||
| false, | ||
| new EncodedValues(details), | ||
| cause, | ||
| null, | ||
| ApplicationErrorCategory.UNSPECIFIED); | ||
| } | ||
|
|
||
| /** | ||
|
|
@@ -118,7 +127,13 @@ public static ApplicationFailure newFailureWithCauseAndDelay( | |
| Duration nextRetryDelay, | ||
| Object... details) { | ||
| return new ApplicationFailure( | ||
| message, type, false, new EncodedValues(details), cause, nextRetryDelay); | ||
| message, | ||
| type, | ||
| false, | ||
| new EncodedValues(details), | ||
| cause, | ||
| nextRetryDelay, | ||
| ApplicationErrorCategory.UNSPECIFIED); | ||
| } | ||
|
|
||
| /** | ||
|
|
@@ -153,7 +168,40 @@ public static ApplicationFailure newNonRetryableFailure( | |
| */ | ||
| public static ApplicationFailure newNonRetryableFailureWithCause( | ||
| String message, String type, @Nullable Throwable cause, Object... details) { | ||
| return new ApplicationFailure(message, type, true, new EncodedValues(details), cause, null); | ||
| return new ApplicationFailure( | ||
| message, | ||
| type, | ||
| true, | ||
| new EncodedValues(details), | ||
| cause, | ||
| null, | ||
| ApplicationErrorCategory.UNSPECIFIED); | ||
| } | ||
|
|
||
| /** | ||
| * New ApplicationFailure with a specified category and {@link #isNonRetryable()} flag set to | ||
| * false. | ||
| * | ||
| * <p>Note that this exception still may not be retried by the service if its type is included in | ||
| * the doNotRetry property of the correspondent retry policy. | ||
| * | ||
| * @param message optional error message | ||
| * @param type error type | ||
| * @param category the category of the application failure. | ||
| * @param cause failure cause. Each element of the cause chain will be converted to | ||
| * ApplicationFailure for network transmission across network if it doesn't extend {@link | ||
| * TemporalFailure} | ||
| * @param details optional details about the failure. They are serialized using the same approach | ||
| * as arguments and results. | ||
| */ | ||
| public static ApplicationFailure newFailureWithCategory( | ||
| String message, | ||
| String type, | ||
| ApplicationErrorCategory category, | ||
| @Nullable Throwable cause, | ||
| Object... details) { | ||
| return new ApplicationFailure( | ||
| message, type, false, new EncodedValues(details), cause, null, category); | ||
| } | ||
|
|
||
| static ApplicationFailure newFromValues( | ||
|
|
@@ -162,8 +210,10 @@ static ApplicationFailure newFromValues( | |
| boolean nonRetryable, | ||
| Values details, | ||
| Throwable cause, | ||
| Duration nextRetryDelay) { | ||
| return new ApplicationFailure(message, type, nonRetryable, details, cause, nextRetryDelay); | ||
| Duration nextRetryDelay, | ||
| ApplicationErrorCategory category) { | ||
| return new ApplicationFailure( | ||
| message, type, nonRetryable, details, cause, nextRetryDelay, category); | ||
| } | ||
|
|
||
| ApplicationFailure( | ||
|
|
@@ -172,12 +222,14 @@ static ApplicationFailure newFromValues( | |
| boolean nonRetryable, | ||
| Values details, | ||
| Throwable cause, | ||
| Duration nextRetryDelay) { | ||
| Duration nextRetryDelay, | ||
| ApplicationErrorCategory category) { | ||
| super(getMessage(message, Objects.requireNonNull(type), nonRetryable), message, cause); | ||
| this.type = type; | ||
| this.details = details; | ||
| this.nonRetryable = nonRetryable; | ||
| this.nextRetryDelay = nextRetryDelay; | ||
| this.category = category; | ||
| } | ||
|
|
||
| public String getType() { | ||
|
|
@@ -210,6 +262,10 @@ public void setNextRetryDelay(Duration nextRetryDelay) { | |
| this.nextRetryDelay = nextRetryDelay; | ||
| } | ||
|
|
||
| public ApplicationErrorCategory getApplicationErrorCategory() { | ||
| return category; | ||
| } | ||
|
|
||
| private static String getMessage(String message, String type, boolean nonRetryable) { | ||
| return (Strings.isNullOrEmpty(message) ? "" : "message='" + message + "', ") | ||
| + "type='" | ||
|
|
@@ -218,4 +274,43 @@ private static String getMessage(String message, String type, boolean nonRetryab | |
| + ", nonRetryable=" | ||
| + nonRetryable; | ||
| } | ||
|
|
||
| public static boolean isBenignApplicationFailure(@Nullable Throwable t) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Easier to justify than the Go equivalent I suppose, but I still think this helper may not be needed and may be more confusing than it's worth
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Moved to an internal failure utils file, it's convenient to have imo |
||
| if (t == null) { | ||
| return false; | ||
| } | ||
|
|
||
| if (t instanceof ApplicationFailure | ||
| && ((ApplicationFailure) t).getApplicationErrorCategory() | ||
| == ApplicationErrorCategory.BENIGN) { | ||
| } | ||
|
|
||
| // Handle WorkflowExecutionException, which wraps a protobuf Failure | ||
| if (t instanceof io.temporal.internal.worker.WorkflowExecutionException) { | ||
| io.temporal.api.failure.v1.Failure failure = | ||
| ((io.temporal.internal.worker.WorkflowExecutionException) t).getFailure(); | ||
| if (failure.hasApplicationFailureInfo() | ||
| && failure.getApplicationFailureInfo().getCategory() | ||
| == io.temporal.api.enums.v1.ApplicationErrorCategory | ||
| .APPLICATION_ERROR_CATEGORY_BENIGN) { | ||
| return true; | ||
| } | ||
| } | ||
|
|
||
| // Handle ActivityFailure, which wraps the actual ApplicationFailure | ||
| if (t instanceof io.temporal.failure.ActivityFailure) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we want to expose this kind of logic to users. A user checking whether an in-workflow-thrown application failure is benign may not want to have it match when an activity raises it. I think if a user wants to check a category for whatever reason, they can.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree, moved to an internal util file |
||
| Throwable cause = t.getCause(); | ||
| boolean result = cause != null && isBenignApplicationFailure(cause); | ||
| return result; | ||
| } | ||
|
|
||
| // Check the immediate cause. | ||
| Throwable cause = t.getCause(); | ||
| boolean result = | ||
| cause != null | ||
| && cause instanceof ApplicationFailure | ||
| && ((ApplicationFailure) cause).getApplicationErrorCategory() | ||
| == ApplicationErrorCategory.BENIGN; | ||
| return result; | ||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -33,6 +33,7 @@ | |
| import io.temporal.api.query.v1.WorkflowQuery; | ||
| import io.temporal.api.update.v1.Input; | ||
| import io.temporal.api.update.v1.Request; | ||
| import io.temporal.failure.ApplicationFailure; | ||
| import io.temporal.failure.CanceledFailure; | ||
| import io.temporal.internal.common.ProtobufTimeUtils; | ||
| import io.temporal.internal.common.UpdateMessage; | ||
|
|
@@ -153,7 +154,9 @@ private void completeWorkflow(@Nullable WorkflowExecutionException failure) { | |
| metricsScope.counter(MetricsType.WORKFLOW_CANCELED_COUNTER).inc(1); | ||
| } else if (failure != null) { | ||
| workflowStateMachines.failWorkflow(failure.getFailure()); | ||
| metricsScope.counter(MetricsType.WORKFLOW_FAILED_COUNTER).inc(1); | ||
| if (!ApplicationFailure.isBenignApplicationFailure(failure)) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I do not think you want to swallow activities that fail because of benign exceptions. They may reach a max attempts or something and throw this out. Just because an activity has a benign exception it doesn't want to be in telemetry doesn't mean a workflow doesn't want it in telemetry.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this not the same pattern in: What about if a user throws a benign exception in workflow code, not in an activity?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, I was unclear here. This is basically me restating https://github.com/temporalio/sdk-java/pull/2485/files#r2054385164.
That is fine to skip metrics, but we don't want to skip metrics if the exception in the workflow code is an activity failure with a benign cause. There is a difference between not recording telemetry on benign exceptions and not recording telemetry on an exception that has a benign cause. Here are some scenarios:
In the last 3 bullets on certain languages, they will be treated as benign and shouldn't be IMO. Basically, change
Not exactly because Go stops at the first application error, not every application error no matter the depth. Also, this isn't recursively checking causes to arbitrary depths like Go. Still, looking back on Go, we should only apply it to the current error and not check cause or If you think about how users use benign exceptions (as control flow), they aren't supposed to wrap them. Open to discussion here.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Won't they inherently be wrapped as other failure types though? Could be wrong here, but for example, an ApplicationFailure thrown from an activity, would it not always (or at least in the general case) be wrapped as an ActivityFailure?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree in that it doesn't seem relevant/useful to check for benign causes at an arbitrary depth, but I though a depth of 0 (immediate) or 1 (thinly wrapped) would be sufficient. In Java SDK, how would one receive an ApplicationFailure that is not wrapped?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
No. They are only benign where they are thrown (impl side) and they are usually only wrapped where they are caught (caller side).
Not on the activity side, only on the workflow caller side. But if an activity fails because of a benign error, even though it is benign on the activity side, if that activity failure bubbles out of a workflow, that is not benign.
Even 1 depth treats a benign activity failure as benign on the workflow side which is incorrect IMO (not to mention Go side is arbitrary depth). I just checked Core, Core does it right in that it doesn't recurse.
What do you mean by "one"? If you mean how does our SDK internals receive an unwrapped thrown error, it is where you have changes in An error is only benign where it is thrown impl side, not where it is caught caller side.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the clarification, that all sounds reasonable (and saved me some digging :)) |
||
| metricsScope.counter(MetricsType.WORKFLOW_FAILED_COUNTER).inc(1); | ||
| } | ||
| } else { | ||
| ContinueAsNewWorkflowExecutionCommandAttributes attributes = | ||
| context.getContinueAsNewOnCompletion(); | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems in other parts of the SDK, e.g. parent close policy, we just expose the raw enum from proto. Should we do the same here? It has "unrecognized", automatically gets new values, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't notice that, have changed to use raw proto