Skip to content
Merged

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems in other parts of the SDK, e.g. parent close policy, we just expose the raw enum from proto. Should we do the same here? It has "unrecognized", automatically gets new values, etc.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't notice that, have changed to use raw proto

Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
/*
* Copyright (C) 2024 Temporal Technologies, Inc. All Rights Reserved.
*
* Copyright (C) 2012-2016 Amazon.com, Inc. or its affiliates. All Rights Reserved.
*
* Modifications copyright (C) 2017 Uber Technologies, Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this material except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package io.temporal.failure;

/**
* Mirrors the proto definition for ApplicationErrorCategory. Used to categorize application

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Mirrors the proto definition for ApplicationErrorCategory. Used to categorize application
* Used to categorize application

I don't think we need to mention the proto def. here the docs are for the end user and that isn't really relevant for them

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* failures.
*
* @see io.temporal.api.enums.v1.ApplicationErrorCategory
*/
public enum ApplicationErrorCategory {
UNSPECIFIED,
/** Expected application error with little/no severity. */
BENIGN,
;

public static ApplicationErrorCategory fromProto(
io.temporal.api.enums.v1.ApplicationErrorCategory protoCategory) {
if (protoCategory == null) {
return UNSPECIFIED;
}
switch (protoCategory) {
case APPLICATION_ERROR_CATEGORY_BENIGN:
return BENIGN;
case APPLICATION_ERROR_CATEGORY_UNSPECIFIED:
case UNRECOGNIZED:
default:
// Fallback unrecognized or unspecified proto values as UNSPECIFIED
return UNSPECIFIED;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the concept of unrecognized is ideal and different from unspecified (then again though, arguably we should be using raw API proto enums and not a new enumerate here).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using raw API proto enums now

}
}

public io.temporal.api.enums.v1.ApplicationErrorCategory toProto() {
switch (this) {
case BENIGN:
return io.temporal.api.enums.v1.ApplicationErrorCategory.APPLICATION_ERROR_CATEGORY_BENIGN;
case UNSPECIFIED:
default:
// Fallback to UNSPECIFIED for unknown values
return io.temporal.api.enums.v1.ApplicationErrorCategory
.APPLICATION_ERROR_CATEGORY_UNSPECIFIED;
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -51,13 +51,15 @@
* <li>nonRetryable is set to false
* <li>details are set to null
* <li>stack trace is copied from the original exception
* <li>stack category is set to ApplicationErrorCategory.UNSPECIFIED
* </ul>
*/
public final class ApplicationFailure extends TemporalFailure {
private final String type;
private final Values details;
private boolean nonRetryable;
private Duration nextRetryDelay;
private final ApplicationErrorCategory category;

/**
* New ApplicationFailure with {@link #isNonRetryable()} flag set to false.
Expand Down Expand Up @@ -92,7 +94,14 @@ public static ApplicationFailure newFailure(String message, String type, Object.
*/
public static ApplicationFailure newFailureWithCause(
String message, String type, @Nullable Throwable cause, Object... details) {
return new ApplicationFailure(message, type, false, new EncodedValues(details), cause, null);
return new ApplicationFailure(
message,
type,
false,
new EncodedValues(details),
cause,
null,
ApplicationErrorCategory.UNSPECIFIED);
}

/**
Expand All @@ -118,7 +127,13 @@ public static ApplicationFailure newFailureWithCauseAndDelay(
Duration nextRetryDelay,
Object... details) {
return new ApplicationFailure(
message, type, false, new EncodedValues(details), cause, nextRetryDelay);
message,
type,
false,
new EncodedValues(details),
cause,
nextRetryDelay,
ApplicationErrorCategory.UNSPECIFIED);
}

/**
Expand Down Expand Up @@ -153,7 +168,40 @@ public static ApplicationFailure newNonRetryableFailure(
*/
public static ApplicationFailure newNonRetryableFailureWithCause(
String message, String type, @Nullable Throwable cause, Object... details) {
return new ApplicationFailure(message, type, true, new EncodedValues(details), cause, null);
return new ApplicationFailure(
message,
type,
true,
new EncodedValues(details),
cause,
null,
ApplicationErrorCategory.UNSPECIFIED);
}

/**
* New ApplicationFailure with a specified category and {@link #isNonRetryable()} flag set to
* false.
*
* <p>Note that this exception still may not be retried by the service if its type is included in
* the doNotRetry property of the correspondent retry policy.
*
* @param message optional error message
* @param type error type
* @param category the category of the application failure.
* @param cause failure cause. Each element of the cause chain will be converted to
* ApplicationFailure for network transmission across network if it doesn't extend {@link
* TemporalFailure}
* @param details optional details about the failure. They are serialized using the same approach
* as arguments and results.
*/
public static ApplicationFailure newFailureWithCategory(
String message,
String type,
ApplicationErrorCategory category,
@Nullable Throwable cause,
Object... details) {
return new ApplicationFailure(
message, type, false, new EncodedValues(details), cause, null, category);
}

static ApplicationFailure newFromValues(
Expand All @@ -162,8 +210,10 @@ static ApplicationFailure newFromValues(
boolean nonRetryable,
Values details,
Throwable cause,
Duration nextRetryDelay) {
return new ApplicationFailure(message, type, nonRetryable, details, cause, nextRetryDelay);
Duration nextRetryDelay,
ApplicationErrorCategory category) {
return new ApplicationFailure(
message, type, nonRetryable, details, cause, nextRetryDelay, category);
}

ApplicationFailure(
Expand All @@ -172,12 +222,14 @@ static ApplicationFailure newFromValues(
boolean nonRetryable,
Values details,
Throwable cause,
Duration nextRetryDelay) {
Duration nextRetryDelay,
ApplicationErrorCategory category) {
super(getMessage(message, Objects.requireNonNull(type), nonRetryable), message, cause);
this.type = type;
this.details = details;
this.nonRetryable = nonRetryable;
this.nextRetryDelay = nextRetryDelay;
this.category = category;
}

public String getType() {
Expand Down Expand Up @@ -210,6 +262,10 @@ public void setNextRetryDelay(Duration nextRetryDelay) {
this.nextRetryDelay = nextRetryDelay;
}

public ApplicationErrorCategory getApplicationErrorCategory() {
return category;
}

private static String getMessage(String message, String type, boolean nonRetryable) {
return (Strings.isNullOrEmpty(message) ? "" : "message='" + message + "', ")
+ "type='"
Expand All @@ -218,4 +274,43 @@ private static String getMessage(String message, String type, boolean nonRetryab
+ ", nonRetryable="
+ nonRetryable;
}

public static boolean isBenignApplicationFailure(@Nullable Throwable t) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easier to justify than the Go equivalent I suppose, but I still think this helper may not be needed and may be more confusing than it's worth

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to an internal failure utils file, it's convenient to have imo

if (t == null) {
return false;
}

if (t instanceof ApplicationFailure
&& ((ApplicationFailure) t).getApplicationErrorCategory()
== ApplicationErrorCategory.BENIGN) {
}

// Handle WorkflowExecutionException, which wraps a protobuf Failure
if (t instanceof io.temporal.internal.worker.WorkflowExecutionException) {
io.temporal.api.failure.v1.Failure failure =
((io.temporal.internal.worker.WorkflowExecutionException) t).getFailure();
if (failure.hasApplicationFailureInfo()
&& failure.getApplicationFailureInfo().getCategory()
== io.temporal.api.enums.v1.ApplicationErrorCategory
.APPLICATION_ERROR_CATEGORY_BENIGN) {
return true;
}
}

// Handle ActivityFailure, which wraps the actual ApplicationFailure
if (t instanceof io.temporal.failure.ActivityFailure) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to expose this kind of logic to users. A user checking whether an in-workflow-thrown application failure is benign may not want to have it match when an activity raises it. I think if a user wants to check a category for whatever reason, they can.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, moved to an internal util file

Throwable cause = t.getCause();
boolean result = cause != null && isBenignApplicationFailure(cause);
return result;
}

// Check the immediate cause.
Throwable cause = t.getCause();
boolean result =
cause != null
&& cause instanceof ApplicationFailure
&& ((ApplicationFailure) cause).getApplicationErrorCategory()
== ApplicationErrorCategory.BENIGN;
return result;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,8 @@ private RuntimeException failureToExceptionImpl(Failure failure, DataConverter d
cause,
info.hasNextRetryDelay()
? ProtobufTimeUtils.toJavaDuration(info.getNextRetryDelay())
: null);
: null,
ApplicationErrorCategory.fromProto(info.getCategory()));
}
case TIMEOUT_FAILURE_INFO:
{
Expand Down Expand Up @@ -146,13 +147,14 @@ private RuntimeException failureToExceptionImpl(Failure failure, DataConverter d
info.hasLastHeartbeatDetails()
? Optional.of(info.getLastHeartbeatDetails())
: Optional.empty();
return new ApplicationFailure(
return ApplicationFailure.newFromValues(
failure.getMessage(),
"ResetWorkflow",
false,
new EncodedValues(details, dataConverter),
cause,
null);
null,
ApplicationErrorCategory.UNSPECIFIED);
}
case ACTIVITY_FAILURE_INFO:
{
Expand Down Expand Up @@ -214,7 +216,8 @@ private RuntimeException failureToExceptionImpl(Failure failure, DataConverter d
false,
new EncodedValues(Optional.empty(), dataConverter),
cause,
null);
null,
ApplicationErrorCategory.UNSPECIFIED);
}
}

Expand Down Expand Up @@ -260,7 +263,8 @@ private Failure exceptionToFailure(Throwable throwable) {
ApplicationFailureInfo.Builder info =
ApplicationFailureInfo.newBuilder()
.setType(ae.getType())
.setNonRetryable(ae.isNonRetryable());
.setNonRetryable(ae.isNonRetryable())
.setCategory(ae.getApplicationErrorCategory().toProto());
Optional<Payloads> details = ((EncodedValues) ae.getDetails()).toPayloads();
if (details.isPresent()) {
info.setDetails(details.get());
Expand Down Expand Up @@ -352,7 +356,10 @@ private Failure exceptionToFailure(Throwable throwable) {
ApplicationFailureInfo.Builder info =
ApplicationFailureInfo.newBuilder()
.setType(throwable.getClass().getName())
.setNonRetryable(false);
.setNonRetryable(false)
.setCategory(
io.temporal.api.enums.v1.ApplicationErrorCategory
.APPLICATION_ERROR_CATEGORY_UNSPECIFIED);
failure.setApplicationFailureInfo(info);
}
return failure.build();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@
import io.temporal.common.interceptors.ActivityInboundCallsInterceptor.ActivityOutput;
import io.temporal.common.interceptors.Header;
import io.temporal.common.interceptors.WorkerInterceptor;
import io.temporal.failure.ApplicationFailure;
import io.temporal.internal.worker.ActivityTaskHandler;
import io.temporal.payload.context.ActivitySerializationContext;
import io.temporal.serviceclient.CheckedExceptionWrapper;
Expand Down Expand Up @@ -122,6 +123,14 @@ public ActivityTaskHandler.Result execute(ActivityInfoInternal info, Scope metri
info.getActivityId(),
info.getActivityType(),
info.getAttempt());
} else if (ApplicationFailure.isBenignApplicationFailure(ex)) {
log.debug(
"{} failure. ActivityId={}, activityType={}, attempt={}",
local ? "Local activity" : "Activity",
info.getActivityId(),
info.getActivityType(),
info.getAttempt(),
ex);
} else {
log.warn(
"{} failure. ActivityId={}, activityType={}, attempt={}",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@
import io.temporal.common.interceptors.WorkerInterceptor;
import io.temporal.common.metadata.POJOActivityImplMetadata;
import io.temporal.common.metadata.POJOActivityMethodMetadata;
import io.temporal.failure.ApplicationFailure;
import io.temporal.internal.activity.ActivityTaskExecutors.ActivityTaskExecutor;
import io.temporal.internal.common.env.ReflectionUtils;
import io.temporal.internal.worker.ActivityTask;
Expand Down Expand Up @@ -209,11 +210,13 @@ static ActivityTaskHandler.Result mapToActivityFailure(
Scope ms =
metricsScope.tagged(
ImmutableMap.of(MetricsTag.EXCEPTION, exception.getClass().getSimpleName()));
if (isLocalActivity) {
ms.counter(MetricsType.LOCAL_ACTIVITY_EXEC_FAILED_COUNTER).inc(1);
ms.counter(MetricsType.LOCAL_ACTIVITY_FAILED_COUNTER).inc(1);
} else {
ms.counter(MetricsType.ACTIVITY_EXEC_FAILED_COUNTER).inc(1);
if (!ApplicationFailure.isBenignApplicationFailure(exception)) {
if (isLocalActivity) {
ms.counter(MetricsType.LOCAL_ACTIVITY_EXEC_FAILED_COUNTER).inc(1);
ms.counter(MetricsType.LOCAL_ACTIVITY_FAILED_COUNTER).inc(1);
} else {
ms.counter(MetricsType.ACTIVITY_EXEC_FAILED_COUNTER).inc(1);
}
}
Failure failure = dataConverter.exceptionToFailure(exception);
RespondActivityTaskFailedRequest.Builder result =
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
import io.temporal.api.query.v1.WorkflowQuery;
import io.temporal.api.update.v1.Input;
import io.temporal.api.update.v1.Request;
import io.temporal.failure.ApplicationFailure;
import io.temporal.failure.CanceledFailure;
import io.temporal.internal.common.ProtobufTimeUtils;
import io.temporal.internal.common.UpdateMessage;
Expand Down Expand Up @@ -153,7 +154,9 @@ private void completeWorkflow(@Nullable WorkflowExecutionException failure) {
metricsScope.counter(MetricsType.WORKFLOW_CANCELED_COUNTER).inc(1);
} else if (failure != null) {
workflowStateMachines.failWorkflow(failure.getFailure());
metricsScope.counter(MetricsType.WORKFLOW_FAILED_COUNTER).inc(1);
if (!ApplicationFailure.isBenignApplicationFailure(failure)) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think you want to swallow activities that fail because of benign exceptions. They may reach a max attempts or something and throw this out. Just because an activity has a benign exception it doesn't want to be in telemetry doesn't mean a workflow doesn't want it in telemetry.

@THardy98 THardy98 Apr 24, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not the same pattern in:
temporalio/sdk-go#1925
temporalio/sdk-rust#905
Unless I'm mistaken.

What about if a user throws a benign exception in workflow code, not in an activity?

@cretz cretz Apr 24, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was unclear here. This is basically me restating https://github.com/temporalio/sdk-java/pull/2485/files#r2054385164.

What about if a user throws a benign exception in workflow code, not in an activity?

That is fine to skip metrics, but we don't want to skip metrics if the exception in the workflow code is an activity failure with a benign cause. There is a difference between not recording telemetry on benign exceptions and not recording telemetry on an exception that has a benign cause.

Here are some scenarios:

  • Activity throws benign exception - do not record telemetry
  • Workflow throws benign exception - do not record telemetry
  • Activity or child fails with benign exception from workflow POV (thrown from activity and it is out of max attempts or something) - still record telemetry
  • Activity uses a Temporal client that runs a workflow and that workflow throws benign exception - still record telemetry
  • In Python, an activity has a general catch that caught the benign exception, tried to do cleanup, and raised its own exception (which automatically sets the cause as the current catch exception) - still record telemetry

In the last 3 bullets on certain languages, they will be treated as benign and shouldn't be IMO. Basically, change isBenignApplicationFailure to stop checking causes IMO. IMO throwing a benign exception is benign, but receiving a benign exception wrapped in another exception is not benign.

Is this not the same pattern in:
temporalio/sdk-go#1925
temporalio/sdk-rust#905
Unless I'm mistaken.

Not exactly because Go stops at the first application error, not every application error no matter the depth. Also, this isn't recursively checking causes to arbitrary depths like Go. Still, looking back on Go, we should only apply it to the current error and not check cause or errors.As IMO. I added a comment at temporalio/sdk-go#1925 (comment). Didn't check Core, but I think the same applies there too.

If you think about how users use benign exceptions (as control flow), they aren't supposed to wrap them. Open to discussion here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't they inherently be wrapped as other failure types though? Could be wrong here, but for example, an ApplicationFailure thrown from an activity, would it not always (or at least in the general case) be wrapped as an ActivityFailure?

@THardy98 THardy98 Apr 24, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree in that it doesn't seem relevant/useful to check for benign causes at an arbitrary depth, but I though a depth of 0 (immediate) or 1 (thinly wrapped) would be sufficient.

In Java SDK, how would one receive an ApplicationFailure that is not wrapped?

@cretz cretz Apr 25, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't they inherently be wrapped as other failure types though?

No. They are only benign where they are thrown (impl side) and they are usually only wrapped where they are caught (caller side).

Could be wrong here, but for example, an ApplicationFailure thrown from an activity, would it not always (or at least in the general case) be wrapped as an ActivityFailure?

Not on the activity side, only on the workflow caller side. But if an activity fails because of a benign error, even though it is benign on the activity side, if that activity failure bubbles out of a workflow, that is not benign.

I agree in that it doesn't seem relevant/useful to check for benign causes at an arbitrary depth, but I though a depth of 0 (immediate) or 1 (thinly wrapped) would be sufficient.

Even 1 depth treats a benign activity failure as benign on the workflow side which is incorrect IMO (not to mention Go side is arbitrary depth). I just checked Core, Core does it right in that it doesn't recurse.

In Java SDK, how would one receive an ApplicationFailure that is not wrapped?

What do you mean by "one"? If you mean how does our SDK internals receive an unwrapped thrown error, it is where you have changes in ActivityTaskExecutors.java. If you mean how does a workflow caller receive it unwrapped, they don't, that's the point.

An error is only benign where it is thrown impl side, not where it is caught caller side.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification, that all sounds reasonable (and saved me some digging :))
I've removed the nested failure checks, only check immediate failures.

metricsScope.counter(MetricsType.WORKFLOW_FAILED_COUNTER).inc(1);
}
} else {
ContinueAsNewWorkflowExecutionCommandAttributes attributes =
context.getContinueAsNewOnCompletion();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@
import io.temporal.api.query.v1.WorkflowQueryResult;
import io.temporal.api.workflowservice.v1.GetSystemInfoResponse;
import io.temporal.api.workflowservice.v1.PollWorkflowTaskQueueResponseOrBuilder;
import io.temporal.failure.ApplicationFailure;
import io.temporal.internal.Config;
import io.temporal.internal.common.SdkFlag;
import io.temporal.internal.common.UpdateMessage;
Expand Down Expand Up @@ -266,12 +267,15 @@ private void applyServerHistory(long lastEventId, WorkflowHistoryIterator histor
implementationOptions.getFailWorkflowExceptionTypes();
for (Class<? extends Throwable> failType : failTypes) {
if (failType.isAssignableFrom(e.getClass())) {
metricsScope.counter(MetricsType.WORKFLOW_FAILED_COUNTER).inc(1);
if (!ApplicationFailure.isBenignApplicationFailure(e)) {
metricsScope.counter(MetricsType.WORKFLOW_FAILED_COUNTER).inc(1);
}
throw new WorkflowExecutionException(
workflow.getWorkflowContext().mapWorkflowExceptionToFailure(e));
}
}
if (e instanceof WorkflowExecutionException) {
if (e instanceof WorkflowExecutionException
&& !ApplicationFailure.isBenignApplicationFailure(e)) {
metricsScope.counter(MetricsType.WORKFLOW_FAILED_COUNTER).inc(1);
}
throw wrap(e);
Expand Down
Loading