Distributed Background Jobs: Catch exceptions in job loop to improve application resilience by nikolajlauridsen · Pull Request #21099 · umbraco/Umbraco-CMS

nikolajlauridsen · 2025-12-09T12:50:33Z

Summary

Catches exceptions thrown during distributed background job execution instead of letting them crash the application
Logs the exception and continues processing the next job iteration
Preserves OperationCanceledException behavior to allow graceful shutdown

Problem

When a BackgroundService throws an exception (e.g., database lock timeout), the ASP.NET Core host catches the exception, logs it, and exits gracefully. This is problematic for applications where continuous uptime is critical, as transient errors like network timeouts or temporary database issues would bring down the entire application.

Solution

Instead of letting exceptions propagate up and crash the application, the job loop now catches exceptions and logs them, allowing the background service to continue running. This improves SLAs by keeping the application running even when individual job iterations fail.

Key changes:

Added try-catch inside the job execution loop
OperationCanceledException is re-thrown to allow normal graceful shutdown via cancellation token
All other exceptions are logged and the loop continues to the next iteration
Removed the previous approach of setting Environment.ExitCode = 1 before re-throwing

Test Plan

Temporarily modify DistributedBackgroundJobHostedService.RunRunnableJob() to throw an exception:

private async Task RunRunnableJob()
{
    throw new Exception("Test exception for resilience verification");
    // ... rest of the method
}

Run the application:
```
dotnet run --project src/Umbraco.Web.UI
```
Verify:
- The exception is logged with LogError
- The application continues running (doesn't crash)
- The background job loop continues processing on the next timer tick

Related Issues

🤖 Generated with Claude Code

Copilot

Pull request overview

This PR addresses a critical container orchestration issue where unhandled exceptions in the DistributedBackgroundJobHostedService cause the application to exit with a success code (0), preventing orchestrators like Kubernetes and Docker Swarm from detecting failures and triggering restart policies. The solution explicitly sets Environment.ExitCode = 1 before re-throwing exceptions in the background service's ExecuteAsync method.

Key changes:

Added exception handler to set non-zero exit code before propagating unhandled exceptions
Removed unused System.Diagnostics import
Removed empty XML documentation comments for constructor parameters

AndyButland

Can confirm the test steps work as described and the solution looks sensible to me. @lauraneto is also mid-way through looking at this so will leave to her to approve or raise any questions.

lauraneto

@nikolajlauridsen I see that this happens when there are errors in the distributed job service, like failing to acquire a lock.
While this ensures that a non successful status code is set in case of error, I feel like this is addressing a symptom instead of the root cause.

Ideally the application shouldn't crash or reach this state at all during a failure like this.
I think we should instead be looking into how we can make the service more resilient, and what we want the behavior to be in case an error still happens... Is the application crashing expected behavior?

nikolajlauridsen · 2025-12-10T12:02:10Z

I've made a separate PR to address some issues in the distributed background service to harden it: #21100.

If a specific implementation of a distributed background job fails, it doesn't crash the app—that's caught in RunRunnableJob. The app will only crash in case of a total failure of the DistributedBackgroundJobHostedService. This is the expected behaviour; from dotnet, at least, if a BackgroundService stops, it stops the application.

I agree that the application shouldn't reach this state, but I'm a bit hesitant to continue and ignore the error if it comes up. That way, distributed background jobs might be entirely broken, and you wouldn't know, because we'd just keep restarting the RunRunnableJob loop. So I think it's better to fail fast and fix any locking issues, but I'm open to other approaches.

nikolajlauridsen · 2025-12-10T13:40:32Z

After some discussions, we've decided to instead, swallow the error to stop the app from crashing and avoid downtimes.

This is a better approach, thanks @lauraneto 🙌

AndyButland

Can confirm this "logs and continues" now:

[15:21:04 ERR] An exception occurred while attempting to run a distributed background job.
System.ApplicationException: Test exception logging from DistributedBackgroundJobHostedService.
   at Umbraco.Cms.Infrastructure.BackgroundJobs.DistributedBackgroundJobHostedService.ExecuteAsync(CancellationToken stoppingToken) in C:\Repos\Umbraco\Umbraco.Cms-V16\src\Umbraco.Infrastructure\BackgroundJobs\DistributedBackgroundJobHostedService.cs:line 61
[15:21:09 ERR] An exception occurred while attempting to run a distributed background job.
System.ApplicationException: Test exception logging from DistributedBackgroundJobHostedService.
   at Umbraco.Cms.Infrastructure.BackgroundJobs.DistributedBackgroundJobHostedService.ExecuteAsync(CancellationToken stoppingToken) in C:\Repos\Umbraco\Umbraco.Cms-V16\src\Umbraco.Infrastructure\BackgroundJobs\DistributedBackgroundJobHostedService.cs:line 61
...

lauraneto

There is also this call in line 51:

await _distributedJobService.EnsureJobsAsync();

Which can also make this hosted service fail in case the app restarts for some reason, which can also happen without human interaction. We should also look into handling this one.

nikolajlauridsen · 2025-12-12T11:44:00Z

Good point, I've updated it to just swallow the error now 😄

AndyButland · 2025-12-16T05:55:59Z

I'll merge this one in, as I can see the last issue raised in review has been handled.

Set exit code when exception is thrown

ae1591d

Copilot AI review requested due to automatic review settings December 9, 2025 12:50

Copilot started reviewing on behalf of nikolajlauridsen December 9, 2025 12:51 View session

nikolajlauridsen changed the title ~~Set exit code when background job throws exception~~ Distribyted background jobs: Set exit code when background job throws exception to ensure proper container orchestration handling Dec 9, 2025

nikolajlauridsen added area/backend release/17.1.0 type/bug and removed type/bug labels Dec 9, 2025

Copilot AI reviewed Dec 9, 2025

View reviewed changes

AndyButland reviewed Dec 10, 2025

View reviewed changes

lauraneto reviewed Dec 10, 2025

View reviewed changes

nikolajlauridsen added 2 commits December 10, 2025 14:34

Merge branch 'main' into v17/fix/stop-background-job-correctly

b649ec7

Catch any error instead so we don't stop the application

f58e211

nikolajlauridsen changed the title ~~Distributed Background Jobs: Set exit code when background job throws exception to ensure proper container orchestration handling~~ Distributed Background Jobs: Catch exceptions in job loop to improve application resilience Dec 10, 2025

AndyButland approved these changes Dec 10, 2025

View reviewed changes

lauraneto reviewed Dec 10, 2025

View reviewed changes

Handle exception when ensuring jobs

d0857ac

AndyButland merged commit 7982641 into main Dec 16, 2025
26 checks passed

AndyButland deleted the v17/fix/stop-background-job-correctly branch December 16, 2025 05:56

nikolajlauridsen mentioned this pull request Dec 16, 2025

V17 - Background exception gracefully stops app instead of a fail state #21083

Closed

nikolajlauridsen added release/17.1.0 and removed release/17.1.0 labels Dec 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed Background Jobs: Catch exceptions in job loop to improve application resilience#21099

Distributed Background Jobs: Catch exceptions in job loop to improve application resilience#21099
AndyButland merged 4 commits intomainfrom
v17/fix/stop-background-job-correctly

nikolajlauridsen commented Dec 9, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

AndyButland left a comment

Uh oh!

lauraneto left a comment

Uh oh!

nikolajlauridsen commented Dec 10, 2025 •

edited

Loading

Uh oh!

nikolajlauridsen commented Dec 10, 2025

Uh oh!

AndyButland left a comment •

edited

Loading

Uh oh!

lauraneto left a comment

Uh oh!

nikolajlauridsen commented Dec 12, 2025

Uh oh!

AndyButland commented Dec 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nikolajlauridsen commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Test Plan

Related Issues

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

AndyButland left a comment

Choose a reason for hiding this comment

Uh oh!

lauraneto left a comment

Choose a reason for hiding this comment

Uh oh!

nikolajlauridsen commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nikolajlauridsen commented Dec 10, 2025

Uh oh!

AndyButland left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lauraneto left a comment

Choose a reason for hiding this comment

Uh oh!

nikolajlauridsen commented Dec 12, 2025

Uh oh!

AndyButland commented Dec 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nikolajlauridsen commented Dec 9, 2025 •

edited

Loading

nikolajlauridsen commented Dec 10, 2025 •

edited

Loading

AndyButland left a comment •

edited

Loading