Skip to content

Distributed Background Jobs: Catch exceptions in job loop to improve application resilience#21099

Merged
AndyButland merged 4 commits intomainfrom
v17/fix/stop-background-job-correctly
Dec 16, 2025
Merged

Distributed Background Jobs: Catch exceptions in job loop to improve application resilience#21099
AndyButland merged 4 commits intomainfrom
v17/fix/stop-background-job-correctly

Conversation

@nikolajlauridsen
Copy link
Contributor

@nikolajlauridsen nikolajlauridsen commented Dec 9, 2025

Summary

  • Catches exceptions thrown during distributed background job execution instead of letting them crash the application
  • Logs the exception and continues processing the next job iteration
  • Preserves OperationCanceledException behavior to allow graceful shutdown

Problem

When a BackgroundService throws an exception (e.g., database lock timeout), the ASP.NET Core host catches the exception, logs it, and exits gracefully. This is problematic for applications where continuous uptime is critical, as transient errors like network timeouts or temporary database issues would bring down the entire application.

Solution

Instead of letting exceptions propagate up and crash the application, the job loop now catches exceptions and logs them, allowing the background service to continue running. This improves SLAs by keeping the application running even when individual job iterations fail.

Key changes:

  • Added try-catch inside the job execution loop
  • OperationCanceledException is re-thrown to allow normal graceful shutdown via cancellation token
  • All other exceptions are logged and the loop continues to the next iteration
  • Removed the previous approach of setting Environment.ExitCode = 1 before re-throwing

Test Plan

  1. Temporarily modify DistributedBackgroundJobHostedService.RunRunnableJob() to throw an exception:

    private async Task RunRunnableJob()
    {
        throw new Exception("Test exception for resilience verification");
        // ... rest of the method
    }
  2. Run the application:

    dotnet run --project src/Umbraco.Web.UI
  3. Verify:

    • The exception is logged with LogError
    • The application continues running (doesn't crash)
    • The background job loop continues processing on the next timer tick

Related Issues

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings December 9, 2025 12:50
@nikolajlauridsen nikolajlauridsen changed the title Set exit code when background job throws exception Distribyted background jobs: Set exit code when background job throws exception to ensure proper container orchestration handling Dec 9, 2025
@nikolajlauridsen nikolajlauridsen changed the title Distribyted background jobs: Set exit code when background job throws exception to ensure proper container orchestration handling Distributed background jobs: Set exit code when background job throws exception to ensure proper container orchestration handling Dec 9, 2025
@nikolajlauridsen nikolajlauridsen changed the title Distributed background jobs: Set exit code when background job throws exception to ensure proper container orchestration handling Distributed Background Jobs: Set exit code when background job throws exception to ensure proper container orchestration handling Dec 9, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a critical container orchestration issue where unhandled exceptions in the DistributedBackgroundJobHostedService cause the application to exit with a success code (0), preventing orchestrators like Kubernetes and Docker Swarm from detecting failures and triggering restart policies. The solution explicitly sets Environment.ExitCode = 1 before re-throwing exceptions in the background service's ExecuteAsync method.

Key changes:

  • Added exception handler to set non-zero exit code before propagating unhandled exceptions
  • Removed unused System.Diagnostics import
  • Removed empty XML documentation comments for constructor parameters

Copy link
Contributor

@AndyButland AndyButland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can confirm the test steps work as described and the solution looks sensible to me. @lauraneto is also mid-way through looking at this so will leave to her to approve or raise any questions.

Copy link
Contributor

@lauraneto lauraneto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nikolajlauridsen I see that this happens when there are errors in the distributed job service, like failing to acquire a lock.
While this ensures that a non successful status code is set in case of error, I feel like this is addressing a symptom instead of the root cause.

Ideally the application shouldn't crash or reach this state at all during a failure like this.
I think we should instead be looking into how we can make the service more resilient, and what we want the behavior to be in case an error still happens... Is the application crashing expected behavior?

@nikolajlauridsen
Copy link
Contributor Author

nikolajlauridsen commented Dec 10, 2025

I've made a separate PR to address some issues in the distributed background service to harden it: #21100.

If a specific implementation of a distributed background job fails, it doesn't crash the app—that's caught in RunRunnableJob. The app will only crash in case of a total failure of the DistributedBackgroundJobHostedService. This is the expected behaviour; from dotnet, at least, if a BackgroundService stops, it stops the application.

I agree that the application shouldn't reach this state, but I'm a bit hesitant to continue and ignore the error if it comes up. That way, distributed background jobs might be entirely broken, and you wouldn't know, because we'd just keep restarting the RunRunnableJob loop. So I think it's better to fail fast and fix any locking issues, but I'm open to other approaches.

@nikolajlauridsen
Copy link
Contributor Author

After some discussions, we've decided to instead, swallow the error to stop the app from crashing and avoid downtimes.

This is a better approach, thanks @lauraneto 🙌

@nikolajlauridsen nikolajlauridsen changed the title Distributed Background Jobs: Set exit code when background job throws exception to ensure proper container orchestration handling Distributed Background Jobs: Catch exceptions in job loop to improve application resilience Dec 10, 2025
Copy link
Contributor

@AndyButland AndyButland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can confirm this "logs and continues" now:

[15:21:04 ERR] An exception occurred while attempting to run a distributed background job.
System.ApplicationException: Test exception logging from DistributedBackgroundJobHostedService.
   at Umbraco.Cms.Infrastructure.BackgroundJobs.DistributedBackgroundJobHostedService.ExecuteAsync(CancellationToken stoppingToken) in C:\Repos\Umbraco\Umbraco.Cms-V16\src\Umbraco.Infrastructure\BackgroundJobs\DistributedBackgroundJobHostedService.cs:line 61
[15:21:09 ERR] An exception occurred while attempting to run a distributed background job.
System.ApplicationException: Test exception logging from DistributedBackgroundJobHostedService.
   at Umbraco.Cms.Infrastructure.BackgroundJobs.DistributedBackgroundJobHostedService.ExecuteAsync(CancellationToken stoppingToken) in C:\Repos\Umbraco\Umbraco.Cms-V16\src\Umbraco.Infrastructure\BackgroundJobs\DistributedBackgroundJobHostedService.cs:line 61
...

Copy link
Contributor

@lauraneto lauraneto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also this call in line 51:

await _distributedJobService.EnsureJobsAsync();

Which can also make this hosted service fail in case the app restarts for some reason, which can also happen without human interaction. We should also look into handling this one.

@nikolajlauridsen
Copy link
Contributor Author

Good point, I've updated it to just swallow the error now 😄

@AndyButland
Copy link
Contributor

I'll merge this one in, as I can see the last issue raised in review has been handled.

@AndyButland AndyButland merged commit 7982641 into main Dec 16, 2025
26 checks passed
@AndyButland AndyButland deleted the v17/fix/stop-background-job-correctly branch December 16, 2025 05:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

V17 - Background exception gracefully stops app instead of a fail state

3 participants