Distributed Background Jobs: Catch exceptions in job loop to improve application resilience#21099
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses a critical container orchestration issue where unhandled exceptions in the DistributedBackgroundJobHostedService cause the application to exit with a success code (0), preventing orchestrators like Kubernetes and Docker Swarm from detecting failures and triggering restart policies. The solution explicitly sets Environment.ExitCode = 1 before re-throwing exceptions in the background service's ExecuteAsync method.
Key changes:
- Added exception handler to set non-zero exit code before propagating unhandled exceptions
- Removed unused
System.Diagnosticsimport - Removed empty XML documentation comments for constructor parameters
AndyButland
left a comment
There was a problem hiding this comment.
Can confirm the test steps work as described and the solution looks sensible to me. @lauraneto is also mid-way through looking at this so will leave to her to approve or raise any questions.
lauraneto
left a comment
There was a problem hiding this comment.
@nikolajlauridsen I see that this happens when there are errors in the distributed job service, like failing to acquire a lock.
While this ensures that a non successful status code is set in case of error, I feel like this is addressing a symptom instead of the root cause.
Ideally the application shouldn't crash or reach this state at all during a failure like this.
I think we should instead be looking into how we can make the service more resilient, and what we want the behavior to be in case an error still happens... Is the application crashing expected behavior?
|
I've made a separate PR to address some issues in the distributed background service to harden it: #21100. If a specific implementation of a distributed background job fails, it doesn't crash the app—that's caught in RunRunnableJob. The app will only crash in case of a total failure of the DistributedBackgroundJobHostedService. This is the expected behaviour; from dotnet, at least, if a BackgroundService stops, it stops the application. I agree that the application shouldn't reach this state, but I'm a bit hesitant to continue and ignore the error if it comes up. That way, distributed background jobs might be entirely broken, and you wouldn't know, because we'd just keep restarting the RunRunnableJob loop. So I think it's better to fail fast and fix any locking issues, but I'm open to other approaches. |
|
After some discussions, we've decided to instead, swallow the error to stop the app from crashing and avoid downtimes. This is a better approach, thanks @lauraneto 🙌 |
There was a problem hiding this comment.
Can confirm this "logs and continues" now:
[15:21:04 ERR] An exception occurred while attempting to run a distributed background job.
System.ApplicationException: Test exception logging from DistributedBackgroundJobHostedService.
at Umbraco.Cms.Infrastructure.BackgroundJobs.DistributedBackgroundJobHostedService.ExecuteAsync(CancellationToken stoppingToken) in C:\Repos\Umbraco\Umbraco.Cms-V16\src\Umbraco.Infrastructure\BackgroundJobs\DistributedBackgroundJobHostedService.cs:line 61
[15:21:09 ERR] An exception occurred while attempting to run a distributed background job.
System.ApplicationException: Test exception logging from DistributedBackgroundJobHostedService.
at Umbraco.Cms.Infrastructure.BackgroundJobs.DistributedBackgroundJobHostedService.ExecuteAsync(CancellationToken stoppingToken) in C:\Repos\Umbraco\Umbraco.Cms-V16\src\Umbraco.Infrastructure\BackgroundJobs\DistributedBackgroundJobHostedService.cs:line 61
...
lauraneto
left a comment
There was a problem hiding this comment.
There is also this call in line 51:
await _distributedJobService.EnsureJobsAsync();Which can also make this hosted service fail in case the app restarts for some reason, which can also happen without human interaction. We should also look into handling this one.
|
Good point, I've updated it to just swallow the error now 😄 |
|
I'll merge this one in, as I can see the last issue raised in review has been handled. |
Summary
OperationCanceledExceptionbehavior to allow graceful shutdownProblem
When a
BackgroundServicethrows an exception (e.g., database lock timeout), the ASP.NET Core host catches the exception, logs it, and exits gracefully. This is problematic for applications where continuous uptime is critical, as transient errors like network timeouts or temporary database issues would bring down the entire application.Solution
Instead of letting exceptions propagate up and crash the application, the job loop now catches exceptions and logs them, allowing the background service to continue running. This improves SLAs by keeping the application running even when individual job iterations fail.
Key changes:
OperationCanceledExceptionis re-thrown to allow normal graceful shutdown via cancellation tokenEnvironment.ExitCode = 1before re-throwingTest Plan
Temporarily modify
DistributedBackgroundJobHostedService.RunRunnableJob()to throw an exception:Run the application:
Verify:
LogErrorRelated Issues
🤖 Generated with Claude Code