Skip to content

[BUG] EventHubListener causes message lost in shutdown #41784

@yfujiwara-sansan

Description

@yfujiwara-sansan

Library name and version

Microsoft.Azure.WebJobs.Extensions.EventHubs 6.0.2 and 5.5.0

Describe the bug

EventHubListener.PartitionProcessor progresses the checkpoint even when the application is shutting down (for example, configuration change, scaling, etc.). It causes message lost.
Note that I found that this issue is occurred occasionally.

I and my colleague guess that this issue should be fixed #36432, but reintroduced with #38067 as following out investigation results.

Our investigation results

In this situation, while _functionExecutionToken.IsCancellationRequested became true, linkedCts.IsCancellationRequested had not become true. So, the checkpoint was progressed even if the application execution had been cancelled.

By LinkedCancellationTokenSource source code, the following facts were found:

  • LinkdedCancellationTokenSource (linkedCts) is cancelled via callback of "linked" cancellation token (_functionExecutionToken's source).
  • The callback chain is invoked LIFO order.
  • The callback is also used in many places such as Task.Delay(CancellationToken) implementation.

By watching the callstack in the checkpointing, following sequence was occurred:

  1. WebHost calls listener's StopAsync()
  2. The listener calls CancellationTokenSource.Cancel() (source of _functionExecutionToken)
  3. The application cancellation is occurred as continuation of async / await, then awaits in function runtimes are finished as part of registered callback execution. Note that this callback should be occurred before setting linkedCts.IsCancellationRequested to true as described above. So, the checkpoint is progressed because linkedCts.IsCancellationRequested has not been true yet.

Expected behavior

The checkpoint is never progressed when application process is shutting down ( _functionExecutionToken is cancelled).

Actual behavior

The checkpoint is progressed occasionally.

Reproduction Steps

  1. Use event hub trigger with following in local.
  2. After Recieve method started, press Ctrl + C to shutdown process.
  3. The checkpoint should be progressed. You can investigate linkedCts and _functionExecutionToken states with break point in the checkpointing.
[FunctionName("Receive")]
[ExponentialBackoffRetry(5, "00:00:10", "00:10:00")]
public async Task Receive(
    [EventHubTrigger("%EventHubName%", Connection = "EventHubConnectionString")] EventData[] events,
    CancellationToken cancellationToken)
{
    await Task.Delay(TimeSpan.FromMinutes(3), cancellationToken);
}

Environment

  • Platform: Windows (Functions runtime v3 and v4, Azure App Service)
    • We reproduced in local environment, but message lost was occurred in production Azure environment multiple times.

Metadata

Metadata

Assignees

Labels

ClientThis issue is related to a non-management packageEvent Hubscustomer-reportedIssues that are reported by GitHub users external to the Azure organization.needs-team-attentionWorkflow: This issue needs attention from Azure service team or SDK teamquestionThe issue doesn't require a change to the product in order to be resolved. Most issues start as that

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions