Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CancellationToken registration leak, regression #1810

Closed
tlecomte opened this issue Oct 20, 2022 · 34 comments · Fixed by #1818
Closed

CancellationToken registration leak, regression #1810

tlecomte opened this issue Oct 20, 2022 · 34 comments · Fixed by #1818
Labels
💥 Regression Issues that are regressions introduced from earlier PRs.

Comments

@tlecomte
Copy link

tlecomte commented Oct 20, 2022

Describe the bug

After updating to SqlClient 5.0.1, we observe a memory leak in our application. Memory dumps show a huge amount of CancellationTokenSource+CallbackNode entries (we had a dump with 2 GB of those), referencing SqlCommand objects.

We believe this might be due to the changes in:

With the changes in these PRs, it seems the CancellationToken registration is not always disposed. There are several if blocks that may return before the registration object gets to be saved in the context. And exceptions could also be thrown.

To reproduce

using Microsoft.Data.SqlClient; // 5.0.1
using System.Threading;
using System.Threading.Tasks;

public static class Program
{
    public static async Task Main()
    {
        const string query = "SELECT 1";

        // this could be the application lifetime cancellation
        // or another form of global shared cancellation
        using var longRunningCts = new CancellationTokenSource();

        var connectionString = new SqlConnectionStringBuilder
        {
            DataSource = "(local)",
            IntegratedSecurity = true,
            TrustServerCertificate = true,
        }.ToString();

        try
        {
            using (var linkedCts = CancellationTokenSource.CreateLinkedTokenSource(longRunningCts.Token))
            using (var connection = new SqlConnection(connectionString))
            {
                await connection.OpenAsync(linkedCts.Token);

                using (var command = new SqlCommand(query, connection))
                using (var reader = await command.ExecuteReaderAsync(linkedCts.Token))
                {
                    linkedCts.Cancel();

                    while (await reader.ReadAsync(linkedCts.Token))
                    {

                    }
                }
            }
        }
        catch (Exception e)
        {

        }

        // => attach a debugger here
        // linkedCts as well as all the SQL objects have been disposed above
        // however a memory snapshot shows that the CancellationTokenSource+Registrations and CancellationTokenSource+CallbackNode are still there, including the objects they reference
        // so garbage collection cannot happen correctly
    }
}

Expected behavior

No memory leak, CancellationTokenSource+CallbackNode entries are not present in a dump.

Further technical details

Microsoft.Data.SqlClient version: 5.0.1
.NET target: .NET 6.0.3
SQL Server version: SQL Server 2019
Operating system: Windows 2019

Additional context

@lcheunglci
Copy link
Contributor

Hi @tlecomte, thanks for bringing this to our attention. We'll look into it soon.

@lcheunglci lcheunglci added the 🆕 Triage Needed For new issues, not triaged yet. label Oct 20, 2022
@tlecomte
Copy link
Author

Thanks @lcheunglci.
FYI I edited the issue to add a code sample to reproduce the problem.

@Wraith2
Copy link
Contributor

Wraith2 commented Oct 25, 2022

There's a fairly simple way to solve it using a mutable ref struct to hold the registration and then the context can Take() it if needed. If the struct has a value when it disposes then you dispose the value if not then it's been taken and is now owned by the context which will clean it up.

@Shane32
Copy link

Shane32 commented Nov 1, 2022

I'm having this issue also. After moving to 5.0.1 my Linux server would leak memory rapidly until crashing. Analyzing a memory dump revealed millions of instances of SqlCommand rooted only by the cancellation token registration (CallbackNode). Even after just a couple hours, I can confirm that reverting to 5.0.0 has solved the problem. For an in-depth analysis, including dump screen shots, system configuration, and other information, please see this issue:

image

(Azure App Service - Linux - Memory Working Set - last 30 days)

image

(Azure App Service - Linux - Memory Working Set - last 3 days)

I recommend marking 5.0.1 as unstable on NuGet for three reasons:

  1. This bug potentially affects anyone that has SQL code continuously or periodically executing within a BackgroundService, which is probably common enough.
  2. The bug is not necessarily obvious (if the memory leak is small), and is very difficult to track down and diagnose, and even harder to write a test for. (I spent probably 40 hours on this.)
  3. The severity of the bug is such that it can often result in the application crashing. Not only that, it can crash other applications within the same server/VM. (For me, it affected all web apps within within same Azure App Service Plan.)

@dmytro-gokun
Copy link

Same here

@Wraith2
Copy link
Contributor

Wraith2 commented Nov 1, 2022

It's not really a leak. A leak would be if we somehow had unreferenced memory filling memory and being unreclaimable. This is unwanted retention. It's also not how it used to work.

I do find it odd that you're seeing so big a problem. From my understanding of the change that caused it the retention should only happen if you pass in a cts to a query execution which then returns before any query is required, doesn't that mean that large quantities of your queries are pointless because they're not going to succeed?

@Shane32
Copy link

Shane32 commented Nov 1, 2022

I have not been able to understand how the bug affects my code yet. The code snapshot was taken after about an hour of running. There was 400 retained instances of SqlCommand, yet there was about 4.7 million CallbackNode instances at the time of the snapshot (see here for more details). The cancellation token for these references have an application lifetime and are never signaled. The code runs periodically and executes a few queries; perhaps several. Any exceptions in any of this code would have been logged; as far as I know all the code works just fine. So I don't really see how 4 million of anything could possibly have been executed in that timeframe. Unless it was individual packets of data I suppose, and even that would be a stretch. However, 400 SQL calls over an hour for this code sounds about right.

The rest of the codebase stems from ASP.NET Core and seems to be unaffected; perhaps ASP.NET Core releases the reference to the CancellationTokenSource after each request is finished. It's also possible that all the code is affected and I am simply interpreting the memory dump incorrectly.

Note that all my data access is through the linq2db data access layer. I do not use EF in this project, nor do I access SqlClient directly.

I'd be happy to help diagnose this further....if I can. So far I have not been able to create a small code sample that reproduces the bug.

@Wraith2
Copy link
Contributor

Wraith2 commented Nov 1, 2022

You could see if you can reproduce the issue with the artifacts from the PR i opened to fix this #1818 if you can then i'll need to rework, otherwise we just wait for it to be merged and released.

@JRahnama
Copy link
Contributor

JRahnama commented Nov 1, 2022

You could see if you can reproduce the issue with the artifacts from the PR i opened to fix this #1818 if you can then i'll need to rework, otherwise we just wait for it to be merged and released.

We are planning to release next preview on Tuesday 8th, if you can provide the feedback before that date, we will be able to include this PR, otherwise it will go to next year after GA release of 5.1

@ErikEJ
Copy link
Contributor

ErikEJ commented Nov 1, 2022

@JRahnama any chance to get the TimeOnly DateOnly PR in before next preview then?

@JRahnama
Copy link
Contributor

JRahnama commented Nov 1, 2022

@JRahnama any chance to get the TimeOnly DateOnly PR in before next preview then?

It is included in preview2 milestone and hopefully will get merged for next preview release unless reviewers find some odd behaviors/regression in the changes.

@Shane32
Copy link

Shane32 commented Nov 2, 2022

You could see if you can reproduce the issue with the artifacts from the PR i opened to fix this #1818 if you can then i'll need to rework, otherwise we just wait for it to be merged and released.

We are planning to release next preview on Tuesday 8th, if you can provide the feedback before that date, we will be able to include this PR, otherwise it will go to next year after GA release of 5.1

I installed the nuget package and will report tomorrow with a memory usage graph. I can verify that it is loaded in memory on my production server as this code now returns "5.00-pull.55891" in the live environment:

    private string GetSqlClientVersion()
    {
        // pull a linq2db data context from DI
        using var scope = _serviceScopeFactory.CreateScope();
        var db = scope.ServiceProvider.GetRequiredService<HiveDataContext>();
        // create a DbConnection instance
        using var dbConnection = db.DataProvider.CreateConnection(db.ConnectionString!);
        // get the "ProductVersion" from the assembly
        var assembly = dbConnection.GetType().Assembly;
        var fileVersionInfo = FileVersionInfo.GetVersionInfo(assembly.Location);
        return fileVersionInfo.ProductVersion;
    }

@Shane32
Copy link

Shane32 commented Nov 2, 2022

I believe I have tracked down the problem. Any call to ReadAsync when there is enough data already read in the buffer to return a result without awaiting will cause an "unwanted retention". This is due to the following code which short-circuits the async code to immediately return true if the next row can be read from the buffer:

https://github.com/dotnet/SqlClient/blob/main/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlDataReader.cs#L4800

This explains how there can be such drastic consequences to a few number of SQL calls -- for example, this code might perhaps add a hundred or more "unwanted retentions" due to the high number of integer values that can be contained within in a single data packet from the server:

await db.Products.Select(x => x.Id).ToListAsync(token);

As before, I would strongly urge you to consider deprecating 5.0.1 immediately for the reasons stated previously. Below is a repro sample of the code necessary to demonstrate the "unwanted retentions":

Sample:

Console.WriteLine($"SqlClient version: {FileVersionInfo.GetVersionInfo(typeof(SqlCommand).Assembly.Location).ProductVersion}");
var cts = new CancellationTokenSource();
var token = cts.Token;
while (true)
{
    using (var sqlConnection = new SqlConnection("Server=.\\SQLEXPRESS;Initial Catalog=master;Integrated Security=true;")) {
        await sqlConnection.OpenAsync(token);
        using var cmd = sqlConnection.CreateCommand();
        cmd.CommandText = "SELECT n FROM (VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) v(n)";
        using var reader = await cmd.ExecuteReaderAsync(token);
        while (await reader.ReadAsync(token)) {
            // ReadAsync executes 10 times for the 10 data values and one final time to notify the caller there is no more data
            // Only the first call to ReadAsync actually awaits; the buffer contains data for the other 10 calls
        }
    }
    GC.Collect(); // for demonstration only; removing this line does not change output
    Console.WriteLine($"Lost registrations: {CountLostSqlCommands(token)}");
}
        
static int CountLostSqlCommands(CancellationToken token)
{
    var cts = (CancellationTokenSource)typeof(CancellationToken).GetField("_source", System.Reflection.BindingFlags.Instance | System.Reflection.BindingFlags.NonPublic)!.GetValue(token)!;
    var registrationsField = typeof(CancellationTokenSource).GetField("_registrations", System.Reflection.BindingFlags.Instance | System.Reflection.BindingFlags.NonPublic)!;
    var registrations = registrationsField.GetValue(cts);
    if (registrations == null)
        return 0;
    var callbacksProperty = registrationsField.FieldType.GetField("Callbacks", System.Reflection.BindingFlags.Instance | System.Reflection.BindingFlags.Public)!;
    var callbackNode = callbacksProperty.GetValue(registrations);
    var callbackNodeType = callbacksProperty.FieldType;
    var nextProperty = callbackNodeType.GetField("Next", System.Reflection.BindingFlags.Instance | System.Reflection.BindingFlags.Public)!;
    var stateProperty = callbackNodeType.GetField("CallbackState", System.Reflection.BindingFlags.Instance | System.Reflection.BindingFlags.Public)!;
    int lostSqlCommands = 0;
    while (callbackNode != null) {
        var callbackState = stateProperty.GetValue(callbackNode);
        if (typeof(SqlCommand) == callbackState?.GetType()) {
            lostSqlCommands += 1;
        }
        callbackNode = nextProperty.GetValue(callbackNode);
    }
    return lostSqlCommands;
}

Which produces the following:

SqlClient version: 5.0.1
Lost registrations: 10
Lost registrations: 20
Lost registrations: 30
Lost registrations: 40
Lost registrations: 50
Lost registrations: 60
Lost registrations: 70
Lost registrations: 80
Lost registrations: 90
Lost registrations: 100
Lost registrations: 110
Lost registrations: 120
Lost registrations: 130
Lost registrations: 140
Lost registrations: 150
Lost registrations: 160
Lost registrations: 170
Lost registrations: 180
Lost registrations: 190
Lost registrations: 200
Lost registrations: 210
...

With version 5.0.0:

SqlClient version: 5.0.0
Lost registrations: 0
Lost registrations: 0
Lost registrations: 0
Lost registrations: 0
Lost registrations: 0
Lost registrations: 0
Lost registrations: 0
Lost registrations: 0
Lost registrations: 0
Lost registrations: 0
Lost registrations: 0
Lost registrations: 0
...

With version 5.0.0-pull.55891:

SqlClient version: 5.00-pull.55891
Lost registrations: 0
Lost registrations: 0
Lost registrations: 0
Lost registrations: 0
Lost registrations: 0
Lost registrations: 0
Lost registrations: 0
Lost registrations: 0
Lost registrations: 0
Lost registrations: 0
Lost registrations: 0
Lost registrations: 0
Lost registrations: 0
Lost registrations: 0
Lost registrations: 0
Lost registrations: 0
...

As promised, here is the memory usage graph for 5.0.0-pull.55891:

image

PR #1818 seems to solve the problem, although I question the logic of repeatedly registering a cancellation callback only to immediately remove it (the logic introduced in #1781 ). It would seem that the cancellation callback should only be registered once the short-circuit logic has been evaluated. However, I have not done a full review of #1781 and #1065 .

@lcheunglci
Copy link
Contributor

Thanks @Shane32 for doing the memory analysis. Hopefully, we'll get PR #1818 merged very soon.

@Wraith2
Copy link
Contributor

Wraith2 commented Nov 2, 2022

So fix confirmed is good. I'll investigate the position of the registration to see if it can be moved without regressing the fix that moved it up but the order of operations is clearly important for catching some timing related behaviours so while I agree it would be nice to put it as low as possible that may not be practical.

@Shane32
Copy link

Shane32 commented Nov 2, 2022

Thanks @Wraith2 ! Perhaps it would be wise to merge #1818 (and similar for 5.0.x) as-is so as to not hold up getting a fix out for this critical issue. Then we might be able to move the position of the registration as a performance enhancement in a later PR, if such a fix is practical. I'll leave that to you. My only suggestion would be to add test(s) similar to my sample above to confirm that there are no unwanted retentions after calling ReadAsync in various scenarios. Please let me know if I can be of further assistance.

@Shane32
Copy link

Shane32 commented Nov 3, 2022

Is this going to be released in 5.0.2 ? I do not see a matching PR for the 5.0 servicing branch. I can post a PR if desired.

@JRahnama
Copy link
Contributor

JRahnama commented Nov 3, 2022

5.1.0-preview2 which will be release sometime next week and GA of 5.1.0 will be released near end of the year.

@Shane32
Copy link

Shane32 commented Nov 3, 2022

So the plan is to leave 5.0.1 as-is with a memory leak (unwanted retention) that occurs for almost every call to ReadAsync, affecting probably every single user of this version of the library, and some critically so?

@Tornhoof
Copy link

Tornhoof commented Nov 8, 2022

May I suggest to release a 5.0.2 anyway?
The new EF 7 SQL Server package depends on 5.0.1 now.
see Microsoft.EntityFrameworkCore.SqlServer

You might want to reconsider ;)

Afaik Nuget automatically resolves the lowest version of a package which fits the constraints, so until a new EF Core SqlServer Package is released referencing a new version of the client package (be it 5.0.2 or 5.1) everyone need to manually reference the fixed package.

@Wraith2
Copy link
Contributor

Wraith2 commented Nov 9, 2022

Sounds like a good idea. We'll have to make sure the EFCore team are aware that they might get some problems with the current version and I'm sure they'll pick up the later version as soon as it's available

/cc @ajcvickers @roji

@roji
Copy link
Member

roji commented Nov 9, 2022

Yep, we're very aware. I think it makes sense for an EF Core patch release to take a dependency on the SqlClient 5.0.2, when that's released - we'll be discussing that.

@paalandreassen
Copy link

We are seeing crashing AKS K8S pods with OutOfMemory exceptions after upgrading to 5.0.1. Seems memory usage increases 10-20 MB for each processing run which involves efcore. You really can't leave this bug and not fix it until 5.1.0 is out. You konw the reason and how to fix it, so please come together with the efcore guys and release both 5.0.2 and an updated ef 7 package the requires 5.0.2 at the same time. Thank you.

@ErikEJ
Copy link
Contributor

ErikEJ commented Nov 22, 2022

@paalandreassen You can use 5.1 preview 2 today and add an explicit reference to it in your EF Core project.

@paalandreassen
Copy link

@paalandreassen You can use 5.1 preview 2 today and add an explicit reference to it in your EF Core project.

Is 5.1 preview 2 production ready? I really try to avoid preview releases in production.

@ErikEJ
Copy link
Contributor

ErikEJ commented Nov 22, 2022

Yes, it looks like what is in now will be in 5.1: https://github.com/dotnet/SqlClient/projects/11

@stevendarby
Copy link

Any decision about a 5.0.2?

@ErikEJ
Copy link
Contributor

ErikEJ commented Dec 18, 2022

@stevendarby 5.1 preview is out today, and 5.1 will appear beginning of next month.

@stevendarby
Copy link

I know but as per above discussions, 5.0.2 would seem like a really good idea and I wanted to know if a decision was made about whether that was going to happen.

@CodeAngry
Copy link

Just putting this here as it's a Test that seems to work.

  • in 5.0.1 referenced == references.Count so no reference is lost.
  • in 5.0.0 referenced is usually 1 and sometimes 0.
const string connectionString = "Server=lpc:localhost;" +
	"Integrated Security=True;Encrypt=False;" +
	"MultipleActiveResultSets=False;Pooling=False;";

using CancellationTokenSource cts = new();
cts.CancelAfter(TimeSpan.FromMinutes(1));
CancellationToken ct = cts.Token;

List<WeakReference<SqlConnection>> references = new(100);
for (int it = 0; it < 100; ++it)
{
	using SqlConnection connection = new(connectionString);
	references.Add(new(connection));
	await connection.OpenAsync(ct);

	using SqlCommand command = connection.CreateCommand();
	command.CommandText = "select cast(1 as bit);";

	using SqlDataReader reader = await command.ExecuteReaderAsync(ct);
	if (await reader.ReadAsync(ct))
	{
		_ = reader.GetBoolean(0);
	}
}

SqlConnection.ClearAllPools();
GC.Collect();

int referenced = references.Where(i => i.TryGetTarget(out _)).Count();
Console.WriteLine("References: {0:N0} of {1:N0}", referenced, references.Count);
Debug.Assert(referenced != references.Count);

@Shane32
Copy link

Shane32 commented Dec 27, 2022

FYI @CodeAngry the bug occurs when the SQLClient already has the data cached for the call to ReadAsync. So for a proper test, you’d want to call ReadAsync at least twice; the bug triggers on the second call. I’m your example, just reading a second time (which should return false) should be enough.

You can also use reflection to determine the exact number of lost registrations without relying on GC behavior. See code snippet at #1810 (comment)

Perhaps your code is triggering the bug in a way I’m unaware however.

@CodeAngry
Copy link

@Shane32 My StackOverflow post (mentioned right above) has a infinite loop that triggers it. One ReadAsync() per DataReader is enough. Nice code, thanks!

@aarondandy
Copy link

I think many out there (like me) who took way too long to discover this as the root cause are going to really appreciate upgrading to 5.1.0 when it is released.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💥 Regression Issues that are regressions introduced from earlier PRs.
Projects
None yet
Development

Successfully merging a pull request may close this issue.