Redesign the SqlClient Connection Pool to Improve Performance and Async Support #2612

mdaigle · 2024-06-26T21:41:18Z

mdaigle
Jun 26, 2024
Maintainer

Problem Statement

Modern .NET applications interact with Azure SQL server instances which have robust load balancing, scaling, and reliability. However, the current connection pool design does not take advantage of these properties and artificially limits connection creation throughput. Additionally, the async open path suffers from extremely poor performance. Customers want higher throughput connection creation so that their applications can scale and warm up faster. It’s advantageous from a cost saving and user experience perspective to allow customers to utilize the highest throughput that their system can support.

Connection opening and pooling is the front door to the SQL experience and forms a customer’s first impression of developing with SQL server. We need to redesign the .NET SqlClient connection pool to support modern Azure SQL use cases and C# language features.

Low connection creation throughput impacts customers in key scenarios, for example:

Customers must delay traffic from hitting new application servers until the connection pool is warmed up to reduce the chance of slow operations. Customers expect Azure to support their on-demand scaling needs. Issues that slow down scaling violate customer trust.
Serverless applications that use multiple connections to a sql server realize the slowdown of low throughput on each cold invocation. This situation is particularly bad because serverless applications should be small, simple, and fast. The low throughput directly impacts developers’ ability to write performant serverless applications.
When interacting with Azure SQL’s serverless offering, applications may need to re-establish connections after the server has been paused. This delays customer applications which must wait for connections to re-establish slowly on top of the wait for the SQL server instance to resume.

Root Causes

There are two main issues that cause the low connection creation throughput. These issues are inherent to the design of the connection pool and necessitate a rewrite.

1. Async unfriendly locks

All locks in the current connection pool are WaitHandle derived, and block on wait. Naturally, when blocking on acquire, parallel throughput is limited by the thread pool size. More modern locking mechanisms have support for async waits. Massive performance issues with the async open flow keep customers stuck on sync despite the drawbacks.

Possible lock contention/thread pool starvation when acquiring access token causing timeout exceptions #2152

2. Async open requests are handled serially

The connection pool delegates async open requests to a background thread, on which they are queued up, handled synchronously, and passed back to the thread that first initiated the open using a task completion. This wastes thread resources and forces async opens through a funnel that serializes them. Developers expect async APIs to be performant and thread efficient. Counterintuitively, the async APIs for the connection pool are significantly slower and less efficient than their sync counterparts. At a deeper level, all create operations are also funneled through a single semaphore that guards creation.

Design Waypoints

Allow parallel connection opening
- Gate concurrent pool usage with a counting semaphore
  - Offers a WaitAsync API
  - Rely on concurrent data structures to protect integrity of pool state.
True async connection opening
- The connection pool will support opening connections asynchronously via the native async APIs of the underlying layers rather than simulating it using an explicit background thread.
Rely on tasks and the managed thread pool to schedule background work
- Reduces complexity and removes hidden thread creation.
Use a concurrent queue to hold idle connection references
- Queues give a relatively uniform and “first come, first served” distribution of connection usage. Makes it easy to reason about pruning cadence.
Implement configurable, dynamic throttling
- Applications can supply their own retry logic.
- Basic retry logic provided by default
Opt-in to behavior changes
- Use new, opt-in configurability to gate any breaking changes.

Flow Diagrams

Figure 1: Proposed async open flow

Figure 2: Proposed return flow

Figure 3: Proposed cleanup/warmup flow

Appendix

Figure 4: Current internal data structures

Figure 5: Current create synchronization flow

Figure 6: Current async open flow

roji · 2024-06-27T07:20:19Z

roji
Jun 27, 2024
Collaborator

Thanks for the writeup, it's great to see this work happening. Here are a couple of thoughts.

First, in the Open flow I'm seeing "async acquire semaphore" before anything happens - including even getting an idle connection. This seems like it would be a point of contention - I'm wondering why this step is necessary (and especially why it's necessary for before idle connection acquisition; could it at least be moved to cover only physical connection open?).

Second (and probably related)... In offline discussions I think I understood that a SqlClient goal is to throttle connections to the server (SQL Server or Azure SQL), as a way of avoiding overloading the server with too many concurrent physical connection attempts. If this is right, here are some comments on this although I'm guessing it's a hard requirement from the server side which you have little control over:

If there's a need to throttle incoming physical connections, that really should be implemented on the server (or gateway) side, rather than in the database client.
In today's world, you don't just have a single application instance running, with a single pool; the above writeup explicitly mentions serverless deployment models, where multiple instances of the application are deployed. In such scenarios, each application instance has its own pool anyway, and any throttling you implement at the cilent would affect only that; that's likely to be useless, as multiple serverless instances are brought up, and each on its own has little load, but the aggregate load produced by all instances is too much. In other words, the general problem of preventing overload on the database cannot be solved at the client side, but only at a point where the load is aggregated (at the database, or at some gateway).
The actual throttling parameters - how many concurrent connection opens should be allowed - is highly environment-dependent, and the driver simply has no knowledge of that. So the only thing possible is probably to put in some arbitrary limit which seems useful, but which be a bad fit for some environments. That's not great.
Pushing throttling down to the client means that this logic needs to be reimplemented for each language driver (.NET, Java...), again and again.
For comparison, I'm not aware of any database client out there which does something like this (though I may not be aware). Npgsql's connection pool contains no internal throttling, and will happily open 100 physical connections for you concurrently, if your Max Pool Size allows it. I've never received any reports of trouble from that, either against on-premise PostgreSQL or cloud.

One last thought... If client-side throttling is a must-have, then I'd at least consider implementing it via a pluggable policy mechanism such as Polly. Polly is a major resilience/fault tolerance framework, which also has circuit breaker and throttling functionalities. Allowing the connection pool to be configured with a Polly policy would allow users to specify rich behaviors for both retrying and throttling at the same time, and would externalize the problem (so you wouldn't have to implement it). Also, while the default could still be a throttling policy, someone would be able to disable that by passing in a non-throttling policy. The downside, of course, would be the added dependency on Polly, though that may be split out to a different package for those advanced users who need it.

Anyway, happy to discuss all this further if needed!

18 replies

saurabh500 Jun 28, 2024
Maintainer

@mdaigle We should discuss this approach, and I am also curious to understand the algorithm behind pooling and how we plan to implement them, the policies we need to enforce and how. I understand that we have a general overall idea, but it would be good to dive into the details before taking a call on the defaults. It almost feels like we are thinking of the default due to a restriction of implementation due to dependency on a package, rather than going after what would be right for the customers. For us, using Polly is a guidance, not a requirement.

Re throttling:

In an ideal world, I agree with @roji that throttling should be a function of the service, not the client.

This is a backward compatibility issue that we have, that the service and the customers have taken dependency on. This means that for on-prem customers, they plan their failover, and the success of it, based on the knowledge of how the pool throttles the connection. If we remove throttling during the initialization of the pool, there is a lot more testing that needs to be done for both enterprise server deployment scenarios as well as careful evaluation of Azure failovers.

The good thing is that we know that throttling is needed in scenarios when we are booting up the pool or booting up or failing over the server.
As a result, we need an adaptive rate limiting strategy, where we keep increasing the size of the rate at which pool offers connections to be opened, if the connection open attempts are successful. This is something we had discussed in the internal SQL tech discussions as well.

Say we start with 1 connection at a time, and if 5 connections are successful, increase the Degree of parallelization to 5, another 5 successful connections, double it.

At the same time, if after a bit, 5 connections fail to open, dial down the pool's degree of parallelization for connection open to half, till we bring it down to 1. And again with successful connections, start dialing it up.

This way, an app connecting to a healthy server will quickly be at no-throttle limit and if the server is failing over, causing blips in connectivity, then we temporarily dial down the connection open count, and if the errors persist, the pool goes down to allowing 1 connection at a time, till we see healthy connections being opened again.

If Polly helps us get there, then thats great, else we will have to work on this.

saurabh500 Jun 28, 2024
Maintainer

Also one more thing, an approach could be "don't throttle proactively, but reactively on receiving connection failures". I think @edwardneal hinted at this too. That may be a good idea, as long as we reduce thrashing the server with connection attempts when it's anyhow erroring out, to allow it to get to a healthy state. A lot of this applies to on-prem.

edwardneal Jun 28, 2024

@saurabh500 you're correct. Connection failures were the primary signal I was thinking about; a later secondary signal was physical connection timings. In the future, OTel metrics support will need us to gather the connection establishment time anyway, so if we start to see connection times increase then we could use that data to determine whether we should throttle until connection timings improve.

We seem to be discussing a variant of a cache stampede, where the connection pool is the cache and the set of [Max Pool Size] physical connections are the data. While recovering, a server might easily handle a total of 500 connection requests from half a dozen separate processes/connection strings spread over X seconds, but be unable to handle 500 connection requests at the same second. Dynamic throttling would work well within the context of the connection pool, but I wonder if we'd get better results if we also incorporated a randomised delay to try to reduce the chance of every connection pool simultaneously receiving the same throttling hint and increasing the rate at which it establishes connections. There might also be insights in the design discussions backing ASP.NET's caching infrastructure or more generally about how to handle cache stampedes.

saurabh500 Jun 28, 2024
Maintainer

@edwardneal do you mean we could add Jitter to the connectivity attempt after failures?

edwardneal Jun 29, 2024

Yes - but there might be other techniques used to mitigate cache stampedes which are useful.

The addition of jitter would mean that after the server recovers or completes a failover, it doesn't receive an immediate influx of connection from every physical connection in every connection pool. We can't make every connection pool coordinate, so a random offset (perhaps a random percentage of the rolling average connection time for the pool) would be a reasonable alternative.

edwardneal · 2024-06-27T10:49:56Z

edwardneal
Jun 27, 2024

Thanks for this @mdaigle. I've not got anything to add to the high-level design - just extra context and a few comments around the semaphore and cleanup logic.

Expanding on your point about WaitHandles to make the consequences more explicit; this is particularly important because it has an application-wide impact on the thread pool as a whole. If any combination of WaitHandles are sufficiently contended within Tasks then a large chunk of the thread pool can be blocked, and this can throttle or block async in other parts of the host application from running. Removing these WaitHandles should thus end up improving performance in async parts of the applications which don't use SqlClient. While I don't know enough about the ASP.NET request pipeline to speak confidently about it, once the design's implemented I think it's worth testing a sample SQL-based web API.
To add context to the design, SqlClient currently uses a two-tier approach to pooling connections: connection pool groups containing connection pools. This is because there are cases where two identical connection strings could very easily go into different connection pools - they could refer to Windows authentication (or perhaps interactive Azure AD authentication) and could have been opened while impersonating different identities. It's not in the design scope, but it's close enough that we should test it as a result of our rework.
I don't agree with preemptively throttling the connection open rate - I think the connection pool group should recognise when the connection open process slows down and throttle itself in response. Any kind of preemptive throttle falls down when it's got to be coordinated across process/server/container boundaries. roji has already gone into detail on this, and I've not got anything to add there besides agreement. I don't think we'd necessarily need to take a dependency on Polly though; I'd loosely prefer to provide an interface which a new "Polly.SqlClient" package implements.
As I understand it, an orphaned connection is actually an orphaned TdsParserStateObject which has ongoing transactions/commands but no owning logical connection. Will that concept continue to exist after the IO changes are instituted?
Could the semaphore potentially be eliminated with a second channel? Each connection pool could have a bounded idle channel, but there could also be an unbounded channel for orphaned/closed connections which is shared across connection pool groups. In this case, the proposed cleanup flow would simply TryRead a data structure from the orphaned/closed connections' channel and TryWrite to the relevant connection pool's idle channel (closing the physical connection completely if the connection pool is full.) This cleanup flow should provide the guarantees we need around Max/Min Pool Size, and the idle channel semantics would provide the blocking logic.
It might be better for the physical connection closures to happen in parallel. In the case that one of these takes a long time (perhaps due to a long-running transaction) we don't want the rest to be blocked behind it. Under load, this could interact with Max Pool Size and reintroduce a dependency on a serial operation.
Finally, one nit: the second loop in figure 3 is labelled "loop [while over min pool size]". I'm pretty sure this actually refers to the max pool size instead, is this correct?

12 replies

edwardneal Jun 30, 2024

An example of the scenario I have in mind is a cluster of web servers providing the same web API. This is comprised of instances A & B, each with multiple request processing threads. In every instance, the web server spins up a SQL connection to a shared database server on every request. One of these requests will run a set of SqlCommands against the database server which opens a transaction, updates table values and commits a transaction.

A request hits instance A and is processed by a thread. This thread opens the transaction, but is then killed before committing the transaction. We can take for granted here that instance A is now in an indeterminate state, but it's still running. Its SqlConnection is still somewhere on the heap, orphaned. As far as the database server is concerned, that connection's transaction is still open.

A second request hits instance B. Instance B hasn't had any threads killed, it's completely healthy. The request starts being processed, then tries to run the set of SqlCommands to open the transaction and update table values. This set of commands will block because of the transaction. Instance A has acquired something akin to a shared mutex but never released it, blocking instance B from doing the same.

My goal with a finalizer is to let instance A make a best-effort attempt to identify the orphaned connection and to close it, so the database server can revert the transaction that connection holds and allow instance B to make progress. If we ever need the finalizer to run, something has gone wrong - but its purpose isn't to protect instance A, it's to try and prevent the indeterminate state in instance A from spreading to everything else.

roji Jun 30, 2024
Collaborator

Right, so you're describing the original scenario, where a thread is somehow killed (Thread.Abort() or similar) and the process is in an indeterminate state. You're looking to have that process - whose thread has been randomly killed - make a best effort to close the connection, release database resources etc.

I don't really have much to add beyond what we've already written above - AFAIK there simply isn't any scenario where a .NET application can expect to have a thread randomly killed, and then stay up and running for any meaningful amount of time. A thread can be killed as part of the process being taken down as a whole, but of course there's no need to clean up orphaned connections since the entire process is going away. If user code is purposefully terminating a thread in this way and expecting things to not go haywire, it has a severe bug and a database driver shouldn't be attempting to address that sort of scenario.

Do you have a concrete example of such a scenario, in the real world?

edwardneal Jun 30, 2024

That's correct, and I don't have a real-world example - I think the situation we're talking about is incredibly obscure.

It's completely fair to say that killing a thread in an instance of a .NET application pushes that entire instance into an indeterminate state, but our database connections are a handle to a shared resource in the same way that a file handle is, and my personal opinion is that if the instance of the application ends up in a state where it can't clean up that handle, then we should do our best to pick up the slack and avoid that wider impact (even if "our best" can't guarantee cleanup 100% of the time.) I agree that this could only happen if developers step firmly into unsupported territory, but I'm still personally in favour of attempting that cleanup because not doing so could leave a shared resource in the lurch. My view's geared around harm mitigation and trying to make sure the rest of a system degrades gracefully if one instance steps into undefined behaviour.

OTOH: It sounds like we both agree that the concept of an orphaned connection doesn't need to exist, and that part of the connection pool processing can probably be left out. If a finalizer turns out to be needed because someone raises this as a real-world issue and the root cause is outside the developers' control, it'd be a fairly trivial change to make. I'm happy enough to wait and see if this actually happens.

roji Jul 1, 2024
Collaborator

OK, thanks for the conversation.

(This connects to the larger question of whether it's actually a good thing to provide a fix that's "best-effort". Yes, it may save users some trouble; but then it also actually masks the root problem by mitigating some/most of its occurrences, and makes it harder to understand what's going on and isolate the bug. I'm usually a proponent a fail-fast behavior: if there's a problem, it's preferable for the application to always fail in a clear and predictable way, rather than for the problem to only sometimes be mitigated. But it's obviously a bit case-by-case and there's not one definite answer)

In any case, I'll just note the other reason to have this orphaned connection mechanism: backwards compatibility for applications relying on it, i.e. neglecting to dispose their connections. Although it's worth remembering that finalizers aren't guaranteed to be executed, so if an application forgets to dispose/close, they have a bug that will surface at some point. So once again, is it a good idea to "mitigate" it partially - masking the problem - or better to have the leak happen right away, so that they're well-aware of it and can fix it (I'm of the latter camp).

edwardneal Jul 1, 2024

Thanks, and I agree. In a larger system, I've found that it's often a trade-off of limiting the blast radius and failing fast, predictably (in terms of both data and execution state) or cheaply. In the type of system-wide failure that we're looking at here, adding a best-effort mechanism is a trade-off between limiting the blast radius and failing predictably, and I think it's fair enough to start with a predictable failure mode where connections from a killed thread will always remain orphaned.

In terms of detection and correction: if someone's using the component in a wider system, I'd expect them to monitor that system. The only way to consistently detect this would be to use OTel metrics and traces to see if operations (such as an API response) have failed, if a SqlConnection has been open for unusually long and if a thread was killed. We've got issues raised for both metric & log integrations for SqlClient, which should hopefully allow that.

mdaigle · 2024-07-01T23:19:56Z

mdaigle
Jul 1, 2024
Maintainer Author

Lots of great discussion in here. I want to summarize what we have so far and split up topics into designated issues to keep things digestible. I will capture the outcomes of our discussions so far on these issues so that we can come to decisions on them. They will also be the best place going forward to discuss their respective topics. New topics not covered by these issues can of course be added to this discussion.

Throttling
- Don't throttle connection creation #2637
- We agree that we need this in some form for backwards compatibility.
- There is concern about having throttling internal to the client.
- I'll be working on a POC to show what we're capable of doing.
Locking/channels/pooling logic
- Implement idle connection queue and connection retrieval for connection pool #2627
- We agree that there is risk in extra, unneeded synchronization.
- We need to make a decision on FIFO ordering of requests to the pool as a requirement.
- @VladimirReshetnikov is working on a POC of the pool internals so that we have more concrete details to discuss.
Handling orphaned connections
- Release and recycle orphaned connections in the connection pool #2639
- Similar to throttling, we agree that we need this for backwards compatibility. The question is how/when to enable it.
Backwards compatibility opt-in vs opt-out
- I want to start a focused discussion for this topic and will update with a venue for that once it exists. I believe this topic should be broader than just connection pooling. Other parts of sqlclientx may wish to retain "legacy" functionality and we should agree on a way to access that.

Please let me know if I missed anything!

1 reply

saurabh500 Jul 2, 2024
Maintainer

Thanks @mdaigle

JRahnama · 2024-07-02T03:53:06Z

JRahnama
Jul 2, 2024

Has #343 been discussed here?

1 reply

mdaigle Aug 22, 2024
Maintainer Author

Hey @JRahnama, yes, I've got that on my radar. It feels like a natural time to address it so it'll definitely be evaluated..

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesign the SqlClient Connection Pool to Improve Performance and Async Support #2612

{{title}}

Replies: 4 comments 32 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Redesign the SqlClient Connection Pool to Improve Performance and Async Support #2612

mdaigle Jun 26, 2024 Maintainer

Problem Statement

Root Causes

1. Async unfriendly locks

2. Async open requests are handled serially

Design Waypoints

Flow Diagrams

Appendix

Replies: 4 comments · 32 replies

roji Jun 27, 2024 Collaborator

saurabh500 Jun 28, 2024 Maintainer

saurabh500 Jun 28, 2024 Maintainer

edwardneal Jun 28, 2024

saurabh500 Jun 28, 2024 Maintainer

edwardneal Jun 29, 2024

edwardneal Jun 27, 2024

edwardneal Jun 30, 2024

roji Jun 30, 2024 Collaborator

edwardneal Jun 30, 2024

roji Jul 1, 2024 Collaborator

edwardneal Jul 1, 2024

mdaigle Jul 1, 2024 Maintainer Author

saurabh500 Jul 2, 2024 Maintainer

JRahnama Jul 2, 2024

mdaigle Aug 22, 2024 Maintainer Author

mdaigle
Jun 26, 2024
Maintainer

Replies: 4 comments 32 replies

roji
Jun 27, 2024
Collaborator

saurabh500 Jun 28, 2024
Maintainer

saurabh500 Jun 28, 2024
Maintainer

saurabh500 Jun 28, 2024
Maintainer

edwardneal
Jun 27, 2024

roji Jun 30, 2024
Collaborator

roji Jul 1, 2024
Collaborator

mdaigle
Jul 1, 2024
Maintainer Author

saurabh500 Jul 2, 2024
Maintainer

JRahnama
Jul 2, 2024

mdaigle Aug 22, 2024
Maintainer Author