Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

database: Avoid race condition in connection creation #26147

Merged
merged 5 commits into from
Mar 26, 2024

Conversation

swenson
Copy link
Contributor

@swenson swenson commented Mar 25, 2024

When creating database connections, there is a race condition when multiple goroutines try to create the connection at the same time. This happens, for example, on leadership changes in a cluster.

Normally, the extra database connections are cleaned up when this is detected. However, some database
implementations, notably Postgres, do not seem to clean up in a timely manner, and can leak in these scenarios.

To fix this, we create a global lock when creating database connections to prevent multiple connections from being created at the same time.

We also clean up the logic at the end so that if (somehow) we ended up creating an additional connection, we use the existing one rather than the new one. This by itself would solve our problem long-term, however, would still involve many transient database connections being created and immediately killed on leadership changes.

It's not ideal to have a single global lock for database connection creation. Some potential alternatives:

  • a map of locks from the connection name to the lock. The biggest downside is the we probably will want to garbage collect this map so that we don't have an unbounded number of locks.
  • a small pool of locks, where we hash the connection names to pick the lock. Using such a pool generally is a good way to introduce deadlock, but since we will only use it in a specific case, and the purpose is to improve performance for concurrent connection creation, this is probably acceptable.

When creating database connections, there is a race
condition when multiple goroutines try to create the
connection at the same time. This happens, for
example, on leadership changes in a cluster.

Normally, the extra database connections are cleaned
up when this is detected. However, some database
implementations, notably Postgres, do not seem to
clean up in a timely manner, and can leak in these
scenarios.

To fix this, we create a global lock when creating
database connections to prevent multiple connections
from being created at the same time.

We also clean up the logic at the end so that
if (somehow) we ended up creating an additional
connection, we use the existing one rather than
the new one. This by itself would solve our
problem long-term, however, would still involve
many transient database connections being created
and immediately killed on leaderhip changes.

It's not ideal to have a single global lock for
database connection creation. Some potential
alternatives:

* a map of locks from the connection name to the lock.
  The biggest downside is the we probably will want to
  garbage collect this map so that we don't have an
  unbounded number of locks.
* a small pool of locks, where we hash the connection
  names to pick the lock. Using such a pool generally
  is a good way to introduce deadlock, but since we
  will only use it in a specific case, and the purpose
  is to improve performance for concurrent conenction
  creation, this is probably acceptable.
@swenson swenson requested a review from a team as a code owner March 25, 2024 23:09
@github-actions github-actions bot added the hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed label Mar 25, 2024
dbi := b.connections.Get(name)
if dbi != nil {
return dbi, nil
}

// slow path, create a new connection
// if we don't lock the rest of the operation, there is a race condition for multiple callers of this function
b.createConnectionLock.Lock()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we delete this block, the fix will still work; we still see a spike in open file descriptors on leadership transfers under load, but the file descriptors are closed correctly (because we switch to PutIfEmpty below).

@swenson swenson added this to the 1.16.1 milestone Mar 25, 2024
Copy link

github-actions bot commented Mar 25, 2024

CI Results:
All Go tests succeeded! ✅

Copy link

github-actions bot commented Mar 25, 2024

Build Results:
All builds succeeded! ✅

Co-authored-by: Jason O'Donnell <[email protected]>
@jasonodonnell jasonodonnell self-requested a review March 26, 2024 15:36
Copy link
Contributor

@benashz benashz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few suggestions/nits for your consideration. Otherwise LGTM!

for i := 0; i < goroutines; i++ {
go func(i int) {
dbis[i], errs[i] = b.GetConnection(ctx, s, "mockv5")
wg.Done()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: probably not necessary, but deferring this call just before calling b.GetConnection() feels a bit better.

errs := make([]error, goroutines)
for i := 0; i < goroutines; i++ {
go func(i int) {
dbis[i], errs[i] = b.GetConnection(ctx, s, "mockv5")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: might want to factor mockv5 out to a constant to avoid duplication.

conn, ok := b.connections.PutIfEmpty(name, dbi)
if !ok {
// this is a bug
b.Logger().Error("Error: there was a race condition adding to the database connection map")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention that we hit a bug here? A user could then report it.

Should this be Warn() like line 355 below?


// TestGetConnectionRaceCondition checks that GetConnection always returns the same instance, even when asked
// by multiple goroutines in parallel.
func TestGetConnectionRaceCondition(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@jasonodonnell jasonodonnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2024-03-26 at 11 35 51 AM

I've tested this using test scenarios that resulted in hundreds of orphaned connections and this seemed to fix the issue! LGTM

@swenson
Copy link
Contributor Author

swenson commented Mar 26, 2024

Thanks!

@swenson swenson enabled auto-merge (squash) March 26, 2024 16:38
@swenson swenson merged commit a65d913 into main Mar 26, 2024
78 checks passed
@swenson swenson deleted the vault-25323/postgres-leak branch March 26, 2024 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants