Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace conditional variable with semaphore #24

Closed
wants to merge 16 commits into from

Conversation

jan-dubsky
Copy link

@jan-dubsky jan-dubsky commented Oct 4, 2022

Motivation

We use pgx and pgxpool in our production app. We were recently doing some load-tests with scenarios where the app has been overloaded with masive CPU computations running in other goroutines (which is realistic scenario for the app). Unfortunately what we have seen is that calls to pgxpool.Pool.Acquire() took about 5 seconds during our load-test. The 5 second limit was artificial as that has been the timeout of a single request to the load-tested app.

Problem analysis

Out investigation showed that the problem is the following piece of code in puddle.Pool.Acquire() function:

	// Convert p.cond.Wait into a channel
	waitChan := make(chan struct{}, 1)
	go func() {
		p.cond.Wait()
		waitChan <- struct{}{}
	}()

	select {
	case <-ctx.Done():
		// Allow goroutine waiting for signal to exit. Re-signal since we couldn't
		// do anything with it. Another goroutine might be waiting.
		go func() {
			<-waitChan
			p.cond.L.Unlock()
			p.cond.Signal()
		}()

		p.cond.L.Lock()
		p.canceledAcquireCount += 1
		p.cond.L.Unlock()
		return nil, ctx.Err()
	case <-waitChan:
	}

Namely the problem are those 2 background goroutines that are supposed to act as adapters between sync.Cond and channel.

Scenario without cancellation

Let's start with single scenario when context is not cancelled (i.e. branch case <-ctx.Done() is never taken):

  1. A request goroutine enters Acquire() function and reaches this block of code.
  2. It spawns the adapter#1 goroutine:
    go func() {
    	p.cond.Wait()
    	waitChan <- struct{}{}
    }()
    
  3. The request processing goroutine is blocked on the select statement.
  4. Some resource (pgxpool connection is our case) is released and the adapter#1 is signalled.
  5. The adapter#1 wakes up (i.e. locks the lock), writes to the channel and terminates. The lock remain locked.
  6. Because the app is overloaded, there are multiple other goroutines scheduled in the meantime in (almost) random order. On the other hand any attempt to access the pool is not possible because the pool mutex is locked.
  7. The request goroutine is finally scheduled, reads from the channel, acquires a resource (connection) and unlocks the pool.

This scenario is problematic mostly because the lock of the pool remains locked for significant amount of time. Any goroutine attempting to access the pool in between points (5) and (7) is blocked. This state also prevents the Go scheduler to prioritize the goroutine that holds the lock, because the one which locked it is already dead. Consequently, the Go scheduler would have to be able to perform quite complex analysis to find out which goroutine should have scheduling priority to unlock the lock.

Scenario with cancellation

Unfortunately the scenario described above is not the worst possible. The problematic scenario which caused our Acquire() calls to last up to 5 seconds is the one where context is cancelled (i.e. when the case <-ctx.Done(): select branch is taken):

(Steps 1 to 3 are the same as in previous case)

  1. A request goroutine enters Acquire() function and reaches this block of code.
  2. It spawns the adapter#1 goroutine:
    go func() {
    	p.cond.Wait()
    	waitChan <- struct{}{}
    }()
    
  3. The request processing goroutine is blocked on the select statement.
  4. The request context is cancelled and the case <-ctx.Done(): is taken and adapter#2 goroutine is started:
    go func() {
    	<-waitChan
    	p.cond.L.Unlock()
    	p.cond.Signal()
    }()
    
    The the request processing goroutine just returns.
  5. Some resource (connection) is released and the adapter#1 goroutine is signalled.
  6. adapter#1 goroutine writes to channel and exits.
  7. (same as in previous scenario): Because the app is overloaded, there are multiple other goroutines scheduled in the meantime in (almost) random order. On the other hand any attempt to access the pool is not possible because the pool is locked.
  8. The adapter#2 goroutine is finally scheduled, it unlocks the lock and signals another adapter#1 goroutine.
  9. Cycle in points (6) to (8) repeats for all cancelled Acquire calls - no actual work (read connection acquisition) is done.

The problem with this scenario is that due to cancellations, there can be arbitrarily long chain of adapter#1 and adapter#2 goroutines all belonging to Acquires that were already cancelled. Consequently, one signal might jump K times via adapter#1 and adapter#2 without doing any useful work (read Acquiring a resource).

What is even worse is that this has a positive feedback loop - the more requests timeout the more adapters are there. The more adapters are there, the more goroutines have to be scheduled to do a single successful Acquire call. Consequently, Acquire calls take longer, which results in higher chance to Acquire timeout.

Analysis summary

The problem are goroutines that are spawned to convert sync.Cond to channel (to allow double-select with context). Naturally the other part of the problem are the adapters that convert channel back to sync.Cond signal.

Proposed solution

I propose to modify the code not to use sync.Cond at all. In this PR, I replace the conditional variably with semaphore. The logic chance can be described as following:

All goroutines are blocked at the beginning of `Acquire` call by the semaphore (which allows cancellation). Once the goroutine acquires the semaphore token, it's guaranteed that there is either idle resource in the pool or there is enough space to create a new resource. No adapters are needed.

Benchmarking the solution

This PR adds multiple benchmarks, but there are 2 benchmarks relevant to observe the difference: BenchmarkAcquire_MultipleCancelled and BenchmarkAcquire_MultipleCancelledWithCPULoad. Both simulate a situation when some requests are cancelled.

Results before the change:

$ go test -benchmem -run=^$ -bench ^BenchmarkAcquire_MultipleCancelled.*$ ./...
goos: linux
goarch: amd64
pkg: github.com/jackc/puddle/v2
cpu: Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
BenchmarkAcquire_MultipleCancelled-4                        3721            334546 ns/op           10800 B/op        257 allocs/op
BenchmarkAcquire_MultipleCancelledWithCPULoad-4              842           1480372 ns/op             512 B/op         13 allocs/op

Results after the change:

$ go test -benchmem -run=^$ -bench ^BenchmarkAcquire_MultipleCancelled.*$ ./...
goos: linux
goarch: amd64
pkg: github.com/jackc/puddle/v2
cpu: Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
BenchmarkAcquire_MultipleCancelled-4                        9073            148643 ns/op           10399 B/op        194 allocs/op
BenchmarkAcquire_MultipleCancelledWithCPULoad-4            10000            116802 ns/op             639 B/op         11 allocs/op

NOTE: Don't be surprised that the second test is faster per operation in case of the improved code - it does only 3 cancelled request per loop compared to 64 cancellation in case of the first test. The *WithCPULoad benchmark would last too long (before patch) if there were 64 cancellation.

The other benchmarks added by this PR were written during the development so I have seen no reason not to preserve them.

Implementation notes

This section contains all other changes this diff introduce and provide comments on them.

Another bugfix: Avoid goroutine leak

When asynchronous resource creation in Acquire function failed, the error was unconditionally written to an unbufferred channel. But if no-one was listening to the unbufferred channel (because the Acquire call was cancelled by the context), this write blocked forever. This caused a goroutine leak which could potentially result in application OOM kill.

This bugfix has been discovered during the rewrite. This is the reason why it was not posted as a separate PR. Another reason is that this part of the code is subject of a heavy refactor in this PR and posting it in a separate PR would cause a merge conflict.

AcquireAllIdle implementation changed

In the original implementation, the AcquireAllIdle guaranteed to atomically acquire all idle connections. In the new design, the implementation doesn't acquire all connections atomically. In this section, I will argue that this is not an issue and that the implementation still has same guarantees as the previous implementation.

The design with semaphore requires the AcquireAllIdle to Acquire tokens from the semaphore and only then it can lock the pool mutex. Analogously, the implementation has to acquire the pool mutex to find out how many idle connections are there. Those two steps cannot be swapped because of possible deadlock with Acquire function. The solution is based on opportunistic approach - acquire as many tokens as possible first, then check how many connections are idle and release all tokens that exceed the number of idle resources.

This approach is by far not atomic, but it's guarantees are strong enough to satisfy the requirement on the AcquiteAllIdle behaviour. The AcquireAllIdle now guarantees that it returns all resources that (1) Were not idle before start of the AcquireAllIdle AND (2) Which Acquire was not started while AcquireAllIdle was executing. None of those are an issue because in parallel environment the order of concurrent actions is not defined. Let's discuss those 2 conditions independently:
1. Resources that are concurrently being released do not have to be acquired by the AcquireAllIdle call. This situation is equivalent to the situation when resource is released just after the AcquireAllIdle call.
2. States while AcquireAllIdle is being executed, there can be another concurrent Acquire call that will get a connection. This corresponds to the situation when Acquire is called before AcquireAllIdle calls.

As you can see, none of those 2 conditions are problematic in parallel environment and the behaviour of AcquireAllIdle still appears to be atomic.

In the original implementation, the AcquireAllIdle function was able to atomically acquire all idle resources. This was true because the AcquireAllIdle was synchronized with all Acquire/TryAcquire calls (via mutex). The new implementation comprises two synchronization primitives: mutex and semaphore. For this reason, the way how Acquire/TryAcquire and AcquireAllIdle are synchronized is significantly more complex.

Both Acquire and TryAcquire are required to acquire the semaphore token before entering the critical section (locking the mutex). The semaphore acquire has approximately the following meaning: "Once semaphore is acquired, the goroutine has a reservation to acquire a resource". This implies that AcquireAllIdle is not allowed to take idle resources that are in the pool if the semaphore "reservation" has already been done by some Acquire/TryAcquire call running in parallel.

The new implementation of AcquireAllIdle guarantees that it returns all resources that are idle and haven't been "reserved" by some Acquire/TryAcquire call. This is not an issue because once an idle resources is reserved, it's guaranteed that it will be acquired (see Generational stack below). This at least holds for idle resources in the pool.

This redefinition of AcquireAllIdle guarantees is not problematic in parallel environment, because we can say that the AcquireAllIdle call behaves as if all Acquire/AcquireAllIdle calls that managed to "reserve" the resource executed before AcquireAllIdle. In the end the only thing we need to guarantee is that all resources that are idle at the time of call AcquireAllIdle will get acquired eventually.

## List of idleResources is now circular queue

This change was necessary because of change in AcquireAllIdle. If we kept the stack representation (implemented by array append and pop-back) of idleResources, we wouldn't be able to guarantee properties of AcquireAllIdle. The undesired case would happen in case of the condition (1) in previous section:

1. Call AcquireAllIdle.
2. Lock all K semaphore tokens (in AcquireAllIdle)
3. Concurrent goroutine releases a resource.
4. AcquireAllIdle locks the mutex.
5. AcquireAllIdle acquires the newly released connection (it's the first one on top of a stack) and K-1 idle resources.

The problem with this scenario is that it's no longer equivalent to a situation when Release is called after AcquireAllIdle execution. The released resource is acquired instead of resource idleResources[0]. This issue has been addressed by using circular queue for idleResources.

List of idleResources is not generational stack (replaces List of idleResources is now circular queue)

To support the current implementation of AcquireAllIdle, the list of idle resources had to be changed to generational stack. A generational stack behaves as a standard stack (allows push and pop) in a single generation. In addition to that, a generational stack can start a new generation. The behaviour is following: All elements pushed in previous generation are popped before any element pushed in later generation.

If we used a conventional stack, there would exist the following race between Acquire/TryAcquire and AcquireAllIdle:

  1. Acquire (without loss of generality) acquires a semaphore token (reserves a resource) and is preempted before mutex lock.
  2. AcquireAllIdle locks the mutex and consumes all remaining tokens from the semaphore.
  3. AcquireAllIdle cannot take all idle connections because one of them is already "reserved" by concurrent Acquire. For this reason it takes all but one idle connections and returns them expecting that the last idle connection will be taken by the Acquire.
  4. Some other goroutine calls Release on a resource it has. This Release locks the mutex before Acquire and pushes a new idle resource to the top of the idleResources stack.
  5. Acquire finally locks the mutex and pops pre resource from the top of the stack (the one released in (4)).

As a result of this race, there is one "old" idle resource remaining at the bottom of the stack. Which would be an issue if AcquireAllIdle was used for keep-alive as doc-commented.

To address this issue, we use generational stack instead of a normal stack and AcquireAllIdle always starts a new generation of the stack. Because the new generation is started in (3) (at the end of AcquireAllIdle), the Release call in (4) will push the released resource to the new stack generation. On the other hand the idleResources pop in (5) will consume the resource from the old generation first so no old resource will remain in the pool.

@jackc
Copy link
Owner

jackc commented Oct 8, 2022

The semaphore strategy is much cleaner. I wish that semaphore package had existed when I wrote the original pool in pgx that was extracted to puddle.

Here are the things I noticed.

This requires Go 1.19. pgx (the primary consumer of puddle) supports at least the previous 2 Go releases. So this needs to be compatible with Go 1.18. That means we don't get the nicer atomic types. We'd need to use the older atomics and ensure that they work on 32-bit platforms. (It surprised me that people are still using 32-bit, but I've gotten several issues files due to atomics on 32-bit).

The test coverage is now only 98.4%. I'd like to get back to 100% test coverage.

I'd guess both of those are relatively easy to fix.

The change to a circular queue from a stack changes behavior that will affect users of the pool that implement inactivity checks. For example, consider the case of a pgxpool.Pool with MaxConnIdleTime set to 5 minutes. The pool has 10 connections. One request per second is received. With a stack, 9 of the connections will be closed in 5 minutes. With the circular queue they will never be closed.

But my primary concern is how big a change this is. This is a rewrite of almost the entire core and there's a lot of tricky logic involved. Aside from the concerns noted above it looks good to me, but I'd like to have input from more reviewers. I'm going to create an issue on pgx to request additional input.

@jan-dubsky
Copy link
Author

Replies to code review notes

This requires Go 1.19. pgx (the primary consumer of puddle) supports at least the previous 2 Go releases. So this needs to be compatible with Go 1.18.

Just food for thought: Because puddle is primarily used by pgx it might make sense to merge it into a single package as well as you did with pgconn, pgtype and pgpool packages. In the current state of the repo, the requirement on version compatibility was not obvious to me while I was writing this PR, because it's not documented anywhere in the repo. Another point might be that in the end you ask in pgx repo for more reviewers as well as pgx issues are linked in the source code of puddle. In the end those 2 projects feel so tightly coupled that unification might make sense in pgx/v5.1?

That means we don't get the nicer atomic types. We'd need to use the older atomics and ensure that they work on 32-bit platforms. (It surprised me that people are still using 32-bit, but I've gotten several issues files due to atomics on 32-bit).

No problem. I used go-uber/atomic instead. If you will not want a dependency I can rework this to standard atomic types. On the other hand given that this dependency will be there only for 4 month (until Go 1.20 release), this dependency fells quite acceptable to me.

One nice thing about those atomic structs is that they address the alignment issue you had with 64-bit types on 32-bit platforms.

https://pkg.go.dev/sync/atomic#pkg-note-BUG

On ARM, 386, and 32-bit MIPS, it is the caller's responsibility to arrange for 64-bit alignment of 64-bit words accessed atomically via the primitive atomic functions (types Int64 and Uint64 are automatically aligned). The first word in an allocated struct, array, or slice; in a global variable; or in a local variable (because the subject of all atomic operations will escape to the heap) can be relied upon to be 64-bit aligned.

In other words, struct first element is always 64 bit aligned.

The test coverage is now only 98.4%. I'd like to get back to 100% test coverage.

When you take a look which rows are not covered, those are unreachable rows with 2 panics. I added comments to make it clear that this code is unreachable. Those 2 rows check that the state of the semaphore is in-sync with the state of the pool. To be honest, those rows simplified debugging of many bugs. For this reason I decided to keep them in my PR. On the other hand if you don't like them I can drop them from the code.

The change to a circular queue from a stack changes behavior that will affect users of the pool that implement inactivity checks. For example, consider the case of a pgxpool.Pool with MaxConnIdleTime set to 5 minutes. The pool has 10 connections. One request per second is received. With a stack, 9 of the connections will be closed in 5 minutes. With the circular queue they will never be closed.

OK, good point. I have reworked the setup not to use circular queue, but generational stack (see modified implementation notes below). A single stack implementation was not race-free.

It would be useful to document this feature in doc-comments of Acquire and TryAcquire functions. Your reasoning written here makes perfect sense, but it's not described in package API promises. This API promise and reasoning behind it were not obvious to me while I was making this PR.

But my primary concern is how big a change this is. This is a rewrite of almost the entire core and there's a lot of tricky logic involved. Aside from the concerns noted above it looks good to me, but I'd like to have input from more reviewers. I'm going to create an issue on pgx to request additional input.

I agree, to be honest I have been quite worried how will you accept such a big change. On the other hand I was not able to find any way how to split this change into more small pull-requests. If you need someone for CR, I can ask somebody from our company to take a look. Yet I have to admit that I'd prefer review by someone who has deep knowledge of the pgx internals rather than by my colleague.

Implementation notes (changes)

Those changes were also propagated to the main PR description not to confuse future readers along the way.

AcquireAllIdle guarantees changed (replaces original content)

In the original implementation, the AcquireAllIdle function was able to atomically acquire all idle resources. This was true because the AcquireAllIdle was synchronized with all Acquire/TryAcquire calls (via mutex). The new implementation comprises two synchronization primitives: mutex and semaphore. For this reason, the way how Acquire/TryAcquire and AcquireAllIdle are synchronized is significantly more complex.

Both Acquire and TryAcquire are required to acquire the semaphore token before entering the critical section (locking the mutex). The semaphore acquire has approximately the following meaning: "Once semaphore is acquired, the goroutine has a reservation to acquire a resource". This implies that AcquireAllIdle is not allowed to take idle resources that are in the pool if the semaphore "reservation" has already been done by some Acquire/TryAcquire call running in parallel.

The new implementation of AcquireAllIdle guarantees that it returns all resources that are idle and haven't been "reserved" by some Acquire/TryAcquire call. This is not an issue because once an idle resources is reserved, it's guaranteed that it will be acquired (see Generational stack below). This at least holds for idle resources in the pool.

This redefinition of AcquireAllIdle guarantees is not problematic in parallel environment, because we can say that the AcquireAllIdle call behaves as if all Acquire/AcquireAllIdle calls that managed to "reserve" the resource executed before AcquireAllIdle. In the end the only thing we need to guarantee is that all resources that are idle at the time of call AcquireAllIdle will get acquired eventually.

List of idleResources is not generational stack (replaces List of idleResources is now circular queue)

To support the current implementation of AcquireAllIdle, the list of idle resources had to be changed to generational stack. A generational stack behaves as a standard stack (allows push and pop) in a single generation. In addition to that, a generational stack can start a new generation. The behaviour is following: All elements pushed in previous generation are popped before any element pushed in later generation.

If we used a conventional stack, there would exist the following race between Acquire/TryAcquire and AcquireAllIdle:

  1. Acquire (without loss of generality) acquires a semaphore token (reserves a resource) and is preempted before mutex lock.
  2. AcquireAllIdle locks the mutex and consumes all remaining tokens from the semaphore.
  3. AcquireAllIdle cannot take all idle connections because one of them is already "reserved" by concurrent Acquire. For this reason it takes all but one idle connections and returns them expecting that the last idle connection will be taken by the Acquire.
  4. Some other goroutine calls Release on a resource it has. This Release locks the mutex before Acquire and pushes a new idle resource to the top of the idleResources stack.
  5. Acquire finally locks the mutex and pops pre resource from the top of the stack (the one released in (4)).

As a result of this race, there is one "old" idle resource remaining at the bottom of the stack. Which would be an issue if AcquireAllIdle was used for keep-alive as doc-commented.

To address this issue, we use generational stack instead of a normal stack and AcquireAllIdle always starts a new generation of the stack. Because the new generation is started in (3) (at the end of AcquireAllIdle), the Release call in (4) will push the released resource to the new stack generation. On the other hand the idleResources pop in (5) will consume the resource from the old generation first so no old resource will remain in the pool.

@jackc
Copy link
Owner

jackc commented Oct 14, 2022

It looks good to me. That generational stack idea is pretty clever.


Just food for thought: Because puddle is primarily used by pgx it might make sense to merge it into a single package

It's worth considering. puddle's primary reason for being is for pgx. But a secondary reason was to encapsulate all the tricky logic involved in a resource pool. As time and experience has proved it is very difficult to get right. I like the idea of there being a generic resource pool that solves all the low level problems.

At the very least its overarching requirement of serving the needs of pgx should be documented.

The change to a circular queue from a stack changes behavior that will affect users of the pool that implement inactivity checks. For example, consider the case of a pgxpool.Pool with MaxConnIdleTime set to 5 minutes. The pool has 10 connections. One request per second is received. With a stack, 9 of the connections will be closed in 5 minutes. With the circular queue they will never be closed.

OK, good point. I have reworked the setup not to use circular queue, but generational stack (see modified implementation notes below). A single stack implementation was not race-free.

It would be useful to document this feature in doc-comments of Acquire and TryAcquire functions. Your reasoning written here makes perfect sense, but it's not described in package API promises. This API promise and reasoning behind it were not obvious to me while I was making this PR.

I agree that should be documented.

The test coverage is now only 98.4%. I'd like to get back to 100% test coverage.

When you take a look which rows are not covered, those are unreachable rows with 2 panics. I added comments to make it clear that this code is unreachable. Those 2 rows check that the state of the semaphore is in-sync with the state of the pool. To be honest, those rows simplified debugging of many bugs. For this reason I decided to keep them in my PR. On the other hand if you don't like them I can drop them from the code.

That's fair. 100% test coverage is more of an emotional / peace of mind issue. I suppose we could instead say that we have 100% coverage of reachable code.


With regards to the Go 1.19 requirement vs. an external dependency and with regards to wanting more testing, there might be a solution that solves both of these issues.

If this is merged and tagged as v2.1 or v2.1-beta.1, but pgx does not update its dependency then puddle would be free to rely on Go 1.19. In addition, it would be easy for anyone who wanted to use / test the new version of puddle to do so. Go's minimum version selection policy would ensure that pgx defaulted to the older version. The pgx dependency could be updated when Go 1.20 is released. That would allow a few months of testing. By that point we could have a very high degree of confidence in the new architecture.

@jan-dubsky
Copy link
Author

If this is merged and tagged as v2.1 or v2.1-beta.1, but pgx does not update its dependency then puddle would be free to rely on Go 1.19. In addition, it would be easy for anyone who wanted to use / test the new version of puddle to do so. Go's minimum version selection policy would ensure that pgx defaulted to the older version. The pgx dependency could be updated when Go 1.20 is released. That would allow a few months of testing. By that point we could have a very high degree of confidence in the new architecture.

I fully understand that you want to battle-test this solution and why you want to be careful about updating the pgx dependency. Yet I hope that this PR will be propagated to pgx sooner than in 4 months. As I stated at the very beginning of this PR, this "problem" with chaining adapters is something we have observed in production and it degrades performance of our application in some cases (~overloads). For this reason I'd love to get this change propagated to pgx as soon as possible.

Another point worth mentioning is whether you believe that someone would use the beta version of puddle if not pgx. Are you aware of any other project that would be using puddle and that would be able to test it? You know, it would be weird to wait for 4 months and then find out that no-one actually tested the code because pgx is the only user of the code. But this is clear speculation from my side. You probably know how widely is puddle used.

Copy link

@josharian josharian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few small drive-by comments, since I happened to see a request for reviews over in pgx. I fear I am not ready to dedicate the time to a deeper review...and this is definitely complicated, subtle code that warrants a serious review.

I noticed a stress test and a bit of fuzzing. Those are heartening. The more automated tests that let you have a computer explore the state space while you sleep, the better. (Assuming you actually do that, of course!)

// createNewResource creates a new resource and inserts it into list of pool
// resources.
//
// WARNING: Caller of this method must hold the pool mutex!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want, you can enforce this with sync.Mutex.TryLock: https://pkg.go.dev/sync#Mutex.TryLock. You call TryLock; if it succeeds, the mutex was not held, so you panic. If it fails, all's well. It should be a single atomic load, so not too expensive...but worth double-checking.

Copy link
Author

@jan-dubsky jan-dubsky Oct 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. The mutex could be held by concurrently running goroutine -> the TryLock would fail, but you are not holder of the mutex.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could have false negatives, but it would have no false positives, which is the important thing for such a check.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could have false negatives, but it would have no false positives, which is the important thing for such a check.
True.

To be honest I don't see such a value in this check. I agree that technically it's possible and it could find some bug. But first, we all are trivially able to verify that both those calls of this function are bug free so no check that would slow real production code is necessary. For this keep in mind that atomic operation is still more costly than standard check for the panic referred above. Second reason not to do this check is that in some cases you do recursion in parallel code or multiple sub-calls and the cumulated cost of all those checks would be just too high. But this is an argument non-applicable in this situation, just a practice I try to follow.

pool.go Show resolved Hide resolved
"time"
)

// valueCancelCtx combines two contexts into one. One context is used for values and the other is used for cancellation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Contexts compose, so you shouldn't have to create your own type. (But maybe I'm missing something?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just code relocation.

But to answer your question: There is no way how to combine values of one context and cancellation of another context. Not without 3rd party libraries.

pool.go Show resolved Hide resolved
go.mod Outdated Show resolved Hide resolved
@jorgerasillo
Copy link

@jan-dubsky first of all thanks for putting this together, because as it turns out we're running into a similar problem :) My team and I are currently testing the latest version of your changes in a production like environment and will give you feedback as soon as we have it.

@JordanP JordanP mentioned this pull request Oct 19, 2022
@jackc
Copy link
Owner

jackc commented Oct 22, 2022

I fully understand that you want to battle-test this solution and why you want to be careful about updating the pgx dependency. Yet I hope that this PR will be propagated to pgx sooner than in 4 months. As I stated at the very beginning of this PR, this "problem" with chaining adapters is something we have observed in production and it degrades performance of our application in some cases (~overloads). For this reason I'd love to get this change propagated to pgx as soon as possible.

Right, but I guess my thought was that anyone who needs / wants it could get it with a single go get command, but no one else pays the risk of the new code.

Another point worth mentioning is whether you believe that someone would use the beta version of puddle if not pgx. Are you aware of any other project that would be using puddle and that would be able to test it? You know, it would be weird to wait for 4 months and then find out that no-one actually tested the code because pgx is the only user of the code. But this is clear speculation from my side. You probably know how widely is puddle used.

According to https://pkg.go.dev/github.com/jackc/puddle/v2?tab=importedby there are only a couple other projects directly using puddle v2. But pgx as a whole doesn't need to move to v2.1 for individuals to do so. Even a handful of people running it in production for a while would give me a lot more assurance. At the moment based on the comments in this PR at most 5 people have looked at the code -- to say nothing of production use or similar feedback.

At the very least, tagging a puddle release some amount of time before the dependency was updated in a pgx release would make it easier for people to test.

@jan-dubsky
Copy link
Author

...
At the very least, tagging a puddle release some amount of time before the dependency was updated in a pgx release would make it easier for people to test.

You are right. Let's go with the v2.1 tagging. Thanks for explanation :)

How about the atomic types? Do you want to make Go1.19 required for v2.1 or will we use the uber atomic package as intermediate solution and drop it in February?

@jackc
Copy link
Owner

jackc commented Oct 24, 2022

Yeah, lets use the Go 1.19 atomics and avoid the dependency.

@jorgerasillo
Copy link

Wanted to follow up with an update on our use of this branch in a production like environment. We have recently had to downgrade our pgx version to V4 because we were seeing issues like the ones noted here and here which essentially renders the application unresponsive as a result.

@jackc
Copy link
Owner

jackc commented Oct 28, 2022

I rebased this onto master and resolved the merge conflicts locally, then merged and pushed it back up. Github doesn't seem to recognize the merge but it is on master now.

I tagged v2.1.0 so it can immediately be used.

This seems to also resolve jackc/pgx#1354. If that proves to be accurate and we get some good user experience reports then it may be worth going back to uber atomic package for Go 1.18 compatibility and updating pgx to use use puddle v2.1.0 earlier than originally planned.

Thanks @jan-dubsky!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants