-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: make underlying implementation of channels more efficient (Reference -> https://github.com/alphadose/ZenQ) #52652
Comments
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
I took a very quick look. Channels aren't just thread-safe queues; they also have to handle the CC @golang/runtime |
Yep, I wanted that only Syntactic Sugar of channels + awesome perf Since channels are used a lot, this might help people out |
PS:- this is just a POC, as you said it needs to be fully compatible with the linked methods in https://go.dev/src/runtime/chan.go |
Similarly, the implementation busy-spins when it waits, which is not suitable for the default implementation. Also, there's a separate issue for having lock-free channels #8899 |
The spinning is applied on a slot in the queue whose size can range from 2 to 2^64 - 1 Due to the lack of a global lock, the spinning is used for contesting slots with slot specific states due to which there are low number of collisions among reader and writer indexes due to the large number of slots and also a lot less time is spent in the Compare and Swap for loops because each lock/state has its own individual memory address. But you are right, in case there are a lot of concurrent writer goroutines (1 million or more) its better to use exponential backoff or just make the goroutine sleep/park to reduce slot contention and optimize even more As for #8899, my implementation has lower allocs/op in comparison to that lock-free buffer as well as the native channel and is even a lot faster |
You need to implement select, select fairness, parking and closing etc. before you can draw any conclusions on the performance. Of course, you can make a faster implementation when you discard some of the requirements. Similarly, also run the benchmarks that are in https://github.com/golang/go/blob/master/src/runtime/chan_test.go. Also you need to show benchmarks such as 1M blocked writers. If you are not demonstrating where your code is worse, then it's not a fair evaluation. |
Got it, I will draw another benchmark to replicate the 1M blocked writers scenario for both channels and zenQ and note down the resource utilization |
After that, we can have a look at how to implement select {} like mechanism between multiple zenQs |
Note for benchmarking please see https://pkg.go.dev/golang.org/x/tools/cmd/benchcmp and http://www.inanzzz.com/index.php/post/yz8n/using-golang-bench-benchstat-and-benchcmp-to-measure-performance. You need statistical significance when measuring. I also recommend reading through the lock-free channel issue fully, the design documents and previous pull-requests, before continuing. That way it'll avoid some topics that have come up before. |
@egonelbre I was able to implement parking thereby saving a ton of goroutines, and now ZenQ is more efficient than channels even in the case of 1 million blocking goroutines Here are the latest benchmarks run over 30 cases
Further improvements:-More improvements can be done by using runtime internal calls like But this requires assembly stubs to load the *g pointer, here is my current implementation please have a look at it https://github.com/alphadose/ZenQ/blob/main/lib_runtime_linkage.go and suggest if there are any better methods |
@egonelbre I was able to implement Here is an example code demonstrating its usage package main
import (
"fmt"
"github.com/alphadose/zenq"
)
type custom1 struct {
alpha int
beta string
}
type custom2 struct {
gamma int
}
var (
zq1 = zenq.New[int]()
zq2 = zenq.New[string]()
zq3 = zenq.New[custom1]()
zq4 = zenq.New[*custom2]()
)
func main() {
go looper(intProducer)
go looper(stringProducer)
go looper(custom1Producer)
go looper(custom2Producer)
for i := 0; i < 40; i++ {
// Selection occurs here
switch data := zenq.Select(zq1, zq2, zq3, zq4).(type) {
case int:
fmt.Printf("Received int %d\n", data)
case string:
fmt.Printf("Received string %s\n", data)
case custom1:
fmt.Printf("Received custom data type number 1 %#v\n", data)
case *custom2:
fmt.Printf("Received pointer %#v\n", data)
}
}
}
func intProducer(ctr int) { zq1.Write(ctr) }
func stringProducer(ctr int) { zq2.Write(fmt.Sprint(ctr * 10)) }
func custom1Producer(ctr int) { zq3.Write(custom1{alpha: ctr, beta: fmt.Sprint(ctr)}) }
func custom2Producer(ctr int) { zq4.Write(&custom2{gamma: 1 << ctr}) }
func looper(producer func(ctr int)) {
for i := 0; i < 10; i++ {
producer(i)
}
} |
@alphadose it's a polling implementation, which it should not be. The select contains a lockup with regards multiple selects on the same channel, even when there is a possibility for usual select to proceed. Similarly, the fairness guarantee doesn't seem to hold:
I'm all for implementing custom primitives for niche cases, however, I would recommend trying to improve the performance of the current implementation, rather than implementing all the minutia from scratch. |
Agreed regarding the fairness Also for the performance, I am currently testing by writing assembly to load the *g pointer from thread local storage and then try benchmarking with parking/unparking implementation, hopefully we can see some good improvement |
@egonelbre I was able to get a good improvement in performance by using
Here is the current comparison vs channels also run over 30 cases and queue_size = 4096
Also note that for queue_size = 4096, current ZenQ performs slower than channels in the case of million blocking writers but increasing the queue_size to 2^14 for both, ZenQ now also performs better than channels in the million goroutines case. queue_size = 2^14
|
Update:- Here are the latest benchmarks for queue_size 4096
|
It does not make too much sense to talk about performance until there's feature parity. As I mentioned, it's easy to make a faster queue, if you don't need all the features. Similarly, there are still bugs in the Select, which make the benchmark results not useful. This is the bug I mentioned in the previous comment (this program fails to complete sometimes):
Also there's a mutex per element, which causes it to behave like a fifo. This is also introduces a significant memory overhead to things like EDIT: Removed note about fairness of two concurrent receives, as I'm not sure whether it's required. |
Also, your implementation has an hardcoded size + requirement of power of 2 elements. |
Will work on the fairness of select by going through the channel tests Also regarding the power of 2 elements, this is just a POC, a const is required for declaring arrays instead of slices The tradeoff for not using powers of 2 is that you have to use modulo instead of index masking to get read/write pointers which will cost a few nanoseconds of operation time |
Also there isn't 1 mutex per element, its just num(mutexes) = queue_size there is only 1 mutex per slot, hence there isn't much memory overhead, this is evident from the million goroutines case, zenq took 400 MB lesser than channels |
@egonelbre just benchmarked now with
Same with Putting benchmarks aside, anything else from selection fairness and closing channels to be implemented to achieve feature parity ? |
I don't think restricting to pow2 is an acceptable tradeoff.
The slice should be implementable similarly how it's implemented currently. I suspect it still would need to be a slice in the runtime, otherwise it would increase binary bloat.
Size benchmark:
|
I don't think this is the full list and I'm not the person who can approve the change, but from the top of my head:
Or in other words, the problem isn't coming up with a slightly better queue, but integrating it into Go codebase and ensuring it doesn't break anything and preserves the necessary properties. |
Regarding the race detector, I cannot currently use race.Acquire since this is not runtime internal package |
Ah, also select needs to support both send and recv. Similarly the case of recv/send from nil channels. |
Apparently I made a critical error in one of my last comments:
It should have said "I'm not the person who can approve..."... I now need to find the largest facepalm emote. |
Updates:-
Now I will run this implementation on the test suite you provided and try to complete all tests including modifications to the current implementation if required |
@egonelbre Updates:- By adding direct_send and an optimistic first pass read approach to the selector, I was able to improve its performance Its now comparable to channels although still not as fast for large batch sizes
|
@alphadose note, there's still a significant memory usage difference between zenq and channels. 5x for the usual case. |
In terms of memory usage, it would be better to split it into 2 categories, initialization memory, and operational memory usage for fair comparison In terms of initialization memory, channels are better and have a lower footprint But in terms of operational memory usage ZenQ is better
|
If you pay the initialization cost, you can gain a substantial reduction in operational memory usage and the initialization cost is really cheap considering hardware nowadays its the operational cost which needs to be reduced and optimised |
There are many ephemeral channels for context cancellations (and similar) that are impacted by the memory usage. Of course, it's difficult to understand the real-world impact without testing. Based on my experience, small channels are more common than large channels that require high-throughput. |
true, if throughput is less then it is better to use channels to save resources but there also have been cases where this was insufficient in which scenarios users implemented their own queues imo golang should also provide a throughput optimised communication mechanism |
If 95% (just guessing) of the channel usage is non-throughput related then it needs to be carefully balanced with throughput. Similarly, the performance and memory usage of such use-cases is probably more important for the stdlib than high-workload and high-throughput. Similarly, As you've demonstrated it's possible to have external library queues that trade-off memory for performance. So there's no pressing need to have it in the stdlib, however, I definitely agree it would be nice. |
Actually, there is one more imp reason for making it a part of the std library Currently, it is not possible to access the goroutine struct and other such things from external lib In which case there are more optimisations that can be made making the current ZenQ even more efficient which wont be possible from external lib Extracting the last amount of juice can only be done from golang internal lib and after this is done, no external lib will be able to match the performance of native (unless its a better algorithm) |
@egonelbre performance improved a lot in the case of million goroutines
|
@egonelbre Updates:- operational memory usage reduced drastically (especially evident in the case of 1000 goroutines)
|
Here are a few more benchmarks for different environments/systems
|
I would be interested to see these benchmark results on other architectures, and those with less resources to work with. As a Ryzen 5800H is quite beefy compared to a lot of the systems these could be running on. Your implementation scales much better on the larger end of input/writers, but I'd be interested how the initialization overhead scales down on more constrained systems. |
cool, I was collecting benchmarks for all kinds of systems and will definitely benchmark this in low end systems like raspberry pi. I am gathering all benchmarks here |
@NyaaaWhatsUpDoc apologies for the delay, here are the benchmarks for a low end device (32 bit raspberry pi) Device Info
Benchstat of 20 cases
We can see that ZenQ scales well in both low end as well as high end devices |
ZenQ's
For selection ZenQ uses golang interfaces which internally work via dynamic dispatch, this cost adds up and attributes to the slower performance as compared to channels. |
Current implementation of channels could use some improvements in terms of
ns/op
,B/op
andallocs/op
I have made a POC thread-safe queue which has better performance in all the above 3 metrics https://github.com/alphadose/ZenQ
Here are the benchmarks vs native channels https://github.com/alphadose/ZenQ#benchmarks for a wide range of use cases
My proposal is to make the underlying implementation of native channels more efficient by changing the algorithm and data structure to the lock-free one used in ZenQ or some other suitable implementation (debatable)
The text was updated successfully, but these errors were encountered: