[Enhancement]: Migrate away from handling each packet in a dedicated goroutine #71
Labels
awaiting-approval
Topic has not been approved or denied
enhancement
An update to an existing part of the codebase
Checked Existing
What enhancement would you like to see?
Currently we handle each request as its own goroutine. This was fine originally when there were fewer requests, but now that we're handling thousands upon thousands of requests this can create an ungodly number of goroutines. This can, under heavy load, create a lot of system resource usage. Goroutines are not "free", even if small they still take up memory and can create slowdowns during context switching. This can, under the right circumstances, create a situation where clients have connection issues or large amounts of memory ends up being used.
To account for this, it may be worth moving to a worker pool-based approach with a channel queue. This would mean creating a pool of premade goroutines, say 100 of them, to handle incoming packets. A channel queue can move the packets into a worker as soon as one becomes ready to handle the packet. This would keep things under a somewhat more consistent predefined level of usage.
Any other details to share? (OPTIONAL)
Considerations
Some more accurate testing needs to be done to determine how much of an issue the current design is. We've fixed a number of memory related issues, however we are still abusing goroutines pretty heavily which goes against recommendations and sometimes I do see the friends server get some high-ish memory usage.
There are a number of edge cases and caveats to consider when using this sort of design as well. For instance, whether or not to buffer the channel or not.
Buffering the channel will mean it has a predefined size and will drop any new data put into it if it's reached its max size, called "load shedding". This has the benefit of controlling memory usage better but means we will begin to drop packets if the channel becomes overloaded. We handle instances where packets may be dropped already, since that can happen no matter what, so maybe that's not a huge deal, but it does mean that if we are consistently under load then the servers will consistently drop packets.
Channels in Go can be unbuffered, and thus have an infinite size, however this means that under heavy load we will get more memory usage as data backs up. This also means that clients may actually have a higher risk of timing out. In a system where the channel is buffered if the packet is dropped then the next time the packet is sent by the client it may have a chance to get handled sooner as there may be less packets in the queue at that time. However, with an unbuffered queue the amount of time before the packet is processed is variable. This means the queue may have many instances of the same packet if it's resent multiple times, and if the queue is very long then it may take a while before it becomes processed.
Another thing to consider is the number of workers. Unlike most languages we can safely have a number of workers higher than CPU cores, since the Go runtime is very good at multiplexing them and context switching. However, since some operations may take a while to complete, we could run into an issue where all workers are busy. Say we have 100 workers and we get 101 requests, all of which are operations which take a second or more (this is a real possibility, as we had some friends server requests taking multiple seconds in the past). Suddenly that 101st request must now wait an extra second to be processed because there are no workers available.
Alternatives
If we wanted to get really low level, we could do things like what Tailscale does. Tailscale uses some lower-level Linux systems to improve performance and reliability, such as recvmmsg and netpoll to handle incoming packets and finding free goroutines to process them. This can be incredibly powerful and would likely help us a lot here, but is also fairly complex and would likely break on other platforms outside of Linux
The text was updated successfully, but these errors were encountered: