-
Notifications
You must be signed in to change notification settings - Fork 332
Add graceful termination to coturn chart #2456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| The coturn chart has new functionality to enable graceful pod termination, by | ||
| waiting for all active allocations on a coturn instance to drain first. When | ||
| combined with a suitable external service discovery mechanism which can steer | ||
| client traffic away from terminating coturn pods, this can be used to implement | ||
| graceful rolling restarts of clusters of coturn instances. |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -24,3 +24,15 @@ securityContext: | |
|
|
||
| coturnTurnListenPort: 3478 | ||
| coturnMetricsListenPort: 9641 | ||
|
|
||
| # This chart optionally supports waiting for traffic to drain from coturn | ||
| # before pods are terminated. Warning: coturn does not have any way to steer | ||
| # incoming client traffic away from itself on its own, so this functionality | ||
| # relies on external traffic management (e.g. service discovery for active coturn | ||
| # instances) to prevent clients from sending new requests to pods which are in a | ||
| # terminating state. | ||
| coturnGracefulTermination: false | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd probably have this feature enabled by default, and make a Things like nginx also gracefully terminate and don't just blindly kill all connections when retrieving SIGTERM - the fact we're using a
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'm not sure which of [enabled by default]/[disabled by default] is the least surprising option here. I would prefer to keep the default behaviour of the coturn chart close to that of the restund chart (which does not have any graceful termination functionality), given that the former should eventually replace the latter. Enabling this functionality increases the runtime complexity/number of moving parts, and I'm not sure whether that's a good default.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
… and this is mostly due to wireapp/coturn#5 now ungracefully terminating connections on SIGTERM, which is the need for the additional moving parts ( Before wireapp/coturn#5, terminating endpoints in the service resource were marked with wireapp/coturn#5 only makes SIGTERM ungraceful - coturn "immediately"(during the next event loop run) exits itself, entirely ignoring there's still open connections - so we're now adding the This means, with wireapp/coturn#5 merged, but users not explicitly setting That's a regression IMHO. We should either put some more graceful termination logic into coturn itself (and probably get this upstreamed), so we don't need to maintain complex
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we're starting to talk at cross-purposes here.
No. wireapp/coturn#6 introduces the SIGTERM handler. wireapp/coturn#5 is the change which added the preStop hook script. The preStop hook was introduced before the signal handling change was made, with the goal of postponing signals being sent to coturn until connections had drained, under incorrect assumptions about coturn's signal handling behaviour when running as PID 1. The SIGTERM handler was subsequently introduced to ensure that coturn would exit correctly when signaled by the control plane (under the assumption that either the preStop hook has exited, or the grace period timeout has expired). I consider the addition of the SIGTERM handler to be a bugfix, as it means that coturn's signal handling behaviour is consistent irrespective of whether it runs as PID 1 or not.
This is not correct if you are not using the modifications to the Helm chart in this PR. Before this PR, there is no means to configure the termination grace period or enable the preStop hook. Hence, the default grace period of 30 second applies (see here and scroll down a bit).
In both of these cases, coturn dies relatively promptly after the control plane requests pod termination, and in both cases this happens regardless of the client load on a given coturn pod. Now, considering the changes in this PR -- this expects that the coturn image contains the preStop hook script, so when the config flag in the Helm chart is enabled it will actually work correctly (ref: e212919). Therefore we eliminate the case where the version of the Helm chart in this PR is using an old version of coturn which has neither the hook script nor the SIGTERM fix. If the operator has
I disagree. The changes introduced by this PR, in their default configuration, do not significantly alter the termination behaviour of the services defined in this chart in comparison to the version of the chart preceding this PR. I would also offer a gentle reminder that the coturn chart is currently marked in the tree as alpha-quality, and on this basis I think that significant changes to the chart's behaviour at this stage are within scope if we want to make them.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I might have confused the two PR numbers. It was a bit confusing the preStop script came before the SIGTERM handler. My stance is that handling SIGTERM in coturn shouldn't be ungraceful in first place - it's confusingly unintuitive, compared to how other applications handle it. They usually switch into a mode where they don't accept new connections, and properly exit once the still-open connections are done (and usually whatever's supervising the process, be it kubernetes, systemd, or an impatient operator sending SIGKILL after a timeout). So with that implemented, wireapp/coturn#5 and the preStop hook in this PR here wouldn't be necessary at all. It's probably best to carry the discussion on how SIGTERM should be handled upstream. We don't want to keep maintaining a coturn fork downstream anyways.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Alright, I think I understand your point better now. Should I interpret your commentary here as an outright rejection of this PR? Or do we merge this, fix coturn, and then revise the Helm chart here afterwards? I do feel that we're getting into "perfect is the enemy of the good" territory here.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we should merge this, and should undo wireapp/coturn#5 and wireapp/coturn#6 until SIGTERM is handled gracefully.
If SIGTERM is handled gracefully, we don't need any This overall feels way less brittle, and a graceful SIGTERM handler is probably much more likely to get upstreamed as well.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't understand how complicated it is to push for changes into coturn, but I suspect it is not a cake walk. I think it is fine to write hacks to get things working and then push for changes upstream.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Tracking ticket for this functionality is here: https://wearezeta.atlassian.net/browse/SQPIT-1013. If not now, we have a reminder to do this later. |
||
| # Grace period for terminating coturn pods, after which they will be forcibly | ||
| # terminated. This setting is only effective when coturnGracefulTermination is | ||
| # set to true. | ||
| coturnGracePeriodSeconds: 86400 # one day | ||
|
Comment on lines
+35
to
+38
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since this is a statefulset, by default it would upgrade pods one by one. So, this number will get multiplied by number of instances. Maybe it is better to either make this a
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can't have more than one instance running at the same time, as they'd bind on the same ports in the host namespace. We can probably set |
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During draining, shouldn't the endpoint slice of the service backing the statefulset have the draining endpoints with
Readybeing set tofalse?So if that's used for service discovery, it should already work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should indeed just work in environments which have Kube-aware service discovery. @julialongtin tells me, however, that this generally is not the case in on-premise environments, and that the preference is to set things like the TURN endpoints statically. I'm hesitant to assume too much Kube prior knowledge in something we're shipping to customers, and I'd rather have it explicitly stated that while this graceful termination functionality exists, it needs extra moving parts in order to work correctly (which might not be feasible to implement in all environments).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, on-prem environments might just have the IP adresses/hostnames hardcoded, and don't actually take things out of the rotation.
So it'd probably even be more important for coturn to do some more graceful termination when receiving SIGTERM, by just not accepting new connections when receiving a SIGTERM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does restund do any sort of graceful termination on e.g. SIGTERM? My current understanding is that the failover strategy we have for restund is similar to what I'm trying to achieve here, that being we rely on fiddling with service discovery elsewhere to rotate instances out of the advertised instance group, and then watch the metrics until traffic has drained from the instance we're removing before we kill it off.
In the first instance I'm trying to implement automation for coturn which is at least approximately equally as bad as what we have for restund already, which can then be the basis for incremental future development. As I mentioned in the other review thread, this chart is currently marked as alpha-quality, so I'm not really interested in bikeshedding the perfect coturn-on-k8s management solution here. That being said, if you want write some patches against coturn to introduce application-level graceful termination functionality then that would certainly not be unwelcome, and would address the points you've raised in the other review thread.