Skip to content
Merged
2 changes: 2 additions & 0 deletions changelog.d/0-release-notes/cannon-drain
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
The `.cannon.drainTimeout` setting on the wire-server helm chart has been
removed and replaced with `.cannon.config.drainOpts`.
4 changes: 4 additions & 0 deletions changelog.d/2-features/cannon-drain
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Drain websockets in a controlled fashion when cannon receives a SIGTERM or
SIGINT. Instead of waiting for connections to close on their own, the websockets
are now severed at a controlled pace. This allows for quicker rollouts of new
versions.
6 changes: 6 additions & 0 deletions charts/cannon/templates/configmap.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,12 @@ data:
gundeck:
host: gundeck
port: 8080

drainOpts:
gracePeriodSeconds: {{ .Values.config.drainOpts.gracePeriodSeconds }}
millisecondsBetweenBatches: {{ .Values.config.drainOpts.millisecondsBetweenBatches }}
minBatchSize: {{ .Values.config.drainOpts.minBatchSize }}

kind: ConfigMap
metadata:
name: cannon
11 changes: 2 additions & 9 deletions charts/cannon/templates/statefulset.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -30,17 +30,10 @@ spec:
annotations:
checksum/configmap: {{ include (print .Template.BasePath "/configmap.yaml") . | sha256sum }}
spec:
terminationGracePeriodSeconds: {{ .Values.drainTimeout }} # should be higher than the sleep duration of preStop
terminationGracePeriodSeconds: {{ add .Values.config.drainOpts.gracePeriodSeconds 5 }}
containers:
- name: cannon
image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
lifecycle:
preStop:
# kubernetes by default immediately sends a SIGTERM to the container,
# which would cause cannon to exit, breaking existing websocket connections.
# Instead we sleep for a day. (SIGTERM is still sent, but after the preStop completes)
exec:
command: ["sleep", {{ .Values.drainTimeout | quote }} ]
volumeMounts:
- name: empty
mountPath: /etc/wire/cannon/externalHost
Expand All @@ -65,7 +58,7 @@ spec:
{{ toYaml .Values.resources | indent 12 }}
initContainers:
- name: cannon-configurator
image: alpine:3.13.1
image: alpine:3.15.4
command:
- /bin/sh
args:
Expand Down
11 changes: 10 additions & 1 deletion charts/cannon/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,16 @@ image:
pullPolicy: IfNotPresent
config:
logLevel: Info

# See also the section 'Controlling the speed of websocket draining during
# cannon pod replacement' in docs/how-to/install/configuration-options.rst
drainOpts:
# The following drains a minimum of 400 connections/second
# for a total of 10000 over 25 seconds
# (if cannon holds more connections, draining will happen at a faster pace)
gracePeriodSeconds: 25
millisecondsBetweenBatches: 50
minBatchSize: 20
resources:
requests:
memory: "256Mi"
Expand All @@ -16,4 +26,3 @@ service:
name: cannon
internalPort: 8080
externalPort: 8080
drainTimeout: 0
5 changes: 5 additions & 0 deletions deploy/services-demo/conf/cannon.demo-docker.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,10 @@ gundeck:
host: gundeck
port: 8086

drainOpts:
gracePeriodSeconds: 1
millisecondsBetweenBatches: 5
minBatchSize: 100

logLevel: Info
logNetStrings: false
5 changes: 5 additions & 0 deletions deploy/services-demo/conf/cannon.demo.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,5 +7,10 @@ gundeck:
host: 127.0.0.1
port: 8086

drainOpts:
gracePeriodSeconds: 1
millisecondsBetweenBatches: 5
minBatchSize: 100

logLevel: Info
logNetStrings: false
44 changes: 44 additions & 0 deletions docs/src/how-to/install/configuration-options.rst
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,50 @@ Keys below ``gundeck.secrets`` belong into ``values/wire-server/secrets.yaml``:

After making this change and applying it to gundeck (ensure gundeck pods have restarted to make use of the updated configuration - that should happen automatically), make sure to reset the push token on any mobile devices that you may have in use.

Controlling the speed of websocket draining during cannon pod replacement
-------------------------------------------------------------------------

The 'cannon' component is responsible for persistent websocket connections.
Normally the default options would slowly and gracefully drain active websocket
connections over a maximum of ``(amount of cannon replicas * 30 seconds)`` during
the deployment of a new wire-server version. This will lead to a very brief
interruption for Wire clients when their client has to re-connect on the
websocket.

You're not expected to need to change these settings.

``drainOpts``: Drain websockets in a controlled fashion when cannon receives a
SIGTERM or SIGINT (this happens when a pod is terminated e.g. during rollout
of a new version). Instead of waiting for connections to close on their own,
the websockets are now severed at a controlled pace. This allows for quicker
rollouts of new versions.

There is no way to entirely disable this behaviour, two extreme examples below

* the quickest way to kill cannon is to set ``gracePeriodSeconds: 1`` and
``minBatchSize: 100000`` which would sever all connections immediately; but it's
not recommended as you could DDoS yourself by forcing all active clients to
reconnect at the same time. With this, cannon pod replacement takes only 1
second per pod.
* the slowest way to roll out a new version of cannon without severing websocket
connections for a long time is to set ``minBatchSize: 1``,
``millisecondsBetweenBatches: 86400000`` and ``gracePeriodSeconds: 86400``
which would lead to one single websocket connection being closed immediately,
and all others only after 1 day. With this, cannon pod replacement takes a
full day per pod.

.. code:: yaml

# overrides for wire-server/values.yaml
cannon:
drainOpts:
# The following defaults drain a minimum of 400 connections/second
# for a total of 10000 over 25 seconds
# (if cannon holds more connections, draining will happen at a faster pace)
gracePeriodSeconds: 25
millisecondsBetweenBatches: 50
minBatchSize: 20


Blocking creation of personal users, new teams
--------------------------------------------------------------------------
Expand Down
3 changes: 3 additions & 0 deletions services/cannon/cannon.cabal
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@ library
, data-timeout >=0.3
, exceptions >=0.6
, extended
, extra
, gundeck-types
, hashable >=1.2
, http-types >=0.8
Expand All @@ -107,6 +108,8 @@ library
, text >=1.1
, tinylog >=0.10
, types-common >=0.16
, unix
, unliftio
, uuid >=1.3
, vector >=0.10
, wai >=3.0
Expand Down
5 changes: 5 additions & 0 deletions services/cannon/cannon.integration.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,5 +16,10 @@ gundeck:
host: 127.0.0.1
port: 8086

drainOpts:
gracePeriodSeconds: 1
millisecondsBetweenBatches: 500
minBatchSize: 5

logLevel: Info
logNetStrings: false
5 changes: 5 additions & 0 deletions services/cannon/cannon2.integration.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,5 +16,10 @@ gundeck:
host: 127.0.0.1
port: 8086

drainOpts:
gracePeriodSeconds: 1
millisecondsBetweenBatches: 5
minBatchSize: 100

logLevel: Info
logNetStrings: false
3 changes: 3 additions & 0 deletions services/cannon/package.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ library:
- data-default >=0.5
- data-timeout >=0.3
- exceptions >=0.6
- extra
- gundeck-types
- hashable >=1.2
- http-types >=0.8
Expand All @@ -43,6 +44,8 @@ library:
- text >=1.1
- tinylog >=0.10
- types-common >=0.16
- unix
- unliftio
- uuid >=1.3
- vector >=0.10
- wai >=3.0
Expand Down
12 changes: 10 additions & 2 deletions services/cannon/src/Cannon/Dict.hs
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ module Cannon.Dict
removeIf,
lookup,
size,
toList,
)
where

Expand All @@ -32,10 +33,11 @@ import Data.SizedHashMap (SizedHashMap)
import qualified Data.SizedHashMap as SHM
import Data.Vector (Vector, (!))
import qualified Data.Vector as V
import Imports hiding (lookup)
import Imports hiding (lookup, toList)

newtype Dict a b = Dict
{_map :: Vector (IORef (SizedHashMap a b))}
{ _map :: Vector (IORef (SizedHashMap a b))
}

size :: MonadIO m => Dict a b -> m Int
size d = liftIO $ sum <$> mapM (\r -> SHM.size <$> readIORef r) (_map d)
Expand Down Expand Up @@ -68,6 +70,12 @@ removeIf f k d = liftIO . atomicModifyIORef' (getSlice k d) $ \m ->
lookup :: (Eq a, Hashable a, MonadIO m) => a -> Dict a b -> m (Maybe b)
lookup k = liftIO . fmap (SHM.lookup k) . readIORef . getSlice k

toList :: (MonadIO m, Hashable a) => Dict a b -> m [(a, b)]
toList =
fmap (mconcat . V.toList)
. V.mapM (fmap SHM.toList . readIORef)
. _map

-----------------------------------------------------------------------------
-- Internal

Expand Down
25 changes: 24 additions & 1 deletion services/cannon/src/Cannon/Options.hs
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,12 @@ module Cannon.Options
logLevel,
logNetStrings,
logFormat,
drainOpts,
Opts,
gracePeriodSeconds,
millisecondsBetweenBatches,
minBatchSize,
DrainOpts,
)
where

Expand Down Expand Up @@ -60,12 +65,30 @@ makeFields ''Gundeck

deriveApiFieldJSON ''Gundeck

data DrainOpts = DrainOpts
{ -- | Maximum amount of time draining should take. Must not be set to 0.
_drainOptsGracePeriodSeconds :: Word64,
-- | Maximum amount of time between batches, this speeds up draining in case
-- there are not many users connected. Must not be set to 0.
_drainOptsMillisecondsBetweenBatches :: Word64,
-- | Batch size is calculated considering actual number of websockets and
-- gracePeriod. If this number is too little, '_drainOptsMinBatchSize' is
-- used.
_drainOptsMinBatchSize :: Word64
}
deriving (Eq, Show, Generic)

makeFields ''DrainOpts

deriveApiFieldJSON ''DrainOpts

data Opts = Opts
{ _optsCannon :: !Cannon,
_optsGundeck :: !Gundeck,
_optsLogLevel :: !Level,
_optsLogNetStrings :: !(Maybe (Last Bool)),
_optsLogFormat :: !(Maybe (Last LogFormat))
_optsLogFormat :: !(Maybe (Last LogFormat)),
_optsDrainOpts :: DrainOpts
}
deriving (Eq, Show, Generic)

Expand Down
31 changes: 24 additions & 7 deletions services/cannon/src/Cannon/Run.hs
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ import Cannon.API.Public
import Cannon.App (maxPingInterval)
import qualified Cannon.Dict as D
import Cannon.Options
import Cannon.Types (Cannon, applog, clients, mkEnv, monitor, runCannon', runCannonToServant)
import Cannon.Types (Cannon, applog, clients, env, mkEnv, monitor, runCannon', runCannonToServant)
import Cannon.WS hiding (env)
import qualified Control.Concurrent.Async as Async
import Control.Exception.Safe (catchAny)
Expand All @@ -48,7 +48,10 @@ import Servant
import qualified System.IO.Strict as Strict
import qualified System.Logger.Class as LC
import qualified System.Logger.Extended as L
import System.Posix.Signals
import qualified System.Posix.Signals as Signals
import System.Random.MWC (createSystemRandom)
import UnliftIO.Concurrent (myThreadId, throwTo)
import qualified Wire.API.Routes.Internal.Cannon as Internal
import Wire.API.Routes.Public.Cannon
import Wire.API.Routes.Version.Wai
Expand All @@ -57,15 +60,16 @@ type CombinedAPI = PublicAPI :<|> Internal.API

run :: Opts -> IO ()
run o = do
when (o ^. drainOpts . millisecondsBetweenBatches == 0) $
error "drainOpts.millisecondsBetweenBatches must not be set to 0."
when (o ^. drainOpts . gracePeriodSeconds == 0) $
error "drainOpts.gracePeriodSeconds must not be set to 0."
ext <- loadExternal
m <- Middleware.metrics
g <- L.mkLogger (o ^. logLevel) (o ^. logNetStrings) (o ^. logFormat)
e <-
mkEnv <$> pure m
<*> pure ext
<*> pure o
<*> pure g
<*> D.empty 128
mkEnv m ext o g
<$> D.empty 128
<*> newManager defaultManagerSettings {managerConnCount = 128}
<*> createSystemRandom
<*> mkClock
Expand All @@ -83,6 +87,9 @@ run o = do
server =
hoistServer (Proxy @PublicAPI) (runCannonToServant e) publicAPIServer
:<|> hoistServer (Proxy @Internal.API) (runCannonToServant e) internalServer
tid <- myThreadId
void $ installHandler sigTERM (signalHandler (env e) tid) Nothing
void $ installHandler sigINT (signalHandler (env e) tid) Nothing
runSettings s app `finally` do
Async.cancel refreshMetricsThread
L.close (applog e)
Expand All @@ -93,10 +100,20 @@ run o = do
loadExternal :: IO ByteString
loadExternal = do
let extFile = fromMaybe (error "One of externalHost or externalHostFile must be defined") (o ^. cannon . externalHostFile)
fromMaybe (readExternal extFile) (return . encodeUtf8 <$> o ^. cannon . externalHost)
maybe (readExternal extFile) (return . encodeUtf8) (o ^. cannon . externalHost)
readExternal :: FilePath -> IO ByteString
readExternal f = encodeUtf8 . strip . pack <$> Strict.readFile f

signalHandler :: Env -> ThreadId -> Signals.Handler
signalHandler e mainThread = CatchOnce $ do
runWS e drain
throwTo mainThread SignalledToExit

data SignalledToExit = SignalledToExit
deriving (Show)

instance Exception SignalledToExit

refreshMetrics :: Cannon ()
refreshMetrics = do
m <- monitor
Expand Down
2 changes: 1 addition & 1 deletion services/cannon/src/Cannon/Types.hs
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ mkEnv ::
Env
mkEnv m external o l d p g t =
Env m o l d def $
WS.env external (o ^. cannon . port) (encodeUtf8 $ o ^. gundeck . host) (o ^. gundeck . port) l p d g t
WS.env external (o ^. cannon . port) (encodeUtf8 $ o ^. gundeck . host) (o ^. gundeck . port) l p d g t (o ^. drainOpts)

runCannon :: Env -> Cannon a -> Request -> IO a
runCannon e c r =
Expand Down
Loading