Improve upgrade mechanisms to keep service as healthy as possible #8

s4ke · 2023-01-09T23:18:55Z

Currently we only wait until the node is drained. We should investigate whether it is feasible to wait for all stacks to finish being moved over. Wait for all services to stop scheduling new things during cluster upgrade?

Maybe we need to take a snapshot of all services and the replica counts before the upgrade and we then wait until the same replica counts are back?

s4ke · 2023-03-04T12:46:49Z

Something based around this should help:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Copyright 2023 NeuroForge GmbH & Co. KG <https://neuroforge.de>
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from dataclasses import dataclass
from datetime import datetime
import docker
from typing import List
from docker.models.services import Service

def print_timed(msg):
    to_print = '{} [{}]: {}'.format(
        datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
        'docker_events',
        msg)
    print(to_print)


@dataclass
class StateInfo:
    service: Service
    target_replicas: int
    actual_replicas: int


def has_long_restart_policy(service: Service):
    """
    detects services with a long restart policy such as
    cron style services with a restart condition
    """
    try:
        restart_policy = service.attrs["Spec"]["TaskTemplate"]["RestartPolicy"]
        delay_ns = restart_policy["Delay"]

        # 10 minutes in nanoseconds
        cutoff_ns = 10 * 60 * 1e9

        return delay_ns > cutoff_ns
    except:
        return False
    

def is_oneshot(service: Service):
    """
    detects services that are intended as one shot
    """
    try:
        restart_policy = service.attrs["Spec"]["TaskTemplate"]["RestartPolicy"]
        return restart_policy["Condition"] == "none"
    except:
        return False


def get_state_infos(client: docker.DockerClient) -> List[StateInfo]:
    state_info: List[StateInfo] = []
    services = client.services.list()
    service: Service
    for service in services:
        mode = service.attrs["Spec"]["Mode"]

        if is_oneshot(service):
            # TODO: if its a one shot, check if the task is still
            #       running
            continue
        if has_long_restart_policy(service):
            continue

        if "Replicated" in mode:
            target_replicas = mode["Replicated"]["Replicas"]
        elif "Global" in mode:
            target_replicas = len(client.nodes.list())
        else: 
            continue

        desired_running_tasks = service.tasks(filters={"desired-state": "running"})
        actually_running_tasks = [elem for elem in desired_running_tasks 
                                if elem["Status"]["State"] == "running"]
        
        actually_running_tasks_count = len(actually_running_tasks)

        state_info.append(StateInfo(
            service=service,
            target_replicas=target_replicas,
            actual_replicas=actually_running_tasks_count
        ))

    return state_info


def is_settled() -> bool:
    client = docker.DockerClient()
    
    state_info = get_state_infos(client)

    settled_services = [elem for elem in state_info 
                        if elem.actual_replicas == elem.target_replicas]
    unsettled_services = [elem for elem in state_info 
                          if elem.actual_replicas != elem.target_replicas]
    
    unsettled_count = len(unsettled_services)

    for elem in settled_services:
        print_timed(f"OK: service {elem.service.name} ({elem.service.id}) has settled")
    for elem in unsettled_services:
        print_timed(f"NOK: service {elem.service.name} ({elem.service.id}) has not settled yet")
    
    return unsettled_count == 0


if __name__ == '__main__':
    if is_settled():
        print_timed("swarm has settled")
        exit(0)
    else:
        print_timed("swarm has not settled yet")
        exit(1)

s4ke · 2023-06-19T17:33:15Z

see moby/moby#34139 (comment)

s4ke · 2023-06-19T20:55:14Z

or moreover moby/moby#34139 (comment)

s4ke · 2023-06-30T17:29:17Z

leaving this here as well

As an alternative approach to move off services of nodes that are about to be drained it would be worth trying out to update services with "--constraint-add 'node.hostname!=$(hostname)'" or any other constraint on a per need basis instead of deploying them with the constraint from the get go. I haven't tried this on a multinode swarm yet, but trying it on a local "1 node swarm" suggests it to be worth exploring more

This could work:

docker node update $(hostname) --label-add draining=yes
For each service run: docker service update --constraint-add "node.labels.draining!=yes" <service_name>
actually drain the node
docker node update $(hostname) --label-rm draining
For each service run again docker service update --constraint-add "node.labels.draining!=yes" <service_name>

To not force the tasks off the nodes immediately, do this for every node in this order

s4ke mentioned this issue Mar 4, 2023

Add "update-order" to "docker node --availability drain" moby/moby#34139

Open

s4ke changed the title ~~Wait for all services to stop scheduling new things during cluster upgrade~~ Improve upgrade mechanisms Dec 29, 2023

s4ke changed the title ~~Improve upgrade mechanisms~~ Improve upgrade mechanisms to keep service as healthy as possible Dec 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve upgrade mechanisms to keep service as healthy as possible #8

Improve upgrade mechanisms to keep service as healthy as possible #8

s4ke commented Jan 9, 2023 •

edited

Loading

s4ke commented Mar 4, 2023 •

edited

Loading

s4ke commented Jun 19, 2023

s4ke commented Jun 19, 2023

s4ke commented Jun 30, 2023 •

edited

Loading

Improve upgrade mechanisms to keep service as healthy as possible #8

Improve upgrade mechanisms to keep service as healthy as possible #8

Comments

s4ke commented Jan 9, 2023 • edited Loading

s4ke commented Mar 4, 2023 • edited Loading

s4ke commented Jun 19, 2023

s4ke commented Jun 19, 2023

s4ke commented Jun 30, 2023 • edited Loading

s4ke commented Jan 9, 2023 •

edited

Loading

s4ke commented Mar 4, 2023 •

edited

Loading

s4ke commented Jun 30, 2023 •

edited

Loading