-
Notifications
You must be signed in to change notification settings - Fork 114
Adding/removing a master node to the cluster is kicking off a whole cluster restart (canary deployment) #344
Comments
We have created an issue in Pivotal Tracker to manage this: https://www.pivotaltracker.com/story/show/160542355 The labels on this github issue will be updated when the story is started. |
Hi @lgunta2018 This is intended behaviour right now with current way CFCR is deployed. Out of curiosity, what was the reason to scale masters down to 2 VMs? |
Hi @alex-slynko |
Hey @alex-slynko , |
It depends, if bosh resurrector has noticed one of your master vms has gone missing then it will recreate it without touching the workers. Also, say your manifest states master instances: 3 but one of the masters is failing, in this case bosh hasn't seen a change to the instances count and it won't re-template the jobs and touch the workers
Basically what Oleksandr said above.
Externalizing etcd would be a larger architectural change and may not necessarily solve the problem. We might be able to solve this in simpler way by not iterating through the list of etcd links and configuring |
Hi @lgunta2018
There are two workarounds for this I can think of, but we haven't tried them
Feel free to ask more questions in Slack channel or here. |
Thanks, @youreddy && @alex-slynko for looking into this issue. I will try to use your workarounds to solve this problem for now. But its very useful feature in my case. I hope we will see some traction in this regard. I will let you know if the workarounds do not work for me. |
What happened:
Adding/removing a master node to the cluster is kicking off the whole cluster restart (canary deployment)
What you expected to happen:
Adding/removing a master node should not kick off the whole cluster restart
How to reproduce it (as minimally and precisely as possible):
Steps:
runlog:
Removing master node:
bosh deploy -d cfcr ${KD}/manifests/cfcr.yml -o ${KD}/manifests/ops-files/iaas/aws/cloud-provider.yml -o cfcr-ops.yml -l <(bbl outputs)
Using environment 'https://10.0.0.6:25555' as client 'admin'
Using deployment 'cfcr'
Release 'cfcr-etcd/1.5.0' already exists.
Release 'bpm/0.12.3' already exists.
addons:
stemcells:
name: bosh-dns
release: kubo
name: bosh-dns-aliases
features:
use_dns_addresses: true
instance_groups:
instances: 1
jobs:
properties:
addons:
admin-password: ((kubo-admin-password))
admin-username: admin
api-token: ((kubelet-password))
tls:
heapster: ((tls-heapster))
influxdb: ((tls-influxdb))
kubernetes: ((tls-kubernetes))
kubernetes-dashboard: ((tls-kubernetes-dashboard))
metrics-server: ((tls-metrics-server))
release: kubo
lifecycle: errand
name: apply-addons
networks:
stemcell: default
"kubo-deployment/manifests/cfcr.yml" 297L, 7524C
type: certificate
options:
ca: kubo_ca
common_name: etcdClient
extended_key_usage:
type: certificate
options:
alternative_names:
ca: kubo_ca
common_name: metrics-server
type: certificate
options:
alternative_names:
ca: kubo_ca
common_name: heapster
type: certificate
options:
alternative_names: []
ca: kubo_ca
common_name: monitoring-influxdb
type: certificate
options:
common_name: ca
is_ca: true
type: certificate
options:
alternative_names: []
ca: kubernetes-dashboard-ca
common_name: kubernetesdashboard.cfcr.internal
type: certificate
?instances
certificate: ((tls-etcdctl.certificate))
Release 'bosh-dns/1.8.0' already exists.
Release 'docker/32.0.0' already exists.
instance_groups:
Continue? [yN]: y
Task 120
Task 120 | 23:47:07 | Preparing deployment: Preparing deployment (00:00:07)
Task 120 | 23:47:39 | Preparing package compilation: Finding packages to compile (00:00:00)
Task 120 | 23:47:39 | Deleting unneeded instances master: master/a02c36fb-a6d0-4ca9-8178-8a929087d32e (2) (00:00:43)
Task 120 | 23:48:22 | Updating instance master: master/ba1fbbda-5080-4094-b8c8-671d9abb34a6 (0) (canary) (00:01:05)
Task 120 | 23:49:27 | Updating instance master: master/c5dd8b42-ee9e-4607-ad44-152144a7eebf (1) (00:01:22)
Task 120 | 23:50:49 | Updating instance worker: worker/43ac3278-5a5b-4a9b-a782-e3b52254f98d (0) (canary) (00:00:33)
Task 120 | 23:51:22 | Updating instance worker: worker/41823365-c567-4acc-a2c3-d3df897ee8b3 (1) (00:00:35)
Task 120 | 23:51:57 | Updating instance worker: worker/598eec97-2305-4eac-a784-2fee04d6121b (2) (00:00:41)
Same behavior is observed for adding a new master node:
→ bosh deploy -d cfcr ${KD}/manifests/cfcr.yml -o ${KD}/manifests/ops-files/iaas/aws/cloud-provider.yml -o cfcr-ops.yml -l <(bbl outputs)
Using environment 'https://10.0.0.6:25555' as client 'admin'
Using deployment 'cfcr'
Release 'bpm/0.12.3' already exists.
Release 'docker/32.0.0' already exists.
Release 'cfcr-etcd/1.5.0' already exists.
Release 'bosh-dns/1.8.0' already exists.
instance_groups:
Continue? [yN]: y
Task 234
Task 234 | 00:10:30 | Preparing deployment: Preparing deployment (00:00:06)
Task 234 | 00:11:07 | Preparing package compilation: Finding packages to compile (00:00:00)
Task 234 | 00:11:07 | Creating missing vms: master/09e9e7e6-7e1f-4d51-b591-1c0f1e8ca12d (2) (00:01:27)
Task 234 | 00:12:34 | Updating instance master: master/ba1fbbda-5080-4094-b8c8-671d9abb34a6 (0) (canary) (00:01:07)
Task 234 | 00:13:41 | Updating instance master: master/c5dd8b42-ee9e-4607-ad44-152144a7eebf (1) (00:01:05)
Task 234 | 00:14:46 | Updating instance master: master/09e9e7e6-7e1f-4d51-b591-1c0f1e8ca12d (2) (00:01:27)
Task 234 | 00:16:13 | Updating instance worker: worker/43ac3278-5a5b-4a9b-a782-e3b52254f98d (0) (canary) (00:00:33)
Task 234 | 00:16:46 | Updating instance worker: worker/41823365-c567-4acc-a2c3-d3df897ee8b3 (1) (00:00:42)
Task 234 | 00:17:28 | Updating instance worker: worker/598eec97-2305-4eac-a784-2fee04d6121b (2) (00:00:35)
Task 234 Started Sat Sep 15 00:10:30 UTC 2018
Task 234 Finished Sat Sep 15 00:18:03 UTC 2018
Task 234 Duration 00:07:33
Task 234 done
Anything else we need to know?:
Adding worker nodes is working fine, it is not restarting the whole cluster.
kube-deployment : v0.21.0
Environment:
bosh -d <deployment> deployment
):bosh deploy -d cfcr ${KD}/manifests/cfcr.yml -o ${KD}/manifests/ops-files/iaas/aws/cloud-provider.yml -o cfcr-ops.yml -l <(bbl outputs)
ame Release(s) Stemcell(s) Config(s) Team(s)
cfcr bosh-dns/1.8.0 bosh-aws-xen-hvm-ubuntu-xenial-go_agent/97.16 1 cloud/default -
bpm/0.12.3 2 runtime/dns
cfcr-etcd/1.5.0
docker/32.0.0
kubo/0.21.0
Environment Info (
bosh -e <environment> environment
):Name bosh-cfcr-lgunta
UUID 58cd4562-b14b-4310-aaa3-bd1582250f34
Version 267.5.0 (00000000)
CPI aws_cpi
Features compiled_package_cache: disabled
config_server: enabled
dns: disabled
snapshots: disabled
User admin
Kubernetes version (
kubectl version
):Client Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.3", GitCommit:"a4529464e4629c21224b3d52edfe0ea91b072862", GitTreeState:"clean", BuildDate:"2018-09-10T11:44:36Z", GoVersion:"go1.11", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.2", GitCommit:"bb9ffb1654d4a729bb4cec18ff088eacc153c239", GitTreeState:"clean", BuildDate:"2018-08-07T23:08:19Z", GoVersion:"go1.10.3", Compiler:"gc", Platform:"linux/amd64"}
aws
,gcp
,vsphere
):AWS
The text was updated successfully, but these errors were encountered: