You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 20, 2022. It is now read-only.
We are using kopeio/etcd-manager:3.0.20190801 version in our k8s cluster for events and main, and they corrupted the /etc/hosts file after some hours.
for the consitent master it looks like this:
# Your system has configured 'manage_etc_hosts' as True.
# As a result, if you wish for changes to this file to persist
# then you will need to either
# a.) make changes to the master file in /etc/cloud/templates/hosts.tmpl
# b.) change or remove the value of 'manage_etc_hosts' in
# /etc/cloud/cloud.cfg or cloud-config from user-data
#
127.0.1.1 ip-1-2-3-4.ourdomain.pri ip-1-2-3-4
127.0.0.1 localhost
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
# Begin host entries managed by etcd-manager[etcd-events] - do not edit
1.2.3.4 etcd-events-a.internal.example.com
1.2.3.5 etcd-events-b.internal.example.com
1.2.3.6 etcd-events-c.internal.example.com
# End host entries managed by etcd-manager[etcd-events]
# Begin host entries managed by etcd-manager[etcd] - do not edit
1.2.3.4 etcd-a.internal.example.com
1.2.3.5 etcd-b.internal.example.com
1.2.3.6 etcd-c.internal.example.com
# End host entries managed by etcd-manager[etcd]
while on one of the other master, where it is damaged:
r-data
#
127.0.1.1 ip-1-2-3-6.ourdomain.pri ip-1-2-3-6
127.0.0.1 localhost
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
# Begin host entries managed by etcd-manager[etcd] - do not edit
1.2.3.4 etcd-a.internal.example.com
1.2.3.5 etcd-b.internal.example.com
1.2.3.6 etcd-c.internal.example.com
# End host entries managed by etcd-manager[etcd]
# Begin host entries managed by etcd-manager[etcd-events] - do not edit
1.2.3.4 etcd-events-a.internal.example.com
1.2.3.5 etcd-events-b.internal.example.com
1.2.3.6 etcd-events-c.internal.example.com
# End host entries managed by etcd-manager[etcd-events]
As you can see after some concurrent writes the events and the main etcd-manager damaged the beginning of the file (partially removing part of cloud.cfg comment). After some time they will remove the host entries as well, and we end up with a file, that doesn't contain any entries for loclahost and for the hostname ip-x-x-x-x, which causes all the calico nodes in the cluster become unready.
Attaching the 2 host file, and part of kibanlogs we see:
in hosts.go line 94 and line 210 may happen near the same time in events and main, or even 2 WriteFile (line 210). According to the WriteFile documentation it truncates the content of the file before writing it and it may lead in reading empty or partially created file at line 94, or concurrently writing in the same file, truncating what is already written by the other pod.
to avoid race conditions during read/write, I would use os.OpenFile and syscall.Flock, instead of Read/WriteFile as the latter are not data race free. The OpenFile/Flock pair would really guarantee atomic read/write operation.
I'd also like to note that this only really would get noticed in an environment where resolv.conf contains domain/search or DNS is incorrectly configured.
If DNS resolves localhost as 127.0.0.1 then nobody would notice.
If DNS resolved localhost as localhost.localsubdomain due to resolv.conf domain localsubdomain/search localsubdomain then gets a non-127.0.0.1 result then it would become noticable (ie calico failing).
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
We are using kopeio/etcd-manager:3.0.20190801 version in our k8s cluster for events and main, and they corrupted the /etc/hosts file after some hours.
for the consitent master it looks like this:
while on one of the other master, where it is damaged:
As you can see after some concurrent writes the events and the main etcd-manager damaged the beginning of the file (partially removing part of cloud.cfg comment). After some time they will remove the host entries as well, and we end up with a file, that doesn't contain any entries for loclahost and for the hostname ip-x-x-x-x, which causes all the calico nodes in the cluster become unready.
Attaching the 2 host file, and part of kibanlogs we see:
consistent-etc-hosts.txt
damaged-etc-hosts.txt
filtered-kibana-log.txt
The text was updated successfully, but these errors were encountered: