networking: better visibility of DNS/advertised-host issues. #12107

petermattis · 2016-12-06T20:14:16Z

Last night a handful of delta nodes OOM'ed:

The periodic memory profiles showed a huge spike in memory from the Raft log entry slices:

         .          .     93:	// stopping once we have enough.
   10.77GB    10.77GB     94:	ents := make([]raftpb.Entry, 0, hi-lo)
         .          .     95:	size := uint64(0)

delta was restarted yesterday (12/5) at 21:43 and the memory spike occurred on 12/6 at 03:45.

The ranges graph showed a significant number of under-replicated ranges:

The replica leaseholders graph shows one node has no leases, which is curious:

Despite the under-replicated ranges, the replicate queue wasn't doing anything significant:

Looking at the logs from the node with 0 leases showed communication issues:

I161205 21:48:07.700832 182 vendor/google.golang.org/grpc/clientconn.go:667  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp: lookup cockroach-delta-01 on [::1]:53: read udp [::1]:49702->[::1]:53: read: connection refused"; Reconnecting to {"cockroach-delta-01:26257" <nil>}

Port 53 is the DNS port. So the client was trying to connect to cockroach-delta-01 and failing to resolve that hostname to an IP address. The node with the communication difficulties, cockroach-delta-04, experienced an root-disk-full situation. We have previously seen that a full root disk could lead to an empty /etc/resolv.conf. I hypothesize that the empty (or non-existent) /etc/resolv.conf led to Go using talking to localhost for DNS.

Interestingly, delta-04 was able to talk via gossip to other nodes because gossip is configured (via the --join flag) to use IP addresses. This explain why the replicate queue was not fixing the under-replicated nodes: the replicate queue recognizes dead / unavailable nodes by a gossip-based signal. But gossip was ok for delta-04 so the replicate queue thought nothing had to be done. The under-replicated range metric is powered by node liveness and the Raft progress of the replicas.

At 03:45, delta-04 reported the last of the communication problems:

I161206 03:45:05.268431 751 vendor/google.golang.org/grpc/clientconn.go:667  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp: lookup cockroach-delta-07 on [::1]:53: read udp [::1]:49088->[::1]:53: read: connection refused"; Reconnecting to {"cockroach-delta-07:26257" <nil>}

Shortly after that the OOM occurs. Logs show a number of snapshots generated at that time. The Raft data shows that the associated ranges had never had their Raft logs truncated. That is, the truncated state index is 10 which is the initial index position. The un-truncated Raft logs had 1000s of entries. While storage.entries limits the number of entries it will return based on the maxBytes parameter, the code was allocating a slice for the full range of Raft log entries. (See #12100 which fixes the storage.entries behavior).

In person we discussed avoiding using DNS for intra-node communication, but that would likely run into problems due to our use of certs. The other action item is #12101 which is to make replicate queue use the same signal (node liveness) as the under-replicated range metric.

The text was updated successfully, but these errors were encountered:

tamird · 2016-12-06T22:18:36Z

@petermattis disabled block_writer to allow delta to recover, and the event is largely over. Here's the relevant time slice, starting from the deploy which included #12100: https://monitoring.gce.cockroachdb.com/dashboard/db/cockroach-replicas?from=1481055277214&to=1481062260000&var-cluster=delta&var-node=All&var-rate_interval=1m.

Note the oscillation in the unavailable replicas graph - the end of that oscillation signals the shutting down of block_writer. I don't know what to make of that oscillation.

Runtime metrics were uninteresting during this time.

bdarnell · 2016-12-07T13:20:06Z

In person we discussed avoiding using DNS for intra-node communication, but that would likely run into problems due to our use of certs.

Gossip uses certs in the same way; how does it get away with using IP addresses? I think we should be able to resolve any issues that may arise from using IP addresses instead of hostnames.

It would be good in most cases to advertise an IP instead of a hostname, but the problem is that when a node has multiple IPs its not always easy to decide which one to advertise, while the machine will always have a distinguished default hostname.

mberhault · 2016-12-07T13:23:49Z

We put all the addresses/names we know of in the certs.
For example, delta's first node has the following:

            X509v3 Subject Alternative Name:
                DNS:cockroach-delta-01, DNS:cockroach-delta-01.c.cockroach-shared.internal, DNS:localhost, DNS:delta.gce.cockroachdb.com, IP Address:127.0.0.1, IP Address:104.196.191.196

petermattis · 2016-12-07T15:30:56Z

Oh yeah, good point about the IP addresses being in the certs. Can we resolve the hostname and advertise the IP address? At least on GCE this seems like it would give the external IP address via /etc/hosts.

tamird · 2016-12-07T15:40:48Z

Don't you think /etc/hosts will have the same full-disk problem as /etc/resolv.conf?

…

On Wed, Dec 7, 2016 at 10:31 AM, Peter Mattis ***@***.***> wrote: Oh yeah, good point about the IP addresses being in the certs. Can we resolve the hostname and advertise the IP address? At least on GCE this seems like it would give the external IP address via /etc/hosts. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#12107 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABdsPIvyr6H2yE6ZMf-1ltfuENfkIGEjks5rFtE0gaJpZM4LF1Zz> .

a-robinson · 2016-12-07T15:44:29Z

Are you aware of any automated processes that write to /etc/hosts? I don't have the largest sample size in the world, but I haven't ever seen it have disk-related problems.

tamird · 2016-12-07T15:45:40Z

No, just speculating.

…

On Wed, Dec 7, 2016 at 10:44 AM, Alex Robinson ***@***.***> wrote: Are you aware of any automated processes that write to /etc/hosts? I don't have the largest sample size in the world, but I haven't ever seen it have disk-related problems. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#12107 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABdsPFehgP6myF-C4BMjsLP4ugtlZ--Vks5rFtRhgaJpZM4LF1Zz> .

petermattis · 2016-12-07T15:46:31Z

If we can't resolve the hostname of the local node, we shouldn't let the process start.

bdarnell · 2016-12-08T08:16:05Z

Some systems (notably debian) add the default hostname to /etc/hosts with a loopback IP, so resolving our own hostname and advertising the result would not work in this case. This would be easy to special-case (if our hostname resolves to anything in 127.0.0.0/8, advertise the hostname instead), although I'm slightly worried about accumulating hacks and heuristics here.

dianasaur323 · 2017-04-06T16:11:52Z

assigning @mberhault. Feel free to re-assign.

mberhault · 2017-04-18T15:23:08Z

I'm not sure what there is to do here.

Which host to advertise depends on the networking setup (VPC vs open network) and DNS setup (eg: local VM hostnames are resolvable within the VPC in GCE/AWS/Azure, but not on Digital Ocean).

bdarnell · 2017-04-18T15:35:33Z

I don't think there's anything to do here for 1.0. Later, we might want to add some heuristics to better diagnose issues like this (e.g. resolving our own hostname to see if DNS is wired up correctly).

mberhault · 2017-04-18T15:52:14Z

Ok, renaming for now.

petermattis added this to the 1.0 milestone Feb 23, 2017

dianasaur323 assigned mberhault Apr 6, 2017

bdarnell modified the milestones: Later, 1.0 Apr 18, 2017

mberhault changed the title ~~stability: delta OOM postmortem (asymmetric partition event)~~ networking: better visibility of DNS/advertised-host issues. Apr 18, 2017

petermattis modified the milestones: 1.1, Later Jun 1, 2017

petermattis assigned tamird and unassigned mberhault Jun 1, 2017

tamird closed this as completed in 92c483e Jun 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

networking: better visibility of DNS/advertised-host issues. #12107

networking: better visibility of DNS/advertised-host issues. #12107

petermattis commented Dec 6, 2016

tamird commented Dec 6, 2016 •

edited

Loading

bdarnell commented Dec 7, 2016

mberhault commented Dec 7, 2016

petermattis commented Dec 7, 2016

tamird commented Dec 7, 2016 via email

a-robinson commented Dec 7, 2016

tamird commented Dec 7, 2016 via email

petermattis commented Dec 7, 2016

bdarnell commented Dec 8, 2016

dianasaur323 commented Apr 6, 2017

mberhault commented Apr 18, 2017

bdarnell commented Apr 18, 2017

mberhault commented Apr 18, 2017

networking: better visibility of DNS/advertised-host issues. #12107

networking: better visibility of DNS/advertised-host issues. #12107

Comments

petermattis commented Dec 6, 2016

tamird commented Dec 6, 2016 • edited Loading

bdarnell commented Dec 7, 2016

mberhault commented Dec 7, 2016

petermattis commented Dec 7, 2016

tamird commented Dec 7, 2016 via email

a-robinson commented Dec 7, 2016

tamird commented Dec 7, 2016 via email

petermattis commented Dec 7, 2016

bdarnell commented Dec 8, 2016

dianasaur323 commented Apr 6, 2017

mberhault commented Apr 18, 2017

bdarnell commented Apr 18, 2017

mberhault commented Apr 18, 2017

tamird commented Dec 6, 2016 •

edited

Loading