Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

networking: better visibility of DNS/advertised-host issues. #12107

Closed
petermattis opened this issue Dec 6, 2016 · 13 comments
Closed

networking: better visibility of DNS/advertised-host issues. #12107

petermattis opened this issue Dec 6, 2016 · 13 comments
Assignees
Milestone

Comments

@petermattis
Copy link
Collaborator

Last night a handful of delta nodes OOM'ed:

screen shot 2016-12-06 at 2 04 34 pm

The periodic memory profiles showed a huge spike in memory from the Raft log entry slices:

         .          .     93:	// stopping once we have enough.
   10.77GB    10.77GB     94:	ents := make([]raftpb.Entry, 0, hi-lo)
         .          .     95:	size := uint64(0)

delta was restarted yesterday (12/5) at 21:43 and the memory spike occurred on 12/6 at 03:45.

The ranges graph showed a significant number of under-replicated ranges:

screen shot 2016-12-06 at 1 59 01 pm

The replica leaseholders graph shows one node has no leases, which is curious:

screen shot 2016-12-06 at 2 00 36 pm

Despite the under-replicated ranges, the replicate queue wasn't doing anything significant:

screen shot 2016-12-06 at 2 00 59 pm

Looking at the logs from the node with 0 leases showed communication issues:

I161205 21:48:07.700832 182 vendor/google.golang.org/grpc/clientconn.go:667  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp: lookup cockroach-delta-01 on [::1]:53: read udp [::1]:49702->[::1]:53: read: connection refused"; Reconnecting to {"cockroach-delta-01:26257" <nil>}

Port 53 is the DNS port. So the client was trying to connect to cockroach-delta-01 and failing to resolve that hostname to an IP address. The node with the communication difficulties, cockroach-delta-04, experienced an root-disk-full situation. We have previously seen that a full root disk could lead to an empty /etc/resolv.conf. I hypothesize that the empty (or non-existent) /etc/resolv.conf led to Go using talking to localhost for DNS.

Interestingly, delta-04 was able to talk via gossip to other nodes because gossip is configured (via the --join flag) to use IP addresses. This explain why the replicate queue was not fixing the under-replicated nodes: the replicate queue recognizes dead / unavailable nodes by a gossip-based signal. But gossip was ok for delta-04 so the replicate queue thought nothing had to be done. The under-replicated range metric is powered by node liveness and the Raft progress of the replicas.

At 03:45, delta-04 reported the last of the communication problems:

I161206 03:45:05.268431 751 vendor/google.golang.org/grpc/clientconn.go:667  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp: lookup cockroach-delta-07 on [::1]:53: read udp [::1]:49088->[::1]:53: read: connection refused"; Reconnecting to {"cockroach-delta-07:26257" <nil>}

Shortly after that the OOM occurs. Logs show a number of snapshots generated at that time. The Raft data shows that the associated ranges had never had their Raft logs truncated. That is, the truncated state index is 10 which is the initial index position. The un-truncated Raft logs had 1000s of entries. While storage.entries limits the number of entries it will return based on the maxBytes parameter, the code was allocating a slice for the full range of Raft log entries. (See #12100 which fixes the storage.entries behavior).

In person we discussed avoiding using DNS for intra-node communication, but that would likely run into problems due to our use of certs. The other action item is #12101 which is to make replicate queue use the same signal (node liveness) as the under-replicated range metric.

@tamird
Copy link
Contributor

tamird commented Dec 6, 2016

@petermattis disabled block_writer to allow delta to recover, and the event is largely over. Here's the relevant time slice, starting from the deploy which included #12100: https://monitoring.gce.cockroachdb.com/dashboard/db/cockroach-replicas?from=1481055277214&to=1481062260000&var-cluster=delta&var-node=All&var-rate_interval=1m.

screen shot 2016-12-06 at 17 13 24

screen shot 2016-12-06 at 17 13 50

Note the oscillation in the unavailable replicas graph - the end of that oscillation signals the shutting down of block_writer. I don't know what to make of that oscillation.

Runtime metrics were uninteresting during this time.

@bdarnell
Copy link
Contributor

bdarnell commented Dec 7, 2016

In person we discussed avoiding using DNS for intra-node communication, but that would likely run into problems due to our use of certs.

Gossip uses certs in the same way; how does it get away with using IP addresses? I think we should be able to resolve any issues that may arise from using IP addresses instead of hostnames.

It would be good in most cases to advertise an IP instead of a hostname, but the problem is that when a node has multiple IPs its not always easy to decide which one to advertise, while the machine will always have a distinguished default hostname.

@mberhault
Copy link
Contributor

We put all the addresses/names we know of in the certs.
For example, delta's first node has the following:

            X509v3 Subject Alternative Name:
                DNS:cockroach-delta-01, DNS:cockroach-delta-01.c.cockroach-shared.internal, DNS:localhost, DNS:delta.gce.cockroachdb.com, IP Address:127.0.0.1, IP Address:104.196.191.196

@petermattis
Copy link
Collaborator Author

Oh yeah, good point about the IP addresses being in the certs. Can we resolve the hostname and advertise the IP address? At least on GCE this seems like it would give the external IP address via /etc/hosts.

@tamird
Copy link
Contributor

tamird commented Dec 7, 2016 via email

@a-robinson
Copy link
Contributor

Are you aware of any automated processes that write to /etc/hosts? I don't have the largest sample size in the world, but I haven't ever seen it have disk-related problems.

@tamird
Copy link
Contributor

tamird commented Dec 7, 2016 via email

@petermattis
Copy link
Collaborator Author

If we can't resolve the hostname of the local node, we shouldn't let the process start.

@bdarnell
Copy link
Contributor

bdarnell commented Dec 8, 2016

Some systems (notably debian) add the default hostname to /etc/hosts with a loopback IP, so resolving our own hostname and advertising the result would not work in this case. This would be easy to special-case (if our hostname resolves to anything in 127.0.0.0/8, advertise the hostname instead), although I'm slightly worried about accumulating hacks and heuristics here.

@petermattis petermattis added this to the 1.0 milestone Feb 23, 2017
@dianasaur323
Copy link
Contributor

assigning @mberhault. Feel free to re-assign.

@mberhault
Copy link
Contributor

I'm not sure what there is to do here.

Which host to advertise depends on the networking setup (VPC vs open network) and DNS setup (eg: local VM hostnames are resolvable within the VPC in GCE/AWS/Azure, but not on Digital Ocean).

@bdarnell
Copy link
Contributor

I don't think there's anything to do here for 1.0. Later, we might want to add some heuristics to better diagnose issues like this (e.g. resolving our own hostname to see if DNS is wired up correctly).

@bdarnell bdarnell modified the milestones: Later, 1.0 Apr 18, 2017
@mberhault
Copy link
Contributor

Ok, renaming for now.

@mberhault mberhault changed the title stability: delta OOM postmortem (asymmetric partition event) networking: better visibility of DNS/advertised-host issues. Apr 18, 2017
@petermattis petermattis modified the milestones: 1.1, Later Jun 1, 2017
@petermattis petermattis assigned tamird and unassigned mberhault Jun 1, 2017
@tamird tamird closed this as completed in 92c483e Jun 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants