unusable hostnames broadcast for inter-node communication #8832

sploiselle · 2016-08-25T19:51:29Z

When starting cockroach without the --host flag, Cockroach chooses the machine's host name, which it broadcasts as the address other nodes in the cluster can use to communicate to it. However, this name isn't necessarily accessible and the only option to change it is to use the --host flag, which limits the number of interfaces Cockroach listens on.

Example

If I called my machine do-node-2 and start cockroach without the --host flag, the node broadcasts itself as do-node-2:26257. Often times, that name doesn't have requisite DNS entries and is accessible only when the machine refers to itself.

Again, resolving this behavior --host is not viable in all scenarios because the behavior changes from listening on all interfaces to listening only on a single one.

Consequence

Having nodes join clusters is difficult because it requires setting the --host flag
Nodes cannot replicate its data to others in the cluster if they use addresses like this, despite the fact that they all joined the cluster successfully. Attached is an example of this happening on Digital Ocean with 3 droplets. The first node started the cluster using the following command:

cockroach start --insecure --background --host=10.132.80.187

The second and third nodes joined the cluster using this command:

cockroach start --insecure --background --join=10.132.80.187:26257

Ranges on the first node cannot be replicated to the second and third nodes because the addresses they're broadcasting are unreachable by the first node (i.e., they don't have DNS entries to point to do-node-2 and 3).

The text was updated successfully, but these errors were encountered:

tamird · 2016-08-25T19:55:17Z

Dupe of #1008?

sploiselle · 2016-08-25T19:57:59Z

@tamird: Marc asked for a second ticket because of the potential one-way replication issue

tamird · 2016-08-25T19:59:50Z

OK. cc @a-robinson to avoid duplication of effort.

mberhault · 2016-09-01T12:54:32Z

it's a bit more than that. In this case, nodes 2 and 3 joined the cluster properly because they were able to reach node1. However, node 1 could not initiate a connection to the other two, meaning we were stuck with a single-node cluster.
The admin UI showed no sign of this. So while #1008 would solve the addressing issue, we still need to figure out a way to properly surface this scenario.

This would be needed for other uni-directional cases, not just bad advertised address. eg: a badly configured firewall rule will trivially cause this.

petermattis · 2017-02-22T20:49:25Z

@tamird, @mberhault We now have the --advertise-host flag. Is there anything left to do here?

tamird · 2017-02-22T20:52:37Z

@petermattis @mberhault the --advertise-host flag was added in #9503, which also closed #1008, but this issue is (according to @mberhault) not a duplicate of #1008.

@mberhault @sploiselle can you clarify?

a-robinson · 2017-02-22T20:52:52Z

It seems like there's still the matter of helping folks who accidentally run into this. There's no real visibility into what's wrong in cases like this.

sploiselle · 2017-02-22T21:02:29Z

This is a UX issue, as @a-robinson said. Nodes join the cluster but will never successfully receive data. Identifying this could be non-obvious to users because everything appears to be working from the CLI.

petermattis · 2017-02-22T21:07:07Z

Perhaps the new cluster visualization could highlight this situation. Cc @maxlang, @mrtracy, @kuanluo.

spencerkimball · 2017-04-02T20:26:10Z

Does anyone have a concrete suggestion for how to address this? I'm removing the 1.0 milestone because I wasn't able to come up with a decent idea. Note that there are complaints in the logs about this situation:

I170402 16:25:37.422167 236 vendor/google.golang.org/grpc/clientconn.go:806  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 198.105.244.228:26259: i/o timeout"; Reconnecting to {f34tegaf.com:26259 <nil>}

petermattis · 2017-06-01T15:38:43Z

Might be fixed by #16177.

tbg · 2018-06-06T09:18:46Z

This particular one is indeed fixed by #16177, except in the case where the local host can resolve the hostname. I think what's left here is some general warning mechanism that fires when other nodes can't reach out at the advertised address, but it works the other way. That is tracked in #18850.

sploiselle assigned mberhault Aug 25, 2016

petermattis added this to the 1.0 milestone Feb 22, 2017

spencerkimball added the S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. label Apr 2, 2017

spencerkimball modified the milestones: Later, 1.0 Apr 2, 2017

kuanluo closed this as completed Apr 3, 2017

kuanluo reopened this Apr 3, 2017

a-robinson mentioned this issue Apr 4, 2017

Node freezes up after process restart #9658

Closed

tbg mentioned this issue May 26, 2017

cli: unusable ui url printed when $(hostname) doesn't resolve #16173

Closed

petermattis assigned tamird and unassigned mberhault Jun 1, 2017

knz added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Apr 27, 2018

tbg added the A-kv-client Relating to the KV client and the KV interface. label May 15, 2018

tbg closed this as completed Jun 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unusable hostnames broadcast for inter-node communication #8832

unusable hostnames broadcast for inter-node communication #8832

sploiselle commented Aug 25, 2016

tamird commented Aug 25, 2016

sploiselle commented Aug 25, 2016

tamird commented Aug 25, 2016

mberhault commented Sep 1, 2016

petermattis commented Feb 22, 2017

tamird commented Feb 22, 2017

a-robinson commented Feb 22, 2017

sploiselle commented Feb 22, 2017

petermattis commented Feb 22, 2017

spencerkimball commented Apr 2, 2017

petermattis commented Jun 1, 2017

tbg commented Jun 6, 2018

unusable hostnames broadcast for inter-node communication #8832

unusable hostnames broadcast for inter-node communication #8832

Comments

sploiselle commented Aug 25, 2016

Example

Consequence

tamird commented Aug 25, 2016

sploiselle commented Aug 25, 2016

tamird commented Aug 25, 2016

mberhault commented Sep 1, 2016

petermattis commented Feb 22, 2017

tamird commented Feb 22, 2017

a-robinson commented Feb 22, 2017

sploiselle commented Feb 22, 2017

petermattis commented Feb 22, 2017

spencerkimball commented Apr 2, 2017

petermattis commented Jun 1, 2017

tbg commented Jun 6, 2018