Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unusable hostnames broadcast for inter-node communication #8832

Closed
sploiselle opened this issue Aug 25, 2016 · 12 comments
Closed

unusable hostnames broadcast for inter-node communication #8832

sploiselle opened this issue Aug 25, 2016 · 12 comments
Assignees
Labels
A-kv-client Relating to the KV client and the KV interface. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.
Milestone

Comments

@sploiselle
Copy link
Contributor

When starting cockroach without the --host flag, Cockroach chooses the machine's host name, which it broadcasts as the address other nodes in the cluster can use to communicate to it. However, this name isn't necessarily accessible and the only option to change it is to use the --host flag, which limits the number of interfaces Cockroach listens on.

Example

If I called my machine do-node-2 and start cockroach without the --host flag, the node broadcasts itself as do-node-2:26257. Often times, that name doesn't have requisite DNS entries and is accessible only when the machine refers to itself.

Again, resolving this behavior --host is not viable in all scenarios because the behavior changes from listening on all interfaces to listening only on a single one.

Consequence

  • Having nodes join clusters is difficult because it requires setting the --host flag
  • Nodes cannot replicate its data to others in the cluster if they use addresses like this, despite the fact that they all joined the cluster successfully. Attached is an example of this happening on Digital Ocean with 3 droplets. The first node started the cluster using the following command:
cockroach start --insecure --background --host=10.132.80.187

The second and third nodes joined the cluster using this command:

cockroach start --insecure --background --join=10.132.80.187:26257

screen shot 2016-08-25 at 2 34 36 pm

Ranges on the first node cannot be replicated to the second and third nodes because the addresses they're broadcasting are unreachable by the first node (i.e., they don't have DNS entries to point to do-node-2 and 3).

@tamird
Copy link
Contributor

tamird commented Aug 25, 2016

Dupe of #1008?

@sploiselle
Copy link
Contributor Author

@tamird: Marc asked for a second ticket because of the potential one-way replication issue

@tamird
Copy link
Contributor

tamird commented Aug 25, 2016

OK. cc @a-robinson to avoid duplication of effort.

@mberhault
Copy link
Contributor

it's a bit more than that. In this case, nodes 2 and 3 joined the cluster properly because they were able to reach node1. However, node 1 could not initiate a connection to the other two, meaning we were stuck with a single-node cluster.
The admin UI showed no sign of this. So while #1008 would solve the addressing issue, we still need to figure out a way to properly surface this scenario.

This would be needed for other uni-directional cases, not just bad advertised address. eg: a badly configured firewall rule will trivially cause this.

@petermattis petermattis added this to the 1.0 milestone Feb 22, 2017
@petermattis
Copy link
Collaborator

@tamird, @mberhault We now have the --advertise-host flag. Is there anything left to do here?

@tamird
Copy link
Contributor

tamird commented Feb 22, 2017

@petermattis @mberhault the --advertise-host flag was added in #9503, which also closed #1008, but this issue is (according to @mberhault) not a duplicate of #1008.

@mberhault @sploiselle can you clarify?

@a-robinson
Copy link
Contributor

It seems like there's still the matter of helping folks who accidentally run into this. There's no real visibility into what's wrong in cases like this.

@sploiselle
Copy link
Contributor Author

This is a UX issue, as @a-robinson said. Nodes join the cluster but will never successfully receive data. Identifying this could be non-obvious to users because everything appears to be working from the CLI.

@petermattis
Copy link
Collaborator

Perhaps the new cluster visualization could highlight this situation. Cc @maxlang, @mrtracy, @kuanluo.

@spencerkimball spencerkimball added the S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption. label Apr 2, 2017
@spencerkimball
Copy link
Member

Does anyone have a concrete suggestion for how to address this? I'm removing the 1.0 milestone because I wasn't able to come up with a decent idea. Note that there are complaints in the logs about this situation:

I170402 16:25:37.422167 236 vendor/google.golang.org/grpc/clientconn.go:806  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 198.105.244.228:26259: i/o timeout"; Reconnecting to {f34tegaf.com:26259 <nil>}

@petermattis
Copy link
Collaborator

Might be fixed by #16177.

@knz knz added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Apr 27, 2018
@tbg tbg added the A-kv-client Relating to the KV client and the KV interface. label May 15, 2018
@tbg
Copy link
Member

tbg commented Jun 6, 2018

This particular one is indeed fixed by #16177, except in the case where the local host can resolve the hostname. I think what's left here is some general warning mechanism that fires when other nodes can't reach out at the advertised address, but it works the other way. That is tracked in #18850.

@tbg tbg closed this as completed Jun 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-client Relating to the KV client and the KV interface. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) S-3-ux-surprise Issue leaves users wondering whether CRDB is behaving properly. Likely to hurt reputation/adoption.
Projects
None yet
Development

No branches or pull requests

9 participants