Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplified SSH Forwarding Configuration #563

Merged
merged 4 commits into from
Jun 29, 2022

Conversation

willdonnelly
Copy link
Member

@willdonnelly willdonnelly commented Jun 24, 2022

Description:

This is an attempt to implement the mechanism I described in #513, which allows a top-level endpoint config property named address to serve as the forwardHost/forwardPort values for SSH tunneling (and replaces the address property with 127.0.0.1:$RANDOM_LOCAL_PORT before invoking the connector).

It includes some additional logic which does a similar job when asked for the endpoint config spec, and omits the forwardHost, forwardPort, and localPort properties from the SSH config portion of the resulting schema IFF the main endpoint config schema contains a top-level address property.

It should be noted that the new logic is backwards-compatible in two ways:

  • First, any existing captures/materializations which explicitly specify forwardHost / forwardPort / localPort should continue to operate as normal. This means that we can choose to migrate existing uses of source-postgres and other connectors whenever it's convenient.
  • Second, any connectors which do not have a top-level address property will continue to provide the "full" version of the SSH config spec, which includes forwardHost and friends. This means that we can choose to modify additional connectors (such as materialize-postgres) to use the specific address configuration whenever it's convenient.

This whole PR still feels really hacky because of the reliance on a specific endpoint property address, but I believe it should work well in practice and I couldn't justify using a more complicated mechanism at this time. The main issues are:

  1. To opt in to this behavior, connectors will need to be modified. For instance materialize-postgres will need to replace its current host and port properties with address, which is itself a potentially-breaking change that needs some care (either keeping things backwards-compatible or pinning-then-upgrading the live materializations) to not break existing uses.
  2. There could theoretically be some connectors out there with a top-level address property which does not have the expected semantics, and we would do the wrong thing with those. However I have checked and I do not believe any of our current officially supported connectors have this issue, and it's a bit unlikely anyway.

Workflow steps:

Configuration for some connectors (currently source-postgres and source-mysql) will no longer require forwardHost and related properties to be specified. Instead the primary address field should always describe the target database, and the fact that it's not directly reachable but goes through an SSH bastion host is only represented by the presence of sshEndpoint and privateKey configuration.

Documentation links affected:

Probably https://docs.estuary.dev/guides/connect-network/, and it looks like https://docs.estuary.dev/reference/Connectors/capture-connectors/PostgreSQL/#amazon-rds directly mentions forwardHost so that should be changed too.


This change is Reviewable

Copy link
Member

@psFried psFried left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good.

One question I have is whether it would also make sense to apply this behavior in the presence of separate host and port properties. There certainly doesn't seem to be any reason why we couldn't, though it would add complexity to a feature that we may not be entirely certain about. WDYT?

crates/connector_proxy/src/libs/network_tunnel.rs Outdated Show resolved Hide resolved
@willdonnelly
Copy link
Member Author

whether it would also make sense to apply this behavior in the presence of separate host and port properties

Yeah, that's absolutely an option. There were basically two reasons I was hesitant to go that route:

  1. I think it's good for user experience to have basic details like host:port configuration look as similar as possible between connectors. I never really wrote up my argument for why I think that a "host:port"-valued /address property is the optimal one we should standardize on, but I may as well sketch that out now so it's written down somewhere. Basically:
    • Separate /host and /port properties don't work out very well, in the web UI that exists today. Observe the "Create Materialization" form for materialize-postgres, where the rendered sequence of form fields is "Host", "Password", "Port".
    • Even if the field ordering issue was fixed, using separate properties for host and port makes the configuration process look marginally more complex (5 fields vs 4, in that case) for no real benefit. Like, my ideal configuration flow looks like "Address, Username, Password, done" so that extra field really matters. Concatenating host:port as a single string is intuitive to pretty much everybody who's worked much with computers and networking.
    • Naming the combined field /address causes it to occur before just about any other property, in the web UI as it exists today.
  2. It's been said that the only principled numbers in computer science are 0, 1, and infinity. Once we've added support for two different mechanisms it becomes a lot harder to argue against adding a third, whereas if there's just one it feels a lot more natural to say "okay, guess we'd better go modify the connector" to make the SSH forwarding UX work nicely.

Also to get more concrete, I believe that materialize-postgres is the only connector on our approved list which still takes host + port properties at present (there are a few more which take some form of "endpoint URI", but those are a more tricky subject to handle here), and given that's the case then I think we should absolutely make a breaking change to bring that connector into symmetry with source-{mysql,postgres} one way or another.

In high-level terms, this commit allows an endpoint's `address`
property to always describe the target database regardless of
whether SSH forwarding is in use.

In order to make that happen, this commit introduces a new
method `extend_endpoint_schema` of the `NetworkTunnel` trait,
which is called prior to network tunnel startup. This gives a
specific network tunnel implementation (such as SSH Forwarding)
an opportunity to inspect and modify endpoint configuration.

In the case of SSH forwarding, this method extracts values for
`forward_host` and `forward_port` from the endpoint's `address`
property, randomly selects a `local_port` between 10000-20000,
and modifies the endpoint's `address` to `127.0.0.1:<localPort>`.
This commit modifies the config schema generation logic so that
when the underlying endpoint config has an 'address' property
the SSH forwarding config *schema* will no longer mention the
trio of unnecessary `forwardHost/forwardPort/localPort` values.

This conditional behavior turned out to be surprisingly tricky,
and the best way I could find to implement this was to leave
the relevant fields in place in the SSH Forwarding config, make
them optional, and then write a small `Visitor` which could be
used to postprocess the network tunnel spec and remove those
three properties. By only running the "delete that stuff" visitor
when the `address` property exists the generated config spec is
able to vary as intended.

The upshot of this is that almost nothing should change for the
majority of connectors at the moment, and we can fix them up one
at a time to take advantage of this UX improvement while remaining
backwards-compatible with existing specs the whole time (since the
actual struct fields still exist, we're just not mentioning them
and have a reasonable default behavior when they're all unset).
As I understand it the `maximum` keyword in a JSON schema means
an inclusive maximum (there's a separate `exclusiveMaximum`), so
the maximum value of the port number fields should be 65535.

This is an incredibly dumb nitpick given that I'm actively working
on getting rid of these fields, but it was bothering me.
The new test runs NetworkTunnel::extend_endpoint_schema on a
couple of hypothetical endpoint spec schemas and verifies the
results against JSON snapshot files.

There are two schema variants in the test at present: one with
an 'address' property and one without, so we can observe the
difference in what SSH Forwarding properties are described.
@psFried
Copy link
Member

psFried commented Jun 28, 2022

Thanks for the explanation @willdonnelly , I'm convinced 👍

@@ -80,6 +87,13 @@ pub struct SshForwarding {
process: Option<Child>,
}

fn split_host_port(hostport: String) -> Option<(String, u16)> {
let mut splits = hostport.as_str().splitn(2, ':');
Copy link
Member

@mdibaiee mdibaiee Jun 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be rsplitn, since we might have username and password in the host? e.g.:

mahdi:[email protected]:8080

Copy link
Member Author

@willdonnelly willdonnelly Jun 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently none of the connectors for which this logic will trigger support username/password in the address field. I'd be open to adding that, but I think it would require some additional code to chop off anything up to an @ character and then reattach that to the rewritten localhost address at the end.

Copy link
Member

@mdibaiee mdibaiee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you Will! I just have a small note regarding how we split the address into host and port

@willdonnelly willdonnelly merged commit 5934611 into master Jun 29, 2022
@willdonnelly willdonnelly deleted the wgd/ssh-forwarding-config branch June 29, 2022 17:16
@oliviamiannone oliviamiannone added the docs pending Improvements or additions to documentation noted or in progress label Jul 6, 2022
@oliviamiannone oliviamiannone added docs complete / NA No (more) doc work related to this PR and removed docs pending Improvements or additions to documentation noted or in progress labels Jul 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs complete / NA No (more) doc work related to this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants