Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for ODBC driver connection type #116

Merged
merged 10 commits into from
Nov 6, 2020
Merged

Conversation

kwigley
Copy link

@kwigley kwigley commented Oct 27, 2020

Resolves #104

Description

Add support for ODBC driver connections with Databricks via cluster specific path and sql endpoint path.

  • add pyodbc for using ODBC driver for connection
  • add driver and endpoint to profile config (cluster and endpoint are mutually exclusive, they determine how to connect to Databricks)
  • add integration and unit tests
    • because this connection method uses a driver, I created a new image with the driver installed for integration tests to use. TBD where this Dockerfile lives!

Note

At the time of writing this, the new SQL Endpoint does not support create temporary view. Also, extended is prohibited by the ODBC driver for some operations dbt-labs/dbt-adapter-tests#8

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is working for me locally, which is very exciting. Tiny comment for now around the nomenclature of the new connection endpoint.

Comment on lines 38 to 40
class SparkClusterType(StrEnum):
ALL_PURPOSE = "all-purpose"
VIRTUAL = "virtual"
Copy link
Contributor

@jtcohen6 jtcohen6 Oct 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Databricks isn't using the name "virtual clusters" anymore, I believe they're just calling them "endpoints."

Rather than a combination of cluster and cluster_type, I think we should make the distinction between cluster (old style, all-purpose/interactive) and endpoint (new style). Users should specify either a cluster or an endpoint when connecting via odbc.

@kwigley kwigley force-pushed the odbc-driver-support branch 3 times, most recently from 1e84db3 to cfd804b Compare October 30, 2020 19:24
f"{self.method} connection method requires "
"additional dependencies. \n"
"Install the additional required dependencies with "
"`pip install dbt-spark[ODBC]`"
Copy link
Author

@kwigley kwigley Nov 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I landed on pip install dbt-spark[ODBC] because pyodbc is the only python dep that I imagine will require OS dependencies. Also, I think this should line up with connection methods (thrift, http, odbc), rather than connection locations (Databricks, etc..)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I buy it!

Copy link
Contributor

@jtcohen6 jtcohen6 Nov 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, I think we might want to try moving PyHive[hive] as an extra instead of principal requirement, since it's only necessary for the http connection method.

Justification: we see some installation errors (e.g. #114) resulting from less-maintained dependencies exclusive to PyHive

Not something we need to do right now!

@kwigley kwigley marked this pull request as ready for review November 2, 2020 17:28
README.md Outdated Show resolved Hide resolved
@kwigley kwigley self-assigned this Nov 2, 2020
@kwigley kwigley added the enhancement New feature or request label Nov 2, 2020
.circleci/config.yml Outdated Show resolved Hide resolved
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks for the excellent, wide-reaching work to get this running with automated tests.

f"{self.method} connection method requires "
"additional dependencies. \n"
"Install the additional required dependencies with "
"`pip install dbt-spark[ODBC]`"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I buy it!

f"{self.method} connection method requires "
"additional dependencies. \n"
"Install the additional required dependencies with "
"`pip install dbt-spark[ODBC]`"
Copy link
Contributor

@jtcohen6 jtcohen6 Nov 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, I think we might want to try moving PyHive[hive] as an extra instead of principal requirement, since it's only necessary for the http connection method.

Justification: we see some installation errors (e.g. #114) resulting from less-maintained dependencies exclusive to PyHive

Not something we need to do right now!

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few tiny notes after rechecking readme

README.md Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@drewbanin drewbanin removed their request for review November 5, 2020 20:25
@drewbanin
Copy link
Contributor

FYI removing myself from review, but this looks chefs-kiss.jpeg

@kwigley kwigley merged commit 1bbe718 into master Nov 6, 2020
@kwigley kwigley deleted the odbc-driver-support branch November 6, 2020 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Databricks connections via latest JDBC or ODBC driver
4 participants