Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document dbt-spark ODBC connection #454

Merged
merged 1 commit into from
Nov 11, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 39 additions & 4 deletions website/docs/reference/warehouse-profiles/spark-profile.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,10 @@ title: "Spark Profile"
---

## Connection Methods
There are two supported connection methods for Spark targets: `http` and `thrift`.
There are three supported connection methods for Spark targets: `thrift`, `http`, and `odbc`.

### thrift
Use the `thrift` connection method if you are connecting to a Thrift server sitting in front of a Spark cluster.
Use the `thrift` connection method if you are connecting to a Thrift server sitting in front of a Spark cluster, e.g. a cluster running locally or on Amazon EMR.

<File name='~/.dbt/profiles.yml'>

Expand All @@ -26,7 +26,7 @@ your_profile_name:
</File>

### http
Use the `http` method if your Spark provider supports connections over HTTP (e.g. Databricks).
Use the `http` method if your Spark provider supports connections over HTTP (e.g. Databricks interactive cluster).

<File name='~/.dbt/profiles.yml'>

Expand All @@ -47,10 +47,39 @@ your_profile_name:
connect_retries: 5 # optional, default 0
```

</File>

Databricks interactive clusters can take several minutes to start up. You may
include the optional profile configs `connect_timeout` and `connect_retries`,
and dbt will periodically retry the connection.

### ODBC

<Changelog>New in v0.18.1</Changelog>

Use the `odbc` connection method if you are connecting to a Databricks SQL endpoint or interactive cluster via ODBC driver. (Download the latest version of the official driver [here](https://databricks.com/spark/odbc-driver-download).)

<File name='~/.dbt/profiles.yml'>

```yaml
your_profile_name:
target: dev
outputs:
dev:
type: spark
method: odbc
driver: [path/to/driver]
schema: [database/schema name]
host: [yourorg.sparkhost.com]
organization: [org id] # required if Azure Databricks, exclude if AWS Databricks
port: [port]
token: [abc123]

# one of:
endpoint: [endpoint id]
cluster: [cluster id]
```

</File>

## Installation and Distribution
Expand All @@ -60,10 +89,16 @@ dbt's Spark adapter is managed in its own repository, [dbt-spark](https://github
### Using pip
The following command will install the latest version of `dbt-spark` as well as the requisite version of `dbt-core`:

```
```bash
pip install dbt-spark
```

If you are using the `odbc` connection method, you will need to install the extra `ODBC` requirement (includes `pyodbc`):

```bash
pip install "dbt-spark[ODBC]"
```

## Caveats

### Usage with EMR
Expand Down