Skip to content

Commit

Permalink
Update databricks-configs.md for 1.8.0 (#5463)
Browse files Browse the repository at this point in the history
## What are you changing in this pull request and why?
<!---
Describe your changes and why you're making them. If related to an open 
issue or a pull request on dbt Core, then link to them here! 

To learn more about the writing conventions used in the dbt Labs docs,
see the [Content style
guide](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/content-style-guide.md).
-->

Adding documentation for dbt-databricks v1.8.0, specifically MV/ST
config and Tags.

## Checklist
<!--
Uncomment when publishing docs for a prerelease version of dbt:
- [ ] Add versioning components, as described in [Versioning
Docs](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#versioning-entire-pages)
- [ ] Add a note to the prerelease version [Migration
Guide](https://github.com/dbt-labs/docs.getdbt.com/tree/current/website/docs/docs/dbt-versions/core-upgrade)
-->
- [x] Review the [Content style
guide](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/content-style-guide.md)
so my content adheres to these guidelines.
- [x] For [docs
versioning](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#about-versioning),
review how to [version a whole
page](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#adding-a-new-version)
and [version a block of
content](https://github.com/dbt-labs/docs.getdbt.com/blob/current/contributing/single-sourcing-content.md#versioning-blocks-of-content).
- [ ] Add a checklist item for anything that needs to happen before this
PR is merged, such as "needs technical review" or "change base branch."
  • Loading branch information
dataders authored May 15, 2024
2 parents 88cb312 + c078881 commit dc0f8c5
Showing 1 changed file with 133 additions and 4 deletions.
137 changes: 133 additions & 4 deletions website/docs/reference/resource-configs/databricks-configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ When materializing a model as `table`, you may include several optional configs

</VersionBlock>

<VersionBlock firstVersion="1.7">
<VersionBlock firstVersion="1.7" lastVersion="1.7">


| Option | Description | Required? | Model Support | Example |
Expand All @@ -50,6 +50,29 @@ When materializing a model as `table`, you may include several optional configs

\* Beginning in 1.7.12, we have added tblproperties to Python models via an alter statement that runs after table creation.
We do not yet have a PySpark API to set tblproperties at table creation, so this feature is primarily to allow users to anotate their python-derived tables with tblproperties.

</VersionBlock>

<VersionBlock firstVersion="1.8">

1.8 introduces support for [Tags](https://docs.databricks.com/en/data-governance/unity-catalog/tags.html) at the table level, in addition to all table configuration supported in 1.7.

| Option | Description | Required? | Model Support | Example |
|---------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------|---------------|--------------------------|
| file_format | The file format to use when creating tables (`parquet`, `delta`, `hudi`, `csv`, `json`, `text`, `jdbc`, `orc`, `hive` or `libsvm`). | Optional | SQL, Python | `delta` |
| location_root | The created table uses the specified directory to store its data. The table alias is appended to it. | Optional | SQL, Python | `/mnt/root` |
| partition_by | Partition the created table by the specified columns. A directory is created for each partition. | Optional | SQL, Python | `date_day` |
| liquid_clustered_by | Cluster the created table by the specified columns. Clustering method is based on [Delta's Liquid Clustering feature](https://docs.databricks.com/en/delta/clustering.html). Available since dbt-databricks 1.6.2. | Optional | SQL | `date_day` |
| clustered_by | Each partition in the created table will be split into a fixed number of buckets by the specified columns. | Optional | SQL, Python | `country_code` |
| buckets | The number of buckets to create while clustering | Required if `clustered_by` is specified | SQL, Python | `8` |
| tblproperties | [Tblproperties](https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-tblproperties.html) to be set on the created table | Optional | SQL, Python* | `{'this.is.my.key': 12}` |
| databricks_tags | [Tags](https://docs.databricks.com/en/data-governance/unity-catalog/tags.html) to be set on the created table | Optional | SQL+, Python+ | `{'my_tag': 'my_value'} |

\* Beginning in 1.7.12, we have added tblproperties to Python models via an alter statement that runs after table creation.
We do not yet have a PySpark API to set tblproperties at table creation, so this feature is primarily to allow users to anotate their python-derived tables with tblproperties.

\+ `databricks_tags` are currently only supported at the table level, and applied via `ALTER` statements.

</VersionBlock>

## Incremental models
Expand Down Expand Up @@ -598,10 +621,9 @@ snapshots:
</File>
<VersionBlock firstVersion="1.6">
<VersionBlock firstVersion="1.6" lastVersion="1.7">
## Materialized views and streaming tables
Starting with version 1.6.0, the dbt-databricks adapter supports [materialized views](https://docs.databricks.com/en/sql/user/materialized-views.html) and [streaming tables](https://docs.databricks.com/en/sql/load-data-streaming-table.html), as alternatives to incremental tables that are powered by [Delta Live Tables](https://docs.databricks.com/en/delta-live-tables/index.html).
See [What are Delta Live Tables?](https://docs.databricks.com/en/delta-live-tables/index.html#what-are-delta-live-tables-datasets) for more information and use cases.
These features are still in preview, and the support in the dbt-databricks adapter should, for now, be considered _experimental_.
Expand Down Expand Up @@ -641,7 +663,107 @@ At this time the following configuration options are not available:

Additionally, if you change the model definition of your materialized view or streaming table, you will need to drop the materialization in your warehouse directly before running dbt again; otherwise, you will get a refresh error.

We plan to address these limitations during the 1.7.x timeframe.
Please see the latest documentation for updates on these limitations.

</VersionBlock>

<VersionBlock firstVersion="1.8">

## Materialized views and streaming tables

[Materialized views](https://docs.databricks.com/en/sql/user/materialized-views.html) and [streaming tables](https://docs.databricks.com/en/sql/load-data-streaming-table.html) are alternatives to incremental tables that are powered by [Delta Live Tables](https://docs.databricks.com/en/delta-live-tables/index.html).
See [What are Delta Live Tables?](https://docs.databricks.com/en/delta-live-tables/index.html#what-are-delta-live-tables-datasets) for more information and use cases.

In order to adopt these materialization strategies, you will need a workspace that is enabled for Unity Catalog and serverless SQL Warehouses.

<File name='materialized_view.sql'>

```sql
{{ config(
materialized = 'materialized_view'
) }}
```

</File>

or

<File name='streaming_table.sql'>

```sql
{{ config(
materialized = 'streaming_table'
) }}
```

</File>

We support [on_configuration_change](https://docs.getdbt.com/reference/resource-configs/on_configuration_change) for most available properties of these materializations.
The following table summarizes our configuration support:

| Databricks Concept | Config Name | MV/ST support |
| ------------------ | ------------| ------------- |
| [PARTITIONED BY](https://docs.databricks.com/en/sql/language-manual/sql-ref-partition.html#partitioned-by) | `partition_by` | MV/ST |
| COMMENT | `description` | MV/ST |
| [TBLPROPERTIES](https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-tblproperties.html#tblproperties) | `tblproperties` | MV/ST |
| [SCHEDULE CRON](https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-materialized-view.html#parameters) | `schedule: { 'cron': '\<cron schedule\>', 'time_zone_value': '\<time zone value\>' }` | MV/ST |
| query | defined by your model sql | on_configuration_change for MV only |

<File name='mv_example.sql'>

```sql

{{ config(
materialized='materialized_view',
description='this is a materialized view',
partition_by='id',
schedule = {
'cron': '0 0 * * * ? *',
'time_zone': 'Etc/UTC'
},
tblproperties={
'key': 'value'
},
) }}
select * from {{ ref('my_seed') }}

```

</File>

### Configuration Details

#### partition_by
`partition_by` works the same as for views and tables, i.e. can be a single column, or an array of columns to partition by.

#### description
As with views and tables, adding a `description` to your configuration will lead to a table-level comment getting added to your materialization.

#### tblproperties
`tblproperties` works the same as for views and tables with an important exception: the adapter maintains a list of keys that are set by Databricks when making an materialized view or streaming table which are ignored for the purpose of determining configuration changes.

#### schedule
Use this to set the refresh schedule for the model. If you use the `schedule` key, a `cron` key is required in the associated dictionary, but `time_zone_value` is optional (see the example above). The `cron` value should be formatted as documented by Databricks.
If a schedule is set on the materialization in Databricks and your dbt project does not specify a schedule for it (when `on_configuration_change` is set to `apply`), the refresh schedule will be set to manual when you next run the project.
Even when schedules are set, dbt will request that the materialization be refreshed manually when run.

#### query
For materialized views, if the compiled query for the model differs from the query in the database, we will the take the configured `on_configuration_change` action.
Changes to query are not currently detectable for streaming tables; see the next section for details.

### on_configuration_change
`on_configuration_change` is supported for materialized views and streaming tables, though the two materializations handle it different ways.

#### Materialized Views
Currently, the only change that can be applied without recreating the materialized view in Databricks is to update the schedule.
This is due to limitations in the Databricks SQL API.

#### Streaming Tables
For streaming tables, only changes to the partitioning currently requires the table be dropped and recreated.
For any other supported configuration change, we use `CREATE OR REFRESH` (+ an `ALTER` statement for changes to the schedule) to apply the changes.
There is currently no mechanism for the adapter to detect if the streaming table query has changed, so in this case, regardless of the behavior requested by on_configuration_change, we will use a `create or refresh` statement (assuming `partitioned by` hasn't changed); this will cause the query to be applied to future rows without rerunning on any previously processed rows.
If your source data is still available, running with '--full-refresh' will reprocess the available data with the updated current query.

</VersionBlock>

## Setting table properties
Expand All @@ -667,3 +789,10 @@ These properties are sent directly to Databricks without validation in dbt, so b
:::

One application of this feature is making `delta` tables compatible with `iceberg` readers using the [Universal Format](https://docs.databricks.com/en/delta/uniform.html).

<VersionBlock firstVersion="1.7">

`tblproperties` can be specified for python models, but they will be applied via an `ALTER` statement after table creation.
This is due to a limitation in PySpark.

</VersionBlock>

0 comments on commit dc0f8c5

Please sign in to comment.