-
Notifications
You must be signed in to change notification settings - Fork 495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modernize legacy insert by period materialization #410
Modernize legacy insert by period materialization #410
Conversation
select | ||
coalesce(max({{timestamp_field}}), '{{start_date}}')::timestamp as start_timestamp, | ||
coalesce( | ||
{{dbt_utils.dateadd('millisecond', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nanosecond is more accurate, i.e. timestamp(9), millisecond only covers timestamp(3).
|
||
select | ||
start_timestamp, | ||
stop_timestamp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Downstream, load_result truncates these timestamp(9) to timestamp(6), which breaks timestamp comparisons downstream, thus inserting duplicates if incremental is rerun after a completed load.
The fix is to cast both start_timestamp and stop_timestamp here to varchar, so that load_result gets a string and not a timestamp.
Whew! Thanks for all your work on this @HorvathDanielMarton. I just spent some time talking about this very macro in the dbt Community Slack as well. In short, we (the dbt Labs Developer Experience team) are very interested in the future of this materialization because it does some important work. What we're not certain about yet is exactly how it fits into the wider dbt ecosystem:
I promise this PR won't languish forever, but please bear with us while we work out where we're headed 🧭 |
Worth noting: anyone else who comes along in the meantime is of course welcome to make a copy of the code in this PR and apply it to their own project! |
@joellabes Thank you! To make it easier for the curious ones, just update your packages:
- git: "https://github.com/sspinc/dbt-utils.git"
revision: fix-legacy-insert-by-period-macro and execute |
|
||
{%- set period_filter -%} | ||
("{{timestamp_field}}" > '{{start_timestamp}}'::timestamp + interval '{{offset}} {{period}}' and | ||
"{{timestamp_field}}" <= '{{start_timestamp}}'::timestamp + interval '{{offset}} {{period}}' + interval '1 {{period}}' and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line will not work properly if period = "month"
example:
select '2018-12-31 00:00:00'::timestamp + interval '4 month' + interval '1 month',
'2018-12-31 00:00:00'::timestamp + interval '5 month'
2019-05-30 00:00:00 & 2019-05-31 00:00:00 => one day will not be processed.
Redshift can deal with the brackets, like '{{start_timestamp}}'::timestamp + (interval '{{offset}} {{period}}' + interval '1 {{period}}')
, Snowflake will not like it.
Suggestion (that I cannot test right away): calculate offset_plus_one
as offset + 1
and use it in setting the filter, like
{%- set period_filter -%}
("{{timestamp_field}}" > '{{start_timestamp}}'::timestamp + interval '{{offset}} {{period}}' and
"{{timestamp_field}}" <= '{{start_timestamp}}'::timestamp + interval '{{offset_plus_one}} {{period}}' and
"{{timestamp_field}}" < '{{stop_timestamp}}'::timestamp)
{%- endset -%}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this block and it works ok. Feel free to use it.
{% macro get_period_sql(target_cols_csv, sql, timestamp_field, period, start_timestamp, stop_timestamp, offset) -%}
{%- set offset_plus_one = offset + 1 -%}
{%- set period_filter -%}
("{{timestamp_field}}" > '{{start_timestamp}}'::timestamp + interval '{{offset}} {{period}}' and
"{{timestamp_field}}" <= '{{start_timestamp}}'::timestamp + interval '{{offset_plus_one}} {{period}}' and
"{{timestamp_field}}" < '{{stop_timestamp}}'::timestamp)
{%- endset -%}
{%- set filtered_sql = sql | replace("__PERIOD_FILTER__", period_filter) -%}
select
{{target_cols_csv}}
from (
{{filtered_sql}}
)
{%- endmacro %}
Folks on this thread will probably be interested in this Discussion: #487 I'd appreciate your input! |
looks like this can be moved to the experimental features repo :-) |
Unsurprisingly, I can't transfer a PR from one repo to another, but if you want to reopen the PR over there I'll be happy to merge it 👍 |
This is a:
master
dev/
branchdev/
branchDescription & motivation
The goal of this change is to get the
insert_by_period
materialization compatible with Snowflake while using the currently latest dbt (0.21.0b2) and keeping the same functionality.The
insert_by_period
materialization was introduced in this discourse post and the intention behind it is to help with “problematic” initial builds of data models. Here, problematic means that some tables are just so large and/or complex that a simple--full-refresh
is not adequate because the build will be incredibly slow, inefficient, and can even get stuck.Unfortunately, the currently available version does not support Snowflake and it’s advised to refactor it by using the refreshed code for the incremental materialization as a good starting point. As a result, this change resolves dbt-labs/dbt-labs-experimental-features#32.
The materialization itself was completely built from the current incremental materialization and the modifications done to it was inspired by the original
insert_by_period
.The macros that were originally in the
insert_by_period.sql
was moved to ahelpers.sql
file where the Snowflake version ofget_period_boundaries()
, andget_period_sql()
are introduced. Also, a newcheck_for_period_filter()
macro has been added.The materialization seems fully functional and so far works as expected, nevertheless, I'm still testing it against production models.
Ideas
day
,week
,month
, etc... as a unit. However, we might want to build our models fortnightly (ie. 2-week chunks). We could introduce an optional configuration that lets us use a configurable number ofperiod
s as a chunk. The fortnightly example could look like this:__PERIOD_FILTER__
by some time (for example a month lookback time, but it should be configurable). Currently, this materialization doesn't support this, but that is surely a shortcoming for some of our models whereinsert_by_period
would come in handy because of their massive size & complexity. Also, it seems like our team is not the only ones who bumped into this: Feature: Add optionallookback_interval
param toinsert_by_period
materialization #394A weird behavior
The materialization can produce a weird behavior if it’s executed on an already existing model because the “last incremental run” may have a very short overlap with the specified date range. Let’s take the following config as an example:
This is going to build a model with events
2021-08-28 < created_at < 2021-08-31
, so it will have 3 days, thus in 3 steps it will be built, given it’s a fresh start and the run is uninterrupted. After building the 1st day, if the run is terminated, the next run will have 3 steps, again. It's not unlikely, that the last step will only insert a handful of events (many magnitudes smaller, than the previous ones). Like on this CLI output:This may raise some eyebrows, but it makes perfect sense if we look at the query’s where clause - which is basically this:
So the last iteration is only looking for events that happened between
2021-08-30 23:59:59.852
and2021-08-30 23:59:59.999
, which is far from even being a second. It would be nice to deal with this, so it won't cause suspicion that something is buggy.This is most conspicuous for large models with a lot of events.
Checklist
star()
source)