Add source freshness checks to upstream assets #138

cnolanminich · 2025-01-02T18:31:15Z

This PR was a learning experience for me. Sometimes we hear from prospects that they really like using dbt source freshness and would like to continue doing so with Dagster. This PR demonstrates how you would add a custom dbt command to Dagster, namely:

a multi asset check that automatically adds a dbt source freshness check when there is a source freshness defined in the sources.yml manifest
is a blocking asset check, so that if the source isn't fresh it will not allow the downstream dbt step to run in refresh_analytics_jop -- example below
computes one dbt freshness for a set of dbt sources selected and then reports individual asset check status

demonstrates how you can add an automation condition to an asset check -- this one uses the allow_outdated_and_missing_parents_condition, which I confess to not really understanding but works. To demo this you have to run the refresh_analytics_model_job from the launchpad and un-select running the asset checks, and then it will run.

Totally fine if we choose not to add it in but it was fun to work through! One thing I wanted to investigate still is adding the proper metadata to it

github-actions · 2025-01-02T18:32:49Z

Your pull request at commit 0b97ffceef1804d3586c786cde49fc01f7b26696 is automatically being deployed to Dagster Cloud.

Location	Link	Updated
data-eng-pipeline	View in Cloud	Jan 02, 2025 at 08:41 PM (UTC)
basics	Building...	Jan 02, 2025 at 08:37 PM (UTC)
hooli_bi	Building...	Jan 02, 2025 at 08:37 PM (UTC)
batch_enrichment	Building...	Jan 02, 2025 at 08:37 PM (UTC)
hooli_data_ingest	Building...	Jan 02, 2025 at 08:37 PM (UTC)
snowflake_insights	Building...	Jan 02, 2025 at 08:37 PM (UTC)

slopp · 2025-01-02T18:45:08Z

This is awesome! What is the intuition behind using declarative automation for these checks?

I would assume we'd kind of want to just run the source freshness check on a regular basis? (I guess maybe this example in hooli is a bit contrived, I imagine it'd be more interesting to have freshness on external source assets we're observing instead of upstreams we're running?)

slopp · 2025-01-02T18:48:40Z

hooli_data_eng/assets/dbt_assets.py

+            log_freshness_result = dbt_event.raw_event['info']
+            context.log.info(f"Filtered LogFreshnessResult: {log_freshness_result}")
+            passed = True if log_freshness_result['level'] == 'info' else False
+            severity = AssetCheckSeverity.ERROR if log_freshness_result['level'] == 'error' else AssetCheckSeverity.WARN


This is very cool and something I think we should make more of a default practice across the integration now that blocking asset checks are more generally available with DA

💯 ! They don't work with dbt tests on dbt assets I don't think? But will play around more with this as I'd like to understand it better

IIUC dbt tests on dbt assets will pretty much always block other stuff in the same run, since they fail the dbt op... but idk about how they work with DA

slopp · 2025-01-02T18:50:29Z

hooli_data_eng/assets/dbt_assets.py

+            passed = True if log_freshness_result['level'] == 'info' else False
+            severity = AssetCheckSeverity.ERROR if log_freshness_result['level'] == 'error' else AssetCheckSeverity.WARN
+            yield AssetCheckResult(
+                asset_key=AssetKey([dbt_event.raw_event['data']['node_info']['node_relation']['schema'].upper(), dbt_event.raw_event['data']['node_info']['node_name']]),


one nit: there is a lot asset key to dbt name transformations going on in this PR, and a nice refactor might be to pull those out into separate functions

Will do. Another thing I didn't love about this was putting asset checks inside dbt_asset.py. Since I'm removing the automation condition it would be straightforward to move this into a separate file (or at least the supporting functions).

on second look, none of these are repeated -- do you think it makes sense to have one-off functions for each of these use cases?

Ultimately up to you, I think functions would help future us remember why we're doing all this .upper / .lower stuff lol, but I can also buy the case that you could figure it out reasonably well from the context

cnolanminich · 2025-01-02T19:34:19Z

This is awesome! What is the intuition behind using declarative automation for these checks?
I would assume we'd kind of want to just run the source freshness check on a regular basis? (I guess maybe this example in hooli is a bit contrived, I imagine it'd be more interesting to have freshness on external source assets we're observing instead of upstreams we're running?)

It's mostly a historical artifact. When I started I imagined the freshness might be more of assessing freshness per dbt model (like: are all CLEANED/orders_cleaned` sources up to date, implemented within the dbt_asset function body. But that wasn't going to work so I abandoned it for this multi asset approach.

Currently these will run when the Dagster assets are materialized, I agree it would be cool to set up an external source (maybe the S3 asset that we currently have a sensor for?) to show how you could use something like this + an external asset to monitor state on that asset -- I'll remove from this PR though

slopp · 2025-01-02T22:19:58Z

This is awesome! What is the intuition behind using declarative automation for these checks?
I would assume we'd kind of want to just run the source freshness check on a regular basis? (I guess maybe this example in hooli is a bit contrived, I imagine it'd be more interesting to have freshness on external source assets we're observing instead of upstreams we're running?)

It's mostly a historical artifact. When I started I imagined the freshness might be more of assessing freshness per dbt model (like: are all CLEANED/orders_cleaned` sources up to date, implemented within the dbt_asset function body. But that wasn't going to work so I abandoned it for this multi asset approach.

Currently these will run when the Dagster assets are materialized, I agree it would be cool to set up an external source (maybe the S3 asset that we currently have a sensor for?) to show how you could use something like this + an external asset to monitor state on that asset -- I'll remove from this PR though

And just to sanity check my own understanding then - for these particular freshness checks... since we're always running orders and users and then the dbt downstreams... it should always pass the check right?

cnolanminich added 2 commits December 31, 2024 16:25

add WIP source freshness

b7b142a

more fixes

05d9a07

slopp reviewed Jan 2, 2025

View reviewed changes

cnolanminich added 2 commits January 2, 2025 14:26

lock dbt-snowflake version

ef9d4c8

fix syntax

cccd6ca

remove automation condition and locations file

0b97ffc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add source freshness checks to upstream assets #138

Add source freshness checks to upstream assets #138

cnolanminich commented Jan 2, 2025 •

edited

Loading

github-actions bot commented Jan 2, 2025 •

edited

Loading

slopp commented Jan 2, 2025

slopp Jan 2, 2025

cnolanminich Jan 2, 2025

slopp Jan 2, 2025

slopp Jan 2, 2025

cnolanminich Jan 2, 2025

cnolanminich Jan 2, 2025

slopp Jan 2, 2025

cnolanminich commented Jan 2, 2025

slopp commented Jan 2, 2025

Add source freshness checks to upstream assets #138

Are you sure you want to change the base?

Add source freshness checks to upstream assets #138

Conversation

cnolanminich commented Jan 2, 2025 • edited Loading

github-actions bot commented Jan 2, 2025 • edited Loading

slopp commented Jan 2, 2025

slopp Jan 2, 2025

Choose a reason for hiding this comment

cnolanminich Jan 2, 2025

Choose a reason for hiding this comment

slopp Jan 2, 2025

Choose a reason for hiding this comment

slopp Jan 2, 2025

Choose a reason for hiding this comment

cnolanminich Jan 2, 2025

Choose a reason for hiding this comment

cnolanminich Jan 2, 2025

Choose a reason for hiding this comment

slopp Jan 2, 2025

Choose a reason for hiding this comment

cnolanminich commented Jan 2, 2025

slopp commented Jan 2, 2025

cnolanminich commented Jan 2, 2025 •

edited

Loading

github-actions bot commented Jan 2, 2025 •

edited

Loading