-
Notifications
You must be signed in to change notification settings - Fork 297
Retry watcher downstream models on upstream-failure recovery #2684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
bebe3f4
Forward extra args to test-integration hatch script
tatiana 79c973f
Retry watcher downstream models on upstream-failure recovery
tatiana 025ed17
Address #2684 review: link public issue and add unit tests
tatiana 4017c3e
Cover deeper downstream chain in retry-recovery integration test
tatiana b7695d5
Merge branch 'main' into issue-customer
tatiana 2fdffcf
Apply BOSS-401 skip-rewrite fix to WATCHER_KUBERNETES mode
tatiana File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
30 changes: 30 additions & 0 deletions
30
dev/dags/dbt/watcher_upstream_failure_recovery/dbt_project.yml
|
tatiana marked this conversation as resolved.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,30 @@ | ||
| name: 'watcher_upstream_failure_recovery' | ||
|
|
||
| config-version: 2 | ||
| version: '0.1' | ||
|
|
||
| profile: 'default' | ||
|
|
||
| model-paths: ["models"] | ||
| seed-paths: ["seeds"] | ||
| test-paths: ["tests"] | ||
| macro-paths: ["macros"] | ||
|
|
||
| target-path: "target" | ||
| clean-targets: | ||
| - "target" | ||
| - "dbt_modules" | ||
| - "logs" | ||
|
|
||
| require-dbt-version: [">=1.0.0", "<2.0.0"] | ||
|
|
||
| # Sequence used by model_flaky.sql to fail on first run and succeed on subsequent | ||
| # runs (see #2698 regression test). Using a project-specific sequence name | ||
| # avoids state leaking from / into watcher_downstream_not_skipped, which uses | ||
| # the same fail-once recipe but tests a different invariant. | ||
| on-run-start: | ||
| - "CREATE SEQUENCE IF NOT EXISTS {{ target.schema }}._cosmos_recovery_fail_once_seq" | ||
|
|
||
| models: | ||
| watcher_upstream_failure_recovery: | ||
| +materialized: table |
5 changes: 5 additions & 0 deletions
5
dev/dags/dbt/watcher_upstream_failure_recovery/models/model_a.sql
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| select 1 as id, 'Alice' as first_name, 'Smith' as last_name, 'alice@example.com' as email | ||
| union all | ||
| select 2, 'Bob', 'Jones', 'bob@example.com' | ||
| union all | ||
| select 3, 'Charlie', 'Brown', 'charlie@example.com' |
4 changes: 4 additions & 0 deletions
4
dev/dags/dbt/watcher_upstream_failure_recovery/models/model_downstream.sql
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| select | ||
| id, | ||
| first_name | ||
| from {{ ref('model_flaky') }} |
4 changes: 4 additions & 0 deletions
4
dev/dags/dbt/watcher_upstream_failure_recovery/models/model_downstream_2.sql
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| select | ||
| id, | ||
| first_name | ||
| from {{ ref('model_downstream') }} |
12 changes: 12 additions & 0 deletions
12
dev/dags/dbt/watcher_upstream_failure_recovery/models/model_flaky.sql
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| {{ | ||
| config( | ||
| pre_hook=[ | ||
| "DO $$ BEGIN IF nextval('{{ target.schema }}._cosmos_recovery_fail_once_seq') <= 1 THEN RAISE EXCEPTION 'fail_once: intentional first-run failure'; END IF; END $$" | ||
| ] | ||
| ) | ||
| }} | ||
|
|
||
| select | ||
| id, | ||
| first_name | ||
| from {{ ref('model_a') }} |
113 changes: 113 additions & 0 deletions
113
dev/failed_dags/example_watcher_recovers_skipped_downstream.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,113 @@ | ||
| """ | ||
| Demonstrate watcher-mode recovery of a downstream model that dbt skipped | ||
| because its upstream failed on the producer's first attempt (#2698). | ||
|
|
||
| Without the fix in ``cosmos/operators/_watcher/base.py``: | ||
| - A dbt model fails on the first producer attempt. | ||
| - dbt marks every downstream node ``skipped`` with the upstream-failure cause. | ||
|
tatiana marked this conversation as resolved.
|
||
| - The producer log parser pushes that ``"skipped"`` status to XCom. | ||
| - The downstream consumer sensor raises ``AirflowSkipException`` -- SKIPPED. | ||
| - Airflow retries the producer task (the producer's retry is a no-op by | ||
| design: Cosmos restores XCom and raises ``AirflowSkipException`` to avoid | ||
| re-running the whole dbt build). | ||
| - The consumer sensor for the failing upstream retries on its own and falls | ||
| back to running ``dbt --select <model>`` locally, which succeeds. | ||
| - The downstream consumer, however, was already SKIPPED. Airflow does not | ||
| retry skipped tasks, so the downstream model is never re-run even though | ||
| its upstream has now recovered. | ||
| - The DAG ends in ``success`` because Airflow treats SKIPPED as non-failure | ||
| -- a "false green" outcome with un-materialized downstream tables. | ||
|
|
||
| With the fix, the producer parser rewrites the ``"skipped"`` status to | ||
| ``"failed"`` for any node that dbt skipped via ``SkippingDetails`` / | ||
| ``LogSkipBecauseError`` (the only paths reached when ``do_skip(cause=...)`` | ||
| fires -- i.e. exclusively on upstream-node failure). The downstream consumer | ||
| then fails on attempt 1, Airflow retries it, and the same consumer-fallback | ||
| path that recovers the failing upstream now runs the downstream locally. | ||
|
|
||
| Models used (from ``dev/dags/dbt/watcher_upstream_failure_recovery``): | ||
| - ``model_a``: trivial source-style model, succeeds. | ||
| - ``model_flaky``: uses an ``on-run-start`` Postgres sequence to fail on | ||
| the first ``nextval`` call (``<= 1`` → ``RAISE EXCEPTION``) and succeed | ||
| on subsequent calls. | ||
| - ``model_downstream``: depends on ``model_flaky``; dbt skips it on | ||
| attempt 1 because its upstream failed. | ||
|
|
||
| A ``post_dbt`` ``EmptyOperator`` downstream of the task group makes the | ||
| "green DAG" visible. A ``cleanup`` SQL task drops the sequence at the end | ||
| so the DAG is re-runnable from a clean state. | ||
| """ | ||
|
|
||
| import os | ||
| from datetime import datetime, timedelta | ||
| from pathlib import Path | ||
|
|
||
| from airflow.models import DAG | ||
| from airflow.providers.common.sql.operators.sql import SQLExecuteQueryOperator | ||
|
|
||
| try: | ||
| from airflow.providers.standard.operators.empty import EmptyOperator | ||
| except ImportError: | ||
| from airflow.operators.empty import EmptyOperator | ||
|
|
||
| from cosmos import DbtTaskGroup, ExecutionConfig, ProfileConfig, ProjectConfig | ||
| from cosmos.constants import ExecutionMode | ||
| from cosmos.profiles import PostgresUserPasswordProfileMapping | ||
|
|
||
| DEFAULT_DBT_ROOT_PATH = Path(__file__).parent.parent / "dags/dbt" | ||
| DBT_ROOT_PATH = Path(os.getenv("DBT_ROOT_PATH", DEFAULT_DBT_ROOT_PATH)) | ||
| DBT_PROJECT_PATH = DBT_ROOT_PATH / "watcher_upstream_failure_recovery" | ||
|
|
||
| profile_config = ProfileConfig( | ||
| profile_name="default", | ||
| target_name="dev", | ||
| profile_mapping=PostgresUserPasswordProfileMapping( | ||
| conn_id="example_conn", | ||
| profile_args={"schema": "public"}, | ||
| disable_event_tracking=True, | ||
| ), | ||
| ) | ||
|
|
||
| execution_config = ExecutionConfig( | ||
| execution_mode=ExecutionMode.WATCHER, | ||
| ) | ||
|
|
||
| operator_args = { | ||
| "install_deps": True, | ||
| "execution_timeout": timedelta(seconds=120), | ||
| } | ||
|
|
||
| if os.getenv("CI"): | ||
| operator_args["trigger_rule"] = "all_success" | ||
|
|
||
| default_args = { | ||
| "retries": 2, | ||
| "retry_delay": timedelta(seconds=0), | ||
| } | ||
|
|
||
| with DAG( | ||
| dag_id="example_watcher_recovers_skipped_downstream", | ||
| schedule="@daily", | ||
| start_date=datetime(2023, 1, 1), | ||
| catchup=False, | ||
| default_args=default_args, | ||
| ): | ||
| dbt_group = DbtTaskGroup( | ||
| group_id="watcher_upstream_failure_recovery", | ||
| execution_config=execution_config, | ||
| project_config=ProjectConfig(DBT_PROJECT_PATH), | ||
| profile_config=profile_config, | ||
| operator_args=operator_args, | ||
| ) | ||
|
|
||
| post_dbt = EmptyOperator(task_id="post_dbt") | ||
|
|
||
| cleanup = SQLExecuteQueryOperator( | ||
| task_id="drop_fail_once_marker", | ||
| conn_id="example_conn", | ||
| sql="DROP SEQUENCE IF EXISTS public._cosmos_recovery_fail_once_seq;", | ||
| trigger_rule="all_done", | ||
| ) | ||
|
|
||
| dbt_group >> post_dbt | ||
| dbt_group >> cleanup | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these public dbt contracts? If internal, would it be possible to do a quick sanity check that these are consistent across the dbt versions we support? I guess for future releases we might be able to catch if these change when we test against newly released versions, but it would be nice to check for earlier versions once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in https://github.com/astronomer/astronomer-cosmos/pull/2700/changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tatiana. I see the PR touching on the
--log-formatkey. What I meant here was that the docstring talks about upstream failures:SkippingDetails/LogSkipBecauseErrorand whether they are the same for the dbt versions we support.