-
Notifications
You must be signed in to change notification settings - Fork 297
Add watcher retry behaviour history documentation #2600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
010a83b
Add docs related to the watcher retry evolution
tatiana 38c4f14
Restructure watcher retry history into per-goal tables
tatiana ed73fb4
Potential fix for pull request finding
tatiana fd81296
Address watcher retry history review feedback
tatiana 36e4df7
Sharpen watcher retry history tables and intro
tatiana 61df08e
Merge branch 'main' into watcher-retry-history
tatiana File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
268 changes: 268 additions & 0 deletions
268
docs/guides/run_dbt/airflow-worker/watcher-retry-history.rst
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,268 @@ | ||
| :orphan: | ||
|
|
||
| .. _watcher-retry-history: | ||
|
|
||
| Watcher retry behavior history | ||
| ------------------------------ | ||
|
|
||
| While ``ExecutionMode.WATCHER`` can significantly improve DAG run times, it is based on | ||
| non-idempotent `Apache Airflow® <https://airflow.apache.org/>`_ tasks and relies on a complex retry | ||
| mechanism in which one task's status can affect another task's status. This is the reason | ||
| ``ExecutionMode.WATCHER`` has remained marked as experimental for several months — until we can get | ||
| this right. This document aims to present how each aspect of retries has evolved within the Cosmos | ||
| watcher implementation across Cosmos releases. | ||
|
|
||
| Goals | ||
| +++++ | ||
|
|
||
| - The Airflow ``DbtDag`` / ``DbtTaskGroup`` state should match the dbt pipeline status, whether successful or failed | ||
| - Users should be able to retry individual tasks via Airflow retry | ||
| - Users should be able to retry the whole DAG via Airflow automatic retry — so humans do not need to intervene when the DAG fails | ||
| - Users should be able to retry the whole DAG via Airflow clear | ||
| - Avoid duplicate or concurrent runs of the same dbt transformation in the same DAG run | ||
|
|
||
| Does the Airflow state match dbt's? | ||
| +++++++++++++++++++++++++++++++++++ | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
| :widths: 15 85 | ||
|
|
||
| * - Version | ||
| - Outcome | ||
| * - **1.11.0** | ||
| - **Yes**. | ||
| * - **1.11.1** | ||
| - **Yes**. Same as 1.11.0. | ||
| * - **1.11.2** | ||
| - **Yes**. Same as 1.11.0. | ||
| * - **1.11.3** | ||
| - **Yes**. Same as 1.11.0. | ||
| * - **1.12.0** | ||
| - **Yes**. Same as 1.11.0. | ||
| * - **1.12.1** | ||
| - **Yes**. Same as 1.11.0. | ||
| * - **1.13.0** | ||
| - **Maybe**. Yes if successful in the first run. No if retries happen, unless users manually | ||
| clear the producer task. | ||
| * - **1.13.1** | ||
| - **Maybe**. Same as 1.13.0. | ||
| * - **1.14.0** | ||
| - **No** — on producer retry, dbt model failures from the first attempt are silently dropped. | ||
| The consumer tasks for those models are marked successful instead of running their fallback | ||
| retry, so the DAG appears successful even though dbt failed. | ||
| * - **1.14.1** | ||
| - **Yes**. | ||
|
|
||
| Task-level retry — consumer | ||
| +++++++++++++++++++++++++++++ | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
| :widths: 15 85 | ||
|
|
||
| * - Version | ||
| - Behavior | ||
| * - **1.11.0** | ||
| - Fallback to ``ExecutionMode.LOCAL`` behavior (``dbt run --select <model>``). | ||
| * - **1.11.1** | ||
| - Same as 1.11.0. | ||
| * - **1.11.2** | ||
| - Same as 1.11.0. | ||
| * - **1.11.3** | ||
| - Same as 1.11.0. | ||
| * - **1.12.0** | ||
| - Similar to 1.11.0. Fixes rendering of dbt compiled SQL as a templated field | ||
| (`#2209 <https://github.com/astronomer/astronomer-cosmos/pull/2209>`_); consumers run | ||
| asynchronously when they behave as sensors | ||
| (`#2084 <https://github.com/astronomer/astronomer-cosmos/pull/2084>`_), letting them detect | ||
| producer failure faster and freeing worker slots sooner. | ||
| * - **1.12.1** | ||
| - Same as 1.12.0. | ||
| * - **1.13.0** | ||
| - Same as 1.12.0. | ||
| * - **1.13.1** | ||
| - Same as 1.12.0. | ||
| * - **1.14.0** | ||
| - Similar to 1.12.0. Affected by an Airflow limitation | ||
| (`#2554 <https://github.com/astronomer/astronomer-cosmos/issues/2554>`_): because the producer | ||
| returns success on retry and Airflow does not preserve XCom across retries, consumers lose | ||
| the model statuses from the first attempt and may silently mark failed models as successful. | ||
| * - **1.14.1** | ||
| - Similar to 1.12.0. Consumers always read correct model statuses thanks to the producer's | ||
| XCom backup mechanism — see *Task-level retry — producer*. | ||
|
|
||
| Task-level retry — producer | ||
| +++++++++++++++++++++++++++++ | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
| :widths: 15 85 | ||
|
|
||
| * - Version | ||
| - Behavior | ||
| * - **1.11.0** | ||
| - Relaunches the entire ``dbt build`` — dangerous duplicate/concurrent run. | ||
| * - **1.11.1** | ||
| - Same as 1.11.0. | ||
| * - **1.11.2** | ||
| - Manual clear still relaunches ``dbt build``. Auto-retry is now blocked because Cosmos forces | ||
| producer ``retries`` to ``0`` (see *Automatic retries*). | ||
| * - **1.11.3** | ||
| - Same as 1.11.2. | ||
| * - **1.12.0** | ||
| - Same as 1.11.2. | ||
| * - **1.12.1** | ||
| - Same as 1.11.2. | ||
| * - **1.13.0** | ||
| - Producer returns success on ``try_number > 1`` | ||
| (`#2283 <https://github.com/astronomer/astronomer-cosmos/pull/2283>`_) — logs an informational | ||
| message and does not re-run ``dbt build``. Also fixes empty/ephemeral models hanging | ||
| consumers (`#2279 <https://github.com/astronomer/astronomer-cosmos/pull/2279>`_). Only | ||
| reachable via manual clear, since ``retries`` is still forced to ``0``. | ||
| * - **1.13.1** | ||
| - Same as 1.13.0. | ||
| * - **1.14.0** | ||
| - Producer returns success on retry without re-running ``dbt build``. | ||
| * - **1.14.1** | ||
| - Producer raises ``AirflowSkipException`` on retry | ||
| (`#2559 <https://github.com/astronomer/astronomer-cosmos/pull/2559>`_) instead of returning | ||
| success — explicit "skipped" state rather than misleadingly "successful". The producer also | ||
| backs up its XCom state (model statuses) to an Airflow Variable during execution and | ||
| restores it on retry, so consumers always read correct model statuses. The backup Variable | ||
| is cleaned up on success. | ||
|
|
||
| **Known issues with the XCom backup mechanism:** | ||
|
|
||
| - `#2619 <https://github.com/astronomer/astronomer-cosmos/issues/2619>`_ — backup Variable | ||
| key is not sanitized for ``:`` and ``+`` in Airflow 3 default ``run_id`` formats; strict-naming | ||
| secrets backends (e.g. GCP Secret Manager / AWS Secrets Manager) reject the name, breaking every | ||
| ``Variable.get`` / ``Variable.set`` from the producer. | ||
| - `#2625 <https://github.com/astronomer/astronomer-cosmos/issues/2625>`_ — on Airflow 2, | ||
| ``_get_task_group_id()`` returns ``None``, so multiple ``DbtTaskGroup`` producers in the | ||
| same DAG run share one backup key and log ``UniqueViolation`` on every model completion. | ||
|
|
||
| Automatic retries | ||
| +++++++++++++++++ | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
| :widths: 15 85 | ||
|
|
||
| * - Version | ||
| - Behavior | ||
| * - **1.11.0** | ||
| - **Unsafe.** No safeguard; producer auto-retries would relaunch ``dbt build``, while consumer | ||
| tasks may be running their own retries. | ||
| * - **1.11.1** | ||
| - **Unsafe.** Same as 1.11.0. | ||
| * - **1.11.2** | ||
| - **Failure.** Producer ``retries`` forced to ``0`` by Cosmos — no auto-retry possible on the | ||
| producer. | ||
| * - **1.11.3** | ||
| - **Failure.** Same as 1.11.2. | ||
| * - **1.12.0** | ||
| - **Failure.** Same as 1.11.2. | ||
| * - **1.12.1** | ||
| - **Failure.** Same as 1.11.2. | ||
| * - **1.13.0** | ||
| - **Failure.** Same as 1.11.2. | ||
| * - **1.13.1** | ||
| - **Failure.** Same as 1.11.2. | ||
| * - **1.14.0** | ||
| - **Incorrect status.** Forced ``retries=0`` on the producer is removed | ||
| (`#2479 <https://github.com/astronomer/astronomer-cosmos/pull/2479>`_), fixing | ||
| `#2429 <https://github.com/astronomer/astronomer-cosmos/issues/2429>`_. Producer auto-retries | ||
| return success without re-running ``dbt build``, but Airflow does not preserve XCom across | ||
| retries (`#2554 <https://github.com/astronomer/astronomer-cosmos/issues/2554>`_), so failed | ||
| dbt models can be silently marked successful. | ||
| * - **1.14.1** | ||
| - **Works.** Producer auto-retries raise ``AirflowSkipException``; XCom is restored from the | ||
| Variable backup so consumers read correct model statuses. Subject to the XCom backup known | ||
| issues — see *Task-level retry — producer*. | ||
|
|
||
| Full DAG / TaskGroup clear | ||
| ++++++++++++++++++++++++++ | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
| :widths: 15 85 | ||
|
|
||
| * - Version | ||
| - Behavior | ||
| * - **1.11.0** | ||
| - **Unsafe.** Relaunches the entire ``dbt build`` — dangerous duplicate/concurrent run. | ||
| * - **1.11.1** | ||
| - **Unsafe.** Same as 1.11.0. | ||
| * - **1.11.2** | ||
| - **Unsafe.** Same as 1.11.0. | ||
| * - **1.11.3** | ||
| - **Unsafe.** Same as 1.11.0. | ||
| * - **1.12.0** | ||
| - **Unsafe.** Same as 1.11.0. | ||
| * - **1.12.1** | ||
| - **Unsafe.** Same as 1.11.0. | ||
| * - **1.13.0** | ||
| - **Works.** Producer returns success on retry without re-running ``dbt build``; consumers run | ||
| using ``ExecutionMode.LOCAL``. Works correctly on a manual full clear. | ||
| * - **1.13.1** | ||
| - **Works.** Same as 1.13.0. | ||
| * - **1.14.0** | ||
| - **Incorrect status.** Same as 1.13.0, but Airflow does not preserve XCom across retries | ||
| (`#2554 <https://github.com/astronomer/astronomer-cosmos/issues/2554>`_), so failed dbt | ||
| models can be silently marked successful. | ||
| * - **1.14.1** | ||
| - **Works.** Producer raises ``AirflowSkipException`` (skipped, not successful); XCom is | ||
| restored from the Variable backup so consumers read correct model statuses (subject to the | ||
| XCom backup known issues — see *Task-level retry — producer*). For ``DbtTaskGroup``, a | ||
| gateway task ``dbt_producer_watcher_done`` | ||
| (`#2597 <https://github.com/astronomer/astronomer-cosmos/pull/2597>`_) with | ||
| ``trigger_rule="none_failed"`` is added downstream of the producer to absorb its skip state | ||
| so it does not propagate to tasks downstream of the group | ||
| (`#2594 <https://github.com/astronomer/astronomer-cosmos/issues/2594>`_). The gateway is | ||
| only added for ``DbtTaskGroup`` — ``DbtDag`` does not need it. | ||
|
|
||
| Avoid duplicate or concurrent runs of the same dbt transformation in the same DAG run | ||
| ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
| :widths: 15 85 | ||
|
|
||
| * - Version | ||
| - Behavior | ||
| * - **1.11.0** | ||
| - **Not met.** Producer auto-retry, manual clear, or full DAG/TaskGroup clear relaunches | ||
| ``dbt build`` and re-runs all transformations — potentially in parallel with consumer | ||
| fallback runs (``ExecutionMode.LOCAL`` behavior) of the same models. | ||
| * - **1.11.1** | ||
| - **Not met.** Same as 1.11.0. | ||
| * - **1.11.2** | ||
| - **Not met.** Producer ``retries`` forced to ``0`` — auto-re-run is impossible. Manual | ||
| producer clear or full DAG/TaskGroup clear still relaunches ``dbt build`` and may run | ||
| concurrently with consumer fallbacks. | ||
| * - **1.11.3** | ||
| - **Not met.** Same as 1.11.2. | ||
| * - **1.12.0** | ||
| - **Not met.** Same as 1.11.2. | ||
| * - **1.12.1** | ||
| - **Not met.** Same as 1.11.2. | ||
| * - **1.13.0** | ||
| - **Not met.** Producer returns success on retry without re-running ``dbt build``, so retries | ||
| and full clears no longer relaunch the entire build. However, when a consumer sensor times | ||
| out and Airflow auto-retries it, the consumer's ``ExecutionMode.LOCAL`` fallback runs | ||
| unconditionally without checking whether the producer is still running — which can cause | ||
| concurrent runs of the same transformation. Fixed in 1.14.1 by | ||
| `#2592 <https://github.com/astronomer/astronomer-cosmos/pull/2592>`_. | ||
| * - **1.13.1** | ||
| - **Not met.** Same as 1.13.0. | ||
| * - **1.14.0** | ||
| - **Not met.** Same as 1.13.0 — forced ``retries=0`` is lifted, but the consumer-sensor-retry | ||
| concurrent run risk persists. Fixed in 1.14.1 by | ||
| `#2592 <https://github.com/astronomer/astronomer-cosmos/pull/2592>`_. | ||
| * - **1.14.1** | ||
| - **Met.** Producer raises ``AirflowSkipException`` on retry — no ``dbt build`` re-run. On | ||
| consumer sensor retry (`#2592 <https://github.com/astronomer/astronomer-cosmos/pull/2592>`_), | ||
| Cosmos now checks the producer's state first: if it is still running, the sensor keeps | ||
| polling instead of launching a duplicate ``dbt`` invocation; only after the producer reaches | ||
| a terminal state does the consumer fall back to ``ExecutionMode.LOCAL``. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.