Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 58 additions & 2 deletions docs/guides/run_dbt/airflow-worker/watcher-retry-history.rst
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,16 @@ Does the Airflow state match dbt's?
The consumer tasks for those models are marked successful instead of running their fallback
retry, so the DAG appears successful even though dbt failed.
* - **1.14.1**
- **Yes**.
- **Mostly**, except upstream-failure-skipped downstream nodes could produce a "false
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users should be able to retry the whole DAG via Airflow automatic retry — so humans do not need to intervene when the DAG fails

green" — fixed in 1.14.2 by
`#2684 <https://github.com/astronomer/astronomer-cosmos/pull/2684>`_.
Comment on lines 54 to +57
* - **1.14.2**
- **Yes**. Same as 1.14.1, plus a previously unreported "false green" gap is closed:
when a model fails on its first attempt and dbt skips downstream nodes via the
``upstream_failure`` path, those downstream models are now re-run when their upstream
recovers via consumer fallback, instead of being silently marked SKIPPED while the DAG
completes ``success`` with the downstream tables un-materialized
(`#2684 <https://github.com/astronomer/astronomer-cosmos/pull/2684>`_).

Task-level retry — consumer
+++++++++++++++++++++++++++++
Expand Down Expand Up @@ -91,6 +100,19 @@ Task-level retry — consumer
* - **1.14.1**
- Similar to 1.12.0. Consumers always read correct model statuses thanks to the producer's
XCom backup mechanism — see *Task-level retry — producer*.
* - **1.14.2**
- Similar to 1.14.1, with two additions. First, consumers for models that dbt marked as
skipped due to upstream failure (``upstream_failure``) now fail on attempt 1 rather than
skipping — Cosmos rewrites those ``"skipped"`` statuses to ``"failed"`` in the
producer-side log parser, so Airflow retries and the existing
``_fallback_to_non_watcher_run`` path runs the model locally once its upstream has
recovered (`#2684 <https://github.com/astronomer/astronomer-cosmos/pull/2684>`_).
Second, the consumer's fallback ``dbt`` invocation no longer inherits the producer's
internal ``--log-format json`` flag, so retry task logs default to dbt's normal text
output instead of one structured event per line
(`#2713 <https://github.com/astronomer/astronomer-cosmos/pull/2713>`_); users who want
JSON output on retry can opt in via
``operator_args={"dbt_cmd_flags": ["--log-format", "json"]}``.

Task-level retry — producer
+++++++++++++++++++++++++++++
Expand Down Expand Up @@ -132,7 +154,7 @@ Task-level retry — producer
restores it on retry, so consumers always read correct model statuses. The backup Variable
is cleaned up on success.

**Known issues with the XCom backup mechanism:**
**Known issues with the XCom backup mechanism (both fixed in 1.14.2):**

- `#2619 <https://github.com/astronomer/astronomer-cosmos/issues/2619>`_ — backup Variable
key is not sanitized for ``:`` and ``+`` in Airflow 3 default ``run_id`` formats; strict-naming
Expand All @@ -141,6 +163,18 @@ Task-level retry — producer
- `#2625 <https://github.com/astronomer/astronomer-cosmos/issues/2625>`_ — on Airflow 2,
``_get_task_group_id()`` returns ``None``, so multiple ``DbtTaskGroup`` producers in the
same DAG run share one backup key and log ``UniqueViolation`` on every model completion.
* - **1.14.2**
- Same as 1.14.1, with both XCom-backup known issues from 1.14.1 fixed. The key generator now
sanitises every component to ``[A-Za-z0-9_-]`` — the safest subset across the secrets
backends Airflow ships with — so ``:`` / ``+`` from default Airflow 3 ``run_id`` formats no
longer break strict-naming backends
(`#2629 <https://github.com/astronomer/astronomer-cosmos/pull/2629>`_, closes
`#2619 <https://github.com/astronomer/astronomer-cosmos/issues/2619>`_). And
``_get_task_group_id`` now reads ``task.task_group.group_id`` (the correct attribute) instead
of the non-existent ``task.task_group_id`` it was reading before, so each ``DbtTaskGroup``
producer in a DAG writes to a distinct backup Variable
(`#2683 <https://github.com/astronomer/astronomer-cosmos/pull/2683>`_, closes
`#2625 <https://github.com/astronomer/astronomer-cosmos/issues/2625>`_).

Automatic retries
+++++++++++++++++
Expand Down Expand Up @@ -180,6 +214,13 @@ Automatic retries
- **Works.** Producer auto-retries raise ``AirflowSkipException``; XCom is restored from the
Variable backup so consumers read correct model statuses. Subject to the XCom backup known
issues — see *Task-level retry — producer*.
* - **1.14.2**
- **Works.** Same as 1.14.1, with the XCom-backup known issues fixed
(`#2629 <https://github.com/astronomer/astronomer-cosmos/pull/2629>`_,
`#2683 <https://github.com/astronomer/astronomer-cosmos/pull/2683>`_ — see
*Task-level retry — producer*). Additionally, downstream models whose upstream failed on
the producer's first attempt are now re-run on Airflow retry instead of staying SKIPPED
(`#2684 <https://github.com/astronomer/astronomer-cosmos/pull/2684>`_).

Full DAG / TaskGroup clear
++++++++++++++++++++++++++
Expand Down Expand Up @@ -221,6 +262,11 @@ Full DAG / TaskGroup clear
so it does not propagate to tasks downstream of the group
(`#2594 <https://github.com/astronomer/astronomer-cosmos/issues/2594>`_). The gateway is
only added for ``DbtTaskGroup`` — ``DbtDag`` does not need it.
* - **1.14.2**
- **Works.** Same as 1.14.1, with the XCom-backup known issues fixed
(`#2629 <https://github.com/astronomer/astronomer-cosmos/pull/2629>`_,
`#2683 <https://github.com/astronomer/astronomer-cosmos/pull/2683>`_ — see
*Task-level retry — producer*).

Avoid duplicate or concurrent runs of the same dbt transformation in the same DAG run
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Expand Down Expand Up @@ -266,3 +312,13 @@ Avoid duplicate or concurrent runs of the same dbt transformation in the same DA
Cosmos now checks the producer's state first: if it is still running, the sensor keeps
polling instead of launching a duplicate ``dbt`` invocation; only after the producer reaches
a terminal state does the consumer fall back to ``ExecutionMode.LOCAL``.
* - **1.14.2**
- **Met.** Same as 1.14.1. When ``depends_on_past=True`` is set on a ``DbtTaskGroup``, the
cross-run risk where the next DAG run's producer could start before the previous run's
consumers had completed
(`#2596 <https://github.com/astronomer/astronomer-cosmos/issues/2596>`_) is now mitigated:
leaf consumer tasks are wired upstream of the producer-done gateway, and the producer
plus watcher tasks carry ``wait_for_downstream=True``
(`#2615 <https://github.com/astronomer/astronomer-cosmos/pull/2615>`_), so the task group
behaves as a single atomic unit across runs. Topology is unchanged for the default
``depends_on_past=False``.