Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
acf93ee
Try to reproduce downstream watcher tasks when producer skips
tatiana Apr 21, 2026
bec86e0
Reproduce problem of downstream task being skipped
tatiana Apr 21, 2026
4199c20
One option of solution for DbtTaskGroup watcher downstream when watch…
tatiana Apr 22, 2026
a2a7892
Add better names
tatiana Apr 22, 2026
c258efa
Add tests and make feature opt-in
tatiana Apr 22, 2026
44d8a9c
Update docs
tatiana Apr 22, 2026
56650f6
Fix integration test
tatiana Apr 22, 2026
58eaa09
Try to fix tests
tatiana Apr 22, 2026
822a6e5
improve tests so we don't need to install af posgres provider
tatiana Apr 22, 2026
260a450
Skip test due to AF 3.1 limitation https://github.com/astronomer/as…
tatiana Apr 22, 2026
526b30d
Improve test coverage
tatiana Apr 22, 2026
3e4feca
Fix behaviour for AF3
tatiana Apr 22, 2026
aba8c9d
Fix unittests
tatiana Apr 22, 2026
3af4f18
Address PR feedback
tatiana Apr 22, 2026
839ef9a
Address PR feedback
tatiana Apr 22, 2026
7247b32
Update dev/failed_dags/example_watcher_downstream_not_skipped.py
tatiana Apr 22, 2026
8f32df8
Address feedback
tatiana Apr 22, 2026
cf888ac
Address feedback
tatiana Apr 22, 2026
679007b
⏺ The # type: ignore[assignment] is needed because trigger_rule is ty…
tatiana Apr 22, 2026
cb2fc1e
Address PR feedback on tests
tatiana Apr 22, 2026
4f7e948
Fix unittests
tatiana Apr 22, 2026
7ce3478
improve docs
tatiana Apr 22, 2026
4e1ab94
Fix running integration tests
tatiana Apr 22, 2026
ecda234
Address feedback on none_failed_min_one_success
tatiana Apr 22, 2026
cd366d8
Close postgres connection
tatiana Apr 22, 2026
4a8b966
Fix running new integration tests
tatiana Apr 22, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions cosmos/airflow/task_group.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,9 @@
except ImportError:
from airflow.utils.task_group import TaskGroup

from cosmos import settings
from cosmos.config import ExecutionConfig
from cosmos.constants import ExecutionMode
from cosmos.converter import DbtToAirflowConverter, airflow_kwargs, specific_kwargs


Expand All @@ -26,7 +29,52 @@ def __init__(
*args: Any,
**kwargs: Any,
) -> None:
self._execution_config = kwargs.get("execution_config")
kwargs["group_id"] = group_id
TaskGroup.__init__(self, *args, **airflow_kwargs(**kwargs))
kwargs["task_group"] = self
DbtToAirflowConverter.__init__(self, *args, **specific_kwargs(**kwargs))

@property
def is_watcher_mode(self) -> bool:
"""Whether this task group uses a watcher execution mode."""
return isinstance(self._execution_config, ExecutionConfig) and self._execution_config.execution_mode in (
ExecutionMode.WATCHER,
ExecutionMode.WATCHER_KUBERNETES,
)

def _set_watcher_downstream_tasks_trigger_rule(self, downstream_tasks: Any) -> None:
"""In watcher mode the producer task may be skipped on retry.

Set ``trigger_rule="none_failed_min_one_success"`` on downstream tasks so the skip
does not propagate outside the task group.
"""
if not self.is_watcher_mode or not settings.propagate_watcher_trigger_rule:
return
items = downstream_tasks if isinstance(downstream_tasks, (list, tuple)) else [downstream_tasks]
for item in items:
if isinstance(item, TaskGroup):
# For downstream TaskGroups, only set trigger_rule on root tasks
# (tasks with no upstream within the group) to avoid altering
# the group's internal dependency semantics.
for root_task in item.get_roots():
if hasattr(root_task, "trigger_rule"):
root_task.trigger_rule = "none_failed_min_one_success" # type: ignore[assignment]
elif hasattr(item, "trigger_rule"):
Comment thread
tatiana marked this conversation as resolved.
item.trigger_rule = "none_failed_min_one_success" # type: ignore[assignment]

def __rshift__(self, other: Any) -> Any:
# dbt_group >> post_dbt — post_dbt is downstream of dbt_group
result = super().__rshift__(other)
self._set_watcher_downstream_tasks_trigger_rule(other)
return result

def __rlshift__(self, other: Any) -> Any:
# other << dbt_group — other is downstream of dbt_group
Copy link

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__rlshift__ is commented as supporting other << dbt_group (dependency created from the downstream task side), but the docs for propagate_watcher_trigger_rule explicitly call out that task << dbt_group / task.set_upstream(dbt_group) cannot be intercepted. Unless there is a concrete Airflow code path that actually invokes DbtTaskGroup.__rlshift__ for this expression, this method/comment is misleading and may give a false sense that the limitation is handled. Consider removing __rlshift__ (and its test) or updating the comment/docstring to clarify when (if ever) it is invoked.

Suggested change
# other << dbt_group — other is downstream of dbt_group
# Reflected ``<<`` hook used only when Python/Airflow dispatch resolves to this
# TaskGroup instance. This is not a general interception point for
# ``task << dbt_group`` / ``task.set_upstream(dbt_group)``; watcher trigger-rule
# propagation must not rely on this method being invoked for those cases.

Copilot uses AI. Check for mistakes.
result = super().__rlshift__(other)
self._set_watcher_downstream_tasks_trigger_rule(other)
return result

def set_downstream(self, task_or_task_list: Any, edge_modifier: Any = None) -> None: # type: ignore[override]
super().set_downstream(task_or_task_list, edge_modifier=edge_modifier)
self._set_watcher_downstream_tasks_trigger_rule(task_or_task_list)
1 change: 1 addition & 0 deletions cosmos/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@
# this setting allows retries to run on a queue with larger resources, which is often necessary for larger dbt projects
# this would also be used to run the producer task
watcher_dbt_execution_queue = conf.get("cosmos", "watcher_dbt_execution_queue", fallback=None)
propagate_watcher_trigger_rule = conf.getboolean("cosmos", "propagate_watcher_trigger_rule", fallback=False)

# The following environment variable is populated in Astro Cloud
in_astro_cloud = os.getenv("ASTRONOMER_ENVIRONMENT") == "cloud"
Expand Down
26 changes: 26 additions & 0 deletions dev/dags/dbt/watcher_downstream_not_skipped/dbt_project.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
name: 'watcher_downstream_not_skipped'

config-version: 2
version: '0.1'

profile: 'default'

model-paths: ["models"]
seed-paths: ["seeds"]
test-paths: ["tests"]
macro-paths: ["macros"]

target-path: "target"
clean-targets:
- "target"
- "dbt_modules"
- "logs"

require-dbt-version: [">=1.0.0", "<2.0.0"]

on-run-start:
- "CREATE SEQUENCE IF NOT EXISTS {{ target.schema }}._cosmos_fail_once_seq"

models:
watcher_downstream_not_skipped:
+materialized: table
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
select 1 as id, 'Alice' as first_name, 'Smith' as last_name, 'alice@example.com' as email
union all
select 2, 'Bob', 'Jones', 'bob@example.com'
union all
select 3, 'Charlie', 'Brown', 'charlie@example.com'
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{{
config(
pre_hook=[
"DO $$ BEGIN IF nextval('{{ target.schema }}._cosmos_fail_once_seq') <= 1 THEN RAISE EXCEPTION 'fail_once: intentional first-run failure'; END IF; END $$"
]
)
}}

select
id,
first_name
from {{ ref('model_a') }}
99 changes: 99 additions & 0 deletions dev/failed_dags/example_watcher_downstream_not_skipped.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
"""
Airflow DAG to verify watcher downstream-task behavior when a producer task is skipped on retry.

In watcher mode, when a dbt model fails the producer retries and raises AirflowSkipException.
By default, the skip propagates to all tasks downstream of the TaskGroup (e.g. post_dbt),
even though the consumer tasks inside the group succeeded.

To prevent this, enable the ``propagate_watcher_trigger_rule`` Cosmos setting::

export AIRFLOW__COSMOS__PROPAGATE_WATCHER_TRIGGER_RULE=True

When enabled, Cosmos automatically sets ``trigger_rule="none_failed"`` on tasks downstream
of a watcher DbtTaskGroup, so the producer skip does not propagate.

Without this setting, post_dbt will be skipped.

This DAG demonstrates the behaviour with the setting enabled:
- model_a succeeds in the producer and the consumer reads the result from XCom
- model_retry fails on the first attempt (via a fail_once sequence pre-hook) but succeeds
on the consumer retry fallback
- post_dbt (an EmptyOperator downstream of the group) runs successfully — it is NOT skipped

A cleanup task drops the PostgreSQL sequence so the DAG can be re-triggered.
"""
Comment thread
tatiana marked this conversation as resolved.

import os
from datetime import datetime, timedelta
from pathlib import Path

from airflow.models import DAG
from airflow.providers.common.sql.operators.sql import SQLExecuteQueryOperator

try:
from airflow.providers.standard.operators.empty import EmptyOperator
except ImportError:
from airflow.operators.empty import EmptyOperator

from cosmos import DbtTaskGroup, ExecutionConfig, ProfileConfig, ProjectConfig
from cosmos.constants import ExecutionMode
from cosmos.profiles import PostgresUserPasswordProfileMapping

DEFAULT_DBT_ROOT_PATH = Path(__file__).parent.parent / "dags/dbt"
DBT_ROOT_PATH = Path(os.getenv("DBT_ROOT_PATH", DEFAULT_DBT_ROOT_PATH))
DBT_PROJECT_PATH = DBT_ROOT_PATH / "watcher_downstream_not_skipped"

profile_config = ProfileConfig(
profile_name="default",
target_name="dev",
profile_mapping=PostgresUserPasswordProfileMapping(
conn_id="example_conn",
profile_args={"schema": "public"},
disable_event_tracking=True,
),
)

execution_config = ExecutionConfig(
execution_mode=ExecutionMode.WATCHER,
)

operator_args = {
"install_deps": True,
"execution_timeout": timedelta(seconds=120),
}

if os.getenv("CI"):
operator_args["trigger_rule"] = "all_success"

default_args = {
"retries": 2,
"retry_delay": timedelta(seconds=0),
"depends_on_past": True,
}

with DAG(
dag_id="example_watcher_downstream_not_skipped",
schedule="@daily",
start_date=datetime(2023, 1, 1),
catchup=False,
default_args=default_args,
):
dbt_group = DbtTaskGroup(
group_id="watcher_downstream_not_skipped",
execution_config=execution_config,
project_config=ProjectConfig(DBT_PROJECT_PATH),
profile_config=profile_config,
operator_args=operator_args,
)

cleanup = SQLExecuteQueryOperator(
task_id="drop_fail_once_marker",
conn_id="example_conn",
sql="DROP SEQUENCE IF EXISTS public._cosmos_fail_once_seq;",
trigger_rule="all_done",
)

post_dbt = EmptyOperator(task_id="post_dbt")

dbt_group >> post_dbt
dbt_group >> cleanup
78 changes: 70 additions & 8 deletions docs/guides/run_dbt/airflow-worker/watcher-execution-mode.rst
Original file line number Diff line number Diff line change
Expand Up @@ -224,23 +224,85 @@ If a branch of the DAG fails, users can clear the status of a failed consumer ta

**Producer retry behavior**

.. versionadded:: 1.12.2
.. versionchanged:: 1.14.1

When the ``DbtProducerWatcherOperator`` is triggered for a retry (try_number > 1), it will not re-run the dbt build command and will succeed. In previous versions of Cosmos, the producer task would fail during retries.
This behavior is designed to support TaskGroup-level retries, as reported in `#2282 <https://github.com/astronomer/astronomer-cosmos/issues/2282>`_.
When the ``DbtProducerWatcherOperator`` is triggered for a retry (``try_number > 1``), it raises
``AirflowSkipException`` instead of re-running the ``dbt build`` command. Before skipping, it restores
XCom values from a backup so that consumer sensors can still read model statuses from the first attempt.

**Why this matters:**
**XCom backup and restore:**

- In earlier versions, attempting to retry the producer task would raise an ``AirflowException``, causing the retry to fail immediately.
- Now, the producer gracefully skips execution on retries, logging an informational message explaining that the retry was skipped to avoid running a second ``dbt build``.
- This allows users to retry entire TaskGroups and/or DAGs without the producer task blocking the retry flow.
During execution, each XCom push is incrementally backed up to an Airflow Variable. This ensures that
when the producer fails and Airflow clears XCom entries before the retry, the backed-up values can be
restored. On a successful run, the backup Variable is automatically deleted to avoid stale data
accumulating over time.

**How consumer retries work:**

1. The producer runs ``dbt build`` — some models succeed, some fail.
2. The producer task fails, and XCom values are backed up to a Variable.
3. On retry, the producer restores XCom from the Variable and raises ``AirflowSkipException``.
4. Consumer sensors read model statuses from the restored XCom.
5. Consumers for successful models complete immediately.
6. Consumers for failed models detect the error status and raise ``AirflowException``.
7. On their own retry, failed consumers fall back to running dbt individually for their model
(via ``_fallback_to_non_watcher_run``), behaving like ``ExecutionMode.LOCAL``.

**Important considerations:**

- Retries are no longer forced to ``0`` by Cosmos, since 1.14.0. Users may configure ``retries`` freely on the producer task. On any retry attempt (``try_number > 1``), the producer gracefully skips execution and returns success — it will not re-run the ``dbt build`` command. This means retrying the producer (or clearing an entire TaskGroup) is safe and will not cause duplicate dbt builds. During the retry of the sensor tasks, they will effectively run the corresponding dbt commands.
- Users may configure ``retries`` freely on the producer task. On any retry attempt (``try_number > 1``),
the producer gracefully skips execution — it will not re-run the ``dbt build`` command.
- Retrying the producer (or clearing an entire TaskGroup) is safe and will not cause duplicate dbt builds.
- During the retry of the sensor tasks, they will effectively run the corresponding dbt commands.
- When using ``DbtTaskGroup``, the producer's skip state propagates to tasks downstream of the group
by default — Airflow's ``trigger_rule="all_success"`` causes them to be skipped even when all consumer
tasks succeeded. To prevent this, either:

- Set the ``propagate_watcher_trigger_rule`` Cosmos setting to ``True`` (see below), which automatically
sets ``trigger_rule="none_failed_min_one_success"`` on tasks downstream of the watcher ``DbtTaskGroup``.
- Or manually set ``trigger_rule="none_failed_min_one_success"`` on your downstream tasks.

This issue does not affect ``DbtDag``, where the producer-to-consumer dependency is handled differently.

The overall retry behavior will be further improved once `#1978 <https://github.com/astronomer/astronomer-cosmos/issues/1978>`_ is implemented.

Propagate Watcher Trigger Rule
..............................

.. versionadded:: 1.14.1

When using ``DbtTaskGroup`` in watcher mode, the producer task may be skipped on retry. By default,
this causes Airflow to skip any tasks downstream of the task group (due to ``trigger_rule="all_success"``).

The ``propagate_watcher_trigger_rule`` setting makes Cosmos automatically set ``trigger_rule="none_failed_min_one_success"``
on tasks wired downstream of a watcher ``DbtTaskGroup`` when the dependency is created from the
``DbtTaskGroup`` side:

- ``dbt_group >> task``
- ``dbt_group.set_downstream(task)``

**Configuration:**

.. code-block:: ini

[cosmos]
propagate_watcher_trigger_rule = True

Or via environment variable:

.. code-block:: bash

export AIRFLOW__COSMOS__PROPAGATE_WATCHER_TRIGGER_RULE=True

**Limitations:**

- This does not work when the dependency is created from the downstream task side
(e.g., ``task.set_upstream(dbt_group)`` or ``task << dbt_group``), since Cosmos cannot
intercept methods called on tasks it does not control.
- When enabled, it overrides any user-defined ``trigger_rule`` on downstream tasks with ``"none_failed_min_one_success"``.
If you need a different ``trigger_rule`` on a downstream task, do not enable this setting and instead
set ``trigger_rule="none_failed_min_one_success"`` manually on the specific tasks that need it.

Watcher dbt Execution Queue
...........................

Expand Down
23 changes: 23 additions & 0 deletions docs/reference/configs/cosmos-conf.rst
Original file line number Diff line number Diff line change
Expand Up @@ -355,6 +355,29 @@ This page lists all available Airflow configurations that affect ``astronomer-co
- Default: ``None``
- Environment Variable: ``AIRFLOW__COSMOS__WATCHER_DBT_EXECUTION_QUEUE``

.. _propagate_watcher_trigger_rule:

`propagate_watcher_trigger_rule`_:
(Introduced in Cosmos 1.14.1) When using ``DbtTaskGroup`` in watcher mode, the producer task may be
skipped on retry (via ``AirflowSkipException``). By default, this causes Airflow to skip any tasks
downstream of the task group due to the default ``trigger_rule="all_success"``.

When this setting is enabled, Cosmos automatically sets ``trigger_rule="none_failed_min_one_success"`` on tasks wired
downstream of a watcher ``DbtTaskGroup`` when the dependency is created from the ``DbtTaskGroup`` side:

- ``dbt_group >> task``
- ``dbt_group.set_downstream(task)``

**Limitations:**

- Does not work when the dependency is created from the downstream task side
(e.g., ``task.set_upstream(dbt_group)`` or ``task << dbt_group``), since Cosmos cannot
intercept methods called on tasks it does not control.
- When enabled, overrides any user-defined ``trigger_rule`` on downstream tasks with ``"none_failed_min_one_success"``.

- Default: ``False``
- Environment Variable: ``AIRFLOW__COSMOS__PROPAGATE_WATCHER_TRIGGER_RULE``

[openlineage]
+++++++++++++

Expand Down
Loading
Loading