Move dataset emission for ExecutionMode.WATCHER from producer to consumer sensors#2507
Conversation
f00a0ba to
c1654b4
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2507 +/- ##
==========================================
+ Coverage 97.92% 97.97% +0.05%
==========================================
Files 103 103
Lines 7312 7455 +143
==========================================
+ Hits 7160 7304 +144
+ Misses 152 151 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR moves OpenLineage-compatible dataset/asset emission in ExecutionMode.WATCHER from the producer task to the individual consumer tasks by deriving per-node outlet URIs from the dbt manifest and propagating them through XCom/TriggerEvents.
Changes:
- Add adapter-aware dataset namespace resolution plus manifest-based per-node outlet URI computation utilities.
- Propagate outlet URIs from producer → consumer via compressed node-finished events (dbt-runner path) and XCom status payloads (subprocess path).
- Disable producer dataset emission and emit datasets/assets from consumer sensors on success; add/adjust tests accordingly.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
cosmos/dataset.py |
Adds namespace derivation, URI construction, and manifest parsing to compute per-node outlet URIs. |
cosmos/operators/watcher.py |
Producer injects outlet URIs into node-finished event payloads; consumer emits datasets/assets on success. |
cosmos/operators/_watcher/base.py |
Subprocess log parsing now propagates outlet URIs via XCom status payloads; consumer pulls URIs from XCom. |
cosmos/operators/_watcher/triggerer.py |
Trigger extracts/passes outlet URIs through TriggerEvent payloads for deferrable consumers. |
cosmos/airflow/graph.py |
Ensures WATCHER producer task does not emit datasets. |
tests/test_dataset.py |
Adds tests for namespace resolution, URI construction, and manifest-based URI computation. |
tests/operators/test_watcher.py |
Updates expectations for new event/XCom payload shapes including outlet URIs. |
tests/hooks/test_subprocess.py |
Updates XCom push assertions to match the new subprocess status payload shape. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tatiana
left a comment
There was a problem hiding this comment.
@pankajkoti thanks a lot for the work on this. Really excited for this improvement.
A few questions about the implementation:
- At which point is the
manifest.jsoncreated? - Is there a risk that the consumer nodes would have completed running before the dataset URIs had been computed and populated by the producer?
- Are consumer tasks blocked from completing until the producer task finishes? The screenshots didn't include the Gantt chart; if you could confirm, that would be great.
Please, could you rebase and let me know after that's completed, so I can take another look before approving? Also, if you could increase the test coverage, it would be great. I left some minor feedback inline.
c1654b4 to
85c4b77
Compare
ExecutionMode.WATCHER from producer to consumer sensors
fcad845 to
185e106
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Enable OpenLineage-compatible dataset emission in WATCHER execution mode by deriving per-model outlet URIs from the dbt manifest and propagating them from the producer to consumer tasks via XCom and TriggerEvents. Key changes: - Add adapter-aware namespace resolution and manifest-based URI computation to cosmos/dataset.py - Disable dataset emission on the producer task (consumers handle their own) - Propagate outlet URIs through XCom (subprocess path) and compressed event payloads (dbt-runner path) to consumers - Consumer sensors emit Airflow Dataset/Asset objects on successful completion - Add comprehensive tests for namespace resolution, URI construction, and manifest parsing Closes: #2382 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Align adapter namespace resolvers with OL's extract_namespace (fix schemes for sqlserver, athena; add dremio, glue; use fix_account_name for Snowflake; match Spark port-by-method logic) - Remove adapters not supported by OL to keep WATCHER and LOCAL mode dataset emission consistent - Extract shared construct_dataset_uri and remove _create_asset_uri - Guard manifest read for non-dataset resource types - Use specific exceptions in get_dataset_namespace - Fix execute_complete event typing to dict[str, Any] - Document manifest unavailability and bump log to warning Closes: #2382 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add thread-safe lock for manifest URI computation - Use real Version objects in tests instead of MagicMock - Add dataset URI pattern documentation and execution mode comparison - Update watcher doc for consumer-based emission - Improve test coverage for dataset and watcher modules Closes: #2382 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Return None for unknown adapters instead of generic fallback to avoid emitting URIs that don't match LOCAL mode behavior - Use AIP-60 compliant URIs in emit_datasets tests - Update get_dataset_namespace docstring Closes: #2382 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Return None for unknown adapters to skip dataset emission - Use AIP-60 compliant URIs in tests - Remove unreachable try/except for Dataset import - Add sentinel key to avoid repeated manifest reads on empty results - Fix Glue namespace to return empty string when account ID cannot be determined - Add triggerer tests for dict XCom and outlet URI propagation Closes: #2382 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
XCom *_status values are always dicts now; remove isinstance checks and update all tests to use dict format consistently. Closes: #2382 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
11e5fe4 to
c7dfd01
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Use dict XCom format for skipped source freshness nodes - Reject incomplete scheme-only namespaces (e.g. databricks://) - Fix snowflake ImportError test to use builtins.__import__ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
_push_source_freshness_results was pushing plain strings to *_status XCom keys. Align with the dict format expected by _get_node_status in consumer sensors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Catch OSError (includes PermissionError) in profile and manifest reading to avoid breaking the producer task - Fix changelog bullet formatting to match existing style Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| def _emit_datasets(self, context: Context) -> None: | ||
| """Emit Airflow datasets for this consumer task's model using outlet URIs from the producer.""" | ||
| if not getattr(self, "emit_datasets", False): | ||
| return | ||
| outlet_uris = getattr(self, "_outlet_uris", []) | ||
| if not outlet_uris: | ||
| return | ||
|
|
||
| from cosmos import settings | ||
| from cosmos.constants import AIRFLOW_VERSION | ||
|
|
||
| if AIRFLOW_VERSION.major >= 3: | ||
| from airflow.sdk.definitions.asset import Asset | ||
| else: | ||
| from airflow.datasets import Dataset as Asset # type: ignore[no-redef] | ||
|
|
||
| outlets = [Asset(uri=uri) for uri in outlet_uris] | ||
| logger.info("Emitting %d dataset(s) for model '%s': %s", len(outlets), self.model_unique_id, outlet_uris) | ||
| self.register_dataset([], outlets, context) | ||
|
|
||
| if settings.enable_uri_xcom: | ||
| context["ti"].xcom_push(key="uri", value=outlet_uris) | ||
|
|
||
| def execute(self, context: Context, **kwargs: Any) -> None: # type: ignore[override] | ||
| super().execute(context, **kwargs) | ||
| # If we reach here without deferring, the model succeeded — emit datasets | ||
| self._emit_datasets(context) | ||
|
|
||
| def execute_complete(self, context: Context, event: dict[str, Any]) -> None: | ||
| # Extract outlet URIs from trigger event before parent handles status | ||
| self._outlet_uris = event.get("outlet_uris", []) | ||
| super().execute_complete(context, event) | ||
| # If we reach here without raising, the model succeeded — emit datasets | ||
| self._emit_datasets(context) |
There was a problem hiding this comment.
Hello @pankajkoti sorry to jump in, I was following these changes and I have a question (partly for curiosity and partly because I'm interested in porting this to other WATCHER modes).
Why not add this to BaseConsumerSensor, so that this logic is available for all WATCHER modes (and therefore implements dataset emission for all of them)?
Thank you!
There was a problem hiding this comment.
Hi @vricciardulli , yeah, right, I didn't think it deep enough. All I had in mind is we currently do not support datasets emission in Kubernetes execution modes and hence, did not consider the watcher kubernetes+ modes and went for the simple to start with approach. But happy to move it to BaseConsumerSensor and if it can be tested that it works on other extended execution modes that would be super nice. Would be willing to try it out @vricciardulli, that would be great help ?
There was a problem hiding this comment.
Thanks for your quick reply, appreciate it! sure I'll open a PR :)
1.14.0 (2026-04-07) --------------------- Breaking Changes * Drop support for Airflow versions earlier than **2.9** by @jedcunningham in #2288 * Fix inclusion of package models and selection/exclusion behavior by @pankajkoti in #2357 * ``ExecutionMode.WATCHER``: The per-node ``*_status`` XCom value is now a dict (``{"status": "<status>", "outlet_uris": [...]}``) instead of a plain string. Any custom code that reads these internal XCom keys directly will need to be updated by @pankajkoti in #2507 Features * Add cluster policy support for ``ExecutionMode.WATCHER`` sensor retries by @astro-anand in #2293 * Add debug mode to track memory utilization by @tatiana in #2327 * Add FQN selection support for ``LoadMode.DBT_MANIFEST`` by @pankajastro in #2375 * Introduce interceptors for Cosmos tasks by @tatiana in #2419 * Add config to allow disabling dag versioning by @pankajkoti in #2470 * Implement TaskGroups by models folder by @maximilianoarcieri and @tatiana in #1566, #2469, and #2420 * feat: implement DbtTestWatcherOperator by @michal-mrazek in #2447 * Add source freshness aware execution for ``ExecutionMode.WATCHER`` by @pankajastro and @tatiana in #2467 * Note: Like ``ExecutionMode.WATCHER``, this feature is experimental and its interface and implementation can change in the future. * Add Airflow 3.2 support by @pankajastro and @pankajkoti in #2472 Enhancements * Add watcher mode support for dbt test node states by @michal-mrazek in #2318 * Rename watcher-mode sensor retry queue and reuse it for producer tasks by @pankajastro in #2331 * Fix leaked semaphore warnings in Airflow 3 by resetting dbt adapters by @pankajkoti in #2335 * Improve dbt Fusion support and related tests by @tatiana in #2356 * Default Snowflake profile mappings to four threads by @tatiana in #2374 * Attempt to remove Pydantic as a dependency by @tatiana in #2377 * Log dbt-core and adapter versions in watcher consumer tasks by @pankajastro in #2412 * Log model errors in watcher consumer on dbt node failure by @pankajastro in #2431 * Reduce XCom read/write for tracking node state and errors in ConsumerWatcher task by @pankajastro in #2471 * Remove duplicate debug log in watcher subprocess path by @tatiana in #2494 * Simplify and unify WATCHER implementation regardless of InvocationMode by @tatiana in #2498 * Switch to lazy imports in cosmos/__init__.py by @pankajkoti in #2531 Bug Fixes * Handle invalid YAML errors with ``LoadMode.DBT_MANIFEST`` and ``RenderConfig.selector`` by @YourRoyalLinus in #2316 * Populate ``compiled_sql`` for ``InvocationMode.SUBPROCESS`` in ``ExecutionMode.WATCHER`` by @pankajkoti in #2319 * Fix select/exclude type mismatch by @tatiana in #2364 * Set ``emit_datasets=False`` for ``DbtTest*`` operators by @pankajastro in #2365 * Set correct queue priority for watcher producer tasks by @pankajastro in #2372 * Preserve ``extra_context`` for watcher consumer task instances by @pankajkoti in #2381 * Respect ``deferrable=False`` from ``operator_args`` on watcher consumer sensors by @pankajkoti in #2384 * Fix watcher queue precedence and add documentation by @pankajastro in #2391 * Do not set ``compiled_sql`` on ``ExecutionMode.WATCHER`` producers by @pankajkoti in #2440 * Remove const attribute for ``__cosmos_telemetry_metadata__`` dag param by @pankajkoti in #2466 * Remove timeout override from Cosmos watcher sensors by @tatiana and @claude in #2478 * Remove forced ``retries=0`` from watcher producer operators by @tatiana in #2479 * RFC: Add patch for newer versions of amazon provider when running dbt on EKS by @aoelvp94 in #2481 * Fix ``cosmos_debug_max_memory_mb`` XCom not pushed in Watcher sensor tasks by @tatiana in #2503 * Fix ``TestBehavior.NONE`` and ``TestBehavior.AFTER_ALL`` exclude ignored with selectors in ``ExecutionMode.WATCHER`` by @pankajkoti in #2511 * Move dataset emission for ``ExecutionMode.WATCHER`` from producer to consumer sensors by @pankajkoti in #2507 Docs * Document cluster policy configuration for ``ExecutionMode.WATCHER`` sensor tasks by @pankajastro in #2315 * Remove outdated docs for the dbt docs plugin with Airflow 3 by @pankajastro in #2353 * Make Watcher DBT Execution Queue heading clickable by @pankajastro in #2354 * Update ``ExecutionMode.WATCHER`` documentation regarding test node implementation by @jroachgolf84 in #2355 * Fix ``pre_dbt_fusion`` configuration rendering by @pankajastro in #2369 * Add documentation for including/excluding nodes based on FQN by @pankajastro in #2371 * Update watcher execution mode documentation by @tatiana in #2380 * Add documentation for ``DbtSeedLocalOperator`` by @jroachgolf84 in #2383 * Fix miscellaneous Sphinx warnings by @pankajastro in #2395 * Improve contributing documentation by @lzdanski in #2397 * Add **Get Started in 5 Minutes** guide by @lzdanski in #2398 * Add Sphinx redirects package for documentation redirects by @lzdanski in #2407 * Restructure **Getting Started** and **Guides** sections by @lzdanski in #2418 * Add open-source quickstart by @lzdanski in #2439 * Fix documentation redirects by @lzdanski in #2442 * Restructure and refactor reference documentation by @lzdanski in #2443 * Add execution modes decision documentation by @lzdanski in #2444 * Add **Core Concepts** page to Getting Started by @lzdanski in #2448 * Add guide: *How Cosmos Works* by @lzdanski in #2449 * Update **Getting Started** overview and index pages by @lzdanski in #2452 * Add guide: *How Cosmos Runs dbt* by @lzdanski in #2453 * Fix miscellaneous documentation links by @lzdanski in #2454 * Add Mermaid diagrams and execution mode diagrams by @lzdanski and @tatiana in #2459 * Add documentation for memory optimization options by @pankajastro in #2340 * Fix typo in watcher execution mode docs by @evanvolgas in #2485 * Fix minor documentation issues by @evanvolgas in #2489 * Add troubleshooting note for dbt debug logs in ExecutionMode.WATCHER by @tatiana in #2491 * docs: unify RST header styles across documentation by @jigangz in #2473 * docs: fix env var for rich logging by @vricciardulli in #2514 * docs: update dbt project path example for Airflow 3 Astro compatibility by @yeoreums in #2512 * Document missing Cosmos Airflow config settings in cosmos-conf.rst by @tatiana in #2515 * Split security-privacy policy doc and add dependency cooldown by @pankajkoti in #2519 * Add performance optimization and troubleshooting docs by @pankajkoti in #2521 * Update copyright year to 2026 by @tayloramurphy in #2527 * docs: Updating "Project Policies" to "Policies" in menu bar by @jroachgolf84 in #2526 Others * Fix tests after removing support for Airflow versions earlier than 2.9 by @tatiana in #2321 * Enable listener tests for Airflow 3.1 by @pankajastro in #2348 * Accept ``int`` or ``float`` for ``cosmos_debug_max_memory_mb`` in integration tests by @pankajkoti in #2352 * Update ``CODEOWNERS`` to prioritize ``oss-integrations`` by @tatiana in #2359 * Fix automatic reviewer assignment in GitHub by @tatiana and @phanikumv in #2360 * Improve PyPI tagging by @tatiana in #2363 * Add integration tests for dbt Fusion and ``ExecutionMode.WATCHER`` by @tatiana in #2373 * Fix Zizmor check by @tatiana in #2376 * Remove ``methodtools`` dependency by @tatiana in #2378 * Improve comments on #2389 by @evanvolgas in #2394 * Refactor ``load_from_dbt_manifest`` to reduce code complexity by @pankajkoti in #2399 * Refactor ``_handle_no_precursors_or_descendants`` to reduce complexity by @pankajkoti in #2400 * Improve issue templates by @tatiana in #2401 * Avoid running tests when only docs change by @tatiana in #2402 * Add ``no-reload`` target for serving docs locally by @pankajkoti in #2405 * Fix test hash checks on macOS by @tatiana in #2406 * Attempt deterministic dbt project copy in test fixtures by @pankajkoti in #2409 * Pin ``virtualenv <21`` due to hatch incompatibility in CI by @pankajkoti in #2410 * Revert virtualenv pin for hatch installation in CI by @pankajkoti in #2426 * Add version comments for commit SHA pinned GitHub Actions by @pankajkoti in #2436 * Fix ``hatch run docs:build`` issues by @tatiana in #2437 * Minor code improvements by @dnskr in #2446 * Pre-commit autoupdate by @pre-commit-ci in #2367, #2396, #2422, #2451, #2468, #2495, and #2516 * Add file to support Claude understanding the Cosmos repository by @tatiana in #2458 * Dependency updates by @dependabot in #2368, #2425, #2435, #2465, #2475, #2504, #2518, and #2528 * Isolate Scarf telemetry integration test into its own CI job by @pankajkoti and @claude in #2477 * ci: upgrade Airflow version to 3.1 in MyPy type-check job by @yeoreums in #2506 * Add commit message guidelines to CLAUDE.md by @pankajkoti in #2509 * Extend skipping tests in CI for more non-code file changes by @pankajkoti in #2510 * Add Dependabot pre-commit support with 7-day cooldown by @pankajkoti in #2517 * Enforce zero warnings policy for documentation by @dnskr in #2513 Co-authored-by: Pankaj Koti <pankajkoti699@gmail.com> Co-authored-by: Tatiana Al-Chueyr <tatiana.alchueyr@gmail.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Pankaj Koti <pankajkoti699@gmail.com> Co-authored-by: Tatiana Al-Chueyr <tatiana.alchueyr@gmail.com>





Enable OpenLineage-compatible dataset emission in WATCHER execution mode by deriving per-model outlet URIs from the dbt manifest and propagating them from the producer to consumer tasks via XCom and TriggerEvents.
Key changes:
Testing
I tested all combinations of the following and ensured there is no regression and the new intended work yields consistent expected results:
ProfileMappingvsprofiles_yml_filepathapproachesThis included all permutations across these configurations.
Closes: #2382