Support cross-referencing models across dbt projects using dbt-loom#2271
Conversation
✅ Deploy Preview for astronomer-cosmos canceled.
|
6291d2b to
fc7a16c
Compare
2b7d3e4 to
b2dd259
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2271 +/- ##
=======================================
Coverage 97.99% 97.99%
=======================================
Files 100 100
Lines 6431 6440 +9
=======================================
+ Hits 6302 6311 +9
Misses 129 129 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
b31e48e to
a65c838
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds support for multi-project dbt setups using dbt-loom, enabling Cosmos to handle cross-project references where downstream dbt projects reference models from upstream projects.
Changes:
- Cosmos now skips external nodes (those without file paths) injected by dbt-loom during DAG generation
- Added comprehensive documentation for multi-project setups with configuration examples
- Included test coverage for the external node skipping behavior
Reviewed changes
Copilot reviewed 32 out of 34 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| cosmos/dbt/graph.py | Added logic to skip nodes without file paths in both manifest and dbt ls parsing methods |
| tests/dbt/test_graph.py | Added test to verify external nodes from dbt-loom are properly skipped |
| docs/configuration/multi-project.rst | New comprehensive documentation explaining multi-project setups with dbt-loom |
| docs/configuration/index.rst | Added multi-project documentation to the configuration index |
| pyproject.toml | Added dbt-loom as an optional dependency |
| scripts/test/pre-install-airflow.sh | Added dbt-loom installation to test setup |
| dev/dags/dbt_loom_dags.py | Example DAG demonstrating multi-project setup |
| dev/dags/dbt/dbt_loom_upstream_platform/* | Example upstream dbt project files |
| dev/dags/dbt/dbt_loom_downstream_finance/* | Example downstream dbt project files |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Hi @pankajkoti ! This work is very exciting - it's really cool to be able to have a dbt Mesh feature in Cosmos without the need to lock into a proprietary platform. I'd love it if you could address the following points, as we discussed yesterday:
|
Add two minimal dbt projects to demonstrate dbt Loom behavior: - platform_project: upstream with 2 public models, 1 macro, 1 source - finance_project: downstream with cross-project refs and dbt Loom config These projects verify that dbt Loom only injects models (not macros or sources) cross-project.
Add two example DAGs to test Cosmos compatibility with dbt Loom: - dbt_loom_platform_dag.py: Upstream project with public models - dbt_loom_finance_dag.py: Downstream project using dbt Loom cross-project refs
dbt Projects: - Rename platform_project -> dbt_loom_upstream_platform - Rename finance_project -> dbt_loom_downstream_finance - Add comprehensive seed data (customers, orders, order_items, products) - Add staging models with public access for cross-project refs - Add intermediate models (int_orders_enriched, int_customer_orders) - Add finance models (fct_revenue, fct_customer_revenue, dim_payment_methods) - Update profiles to use PostgreSQL - Consolidate into single DAG with chained task groups Documentation: - Add docs/configuration/multi-project.rst with comprehensive guide - Cover cross-project model references using dbt-loom - Document patterns for cross-project sources and macros - Include Cosmos DAG configuration examples - Add troubleshooting and best practices sections
Remove curly braces from ref() example in docstring to prevent Airflow from trying to render it as a Jinja template.
|
Thanks for the detailed review @tatiana . I have addressed the feedback.
Done so, thanks!
Yes, verified this, works smoothly. Added snapshots in the docs as we discussed earlier today.
Yes, confiremd that the syntax is exactly the same as dbt Mesh.
Yes, I tested this with both projects using
Yes. highlighted this in our docs now
No, this is not currently supported. Each DbtTaskGroup (or DbtDag) is configured with a single ProjectConfig that points to one dbt project. Requesting re-review, please! |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 35 out of 41 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Hi @pankajkoti, this looks great, thank you very much!
I have one last concern, discussed in the thread https://github.com/astronomer/astronomer-cosmos/pull/2271/changes#r2736991273.
Once you address this and the checks are passing, please feel free to merge the PR.
Features * Support cross-referencing models across dbt projects using dbt-loom by @pankajkoti in #2271 * Support use of YAML selectors when using ``LoadMode.DBT_MANIFEST`` by @YourRoyalLinus in #2261 * Introduce ``ExecutionMode.WATCHER_KUBERNETES`` to use the watcher with ``KubernetesPodOperator`` by @tatiana in #2207 * Add support for StarRocks profile mapping by @kurkim0661 in #2256 * Allow pushing URIs as XComs for Cosmos tasks by @corsettigyg in #2275 * Support defining custom callbacks alongside the ``WATCHER_KUBERNETES`` callback by @johnhoran in #2307 Enhancements * Refactor: remove duplicate ``_construct_dest_file_path`` by @jx2lee in #2077 * Leverage Airflow ``::group::`` to group logs associated with DAG parsing by @tatiana in #2235 * Refactor ``DbtConsumerWatcherSensor`` for reusability by @tatiana in #2245 * Restore plain text output when using ``ExecutionMode.WATCHER`` by @tiovader in #2241 Bug Fixes * Fix running empty models or ephemeral nodes in ``ExecutionMode.WATCHER`` by @tatiana in #2279 * Improve watcher producer task priority in scheduling and the UI by @tatiana in #2237 * Fix typos and formatting issues in documentation by @pankajkoti in #2259 * Allow watcher producer retries without erroring by @tatiana in #2283 * Fix ``TestBehavior.AFTER_ALL`` is missing project_name information when loading project using manifest file by @tuantran0910 in #2242 * Fix duplicate log lines in watcher subprocess execution and format timestamps by @pankajkoti in #2301 Docs * Add Watcher Kubernetes documentation by @tatiana in #2303 * Document newly added telemetry metrics in the privacy notice by @pankajkoti in #2249 * Add compatibility policy document by @pankajastro in #2251 * Improve watcher documentation related to dbt threads by @tatiana in #2273 * Fix link in watcher execution mode documentation by @jedcunningham in #2277 * Update Apache Airflow minimum compatibility policy by @tatiana in #2285 * Clarify Cosmos runtime support until "End of Basic Support" by @jedcunningham in #2286 * Update watcher docs by @tatiana in #2298 * Update watcher kubernetes documentation by @tatiana in #2306 Others * Add Airflow 3 DAG versioning tests for Cosmos by @michal-mrazek in #2177 * Add dbt Core 1.11 to the test matrix by @tatiana in #2230 * Add integration tests using InvocationMode.SUBPROCESS and validate output by @tatiana in #2287 * Fix main branch failing tests by @tatiana in #2296 * Update pre-commit hooks to the latest versions by @jedcunningham in #2289 * Pre-commit autoupdates by @pre-commit in #2222, #2264, #2274 and #2290 * Dependabot updates by @dependabot in #2218, #2219, #2220, #2280 and #2284 * Add Scarf metrics to understand Cosmos feature usage patterns - Add telemetry tracking for dbt docs plugin usage by @pankajkoti in #2240 - Add DAG run telemetry metrics for load mode, invocation, and render_config parameters by @pankajkoti in #2223 - Collect profile metrics for DAG runs by @pankajastro in #2228 - Compress telemetry metadata to reduce serialized DAG size by @pankajkoti in #2252 - Skip storing telemetry metadata when emission is disabled by @pankajkoti in #2278 - Hide telemetry metadata parameters from the Airflow trigger UI by @pankajkoti in #2247 closes: astronomer/oss-integrations-private#317 --------- Co-authored-by: Tatiana Al-Chueyr <tatiana.alchueyr@gmail.com>
…ternal nodes When using the `+` (precursor) graph selector with dbt-loom cross-project references, `select_node_precursors` crashes with a `KeyError` because external nodes (injected by dbt-loom) are filtered out during manifest loading but local nodes still reference them in `depends_on`. The dbt-loom support added in astronomer#2271 correctly skips external nodes (those without `original_file_path`) during manifest loading. However, when the `+` graph operator traverses upstream dependencies, it encounters `depends_on` entries pointing to these filtered-out external nodes and raises a `KeyError`. This fix adds bounds checks in two locations: - `GraphSelector.select_node_precursors`: skip node IDs not present in the nodes dict during upstream traversal - `NodeSelector.select_nodes_ids_by_intersection`: skip external node IDs that were collected during graph traversal but are not in the nodes dict This allows the graph traversal to gracefully stop at project boundaries, which is the correct behavior for cross-project setups where external dependencies are managed by their own DAGs/task groups. Closes #<TBD> Co-authored-by: Cursor <cursoragent@cursor.com>
#2389) Fixes a `KeyError` when using the `+` (precursor) graph selector on a project that uses dbt-loom for cross-project references. cc @pankajkoti @tatiana — This is a follow-up to your dbt-loom support in #2271. The external node skipping works great for basic rendering, but we hit a `KeyError` when combining it with the `+` graph selector. The `+` operator triggers `select_node_precursors` which traverses `depends_on` entries — and those can point to external nodes that were already filtered out during manifest loading. This code path wasn't exercised by the tests in #2271 since the example DAGs don't use graph selectors. ## Problem The dbt-loom support added in #2271 correctly skips external nodes (those without `original_file_path`) during manifest loading in `load_from_dbt_manifest`. However, local nodes still have `depends_on` entries pointing to these filtered-out external nodes. When the `+` graph operator traverses upstream dependencies via `select_node_precursors`, it does `nodes[node_id]` on these external node IDs and raises a `KeyError`: File "cosmos/dbt/selector.py", line 172, in select_node_precursors new_generation.update(set(nodes[node_id].depends_on)) ~~~~~^^^^^^^^^ KeyError: 'model.upstream_project.external_model' **Reproduction:** Use `select: ["+downstream_model"]` in `RenderConfig` with `LoadMode.DBT_MANIFEST` on a project that uses dbt-loom with cross-project `{{ ref('upstream_project', 'model_name') }}` references. ## Fix Adds bounds checks in two locations in `cosmos/dbt/selector.py`: 1. **`GraphSelector.select_node_precursors`** (line 172): Skip node IDs not present in the `nodes` dict during upstream traversal 2. **`NodeSelector.select_nodes_ids_by_intersection`** (line 552): Skip external node IDs that were collected during graph traversal but don't exist in the `nodes` dict This allows the `+` traversal to gracefully stop at project boundaries — the correct behavior for cross-project setups where external dependencies are managed by their own DAGs/task groups. This is consistent with how `select_node_descendants` already handles missing parents via `defaultdict(set)`. ## Test plan - [x] Added `test_select_nodes_by_precursors_with_external_dependency` — creates a graph where a local node's `depends_on` includes an external node ID not in the `nodes` dict, verifies `+` selector returns local nodes without `KeyError` - [x] All 166 existing selector tests pass - [x] All existing dbt-loom tests in `test_graph.py` pass Co-authored-by: Alex Ward <award@Mac.lan> Co-authored-by: Cursor <cursoragent@cursor.com>
#2389) Fixes a `KeyError` when using the `+` (precursor) graph selector on a project that uses dbt-loom for cross-project references. cc @pankajkoti @tatiana — This is a follow-up to your dbt-loom support in #2271. The external node skipping works great for basic rendering, but we hit a `KeyError` when combining it with the `+` graph selector. The `+` operator triggers `select_node_precursors` which traverses `depends_on` entries — and those can point to external nodes that were already filtered out during manifest loading. This code path wasn't exercised by the tests in #2271 since the example DAGs don't use graph selectors. ## Problem The dbt-loom support added in #2271 correctly skips external nodes (those without `original_file_path`) during manifest loading in `load_from_dbt_manifest`. However, local nodes still have `depends_on` entries pointing to these filtered-out external nodes. When the `+` graph operator traverses upstream dependencies via `select_node_precursors`, it does `nodes[node_id]` on these external node IDs and raises a `KeyError`: File "cosmos/dbt/selector.py", line 172, in select_node_precursors new_generation.update(set(nodes[node_id].depends_on)) ~~~~~^^^^^^^^^ KeyError: 'model.upstream_project.external_model' **Reproduction:** Use `select: ["+downstream_model"]` in `RenderConfig` with `LoadMode.DBT_MANIFEST` on a project that uses dbt-loom with cross-project `{{ ref('upstream_project', 'model_name') }}` references. ## Fix Adds bounds checks in two locations in `cosmos/dbt/selector.py`: 1. **`GraphSelector.select_node_precursors`** (line 172): Skip node IDs not present in the `nodes` dict during upstream traversal 2. **`NodeSelector.select_nodes_ids_by_intersection`** (line 552): Skip external node IDs that were collected during graph traversal but don't exist in the `nodes` dict This allows the `+` traversal to gracefully stop at project boundaries — the correct behavior for cross-project setups where external dependencies are managed by their own DAGs/task groups. This is consistent with how `select_node_descendants` already handles missing parents via `defaultdict(set)`. ## Test plan - [x] Added `test_select_nodes_by_precursors_with_external_dependency` — creates a graph where a local node's `depends_on` includes an external node ID not in the `nodes` dict, verifies `+` selector returns local nodes without `KeyError` - [x] All 166 existing selector tests pass - [x] All existing dbt-loom tests in `test_graph.py` pass Co-authored-by: Alex Ward <award@Mac.lan> Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit 4d86173)
This PR adds support for dbt-loom, enabling Cosmos to work with multi-project dbt architectures where downstream projects reference models from upstream projects.
When using dbt-loom, downstream projects reference upstream models via
{{ ref('upstream_project', 'model_name') }}. dbt-loom injects these external model references into the downstream project's namespace by reading the upstream project's manifest.jsonCosmos now automatically detects and skips external nodes (those without original_file_path) during DAG generation, while still creating tasks for the project's own models. This works for both:
LoadMode.DBT_LS- parsing via dbt lsLoadMode.DBT_MANIFEST- parsing via manifest fileThe PR adds the example Projects (in
dev/dags/dbt/):dbt_loom_upstream_platform/- staging & intermediate models with seedsdbt_loom_downstream_finance/- finance fact tables referencing upstream modelsdbt_loom_dags.py- combined DAG with chained task groupsThe PR also adds a comprehensive guide for multi-project setups in
docs/configuration/multi-project.rstcloses: #2107
Co-authored-by: Tatiana Al-Chueyr tatiana.alchueyr@gmail.com