Skip to content

[Review] Support use of YAML selectors when using LoadMode.DBT_MANIFEST#3

Open
govambam wants to merge 2 commits into
base-for-pr-2261from
review-pr-2261
Open

[Review] Support use of YAML selectors when using LoadMode.DBT_MANIFEST#3
govambam wants to merge 2 commits into
base-for-pr-2261from
review-pr-2261

Conversation

@govambam
Copy link
Copy Markdown
Owner

Recreated from astronomer#2261 for Macroscope review.

Original PR: astronomer#2261 by @YourRoyalLinus
Status: closed (merged)

Recreated using squash merge commit - exact merged state preserved.

Original PR: astronomer#2261

…nomer#2261)

Add support for the use of yaml selectors when loading dbt projects
using `LoadMode.DBT_MANIFEST`. This works by parsing the YAML selectors
found in the manifest file that have already been processed by the dbt
parser into corresponding select and exclude selections that are passed
onto the `select_nodes()` function.

The performance overhead for the initial parse will depend on the user's
`selector.yaml` definition, but in testing it has been reasonable. The
parsing penalty is only paid once, as future selector to select/exclude
mapping access is done from the cache, if enabled (similar to
`dbt_ls_cache`).

#### Key Changes
- Implemented a parser to convert full YAML selector definitions into
corresponding `select` and `exclude` selections
- Is only used for graphs with `RenderConfig.load_method = DBT_MANIFEST`
and a defined `RenderConfig.selector`
- Implemented cache behavior to store and retrieve the parsed YAML
selector definitions
- Added a new `enable_cache_yaml_selectors` cosmos setting to
enable/disable caching
  - Invalidates if `YamlSelector` implementation change
- Ensures users do not have to manually clear the cache if the spec is
changed
- Renamed `dbt_ls_cache` to be more general so it can be used to store
parsed selector yaml definitions
- Cache should always be mutually exclusive for the `dbt_ls` output or
parsed selector yaml

#### Limitations
- I did not implement parser support for the `indirect_selection` or
`default` keywords.
- It seems plausible that both of these can be implemented if desired.
Omitting them was done to limit scope.
- This is not a full YAML selector parser, akin to what dbt implements.
- While much of the implementation is borrowed from the latest version
of `dbt core`, this expects the selector definitions to be in the state
that exists in an unmodified manifest file.
- This approach allows the Cosmos parser to only to handle fully
structured `method-value` selection definitions.
- I kept many high-level parser exceptions in the Cosmos implementation
that are technically redundant with the dbt parser
- This was done to try and catch user-modified manifest files to fail
fast with helpful error messages.
- Modifying the selector definitions found in the manifest file is
considered undefined behavior.

These limitations are reflected in the documentation.

## Related Issue(s)

Closes astronomer#2257

## Breaking Change?

- I reused the existing dbt_ls cache cache key and functionality,
generalizing where possible
- **This is an implementation detail and should not break the public API
nor alter any existing behavior**
- This requires it to be impossible to have a dbt_ls and yaml_selectors
cache at the same time due
- This is technically possible by setting the
`DbtGraph.cache_identifier` to be the same, but this can't be configured
by a user.

These are reflected in the documentation.
@macroscopeapp
Copy link
Copy Markdown

macroscopeapp Bot commented Feb 10, 2026

Support YAML selectors in DbtGraph._manifest when RenderConfig.load_method is LoadMode.DBT_MANIFEST, and add caching via DbtGraph.save_yaml_selectors_cache and DbtGraph.get_yaml_selectors_cache

Add YamlSelectors parsing and selector-based filtering when loading from a dbt manifest, and introduce versioned, compressed caching for parsed YAML selectors keyed by DbtGraph.cache_key, with remote and Airflow Variable backends.

📍Where to Start

Start at the selector handling branch in DbtGraph._manifest in graph.py, then review the parsing in YamlSelectors.parse in selector.py and the cache versioning in _calculate_yaml_selectors_cache_current_version in cache.py.


Macroscope summarized 4931036.

Comment thread cosmos/dbt/selector.py
intersection_def_parts = cls._get_list_dicts(definition, "intersection")
include, exclude = cls._parse_include_exclude_subdefs(intersection_def_parts, cache=cache)

intersection = [",".join(include)]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low

dbt/selector.py:1068 _parse_intersection_definition returns [""] when include is empty (all exclusions). Consider checking if include is non-empty before joining, or filter empty strings in parse().

Suggested change
intersection = [",".join(include)]
intersection = [",".join(include)] if include else []

🚀 Want me to fix this? Reply ex: "fix it for me".

🤖 Prompt for AI
In file cosmos/dbt/selector.py around line 1068:

`_parse_intersection_definition` returns `[""]` when `include` is empty (all exclusions). Consider checking if `include` is non-empty before joining, or filter empty strings in `parse()`.

Comment thread cosmos/dbt/graph.py
return settings.enable_cache and settings.enable_cache_dbt_ls and bool(self.dbt_ls_cache_key)
return settings.enable_cache and settings.enable_cache_dbt_ls and bool(self.cache_key)

@functools.lru_cache
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low

dbt/graph.py:675 Using @functools.lru_cache on an instance method prevents garbage collection of the instance. Since this method takes no arguments, consider using @functools.cached_property instead.

Suggested change
@functools.lru_cache
@functools.cached_property

🚀 Want me to fix this? Reply ex: "fix it for me".

🤖 Prompt for AI
In file cosmos/dbt/graph.py around line 675:

Using `@functools.lru_cache` on an instance method prevents garbage collection of the instance. Since this method takes no arguments, consider using `@functools.cached_property` instead.

Comment thread cosmos/dbt/graph.py Outdated
Comment on lines +1077 to +1083
if not cache_dict:
logger.info(f"Cosmos performance: Cache miss for {self.cache_key}")

return self.parse_yaml_selectors(selector_definitions)

cache_version: str = cache_dict["version"]
yaml_selectors: YamlSelectors = cache_dict["yaml_selectors"]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium

dbt/graph.py:1077 Consider checking for yaml_selectors key before accessing it. get_yaml_selectors_cache() only adds this key when both compressed fields are present, so a corrupted or incomplete cache could cause a KeyError.

-            if not cache_dict:
-                logger.info(f"Cosmos performance: Cache miss for {self.cache_key}")
-
-                return self.parse_yaml_selectors(selector_definitions)
-
-            cache_version: str = cache_dict["version"]
-            yaml_selectors: YamlSelectors = cache_dict["yaml_selectors"]
+            if not cache_dict or "yaml_selectors" not in cache_dict:
+                logger.info(f"Cosmos performance: Cache miss for {self.cache_key}")
+                return self.parse_yaml_selectors(selector_definitions)
+
+            cache_version: str = cache_dict.get("version", "")
+            yaml_selectors: YamlSelectors = cache_dict["yaml_selectors"]

🚀 Want me to fix this? Reply ex: "fix it for me".

🤖 Prompt for AI
In file cosmos/dbt/graph.py around lines 1077-1083:

Consider checking for `yaml_selectors` key before accessing it. `get_yaml_selectors_cache()` only adds this key when both compressed fields are present, so a corrupted or incomplete cache could cause a `KeyError`.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix it for me

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm working on the fix now. The changes to add the yaml_selectors key check and use .get() for safer version access are being prepared and will be pushed to a new branch.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated load_parsed_selectors method in cosmos/dbt/graph.py with two improvements: (1) Changed line 1077 to check for "yaml_selectors" key presence before accessing it, and (2) Changed line 1082 to use cache_dict.get("version", "") for safe version access with a default fallback. These changes prevent KeyError when the cache is corrupted or incomplete.
Commit c54a769 pushed to #4. When checks pass, it will merge automatically into this PR.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I automatically merged #4 into this PR (4931036).

…#4)

Co-authored-by: macroscopeapp[bot] <170038800+macroscopeapp[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant