Skip to content

Cache manifest.json parsing across DbtGraph instances#2486

Open
evanvolgas wants to merge 4 commits into
astronomer:mainfrom
evanvolgas:evanvolgas/cache-manifest-parse-v2
Open

Cache manifest.json parsing across DbtGraph instances#2486
evanvolgas wants to merge 4 commits into
astronomer:mainfrom
evanvolgas:evanvolgas/cache-manifest-parse-v2

Conversation

@evanvolgas
Copy link
Copy Markdown
Contributor

Problem

When multiple DbtDag or DbtTaskGroup instances share the same manifest.json file, each one independently opens and parses the full JSON during every DagBag import cycle. For large dbt projects with sizable manifests, this redundant parsing adds meaningful overhead to DAG parse time — especially in deployments with many DAGs pointing at the same project.

Solution

Add an lru_cache-backed _load_manifest_cached(path, mtime) function that parses the manifest once and returns the cached result on subsequent calls within the same process. The cache key includes the file's st_mtime, so it auto-invalidates whenever the manifest is rewritten on disk.

Each caller receives a copy.deepcopy() of the cached dict. This is critical because downstream code constructs DbtNode instances that hold references to lists and dicts from the parsed manifest (e.g. tags, config). Without the deep copy, any future mutation of those structures by one DbtGraph instance would silently corrupt the shared cache and affect all subsequent consumers — a subtle aliasing bug.

maxsize=8 bounds memory for the uncommon case where a deployment has many distinct manifest files.

Changes

  • cosmos/dbt/graph.py: Add _load_manifest_cached() with functools.lru_cache; replace open()/json.load() in load_from_dbt_manifest() with copy.deepcopy(_load_manifest_cached(...))
  • tests/dbt/test_graph.py: Three new tests:
    • Cache sharing: 3 DbtGraph instances loading the same manifest produce cache hits
    • Cache invalidation: rewriting the manifest with a different mtime returns fresh data (uses os.utime for deterministic mtime changes, avoiding flaky time.sleep on filesystems with 1s granularity)
    • Selector isolation: two graphs with different select filters sharing a cached manifest produce independent filtered_nodes, confirming the deep copy prevents cross-contamination

Test plan

  • test_load_manifest_cached_shares_across_dags passes
  • test_load_manifest_cached_invalidates_on_file_change passes
  • test_load_manifest_cached_different_selectors_no_interference passes
  • Existing test_load_from_dbt_manifest* tests remain green (no regression)

@tatiana
Copy link
Copy Markdown
Collaborator

tatiana commented Mar 24, 2026

@evanvolgas This is a great idea-caching the parsed manifest would definitely help performance. I’m just a bit hesitant about doing so on the local Airflow nodes.

For some context, we actually tried something similar around May/June 2024, where we cached the dbt ls output in the local filesystem as a pickle: #992

While it did help in some cases, the behaviour proved inconsistent. The main issues we ran into were:

  • When users were running Airflow with the Kubernetes Executor instead of Celery, the cache was never reused.
  • With Airflow Celery Executor deployments that have an auto-scaling mechanism, workers would come and go frequently, which caused a lot of cache misses.
  • Depending on the deployment process, caches were often missed after deployments, meaning all Airflow nodes had to be “warmed up” again.

Because of that, we later did a PoC and implemented a dbt Variable–based cache instead, which significantly improved dbt ls DAG parsing:
#1014

It would be great if we could build on those learnings and stay consistent with that caching strategy for manifest.json as well. What do you think?

Add lru_cache to avoid re-parsing the same manifest.json when multiple
DbtDag/DbtTaskGroup instances share a manifest file during DagBag import.

The cache is keyed on (path, mtime) so it auto-invalidates when the file
changes. Each caller receives a deep copy to prevent aliasing bugs if
downstream code ever mutates the parsed dict or its nested structures.

Only local filesystem manifests are cached — remote paths (s3://, gs://,
abfs://) bypass the cache and are loaded directly via ObjectStoragePath,
since os.path.getmtime is not available for remote storage backends.
@evanvolgas evanvolgas force-pushed the evanvolgas/cache-manifest-parse-v2 branch from 6fca801 to eb80331 Compare March 24, 2026 15:07
@evanvolgas
Copy link
Copy Markdown
Contributor Author

Hi @tatiana, thank you for the thoughtful feedback and for sharing the history behind PRs #992 and #1014 — that context is really helpful.

I completely agree that the Airflow Variable-based cache was the right call for dbt ls. The issues you described with local filesystem caching (K8s executor misses, Celery autoscaling, post-deployment warm-up) are real and well-documented pain points for anything that persists state on disk across process lifetimes.

I think this PR is actually solving a narrower, complementary problem, but I may not have explained that well enough in the description. Let me clarify the intent:

This lru_cache is purely in-process and ephemeral — it only lives for the duration of a single DagBag import cycle. The scenario it targets is when a user defines multiple DbtDag or DbtTaskGroup instances in their DAG files that all point to the same manifest.json. Without this cache, each instance calls json.load() on the same file during the same Python process. With it, the file is read once and the parsed dict is shared (via deepcopy to prevent mutation bugs).

It doesn't persist anything to disk or across processes, so the issues from PR #992 (K8s executor, Celery autoscaling, deployment cache misses) wouldn't apply here — when the process ends, the cache is gone.

That said, I see two options and I'm happy to go either direction:

  1. Keep the lru_cache as a lightweight in-process deduplication layer — it complements the Variable-based cache rather than replacing it. I could add a clear code comment explaining why this approach is used here vs. the Variable-based strategy, to avoid future confusion.

  2. Explore using Airflow Variables for manifest caching — though one concern is that full manifests can be quite large (sometimes multiple MB), which may exceed Airflow's Variable size limits and add DB serialization overhead that could outweigh the json.load() savings.

What do you think? I'm happy to adjust the approach to whatever fits best with the project's direction.

@evanvolgas
Copy link
Copy Markdown
Contributor Author

@tatiana just wanted to follow up and see if you had any additional thoughts on this?

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.02%. Comparing base (e34d415) to head (9718849).
⚠️ Report is 118 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2486   +/-   ##
=======================================
  Coverage   98.02%   98.02%           
=======================================
  Files         103      103           
  Lines        7173     7183   +10     
=======================================
+ Hits         7031     7041   +10     
  Misses        142      142           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tatiana tatiana added this to the Cosmos 1.15.0 milestone Apr 7, 2026
Copy link
Copy Markdown
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @evanvolgas - apologies for the delay getting back to you; it’s been a particularly busy period on our side.

Thanks very much for the detailed explanation. Unfortunately, we haven’t had sufficient bandwidth yet to properly review and test the proposed changes. We’re planning to go through the PR thoroughly and aim to include it in the Cosmos 1.15.0 release, which is currently targeted for about a month from now.

In the meantime, have you had a chance to run any benchmarks? It would be really helpful to understand the performance improvements this approach provides. Specifically, it would be great to know:

  • the size of the dbt project used for testing
  • the memory consumption per process

In Airflow, DAGs are parsed not only by the DAG processor/scheduler but also by each task on the worker nodes as part of executor behavior. Because of this, having concrete metrics would help us better assess the impact of these changes on resource utilization across both the DAG processor and worker processes.

Thanks again for your work on this - looking forward to digging deeper soon.

@corsettigyg
Copy link
Copy Markdown
Collaborator

Since we are facing a similar problem on our end regarding performance, I will give my 2 cents here.

the scenario where this cache would shine is very narrow and mostly if multiple cosmos DAGs are using the same manifest and are defined in the same file, which creates other problems (no modularity, etc). caching the manifest as a variable is not feasible at all either due to its size.

The solution we are using here for now is to speed up the parsing via orjson (created a PR #2552) and we have a POC to try to split the manifest per dag before loading it instead of reusing the same one with different --select flags. For the later, it would be outside of the scope of cosmos though (although I think dbt-loom idea should help out here, but I have never tried it out myself so cant say much)

Anyway, I agree with Tatiana that caching is very niche and mostly trivial for the manifest, but I could be wrong 😃

@tatiana
Copy link
Copy Markdown
Collaborator

tatiana commented Apr 15, 2026

@evanvolgas I've been working on other projects/tasks. WDYT of trying to use https://docs.python.org/3/library/functools.html#functools.partial to avoid the global vars?

@github-actions
Copy link
Copy Markdown

This PR is stale because it has been open for 30 days with no activity.

@github-actions github-actions Bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label May 15, 2026
@tatiana tatiana self-assigned this May 27, 2026
@github-actions github-actions Bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants