Cache manifest.json parsing across DbtGraph instances by evanvolgas · Pull Request #2486 · astronomer/astronomer-cosmos

evanvolgas · 2026-03-21T00:36:14Z

Problem

When multiple DbtDag or DbtTaskGroup instances share the same manifest.json file, each one independently opens and parses the full JSON during every DagBag import cycle. For large dbt projects with sizable manifests, this redundant parsing adds meaningful overhead to DAG parse time — especially in deployments with many DAGs pointing at the same project.

Solution

Add an lru_cache-backed _load_manifest_cached(path, mtime) function that parses the manifest once and returns the cached result on subsequent calls within the same process. The cache key includes the file's st_mtime, so it auto-invalidates whenever the manifest is rewritten on disk.

Each caller receives a copy.deepcopy() of the cached dict. This is critical because downstream code constructs DbtNode instances that hold references to lists and dicts from the parsed manifest (e.g. tags, config). Without the deep copy, any future mutation of those structures by one DbtGraph instance would silently corrupt the shared cache and affect all subsequent consumers — a subtle aliasing bug.

maxsize=8 bounds memory for the uncommon case where a deployment has many distinct manifest files.

Changes

cosmos/dbt/graph.py: Add _load_manifest_cached() with functools.lru_cache; replace open()/json.load() in load_from_dbt_manifest() with copy.deepcopy(_load_manifest_cached(...))
tests/dbt/test_graph.py: Three new tests:
- Cache sharing: 3 DbtGraph instances loading the same manifest produce cache hits
- Cache invalidation: rewriting the manifest with a different mtime returns fresh data (uses os.utime for deterministic mtime changes, avoiding flaky time.sleep on filesystems with 1s granularity)
- Selector isolation: two graphs with different select filters sharing a cached manifest produce independent filtered_nodes, confirming the deep copy prevents cross-contamination

Test plan

test_load_manifest_cached_shares_across_dags passes
test_load_manifest_cached_invalidates_on_file_change passes
test_load_manifest_cached_different_selectors_no_interference passes
Existing test_load_from_dbt_manifest* tests remain green (no regression)

tatiana · 2026-03-24T09:37:22Z

@evanvolgas This is a great idea-caching the parsed manifest would definitely help performance. I’m just a bit hesitant about doing so on the local Airflow nodes.

For some context, we actually tried something similar around May/June 2024, where we cached the dbt ls output in the local filesystem as a pickle: #992

While it did help in some cases, the behaviour proved inconsistent. The main issues we ran into were:

When users were running Airflow with the Kubernetes Executor instead of Celery, the cache was never reused.
With Airflow Celery Executor deployments that have an auto-scaling mechanism, workers would come and go frequently, which caused a lot of cache misses.
Depending on the deployment process, caches were often missed after deployments, meaning all Airflow nodes had to be “warmed up” again.

Because of that, we later did a PoC and implemented a dbt Variable–based cache instead, which significantly improved dbt ls DAG parsing:
#1014

It would be great if we could build on those learnings and stay consistent with that caching strategy for manifest.json as well. What do you think?

Add lru_cache to avoid re-parsing the same manifest.json when multiple DbtDag/DbtTaskGroup instances share a manifest file during DagBag import. The cache is keyed on (path, mtime) so it auto-invalidates when the file changes. Each caller receives a deep copy to prevent aliasing bugs if downstream code ever mutates the parsed dict or its nested structures. Only local filesystem manifests are cached — remote paths (s3://, gs://, abfs://) bypass the cache and are loaded directly via ObjectStoragePath, since os.path.getmtime is not available for remote storage backends.

evanvolgas · 2026-03-24T17:14:06Z

Hi @tatiana, thank you for the thoughtful feedback and for sharing the history behind PRs #992 and #1014 — that context is really helpful.

I completely agree that the Airflow Variable-based cache was the right call for dbt ls. The issues you described with local filesystem caching (K8s executor misses, Celery autoscaling, post-deployment warm-up) are real and well-documented pain points for anything that persists state on disk across process lifetimes.

I think this PR is actually solving a narrower, complementary problem, but I may not have explained that well enough in the description. Let me clarify the intent:

This lru_cache is purely in-process and ephemeral — it only lives for the duration of a single DagBag import cycle. The scenario it targets is when a user defines multiple DbtDag or DbtTaskGroup instances in their DAG files that all point to the same manifest.json. Without this cache, each instance calls json.load() on the same file during the same Python process. With it, the file is read once and the parsed dict is shared (via deepcopy to prevent mutation bugs).

It doesn't persist anything to disk or across processes, so the issues from PR #992 (K8s executor, Celery autoscaling, deployment cache misses) wouldn't apply here — when the process ends, the cache is gone.

That said, I see two options and I'm happy to go either direction:

Keep the lru_cache as a lightweight in-process deduplication layer — it complements the Variable-based cache rather than replacing it. I could add a clear code comment explaining why this approach is used here vs. the Variable-based strategy, to avoid future confusion.
Explore using Airflow Variables for manifest caching — though one concern is that full manifests can be quite large (sometimes multiple MB), which may exceed Airflow's Variable size limits and add DB serialization overhead that could outweigh the json.load() savings.

What do you think? I'm happy to adjust the approach to whatever fits best with the project's direction.

evanvolgas · 2026-03-30T18:54:54Z

@tatiana just wanted to follow up and see if you had any additional thoughts on this?

codecov · 2026-04-01T13:47:48Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.02%. Comparing base (e34d415) to head (9718849).
⚠️ Report is 118 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2486   +/-   ##
=======================================
  Coverage   98.02%   98.02%           
=======================================
  Files         103      103           
  Lines        7173     7183   +10     
=======================================
+ Hits         7031     7041   +10     
  Misses        142      142

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

tatiana

Hey @evanvolgas - apologies for the delay getting back to you; it’s been a particularly busy period on our side.

Thanks very much for the detailed explanation. Unfortunately, we haven’t had sufficient bandwidth yet to properly review and test the proposed changes. We’re planning to go through the PR thoroughly and aim to include it in the Cosmos 1.15.0 release, which is currently targeted for about a month from now.

In the meantime, have you had a chance to run any benchmarks? It would be really helpful to understand the performance improvements this approach provides. Specifically, it would be great to know:

the size of the dbt project used for testing
the memory consumption per process

In Airflow, DAGs are parsed not only by the DAG processor/scheduler but also by each task on the worker nodes as part of executor behavior. Because of this, having concrete metrics would help us better assess the impact of these changes on resource utilization across both the DAG processor and worker processes.

Thanks again for your work on this - looking forward to digging deeper soon.

corsettigyg · 2026-04-13T10:04:30Z

Since we are facing a similar problem on our end regarding performance, I will give my 2 cents here.

the scenario where this cache would shine is very narrow and mostly if multiple cosmos DAGs are using the same manifest and are defined in the same file, which creates other problems (no modularity, etc). caching the manifest as a variable is not feasible at all either due to its size.

The solution we are using here for now is to speed up the parsing via orjson (created a PR #2552) and we have a POC to try to split the manifest per dag before loading it instead of reusing the same one with different --select flags. For the later, it would be outside of the scope of cosmos though (although I think dbt-loom idea should help out here, but I have never tried it out myself so cant say much)

Anyway, I agree with Tatiana that caching is very niche and mostly trivial for the manifest, but I could be wrong 😃

tatiana · 2026-04-15T10:27:34Z

@evanvolgas I've been working on other projects/tasks. WDYT of trying to use https://docs.python.org/3/library/functools.html#functools.partial to avoid the global vars?

github-actions · 2026-05-15T11:39:39Z

This PR is stale because it has been open for 30 days with no activity.

evanvolgas requested review from a team, corsettigyg, dwreeves and jbandoro as code owners March 21, 2026 00:36

evanvolgas requested review from pankajastro and tatiana March 21, 2026 00:36

evanvolgas had a problem deploying to external March 21, 2026 00:36 — with GitHub Actions Error

pre-commit-ci Bot had a problem deploying to external March 21, 2026 00:36 Error

derekgoering approved these changes Mar 21, 2026

View reviewed changes

evanvolgas temporarily deployed to external March 23, 2026 22:58 — with GitHub Actions Inactive

evanvolgas force-pushed the evanvolgas/cache-manifest-parse-v2 branch from 6fca801 to eb80331 Compare March 24, 2026 15:07

evanvolgas had a problem deploying to external March 24, 2026 15:07 — with GitHub Actions Error

🎨 [pre-commit.ci] Auto format from pre-commit.com hooks

2d1678b

pre-commit-ci Bot had a problem deploying to external March 24, 2026 15:08 Error

Evan Volgas added 2 commits March 24, 2026 14:50

Merge branch 'main' into evanvolgas/cache-manifest-parse-v2

4a2bf0f

Merge branch 'main' into evanvolgas/cache-manifest-parse-v2

9718849

evanvolgas temporarily deployed to external March 27, 2026 16:29 — with GitHub Actions Inactive

tatiana added this to the Cosmos 1.15.0 milestone Apr 7, 2026

tatiana reviewed Apr 7, 2026

View reviewed changes

github-actions Bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label May 15, 2026

tatiana self-assigned this May 27, 2026

github-actions Bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache manifest.json parsing across DbtGraph instances#2486

Cache manifest.json parsing across DbtGraph instances#2486
evanvolgas wants to merge 4 commits into
astronomer:mainfrom
evanvolgas:evanvolgas/cache-manifest-parse-v2

evanvolgas commented Mar 21, 2026

Uh oh!

tatiana commented Mar 24, 2026

Uh oh!

evanvolgas commented Mar 24, 2026

Uh oh!

evanvolgas commented Mar 30, 2026

Uh oh!

codecov Bot commented Apr 1, 2026 •

edited

Loading

Uh oh!

tatiana left a comment •

edited

Loading

Uh oh!

corsettigyg commented Apr 13, 2026

Uh oh!

tatiana commented Apr 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

evanvolgas commented Mar 21, 2026

Problem

Solution

Changes

Test plan

Uh oh!

tatiana commented Mar 24, 2026

Uh oh!

evanvolgas commented Mar 24, 2026

Uh oh!

evanvolgas commented Mar 30, 2026

Uh oh!

codecov Bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

tatiana left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

corsettigyg commented Apr 13, 2026

Uh oh!

tatiana commented Apr 15, 2026

Uh oh!

github-actions Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov Bot commented Apr 1, 2026 •

edited

Loading

tatiana left a comment •

edited

Loading