Cache manifest.json parsing across DbtGraph instances#2486
Conversation
|
@evanvolgas This is a great idea-caching the parsed manifest would definitely help performance. I’m just a bit hesitant about doing so on the local Airflow nodes. For some context, we actually tried something similar around May/June 2024, where we cached the dbt ls output in the local filesystem as a pickle: #992 While it did help in some cases, the behaviour proved inconsistent. The main issues we ran into were:
Because of that, we later did a PoC and implemented a dbt Variable–based cache instead, which significantly improved dbt ls DAG parsing: It would be great if we could build on those learnings and stay consistent with that caching strategy for |
Add lru_cache to avoid re-parsing the same manifest.json when multiple DbtDag/DbtTaskGroup instances share a manifest file during DagBag import. The cache is keyed on (path, mtime) so it auto-invalidates when the file changes. Each caller receives a deep copy to prevent aliasing bugs if downstream code ever mutates the parsed dict or its nested structures. Only local filesystem manifests are cached — remote paths (s3://, gs://, abfs://) bypass the cache and are loaded directly via ObjectStoragePath, since os.path.getmtime is not available for remote storage backends.
6fca801 to
eb80331
Compare
|
Hi @tatiana, thank you for the thoughtful feedback and for sharing the history behind PRs #992 and #1014 — that context is really helpful. I completely agree that the Airflow Variable-based cache was the right call for I think this PR is actually solving a narrower, complementary problem, but I may not have explained that well enough in the description. Let me clarify the intent: This It doesn't persist anything to disk or across processes, so the issues from PR #992 (K8s executor, Celery autoscaling, deployment cache misses) wouldn't apply here — when the process ends, the cache is gone. That said, I see two options and I'm happy to go either direction:
What do you think? I'm happy to adjust the approach to whatever fits best with the project's direction. |
|
@tatiana just wanted to follow up and see if you had any additional thoughts on this? |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2486 +/- ##
=======================================
Coverage 98.02% 98.02%
=======================================
Files 103 103
Lines 7173 7183 +10
=======================================
+ Hits 7031 7041 +10
Misses 142 142 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Hey @evanvolgas - apologies for the delay getting back to you; it’s been a particularly busy period on our side.
Thanks very much for the detailed explanation. Unfortunately, we haven’t had sufficient bandwidth yet to properly review and test the proposed changes. We’re planning to go through the PR thoroughly and aim to include it in the Cosmos 1.15.0 release, which is currently targeted for about a month from now.
In the meantime, have you had a chance to run any benchmarks? It would be really helpful to understand the performance improvements this approach provides. Specifically, it would be great to know:
- the size of the dbt project used for testing
- the memory consumption per process
In Airflow, DAGs are parsed not only by the DAG processor/scheduler but also by each task on the worker nodes as part of executor behavior. Because of this, having concrete metrics would help us better assess the impact of these changes on resource utilization across both the DAG processor and worker processes.
Thanks again for your work on this - looking forward to digging deeper soon.
|
Since we are facing a similar problem on our end regarding performance, I will give my 2 cents here. the scenario where this cache would shine is very narrow and mostly if multiple cosmos DAGs are using the same manifest and are defined in the same file, which creates other problems (no modularity, etc). caching the manifest as a variable is not feasible at all either due to its size. The solution we are using here for now is to speed up the parsing via orjson (created a PR #2552) and we have a POC to try to split the manifest per dag before loading it instead of reusing the same one with different --select flags. For the later, it would be outside of the scope of cosmos though (although I think dbt-loom idea should help out here, but I have never tried it out myself so cant say much) Anyway, I agree with Tatiana that caching is very niche and mostly trivial for the manifest, but I could be wrong 😃 |
|
@evanvolgas I've been working on other projects/tasks. WDYT of trying to use https://docs.python.org/3/library/functools.html#functools.partial to avoid the global vars? |
|
This PR is stale because it has been open for 30 days with no activity. |
Problem
When multiple
DbtDagorDbtTaskGroupinstances share the samemanifest.jsonfile, each one independently opens and parses the full JSON during every DagBag import cycle. For large dbt projects with sizable manifests, this redundant parsing adds meaningful overhead to DAG parse time — especially in deployments with many DAGs pointing at the same project.Solution
Add an
lru_cache-backed_load_manifest_cached(path, mtime)function that parses the manifest once and returns the cached result on subsequent calls within the same process. The cache key includes the file'sst_mtime, so it auto-invalidates whenever the manifest is rewritten on disk.Each caller receives a
copy.deepcopy()of the cached dict. This is critical because downstream code constructsDbtNodeinstances that hold references to lists and dicts from the parsed manifest (e.g.tags,config). Without the deep copy, any future mutation of those structures by oneDbtGraphinstance would silently corrupt the shared cache and affect all subsequent consumers — a subtle aliasing bug.maxsize=8bounds memory for the uncommon case where a deployment has many distinct manifest files.Changes
cosmos/dbt/graph.py: Add_load_manifest_cached()withfunctools.lru_cache; replaceopen()/json.load()inload_from_dbt_manifest()withcopy.deepcopy(_load_manifest_cached(...))tests/dbt/test_graph.py: Three new tests:os.utimefor deterministic mtime changes, avoiding flakytime.sleepon filesystems with 1s granularity)selectfilters sharing a cached manifest produce independentfiltered_nodes, confirming the deep copy prevents cross-contaminationTest plan
test_load_manifest_cached_shares_across_dagspassestest_load_manifest_cached_invalidates_on_file_changepassestest_load_manifest_cached_different_selectors_no_interferencepassestest_load_from_dbt_manifest*tests remain green (no regression)