[Experimental feature] enable orjson parser for whole project#2552
Conversation
Uses orjson (a Rust-based JSON library) to parse manifest.json when the `enable_orjson_parser` setting is enabled, falling back to stdlib json by default. Benchmarks show ~1.9x speedup (≈47% faster) across manifest sizes from 400 KB to 30 MB. The feature is disabled by default to remain backwards-compatible. Enable with: AIRFLOW__COSMOS__ENABLE_ORJSON_PARSER=True pip install 'astronomer-cosmos[orjson]' Made-with: Cursor
Add optional orjson parser for faster dbt manifest loading
There was a problem hiding this comment.
Pull request overview
Adds an experimental configuration flag to parse dbt manifest.json with orjson for faster DAG parsing, along with documentation, packaging, and tests to support/validate the feature.
Changes:
- Introduce
settings.enable_orjson_parserand document the new Airflow config/env var. - Add optional dependency group
orjsonplus test-environment dependency installation. - Refactor
DbtGraph.load_from_dbt_manifest()to parse manifests via a new_load_manifest_from_file()helper, and add unit tests for behavior/equivalence.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
cosmos/dbt/graph.py |
Adds optional orjson import and _load_manifest_from_file() used by manifest-based graph loading. |
cosmos/settings.py |
Adds enable_orjson_parser setting read from Airflow config. |
docs/reference/configs/cosmos-conf.rst |
Documents the new experimental setting and installation instructions. |
pyproject.toml |
Adds orjson optional extra and installs orjson in the hatch tests environment. |
tests/dbt/test_orjson_parser.py |
New unit tests for default behavior, missing dependency errors, and parser equivalence. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Use open("rb")/open("r") instead of read_bytes() in _load_manifest_from_file
so the method works correctly with ObjectStoragePath (remote object storage),
which may not implement Path.read_bytes()
- Restore the `or {}` fallback in load_from_dbt_manifest to guard against
manifest.json files containing JSON null, preserving the previous behavior
Made-with: Cursor
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…atibility
The new `_load_manifest_from_file` raised `CosmosLoadDbtException` for any
non-dict root, breaking the pre-existing `test_load_from_dbt_manifest_handles_null_manifest`
which expects a JSON `null` manifest to be treated as an empty dict (the old
`json.load(fp) or {}` behavior).
Treat `None` as `{}` while still rejecting genuinely invalid roots (lists,
strings, numbers). Also tightens the null-root test and adds coverage for
invalid non-null, non-dict roots.
Made-with: Cursor
|
fixed CICD issues |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2552 +/- ##
=======================================
Coverage 98.02% 98.02%
=======================================
Files 105 105
Lines 7829 7844 +15
=======================================
+ Hits 7674 7689 +15
Misses 155 155 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
tatiana
left a comment
There was a problem hiding this comment.
@corsettigyg This looks great, thank you very much, and I'm sorry for all the re-work on the original PR. I left two minor comments - if you could address them, it would be great. You have superpowers to merge these comments, and the checks pass.
Co-authored-by: Tatiana Al-Chueyr <tatiana.alchueyr@gmail.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The test assertions expected 'astronomer-cosmos[orjson]' in the error message, but the implementation raises an error suggesting 'pip install orjson'. Since no orjson extra is defined in pyproject.toml, update both assertions to match the actual installation guidance. Co-authored-by: Cursor <cursoragent@cursor.com>
|
@tatiana agree with the comments :) it should be all good now 💪 |
Description
closed the old #2320 due to the code change getting convoluted and re-opened here.
Idea is the same, use orjson for better manifest parsing, but keep it only where it matters
Related Issue(s)
Breaking Change?
should not be
Checklist
Benchmark
I patched the graph generator integration test and ran 10 iterations on it for different manifest sizes. In total, I got
What is interesting is that in our real world manifest, which has not only nodes but also descriptions, tests, tags, etc.... the performance improvement of orjson is better than for the mock manifests I created. I read about it and the explanation I can think about is that even tho the real manifest has fewer graph objects, it is still a much fatter json per object, which is where orjson helps the most 🥇
If necessary, I can share the benchmark I ran. just did not commit it since it is not exactly a test per say.