[Experimental feature] enable orjson parser for whole project#2320
[Experimental feature] enable orjson parser for whole project#2320corsettigyg wants to merge 0 commit into
Conversation
✅ Deploy Preview for astronomer-cosmos ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
Pull request overview
This PR introduces an experimental option to use orjson for dbt manifest parsing to reduce DAG parsing time for large dbt projects.
Changes:
- Add a new
enable_orjson_parserconfiguration flag (off by default) and document it in the Cosmos configuration docs. - Implement conditional manifest loading using
orjsoninDbtGraph.load_from_dbt_manifest, with a clear error when the feature is enabled butorjsonis not installed. - Add tests covering configuration behavior, error paths, and equivalence between
orjsonand the standardjsonparser, and define a neworjsonoptional-dependency extra inpyproject.toml.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
cosmos/dbt/graph.py |
Adds optional orjson-based manifest loading with a guarded feature flag and error handling when orjson is missing. |
cosmos/settings.py |
Introduces the enable_orjson_parser boolean setting wired to Airflow config/environment. |
tests/dbt/test_orjson_parser.py |
Adds tests for default/overridden settings, missing-orjson error, and equivalence between orjson and standard parsing. |
pyproject.toml |
Defines an orjson extra (astronomer-cosmos[orjson]) to install the orjson dependency. |
docs/configuration/cosmos-conf.rst |
Documents the new enable_orjson_parser config option, including env var and behavior when orjson is missing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
This is super exciting, @corsettigyg ! If this were the default, do you have any concerns regarding backwards compatibility - other than the dependency on this Rust JSON parser? |
|
hey @tatiana 😎 honestly, I would argue it is a pretty safe change in general. I have tested it on airflow 2.9 & airflow 2.11 and did not see any issues. The orjson does behave almost exactly like the classic json package. Of course, I came across orjson by chance so cannot say much of the project, but in the local sandbox I have been playing around with it, it has been working exactly the same with very good performances. I'd say to keep it as an experimental feature and if feedback is positive, use it as the only json parser from cosmos 2.0 On my local airflow, I have been testing this branch on our cosmos dags and could not see any issues. Only observation is that the scheduler is suffering less to parse them hahaha obs: to keep complexity <10, had to create the |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2320 +/- ##
==========================================
- Coverage 97.90% 97.90% -0.01%
==========================================
Files 102 103 +1
Lines 6818 6861 +43
==========================================
+ Hits 6675 6717 +42
- Misses 143 144 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@corsettigyg Sorry for commenting while this task is still in draft, but I had an idea for how we could expose this feature - and I thought it was good to share it early in the implementation. At the moment, it seems we're importing the What if we had a And we change the 15 import paths to import json to be And we did not have yet-another-configuration in Cosmos to set this up? This would allow us to remove:
This way, people could benefit from this optimisation - in all parts of Cosmos - by just installing orjson. |
|
hey @tatiana , dont worry, your observations are always more than welcome even if it is a draft 🚀 This is a very good take actually. I will try to do something like that, so the overall performance impact can reverberate across the whole codebase. orjson has some small peculiarities though that do not make it into an one to one equivalent of the standard json package though, but it is worth the try. will try to do it next week |
|
@tatiana I tried creating a global enabler for the orjson across the project. Some parts might not be that elegant and I had to use AI to come up with a solution in some components I am not familar with, but based on some basic tests I did, it is functional. I created a simple benchmark with claude to test some areas touched by the change to have a global idea of performance enhancement. The results are pretty good For now, I will enable it for review, but I am more than welcome to receive feedback on the implementation or changes I'd have to do 😄 |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 20 out of 20 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
|
@tatiana so I have rebased and fixed the issues in the code. To keep both json and orjson, I had to do some hacks here and there but overall the tests are passing. Not a beautiful solution code-wise and it would be much easier to just fully replace to orjson instead of keeping this switch OR just use orjson where it would in fact provide a performance boost, so I am open for suggestions, if any 😄 |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 20 out of 20 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 20 out of 20 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
cosmos/_json.py:129
dump()takes**kwargsbut, when orjson is enabled andindentis supported, it always uses orjson and silently ignores any passed stdlib-only kwargs (e.g.ensure_ascii,default,cls). This can change behavior without warning. Consider matching theloads()/dumps_*()gating by falling back to stdlib wheneverkwargsis non-empty (and/or when unsupported options are requested).
else:
_json.dump(obj, fp, sort_keys=sort_keys, indent=indent, **kwargs)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 20 out of 20 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| env_vars = self.env_vars | ||
| if env_vars: | ||
| envvars_str = json.dumps(env_vars, sort_keys=True) | ||
| envvars_str = json.dumps_str(env_vars, sort_keys=True, separators=(",", ":")) |
| # Use JSON with sorted keys for deterministic hashing, resilient to dict ordering changes | ||
| yaml_selector_hash = hashlib.md5( | ||
| json.dumps(selector_definitions, sort_keys=True, separators=(",", ":")).encode() | ||
| json.dumps_bytes(selector_definitions, sort_keys=True, separators=(",", ":")) | ||
| ).hexdigest() |
|
|
||
|
|
| indent: int | None = None, | ||
| **kwargs: Any, | ||
| ) -> None: | ||
| if _use_orjson() and indent in _ORJSON_SUPPORTED_INDENTS: |
| pass | ||
|
|
||
| import cosmos.dbt.runner as dbt_runner | ||
| from cosmos import _json as json |
|
Wow! This is a good idea @corsettigyg and I can tell how much work you put into this. Really appreciate that :) I can also confirm that currently JSON parsing is slower than ideal (we are seeing ~2 seconds for our files) and in ways that we've had to hack around. orjson would materially make a difference. How can I help on this PR? |
|
A few additional thoughts on this PR, if I may:
|
|
hey @evanvolgas , I'd say that for now, the problem I see in this PR is that there is too much complexity on trying to use orjson everywhere while keeping the old json as an alternative/backup. I was actually considering to close this PR and re-open as the initial simple switch I idealised. If you have suggestions on how to improve the process, feel free to open PRs on my fork and they will end up reflected here. after playing with the implementation for some time however, my 2 cents here is: implement orjson only where it will give results and keep json on the other places. In many places, orjson has some limitations and has almost no value. the unit tests & integration tests are strong enough to deal with the internal switch without introducing it as experimental IMO. Once the maintainers have time to check it,I'd be willing to hear the desired direction here, since creating a complex switch with so many edge cases feels overly complex, which is why I stopped working on it for the time being |
|
Hi @corsettigyg, Apologies for the delay in getting back to you on this PR. We've had a huge amount of demand recently, including work not directly visible here, and we're a bit short on hands. I completely agree that extending the usage of orjson to all areas made the code more complex than intended - I regret suggesting it in the first place. I support closing this PR and opening a new one scoped only to Additionally, could you share some numbers on how this change affects parsing
Once we have these numbers alongside the original change, I’m happy to move forward and commit to releasing it as part of Cosmos 1.5.0. Thanks again for your patience and for helping improve this! |
|
@tatiana sure, opened a new PR for it 😄 |
closed the old #2320 due to the code change getting convoluted and re-opened here. Idea is the same, use orjson for better manifest parsing, **but keep it only where it matters** ## Related Issue(s) ## Breaking Change? should not be ## Checklist - [X] I have made corresponding changes to the documentation (if required) - [X] I have added tests that prove my fix is effective or that my feature works ## Benchmark I patched the graph generator integration test and ran 10 iterations on it for different manifest sizes. In total, I got Manifest | Size | Nodes | json mean | orjson mean | Speedup -- | -- | -- | -- | -- | -- sample | 0.4 MB | 28 | 0.003 s | 0.002 s | 1.44× sample | 5.1 MB | 2,436 | 0.052 s | 0.040 s | 1.31× sample | 10.0 MB | 4,984 | 0.101 s | 0.086 s | 1.18× sample | 15.0 MB | 7,532 | 0.169 s | 0.121 s | 1.39× sample | 20.0 MB | 10,080 | 0.234 s | 0.180 s | 1.30× sample | 25.0 MB | 12,628 | 0.345 s | 0.239 s | 1.44× sample | 30.0 MB | 15,176 | 0.425 s | 0.284 s | 1.50× real-world (the manifest we use for our cosmos runs) | 26.7 MB | 1,533 | 0.159 s | 0.078 s | 2.03× What is interesting is that in our real world manifest, which has not only nodes but also descriptions, tests, tags, etc.... the performance improvement of orjson is better than for the mock manifests I created. I read about it and the explanation I can think about is that even tho the real manifest has fewer graph objects, it is still a much fatter json per object, which is where orjson helps the most 🥇 If necessary, I can share the benchmark I ran. just did not commit it since it is not exactly a test per say. --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tatiana Al-Chueyr <tatiana.alchueyr@gmail.com> Co-authored-by: Cursor <cursoragent@cursor.com>
Description
At GetYourGuide, as we started having more and more cosmos dags, we noticed that - even though we use the dbt manifest parsing method - the total dag parsing time started increasing remarkably, which started affecting our metrics.
On a further investigation, the main offender is that the size of our manifest (which is used for many different dags) increased over 30 mb which makes the parsing a somehow heavy process, especially when we have over 50+ cosmos dags running, some of them with hundreds of models.
To work around that, I started playing with the orjson library, which is a rust-based json library for python. They advertise being much faster and "more correct" than the standard json library. Using our manifest, I executed a benchmark on it and got the following results
which basically showed an improvement of 40% on the parsing time.
Related Issue(s)
Breaking Change?
Should not be right now as it is disabled by default.
Checklist