Skip to content

[Experimental feature] enable orjson parser for whole project#2320

Closed
corsettigyg wants to merge 0 commit into
astronomer:mainfrom
corsettigyg:main
Closed

[Experimental feature] enable orjson parser for whole project#2320
corsettigyg wants to merge 0 commit into
astronomer:mainfrom
corsettigyg:main

Conversation

@corsettigyg
Copy link
Copy Markdown
Collaborator

Description

At GetYourGuide, as we started having more and more cosmos dags, we noticed that - even though we use the dbt manifest parsing method - the total dag parsing time started increasing remarkably, which started affecting our metrics.

On a further investigation, the main offender is that the size of our manifest (which is used for many different dags) increased over 30 mb which makes the parsing a somehow heavy process, especially when we have over 50+ cosmos dags running, some of them with hundreds of models.

To work around that, I started playing with the orjson library, which is a rust-based json library for python. They advertise being much faster and "more correct" than the standard json library. Using our manifest, I executed a benchmark on it and got the following results

Screenshot 2026-02-04 at 16 49 43

which basically showed an improvement of 40% on the parsing time.

Related Issue(s)

Breaking Change?

Should not be right now as it is disabled by default.

Checklist

  • I have made corresponding changes to the documentation (if required)
  • I have added tests that prove my fix is effective or that my feature works

Copilot AI review requested due to automatic review settings February 4, 2026 15:54
@netlify
Copy link
Copy Markdown

netlify Bot commented Feb 4, 2026

Deploy Preview for astronomer-cosmos ready!

Name Link
🔨 Latest commit 0d93a6b
🔍 Latest deploy log https://app.netlify.com/projects/astronomer-cosmos/deploys/698cff5c2575df00084233a9
😎 Deploy Preview https://deploy-preview-2320--astronomer-cosmos.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an experimental option to use orjson for dbt manifest parsing to reduce DAG parsing time for large dbt projects.

Changes:

  • Add a new enable_orjson_parser configuration flag (off by default) and document it in the Cosmos configuration docs.
  • Implement conditional manifest loading using orjson in DbtGraph.load_from_dbt_manifest, with a clear error when the feature is enabled but orjson is not installed.
  • Add tests covering configuration behavior, error paths, and equivalence between orjson and the standard json parser, and define a new orjson optional-dependency extra in pyproject.toml.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
cosmos/dbt/graph.py Adds optional orjson-based manifest loading with a guarded feature flag and error handling when orjson is missing.
cosmos/settings.py Introduces the enable_orjson_parser boolean setting wired to Airflow config/environment.
tests/dbt/test_orjson_parser.py Adds tests for default/overridden settings, missing-orjson error, and equivalence between orjson and standard parsing.
pyproject.toml Defines an orjson extra (astronomer-cosmos[orjson]) to install the orjson dependency.
docs/configuration/cosmos-conf.rst Documents the new enable_orjson_parser config option, including env var and behavior when orjson is missing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pyproject.toml Outdated
@tatiana
Copy link
Copy Markdown
Collaborator

tatiana commented Feb 4, 2026

This is super exciting, @corsettigyg ! If this were the default, do you have any concerns regarding backwards compatibility - other than the dependency on this Rust JSON parser?

@corsettigyg
Copy link
Copy Markdown
Collaborator Author

corsettigyg commented Feb 4, 2026

hey @tatiana 😎 honestly, I would argue it is a pretty safe change in general. I have tested it on airflow 2.9 & airflow 2.11 and did not see any issues. The orjson does behave almost exactly like the classic json package. Of course, I came across orjson by chance so cannot say much of the project, but in the local sandbox I have been playing around with it, it has been working exactly the same with very good performances. I'd say to keep it as an experimental feature and if feedback is positive, use it as the only json parser from cosmos 2.0

On my local airflow, I have been testing this branch on our cosmos dags and could not see any issues. Only observation is that the scheduler is suffering less to parse them hahaha

obs: to keep complexity <10, had to create the _load_manifest_from_file function

@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 5, 2026

Codecov Report

❌ Patch coverage is 98.93617% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 97.90%. Comparing base (792c5c8) to head (0d93a6b).

Files with missing lines Patch % Lines
cosmos/_json.py 97.77% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2320      +/-   ##
==========================================
- Coverage   97.90%   97.90%   -0.01%     
==========================================
  Files         102      103       +1     
  Lines        6818     6861      +43     
==========================================
+ Hits         6675     6717      +42     
- Misses        143      144       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tatiana
Copy link
Copy Markdown
Collaborator

tatiana commented Feb 5, 2026

@corsettigyg Sorry for commenting while this task is still in draft, but I had an idea for how we could expose this feature - and I thought it was good to share it early in the implementation.

At the moment, it seems we're importing the json library in a few places in the Cosmos codebase:

$ grep -rn "import json" cosmos
cosmos/operators/local.py:5:import json
cosmos/operators/watcher.py:4:import json
cosmos/operators/_watcher/triggerer.py:5:import json
cosmos/operators/_watcher/base.py:1:import json
cosmos/dbt/graph.py:7:import json
cosmos/plugin/airflow3.py:4:import json
cosmos/cache.py:5:import json
cosmos/telemetry.py:3:import json
cosmos/profiles/bigquery/service_account_keyfile_dict.py:5:import json
cosmos/profiles/snowflake/user_pass.py:5:import json
cosmos/profiles/snowflake/user_encrypted_privatekey_env_variable.py:5:import json
cosmos/profiles/snowflake/user_privatekey.py:5:import json
cosmos/profiles/snowflake/user_encrypted_privatekey_file.py:5:import json
cosmos/profiles/base.py:9:import json
cosmos/listeners/dag_run_listener.py:5:import json

What if we had a cosmos/dependencies.py or something similar, where we did:

try:
    import orjson as json
except ImportError:  # pragma: no cover
    import json

And we change the 15 import paths to import json to be from cosmos. dependencies import json?

And we did not have yet-another-configuration in Cosmos to set this up? This would allow us to remove:

  • the new config
  • the new installation plugin

This way, people could benefit from this optimisation - in all parts of Cosmos - by just installing orjson.

@corsettigyg
Copy link
Copy Markdown
Collaborator Author

hey @tatiana , dont worry, your observations are always more than welcome even if it is a draft 🚀

This is a very good take actually. I will try to do something like that, so the overall performance impact can reverberate across the whole codebase. orjson has some small peculiarities though that do not make it into an one to one equivalent of the standard json package though, but it is worth the try. will try to do it next week

@corsettigyg corsettigyg changed the title [Experimental feature] enable orjson parser for manifest parsing mode [Experimental feature] enable orjson parser for whole project Feb 10, 2026
@corsettigyg
Copy link
Copy Markdown
Collaborator Author

corsettigyg commented Feb 10, 2026

@tatiana I tried creating a global enabler for the orjson across the project. Some parts might not be that elegant and I had to use AI to come up with a solution in some components I am not familar with, but based on some basic tests I did, it is functional.
I was wondering if it would be a good idea to enable the flag to true by default , run the integration tests & unit tests in the cicd with it, then put it back to false again. This way, it is an extra layer of validation.

I created a simple benchmark with claude to test some areas touched by the change to have a global idea of performance enhancement.

  1. Manifest parsing        – cosmos/dbt/graph.py  load_from_dbt_manifest()
  2. dbt ls line parsing     – cosmos/dbt/graph.py  run_command_with_dbt_runner()
  3. Compress + serialize    – cosmos/operators/local.py  store_compiled_sql() / watcher events
  4. Decompress + deserialize– cosmos/dbt/graph.py  _load_dbt_ls_cache_from_disk()

The results are pretty good

========================================================================
  Generating ~2 MB manifest ...
  Generated 4.1 MB  (886 nodes)  in 0.1s
  Running 5 iterations per scenario
========================================================================
  [Manifest parse        ] json= 101.1ms  orjson=  40.9ms  (2.5x)
  [dbt ls line parse     ] json=  28.9ms  orjson=  12.5ms  (2.3x)
  [Compress+serialize    ] json= 199.0ms  orjson=  36.8ms  (5.4x)
  [Decompress+deser.     ] json= 104.2ms  orjson=  49.1ms  (2.1x)

========================================================================
  Generating ~10 MB manifest ...
  Generated 23.4 MB  (5594 nodes)  in 0.4s
  Running 5 iterations per scenario
========================================================================
  [Manifest parse        ] json= 600.7ms  orjson= 287.3ms  (2.1x)
  [dbt ls line parse     ] json=  29.1ms  orjson=  29.5ms  (1.0x)
  [Compress+serialize    ] json=   1.17s  orjson= 215.7ms  (5.4x)
  [Decompress+deser.     ] json= 624.5ms  orjson= 299.2ms  (2.1x)

========================================================================
  Generating ~50 MB manifest ...
  Generated 120.4 MB  (29133 nodes)  in 2.2s
  Running 5 iterations per scenario
========================================================================
  [Manifest parse        ] json= 829.9ms  orjson= 340.5ms  (2.4x)
  [dbt ls line parse     ] json=   5.0ms  orjson=   2.2ms  (2.2x)
  [Compress+serialize    ] json=   1.44s  orjson=   1.05s  (1.4x)
  [Decompress+deser.     ] json=   1.01s  orjson= 393.9ms  (2.6x)

========================================================================
  Generating ~100 MB manifest ...
  Generated 241.7 MB  (58557 nodes)  in 4.7s
  Running 5 iterations per scenario
========================================================================
  [Manifest parse        ] json=   1.71s  orjson= 650.8ms  (2.6x)
  [dbt ls line parse     ] json=   5.4ms  orjson=  42.1ms  (0.1x)
  [Compress+serialize    ] json=   2.91s  orjson=   2.13s  (1.4x)
  [Decompress+deser.     ] json=   2.02s  orjson=   1.02s  (2.0x)

For now, I will enable it for review, but I am more than welcome to receive feedback on the implementation or changes I'd have to do 😄

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread tests/dbt/test_orjson_parser.py Outdated
Comment thread docs/reference/configs/cosmos-conf.rst Outdated
Comment thread cosmos/_json.py Outdated
@corsettigyg
Copy link
Copy Markdown
Collaborator Author

corsettigyg commented Mar 14, 2026

@tatiana so I have rebased and fixed the issues in the code. To keep both json and orjson, I had to do some hacks here and there but overall the tests are passing. Not a beautiful solution code-wise and it would be much easier to just fully replace to orjson instead of keeping this switch OR just use orjson where it would in fact provide a performance boost, so I am open for suggestions, if any 😄

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread docs/reference/configs/cosmos-conf.rst Outdated
Comment thread cosmos/settings.py Outdated
Comment thread cosmos/_json.py Outdated
Comment thread cosmos/_json.py Outdated
Comment thread cosmos/_json.py Outdated
Comment thread cosmos/operators/_watcher/base.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

cosmos/_json.py:129

  • dump() takes **kwargs but, when orjson is enabled and indent is supported, it always uses orjson and silently ignores any passed stdlib-only kwargs (e.g. ensure_ascii, default, cls). This can change behavior without warning. Consider matching the loads()/dumps_*() gating by falling back to stdlib whenever kwargs is non-empty (and/or when unsupported options are requested).
    else:
        _json.dump(obj, fp, sort_keys=sort_keys, indent=indent, **kwargs)


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread tests/dbt/test_orjson_parser.py Outdated
Comment thread cosmos/profiles/base.py Outdated
Comment thread cosmos/dbt/graph.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread cosmos/dbt/graph.py Outdated
env_vars = self.env_vars
if env_vars:
envvars_str = json.dumps(env_vars, sort_keys=True)
envvars_str = json.dumps_str(env_vars, sort_keys=True, separators=(",", ":"))
Comment thread cosmos/cache.py
Comment on lines 297 to 300
# Use JSON with sorted keys for deterministic hashing, resilient to dict ordering changes
yaml_selector_hash = hashlib.md5(
json.dumps(selector_definitions, sort_keys=True, separators=(",", ":")).encode()
json.dumps_bytes(selector_definitions, sort_keys=True, separators=(",", ":"))
).hexdigest()
Comment thread tests/dbt/test_orjson_parser.py Outdated
Comment on lines +210 to +211


Comment thread cosmos/_json.py Outdated
indent: int | None = None,
**kwargs: Any,
) -> None:
if _use_orjson() and indent in _ORJSON_SUPPORTED_INDENTS:
Comment thread cosmos/dbt/graph.py Outdated
pass

import cosmos.dbt.runner as dbt_runner
from cosmos import _json as json
@evanvolgas
Copy link
Copy Markdown
Contributor

evanvolgas commented Mar 19, 2026

Wow! This is a good idea @corsettigyg and I can tell how much work you put into this. Really appreciate that :)

I can also confirm that currently JSON parsing is slower than ideal (we are seeing ~2 seconds for our files) and in ways that we've had to hack around. orjson would materially make a difference.

How can I help on this PR?

@evanvolgas
Copy link
Copy Markdown
Contributor

evanvolgas commented Mar 19, 2026

A few additional thoughts on this PR, if I may:

  1. dump() silently drops kwargs when orjson is active (Copilot noticed this too). If someone calls json.dump(..., default=my_serializer), orjson will ignore default without warning. The loads/dumps* functions fall back to stdlib when kwargs is non-empty; dump() should do the same.
  2. dbt_ls_cache_key_args changes separators — the PR quietly changes json.dumps_str(env_vars, sort_keys=True) to add separators=(",", ":"). This alters the cache key format for existing users, invalidating their dbt ls cache on upgrade. Minor, but worth a note in the changelog.
  3. The enable_orjson_parser setting path — it fails loudly with ImportError at call time if orjson isn't installed, which is good. But users who set the flag and then pip install astronomer-cosmos (without the extra) will only get the error during DAG parsing, not at startup.

@corsettigyg
Copy link
Copy Markdown
Collaborator Author

hey @evanvolgas , I'd say that for now, the problem I see in this PR is that there is too much complexity on trying to use orjson everywhere while keeping the old json as an alternative/backup. I was actually considering to close this PR and re-open as the initial simple switch I idealised. If you have suggestions on how to improve the process, feel free to open PRs on my fork and they will end up reflected here.

after playing with the implementation for some time however, my 2 cents here is: implement orjson only where it will give results and keep json on the other places. In many places, orjson has some limitations and has almost no value. the unit tests & integration tests are strong enough to deal with the internal switch without introducing it as experimental IMO.

Once the maintainers have time to check it,I'd be willing to hear the desired direction here, since creating a complex switch with so many edge cases feels overly complex, which is why I stopped working on it for the time being

@tatiana
Copy link
Copy Markdown
Collaborator

tatiana commented Apr 7, 2026

Hi @corsettigyg,

Apologies for the delay in getting back to you on this PR.

We've had a huge amount of demand recently, including work not directly visible here, and we're a bit short on hands. I completely agree that extending the usage of orjson to all areas made the code more complex than intended - I regret suggesting it in the first place.

I support closing this PR and opening a new one scoped only to manifest.json.

Additionally, could you share some numbers on how this change affects parsing manifest.json? Specifically, it would be helpful to see:

  • The size of the file (if you have multiple manifest.json files, 2–3 examples would be great)
  • The time Cosmos takes to parse the project using json vs orjson, based on the DAG processor logs, similar to:
Cosmos performance (<cache_id>) - [<hostname>|<pid>]: It took 0.068s to parse the dbt

Once we have these numbers alongside the original change, I’m happy to move forward and commit to releasing it as part of Cosmos 1.5.0.

Thanks again for your patience and for helping improve this!

@corsettigyg
Copy link
Copy Markdown
Collaborator Author

@tatiana sure, opened a new PR for it 😄

tatiana added a commit that referenced this pull request May 12, 2026
closed the old #2320 due to the code change getting convoluted and
re-opened here.

Idea is the same, use orjson for better manifest parsing, **but keep it
only where it matters**
## Related Issue(s)

## Breaking Change?

should not be

## Checklist

- [X] I have made corresponding changes to the documentation (if
required)
- [X] I have added tests that prove my fix is effective or that my
feature works

## Benchmark

I patched the graph generator integration test and ran 10 iterations on
it for different manifest sizes. In total, I got


Manifest | Size | Nodes | json mean | orjson mean | Speedup
-- | -- | -- | -- | -- | --
sample | 0.4 MB | 28 | 0.003 s | 0.002 s | 1.44×
sample | 5.1 MB | 2,436 | 0.052 s | 0.040 s | 1.31×
sample | 10.0 MB | 4,984 | 0.101 s | 0.086 s | 1.18×
sample | 15.0 MB | 7,532 | 0.169 s | 0.121 s | 1.39×
sample | 20.0 MB | 10,080 | 0.234 s | 0.180 s | 1.30×
sample | 25.0 MB | 12,628 | 0.345 s | 0.239 s | 1.44×
sample | 30.0 MB | 15,176 | 0.425 s | 0.284 s | 1.50×
real-world (the manifest we use for our cosmos runs) | 26.7 MB | 1,533 |
0.159 s | 0.078 s | 2.03×



What is interesting is that in our real world manifest, which has not
only nodes but also descriptions, tests, tags, etc.... the performance
improvement of orjson is better than for the mock manifests I created. I
read about it and the explanation I can think about is that even tho the
real manifest has fewer graph objects, it is still a much fatter json
per object, which is where orjson helps the most 🥇

If necessary, I can share the benchmark I ran. just did not commit it
since it is not exactly a test per say.

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Tatiana Al-Chueyr <tatiana.alchueyr@gmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants