-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PR: Use fastjsonschema if installed and add tests #191
Conversation
9930a44
to
45ee420
Compare
ea9ae9d
to
5a51f83
Compare
This seems like a good step forward... but I have this feeling, much like with ujson/orjson on jupyterlab/jupyterlab_server#95, that just picking a second implementation and writing multiple code paths in many places will make the effort harder to manage in the future. I guess, to both ends, a configurable An additional thing it could provide would be an async interface to (de)serialization and validation, using (probably) an executor, as to my knowledge, all of these things are blocking, which is fine for, say, nbconvert, but not so fine in a server setting. Such a manager could be started here in nbformat, as it's as close as anything to the bits-on-disk, and moved at a later date. |
Hi @bollwyvl, thanks for the feedback.
Agreed, it does make the code path harder to follow and I actually tried it but for this specific case, that would require making way more involved changes and this was a "low hanging fruit fix". This fix is also kinda "urgent" for the use case we ran into. We were having jupyter_server taking 40 seconds to actually send a notebook!
Agreed, except that unlike the jupyter_server case we have at least 3-4 (more?) different competing implementations of JSON that are fast in general. Here we have just this one (in python), the other option is slow too. We could explore using other libraries written in other languages, but that would also require calling them via subprocess and rethinking a lot along the way.
At this point there are 2 things this manager would at least need to do. Reading and validation, and the bottleneck here is validation, not reading, although I could see benefits in that area as well. Also I think this is an interesting project enough (the JSONManager) that I would prefer starting a new package for it. Could you create a repo in deathbeds and I can start coding in there with more freedom :-) |
... unless if collecting the low-hanging fruit makes the tree unhealthy. Once For example, I frankly wouldn't see running both a Speaking of implementations...
Well, there's at least these: I have no benchmarks, but I'd wager, for certain workloads, these would be faster.
Not to down-sell our mad science experiments over there, but nobody looks at deathbeds, and a lot of the stuff in there is intentionally not "core ready". "How Jupyter Does JSON Standards" is a pretty big deal, and should probably be incubated more publicly in an official project... to me, that seems like So, like I said, right here in nbformat is probably already the right place, and could start with something far simpler, e.g. a |
Sounds good, will take a look ;-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to submit my morning review, which is I believe deprecated based on recent discussions.
nbformat/validator.py
Outdated
@@ -294,7 +327,8 @@ def iter_validate(nbdict=None, ref=None, version=None, version_minor=None, | |||
if version is None: | |||
version, version_minor = get_version(nbdict) | |||
|
|||
validator = get_validator(version, version_minor, relax_add_props=relax_add_props) | |||
validator = get_validator(version, version_minor, relax_add_props=relax_add_props, | |||
use_fast=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use_fast
is always set to False. Anyway that a consumer of the nbformat
library would use it with use_fast
set to True
? I mean, it sounds to me reading the changes that it will not be possible for a consumer to benefit from this enhancement as it would always be disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A consumer can can call validate(nb, use_fast=True)
, but what about the double validation question we have identified in jupyter_server
? The validation is always triggered on load in nbformat
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @echarles, thanks for the feedback. So:
use_fast is always set to False. Anyway that a consumer of the nbformat library would use it with use_fast set to True? I mean, it sounds to me reading the changes that it will not be possible for a consumer to benefit from this enhancement as it would always be disabled.
Yes, this is because there is some nested logic that required the "switch"
validate -> iter_validate -> better_validation_errors -> validate ....
I am looking into decoupling it so the use_fast
is no longer needed.
A consumer can can call validate(nb, use_fast=True), but what about the double validation question we have identified in jupyter_server? The validation is always triggered on load in nbformat?
The read
method which also validates (but does not raise errors) uses now use_fast=True
. Since we use validate on jupyter_server, we could add the use_fas=True
there, so we would still have double validation, but now both are fast.
We could also evaluate if there should be an extra parameter on the read
method, to actually skip validation, I think it should, but be True
by default.
Still, I would prefer no to have the use_fast
parameter, so I need to think of a better way to decouple the
validate -> iter_validate -> better_validation_errors -> validate ....
chain
a92153e
to
2b339e7
Compare
The force push makes it very hard to follow the work... just lost a comment to the effect of:
|
Sorry! |
Yes, it would be cleaner to have different classes, but as you say:
Yeah this is not possible since this is in not a 1-1 replacement, so it would not make sense to split more at this stage
See para above.
Yes, in the case of having more validators bundled, yes, but I would say that is out of the scope of this one. If more validators are bundled, yes, at the current state it would not make much sense since only fast validation is using the fast validator (on Also, the Thanks for the review @bollwyvl ! |
Will think a bit harder to decouple the |
It could just If someone is use iter_errors and treating "some" errors as warnings, or counting errors, then they are setting up their stuff for failure, anyway.
Ok, then I feel like adding |
Agreed, I just need to figure out splitting the logic, so working on that :-) |
@MSeal would it be ok to add pytest so I can add fixtures with parameters for the different libraries available? |
60bb71f
to
7b4274c
Compare
This one is ready for review @MSeal, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me, will give @bollwyvl some time to respond in case he has further thoughts on his side.
Sorry for the delay on review! Yeah, this is looking great. 🎉 This gets us to where a downstream can cleanly experiment already... that rust one sounds so promising! The only things missing from this PR that I would think could be fixed in this PR or on subsequent PRs:
Definitely out of scope, not blocking, but could be investigated:
|
Thanks @bollwyvl!
All great suggestions, I think they should be added on this PR, just wanted to do it after agreeing on the workflow.
Absolutely! And yes you owe me some issues :-p Will get this done later today! |
066e15c
to
49da28f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor comment, at this point I'm happy to merge though when that gets touched up.
@@ -90,6 +90,7 @@ | |||
] | |||
|
|||
extras_require = setuptools_args['extras_require'] = { | |||
'fast': ['fastjsonschema'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also add this to test
so tests can use it without issue,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
Made the suggested fixes @MSeal ! Thanks :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for following up on the PR reviews quickly!
We also have #189 open and ready for review -- I'd like to get it released too if we can. If that takes longer to merge we can ship a release. Very soon I won't be able to take much action (newborn coming) on these projects so someone else will likely need to do the release itself. |
Congratulations!
I would be happy to help anyway I can :-), I can set the release process to be tied to tags and run the pypinpm publication on CI, that would simplify the process for anyone mainting this. Cheers! |
Congrats @goanpeca for this great work and improvements. Really loved how things have moved forward. Thanks also to @MSeal and @bollwyvl for the reviews.
That would be another great contribution @goanpeca |
I went ahead and released 5.0.8 with this change since the other PR is likely going to have to sit until JupyterCon is done to get proper eyes on it |
Thanks for releasing @MSeal. It looks like pypi was missed when you published. Could you please update that too? |
Huh that's strange, I guess when I did the push originally there was an error message I missed. Looks fixed now. |
Fixes #190
This adds the possibility of using fastjsonschema if installed and other libraries if installed or defined by a
NBFORMAT_VALIDATOR
environment variable.The possible values are 1-1 to the module names so,
jsonschema
,fastjsonschema
, eventually (when stable) we could also addjsonschema_rs
, the rust implementation.better_error
will not work, so those 2 tests are skipped forfastjsonschema
. The rest works as expected. This means if usingfastjsonschema
printed errors will be less readable.Related to jupyter-server/jupyter_server#312
Pinging @echarles, @jasongrout, @Zsailer, @mlucool