Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nbformat asynchronous API? #194

Open
bollwyvl opened this issue Sep 25, 2020 · 10 comments
Open

nbformat asynchronous API? #194

bollwyvl opened this issue Sep 25, 2020 · 10 comments

Comments

@bollwyvl
Copy link
Contributor

Presumably the next major (or shortly thereafter) release of nbformat will be python 3.6+ (or 3.7).

side note: I actually would suggest to drop only CPython 3.6, pypy 3.6 is still a fine, supported target!

Among the other benefits, this will make use of proper async and await possible, without relying on tornado to get cooperative behavior with downstreams like notebook and jupyter_server, or embedded in more exotic things.

Some places, which are all blocking today, that would likely benefit from being (apparently) asynchronous for certain workloads:

  • loading JSON
  • parsing JSON
  • writing JSON
  • validating JSON

Such workloads would include

  • high-throughput use cases, such as nbviewer
    • blocking behavior increases overall latency, which likely compounds unless the downstream uses their own pool (as nbviewer does)
  • interactively, where a scientist who (even temporarily) generates notebooks with large/complex outputs and persist them to disk

Developer Experience

Typographically, it could be something like (assuming IPython-like top-level await for demonstration purposes):

param

import nbformat
nb = await nbformat.read('path/to/notebook.ipynb', as_version=4, async=True)
  • this is kind of icky from a typing point of view...

proxy

or perhaps, with a full mirroring:

import nbformat.async_ as nbformat
nb = await nbformat.read('path/to/notebook.ipynb', as_version=4)

prefix

just prefix with a

import nbformat
nb = await nbformat.aread('path/to/notebook.ipynb', as_version=4)

Async facade/Configuration

While future asynchronous parsers/validators may arise, given that all of the above APIs are actually implemented as blocking behaviors right now, e.g. json.load or jsonschema.validate, an initial facade would probably be needed, perhaps as a ThreadPoolExecutor, etc. which might require configuration, e.g. NBCONVERT_THREADS.

Not even sure what a good default would be, but i'm always a fan of multiprocessing.cpu_count.

Testing/Dependencies

  • Given (py3.6|3.7)+, no new runtime dependencies should be required
  • Adopting pytest-asyncio is very helpful in verifying async behavior
  • Following the lead of IPython/ipykernel, it should consider (and be tested against) alternate event loop implementations, e.g. uvloop, trio
  • tracking the performance of the sync/async x loop x workload with asv would be very demonstrative

Downstreams

Of course, no downstreams would be ready to use this today! Some coordination might be necessary to determine any other gotchas that might arise, but this also means it can be released as an "experimental" API relatively quickly. As long as the original API remains, basically unchanged, however, there should be no rush-to-implement to catch a compatibility window, even once it is considered "supported".

cc @goanpeca

@MSeal
Copy link
Contributor

MSeal commented Sep 27, 2020

Great idea. I'm 100% supportive of adding async calls to this library. I'll be unlikely to help with implementation / review, but other folks who have helped make jupyter_client and nbclient async can.

Presumably the next major (or shortly thereafter) release of nbformat will be python 3.6+ (or 3.7).

side note: I actually would suggest to drop only CPython 3.6, pypy 3.6 is still a fine, supported target!

Since 3.5 is no longer supported by Python org (no more security patches as of this month) I think 3.6+ is probably right for next release. Python 3.6 is still getting security patches through 2021, I don't see a reason to drop it yet, especially with some format changes coming that older setups may need to use this to parse.

Typographically, it could be something like (assuming IPython-like top-level await for demonstration purposes):

I'd vote adding an aread or async_read function and leave the synchronous ones available as is. We did this for nbclient to support both patterns by wrapping the async calls. Without preserving original function signatures it causes a lot of downstream trouble when you swap well used functions to async on folks.

@bollwyvl
Copy link
Contributor Author

Thanks for the thoughts!

preserving original function signatures

Oh, I would not dream of breaking those! And, as presently everything upstream is blocking, I wouldn't even touch them, and in fact have the initial, reference async implementations be wrappers of the existing functions inside an executor, which will require inspection for thread-safety issues, but as is a continuing mystery to me, file io still has no async implementations in stdlib, and should be considered table stakes for this work.

One concern... in my initial, very hacked-up PoC, I arrived at something not too dissimilar, but far worse, than aiofiles... down to allowing kwargs for the loop and executor. Part of the idea was to not take on new deps!

Python 3.6 is still getting security patches through 2021, I don't see a reason to drop it yet,

Right. This may warrant its own issue, altogether, or at least a signpost on a roadmap..

Going further, if we are going to 3.6, and to ensure no badness, perhaps a precondition to this work is to start adding typings. A quick run through of the test suite through monkeytype is a pretty good baseline... until it get to anything that touches traitlets, or the "yet another bunch" that is NotebookNode. If anything, the "fancy" member syntax enabled by 3.7 is one thing that would be missed.

@MSeal
Copy link
Contributor

MSeal commented Sep 29, 2020

Just use aiofiles, don't try to reinvent that wheel on that one. It's complicated and error prone to try and re-write that behavior and native options don't provide sufficient capabilities here.

perhaps a precondition to this work is to start adding typings

I don't see it as a requirement to do this work. Let's keep it as a separate issue. And yes traitlets don't play all that nice with type hinting.

@mwakaba2
Copy link

mwakaba2 commented Nov 6, 2020

Hi @bollwyvl, are there any updates on this? I have a draft PR open to make jupyter_server's contents api asynchronous, and I'm trying to make the FileContentsManager asynchronous with aiofiles. The FileContentsManager depends on nbformat, so making nbformat.read and other nbformat methods awaitable would be nice.
Jupyter_server Draft PR
nbformat.read example

@bollwyvl
Copy link
Contributor Author

bollwyvl commented Nov 6, 2020

Sorry, no updates yet, but thanks for the links... looks like there's going to be a flurry of long-overdue CI scut work on various projects due to travis CI changes, so fun feature stuff might be on the back burner for a tick..

If you're feeling aiofiles, that's great, and helps reduce my heartburn. Do you have any preferences on the api you would like from the issue description?

@mwakaba2
Copy link

mwakaba2 commented Nov 6, 2020

@bollwyvl no worries. I prefer the proxy option since that requires less code and doc changes, and it'll leave the synchronous methods as is with less chance of causing any problems downstream.

@bollwyvl
Copy link
Contributor Author

bollwyvl commented Nov 8, 2020

@mwakaba2 So i've been looking around some at this (still very experimental)... no guarantees it will stay like it is, but might be something to talk about at the nuts-and-bolts layer.

I am wondering if aiofiles is even the right play. As mentioned above, ipython supports multiple io loops, and this seems like a good system property, and aiofiles means you're on stdlib asyncio for good.

To that end, I've been looking into anyio, which seems actively developed, type annotated, and documented, and supports all kinds of handy things on trio, curio, uvloop, and asyncio with the same API.

@bollwyvl
Copy link
Contributor Author

bollwyvl commented Nov 8, 2020

Maybe even the bigger question: should nbformat be concerned with bits-on-disk at all? Perhaps it should just know about JSON (as utf-8) and dicts. That would be passing the buck pretty hard, but in looking over some of the things down dowstream, await reads, await writes, await validate might be the only things it should have to deal with, leaving the choice of file handling (if files are involved at all) to the caller.

This still leaves (potentially) long-running tasks, and these will still be (natively) blocking for the time being, but something like anyio.run_sync_in_worker_thread seems far more consistent than playing whack-a-mole with different types of open file handles, which JSON parsers and validators need to read in their entirety anyway.

@mwakaba2
Copy link

mwakaba2 commented Nov 8, 2020

Thanks for sharing the draft PR. Yeah, compared to anyio, aiofiles's documentation, readability, and support is a bit lacking. Using anyio seems like a better choice for this use case, especially if it avoids making some assumptions about the different types of file handles it'll receive.

@bollwyvl
Copy link
Contributor Author

bollwyvl commented Nov 8, 2020

I guess the downside of dropping read and write from the async API is memory consumption... json.parse is a dog, and isn't going to play nice with any of the above. Some more concise thoughts about this: https://sans-io.readthedocs.io/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants