Public API to get a view of the shared document model #270

krassowski · 2024-03-29T10:42:36Z

Problem

Other server extensions would like to use the shared document model to access the content of the notebook without the overhead of sending it over from the frontend, and without the risk of using an outdated (or even inaccessible) copy stored on the disk.

Currently, while this is possible as demonstrated in jupyterlab/jupyter-ai#708, it requires:

accessing a private _document attribute of the DocumentRoom
re-creating the room ID using encode_file_path and other some black magic (which has no guarantee of working if the id format changes in a future release of jupyter-collaboration)

Proposed Solution

Expose a new get_document() method on the YDocExtension. It could look like:

def get_document(
    self: YDocExtension,
    path: str,
    content_type: 'notebook' | 'file',
    file_format: 'json' | 'text'
) -> YBaseDoc | None:
    file_id_manager = self.settings["file_id_manager"]
    file_id = file_id_manager.index(request.path)

    # these two lines would be moved to an utility function to avoid drift from the code used in the handler.
    encoded_path = encode_file_path(file_format, content_type, file_id)
    room_id: str = encoded_path.split("/")[-1]

    try:
        room = await collaboration.ywebsocket_server.get_room(room_id)
    except RoomNotFound:
        return None

    if isinstance(room, DocumentRoom):
        document = room._document
        return document

One question raised by @3coins on the server call was whether such API should return a read-only view of the document:

In case of YNotebook subclass this could mean an encapsulation hiding methods such as append_cell and set_cell. This would require defining read-only variants/wrapper for each of the data structures defined in jupyter-ydoc.
Alternatively, we could use a copy of the model as implemented by @davidbrochart for forking in Support suggestions #239. Creating a copy might have performance implications; while we still do not send the notebook over wire, we are making an in-memory copy - which should be orders of magnitude faster, but still slower than not doing it.

Task list:

benchmark the performance penalty of cloning the document as done in forking (relatively to sending it over the wire)
establish consensus on the way forward with making the result read-only
implement the new method
add tests
add documentation

Additional context

This would follow the same pattern as used by jupyter-server-fileid which exposes a small (two methods) public API as documented here.

The text was updated successfully, but these errors were encountered:

krassowski · 2024-03-29T10:44:34Z

The content_type and file_format appear redundant to me. Notebooks will always have json file format and notebook content type. Can a file get a json file_format?

davidbrochart · 2024-03-29T11:28:22Z

Jupyter server's contents manager needs this information to access a file, so I guess the redundancy comes from there.
I've always been confused by that, and I don't even think the server should make a special case for notebooks.

krassowski · 2024-03-29T17:55:40Z

I compared using cloning vs not using it for the usecase from jupyterlab/jupyter-ai#708:

test case	mean	median	stdev
5 cell notebook, no cloning	0.59ms	0.56ms	0.18ms
5 cell notebook, cloning	0.79ms	0.77ms	0.06ms
100 cell notebook, no cloning	7.63ms	7.36ms	0.93ms
100 cell notebook, cloning	7.30ms	7.16ms	0.56ms

It appears that making a copy of the shared model has negligible cost compared to the cost of iterating the notebook and extracting the code / concatenating it into a prefix and suffix. Still, for 100 cells I expected it to be much faster than 7ms.

Maybe the cost is in calling to_py() repeatedly? On the other hand, I do not want to convert everything to a JSON/Python dict straightaway because outputs may be have GBs of irrelevant data. Something to investigate further.

Edit: here are the results - using YNotebook.get() is much more expensive even if the notebook has no outputs at all (because each cell has metadata).

test case	mean	median	stdev
100 cell notebook, cloning, `.get()`	29.68ms	28.80ms	2.78ms
100 cell notebook, cloning	8.46ms	7.87ms	1.48ms

krassowski · 2024-03-29T17:57:05Z

When using 10 000 iterations in a microbenchmark I sometimes see:

Traceback (most recent call last):
  File "/lib/python3.10/site-packages/distributed/profile.py", line 367, in _watch
    del frame
RuntimeError: pycrdt::map::Map is unsendable, but is being dropped on another thread

Does it mean that I need to respect locks even if only reading the content (in an async function)?

pycrdt                        0.8.17
pycrdt-websocket              0.12.7

davidbrochart · 2024-03-29T22:39:31Z

This only means that pycrdt objects should be created, used, and deleted in the same thread. This is stricter than just not being thread-safe.
In your case I see you're using Dask which probably uses threads, and the object is garbage-collected in a different thread than the one in which it was created. I also see this error in pytest sometimes. This issue gives some options to fix it.

davidbrochart · 2024-04-04T07:55:02Z

I'm thinking that a way to have read-only access to a shared document could be by using a transport layer that doesn't let a client send their changes. We could for instance create a document server/provider that uses Server-Sent Events.

krassowski added the enhancement New feature or request label Mar 29, 2024

krassowski mentioned this issue Mar 29, 2024

Use jupyter-collaboration to get the full notebook content for completions jupyterlab/jupyter-ai#708

Closed

krassowski mentioned this issue Mar 29, 2024

Separate out the frontend and the backend #269

Closed

This was referenced Apr 1, 2024

No tests for notebooks, fixture does not work when store = True #274

Closed

Add a public API for getting a read-only view of the shared model #275

Merged

davidbrochart closed this as completed in #275 Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Public API to get a view of the shared document model #270

Public API to get a view of the shared document model #270

krassowski commented Mar 29, 2024 •

edited

Loading

krassowski commented Mar 29, 2024

davidbrochart commented Mar 29, 2024

krassowski commented Mar 29, 2024 •

edited

Loading

krassowski commented Mar 29, 2024

davidbrochart commented Mar 29, 2024

davidbrochart commented Apr 4, 2024

Public API to get a view of the shared document model #270

Public API to get a view of the shared document model #270

Comments

krassowski commented Mar 29, 2024 • edited Loading

Problem

Proposed Solution

Additional context

krassowski commented Mar 29, 2024

davidbrochart commented Mar 29, 2024

krassowski commented Mar 29, 2024 • edited Loading

krassowski commented Mar 29, 2024

davidbrochart commented Mar 29, 2024

davidbrochart commented Apr 4, 2024

krassowski commented Mar 29, 2024 •

edited

Loading

krassowski commented Mar 29, 2024 •

edited

Loading