Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

draft: File ID Manager base implementation #921

Closed
wants to merge 15 commits into from

Conversation

dlqqq
Copy link
Contributor

@dlqqq dlqqq commented Jul 13, 2022

Overview

Draft implementation of File ID service proposed here.

File ID Service maintains a bi-directional mapping between file IDs and file names. This service permits bi-directional lookup, and preserves file ID across filesystem operations done through the ContentsManager.

  • Adds FileIdManager class with assoc. unit tests
  • Integrates FileIdManager methods into ContentsManager and AsyncContentsManager filesystem methods
  • Appends id property that contains the file ID to the Contents API GET/POST/PUT/PATCH /api/contents/{path} operations.
  • Adds benchmark under jupyter_server/benchmarks/fileidmanager_benchmark.py.

Testing

To test with JupyterLab stable, developers can run the following in a separate conda environment:

git clone [email protected]:jupyter-server/jupyter_server.git
cd jupyter_server/
git remote add dlqqq [email protected]:dlqqq/jupyter_server.git
git pull dlqqq file-id-service
pip install jupyterlab
pip install -e ".[dev,test]"

Benchmarks

To run File ID Manager benchmark:

python jupyter_server/benchmarks/fileidmanager_benchmark.py

Benchmarks run on m5.12xlarge AWS EC2 instance.

% python jupyter_server/benchmarks/fileidmanager_benchmark.py
Index benchmark (separate transactions)
100       files | 0.2932   s
1,000     files | 2.9968   s
Index benchmark (single transaction, atomic INSERTs)
100       files | 0.0032   s
1,000     files | 0.0072   s
10,000    files | 0.0362   s
100,000   files | 0.3430   s
1,000,000 files | 3.5570   s
Index benchmark (single transaction, batched INSERTs)
100       files | 0.2897   s
1,000     files | 0.2663   s
10,000    files | 0.2613   s
100,000   files | 0.2678   s
1,000,000 files | 2.7359   s
Recursive move benchmark
100       files | 0.0065   s
1,000     files | 0.0093   s
10,000    files | 0.0370   s
100,000   files | 0.3499   s
1,000,000 files | 3.6639   s
Recursive copy benchmark
100       files | 0.0106   s
1,000     files | 0.0121   s
10,000    files | 0.0233   s
100,000   files | 0.1632   s
1,000,000 files | 1.5505   s
Recursive delete benchmark
100       files | 0.0033   s
1,000     files | 0.0047   s
10,000    files | 0.0150   s
100,000   files | 0.1359   s
1,000,000 files | 1.4314   s



class FileIdManager(LoggingConfigurable):
db_path = Unicode(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend that we name this trait database_filepath to match the SessionManager:

database_filepath = Unicode(
default_value=":memory:",
help=(
"The filesystem path to SQLite Database file "
"(e.g. /path/to/session_database.db). By default, the session "
"database is stored in-memory (i.e. `:memory:` setting from sqlite3) "
"and does not persist when the current Jupyter Server shuts down."
),
).tag(config=True)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also some logic here you could borrow to validate this trait:

@validate("database_filepath")
def _validate_database_filepath(self, proposal):
value = proposal["value"]
if value == ":memory:":
return value
path = pathlib.Path(value)
if path.exists():
# Verify that the database path is not a directory.
if path.is_dir():
raise TraitError(
"`database_filepath` expected a file path, but the given path is a directory."
)
# Verify that database path is an SQLite 3 Database by checking its header.
with open(value, "rb") as f:
header = f.read(100)
if not header.startswith(b"SQLite format 3") and not header == b"":
raise TraitError("The given file is not an SQLite database file.")
return value

@Zsailer
Copy link
Member

Zsailer commented Jul 26, 2022

I think we could borrow quite a bit of SQLite connection logic in SessionManager here to avoid duplication of code. We'll need to abstract out some of these pieces and share them across classes.

For example, the definition of the connection, cursor, etc. could be abstracted out of the SessionManager and used by both of these managers:

@property
def cursor(self):
"""Start a cursor and create a database called 'session'"""
if self._cursor is None:
self._cursor = self.connection.cursor()
self._cursor.execute(
"""CREATE TABLE IF NOT EXISTS session
(session_id, path, name, type, kernel_id)"""
)
return self._cursor
@property
def connection(self):
"""Start a database connection"""
if self._connection is None:
# Set isolation level to None to autocommit all changes to the database.
self._connection = sqlite3.connect(self.database_filepath, isolation_level=None)
self._connection.row_factory = sqlite3.Row
return self._connection
def close(self):
"""Close the sqlite connection"""
if self._cursor is not None:
self._cursor.close()
self._cursor = None
def __del__(self):
"""Close connection once SessionManager closes"""
self.close()

@Zsailer Zsailer added this to the Future milestone Jul 26, 2022
@Zsailer
Copy link
Member

Zsailer commented Jul 26, 2022

Thank you for working on this, @dlqqq! I think a File/Document ID will be really useful for a lot of use cases! I'm excited to see this land.

My number one concern about the current implementation is that the file ID isn't strictly unique—i.e. it's just an index in the database. In order to make this service more broadly useful, I think these IDs must be unique.

Let me give a simple example...

This service could be useful for sharing notebooks between two running Jupyter Servers (not necessarily RTC, just basic copying between two servers)—an enhancement many people have asked us for.

In this example, each server has its own file ID database that maps file IDs to their path on the individual filesystems. If we "share" this file from one server to another, I would hope we could use this service to have a (admittedly, weak) link between
these two copies of the document. In Jupyter Server today, we can't create this link, because using name/path wasn't a viable solution given the lack of uniqueness. In this implementation, we also can't create this link, because we can't guarantee uniqueness from these IDs between the two servers. They will likely have conflicting/unavailable indexes in their database since have completely different filesystem structures.

@echarles
Copy link
Member

Thank you for working on this, @dlqqq!

Yes, thank you!

Let me give a simple example...

Sorry for jumping here without reading the complete history.... have the requirements for distributed systems/servers like jupyterhub be taken into account? (uniqueness, portability across servers...)

@dlqqq
Copy link
Contributor Author

dlqqq commented Jul 27, 2022

@Zsailer Hey Zach! Thank you for taking the time to review my progress so far. Right now, I'm working on implementing handling for out-of-band file system operations along with a design doc that dives deeper into how the implementation actually works along with some of its shortcomings.

I've addressed a few of your questions below and will address the others once I'm done with my current work. Keep in mind though, the logic for handling out-of-band operations gets a bit more tricky, so it'll likely need a re-review.

I think we could borrow quite a bit of SQLite connection logic in SessionManager here to avoid duplication of code. We'll need to abstract out some of these pieces and share them across classes.

Well, the SQL connection/statement logic is fairly simple and I'm not sure if it's worth abstracting. The SQL statements are all fairly self-contained and only take 2, maybe 3 lines per statement. Over-abstracting and splitting the source over too many different files hinders readability. But I'll take a closer look at this concern when I'm done with my current work.

My number one concern about the current implementation is that the file ID isn't strictly unique—i.e. it's just an index in the database. In order to make this service more broadly useful, I think these IDs must be unique.
...
This service could be useful for sharing notebooks between two running Jupyter Servers (not necessarily RTC, just basic copying between two servers)—an enhancement many people have asked us for.

I'm lost as to why a file ID needs to be, say, "globally unique" to support this use-case. This is just a file copy. If the receiving server needs data (e.g. comments) attached to that copy, the sending server sends that data in tandem. The receiving server is then free to assign a different ID to that new file.

IMO, there's no reason for these two servers to share a file ID for a given file on two separate filesystems because... well, they're not the same file. A copy of a file is a new file. This is how it's currently working in local filesystems. Circling back to commenting, if I add a comment in one file, I should not be adding a comment to another file just because it's a copy of the original and shares the same ID. Shared IDs across files leads to shared state across files, which gets complex really fast.

I'm mainly giving pushback here because using a UUID for the primary key is a performance detriment. I'm not giving a firm "no" here.

@echarles Great questions! Yes, these concerns will be taken into much further study, but only after we handle operations on local filesystems. Could you elaborate more on what you mean by "distributed [Jupyter] servers"? AFAIK JupyterHub is still a single server, so it seems more accurate to refer to it as "remote" rather than "distributed".

Regarding portability and uniqueness, could you elaborate more on these as well? @kevin-bates brought these points up too. Do you mean that file IDs should be globally unique as mentioned earlier? Again, I'm still a bit lost on the use case for this; to my knowledge Jupyter server only runs on a single machine, and a client can't connect to multiple Jupyter servers.

Rough sketch on how to handle remote Jservers/filesystems: The idea is that the ContentsManager implementation for remote Jservers/filesystems should expose methods like dir_exists(), file_exists(), stat(), etc. that the FileIdManager invokes via self.parent.X(). For local filesystems, self.parent refers to the default implementation for local filesystems, FileContentsManager.

@echarles
Copy link
Member

(disclaimer: I have not taken the needed time to read and digest the dense and very useful discussion around the file id service, but think it is not too early to lay down any requirements)

Great questions! Yes, these concerns will be taken into much further study, but only after we handle operations on local filesystems.

Well, if the design and the technical implementation shows to be fairly different when we would look after, this may be annoying (e.g. if the generated database in case of many/distributed/remote servers is different form le local database). To which extends, not sure, one could argue we don't really care as the local database would be anyway not migrated to any such distributed system. But playing devil advocate, I could imagine my local server has a custom server extension that sends to a corporate backend those IDs and that one day my notebooks would have to be served from JupyterHub. The reverse is also true. A JupyterHub instance could allow the downoad of the notebook that could be run on my local laptop.

In short, I would prefer to cover the JupyterHub case in the design.

Could you elaborate more on what you mean by "distributed [Jupyter] servers"? AFAIK JupyterHub is still a single server, so it seems more accurate to refer to it as "remote" rather than "distributed".

I am thinking to a case where the notebook files are loaded from a shared storage system, e.g. NFS. In that case, the user goes today to server 1 instance, and tomorrow to server 2 instance.

Another case is RTC where a server hosted by JupyterHub will be consumed by multiple users.

So true, "distributed" is a bad name, IMHO a server is always "remote", "ephemeral" may be better but still not very good..

Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dlqqq - thanks for getting this started. Here are some initial comments.

I think there are some things to work out - primarily uniqueness in both the id (UUID vs. Integer) and path (how to deal with different root_dir values). FWIW, I took a stab at converting the implementation to use UUIDs here.

Is it expected that things like "comments", or anything that references a File ID, be in this same database? If not, I think that argues for UUIDs as well - otherwise external references could get a false positive and be "linked" to files other than what is intended.

@vidartf has a good point of perhaps decoupling this from the ContentsManager implementation and using the event bus.

@@ -1886,9 +1887,9 @@ def init_configurables(self):
connection_dir=self.runtime_dir,
kernel_spec_manager=self.kernel_spec_manager,
)
self.file_id_manager = FileIdManager(parent=self, log=self.log)
Copy link
Member

@kevin-bates kevin-bates Aug 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably define a file_id_manager_class configurable like the other managers define so folks can BYO their own FID manager. This also suggests defining an abstract base class that is then referenced in the klass attribute which allows folks to either derive from FileIdManager or essentially implement their own.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I forgot to define FileIdManager as a trait on ServerApp. Before I fix this, is there any reason to prefer

self.file_id_manager_class = Type(klass=FileIdManager)
...
self.file_id_manager = self.file_id_manager_class(*args)

over just using

self.file_id_manager = Instance(klass=FileIdManager, args=(*args))

Copy link
Member

@kevin-bates kevin-bates Aug 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think either is what we want.

Both imply that one has to derive from FileIdManager - which we shouldn't impose on anyone. So klass should reference an ABC (e.g., FileIdManagerABC) that only defines the public APIs as abstract. It could also contain validation logic for the db-path trait, etc.

So the user would invoke the server using --ServerApp.file_id_manager_class = my.package.MyFileIdManager and the trait would ensure that my.package.MyFileIdManager is an instance of FileIdManagerABC and whatever interacts with the file ID manager class would be relegated to the methods defined in the ABC.

Folks that want to slightly adjust the OOTB implementation of FileIdManager (or override all of its methods entirely) are free to do so via subclassing as they are still an instance of the ABC.

For example, with the last set of changes, an entirely different implementation is required if any non-filesystem-based ContentsManager is configured (unfortunately).

jupyter_server/services/contents/fileidmanager.py Outdated Show resolved Hide resolved
jupyter_server/services/contents/fileidmanager.py Outdated Show resolved Hide resolved
jupyter_server/services/contents/fileidmanager.py Outdated Show resolved Hide resolved
jupyter_server/services/contents/fileidmanager.py Outdated Show resolved Hide resolved
jupyter_server/services/contents/fileidmanager.py Outdated Show resolved Hide resolved
jupyter_server/services/contents/manager.py Show resolved Hide resolved
@dlqqq dlqqq mentioned this pull request Aug 2, 2022
8 tasks
@dlqqq
Copy link
Contributor Author

dlqqq commented Aug 2, 2022

Hey @kevin-bates! Thanks for the review. You can make some excellent points. However, I literally was just about to push my latest changes on top of this PR. I'll do my best to address your comments soon.

To the rest of the team, I wanted to segment design discussion to a separate tracking issue, since this project will likely have to span multiple PRs due to its size. See here: #940

@dlqqq
Copy link
Contributor Author

dlqqq commented Aug 2, 2022

I'm pushing my latest changes (which track out-of-band filesystem operations) now.

EDIT: oh geez, the commit history is weirdly interweaved with review comments. yikes. LMK if you all want me to rebase and edit the commit dates.

@dlqqq dlqqq changed the title draft: File ID service draft: File ID Manager base implementation Aug 2, 2022
# uniqueness constraint relaxed here because we need to keep records
# of deleted files which may occupy same path
"path TEXT NOT NULL, "
"ino INTEGER NOT NULL UNIQUE, "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inode (st_ino) is not unique by itself and needs to be coupled with st_dev to gain uniqueness within a set of devices managed by a given server. I also think we will have issues when mounted file systems are in play (e.g., I think I recall that the inode reflects that of the NFS server for example, not necessarily the client mounting the filesystem, besides other idiosyncrasies that will creep in as these kinds of things vary with OS), not to mention non-file systems like Databases or Cloud Object Storage, etc. So this feels like a bit of a slippery slope, IMHO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah... right. Good catch. I'll see what I can do about this. I think by default, the File ID Manager will only support local filesystems and NFS. For other cloud services, it's possible that they already expose a file ID, so the custom File ID manager implementation should be straightforward.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For other cloud services, it's possible that they already expose a file ID, so the custom File ID manager implementation should be straightforward.

I believe this is saying that non-file system-based ContentsManager derivations must implement their own FileIdManager. Is that correct?

Regarding NFS, I wouldn't be surprised if the st_dev value changes if a previously mounted FS is unmounted, then remounted (following the mount of others or even a reboot). (This hunch may be confirmed if I'm reading things correctly.)

Copy link
Member

@echarles echarles Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is saying that non-file system-based ContentsManager derivations must implement their own FileIdManager. Is that correct?

If this is the case (prolly it is), the FileIdManager implementation should be configurable (like the ContentManager...). It may already be the case (have not looked at the code, sorry).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@echarles - this is discussed here: #921 (comment) although I don't think we should require that.

Personal benchmarks show a very significant performance gain when
calling get_path(). This is because we are no longer scanning all the
contents of every indexed-but-moved directory in _sync_all(), but rather
intelligently appending the new paths of the directories to check for
dirtiness to a deque used in the body of _sync_all().

Also fixes a few bugs (sorry, couldn't split this commit up):
- Indexing symlinks always indexes the real path that symlink refers to
- Fixes bug with the way new paths were calculated in recursive moves
@codecov-commenter
Copy link

Codecov Report

Merging #921 (86fcaf3) into main (1ec1aee) will increase coverage by 1.21%.
The diff coverage is 100.00%.

❗ Current head 86fcaf3 differs from pull request most recent head fb8cc40. Consider uploading reports for the commit fb8cc40 to get more accurate results

@@            Coverage Diff             @@
##             main     #921      +/-   ##
==========================================
+ Coverage   71.44%   72.66%   +1.21%     
==========================================
  Files          65       66       +1     
  Lines        7705     8084     +379     
  Branches     1289     1339      +50     
==========================================
+ Hits         5505     5874     +369     
+ Misses       1805     1804       -1     
- Partials      395      406      +11     
Impacted Files Coverage Δ
jupyter_server/pytest_plugin.py 88.88% <100.00%> (+0.70%) ⬆️
jupyter_server/serverapp.py 65.88% <100.00%> (+0.60%) ⬆️
jupyter_server/services/contents/fileidmanager.py 100.00% <100.00%> (ø)
jupyter_server/services/contents/filemanager.py 72.25% <100.00%> (+0.05%) ⬆️
jupyter_server/services/contents/handlers.py 86.55% <100.00%> (ø)
jupyter_server/services/contents/manager.py 83.29% <100.00%> (+0.56%) ⬆️
jupyter_server/auth/identity.py 82.93% <0.00%> (-7.27%) ⬇️
jupyter_server/auth/security.py 75.67% <0.00%> (+0.33%) ⬆️
... and 5 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@dlqqq
Copy link
Contributor Author

dlqqq commented Aug 22, 2022

@kevin-bates Thank you for your suggestions! I addressed as many as I could, but I believe it's best if I start working on separating my work into a standalone server extension. The great thing about this is that we can split up each of your concerns into a separate PR to address them.

@dlqqq
Copy link
Contributor Author

dlqqq commented Sep 7, 2022

Hey team! This is now migrated into a separate server extension: https://github.com/jupyter-server/jupyter_server_fileid

Thank you all for your review comments! I've left open review questions as issues on that repo.

@dlqqq dlqqq closed this Sep 7, 2022
@ellisonbg
Copy link
Contributor

ellisonbg commented Oct 11, 2022 via email

@kevin-bates
Copy link
Member

  • The scoping of a given FileID service is a ContentManager. Thus, if a
    user has multiple ContentManagers serving different file systems, they
    would presumably have 1 file ID service for each of this. For example if a
    user has a local file system and an S3 bucket exposed through 2
    ContentManagers, they would have two File ID services.

(This is why it would have been nice if FileIDs were a function of (and emitted from) the ContentsManager.)
Since the FileID service is now a consumer of ContentsManager events, how will a given FileID service instance determine that the received event was relative to a LargeFileContentsManager, an S3ContentsManager, or a FooContentsManager?

  • Related to this case, I can imagine situations where a user wants to move
    or copy a file between different ContentManager (copy file to S3 for
    example). In that case, it may help to have UUIDs that are globally unique.

Yes, but more importantly, is that once objects referencing FileIDs find their way to other systems (via sharing or whatever), they will be incorrectly associated with the wrong files until UUIDs are used as the primary ID.

@ellisonbg
Copy link
Contributor

ellisonbg commented Oct 11, 2022 via email

@kevin-bates
Copy link
Member

I see, thank you for your response. Just to be clear, I'm not as worried about the filesystem issues as I am about avoiding data integrity emergencies - per my second point above - which we can continue discussing in the linked issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants