Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add boolean sync argument to get_path() #20

Merged
merged 1 commit into from
Oct 19, 2022

Conversation

dlqqq
Copy link
Collaborator

@dlqqq dlqqq commented Oct 19, 2022

Allows for get_path() to be called multiple times in serial without affecting performance.

Whenever get_path() is called, the Files table must be synced with the local filesystem beforehand via sync_all(). This scans over the entire file tree at the server root and essentially ensures each path is mapped to the correct file ID. This can be an expensive operation, and thus we want to minimize calls to sync_all().

This PR allows users to provide sync=False and simply call sync_all() manually beforehand. This is useful for scenarios where developers are calling get_path() multiple times in serial, and sync_all() does not need to be called between invocations. For example:

file_id_manager.sync_all()
for id in ids:
  file_id_manager.get_path(id, sync=False)

Given that sync_all() represents almost all of the performance cost of get_path(), this method is N times faster than calling file_id_manager.get_path(id) in serial, where N is the length of ids.

@codecov-commenter
Copy link

codecov-commenter commented Oct 19, 2022

Codecov Report

Base: 78.59% // Head: 78.49% // Decreases project coverage by -0.09% ⚠️

Coverage data is based on head (a254942) compared to base (0dd8ca0).
Patch coverage: 83.33% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #20      +/-   ##
==========================================
- Coverage   78.59%   78.49%   -0.10%     
==========================================
  Files           5        5              
  Lines         369      372       +3     
  Branches       51       52       +1     
==========================================
+ Hits          290      292       +2     
  Misses         65       65              
- Partials       14       15       +1     
Impacted Files Coverage Δ
jupyter_server_fileid/manager.py 88.33% <83.33%> (-0.28%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@dlqqq dlqqq merged commit 16535e2 into jupyter-server:main Oct 19, 2022
@kevin-bates
Copy link
Member

Shouldn't a complete system synchronization operation be performed in the background and not on each "get" call? I don't think surfacing this in the API is correct - nor is requiring users to call sync_all() to ensure the service is up to date.

(I also see that get_id() does perform a transaction - per the PUT vs. GET discussion.)

For now, would it make sense to expose a "list-based" API (e.g., get_paths(ids: List[str]) -> List[Tuple[str, str]]) that takes a list of Ids and returns a list of tuples (Id, path)? This way you're not exposing a sync flag that would become obsolete once synchronization is moved into the background as an internal optimization.

A mirrored get_ids() would be a bit more clunky and probably less popular, but should probably exist for consistency.

If this approach was adopted, it might make more sense to return a Dict[str, str] since that can be more naturally represented in JSON (in the REST response).

@dlqqq
Copy link
Collaborator Author

dlqqq commented Oct 20, 2022

@kevin-bates Thank you for your input; I actually had considered exactly that same get_paths() approach you suggested, where it would accept a list of IDs and then return a dictionary mapping ID => path. However, there's actually not too much of a readability gain, given that both approaches require some initialization before iterating over a list of IDs. Compare the following examples:

paths_by_id = file_id_manager.get_paths([record.id for record in records])

for record in records:
  model = Model.from_orm(record)
  model.path = paths_by_id[model.id]
file_id_manager.sync_all()

for record in records:
  model = Model.from_orm(record)
  model.path = file_id_manager.get_path(model.id, sync=False)

Which is more readable is debatable in my opinion. While get_paths() is a lot more clear in its purpose than sync_all(), it's vague as to why get_paths() is any faster than calling get_path() several times in serial. Furthermore, it would be confusing as to why get_ids() is not present in the methods we expose.

At the end of the day, distinguishing identical procedures with different performance characteristics inherently rely on knowledge of implementation details. We believe it's better to be explicit and have developers be aware that get_path() calls sync_all(), which is expensive, and that you can make serious performance gains if you call sync_all() before getting multiple paths in serial. This is similar to the idea of asking SQL developers to commit transactions manually rather than automatically for performance gains, even though it's not clear how "committing" relates to reading/writing data to the database.

@dlqqq
Copy link
Collaborator Author

dlqqq commented Oct 20, 2022

Shouldn't a complete system synchronization operation be performed in the background and not on each "get" call?

Well, if you're willing to sacrifice consistency between the filesystem and the Files DB table, then sure. We could call sync_all() on a time interval, say every 10 seconds, in the background. However, if get_path() doesn't call sync_all() every time, then if the file was moved after the last sync interval, then get_path() would return you the wrong path.

Essentially this is a tradeoff between consistency and performance. My objection to adding this feature as a configurable trait is that I don't see a demonstrable usecase where you want a file path given a file ID and are somehow OK with it being out-of-date. If there is a usecase for that, I still believe that it should not be the default. We should prioritize our methods being as correct as possible out-of-the-box, while allowing users to opt-in to any features that would sacrifice correctness.

@dlqqq
Copy link
Collaborator Author

dlqqq commented Oct 20, 2022

(I also see that get_id() does perform a transaction

Yes. Basically we will delete a record if we find a record with the same inode number but different timestamps, since we are assuming that the file at the current path is a completely different file from the one recorded. Otherwise if both the inode number and timestamps match, we update the record with the current path.

I believe this is an implementation detail that just helps maintain consistency between the filesystem and files table. Don't think this should change the HTTP verb we use.

@ellisonbg
Copy link

The proposal that I had was to add a get_paths() API and have both get_path() and get_paths() accept a time delta argument. The implementation would then look at how long it has been since sync_all() has been called and call it again if that time is longer than the time delta argument. That would let the caller determine the time interval over which it wants to make sure that filesystem changes are detected, while allowing multiple clients to call these APIs without worrying about calling sync_all() needlessly. Alternatively, we could make that time delta a configurable trait on the local file system file id manager.

The big picture though is that the file id APIs will likely need to update state on a regular interval and we need to make sure we do that in a manner that allows multiple clients to call these APIs without calling sync_all() too many times. We should also probably localize calls to get_paths made from JupyterLab to once place to allow us to update the paths for all open files in a single network call.

@ellisonbg
Copy link

I should note that the sync flag wouldn't be able to optimize the case of multiple clients calling these APIs in a manner that doesn't lead to calling sync_all() too many times.

@kevin-bates
Copy link
Member

Essentially this is a tradeoff between consistency and performance

Yes, and I believe you're optimizing for the wrong thing. You've decided to penalize the most common API calls (get_xxx()) to address use cases that, frankly, aren't going to be that common- especially when syncing at startup. (Heck, out-of-band updates don't really exist in Cloud storage or DB-based CMs.) I'm not saying you need to discard synchronization, just saying it should be moved into the background since users should not care or have to worry about their files being synchronized or specify a time delta!

(I apologize. I will no longer comment on implementation details. I will, however, continue to comment on API decisions since that is the contract by which other FileID Services implementations must adhere and they should be able to hide these kinds of decisions from their users.)

@ellisonbg
Copy link

ellisonbg commented Oct 20, 2022

Thinking a bit more overnight. If there are N RTC clients, those clients would need to poll get_paths() on a regular interval, which causes a lot of issues. I think a better approach is as follows:

  • The file id manager should call sync_all in a subprocess on a regular interval.
  • Each time there is a path change, it should publish an event on the event bus.

This way, each client won't have to poll the server to get all these updates. I think this aligns with how @kevin-bates is thinking in his last comment as well.

@dlqqq
Copy link
Collaborator Author

dlqqq commented Oct 20, 2022

@kevin-bates

I'm not saying you need to discard synchronization, just saying it should be moved into the background since users should not care or have to worry about their files being synchronized or specify a time delta!

Hm, even since this project was started, our north star was as follows:

  1. file ID should be as correct as possible out-of-the-box
  2. file ID should be optimized for the most common use-case, single users on local instances accessing local filesystems

If we do add a sync interval and set it to be the default (i.e. get_path() no longer calls sync_all() by default), we're essentially changing both points to instead read "file ID balances correctness and performance while attempting to accommodate multiple common use-cases". Which I think is fine (as there are pros and cons to both), but I believe it requires agreement from others who may want to use file ID. I'll chat with our team members working on the notebook scheduling extensions, and get back to you on what we think.

(I apologize. I will no longer comment on implementation details. I will, however, continue to comment on API decisions since that is the contract by which other FileID Services implementations must adhere and they should be able to hide these kinds of decisions from their users.)

No, I think your comments are super valuable! To be honest, you've been one of the only other contributors critiquing our design. We value your feedback a lot.

@davidbrochart
Copy link
Contributor

  • The file id manager should call sync_all in a subprocess on a regular interval.

Yes, this would offload the server quite a bit, since the current implementation is not async (see my concerns about performances in #13).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants