-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cache for Query.query('exists') #317
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -49,6 +49,7 @@ def __init__(self, data_dir, repo_dir): | |
self.data_dir = data_dir | ||
self.dts_comp_support = int(self.script('dts-comp')) | ||
self.db = data.DB(data_dir, readonly=True, dtscomp=self.dts_comp_support) | ||
self.file_cache = {} | ||
|
||
def script(self, *args): | ||
return script(*args, env=self.getEnv()) | ||
|
@@ -124,22 +125,22 @@ def query(self, cmd, *args): | |
return decode(self.script('get-type', version, path)).strip() | ||
|
||
elif cmd == 'exist': | ||
|
||
# Returns True if the requested file exists, otherwise returns False | ||
|
||
version = args[0] | ||
path = args[1] | ||
|
||
dirname, filename = os.path.split(path) | ||
|
||
entries = decode(self.script('get-dir', version, dirname)).split("\n")[:-1] | ||
for entry in entries: | ||
fname = entry.split(" ")[1] | ||
if version not in self.file_cache: | ||
version_cache = set() | ||
last_dir = None | ||
for _, path in self.db.vers.get(version).iter(): | ||
dirname, filename = os.path.split(path) | ||
if dirname != last_dir: | ||
last_dir = dirname | ||
version_cache.add(dirname) | ||
version_cache.add(path) | ||
|
||
if fname == filename: | ||
return True | ||
self.file_cache[version] = version_cache | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So the cache is never cleared. As you mentioned in the PR description (but not commit message), this cache is local to the query. What are the memory usage considerations? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Around 12MB for Linux v6.9.4. And then it's collected. It would be great to cache this for longer, but it's not going to be possible without some locking. Another thing to try in the future would be to switch to a Python git implementation, for example dulwich. It does ls-tree faster, although I still see it takes around 5ms, while this takes orders of magnitude less once cache is created. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another issue with keeping this cache across requests is that we need to detect reindexing to retrigger the generation. This is complex and error prone. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please add a mention of this memory usage somewhere in the commit message so that it doesn't disappear into the void. Then this can be marked as resolved. |
||
|
||
return False | ||
return path.strip('/') in self.file_cache[version] | ||
|
||
elif cmd == 'dir': | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you change from calling
./script get-dir
to using the versions database to get the information. The commit message is lying and hiding important information. Please make it two commits. Both should expand on why each change is being done (why switch approach + why use a cache, with numbers). Thanks!There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit message maybe is misleading, but I don't see how to split this into two commits.
Also note that 'exist' mode was added with a Makefile filter ac491ab
And is not actually used outside of these filters https://github.com/search?q=repo%3Abootlin%2Felixir+query%28%27exist%27&type=code
Part of this "cache" is created during update
elixir/update.py
Lines 169 to 176 in 0de9a05
elixir/script.sh
Lines 122 to 135 in 0de9a05
git ls-tree
It's also used in definitions getter, so if it's incorrect, we have much more important issues to fix than the Makefile filters.
elixir/query.py
Lines 316 to 328 in 0de9a05
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't get:
My understanding was that you have two ways to do the same test:
Are those two not equivalent? If they are then this could be done in commit 1, then commit 2 would introduce the caching mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some benchmarks for all Makefiles in Linux v6.9.4
Total request times with no cache:
Total request times with cache:
I did this twice, and results were similar. I don't have any HTTP caches set up for that server. Max processing time went down, but mean and total processing time for all Makefiles went up. I will try implementing some heuristics to pick between the cache and the old implementation depending on number of files to search. Each 'exists' operation took ~20-30 ms, setting up the cache takes ~200-300ms, so (for Linux v6.9.4, on my server) the cache becomes worth it if more than ~12 checks will be made.