-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cache for Query.query('exists') #317
base: master
Are you sure you want to change the base?
Conversation
7262618
to
955a2b6
Compare
if dirname != last_dir: | ||
last_dir = dirname | ||
version_cache.add(dirname) | ||
version_cache.add(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you change from calling ./script get-dir
to using the versions database to get the information. The commit message is lying and hiding important information. Please make it two commits. Both should expand on why each change is being done (why switch approach + why use a cache, with numbers). Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commit message maybe is misleading, but I don't see how to split this into two commits.
Also note that 'exist' mode was added with a Makefile filter ac491ab
And is not actually used outside of these filters https://github.com/search?q=repo%3Abootlin%2Felixir+query%28%27exist%27&type=code
Part of this "cache" is created during update
Lines 169 to 176 in 0de9a05
blobs = scriptLines('list-blobs', '-p', tag) | |
buf = [] | |
for blob in blobs: | |
hash, path = blob.split(b' ', maxsplit=1) | |
with blobs_lock: | |
idx = db.blob.get(hash) | |
buf.append((idx, path)) |
Lines 122 to 135 in 0de9a05
if [ "$opt1" = '-p' ]; then | |
# "path" option: return blob hash and full path | |
format='\1 \2' | |
elif [ "$opt1" = '-f' ]; then | |
# "file" option: return blob hash and file name (without its path) | |
format='\1 \4' | |
else | |
# default option: return only blob hash | |
format='\1' | |
v=`echo $opt1 | version_rev` | |
fi | |
git ls-tree -r "$v" | | |
sed -r "s/^\S* blob (\S*)\t(([^/]*\/)*(.*))$/$format/; /^\S* commit .*$/d" |
git ls-tree
It's also used in definitions getter, so if it's incorrect, we have much more important issues to fix than the Makefile filters.
Lines 316 to 328 in 0de9a05
def get_idents_defs(self, version, ident, family): | |
symbol_definitions = [] | |
symbol_references = [] | |
symbol_doccomments = [] | |
if not self.db.defs.exists(ident): | |
return symbol_definitions, symbol_references, symbol_doccomments | |
if not self.db.vers.exists(version): | |
return symbol_definitions, symbol_references, symbol_doccomments | |
files_this_version = self.db.vers.get(version).iter() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I don't get:
The commit message maybe is misleading, but I don't see how to split this into two commits.
My understanding was that you have two ways to do the same test:
# Old approach (rewritten a bit)
dirname, filename = os.path.split(path)
entries = decode(self.script('get-dir', version, dirname)).split("\n")[:-1]
entries = [entry.split(" ")[1] for entry in entries]
return filename in entries
# New approach
version_cache = set()
last_dir = None
for _, path in self.db.vers.get(version).iter():
dirname, filename = os.path.split(path)
if dirname != last_dir:
last_dir = dirname
version_cache.add(dirname)
version_cache.add(path)
return path.strip('/') in version_cache
Are those two not equivalent? If they are then this could be done in commit 1, then commit 2 would introduce the caching mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some benchmarks for all Makefiles in Linux v6.9.4
Total request times with no cache:
root@5e4ed6698975:/usr/local/elixir# cat /tmp/simple_profiler_logs_makefile_test_no_cache_complete_take_2 | jq 'select(.tags[0] | contains("Makefile")).total' | jq -s '{min:min,max:max,sum:add,mean:(add/length)}'
{
"min": 91.74161,
"max": 4178.306757,
"sum": 563964.5041110016,
"mean": 187.1770674115505
}
Total request times with cache:
root@5e4ed6698975:/usr/local/elixir# cat /tmp/simple_profiler_logs_makefile_test_cache_complete_take_2 | jq 'select(.tags[0] | contains("Makefile")).total' | jq -s '{min:min,max:max,sum:add,mean:(add/length)}'
{
"min": 91.613575,
"max": 1283.572011,
"sum": 716206.7588370013,
"mean": 237.15455590629182
}
I did this twice, and results were similar. I don't have any HTTP caches set up for that server. Max processing time went down, but mean and total processing time for all Makefiles went up. I will try implementing some heuristics to pick between the cache and the old implementation depending on number of files to search. Each 'exists' operation took ~20-30 ms, setting up the cache takes ~200-300ms, so (for Linux v6.9.4, on my server) the cache becomes worth it if more than ~12 checks will be made.
|
||
if fname == filename: | ||
return True | ||
self.file_cache[version] = version_cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the cache is never cleared. As you mentioned in the PR description (but not commit message), this cache is local to the query. What are the memory usage considerations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>>> q = query.Query('/srv/elixir-data/linux/data', '/srv/elixir-data/linux/repo')
>>> version = 'v6.9.4'
>>> version_cache = set()
>>> last_dir = None
>>> for _, path in q.db.vers.get(version).iter():
... dirname, filename = os.path.split(path)
... if dirname != last_dir:
... last_dir = dirname
... version_cache.add(dirname)
... version_cache.add(path)
...
# (recursive sizeof recipe from https://docs.python.org/3/library/sys.html#sys.getsizeof)
>>> total_size(version_cache)
11937861
Around 12MB for Linux v6.9.4. And then it's collected. It would be great to cache this for longer, but it's not going to be possible without some locking. Another thing to try in the future would be to switch to a Python git implementation, for example dulwich. It does ls-tree faster, although I still see it takes around 5ms, while this takes orders of magnitude less once cache is created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another issue with keeping this cache across requests is that we need to detect reindexing to retrigger the generation. This is complex and error prone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a mention of this memory usage somewhere in the commit message so that it doesn't disappear into the void. Then this can be marked as resolved.
Tested on staging, can see the perf improvement when loading the root |
This PR adds a simple cache for 'exists' query. Currently, 'exists' calls
git ls-tree
and parses the results to check if a file exists. A single call takes around 20-30 ms. Some Makefile filters use 'exists' to check if a file referred from a Makefile exists. Large Makefiles can cause a filter to make hundreds of these calls, causing filter processing to take seconds.Examples:
This patch makes Makefile processing much faster for large files, although it introduces ~200-300ms of penalty on the first 'exist' call. Subsequent calls are basically free (<0.01ms).
Note that the cache is never explicitly freed - currently a new Query object is created for each request, so this shouldn't be an issue. In the future it would be a good idea to share Query objects between requests (that arrive to the same WSGI process). This will require some changes related to multithreading, hence is out of scope. update.py does not seem to use Query at all, so it shouldn't be affected by this change.