Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cache for Query.query('exists') #317

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

fstachura
Copy link
Collaborator

@fstachura fstachura commented Aug 16, 2024

This PR adds a simple cache for 'exists' query. Currently, 'exists' calls git ls-tree and parses the results to check if a file exists. A single call takes around 20-30 ms. Some Makefile filters use 'exists' to check if a file referred from a Makefile exists. Large Makefiles can cause a filter to make hundreds of these calls, causing filter processing to take seconds.

Examples:

This patch makes Makefile processing much faster for large files, although it introduces ~200-300ms of penalty on the first 'exist' call. Subsequent calls are basically free (<0.01ms).

Note that the cache is never explicitly freed - currently a new Query object is created for each request, so this shouldn't be an issue. In the future it would be a good idea to share Query objects between requests (that arrive to the same WSGI process). This will require some changes related to multithreading, hence is out of scope. update.py does not seem to use Query at all, so it shouldn't be affected by this change.

@fstachura fstachura marked this pull request as ready for review August 19, 2024 12:21
if dirname != last_dir:
last_dir = dirname
version_cache.add(dirname)
version_cache.add(path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you change from calling ./script get-dir to using the versions database to get the information. The commit message is lying and hiding important information. Please make it two commits. Both should expand on why each change is being done (why switch approach + why use a cache, with numbers). Thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message maybe is misleading, but I don't see how to split this into two commits.

Also note that 'exist' mode was added with a Makefile filter ac491ab
And is not actually used outside of these filters https://github.com/search?q=repo%3Abootlin%2Felixir+query%28%27exist%27&type=code

Part of this "cache" is created during update

elixir/update.py

Lines 169 to 176 in 0de9a05

blobs = scriptLines('list-blobs', '-p', tag)
buf = []
for blob in blobs:
hash, path = blob.split(b' ', maxsplit=1)
with blobs_lock:
idx = db.blob.get(hash)
buf.append((idx, path))

elixir/script.sh

Lines 122 to 135 in 0de9a05

if [ "$opt1" = '-p' ]; then
# "path" option: return blob hash and full path
format='\1 \2'
elif [ "$opt1" = '-f' ]; then
# "file" option: return blob hash and file name (without its path)
format='\1 \4'
else
# default option: return only blob hash
format='\1'
v=`echo $opt1 | version_rev`
fi
git ls-tree -r "$v" |
sed -r "s/^\S* blob (\S*)\t(([^/]*\/)*(.*))$/$format/; /^\S* commit .*$/d"
- note -r in git ls-tree

It's also used in definitions getter, so if it's incorrect, we have much more important issues to fix than the Makefile filters.

elixir/query.py

Lines 316 to 328 in 0de9a05

def get_idents_defs(self, version, ident, family):
symbol_definitions = []
symbol_references = []
symbol_doccomments = []
if not self.db.defs.exists(ident):
return symbol_definitions, symbol_references, symbol_doccomments
if not self.db.vers.exists(version):
return symbol_definitions, symbol_references, symbol_doccomments
files_this_version = self.db.vers.get(version).iter()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't get:

The commit message maybe is misleading, but I don't see how to split this into two commits.

My understanding was that you have two ways to do the same test:

# Old approach (rewritten a bit)
dirname, filename = os.path.split(path)
entries = decode(self.script('get-dir', version, dirname)).split("\n")[:-1]
entries = [entry.split(" ")[1] for entry in entries]
return filename in entries

# New approach
version_cache = set()
last_dir = None
for _, path in self.db.vers.get(version).iter():
   dirname, filename = os.path.split(path)
   if dirname != last_dir:
      last_dir = dirname
      version_cache.add(dirname)
   version_cache.add(path)
return path.strip('/') in version_cache

Are those two not equivalent? If they are then this could be done in commit 1, then commit 2 would introduce the caching mechanism.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some benchmarks for all Makefiles in Linux v6.9.4

Total request times with no cache:

root@5e4ed6698975:/usr/local/elixir# cat /tmp/simple_profiler_logs_makefile_test_no_cache_complete_take_2 | jq 'select(.tags[0] | contains("Makefile")).total' | jq -s '{min:min,max:max,sum:add,mean:(add/length)}'
{
  "min": 91.74161,
  "max": 4178.306757,
  "sum": 563964.5041110016,
  "mean": 187.1770674115505
}

Total request times with cache:

root@5e4ed6698975:/usr/local/elixir# cat /tmp/simple_profiler_logs_makefile_test_cache_complete_take_2 | jq 'select(.tags[0] | contains("Makefile")).total' | jq -s '{min:min,max:max,sum:add,mean:(add/length)}'
{
  "min": 91.613575,
  "max": 1283.572011,
  "sum": 716206.7588370013,
  "mean": 237.15455590629182
}

I did this twice, and results were similar. I don't have any HTTP caches set up for that server. Max processing time went down, but mean and total processing time for all Makefiles went up. I will try implementing some heuristics to pick between the cache and the old implementation depending on number of files to search. Each 'exists' operation took ~20-30 ms, setting up the cache takes ~200-300ms, so (for Linux v6.9.4, on my server) the cache becomes worth it if more than ~12 checks will be made.


if fname == filename:
return True
self.file_cache[version] = version_cache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the cache is never cleared. As you mentioned in the PR description (but not commit message), this cache is local to the query. What are the memory usage considerations?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> q = query.Query('/srv/elixir-data/linux/data', '/srv/elixir-data/linux/repo')
>>> version = 'v6.9.4'
>>> version_cache = set()
>>> last_dir = None
>>> for _, path in q.db.vers.get(version).iter():
...     dirname, filename = os.path.split(path)
...     if dirname != last_dir:
...         last_dir = dirname
...         version_cache.add(dirname)
...     version_cache.add(path)
... 
# (recursive sizeof recipe from https://docs.python.org/3/library/sys.html#sys.getsizeof)
>>> total_size(version_cache)
11937861

Around 12MB for Linux v6.9.4. And then it's collected. It would be great to cache this for longer, but it's not going to be possible without some locking. Another thing to try in the future would be to switch to a Python git implementation, for example dulwich. It does ls-tree faster, although I still see it takes around 5ms, while this takes orders of magnitude less once cache is created.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another issue with keeping this cache across requests is that we need to detect reindexing to retrigger the generation. This is complex and error prone.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a mention of this memory usage somewhere in the commit message so that it doesn't disappear into the void. Then this can be marked as resolved.

@tleb
Copy link
Member

tleb commented Oct 29, 2024

Tested on staging, can see the perf improvement when loading the root Makefile. Sorry for the pending comments that I didn't send at the time. Will merge once memory usage is mentioned in the commit message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants