Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cache for Query.query('exists') #317

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 12 additions & 11 deletions query.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ def __init__(self, data_dir, repo_dir):
self.data_dir = data_dir
self.dts_comp_support = int(self.script('dts-comp'))
self.db = data.DB(data_dir, readonly=True, dtscomp=self.dts_comp_support)
self.file_cache = {}

def script(self, *args):
return script(*args, env=self.getEnv())
Expand Down Expand Up @@ -124,22 +125,22 @@ def query(self, cmd, *args):
return decode(self.script('get-type', version, path)).strip()

elif cmd == 'exist':

# Returns True if the requested file exists, otherwise returns False

version = args[0]
path = args[1]

dirname, filename = os.path.split(path)

entries = decode(self.script('get-dir', version, dirname)).split("\n")[:-1]
for entry in entries:
fname = entry.split(" ")[1]
if version not in self.file_cache:
version_cache = set()
last_dir = None
for _, path in self.db.vers.get(version).iter():
dirname, filename = os.path.split(path)
if dirname != last_dir:
last_dir = dirname
version_cache.add(dirname)
version_cache.add(path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you change from calling ./script get-dir to using the versions database to get the information. The commit message is lying and hiding important information. Please make it two commits. Both should expand on why each change is being done (why switch approach + why use a cache, with numbers). Thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message maybe is misleading, but I don't see how to split this into two commits.

Also note that 'exist' mode was added with a Makefile filter ac491ab
And is not actually used outside of these filters https://github.com/search?q=repo%3Abootlin%2Felixir+query%28%27exist%27&type=code

Part of this "cache" is created during update

elixir/update.py

Lines 169 to 176 in 0de9a05

blobs = scriptLines('list-blobs', '-p', tag)
buf = []
for blob in blobs:
hash, path = blob.split(b' ', maxsplit=1)
with blobs_lock:
idx = db.blob.get(hash)
buf.append((idx, path))

elixir/script.sh

Lines 122 to 135 in 0de9a05

if [ "$opt1" = '-p' ]; then
# "path" option: return blob hash and full path
format='\1 \2'
elif [ "$opt1" = '-f' ]; then
# "file" option: return blob hash and file name (without its path)
format='\1 \4'
else
# default option: return only blob hash
format='\1'
v=`echo $opt1 | version_rev`
fi
git ls-tree -r "$v" |
sed -r "s/^\S* blob (\S*)\t(([^/]*\/)*(.*))$/$format/; /^\S* commit .*$/d"
- note -r in git ls-tree

It's also used in definitions getter, so if it's incorrect, we have much more important issues to fix than the Makefile filters.

elixir/query.py

Lines 316 to 328 in 0de9a05

def get_idents_defs(self, version, ident, family):
symbol_definitions = []
symbol_references = []
symbol_doccomments = []
if not self.db.defs.exists(ident):
return symbol_definitions, symbol_references, symbol_doccomments
if not self.db.vers.exists(version):
return symbol_definitions, symbol_references, symbol_doccomments
files_this_version = self.db.vers.get(version).iter()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't get:

The commit message maybe is misleading, but I don't see how to split this into two commits.

My understanding was that you have two ways to do the same test:

# Old approach (rewritten a bit)
dirname, filename = os.path.split(path)
entries = decode(self.script('get-dir', version, dirname)).split("\n")[:-1]
entries = [entry.split(" ")[1] for entry in entries]
return filename in entries

# New approach
version_cache = set()
last_dir = None
for _, path in self.db.vers.get(version).iter():
   dirname, filename = os.path.split(path)
   if dirname != last_dir:
      last_dir = dirname
      version_cache.add(dirname)
   version_cache.add(path)
return path.strip('/') in version_cache

Are those two not equivalent? If they are then this could be done in commit 1, then commit 2 would introduce the caching mechanism.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some benchmarks for all Makefiles in Linux v6.9.4

Total request times with no cache:

root@5e4ed6698975:/usr/local/elixir# cat /tmp/simple_profiler_logs_makefile_test_no_cache_complete_take_2 | jq 'select(.tags[0] | contains("Makefile")).total' | jq -s '{min:min,max:max,sum:add,mean:(add/length)}'
{
  "min": 91.74161,
  "max": 4178.306757,
  "sum": 563964.5041110016,
  "mean": 187.1770674115505
}

Total request times with cache:

root@5e4ed6698975:/usr/local/elixir# cat /tmp/simple_profiler_logs_makefile_test_cache_complete_take_2 | jq 'select(.tags[0] | contains("Makefile")).total' | jq -s '{min:min,max:max,sum:add,mean:(add/length)}'
{
  "min": 91.613575,
  "max": 1283.572011,
  "sum": 716206.7588370013,
  "mean": 237.15455590629182
}

I did this twice, and results were similar. I don't have any HTTP caches set up for that server. Max processing time went down, but mean and total processing time for all Makefiles went up. I will try implementing some heuristics to pick between the cache and the old implementation depending on number of files to search. Each 'exists' operation took ~20-30 ms, setting up the cache takes ~200-300ms, so (for Linux v6.9.4, on my server) the cache becomes worth it if more than ~12 checks will be made.


if fname == filename:
return True
self.file_cache[version] = version_cache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the cache is never cleared. As you mentioned in the PR description (but not commit message), this cache is local to the query. What are the memory usage considerations?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> q = query.Query('/srv/elixir-data/linux/data', '/srv/elixir-data/linux/repo')
>>> version = 'v6.9.4'
>>> version_cache = set()
>>> last_dir = None
>>> for _, path in q.db.vers.get(version).iter():
...     dirname, filename = os.path.split(path)
...     if dirname != last_dir:
...         last_dir = dirname
...         version_cache.add(dirname)
...     version_cache.add(path)
... 
# (recursive sizeof recipe from https://docs.python.org/3/library/sys.html#sys.getsizeof)
>>> total_size(version_cache)
11937861

Around 12MB for Linux v6.9.4. And then it's collected. It would be great to cache this for longer, but it's not going to be possible without some locking. Another thing to try in the future would be to switch to a Python git implementation, for example dulwich. It does ls-tree faster, although I still see it takes around 5ms, while this takes orders of magnitude less once cache is created.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another issue with keeping this cache across requests is that we need to detect reindexing to retrigger the generation. This is complex and error prone.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a mention of this memory usage somewhere in the commit message so that it doesn't disappear into the void. Then this can be marked as resolved.


return False
return path.strip('/') in self.file_cache[version]

elif cmd == 'dir':

Expand Down