Add cache for Query.query('exists') #317

fstachura · 2024-08-16T15:58:08Z

This PR adds a simple cache for 'exists' query. Currently, 'exists' calls git ls-tree and parses the results to check if a file exists. A single call takes around 20-30 ms. Some Makefile filters use 'exists' to check if a file referred from a Makefile exists. Large Makefiles can cause a filter to make hundreds of these calls, causing filter processing to take seconds.

Examples:

This patch makes Makefile processing much faster for large files, although it introduces ~200-300ms of penalty on the first 'exist' call. Subsequent calls are basically free (<0.01ms).

Note that the cache is never explicitly freed - currently a new Query object is created for each request, so this shouldn't be an issue. In the future it would be a good idea to share Query objects between requests (that arrive to the same WSGI process). This will require some changes related to multithreading, hence is out of scope. update.py does not seem to use Query at all, so it shouldn't be affected by this change.

tleb · 2024-08-19T15:19:46Z

query.py

+                    if dirname != last_dir:
+                        last_dir = dirname
+                        version_cache.add(dirname)
+                    version_cache.add(path)


Here you change from calling ./script get-dir to using the versions database to get the information. The commit message is lying and hiding important information. Please make it two commits. Both should expand on why each change is being done (why switch approach + why use a cache, with numbers). Thanks!

The commit message maybe is misleading, but I don't see how to split this into two commits.

Also note that 'exist' mode was added with a Makefile filter ac491ab
And is not actually used outside of these filters https://github.com/search?q=repo%3Abootlin%2Felixir+query%28%27exist%27&type=code

Part of this "cache" is created during update

elixir/update.py

Lines 169 to 176 in 0de9a05

blobs = scriptLines('list-blobs', '-p', tag)

buf = []

for blob in blobs:

hash, path = blob.split(b' ', maxsplit=1)

with blobs_lock:

idx = db.blob.get(hash)

buf.append((idx, path))

elixir/script.sh

Lines 122 to 135 in 0de9a05

if [ "$opt1" = '-p' ]; then

# "path" option: return blob hash and full path

format='\1 \2'

elif [ "$opt1" = '-f' ]; then

# "file" option: return blob hash and file name (without its path)

format='\1 \4'

else

# default option: return only blob hash

format='\1'

v=`echo $opt1 | version_rev`

fi

git ls-tree -r "$v" |

sed -r "s/^\S* blob (\S*)\t(([^/]*\/)*(.*))$/$format/; /^\S* commit .*$/d"

- note -r in git ls-tree

It's also used in definitions getter, so if it's incorrect, we have much more important issues to fix than the Makefile filters.

elixir/query.py

Lines 316 to 328 in 0de9a05

def get_idents_defs(self, version, ident, family):

symbol_definitions = []

symbol_references = []

symbol_doccomments = []

if not self.db.defs.exists(ident):

return symbol_definitions, symbol_references, symbol_doccomments

if not self.db.vers.exists(version):

return symbol_definitions, symbol_references, symbol_doccomments

files_this_version = self.db.vers.get(version).iter()

Sorry, I don't get:

The commit message maybe is misleading, but I don't see how to split this into two commits.

My understanding was that you have two ways to do the same test:

# Old approach (rewritten a bit) dirname, filename = os.path.split(path) entries = decode(self.script('get-dir', version, dirname)).split("\n")[:-1] entries = [entry.split(" ")[1] for entry in entries] return filename in entries # New approach version_cache = set() last_dir = None for _, path in self.db.vers.get(version).iter(): dirname, filename = os.path.split(path) if dirname != last_dir: last_dir = dirname version_cache.add(dirname) version_cache.add(path) return path.strip('/') in version_cache

Are those two not equivalent? If they are then this could be done in commit 1, then commit 2 would introduce the caching mechanism.

I did some benchmarks for all Makefiles in Linux v6.9.4

Total request times with no cache:

root@5e4ed6698975:/usr/local/elixir# cat /tmp/simple_profiler_logs_makefile_test_no_cache_complete_take_2 | jq 'select(.tags[0] | contains("Makefile")).total' | jq -s '{min:min,max:max,sum:add,mean:(add/length)}' { "min": 91.74161, "max": 4178.306757, "sum": 563964.5041110016, "mean": 187.1770674115505 }

Total request times with cache:

root@5e4ed6698975:/usr/local/elixir# cat /tmp/simple_profiler_logs_makefile_test_cache_complete_take_2 | jq 'select(.tags[0] | contains("Makefile")).total' | jq -s '{min:min,max:max,sum:add,mean:(add/length)}' { "min": 91.613575, "max": 1283.572011, "sum": 716206.7588370013, "mean": 237.15455590629182 }

I did this twice, and results were similar. I don't have any HTTP caches set up for that server. Max processing time went down, but mean and total processing time for all Makefiles went up. I will try implementing some heuristics to pick between the cache and the old implementation depending on number of files to search. Each 'exists' operation took ~20-30 ms, setting up the cache takes ~200-300ms, so (for Linux v6.9.4, on my server) the cache becomes worth it if more than ~12 checks will be made.

tleb · 2024-08-19T15:20:46Z

query.py


-                if fname == filename:
-                    return True
+                self.file_cache[version] = version_cache


So the cache is never cleared. As you mentioned in the PR description (but not commit message), this cache is local to the query. What are the memory usage considerations?

>>> q = query.Query('/srv/elixir-data/linux/data', '/srv/elixir-data/linux/repo') >>> version = 'v6.9.4' >>> version_cache = set() >>> last_dir = None >>> for _, path in q.db.vers.get(version).iter(): ... dirname, filename = os.path.split(path) ... if dirname != last_dir: ... last_dir = dirname ... version_cache.add(dirname) ... version_cache.add(path) ... # (recursive sizeof recipe from https://docs.python.org/3/library/sys.html#sys.getsizeof) >>> total_size(version_cache) 11937861

Around 12MB for Linux v6.9.4. And then it's collected. It would be great to cache this for longer, but it's not going to be possible without some locking. Another thing to try in the future would be to switch to a Python git implementation, for example dulwich. It does ls-tree faster, although I still see it takes around 5ms, while this takes orders of magnitude less once cache is created.

Another issue with keeping this cache across requests is that we need to detect reindexing to retrigger the generation. This is complex and error prone.

Please add a mention of this memory usage somewhere in the commit message so that it doesn't disappear into the void. Then this can be marked as resolved.

tleb · 2024-10-29T10:34:31Z

Tested on staging, can see the perf improvement when loading the root Makefile. Sorry for the pending comments that I didn't send at the time. Will merge once memory usage is mentioned in the commit message.

Make Query.query('exist') lookup faster by adding a filename cache

955a2b6

fstachura force-pushed the file-exist-cache-take2 branch from 7262618 to 955a2b6 Compare August 19, 2024 12:21

fstachura marked this pull request as ready for review August 19, 2024 12:21

tleb reviewed Aug 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cache for Query.query('exists') #317

Add cache for Query.query('exists') #317

fstachura commented Aug 16, 2024 •

edited

Loading

tleb Aug 19, 2024

fstachura Aug 19, 2024

tleb Aug 20, 2024

fstachura Aug 20, 2024

tleb Aug 19, 2024

fstachura Aug 19, 2024

tleb Aug 20, 2024

tleb Aug 20, 2024

tleb commented Oct 29, 2024

	blobs = scriptLines('list-blobs', '-p', tag)
	buf = []

	for blob in blobs:
	hash, path = blob.split(b' ', maxsplit=1)
	with blobs_lock:
	idx = db.blob.get(hash)
	buf.append((idx, path))

	if [ "$opt1" = '-p' ]; then
	# "path" option: return blob hash and full path
	format='\1 \2'
	elif [ "$opt1" = '-f' ]; then
	# "file" option: return blob hash and file name (without its path)
	format='\1 \4'
	else
	# default option: return only blob hash
	format='\1'
	v=`echo $opt1 \| version_rev`
	fi

	git ls-tree -r "$v" \|
	sed -r "s/^\S* blob (\S)\t(([^/]\/)(.))$/$format/; /^\S* commit .*$/d"

	def get_idents_defs(self, version, ident, family):

	symbol_definitions = []
	symbol_references = []
	symbol_doccomments = []

	if not self.db.defs.exists(ident):
	return symbol_definitions, symbol_references, symbol_doccomments

	if not self.db.vers.exists(version):
	return symbol_definitions, symbol_references, symbol_doccomments

	files_this_version = self.db.vers.get(version).iter()

Add cache for Query.query('exists') #317

Are you sure you want to change the base?

Add cache for Query.query('exists') #317

Conversation

fstachura commented Aug 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tleb commented Oct 29, 2024

fstachura commented Aug 16, 2024 •

edited

Loading