File Search - Improve performance of file search #15384

chrmarti · 2016-11-11T21:55:28Z

Part of #55.

We will focus on simple text search. Regex text searches might be more difficult to optimize and are used less often than simple text searches. We will initially explore our options (e.g., refining our own implementation, command line tools, text search libraries/engines).

We will include full-text search libraries/engines in our investigation and see if any of these can do simple text searches while at the same time open the door for also supporting a more natural 'fuzzy' text search at a later time.

The goal is raw search performance with library size (if any) and disk usage of an index (if any) being potential trade-offs.

roblourens · 2016-11-15T23:20:09Z

Subjectively, file search in VS Code is slower than in other editors, and this has been noticed by users. Here are a couple example issues:

Slow search
#14157
#14012
#8680
#63

Slow, cancelling is not discoverable, progress is not helpful
#9853

The first step is

Admit we have a problem

To nail down the issue, I'm doing a round of performance tests in controlled conditions to better understand the perf characteristics of our current file search, and find the areas that will benefit the most from optimization. I'm doing the same search using other tools for comparison.

Goals

VS Code file search should not be, or feel, significantly slower than other editors. Being faster than some is probably achievable. Results should appear quickly and the UI should remain responsive throughout the duration of the search.

We also have the goal of investigating whether tools that might provide a very fast indexed search, and a more "fuzzy" search that goes beyond literal text, and evaluating the costs associated with it.

Setup

I'm running these tests in two workspaces, workspace A and workspace B. I'm trying to run a search in each that produces about the same number of results, where workspace A is a very large workspace with roughly evenly-distributed results, and workspace B is a very small workspace with densely packed results.

Workspace A is a clone of the chromium repo, by the instructions here. This is workspace of 234,655 files, about 1.8 GB. In this workspace I'm doing a case-insensitive search for the phrase static_library(, producing 447 results in 282 files, which are well-distributed across the repo. This phrase shows up in BUILD.gn files which are all over the repo.

Workspace B is a folder containing 4 files, which together have just the phrase static_library("foo") 447 times.

The regex search uses the regex /static_...rary\(/.

Timing is done using a stopwatch from pressing 'enter' on the query, to the results being fully loaded into the UI and ready for interaction. Except for vscode, into which I injected timing instrumentation, and grep, using the 'time' command. This is pretty adhoc testing on my MBP and Windows desktop with SSDs.

The Grep search is using this command: time find . -type f -print0 | xargs -0 -P 4 fgrep -i -I "static_library(" | wc -l.

Results

MacOS
- Mid-2015 MacBook Pro
- 2.2 GHz Intel Core i7
- 16 GB 1600 MHz DDR3
- APPLE SSD SM0512G

Tool	Workspace A	Workspace A regex	Workspace B	Notes
Webstorm	3.5s	8s	instant	Indexed search*
grep	18s	19s	.02s	Obviously doesn't need to build a UI, etc etc
git grep	10s	9.3	n/a	Only searches files that git knows about. Multithreaded where normal grep is not.
Sublime	28s	28s	instant	Streams results into a document in lexicographical order
Atom	2m50s	2m47s	instant	Shows results during the search, but scrolls back to the top every time new results are added
VS Code	1m49s	1m49s	.25s	No results appear until the end

Webstorm indexes the workspace on launch. If you search before the index is finished building (takes a minute or so for workspace A) it takes about 50s to search. Also has a search "Context" option to search just in string literals, comments, etc. I want that.

Windows
- Intel Xeon E3-1240 v3 @3.40 GHz
- 16 GB RAM
- Crucial 500GB SSD

Tool	Workspace A	Workspace A regex	Workspace B	Notes
Webstorm	2.5s	2.8s	instant	Indexed search, see above notes.
findstr	18s	29s	instant
Sublime	29s	29s	instant
VS	55s	3m45s	1s	No progress bar. Apparently doesn't exclude binary files like other tools, as it spent awhile searching a .mp4 file.
VS Code	2m18s	2m18s	instant

Analysis

VS Code time breakdown

What happened when VS Code spent 1m49s on a search:
- 0s - Search started
- .2s - First candidate file found in the search process
- .5s - First match found in the search process
- 108.289s - Batch of 282 FileMatch objects received in the frontend (447 matches total)
- 108.324s - Final SearchResult.add call, updating TreeView UI begins
- 109.03s - TreeView UI finished updating

The time from starting the search to finding a match is very short, but then we sit on that match for 100+ seconds, because we don't send results to the frontend until we've filled a batch of 512. Since there are fewer than a full batch of results, nothing is visible until the search is over. So we could increase the perceived performance quite a lot by adding a timeout to send intermediate results over if the batch doesn't fill up.

I changed the batch size to 20 and got results immediately, which is much nicer. But scrolling around through the list, it jumped around a lot as results were inserted into the middle. Results are returned as files are finished, which may or may not be in order. We could only show results in order (Sublime does this), or show results as they come in. This is also a problem with a batch size of 512, but there are only 4 batches in our max results limit of 2048, so less of a problem.

On the end of making search faster, we should consider using the find command (we do this for cmd+p, but not text search) to get a list of files, and grep to search them. We can use findstr on Windows. I'm trying to understand how much time the filesystem walking is responsible for vs the search, but it's difficult to say exactly since the two flows are interleaved. I also want to profile the search process more to understand where it's spending time.

We can also try to run our search in more than one process. I saw a near 4x speedup when running 4 processes of grep instead of 3. Depending on disk speed, number of cores, and amount of time we spend just sending information between processes.

Fuzzy search

We also wanted to investigate options for enabling search-engine-style indexed fuzzy search in a workspace. 'Indexed' meaning an index is built for the files in the workspace, on disk or in memory, and a search in a large workspace can complete in seconds or milliseconds. 'Fuzzy' meaning a search based on not literal text or regex, but on word stemming, word proximity, etc. Here's an example use case, based on a true story:

I'm exploring the vscode code base, trying to find the code responsible for file search. I cmd+shift+F and search search files. No results. file search. No results. filesearch? That's it, there are 50 results with that literal. But it would be useful to find a comment that says // This code searches the text in files.

There are a bunch of options for this, but any third-party tool here will increase our download size, so we plan to consider exposing search as an extension API, and implementing an optional extension that would provide this advanced search behavior for people who want it, while making sure that our built-in search is still usable and fast.

roblourens · 2016-11-23T19:09:46Z

These results come from running the TextSearch.performance test a couple times and averaging (where each run averages 3 runs of the same search). It executes the search and returns results to the frontend process, but doesn't do UI things with them.

	MacOS	Win
With Buffer.slice (original code)	100s	125.5s
Replacing Buffer.slice with buffer.toString(null, start, end)	71.1s	97.05s
With 4 worker processes (for a total of 5 (very very rough))	26.4	19.0
With code in PR	17.6	16.3

…tart, end). This is much faster although it's not totally obvious why. But it's run for every line of every file in the workspace so it's very sensitive. See #15384.

chrmarti added plan-item VS Code - planned item for upcoming search Search widget and operation issues labels Nov 11, 2016

chrmarti added this to the November 2016 milestone Nov 11, 2016

chrmarti assigned roblourens and chrmarti Nov 11, 2016

kieferrm mentioned this issue Nov 11, 2016

November/December Iteration Plan #15099

Closed

datanotion mentioned this issue Nov 22, 2016

Searching for text (CTRL+Shift+F) is too slow #14157

Closed

This was referenced Nov 30, 2016

Improve batching between search process and frontend #16284

Closed

Parallelize text search #16286

Merged

roblourens closed this as completed in #16286 Dec 1, 2016

roblourens mentioned this issue Dec 3, 2016

Test: Text search #16416

Closed

3 tasks

chrmarti mentioned this issue Dec 5, 2016

Improve search performance. #16448

Closed

ramya-rao-a added the on-testplan label Dec 5, 2016

roblourens mentioned this issue Dec 14, 2016

Slow Search #14012

Closed

egamma mentioned this issue Dec 20, 2016

January Iteration Plan #17608

Closed

56 tasks

roblourens mentioned this issue Feb 26, 2017

Explore using a native third-party search tool such as ripgrep or Silver Searcher #19983

Closed

vscodebot bot locked and limited conversation to collaborators Nov 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File Search - Improve performance of file search #15384

File Search - Improve performance of file search #15384

chrmarti commented Nov 11, 2016 •

edited

Loading

roblourens commented Nov 15, 2016 •

edited

Loading

roblourens commented Nov 23, 2016 •

edited

Loading

File Search - Improve performance of file search #15384

File Search - Improve performance of file search #15384

Comments

chrmarti commented Nov 11, 2016 • edited Loading

roblourens commented Nov 15, 2016 • edited Loading

Goals

Setup

Results

Analysis

VS Code time breakdown

Fuzzy search

roblourens commented Nov 23, 2016 • edited Loading

chrmarti commented Nov 11, 2016 •

edited

Loading

roblourens commented Nov 15, 2016 •

edited

Loading

roblourens commented Nov 23, 2016 •

edited

Loading