-
Notifications
You must be signed in to change notification settings - Fork 122
speedup scanning by 3.7x using gitoxide #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
That way the library can be built on non-x64 platforms.
|
Hi @Byron. This is terrific; thank you for submitting this PR! I had seen Let me kick the tires on this a bit. I have a corpus of over 7000 Git repos that I'd like to run these changes on. |
Indeed, the current version of Nosey Parker requires an x86_64 CPU to easily build and run. It is not impossible to run Hyperscan on ARM, however. There is a fork that adds support for additional platforms: https://github.com/vectorcamp/vectorscan. I have some local changes to use that fork instead of Hyperscan, and it seems to work on a MacBook Pro with an M1 Max. I'm hoping to get these changes merged into I have not tried The experiment using |
|
Awesome, I am very much looking forward to seeing the verdict of your corpus run! Vectorscan sounds like the way to go on M1 and I am really looking forward to running it for the first time on my own machine proper, with an actual scan. Then it should become clear if it’s worth to get this sped up to 12GB /s. |
If you are impatient and ambitious, you could build with vectorscan yourself, as I sketched in #5. Big idea: build vectorscan from source, set the |
|
@Byron I have run some tests with the modified blob-reading code from this PR that uses I ran with actual scanning via Hyperscan still enabled. I see a 2-2.5x overall speedup from your changes from a few repos I tried. For example, when running over 100GB of Linux history on my 10-core M1 Max: 2.51GiB/s scan rate with your changes vs 1.18GiB/s without. I am still in process of validating this with my corpus of 7000+ Git repos. I suspect it will run 2-3x faster in total with your changes. The speed is great; I just want to battle-test it some more. A question for you from your original description:
What do you mean here—can you expand on this? Are there particular gitoxide crates or APIs I should take a look at? |
@Byron can you also please elaborate a little more what you have in mind here? Are there crates or APIs I should be looking at? |
|
That's fantastic news! I'd be keen to hear the final numbers to learn how long it takes to go through all these 7k repositories, and maybe how much data was churned. Again, depending on the pack compression and with the right machine 35GB/s are achievable. To do that, you'd have to use a special cache that accelerates decoding of packs. There is Lastly, the enumeration step is something I'd make optional, as instead you can rely and visualize the progress information that I have also opened a tracking issue with features I think you will need to fully replace I hope that helps. |
|
@Byron some data for you! Benchmarks from scanning 7.5k Git repositoriesTest SystemNote that these numbers are from a different computer than my earlier comments in this PR. This system is a 2019 MacBook Pro with a 2.6 GHz 6-Core Intel Core i7, 64 GB 2667 MHz DDR4, and a 1TB NVMe disk. Input DataThe inputs are 7499 fully-packed, bare Git repositories, that take up 182GiB on disk. Within those repositories are 45M blobs in total, comprising 1.78TiB. Of those, there are 27.5M distinct blobs comprising 1.3TiB. (The difference between "total" and "distinct" is because some of the repos are forks of others, and many repos have vendored source code from certain dependencies.) Nosey Parker detects the duplicates based on blob id, and only scans each distinct blob a single time. So the figure to keep in mind: 27.5M distinct blobs comprising 1.3TiB. Benchmark Configurations
Both variants were built in release mode with ResultsSummary of findingsThese are the same in both baseline and candidate. Summary of findingsBaseline performance (
|
|
@Byron some more benchmark data! Benchmarks from scanning 100GiB of Linux Kernel historyBenchmark setupThis uses the same system and baseline and candidate versions as the 7.5k repository benchmarks. ResultsBaseline (
|
|
Thanks so much for sharing this elaborate and exhaustive performance report :)! I would have hoped for more of a speedup in the first case with 7.5k repos, but can see why workloads dominated by the scanning might not benefit as much. Then again, I'd think that the performance for scanning a file is mostly determined by the size of the blob, but @BurntSushi would definitely know better as the author of the I want to be back to this PR once What are your plans with this PR? It seems Thanks and cheers. |
Nosey Parker indeed uses The performance of the
Thanks for asking. Yes, this PR has diverged from Let me know when |
|
Thanks for the explainer on how the
Yes, that sounds great- thank you!
I feel like having a little bit of extra-fun this weekend and will try to get something like |
|
@Byron: all sounds good! Looking forward to the FYI I did a bit more performance investigation. In Nosey Parker, on the 7.5k repositories input, I was mistaken about where time was being spent: even with the second stage My hypothesis: the way that Nosey Parker currently scans blobs is rather suboptimal when dealing with big packfiles. Nosey Parker uses Rayon to parallelize over Git repositories and also the blobs within each. But the way it is splitting up the work for scanning blobs ends up with multiple threads doing redundant work to extract the blobs they are scanning. I spot-checked a handful of repositories, and There is probably another 2-3x speedup possible if Nosey Parker instead had |
This doesn't speed up the enumeration phase (which might not be necessary) but shows how much faster gitoxide can be if used like that.
|
So everything is great, right ;)? Well, some oddity happened :/: When enumerating objects (which may include duplicates) with Unfortunately, the oddities don't stop there. The performance for decoding objects drops to Now that I read the paragraph above it seems like that issue isn't all that odd anymore, and I just pushed an update that uses I can't wait :D! |
|
I think this acquisition of time could also be something that costs much more than it's worth. Something that's far more scalable is to use atomic counters and a separate thread for display. It can be as simple as this example (in principle). |
|
Here is an update on what I think might be the end of my wonderful quest for maximal performance. The latest Additionally I opened up the acquisition of git repositories via The biggest issue I still see is the discovery of git repositories that leads to repositories being opened and enumerated multiple times. Maybe enumeration can be changed to something like what's used in ein tool find which can enumerate deep trees quite fast as well. The dev folder is rather large: Please note that with the git dir enumeration above, there is a current shortcoming of it not following symlinks, but it also won't recognize a git directory that is right behind a symlink. And the biggest potential is certainly in using pack traversal directly, and maybe to make the numeration step optional at the expense of not knowing the upper bound for the progress bar. The best of success with the further development of this project, please let me know if anything else is missing as you finish the migration. Cheers |
It's just single-threaded, but appears to be faster nonetheless.
|
Thanks @Byron; tremendous! I will try to get this merged back before the holidays. Thank you! |
This is probably the right explanation! |
Good catch! I had added that hacked-up caching layer on top of I had hoped that calling I had been considering switching to something like |
|
Thinking about it, maybe the situation changed and by now it's not an issue anymore to make repeated calls to By the looks of it |
Co-authored by Sebastian Thiel <[email protected]>.
Oh, I see now. When using
Anyway, I was able to work around that in a revised and expanded version of this PR in #20. |
That's true, but this case is all I need to implement a similar option which now makes perfect sense. I will let you know in the follow-up PR when it's ready. |
* Tweak `summarize` column names for clarity * Use gitoxide instead of git2, adapted from PR #2 * Use tracing-log to emit `log` records as `tracing` events * Open repos just once; strip out git2 Depending on your input repositories and how many cores you have, this change can give more than a 3x speedup in overall Nosey Parker runtime. Co-authored by Sebastian Thiel <[email protected]>.
This implementation is the minimal changes required to use
gitoxideinstead ofgit2to extractblob data.
Please note that the first commit is to be able to build on ARM, effectively disabling the actual scanning.
With some work, I think the rules database could be made to run with
RegexSetmaking this useful tool more accessible.
Lastly, if the scanning step would be skipped, one could go straight to decoding packs directly to not waste a cycle on decoding packed objects, further speeding up decoding
to reach light speed (i.e., the maximum possible speed :D).
Results
git2blob extraction4.67 GiB/sgitoxideblob extraction (3.2x)gitoxideblob extraction (3.7x)The speedup above will be even more significant on machines with more cores -
git2has globally locked state somewhere and can't scale well even when opening multiple repository handles.Note that the above numbers are created on ARM and I had to disable the actual scanning. There are still ways to speed this up. Trivially by avoiding to clone the blob data, for a few percent maybe, and more radically by changing the algorithm to leverage
gitoxidepack resolution, which can bring up data decompression performance to 12GB/s on my machine for the kernel pack, and I have seen 36GB/s on a Ryzen.That means in theory, if scanning is free, we are looking at 2.5s for scanning the entire linux kernel (on a Ryzen).