Skip to content

Conversation

@Byron
Copy link

@Byron Byron commented Dec 9, 2022

This implementation is the minimal changes required to use gitoxide instead of git2 to extract
blob data.

Please note that the first commit is to be able to build on ARM, effectively disabling the actual scanning.
With some work, I think the rules database could be made to run with RegexSet
making this useful tool more accessible.

Lastly, if the scanning step would be skipped, one could go straight to decoding packs directly to not waste a cycle on decoding packed objects, further speeding up decoding
to reach light speed (i.e., the maximum possible speed :D).

Results

  • 1.46 GiB/s) GiB/s git2 blob extraction
  • 4.67 GiB/s gitoxide blob extraction (3.2x)
  • 5.47GiB/s gitoxide blob extraction (3.7x)

The speedup above will be even more significant on machines with more cores - git2 has globally locked state somewhere and can't scale well even when opening multiple repository handles.

Note that the above numbers are created on ARM and I had to disable the actual scanning. There are still ways to speed this up. Trivially by avoiding to clone the blob data, for a few percent maybe, and more radically by changing the algorithm to leverage gitoxide pack resolution, which can bring up data decompression performance to 12GB/s on my machine for the kernel pack, and I have seen 36GB/s on a Ryzen.

That means in theory, if scanning is free, we are looking at 2.5s for scanning the entire linux kernel (on a Ryzen).

❯ cargo run --release  --no-default-features  -- scan  -d datastore-delme ~/dev/github.com/git/git/
    Finished release [optimized] target(s) in 0.08s
     Running `target/release/noseyparker scan -d datastore-delme /Users/byron/dev/github.com/git/git/`
Found 17.06 GiB from 6,056 plain files and 535,038 blobs from 5 Git repos [00:00:02]
Scanning content  ████████████████████ 100%  17.06 GiB/17.06 GiB  [00:00:04]                                                                                                                                                 Scanned 6.16 GiB from 168,667 blobs in 4 seconds (1.46 GiB/s); 0/0 new matches

Run the `report` command next to show finding details.

noseyparker ( gitoxide) took 7s
❯ cargo run --release  --no-default-features  -- scan  -d datastore-delme ~/dev/github.com/git/git/
   Compiling once_cell v1.16.0
   Compiling smallvec v1.10.0
   Compiling lock_api v0.4.9
   Compiling parking_lot_core v0.9.5
   Compiling thiserror-impl v1.0.37
   Compiling cmake v0.1.49
   Compiling quick-error v2.0.1
   Compiling num-traits v0.2.15
   Compiling crossbeam-queue v0.3.8
   Compiling sha1-asm v0.5.1
   Compiling crc32fast v1.3.2
   Compiling bstr v1.0.1
   Compiling adler v1.0.2
   Compiling libz-sys v1.1.8
   Compiling miniz_oxide v0.6.2
   Compiling ahash v0.7.6
   Compiling human_format v1.0.3
   Compiling thiserror v1.0.37
   Compiling bytesize v1.1.0
   Compiling time-core v0.1.0
   Compiling hashbrown v0.12.3
   Compiling num_threads v0.1.6
   Compiling time-macros v0.2.6
   Compiling crossbeam v0.8.2
   Compiling prodash v21.1.0
   Compiling jwalk v0.6.0
   Compiling sha1_smol v1.0.0
   Compiling parking_lot v0.12.1
   Compiling minimal-lexical v0.2.1
   Compiling signal-hook v0.3.14
   Compiling signal-hook-registry v1.4.0
   Compiling serde v1.0.147
   Compiling git-hash v0.10.1
   Compiling git-validate v0.7.0
   Compiling nom v7.1.1
   Compiling git-path v0.6.0
   Compiling fastrand v1.8.0
   Compiling btoi v0.4.2
   Compiling time v0.3.17
   Compiling remove_dir_all v0.5.3
   Compiling tempfile v3.3.0
   Compiling dashmap v5.4.0
   Compiling memmap2 v0.5.8
   Compiling hash_hasher v2.0.3
   Compiling ahash v0.8.2
   Compiling rustversion v1.0.9
   Compiling git-quote v0.4.0
   Compiling git-tempfile v3.0.0
   Compiling git-config-value v0.9.0
   Compiling git-glob v0.5.0
   Compiling git-sec v0.5.0
   Compiling git-lock v3.0.0
   Compiling imara-diff v0.1.5
   Compiling unicode-bom v1.1.4
   Compiling git-date v0.3.0
   Compiling arrayvec v0.7.2
   Compiling git-actor v0.14.1
   Compiling castaway v0.2.2
   Compiling atoi v1.0.0
   Compiling git-chunk v0.4.0
   Compiling uluru v3.0.0
   Compiling git-command v0.2.0
   Compiling nix v0.25.1
   Compiling tracing-core v0.1.30
   Compiling git-bitmap v0.2.0
   Compiling compact_str v0.6.1
   Compiling filetime v0.2.19
   Compiling clru v0.5.0
   Compiling home v0.5.4
   Compiling thread_local v1.1.4
   Compiling io-close v0.3.7
   Compiling utf8-width v0.1.6
   Compiling arc-swap v1.5.1
   Compiling git-prompt v0.2.0
   Compiling tracing-log v0.1.3
   Compiling indexmap v1.9.2
   Compiling git-mailmap v0.6.0
   Compiling hashlink v0.8.1
   Compiling tracing v0.1.37
   Compiling clap v4.0.29
   Compiling openssl v0.10.43
   Compiling tracing-subscriber v0.3.16
   Compiling rusqlite v0.28.0
   Compiling sha1 v0.10.5
   Compiling bstr v0.2.17
   Compiling byte-unit v4.0.17
   Compiling serde_json v1.0.88
   Compiling serde_yaml v0.9.14
   Compiling globset v0.4.9
   Compiling csv v1.1.6
   Compiling prettytable-rs v0.9.0
   Compiling ignore v0.4.18
   Compiling libssh2-sys v0.2.23
   Compiling flate2 v1.0.25
   Compiling git-features v0.24.1
   Compiling git-object v0.23.0
   Compiling git-attributes v0.6.0
   Compiling git-url v0.11.0
   Compiling libgit2-sys v0.14.0+1.5.0
   Compiling git-credentials v0.7.0
   Compiling git-traverse v0.19.0
   Compiling git-ref v0.20.0
   Compiling git-diff v0.23.0
   Compiling git-revision v0.7.0
   Compiling git-pack v0.27.0
   Compiling git-index v0.9.1
   Compiling git-refspec v0.4.0
   Compiling git-config v0.12.0
   Compiling git-worktree v0.9.0
   Compiling git-discover v0.9.0
   Compiling git-odb v0.37.0
   Compiling git-repository v0.29.0
   Compiling git2 v0.15.0
   Compiling noseyparker v0.10.0 (/Users/byron/dev/github.com/praetorian-inc/noseyparker)
    Finished release [optimized] target(s) in 25.99s
     Running `target/release/noseyparker scan -d datastore-delme /Users/byron/dev/github.com/git/git/`
Found 17.06 GiB from 6,056 plain files and 535,038 blobs from 5 Git repos [00:00:02]
Scanning content  ████████████████████ 100%  17.06 GiB/17.06 GiB  [00:00:01]                                                                                                                                                 Scanned 6.12 GiB from 165,099 blobs in 1 second (4.67 GiB/s); 0/0 new matches

Run the `report` command next to show finding details.

noseyparker ( gitoxide) +1081 -62 [!] took 29s

That way the library can be built on non-x64 platforms.
@Byron Byron marked this pull request as ready for review December 9, 2022 12:41
@Byron Byron changed the title speedup scanning by 3.2x using gitoxide speedup scanning by 3.7x using gitoxide Dec 9, 2022
@bradlarsen
Copy link
Contributor

Hi @Byron. This is terrific; thank you for submitting this PR!

I had seen gitoxide previously, and considered it for reading Git objects instead of git2, but had not had a chance to investigate further. The use of git2 is currently a performance bottleneck when scanning, particularly when using many parallel workers.

Let me kick the tires on this a bit. I have a corpus of over 7000 Git repos that I'd like to run these changes on.

@bradlarsen
Copy link
Contributor

With some work, I think the rules database could be made to run with RegexSet making this useful tool more accessible.

Indeed, the current version of Nosey Parker requires an x86_64 CPU to easily build and run.

It is not impossible to run Hyperscan on ARM, however. There is a fork that adds support for additional platforms: https://github.com/vectorcamp/vectorscan. I have some local changes to use that fork instead of Hyperscan, and it seems to work on a MacBook Pro with an M1 Max. I'm hoping to get these changes merged into main in the next week or so.

I have not tried RegexSet specifically. But in an earlier implementation of Nosey Parker, I did briefly try re2, which similarly supports scanning for many patterns simultaneously. That experiment resulted in an order of magnitude slower scanning. I suspect RegexSet would also be slower than Hyperscan (which at least with certain rulesets can scan at >25Gbps on a single core).

The experiment using RegexSet would be interesting, but at this point we will probably be sticking with Hyperscan (and soon moving to the vectorscan fork that supports ARM).

@Byron
Copy link
Author

Byron commented Dec 9, 2022

Awesome, I am very much looking forward to seeing the verdict of your corpus run!

Vectorscan sounds like the way to go on M1 and I am really looking forward to running it for the first time on my own machine proper, with an actual scan. Then it should become clear if it’s worth to get this sped up to 12GB /s.

@bradlarsen
Copy link
Contributor

Vectorscan sounds like the way to go on M1 and I am really looking forward to running it for the first time on my own machine proper, with an actual scan. Then it should become clear if it’s worth to get this sped up to 12GB /s.

If you are impatient and ambitious, you could build with vectorscan yourself, as I sketched in #5. Big idea: build vectorscan from source, set the HYPERSCAN_ROOT environment variable appropriately, then cargo build, and it should work.

@bradlarsen
Copy link
Contributor

@Byron I have run some tests with the modified blob-reading code from this PR that uses gitoxide instead of git2.
I want this in Nosey Parker.

I ran with actual scanning via Hyperscan still enabled. I see a 2-2.5x overall speedup from your changes from a few repos I tried. For example, when running over 100GB of Linux history on my 10-core M1 Max: 2.51GiB/s scan rate with your changes vs 1.18GiB/s without.

I am still in process of validating this with my corpus of 7000+ Git repos. I suspect it will run 2-3x faster in total with your changes. The speed is great; I just want to battle-test it some more.

A question for you from your original description:

if the scanning step would be skipped, one could go straight to decoding packs directly to not waste a cycle on decoding packed objects, further speeding up decoding to reach light speed

What do you mean here—can you expand on this? Are there particular gitoxide crates or APIs I should take a look at?

@bradlarsen
Copy link
Contributor

changing the algorithm to leverage gitoxide pack resolution

@Byron can you also please elaborate a little more what you have in mind here? Are there crates or APIs I should be looking at?

@Byron
Copy link
Author

Byron commented Dec 13, 2022

That's fantastic news! I'd be keen to hear the final numbers to learn how long it takes to go through all these 7k repositories, and maybe how much data was churned. Again, depending on the pack compression and with the right machine 35GB/s are achievable.

To do that, you'd have to use a special cache that accelerates decoding of packs. There is gix verify which I believe already decodes each and every object of a repository with data rates as high as 12GB/s on the linux kernel on M1. Doing it like this requires some changes to the core algorithm, but the linked code should show how it's done.

Lastly, the enumeration step is something I'd make optional, as instead you can rely and visualize the progress information that gitoxide delivers. There is a portion in the upcoming cargo that shows how to feed your own progress bars with it. With prior enumeration you will have bounds for progress bars, but these are expensive numbers.

I have also opened a tracking issue with features I think you will need to fully replace git2 in this codebase.

I hope that helps.

@bradlarsen
Copy link
Contributor

@Byron some data for you!

Benchmarks from scanning 7.5k Git repositories

Test System

Note that these numbers are from a different computer than my earlier comments in this PR. This system is a 2019 MacBook Pro with a 2.6 GHz 6-Core Intel Core i7, 64 GB 2667 MHz DDR4, and a 1TB NVMe disk.

Input Data

The inputs are 7499 fully-packed, bare Git repositories, that take up 182GiB on disk. Within those repositories are 45M blobs in total, comprising 1.78TiB. Of those, there are 27.5M distinct blobs comprising 1.3TiB. (The difference between "total" and "distinct" is because some of the repos are forks of others, and many repos have vendored source code from certain dependencies.)

Nosey Parker detects the duplicates based on blob id, and only scans each distinct blob a single time. So the figure to keep in mind: 27.5M distinct blobs comprising 1.3TiB.

Benchmark Configurations

% cargo --version
cargo 1.61.0 (a028ae42f 2022-04-29)
% rustc --version
rustc 1.61.0 (fe5b13d68 2022-05-18)
  • Baseline: git2-based Nosey Parker from commit dcf5cb3
  • Candidate: git-repository-based Nosey Parker from dcf5cb3 with the relevant changes from this PR

Both variants were built in release mode with cargo build --release, linking against libhyperscan-dev installed from Homebrew.

Results

Summary of findings

These are the same in both baseline and candidate.

Summary of findings
 Rule                                        Distinct Matches   Total Matches
──────────────────────────────────────────────────────────────────────────────
 Generic Secret                                         5,837         113,560
 Generic API Key                                        4,151         420,914
 PEM-Encoded Private Key                                4,065          12,592
 bcrypt Hash                                            3,816          12,985
 Azure Connection String                                3,016         110,232
 JSON Web Token (base64url-encoded)                     2,487         187,062
 Azure App Configuration Connection String              2,078          24,414
 Google API Key                                         1,287          16,625
 AWS API Key                                              750          24,398
 Google Client ID                                         428           3,120
 Credentials in ODBC Connection String                    257           5,748
 netrc Credentials                                        189           9,637
 AWS Secret Access Key                                    154           2,493
 Facebook Secret Key                                       93             579
 Stripe API Test Key                                       85           7,256
 Google OAuth Access Token                                 71             486
 Mapbox Public Access Token                                67             491
 md5crypt Hash                                             53             217
 Twitter Client ID                                         53             515
 Slack Webhook                                             40             507
 Slack                                                     38             331
 MailChimp API Key                                         38             108
 Jenkins Token or Crumb                                    36             119
 GitHub Secret Key                                         35           1,197
 GitHub Client ID                                          34           1,169
 Sauce Token                                               33             210
 GitHub Personal Access Token                              32           4,219
 CodeClimate                                               30              87
 AWS Account ID                                            30             582
 Google Client Secret                                      26              60
 Stripe API Key                                            25             379
 SendGrid API Key                                          24             192
 LinkedIn Client ID                                        22             558
 Twitter Secret Key                                        21             415
 Mailgun API Key                                           13              67
 Microsoft Teams Webhook                                   12             327
 Amazon MWS Auth Token                                     12              19
 SonarQube Token                                           11              82
 Credentials in PsExec                                     11             168
 Slack Token                                                9              10
 AWS Session Token                                          9              90
 LinkedIn Secret Key                                        8             312
 Twilio API Key                                             7              16
 Facebook Access Token                                      5              10
 NuGet API Key                                              4               4
 Hardcoded Gradle Credentials                               4             462
 GitHub OAuth Access Token                                  4             466
 Square OAuth Secret                                        3               4
 Square Access Token                                        3              10
 PyPI Upload Token                                          3              28
 Okta API Token                                             3               7
 Heroku API Key                                             3              47
 GitHub App Token                                           3          31,467
 Artifactory API Key                                        3             453
 Mapbox Secret Access Token                                 2              16
 Dynatrace Token                                            2               6
 StackHawk API Key                                          1               1
 GitHub Refresh Token                                       1             175

Baseline performance (git2)

blarsen@Bradfords-MacBook-Pro noseyparker % (INPUTS=(~/clones); NP=(cargo run -r --); export NP_DATASTORE=benchmark.baseline.np; rm -rf "$NP_DATASTORE" && time $NP scan $INPUTS -d "$NP_DATASTORE")
...
Found 1.78 TiB from 135,119 plain files and 45,306,315 blobs from 7,499 Git repos [00:05:21]
...
Scanning content  ████████████████████ 100%  1.78 TiB/1.78 TiB  [00:29:31]
Scanned 1.33 TiB from 27,636,004 blobs in 30 minutes (785.92 MiB/s); 997,704/997,704 new matches
...
$NP scan $INPUTS -d "$NP_DATASTORE"  15723.50s user 2200.15s system 844% cpu 35:21.30 total

Candidate performance (gitoxide)

blarsen@Bradfords-MacBook-Pro noseyparker [130] % (INPUTS=(~/clones); NP=(cargo run -r --); export NP_DATASTORE=benchmark.candidate.np; rm -rf "$NP_DATASTORE" && time $NP scan $INPUTS -d "$NP_DATASTORE")
...
Found 1.78 TiB from 135,119 plain files and 45,306,315 blobs from 7,499 Git repos [00:05:12]
...
Scanning content  ████████████████████ 100%  1.78 TiB/1.78 TiB  [00:23:40]
Scanned 1.30 TiB from 27,535,959 blobs in 24 minutes (960.06 MiB/s); 997,704/997,704 new matches
...
$NP scan $INPUTS -d "$NP_DATASTORE"  14687.55s user 1212.05s system 899% cpu 29:28.20 total

Commentary

In this experiment on ~7.5k repositories, the gitoxide implementation is noticeably faster than the git2 implementation, about 1.25x faster scanning in aggregate wall clock time. The total application runtime is 1.2x faster.

From a bit of profiling investigation on different workloads, it looks like Nosey Parker spends the bulk of its time either in extracting Git blob content (either git2 or gitoxide depending on the variant) or in its second-stage matching using the regex crate. This depends on the input data; some inputs end up with ~80% of total time spent in regex, and other inputs end up with ~80% of total time spent extracting Git blob content. This ~7.5k repo input data is skewed more toward time spent in regex.

(The profiling investigation is hindered by the use of rayon, which ends up putting dozens or hundreds of plumbing-type frames in the profile samples above the code that does actual work. This seems to be a known issue, without there being a good solution for it at present.)

It does seem like speeding up the extraction of Git blobs in Nosey Parker would speed up the program as a whole, though to be most effective, the second-stage matching performance needs to be improved too. I'm looking into that at the regex level, and will also eventually take a close look at the 60-something patterns that Nosey Parker currently uses, figuring out which ones are to blame.

@bradlarsen
Copy link
Contributor

@Byron some more benchmark data!

Benchmarks from scanning 100GiB of Linux Kernel history

Benchmark setup

This uses the same system and baseline and candidate versions as the 7.5k repository benchmarks.

Results

Baseline (git2)

(INPUTS=(clones/linux.git); NP=(cargo run -r --); export NP_DATASTORE=benchmark.linux.baseline.np; rm -rf "$NP_DATASTORE" && time $NP scan $INPUTS -d "$NP_DATASTORE")
...
Found 102.02 GiB from 18 plain files and 2,829,263 blobs from 1 Git repos [00:00:50]
Scanning content  ████████████████████ 100%  102.02 GiB/102.02 GiB  [00:02:46]
Scanned 102.02 GiB from 2,829,281 blobs in 3 minutes (628.78 MiB/s); 50/50 new matches

 Rule                      Distinct Matches   Total Matches
────────────────────────────────────────────────────────────
 PEM-Encoded Private Key                 34              34
 md5crypt Hash                            5               6
 netrc Credentials                        3               7
 bcrypt Hash                              2               2
 Generic Secret                           1               1

Run the `report` command next to show finding details.
$NP scan $INPUTS -d "$NP_DATASTORE"  789.53s user 217.94s system 461% cpu 3:38.22 total

Candidate (gitoxide)

blarsen@Bradfords-MacBook-Pro noseyparker % (INPUTS=(clones/linux.git); NP=(cargo run -r --); export NP_DATASTORE=benchmark.linux.candidate.np; rm -rf "$NP_DATASTORE" && time $NP scan $INPUTS -d "$NP_DATASTORE")
...
Found 102.02 GiB from 18 plain files and 2,829,263 blobs from 1 Git repos [00:00:27]
Scanning content  ████████████████████ 100%  102.02 GiB/102.02 GiB  [00:00:58]
Scanned 102.02 GiB from 2,829,281 blobs in 59 seconds (1.74 GiB/s); 50/50 new matches

 Rule                      Distinct Matches   Total Matches
────────────────────────────────────────────────────────────
 PEM-Encoded Private Key                 34              34
 md5crypt Hash                            5               6
 netrc Credentials                        3               7
 bcrypt Hash                              2               2
 Generic Secret                           1               1

Run the `report` command next to show finding details.
$NP scan $INPUTS -d "$NP_DATASTORE"  678.12s user 7.82s system 786% cpu 1:27.25 total

Discussion

In the case of scanning the Linux kernel, the performance difference between the git2 baseline and the gitoxide candidate is much more pronounced. The gitoxide candidate scans ~2.75x faster, and total application wall clock time is 2.5x faster.

In this input, far fewer matches are found compared to the 7.5k repositories, and so the application speed is dictated much more by the Git blob extraction code. Looking briefly at a profiler, it looks like ~80% of total application runtime is spent in blob extraction.

Note that the 7.5k repository input is going to be an atypical workload: many of the repositories there were chosen for the presence of hardcoded secrets.

Takeaway 1: For inputs like this, using gitoxide instead of git2 will be hugely beneficial.

Takeaway 2: On a system with more cores and higher available parallelism, using gitoxide instead of git2 shows bigger speedups.

Takeaway 3: Nosey Parker's second-stage regex matching could benefit from performance improvements.

@Byron
Copy link
Author

Byron commented Dec 16, 2022

Thanks so much for sharing this elaborate and exhaustive performance report :)! I would have hoped for more of a speedup in the first case with 7.5k repos, but can see why workloads dominated by the scanning might not benefit as much. Then again, I'd think that the performance for scanning a file is mostly determined by the size of the blob, but @BurntSushi would definitely know better as the author of the regex crate which appears to be used now 🎉.

I want to be back to this PR once gitoxide supports reading object headers efficiently, something I'd like to see happening in this year still.

What are your plans with this PR? It seems main has changed quite a bit in the mean time. Would you be willing to adjust this PR to your liking so it can be merged and I setup a new one that replaces the read_header() logic with gitoxide? The latter probably allows to ditch git2 entirely and we can run the comparison again, maybe even on a bigger machine with many more cores.

Thanks and cheers.

@bradlarsen
Copy link
Contributor

@Byron:

Then again, I'd think that the performance for scanning a file is mostly determined by the size of the blob, but @BurntSushi would definitely know better as the author of the regex crate which appears to be used now 🎉.

Nosey Parker indeed uses regex, but that's for its second-stage matching. It still uses Hyperscan for the first stage, which very quickly reports for any blob a set of (pattern, match end offset) pairs. These pairs are then used to run a second stage of matching through the regex crate to get match start offsets and capture groups.

The performance of the regex matching should be dependent not on the size of the input blob, but rather on the length of the match. In its implementation, regex uses an optimization in the presence of $ anchors that allows it to match input backward from the end. I need to verify for certain, but this optimization should be firing with the patterns Nosey Parker is using.

I want to be back to this PR once gitoxide supports reading object headers efficiently, something I'd like to see happening in this year still.

What are your plans with this PR? It seems main has changed quite a bit in the mean time. Would you be willing to adjust this PR to your liking so it can be merged and I setup a new one that replaces the read_header() logic with gitoxide? The latter probably allows to ditch git2 entirely and we can run the comparison again, maybe even on a bigger machine with many more cores.

Thanks for asking. Yes, this PR has diverged from main a bit. I'm going to make a new PR based on the changes here, mark you as a coauthor, and merge that to back. Does that sound okay?

Let me know when gitoxide can read object headers efficiently and I will be happy to strip out git2 entirely from Nosey Parker in a future PR. 🚀

@Byron
Copy link
Author

Byron commented Dec 17, 2022

Thanks for the explainer on how the regex matching works, it's amazing to see how much it takes to be this fast!

Thanks for asking. Yes, this PR has diverged from main a bit. I'm going to make a new PR based on the changes here, mark you as a coauthor, and merge that to back. Does that sound okay?

Yes, that sounds great- thank you!

Let me know when gitoxide can read object headers efficiently and I will be happy to strip out git2 entirely from Nosey Parker in a future PR. 🚀

I feel like having a little bit of extra-fun this weekend and will try to get something like read_header() done, and will let you know once it's available so you can do the final integration. I will probably add something like gix odb total-size to exercise it plainly but in a multi-threaded fashion, and post some results here too.

@bradlarsen
Copy link
Contributor

@Byron: all sounds good! Looking forward to the read_header() stuff, whenever you get to it.

FYI I did a bit more performance investigation. In Nosey Parker, on the 7.5k repositories input, I was mistaken about where time was being spent: even with the second stage regex scanning fully enabled, the bulk of the time is spent extracting Git blobs. I ran an experiment with fully disabling the scanning entirely, leaving just blob enumeration and extraction, and the total runtime was hardly reduced.

My hypothesis: the way that Nosey Parker currently scans blobs is rather suboptimal when dealing with big packfiles. Nosey Parker uses Rayon to parallelize over Git repositories and also the blobs within each. But the way it is splitting up the work for scanning blobs ends up with multiple threads doing redundant work to extract the blobs they are scanning. I spot-checked a handful of repositories, and gix verify was usually 2-3x faster than noseyparker scan.

There is probably another 2-3x speedup possible if Nosey Parker instead had gitoxide drive the blob extraction process (like gix verify).

This doesn't speed up the enumeration phase (which might not be
necessary) but shows how much faster gitoxide can be if used like
that.
@Byron
Copy link
Author

Byron commented Dec 18, 2022

odb.header() is now available 🎉. By the looks of it, it's quite a bit faster than what's provided by git2, but could definitely use much more thorough testing. I have implemented gix odb stats to exercise the new functionality as well, and with gix -t1 odb stats it can be forced to 'perfect single-threaded' mode. There it's clear that it spends most of its time parsing pack headers (~75%) and the rest is spend decompressing the first pack entry of an object. This makes reading an object header of a packed object roughly 8 times faster than decompressing it. As it doesn't implement any caching, it does a lot of redundant work as well similar to what you have discovered when decoding the entire object.

So everything is great, right ;)? Well, some oddity happened :/: When enumerating objects (which may include duplicates) with gitoxide , the amount of blobs it claims to find are quite a bit different from what git2 presents. I thought it's because git2 implicitly deduplicates them, but even when doing that manually the numbers don't match at all. On top of that, for some reason, when invoking it with cargo run --release --no-default-features -- scan -d delme ~/dev/git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux, it does enumerate that (and any repo) twice. It's like it's finding …/linux and then …/linux/.git separately. I tested this behaviour against git2 and the latter somehow doesn't seem to iterate any objects when path() is not the .git repo. Maybe that's why the object counts are so vastly different, after all gitoxide traverses both repos just fine.

Unfortunately, the oddities don't stop there. The performance for decoding objects drops to git2 levels! Instead of 6.06GB/s for the linux kernel I am seeing 2.2GB/s. Admittedly, the fantastic values surprised me a bit, and it seems that these are possible due to the order in which git2 iterates the objects. It seems to do that in pack order, which would explain why the pack cache is so well used (and thus it avoids a lot of duplicate work).

Now that I read the paragraph above it seems like that issue isn't all that odd anymore, and I just pushed an update that uses git2 for object iteration. Just to finish this quest, I think tomorrow gitoxide will be able to do the same using a new iteration flag.

I can't wait :D!

@Byron
Copy link
Author

Byron commented Dec 18, 2022

I think this acquisition of time could also be something that costs much more than it's worth. Something that's far more scalable is to use atomic counters and a separate thread for display. It can be as simple as this example (in principle).

@Byron
Copy link
Author

Byron commented Dec 19, 2022

Here is an update on what I think might be the end of my wonderful quest for maximal performance.

The latest git-repository release (v0.30) has support for iteration in pack-order to better leverage pack delta caches, similar to what git2 does but what appears to be quite a bit faster. On main there is gix odb stats to try - it can read the headers of all objects of the linux kernel at 6.7 million objects/s on 10 cores of an M1 Pro. git2 shouldn't be needed here anymore with this upgrade.

Additionally I opened up the acquisition of git repositories via git::open_opts(…) to be allowed to look at certain environment variables to configure the pack cache size for deltas. With an invocation as follows one can increase the delta-cache size for a potential boost in performance at the cost of memory:

GITOXIDE_PACK_CACHE_MEMORY=512MB cargo run --release  --no-default-features  -- scan  -d delme ~/dev/github.com/git/git/.git

The biggest issue I still see is the discovery of git repositories that leads to repositories being opened and enumerated multiple times. Maybe enumeration can be changed to something like what's used in ein tool find which can enumerate deep trees quite fast as well.

❯ time ein tool find ~/dev | wc -l
     457
ein tool find ~/dev  0.03s user 0.08s system 381% cpu 0.028 total
wc -l  0.00s user 0.00s system 5% cpu 0.028 total

The dev folder is rather large:

❯ dua a --stats  ~/dev
 213.49 GB /Users/byron/dev
Statistics { entries_traversed: 3597942, smallest_file_in_bytes: 0, largest_file_in_bytes: 16111816704 }

gitoxide (main) +1 -1 [$!] took 54s

Please note that with the git dir enumeration above, there is a current shortcoming of it not following symlinks, but it also won't recognize a git directory that is right behind a symlink.

And the biggest potential is certainly in using pack traversal directly, and maybe to make the numeration step optional at the expense of not knowing the upper bound for the progress bar.

The best of success with the further development of this project, please let me know if anything else is missing as you finish the migration.

Cheers

It's just single-threaded, but appears to be faster nonetheless.
@bradlarsen
Copy link
Contributor

Thanks @Byron; tremendous! I will try to get this merged back before the holidays. Thank you!

@bradlarsen
Copy link
Contributor

when invoking it with cargo run --release --no-default-features -- scan -d delme ~/dev/git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux, it does enumerate that (and any repo) twice. It's like it's finding …/linux and then …/linux/.git separately

This is probably the right explanation! noseyparker uses the ignore crate to enumerate the filesystem; each directory is attempted to be opened as a Git repo. The logic that noseyparker is using for that may need some rework. It is not intended to be opening Git repos twice. I'll take a look.

@bradlarsen
Copy link
Contributor

I think this acquisition of time could also be something that costs much more than it's worth. Something that's far more scalable is to use atomic counters and a separate thread for display. It can be as simple as this example (in principle).

Good catch! I had added that hacked-up caching layer on top of indicatif because I had seen noticeable overhead from trying to use indicatif directly from many threads. So instead of updating the indicatif progress every iteration, have a thread-local progress bar state and only periodically update the underlying progress bar.

I had hoped that calling time::instant::Instant::elapsed() would avoid entering the kernel each time, perhaps through something like vdso in the implementation. But I hadn't looked into this very much.

I had been considering switching to something like statusline instead, which is much lighter weight and seems to have been written to address scalability issues with indicatif. Thanks for the pointer to another approach!

@Byron
Copy link
Author

Byron commented Dec 20, 2022

Thinking about it, maybe the situation changed and by now it's not an issue anymore to make repeated calls to Instant::elapsed(), I was clearly stigmatised years ago when I tried it 😅. And by the looks of it, the global lock is gone and it's nothing more than a libc call. Probably worth to keep an eye on that as the core algorithm gets faster and faster to see if it starts showing up in profile runs.

By the looks of it vdso is a neat way to make this very, very fast, and maybe that's why this isn't a problem anymore as well. Good to know, I don't have to be afraid of calling elapsed() anymore.

bradlarsen added a commit that referenced this pull request Dec 21, 2022
@bradlarsen
Copy link
Contributor

when invoking it with cargo run --release --no-default-features -- scan -d delme ~/dev/git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux, it does enumerate that (and any repo) twice. It's like it's finding …/linux and then …/linux/.git separately

This is probably the right explanation! noseyparker uses the ignore crate to enumerate the filesystem; each directory is attempted to be opened as a Git repo. The logic that noseyparker is using for that may need some rework. It is not intended to be opening Git repos twice. I'll take a look.

Oh, I see now. When using git2, repositories were opened using its NO_DOTGIT opening option, which was preventing Nosey Parker from opening and enumerating repositories twice when using that library.

gitoxide doesn't seem to have a comparable option at present. It looks like appending .git to a path is always attempted?

Anyway, I was able to work around that in a revised and expanded version of this PR in #20.

@bradlarsen
Copy link
Contributor

Closing this PR in favor of #20, which is an updated and expanded version of this that I'm working on merging back to main.

Thank you again @Byron; tremendous work with this! 🍻

@bradlarsen bradlarsen closed this Dec 21, 2022
@Byron
Copy link
Author

Byron commented Dec 21, 2022

gitoxide doesn't seem to have a comparable option at present. It looks like appending .git to a path is always attempted?

That's true, but this case is all I need to implement a similar option which now makes perfect sense. I will let you know in the follow-up PR when it's ready.

@Byron Byron deleted the gitoxide branch December 21, 2022 18:49
bradlarsen pushed a commit that referenced this pull request Dec 22, 2022
* Tweak `summarize` column names for clarity
* Use gitoxide instead of git2, adapted from PR #2
* Use tracing-log to emit `log` records as `tracing` events
* Open repos just once; strip out git2

Depending on your input repositories and how many cores you have, this change can give more than a 3x speedup in overall Nosey Parker runtime.

Co-authored by Sebastian Thiel <[email protected]>.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants