switch to custom Hasher implementation #630

pascalkuthe · 2022-11-27T22:00:45Z

Speeds up pack creation by about 1% and ensures
that the Hasher will remain robust when
more variants (like sha256) are added to ObjectId in the future.

Looking at the flamegraph, the HashMap doesn't even show up tough.
It seems pack creation is almost entirely bound by zlip decoding during commit transversal.
So this might be a larger improvement in hashing performance but just doesn't matter as much overall.
Since git is still quite a bit faster during (single-threaded) pack creation it
would definitely be interesting to find out why gitoxide is slower.
Since gitoxide uses a zlib-ng I would expect it to be faster than git if decoding is really the bottleneck.
Without that advantage the gap would be even larger.

Perhaps git somehow manages to do less work during transversal compared to gitoxide?

Ok for Byron review the PR on video?

I give my permission to record review and upload on YouTube publicly

If I think the review will be helpful for the community, then I might record and publish a video.

Byron · 2022-11-28T08:22:14Z

Perhaps git somehow manages to do less work during transversal compared to gitoxide?

I would not be surprised. Now, having forgotten most of the history collected in the tracking issue I wonder why it decompresses entries to learn about the size instead of querying the index for that. Oh, right, because it first has to discover all objects and that tree traversal takes a long time even though it already skips over trees it has already seen.

- rename `git-shamap` to `git-hash-collections`. That way more of these can be added as we go, starting with hashmaps (and maybe even ending with them). - remove unused 'raw' module. It can be brought back should it become relevant again. - change names to mirror `std::collections::*` and rely on prefixing with the whole cratename more. - pledge to not using unsafe code - prepare for adding 'breaking change' commit message to `git-revision`

…ons`. Due to the importance of best-suited data structures for maximizing performance we need to take control over them. This is best done using a dedicated crate that can cater to our very needs. That crate is named `git-hash-collections`.

Byron

Thanks a lot for contributing! Every percent of performance is appreciated!

As a general note, it's very important to split commits and use the correct commit message to signal breaking changes. I have created a write-up here: 932c353 .

Besides that I think it's ready for being merged, but wanted to see if you are OK with the changes first.

Thank you

git-pack/Cargo.toml

pascalkuthe · 2022-11-28T15:15:51Z

Looks good to me, I will follow the commit conventions in the future.
The documentation will certanly help with that.

A few comnments:

pledge to not using unsafe code

Fine for now but custom collections (and hashtables in particular) are a prime candidate for unsafe code so we might have to remove that in the future.

rename git-shamap to git-hash-collections

Bikeshed: Maybe git-hashtable (or git-hashtables) instead?
All hash based collections I know of are hashtables internally and just have some extra abstractions on top so hash-collections sounds a bit odd to me (but it's also fine)

Byron · 2022-11-28T15:32:41Z

Fine for now but custom collections (and hashtables in particular) are a prime candidate for unsafe code so we might have to remove that in the future.

That sounds absolutely fair. All new crates (should) start out with these deny's (except for missing_docs), and that should help to also remind us that we try to avoid unsafe, and write safe code first if at all possible.

Bikeshed: Maybe git-hashtable (or git-hashtables) instead?

Yes, let's do that instead! git-hashtable sounds good to me.

I think once the naming is fixed, we can merge this one.

Byron · 2022-11-28T18:04:38Z

Closing as it's already merged, but I had to rewrite all commits but the first one in the end.

switch to custom Hasher implementation

269d59e

pascalkuthe force-pushed the optimize_hashtables branch from 1e783cf to 269d59e Compare November 27, 2022 22:28

Byron added 4 commits November 28, 2022 09:29

fix build warnings; adjust author information of git-shamap

3503ee2

bring git-shamap into the family by referencing it in various places.

63a1adf

Byron approved these changes Nov 28, 2022

View reviewed changes

git-pack/Cargo.toml Outdated Show resolved Hide resolved

pascalkuthe added 2 commits November 28, 2022 16:39

chore: rename git-hash-collections to git-hashtable

bc06202

chore forbid(unsafe) in git-hashtable

08d7847

pascalkuthe marked this pull request as ready for review November 28, 2022 15:51

Byron closed this Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

switch to custom Hasher implementation #630

switch to custom Hasher implementation #630

pascalkuthe commented Nov 27, 2022

Byron commented Nov 28, 2022

Byron left a comment

pascalkuthe commented Nov 28, 2022

Byron commented Nov 28, 2022

Byron commented Nov 28, 2022

switch to custom Hasher implementation #630

switch to custom Hasher implementation #630

Conversation

pascalkuthe commented Nov 27, 2022

Byron commented Nov 28, 2022

Byron left a comment

Choose a reason for hiding this comment

pascalkuthe commented Nov 28, 2022

Byron commented Nov 28, 2022

Byron commented Nov 28, 2022