Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

switch to custom Hasher implementation #630

Closed
wants to merge 7 commits into from

Conversation

pascalkuthe
Copy link
Contributor

Speeds up pack creation by about 1% and ensures
that the Hasher will remain robust when
more variants (like sha256) are added to ObjectId in the future.

Looking at the flamegraph, the HashMap doesn't even show up tough.
It seems pack creation is almost entirely bound by zlip decoding during commit transversal.
So this might be a larger improvement in hashing performance but just doesn't matter as much overall.
Since git is still quite a bit faster during (single-threaded) pack creation it
would definitely be interesting to find out why gitoxide is slower.
Since gitoxide uses a zlib-ng I would expect it to be faster than git if decoding is really the bottleneck.
Without that advantage the gap would be even larger.

Perhaps git somehow manages to do less work during transversal compared to gitoxide?


Ok for Byron review the PR on video?

  • I give my permission to record review and upload on YouTube publicly

If I think the review will be helpful for the community, then I might record and publish a video.

@Byron
Copy link
Owner

Byron commented Nov 28, 2022

Perhaps git somehow manages to do less work during transversal compared to gitoxide?

I would not be surprised. Now, having forgotten most of the history collected in the tracking issue I wonder why it decompresses entries to learn about the size instead of querying the index for that. Oh, right, because it first has to discover all objects and that tree traversal takes a long time even though it already skips over trees it has already seen.

- rename `git-shamap` to `git-hash-collections`. That way more of these
 can be added as we go, starting with hashmaps (and maybe even ending
 with them).
- remove unused 'raw' module. It can be brought back should it become
  relevant again.
- change names to mirror `std::collections::*` and rely on prefixing
  with the whole cratename more.
- pledge to not using unsafe code
- prepare for adding 'breaking change' commit message to `git-revision`
…ons`.

Due to the importance of best-suited data structures for maximizing
performance we need to take control over them. This is best done using
a dedicated crate that can cater to our very needs. That crate is
named `git-hash-collections`.
Copy link
Owner

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for contributing! Every percent of performance is appreciated!

As a general note, it's very important to split commits and use the correct commit message to signal breaking changes. I have created a write-up here: 932c353 .

Besides that I think it's ready for being merged, but wanted to see if you are OK with the changes first.

Thank you

git-pack/Cargo.toml Outdated Show resolved Hide resolved
@pascalkuthe
Copy link
Contributor Author

Looks good to me, I will follow the commit conventions in the future.
The documentation will certanly help with that.

A few comnments:

  • pledge to not using unsafe code

Fine for now but custom collections (and hashtables in particular) are a prime candidate for unsafe code so we might have to remove that in the future.

rename git-shamap to git-hash-collections

Bikeshed: Maybe git-hashtable (or git-hashtables) instead?
All hash based collections I know of are hashtables internally and just have some extra abstractions on top so hash-collections sounds a bit odd to me (but it's also fine)

@Byron
Copy link
Owner

Byron commented Nov 28, 2022

Fine for now but custom collections (and hashtables in particular) are a prime candidate for unsafe code so we might have to remove that in the future.

That sounds absolutely fair. All new crates (should) start out with these deny's (except for missing_docs), and that should help to also remind us that we try to avoid unsafe, and write safe code first if at all possible.

Bikeshed: Maybe git-hashtable (or git-hashtables) instead?

Yes, let's do that instead! git-hashtable sounds good to me.

I think once the naming is fixed, we can merge this one.

@pascalkuthe pascalkuthe marked this pull request as ready for review November 28, 2022 15:51
@Byron
Copy link
Owner

Byron commented Nov 28, 2022

Closing as it's already merged, but I had to rewrite all commits but the first one in the end.

@Byron Byron closed this Nov 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants