Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrade crates-index-diff & git-repository #1935

Merged
merged 4 commits into from
Dec 1, 2022

Conversation

syphar
Copy link
Member

@syphar syphar commented Nov 27, 2022

This updates the crates-index-diff crates, including thread-names for all the threads that crates-index-diff spawns, to get visibility into our CPU-load problem.

related release notes:

Also:

@github-actions github-actions bot added the S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed label Nov 27, 2022
@syphar
Copy link
Member Author

syphar commented Nov 27, 2022

r? @Byron

what do you thing about the workaround @Nemo157 @jyn514 ?

src/build_queue.rs Outdated Show resolved Hide resolved
Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integration itself looks good to me even though I can't tell what's best to do with the new Change variants.

There are a few more improvements to consider here.

  • set the GITOXIDE_PACK_CACHE_MEMORY=256MB environment variable to increase the maximum cache to use for pack entries, which typically improves diffing performance by a factor of 4. Please note that the number given here is a rough guess and the performance improvement flatlines quickly once the optimal value is reached. Setting it to any value higher than a couple of megabytes is probably going to be worth an improvement already.
  • if you can, set the max-performance feature to get a 2.5x performance boost. Yes, 10x is what's possible if both of these are used. An example both in action is the baseline test run on CI. Note that it uses an overly large value for the pack cache which is me not wanting to try to find an optimum that is lower.
  • You can use peek_changes_ordered() to single-step through the history that has accumulated and thus maintain the order of user-generated changes, leading to a fair queuing order. I benchmarked it and typical changes of two days (~1500) take about 20s to obtain (90 changes/s), thus polling every 5 minutes or so wouldn't take more than a second of compute time given the amount of changes in that duration is much lower than 1500.

I am particularly excited about the fair/correct ordering of changes and would love that to land as part of this PR as well. Maybe it's something for a follow-up one though as it's not required to upgrade.

@syphar
Copy link
Member Author

syphar commented Nov 27, 2022

Thank you for checking this out! @Byron

About your points:
Since we're checking for changes every 60 seconds, differences between peek_changes_ordered and peek_changes should be edge-cases, and similarly we probably don't run into performance issues fetching or diffing. ( please correct me I'm wrong).

Still I changed to using peek_changes_ordered, which seems more like the thing we want.

Also I added the max-performance feature. What's the reason why this is optional? What are the downsides?
GITOXIDE_PACK_CACHE_MEMORY is something we can think about on prod.

@pascalkuthe
Copy link

Also I added the max-performance feature. What's the reason why this is optional? What are the downsides?

Among other things this adds dependencies on zlib-ng and an sha1 implementation that uses assembly for better performance. These dependencies are not available on all platforms (like WASM for example). zlib-ng in particular also causes problems if other crates link to standard zlib implementation because it uses the same symbol names as zlib.
Without the max-performance feature (slower) pure-rust alternatives are used.

Some downstream crates where performance isn't critical that don't want to introduce additional runtime dependencies or that want to support nieche targets might therefore want to opt out.
I am not sure why it is not a default feature for this crate as that is how git-repository handles this. Maybe an oversight @Byron?

@Byron
Copy link
Member

Byron commented Nov 27, 2022

and similarly we probably don't run into performance issues fetching or diffing. ( please correct me I'm wrong).

I agree, I wouldn't expect performance issues no matter which configuration is chosen.

Also I added the max-performance feature. What's the reason why this is optional? What are the downsides?

I second @pascalkuthe, and am happy he joined in as my explanation would have been more hand-wavy for sure 😁.

I am not sure why it is not a default feature for this crate as that is how git-repository handles this. Maybe an oversight @Byron?

Thanks for starting this conversation.

It's one of these trade-offs where we ask what's more important - optimal performance but it might not work at all for some, or reduced performance but best compatibility. git-repository and with it gitoxide has to favor compatibility and crates-index-diff aligns with that stance because I have learned this the hard way.

Since crates-index-diff is standalone it could make its own choices though and if given the choice, I'd leave it as is. The reason is simple, and a little bit selfish: if something doesn't work I get bug reports, if something isn't as fast as it could be, thus far, I didn't get any reports for that. The latter might be because feature flags are quite discoverable by now or that people generally don't have performance issues anyway. Generally I'd hope that we can make the performance options easy to find and maybe documentation can be improved to assure that.

@syphar
Copy link
Member Author

syphar commented Nov 30, 2022

ping @Nemo157 @jyn514 does one of you have time for a review?

I would really love to deploy this to see if crates-index-diff is the reason for the high cpu load in prod.

Copy link
Member

@jyn514 jyn514 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the docs.rs side of things looks good to me.

set the GITOXIDE_PACK_CACHE_MEMORY=256MB environment variable to increase the maximum cache to use for pack entries, which typically improves diffing performance by a factor of 4. Please note that the number given here is a rough guess and the performance improvement flatlines quickly once the optimal value is reached. Setting it to any value higher than a couple of megabytes is probably going to be worth an improvement already.
if you can, set the max-performance feature to get a 2.5x performance boost. Yes, 10x is what's possible if both of these are used. An example both in action is the baseline test run on CI. Note that it uses an overly large value for the pack cache which is me not wanting to try to find an optimum that is lower.

👍 seems good - we have quite a lot of memory slack currently so I'm not too worried about the memory limit being too high, 256 MB is low enough it doesn't affect whether we'll have production issues.
image

I do worry that we're landing this without tests, since it sounds like there were behavior changes on the crates-index-diff side. ideally we would land those before the update, but I know they're a pain to write ... @Byron does crates-index-diff have its own internal tests we can trust?

src/build_queue.rs Outdated Show resolved Hide resolved
@syphar syphar merged commit 754fd98 into rust-lang:master Dec 1, 2022
@syphar syphar deleted the upgrade-git branch December 1, 2022 15:56
@github-actions github-actions bot added S-waiting-on-deploy This PR is ready to be merged, but is waiting for an admin to have time to deploy it and removed S-waiting-on-review Status: This pull request has been implemented and needs to be reviewed labels Dec 1, 2022
@Byron
Copy link
Member

Byron commented Dec 1, 2022

@Byron does crates-index-diff have its own internal tests we can trust?

By now I trust the test suite, it's as good as I could make it. It was the reason we could rewrite the diffing engine so painlessly. Now there is also new baseline tests which validates that no matter how you step through the history to obtain diffs, you will end up at the same state as iterating through all crates using crates-index. This is to assure myself that we don't see these "didn't process my crate version" bugs again and that the diffing logic is consistent. The baseline tests also run once a day against the actual crates-io index to be sure.

@jyn514
Copy link
Member

jyn514 commented Dec 1, 2022

amazing, thank you! ❤️

@syphar syphar removed the S-waiting-on-deploy This PR is ready to be merged, but is waiting for an admin to have time to deploy it label Dec 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants