Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(cargo-package): match certain path prefix with pathspec #14997

Merged
merged 1 commit into from
Jan 13, 2025

Conversation

weihanglo
Copy link
Member

@weihanglo weihanglo commented Jan 1, 2025

What does this PR try to resolve?

This revives #14962. See benchmark chart in #14962 (comment).
#14962 was closed because we found more bugs in cargo package, and #14962 could potentially make them even harder to fix. Two of them have been fixed so this is good to ship IMO with its own good.


An improvement #14955.

check_repo_state checks the entire git repo status. This is usually fine if you have only a few packages in a workspace.

For huge monorepos, it may hit performance issues.

For example,
on awslabs/aws-sdk-rust@2cbd34d
the workspace has roughly 434 members to publish.
git ls-files reported us 204379 files in this Git repository. That means git may need to check status of all files 434 times. That would be 204379 * 434 = 88,700,486 checks!

Moreover, the current algorithm is finding the intersection of PathSource::list_files and git status.
It is an O(n^2) check.
Let's assume files are evenly distributed into each package, so roughly 470 files per package.
If we're unlucky to have some dirty files, say 100 files. We will have to do 470 * 100 = 47,000 times of path comparisons.

Even worse, because we git status everything in the repo, we'll have to it for all members,
even when those dirty files are not part of the current package in question. So it becomes 470 * 100 * 434 = 20,398,000!

Solution

Instead of comparing with the status of the entire repository, this patch use the magic pathspec1 to tell git only reports paths that match a certain path prefix.

This wouldn't help the O(n^2) algorithm,
but at least it won't check dirty files outside the current package. Also, we don't git status against entire git worktree/index anymore.

How should we test and review this PR?

Run this command against awslabs/aws-sdk-rust@2cbd34d,
and see if it is getting better.

CARGO_LOG_PROFILE=1 cargor package --no-verify --offline --allow-dirty -p aws-sdk-accessanalyzer -p aws-sdk-apigateway

I've verified checksums of .crate files generated from master (d85d761) and this commit (3dabdcd). They are the same.

Additional information

There are some other alternatives, like making PathSource::list_files additionally reports dirty files. While we already have rooms to do it, this approach should be the most straightforward one at this moment.

Some other approaches like

  • Switch to gitoxide (I tried and it didn't as good as expected. Maybe I did something wrong).
  • A flag --no-vcs to skip vcs at all
  • Improve the O(n^2) algorithm

`check_repo_state` checks the entire git repo status.
This is usually fine if you have only a few packages in a workspace.

For huge monorepos, it may hit performance issues.

For example,
on awslabs/aws-sdk-rust@2cbd34d
the workspace has roughly 434 members to publish.
`git ls-files` reported us 204379 files in this Git repository.
That means git may need to check status of all files 434 times.
That would be `204379 * 434 = 88,700,486` checks!

Moreover, the current algorithm is finding the intersection of
`PathSource::list_files` and `git status`.
It is an `O(n^2)` check.
Let's assume files are evenly distributed into each package,
so roughly 470 files per package.
If we're unlucky to have some dirty files, say 100 files.
We will have to do `470 * 100 = 47,000` times of path comparisons.

Even worse, because we `git status` everything in the repo,
we'll have to it for all members,
even when those dirty files are not part of the current package in question.
So it becomes `470 * 100 * 434 = 20,398,000`!

Instead of comparing with the status of the entire repository,
this patch use the magic pathspec[1] to tell git only reports
paths that match a certain path prefix.

This wouldn't help the `O(n^2)` algorithm,
but at least it won't check dirty files outside the current package.
Also, we don't `git status` against entire git worktree/index anymore.

[1]: https://git-scm.com/docs/gitglossary#Documentation/gitglossary.txt-aiddefpathspecapathspec
@rustbot
Copy link
Collaborator

rustbot commented Jan 1, 2025

r? @epage

rustbot has assigned @epage.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added A-git Area: anything dealing with git Command-package S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 1, 2025
@epage epage added this pull request to the merge queue Jan 13, 2025
Merged via the queue into rust-lang:master with commit 54df3c7 Jan 13, 2025
20 checks passed
@weihanglo weihanglo deleted the package-perf branch January 13, 2025 20:19
bors added a commit to rust-lang-ci/rust that referenced this pull request Jan 18, 2025
Update cargo

13 commits in 088d496082726091024f1689c124a0c3dccbd775..045bf21b36a2e1f3ed85e38278d1c3cc4305e134
2025-01-10 20:10:21 +0000 to 2025-01-17 14:59:36 +0000
- created a function for user defined aliases (rust-lang/cargo#15076)
- took the functionality of the third party subcommand from the list_commands function (rust-lang/cargo#15075)
- fix: wrong concat and field name (rust-lang/cargo#15074)
- fix(publish): Report all unpublishable packages  (rust-lang/cargo#15070)
- docs(cargo-clippy): correct typo (rust-lang/cargo#15072)
- docs(cargo-package): alwasy include the lockfile (rust-lang/cargo#15067)
- docs(ref): Deprecate 'package.authors'  (rust-lang/cargo#15068)
- fix(build-std): parse as comma-separated list (rust-lang/cargo#15065)
- Fix benchsuite issue with newer versions of git (rust-lang/cargo#15069)
- Document that cargo automatically registers variables used in env! macro to trigger rebuilds (rust-lang/cargo#15062)
- perf(cargo-package): match certain path prefix with pathspec (rust-lang/cargo#14997)
- Clarify note in example (rust-lang/cargo#15054)
- chore(deps): update msrv (3 versions) to v1.82 (rust-lang/cargo#15050)
bors added a commit to rust-lang-ci/rust that referenced this pull request Jan 18, 2025
Update cargo

13 commits in 088d496082726091024f1689c124a0c3dccbd775..045bf21b36a2e1f3ed85e38278d1c3cc4305e134
2025-01-10 20:10:21 +0000 to 2025-01-17 14:59:36 +0000
- created a function for user defined aliases (rust-lang/cargo#15076)
- took the functionality of the third party subcommand from the list_commands function (rust-lang/cargo#15075)
- fix: wrong concat and field name (rust-lang/cargo#15074)
- fix(publish): Report all unpublishable packages  (rust-lang/cargo#15070)
- docs(cargo-clippy): correct typo (rust-lang/cargo#15072)
- docs(cargo-package): alwasy include the lockfile (rust-lang/cargo#15067)
- docs(ref): Deprecate 'package.authors'  (rust-lang/cargo#15068)
- fix(build-std): parse as comma-separated list (rust-lang/cargo#15065)
- Fix benchsuite issue with newer versions of git (rust-lang/cargo#15069)
- Document that cargo automatically registers variables used in env! macro to trigger rebuilds (rust-lang/cargo#15062)
- perf(cargo-package): match certain path prefix with pathspec (rust-lang/cargo#14997)
- Clarify note in example (rust-lang/cargo#15054)
- chore(deps): update msrv (3 versions) to v1.82 (rust-lang/cargo#15050)
@rustbot rustbot added this to the 1.86.0 milestone Jan 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-git Area: anything dealing with git Command-package S-waiting-on-review Status: Awaiting review from the assignee but also interested parties.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants