-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize dep queue by recording and using crate build times #7396
Comments
@matthiaskrgr good suggestion I've thought about this a little myself. The problem that cargo has to solve is known as the multiprocessor DAG scheduling problem, which is NP complete so we can only apply approximations. On the bright side, the problem is well-known in the literature and you can find a wealth of papers about the topic. E.g. this one (doi link) is fairly recent, or this one (doi link). The heuristic used currently in cargo is based on depth only and it already has a TODO that mentions possible extensions. A project to find a better scheduling algorithm is a great opportunity for experimentation. One could e.g. parse crater logs to find out how long each crate takes to build and then use that information to build a "virtual" crater where the results of different scheduling algorithms are compared using different CPU numbers etc. Then we'd find the scheduling algorithm that fits the cargo use case best. The final implementation of the scheduling algorithm could be even put into its own crate, for use in the ecosystem, similar to polonius. It's possibly a great topic for an academic thesis if someone is looking for one. |
cc also #5125 . |
@est31 that was very interesting reading. It was not clear to me how much the general problem gets simplified by the fact that Cargo has basically no weight/edge/communications costs. Also all of the algorithms assume that the cost of each process is (at least approximately) know in advance. So a deeper dive on which algorithm to use for Cargo, is somewhat premature until as the OP suggests we find some way to estimate the build time. |
At a meetup last night we were talking about this. We realized there's a tiny version of this that is still useful. If we record the full |
As for measuring the scheduling efficiency, could we display something like the average number of waiting edges over the entire time of the build process (higher number means more potential parallelism)? |
Yeah there is no sure way to know the complexity of a unit without doing it, at which point that piece of data becomes useless. Therefore, one of the needed tasks would be to come up with heuristics to inform guesses on the complexity of the build. There is a large set of resources one can use:
All these have correlations to the build time. Of course the prediction won't be 100% accurate. However, right now (unless there has been work I'm not aware of), the prediction cargo works with is that every unit has the same cost, so every crate has build cost of 1. Which means that coming up with a heuristic that's better than that should be pretty easy :). For the processing one could think about basic data science stuff like weighted sums, simple fully connected 2/3-layer neural networks, nothing fancy. |
Hey, I've recently worked on Dice_box, that I hope will serve as a playground for different scheduling algorithms. It relies on JSON timings and unit graph. It is still in the works - the output is not the prettiest and it would be cool to choose the algorithm to run via command-line - but the core idea is there and it works. I have already implemented pipelining simulation based on a separate rustc process which was a point of discussion in #6660. It also supports simulating builds with different thread counts - it does not (in theory, as things like #7689 happen) depend on the data to be captured with arbitrary parallelism level. |
At RustConf, we discussed this a little at the Unconf. Someone observed that they've seen the optimal order of packages be dependent on what platform they are build either on or for (not quite sure which). |
Also, #7437 would likely be a prerequisite, at least internally / unstable |
Describe the problem you are trying to solve
There are a couple of things in the dep queue that could be improved.
Sometimes we and up waiting on a single dep where things could run partly in parallel if the dep queue was ordered differently.
Describe the solution you'd like
If we knew how long a crate usually takes to build on the build host, we could estimates of how long parts of the dep graph take to build and make better decisions on what parts of the tree to start building first, or if it pays out to delay a package artificially in order to build it at a later stage to improve parallelism.
Perhaps we could save build times, compiler version used, build flags, crate metadata etc for each crate and save the data to a file inside the $CARGO_HOME and later read it to improve ordering of the dep queue.
Notes
We probably need to make sure that the files does not grow too much and prune old entries from the crate-build-time database from time to time.
The text was updated successfully, but these errors were encountered: