Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster resolver: Cache past conflicting_activations, prevent doing the same work repeatedly. #5168

Merged
merged 10 commits into from
Mar 14, 2018

Conversation

Eh2406
Copy link
Contributor

@Eh2406 Eh2406 commented Mar 12, 2018

This work is inspired by @alexcrichton's comment that a slow resolver can be caused by all versions of a dependency being yanked. Witch stuck in my brain as I did not understand why it would happen. If a dependency has no candidates then it will be the most constrained and will trigger backtracking in the next tick. Eventually I found a reproducible test case. If the bad dependency is deep in the tree of dependencies then we activate and backtrack O(versions^depth) times. Even tho it is fast to identify the problem that is a lot of work.

The set up:

  1. Every time we backtrack cache the (dep, conflicting_activations).
  2. Build on the work in backtrack if can not activate #5000, Fail to activate if any of its dependencies will just backtrack to this frame. I.E. for each dependency check if any of its cached conflicting_activations are already all activated. If so we can just skip to the next candidate. We also add that bad conflicting_activations to our set of conflicting_activations, so that we can...

The pay off:
If we fail to find any candidates that we can activate in lite of 2, then we cannot be activated in this context, add our (dep, conflicting_activations) to the cache so that next time our parent will not bother trying us.

I hear you saying "but the error messages, what about the error messages?" So if we are at the end !has_another then we disable this optimization. After we mark our dep as being not activatable then we activate anyway. It won't resolve but it will have the same error message as before this PR. If we have been activated for the error messages then skip straight to the last candidate, as that is the only backtrack that will end with the user.

I added a test in the vain of #4834. With the old code the time to run was O(BRANCHING_FACTOR ^ DEPTH) and took ~3min with DEPTH = 10; BRANCHING_FACTOR = 5; with the new code it runs almost instantly with 200 and 100.

@rust-highfive
Copy link

r? @matklad

(rust_highfive has picked a reviewer for you, use r? to override)

@Eh2406
Copy link
Contributor Author

Eh2406 commented Mar 12, 2018

I hear you saying "but does this slow the happy path?" It should not and it doesn't for me, but I have been wrong before so please test with servo and x.py. The reason is that past_conflicting_activations is only added to if we backtrack, and we don't backtrack in the happy path. So all the new loops are looping over nothing.

@matklad
Copy link
Member

matklad commented Mar 12, 2018

r? @alexcrichton

@alexcrichton
Copy link
Member

Thanks so much for this @Eh2406, this is looking fantastic!

I've got a few high level questions and a more meta question. The higher level questions are:

  • Could the test suite be expanded to ensure we exercise all this? It looks like the test case is a successful resolve, right? The original test case is a failed resolve I think though?
  • Otherwise if you can think of any other ways to stress this, having more tests I'm always a fan of :)

The performance here looks like we won't take too much of a hit on the fast path I agree. The major added cost I can tell is:

  • past_conflicting_activations.entry(dep.clone()).or_insert_with(Vec::new); - this only happens O(N) times though where N is the size of the whole graph, a pretty small hash map
  • past.contains(&conflicting_activations) - on the fast path we're always searching an empty vector, so this should be speedy

But otherwise the only remaining question I'd have is that I've only got a tenuous grasp at best on what's happening here. Mind expanding the doc comments at the top or the comments inline as to what's happening? I've personally found that "comment stories" which walk through examples can be quite helpful!

@Eh2406
Copy link
Contributor Author

Eh2406 commented Mar 12, 2018

I just pushed two small performance gains, and an improvement to the test. The performance gains make sure we deal with known errors ferst. The test now checks with a successful and a failed resolve.
I should add a test like resolving_with_deep_backtracking just with hundreds deep. I'm not sure what other cases I should test. Mabey when this is ready to merge @aidanhs would be generous enough to try before/after over all of crates.io as they did for #4834.

I will work on comments next time I have a minute.

@alexcrichton
Copy link
Member

Looks and sounds great to me, thanks so much again @Eh2406!

@Eh2406
Copy link
Contributor Author

Eh2406 commented Mar 13, 2018

If you happen to have time... If you have leading questions then I will endeavour to ensure that the comments I add address them. If not I will try and guess at what is confusing to someone new to this mess.

@alexcrichton
Copy link
Member

Oh certainly! In general I think it'd be good to get a refreshed high-level overview with what the resolver is doing, particularly with respect to backtracking. Currently (AFAIK) there's two primary optimizations when backtracking:

  • When backtracking we skip over all past frames which won't actually fix whatever the reason we got a conflict for at the bottom. For example (I think) if the very bottom frame links in a crate that says links = "foo" then we'll zoom all the way to the point that links = "foo" was first introduced. (This I believe was @aidanhs's optimization)
  • This PR's optimization of caching a dependency's conflicting activations, so if we see it again we can zoom past it (I think? I'm a bit fuzzy on this PR specifically)

I think that'd be more than enough for now, but eventually I'd also love to have a scenario or two in the documentation and then in the code it can mention "we get here in scenario 1 when we do this work..." or things like that (but that can always happen later)

@Eh2406
Copy link
Contributor Author

Eh2406 commented Mar 13, 2018

I added two more commits. One extended the test to include cases where the redundant backtracking is caused by previous activations. The other was a small performance gain in which we only disable this optimization if the backtracking will get to the user.

I will try to work on comments. At some point we may want to restructure / add new types / helper functions so that the random comments can become dock comments. (Editor's note: this got kinda long, I think the first draft ended up in this comment. :-) )

I think you describe @aidanhs's optimization well with one small but relevant caveat. We will zoom all the way to the point that links = "foo" was first introduced or to our parent whichever we remove first. This makes sense. No parent, no requirement, no conflict.

Let's say that the dep we can't find a candidate for is foo-sys = "^2.0". We can't find a candidate because links = "foo" was set by the activation of foo-sys: 3.0.0. We prepare to jump / zoom all the way back but the first bactrack frame we hit is our parent foo-rs: 2.0.

But this is a lital unsatisfactory. Probably all versions of foo-rs = "^2.0" will depend on foo-sys = "^2.0" and will conflict with foo-sys: 3.0. We can't just jump / zoom all the way back as there are some ways the next thing we try will work. (Maybe the next version of foo-rs actually depends on foo-sys = "^3.0" specifically to get us out of this mess. Or some old version was a port to rust and does not depend on foo-sys at all. Both rare, but not impossible.) So what can we do? (hear is were the new work starts.)

We can leave a note:

Note: we cannot find a candidate for dep `foo-sys = "^2.0"` if `foo-sys: 3.0.0` is already activated due to a conflict on `links = "foo"`.

and we continue on our way. If we find a resolution then making a note is not much work. But next time we try and activate foo-rs: 2.0 before we add its dependencies to the list of requirements to resolve, we look thru our notes for each dependency. We see that one of the dependencies is foo-sys = "^2.0" and that we have a note about it, and that that note still applies, foo-sys: 3.0.0 is still active. So we decided that foo-rs: 2.0 cannot be activated, we mark that we hit a problem with foo-sys: 3.0.0 and try the next candidate. If we run out of candidates for foo-rs: "^2.0", and nothing too complicated is involved, then before we backtrack we add a new note:

Note: we cannot find a candidate for dep `foo-rs = "^2.0"` if `foo-sys: 3.0.0` is already activated due to a conflict on `links = "foo"`.

and we continue on our way. If we find a resolution then making a note is not much work. But next time we try and activate ...

@alexcrichton
Copy link
Member

@Eh2406 ah yes indeed, and your explanation sounds great!

I think it'd also be great to put links to these PR discussions. Whenever someone feels the need to understand the resolver they're probably in for the long haul (aka willing to read a lot) anyway, and the discussions on GitHub I've found are treasure troves of information :)

@Eh2406
Copy link
Contributor Author

Eh2406 commented Mar 14, 2018

I have been working hard to add lots of links in my PRs to hopefully make it possible for someone in the future to find all the discussions. I pushed a commit with a bunch more comments. I hope I did not go to far in over commenting. Let me know what you think. If there are more questions, I will do my best to answer. If you see how this would be slow or would use to much ram with a different test, I'd love to know.

Note that the benchmark from #4810 (comment) is largely unaffected by this PR. I have several small changes to this that do fix that case. Some are almost acceptable, but I don't have an isolated independent test. So that will for the next PR.

@alexcrichton
Copy link
Member

This looks great to me, thanks so much @Eh2406!

@bors: r+

@bors
Copy link
Collaborator

bors commented Mar 14, 2018

📌 Commit 58d7cf1 has been approved by alexcrichton

bors added a commit that referenced this pull request Mar 14, 2018
Faster resolver: Cache past conflicting_activations, prevent doing the same work repeatedly.

This work is inspired by @alexcrichton's [comment](#4066 (comment)) that a slow resolver can be caused by all versions of a dependency being yanked. Witch stuck in my brain as I did not understand why it would happen. If a dependency has no candidates then it will be the most constrained and will trigger backtracking in the next tick. Eventually I found a reproducible test case. If the bad dependency is deep in the tree of dependencies then we activate and backtrack `O(versions^depth)`  times. Even tho it is fast to identify the problem that is a lot of work.

**The set up:**
1. Every time we backtrack cache the (dep, `conflicting_activations`).
2. Build on the work in #5000, Fail to activate if any of its dependencies will just backtrack to this frame. I.E. for each dependency check if any of its cached `conflicting_activations` are already all activated. If so we can just skip to the next candidate. We also add that bad `conflicting_activations` to our set of  `conflicting_activations`, so that we can...

**The pay off:**
 If we fail to find any candidates that we can activate in lite of 2, then we cannot be activated in this context, add our (dep, `conflicting_activations`) to the cache so that next time our parent will not bother trying us.

I hear you saying "but the error messages, what about the error messages?" So if we are at the end `!has_another` then we disable this optimization. After we mark our dep as being not activatable then we activate anyway. It won't resolve but it will have the same error message as before this PR. If we have been activated for the error messages then skip straight to the last candidate, as that is the only backtrack that will end with the user.

I added a test in the vain of #4834. With the old code the time to run was `O(BRANCHING_FACTOR ^ DEPTH)` and took ~3min with DEPTH = 10; BRANCHING_FACTOR = 5; with the new code it runs almost instantly with 200 and 100.
@bors
Copy link
Collaborator

bors commented Mar 14, 2018

⌛ Testing commit 58d7cf1 with merge 3a06fde...

@bors
Copy link
Collaborator

bors commented Mar 14, 2018

☀️ Test successful - status-appveyor, status-travis
Approved by: alexcrichton
Pushing 3a06fde to master...

@bors bors merged commit 58d7cf1 into rust-lang:master Mar 14, 2018
@Eh2406
Copy link
Contributor Author

Eh2406 commented Mar 14, 2018

@aidanhs would you be willing to redo the tests you ran in #4834 (comment) ? This PR should not change the output, and should make things faster. But I am nervous that I missed something and we will see regressions.

@alexcrichton
Copy link
Member

@Eh2406 ok I'm trying to understand what's going on here in more depth (adding comments with my current understanding as I go along). I was wondering if you could help me out on a few points?

Right now one question I'd have is why there's two locations we update past_conflicting_activations. The first one makes sense to me (when remaining_candidates.next returns an error), but I'm not quite grasping why we also perform similar logic when an activation succeeds but we predict it will fail.

Can you elaborate on why we have the duplicate logic?

@Eh2406
Copy link
Contributor Author

Eh2406 commented Mar 16, 2018

Thank you for doing an in depth review! If your comments are available for public view I can give them a look. I will happily do my best to explain anything I can. :-) IMO the best outcome of the discussion is a different way to write the code so that it makes more sense / less relies on how all the pieces fit together.

That specific question is an excellent one. I have tried unsuccessfully to remove that duplication several times. I will try again when I am done with this email. For now let me tell the story from the beginning.

The first attempt at this had a cache for every time remaining_candidates.next returned an error, and never activated a candidate with a dep in the cache. This worked, but had epickly bad error messages.

So then we decided that we would activate if !has_another even if we know it would fail. Never backtrack only because of cached conflicts. Error messages are back, but so is the O(versions^depth). Why? Only the bottom layer was being entered in the cache.

Solution when we know we can not be successful add ourselves to the cache. Only then do we decide whether or not to activate. This is why we have 2 places we add ourselves to the cache.

If that is why we need it, then why do I think it may be removable now?

We have to activate all the way to the bottom for every time we activate but know it wont work. If we do it every time then we have O(versions*depth^2), better but not grate. So we added a check before activate but know it wont work try running find_candidate on a copy of the backtrack_stack. If we have something in the stack to try then we can start the backtracking now and the user won't see it.

I think we should have 2 cases

  1. We add to the cache, but then don't activate, leading to the backtracking code adding to the cache again.
  2. We add to the cache, but then activate because we are ready to show an error to the user, never looking at the cache again.
  3. Bugs that have made it not work for me in the past, that should be identified and stamped out for their own good.

@alexcrichton
Copy link
Member

Oh sure! I was gonna wait on #5187 to avoid conflicts but we can go ahead and get started, I opened up #5195 with some of the comments/tweaks I've got so far.

Thanks for the explanation, though! I think that makes sense to me in terms of propagating the past_conflicting_activations cache changes up the dependency graph instead of sticking around only on the bottom layer. I haven't thought too much about how to deduplicate these (if even possible) much though.

bors added a commit that referenced this pull request Mar 17, 2018
Faster resolver: clean code and the `backtrack_stack`

This is a small extension to #5168 and is inspired by #4834 (comment)

After #5168 these work (and don't on cargo from nightly.):
- `safe_core = "=0.22.4"`
- `safe_vault = "=0.13.2"`

But these don't work (and do on cargo from this PR.)
- `crust = "=0.24.0"`
- `elastic = "=0.3.0"`
- `elastic = "=0.4.0"`
- `elastic = "=0.5.0"`
- `safe_vault = "=0.14.0"`

It took some work to figure out why they are not working, and make a test case.

This PR remove use of `conflicting_activations` before it is extended with the conflicting from next.
#5187 (comment)
However the `find_candidate(` is still needed so it now gets the conflicting from next before being called.

It often happens that the candidate whose child will fail leading to it's failure, will have older siblings that have already set up `backtrack_frame`s. The candidate knows that it's failure can not be saved by its siblings, but sometimes we activate the child anyway for the error messages. Unfortunately the child does not know that is uncles can't save it, so it backtracks to one of them. Leading to a combinatorial loop.

The solution is to clear the `backtrack_stack` if we are activating just for the error messages.

Edit original end of this message, no longer accurate.
#5168 means that when we find a permanent problem we will never **activate** its parent again. In practise there afften is a lot of work and `backtrack_frame`s between the problem and reactivating its parent. This PR removes `backtrack_frame`s where its parent and the problem are present. This means that when we find a permanent problem we will never **backtrack** into it again.

An alternative is to scan all cashed problems while backtracking, but this seemed more efficient.
@ehuss ehuss added this to the 1.26.0 milestone Feb 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants