-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
solver: fix issue with double merged edges #4285
Conversation
e427e85
to
b86d50c
Compare
"Merged edges" is an optimization in the solve graph where two different active LLB edges can be combined into one after cache key computation finds that they generated equivalent cache chain. In that case, one edge is released and is set to point to another. The reference count for the second one is increased. An issue was discovered where an edge that was already pointing to another edge became a target to the third one. The current implementation did not handle the case where an edge that already had a different target itself became a target edge as well. This resulted in an inconsistent graph state where edges that were thought to be released could get scheduled again. Instead of setting the same edge value to two different maps, the new logic is to chain the edges into a linked list that should handle multiple levels of targets. This slightly increases the number of lookups, but "merged edges" case is very rare anyway, and a couple of extra pointer dereferences do not affect build speed. Signed-off-by: Tonis Tiigi <[email protected]>
b86d50c
to
100d3cb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the problem and the general idea of the solution, but I'd be lying if I said I was confident in my understanding of the overall code that is connected to the changes here 🙂
Along those lines, this feels like it would greatly benefit from unit tests if at all possible.
func (e *edge) takeOwnership(old *edge) { | ||
e.releaserCount += old.releaserCount + 1 | ||
old.owner = e | ||
old.releaseResult() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question for my own understanding: in scheduler
, there's loop
which calls dispatch
on edges. Once we've done a merge from old
edge -> target
edge, will that mean that dispatch
is only ever called on target
edge (or if target
itself has owners, the "ultimate" owner edge after following the linked list) and never be called on old
edge anymore?
Just wondering because, if true, that would massively simply trying to understand the overall logic. I was initially concerned that we make this old.releaseResult()
call but then there's still a ton of methods on edge
that access e.result
without checking whether e.owner != nil
, which felt like an avenue to accidentally using an already released result.
But, if we only ever call dispatch on edges with e.owner == nil
, I think most of those concerns go away since all those other methods should never be called once e.owner != nil
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the "ultimate" owner edge after following the linked list) and never be called on old edge anymore?
Yes, the scheduler only accesses edges via getEdge()
method, and I have added owner
check into that method in this PR so that once the owner is set, it is returned instead of the old definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks, that helps.
I really think we should add one or more unit tests for this case if at all possible. Partially for regression but also partially because the unit tests are pretty helpful for understanding the expected behavior too.
Other than that LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how tricky it is to repro this with unit test. You need to synchronize the requests to not only cause the triple edge merge, but you need to be able to trigger the evaluation of the second one only after the triple merge has already happened.
@alexcb Maybe your repro can be turned into an integration test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can definitely give it a try -- I'll use
buildkit/solver/scheduler_test.go
Line 3040 in 86e25b3
func TestMergedEdgesLookup(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still no luck triggering this via a test; I've been hacking at this function in a branch based off master (i.e. without this fix)
here's what I have so far: https://github.com/alexcb/buildkit/blob/507cc41e4aa901d229e1fa97d65de998120142ff/solver/scheduler_test.go#L3081-L3163 (with the intention of moving it to a new function if I succeed in reproducing it)
I've added in printfs to the scheduler to trace the order needed to reproduce the problem. Here's an example of the error: https://github.com/alexcb/buildkit/blob/507cc41e4aa901d229e1fa97d65de998120142ff/prints-from-reproduction-program
but unfortunately my test isn't triggering the same code path -- one noticable difference is in the test, I only see newRequestWithFunc called for ...
once per edge, when I believe I need it to occur twice in order to trigger the bad merge.
I'll keep hacking at the test, but wanted to share an update in case there's any obvious parts my test is skipping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to spend some more time with the scheduler_test.go code -- it's pretty neat.
I created some functions to make it easier to wait on cacheKey() returning, and am able to create some edge merges more easily, but I still haven't been able to reproduce the multi edge merge.
here's the updated attempt: https://github.com/alexcb/buildkit/blob/e6da21940e81a000a56c06a59664bd9ab28ea480/solver/scheduler_test.go#L3123
I'm running out of steam on it though, and am missing something timing related I suspect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem is that you don't only need to create the merge case, but there also needs to be a vertex that depends on the edge that gets merged, and that edge needs to go to the completed state only after the merge has happened. Merging itself does not trigger the bug. What triggers the bug is a new request coming into the vertex that already had the 3-level merge and should never unpark again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, while digging into #4347 (comment), I found something that seems to contradict something here.
the "ultimate" owner edge after following the linked list) and never be called on old edge anymore?
Yes, the scheduler only accesses edges via
getEdge()
method, and I have addedowner
check into that method in this PR so that once the owner is set, it is returned instead of the old definition.
My understanding of this is that dispatch
should never be called with an *edge
that has an owner
? However, I do seem to see this in logs (using my little investigation branch from my linked comment master...jedevc:buildkit:wip-repro-the-deadlock):
>>> sha256:614be6d44883c15260f9ce29e2feb9fc4f83bf8467744434f2e6152a747ab383
sha256:614be6d44883c15260f9ce29e2feb9fc4f83bf8467744434f2e6152a747ab383 owned by sha256:ccb9d6e0f552230dba1da5f989d813984ddb3e5a44cf674e5df1c0e4524f4463
Is my understanding wrong here? Is owner
on a edge a valid input to dispatch
? And if it is, we should panic
if we detect it, since some internal assumption seems to be violated then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a test may be pretty tough in practice, so don't want to block this on that
cherry-picked 100d3cb6b6903be50f7a3e5dba193515aa9530fa from moby/buildkit#4285 fixes #2957 Signed-off-by: Alex Couture-Beil <[email protected]>
cherry-picked 100d3cb6b6903be50f7a3e5dba193515aa9530fa from moby/buildkit#4285 fixes #2957 Signed-off-by: Alex Couture-Beil <[email protected]>
Going to bring this in. If we can get a test then post a follow-up. |
I looked at writing a test that can repro this and got some way to it: func TestMergedEdgesFoo(t *testing.T) {
t.Parallel()
for i := 0; i < 1; i++ {
ctx := context.TODO()
cacheManager := newTrackingCacheManager(NewInMemoryCacheManager())
l := NewSolver(SolverOpt{
ResolveOpFunc: testOpResolver,
DefaultCache: cacheManager,
})
defer l.Close()
j0, err := l.NewJob("j0")
require.NoError(t, err)
defer func() {
if j0 != nil {
j0.Discard()
}
}()
va := vtxAdd(2, vtxOpt{name: "va", inputs: []Edge{
{Vertex: vtxConst(3, vtxOpt{})},
{Vertex: vtxConst(4, vtxOpt{})},
}})
vb := vtxAdd(2, vtxOpt{name: "vb", inputs: []Edge{
{Vertex: vtxConst(3, vtxOpt{})},
{Vertex: vtxConst(4, vtxOpt{})},
}})
vc := vtxAdd(2, vtxOpt{name: "vc", inputs: []Edge{
{Vertex: vtxConst(3, vtxOpt{})},
{Vertex: vtxConst(4, vtxOpt{})},
}})
g0 := Edge{
Vertex: vtxSum(1, vtxOpt{inputs: []Edge{
{Vertex: va, Index: 0},
{Vertex: vb, Index: 0},
{Vertex: vb, Index: 1},
{Vertex: vc, Index: 1},
}}),
}
g0.Vertex.(*vertexSum).setupCallCounters()
res, err := j0.Build(ctx, g0)
require.NoError(t, err)
_ = res
require.NoError(t, j0.Discard())
j0 = nil
}
} I can get the 3-level merge:
After this, |
fix #4278
"Merged edges" is an optimization in the solve graph where two different active LLB edges can be combined into one after cache key computation finds that they generated equivalent cache chain.
In that case, one edge is released and is set to point to another. The reference count for the second one is increased.
An issue was discovered where an edge that was already pointing to another edge became a target to the third one. The current implementation did not handle the case where an edge that already had a different target itself became a target edge as well. This resulted in an inconsistent graph state where edges that were thought to be released could get scheduled again.
Instead of setting the same edge value to two different maps, the new logic is to chain the edges into a linked list that should handle multiple levels of targets. This slightly increases the number of lookups, but "merged edges" case is very rare anyway, and a couple of extra pointer dereferences do not affect build speed.