-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Beef up and harden import/export tests #7140
Conversation
7f52d1b
to
8dc2a2a
Compare
Are we talking about two separate issues here? There's:
Your tests prove that this issue can't be related to GC locks by successfully verifying that none of the pinned blocks in the CARs are garbage collected. If this were a GC race, we'd be deleting pinned data. We know what at least part of the GC issue is. The issue is that we're garbage collecting parts of the pin set. We could be generating and garbage collecting additional data, but we'd have to look into that by looking at the blocks GC is deleting. |
@Stebalien not a GC race - a GC deadlock. I am 99% sure this is what you observed in 7123. What I am modifying is that if we trigger a generous timeout: we will dump a full stacktrace where I "expect nothing" otherwise. The other thing ( ephemeral pinsets showing up ) is tracked in a different ticket. |
Ah, got it. The bug we were running into there was that we were invoking the CoreAPI while already holding the PinLock. However, we could definitely have a different deadlock somewhere. |
Doing some spelunking: I don't think it's a starvation issue. That is, repeated GC should eventually allow a |
@@ -0,0 +1,46 @@ | |||
From 6a43df1919da862f56c9af867a28bf2d7858e5ca Mon Sep 17 00:00:00 2001 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the behavior of sharness is surprising, but I'm concerned that this will cause more problems than it fixes. It already has a bunch of special cases to avoid rejecting valid code.
I'd rather document this in the sharness README and move on unless we can find a more reliable way to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nobody will ever discover the sharness README :)
No worries though - I will pull this commit out, and use it locally instead.
Unless you are open to have it gated on an environment variable...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Force-pushed a removal ( it was the first commit )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we'd trigger it when we get a failure. However, in this case, that wouldn't have helped regardless because we weren't checking for errors.
You're right that nobody will read the README... Honestly, it's not terrible given that we can catch and deal with any false positives. What if we include it but put in a comment explaining exactly what it does, where it comes from (it's non-obvious that we're patching it in), and that it's a heuristic and that we should remove it if it causes problems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll open a separate PR once I had a chance to write up better prose. The the error will link to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now let's merge the rest to catch the lock if it happens again ( all the "$(
stuff, while annoying was a red herring )
89d333b
to
768c287
Compare
Fixes ( or at least makes cleaner ) #7123
At this point I am convinced that this comment I made way back is in fact real: https://github.com/ipfs/go-ipfs/pull/7011/files/aa5019413741f2d34c9723d376591cf8c4b6b465#diff-15172926c4147422f472139313467179R330-R341
We did switch away from gclock a while back, and likely other changes I made cause it to trigger less frequently. Reproducing this locally is nigh impossible however :(
So instead the timeout+QUIT contraptions should tell us more next time it happens on the slower circleci boxes.