-
-
Notifications
You must be signed in to change notification settings - Fork 825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure delays from transactions don't cause cache race condition #25673
Conversation
(Standard links)
|
@jmcclelland @MegaphoneJon The change looks simple enough and I'd be happy to merge if someone can say "Yes I'm running this in production for 1 month, it hasn't caused any problems and it looks like it has improved(/cleared?) the cache errors" |
Oh, and it looks like this PR is causing test failures :-) |
Woops - that was a silly mistake. Let's see if this passes. If we pass all tests and neither you nor @eileenmcnaughton see any obvious problems I would be willing to try this in production. |
Hmm - interestingly the test is one of our concurrency tests api_v3_JobProcessMailingTest::testConcurrency with data set #0 (array(20, 3, 10, 4, 1), array(2, 1), 4) @totten this is one that would otherwise be flaky? |
@jmcclelland yes - I think this is worth trying in prod |
Actually this fail might suggest it's working - ie it failed to grab the lock & hence can't run at exactly the same time |
@eileenmcnaughton - yes - it does seem that is the case. But now I wonder: what should happen if two processes compete for the same lock? Right now the code throws an exception. Is that the right thing that should happen? Or should it instead what 1 second and try again and only after failing for X seconds throw an exception? |
|
I'm not currently seeing any waiting here. I don't think it's happening in the |
Oh - I see the So... hm. I'm still back to the question of what to do. I guess the next step is to figure out why the email is taking longer than 3 seconds to complete the SQL transaction. I'm not sure if, when sending email, the transaction lasts the entire length of sending all the emails, or just one email? In other words, should we legitimately expect a SQL transaction to take longer than 3 seconds or is the mail test failure a real failure that needs to be fixed? |
I'm struggling a bit to understand the test - my suspicion is the test failure represents a fix for the bug you are trying to resolve - ie the first process has maybe not committed the transaction so the second process can't jump in but - @totten probably needs to look - it's a crazy complicated test - which I guess makes sense cos it's trying to create something in one process that should not be possible in one process
|
@jmcclelland if #22013 applies to you / the test servers then you won't see any waiting because you'll always get the lock granted and it effectively won't work. With 22013 you'll get an immediate rejection of the lock. So maybe some combination of the two? Where it waits for the 3 seconds if it was already locked. |
Hmmm - I think if you are going to test on production you should try just this - we have a clear theory about a sequence of events that would lead to this. For the other I guess it occurs when php statics are not being used & the lock is being relied on to compensate for that - but I don't have enough information to really assess how that could arise |
Yes, here is the job that is failing. It seems to be calling And based on @eileenmcnaughton feedback I'll try this out in a limited fashion in production tomorrow. If all goes well I will push it out to all of our sites which should give us reasonable real world data. |
@jmcclelland & I suspect we are unable to aquire the lock cos the transaction isn't finished - ie cos the patch is working & the test is not completing the transaction to simulate the right scenario - but am trying to get @totten to look |
The change is live on our first site. I'll keep you all posted :). |
thanks @jmcclelland |
I've applied the patch to a client's site that has been having deadlock errors for a specific recurring contribution since last year, and the payment went through this month! Thanks for the patch! |
Oh good - glad to hear! My update:
|
test this please |
Curious - I tried running
And it ran fine for all of them. But then I started a full run of |
Interesting - pretty sure it failed consistently for me |
There's an interaction between this patch,
(To reproduce quickly, delete all the other api3 tests and then run the suite.) |
0169eb1
to
99a1aa0
Compare
FWIW, I usually have to turn off some local customizations to run it. Specifically, it fails because I hard-code email options into
Found the problem - I've rebased to include a fix and force-pushed, and I'll open another PR for a bigger preventive measure. |
99a1aa0
to
ec7964f
Compare
OK, so it's passing, now I'm looking at the patch itself a bit more. The patch provides for According to MySQL docs, it doesn't do anything automatically:
So I guess that means you keep a lingering hold on the lock? But I'm not sure: from runtime POV, is that good or bad? Ah ha. Looking back to the test-interaction, it may have been telling us something after all! Theory: When Which lock? No idea. But I tried a couple alternative patches - either of these fixes the spooky interaction. One patch makes for an explicit "release after rollback": diff --git a/CRM/Core/Lock.php b/CRM/Core/Lock.php
index abe42f73f3..f72c655d59 100644
--- a/CRM/Core/Lock.php
+++ b/CRM/Core/Lock.php
@@ -216,6 +216,9 @@ class CRM_Core_Lock implements \Civi\Core\Lock\LockInterface {
CRM_Core_Transaction::addCallback(CRM_Core_Transaction::PHASE_POST_COMMIT, function ($query, $params) {
return CRM_Core_DAO::singleValueQuery($query, $params);
}, [$query, $params]);
+ CRM_Core_Transaction::addCallback(CRM_Core_Transaction::PHASE_POST_ROLLBACK, function ($query, $params) {
+ return CRM_Core_DAO::singleValueQuery($query, $params);
+ }, [$query, $params]);
}
else {
return CRM_Core_DAO::singleValueQuery($query, $params); The other patch does a generic cleanup in between tests: diff --git a/tests/phpunit/CiviTest/CiviUnitTestCase.php b/tests/phpunit/CiviTest/CiviUnitTestCase.php
index 9930747914..da8396a655 100644
--- a/tests/phpunit/CiviTest/CiviUnitTestCase.php
+++ b/tests/phpunit/CiviTest/CiviUnitTestCase.php
@@ -453,6 +453,7 @@ class CiviUnitTestCase extends PHPUnit\Framework\TestCase {
$this->quickCleanup($tablesToTruncate);
$this->createDomainContacts();
}
+ CRM_Core_DAO::executeQuery('SELECT RELEASE_ALL_LOCKS()');
$this->cleanTempDirs();
$this->unsetExtensionSystem(); But, I haven't got my head fully into the issue. So I'm not sure if "release after rollback" policy is good or bad. |
Just noting the risk with keeping the lock too long after a rollback is probably not that bad in the real world - ie the process normally crashes pretty soon after that & the lock is released at that point |
Thanks for the extra eyes @totten. I think I prefer the explicit release after rollback patch. I agree with @eileenmcnaughton that in practice in really doesn't matter since once we have rolled back, we probably have bigger problems. However, including the explicit release after rollback is, I think, what should happen - the transaction is rolled back, no reason not to allow a new lock to happen. And it's easier to explain why we would include it (it clearly answers @toten's reasonable question) compared with explaining why we have to release all locks in the test code. |
ec7964f
to
8db447b
Compare
Tangential - but I've also had problems with this setting. However, I have my local Postfix set with:
This means anything sending to |
@totten was this the next in your merge order? |
OK, I've read more of the discussion - and I agree that release-after-rollback makes sense in the context of the problematic story/scenario. And the test-suite passed with that. Now that I've got my head into that story, I do like it. And there's been some r-run and several people seem inclined toward the patch. Given the parameters of our current process/capabilities, this seems ready to merge. (Just as an aside, I did try poking holes with some mental exercises -- e.g. consider whether/how it plays differently for UPDATE; whether/how the InnoDB row-locks are interacting; whether/how it affects anything outside of SqlGroup. These cases are complicated, so take this with a grain of salt, but I couldn't spot any obvious problematic effects. Problems could exist, because it's a complicated area, but I don't think we're going to find them in PR review.) |
Overview
CRM_Core_Lock should ensure that two processes cannot get the same lock.
However, if each process is writing to the database via a SQL Transaction (which delays the insert), then two processes could get two consecutive locks before either one's transaction completes. That's because we are releasing the lock before the transactions complete.
This PR resolves that problem by checking for an active transaction and if found, delaying the release of a lock until after the currently active transaction is complete.
More discussion is here:
https://lab.civicrm.org/dev/core/-/issues/3988
Before
Two processes that set the exact same cache item from within a transaction might both successfully get a lock while each transaction is in process. When each transaction is complete, the first cache set will succeed but the second one will fail.
After
If two processes try to set the exact same cache item during a transaction, the second one will encounter the lock until the first one's transaction has been completed.
Technical Details
There is some debate in the issue 3988 about whether the cause for the error being encountered is two different processes writing the same cache item OR one process writing the same cache item recursively.
This PR is based on the idea that two separate processes are writing the same cache item. A fix for the recursive problem is here.
Comments
I haven't tested this code in production.