Ensure delays from transactions don't cause cache race condition #25673

jmcclelland · 2023-02-27T16:50:01Z

Overview

CRM_Core_Lock should ensure that two processes cannot get the same lock.

However, if each process is writing to the database via a SQL Transaction (which delays the insert), then two processes could get two consecutive locks before either one's transaction completes. That's because we are releasing the lock before the transactions complete.

This PR resolves that problem by checking for an active transaction and if found, delaying the release of a lock until after the currently active transaction is complete.

More discussion is here:

https://lab.civicrm.org/dev/core/-/issues/3988

Before

Two processes that set the exact same cache item from within a transaction might both successfully get a lock while each transaction is in process. When each transaction is complete, the first cache set will succeed but the second one will fail.

After

If two processes try to set the exact same cache item during a transaction, the second one will encounter the lock until the first one's transaction has been completed.

Technical Details

There is some debate in the issue 3988 about whether the cause for the error being encountered is two different processes writing the same cache item OR one process writing the same cache item recursively.

This PR is based on the idea that two separate processes are writing the same cache item. A fix for the recursive problem is here.

Comments

I haven't tested this code in production.

civibot · 2023-02-27T16:50:04Z

(Standard links)

If you are reviewing this pull-request, you may wish to consult the test sites and the Review Standards (long template, short template).

mattwire · 2023-02-27T17:21:37Z

@jmcclelland @MegaphoneJon The change looks simple enough and I'd be happy to merge if someone can say "Yes I'm running this in production for 1 month, it hasn't caused any problems and it looks like it has improved(/cleared?) the cache errors"

mattwire · 2023-02-27T17:22:58Z

Oh, and it looks like this PR is causing test failures :-)

jmcclelland · 2023-02-27T18:38:34Z

Woops - that was a silly mistake. Let's see if this passes. If we pass all tests and neither you nor @eileenmcnaughton see any obvious problems I would be willing to try this in production.

eileenmcnaughton · 2023-02-27T20:00:35Z

Hmm - interestingly the test is one of our concurrency tests

api_v3_JobProcessMailingTest::testConcurrency with data set #0 (array(20, 3, 10, 4, 1), array(2, 1), 4)
API tallies should match.Array

@totten this is one that would otherwise be flaky?

eileenmcnaughton · 2023-02-27T20:01:17Z

@jmcclelland yes - I think this is worth trying in prod

eileenmcnaughton · 2023-02-27T20:05:18Z

Actually this fail might suggest it's working - ie it failed to grab the lock & hence can't run at exactly the same time

jmcclelland · 2023-02-27T21:32:48Z

Actually this fail might suggest it's working - ie it failed to grab the lock & hence can't run at exactly the same time

@eileenmcnaughton - yes - it does seem that is the case.

But now I wonder: what should happen if two processes compete for the same lock? Right now the code throws an exception. Is that the right thing that should happen? Or should it instead what 1 second and try again and only after failing for X seconds throw an exception?

mattwire · 2023-02-27T21:54:25Z

But now I wonder: what should happen if two processes compete for the same lock?
Well they shouldn't be competing for the same lock.. The second process should wait and only fail/timeout/throw exception if the wait time is excessive (I think core defaults to 3 seconds which seems reasonable given how fast computers are meant to be).

jmcclelland · 2023-02-27T22:10:37Z

But now I wonder: what should happen if two processes compete for the same lock?
Well they shouldn't be competing for the same lock.. The second process should wait and only fail/timeout/throw exception if the wait time is excessive (I think core defaults to 3 seconds which seems reasonable given how fast computers are meant to be).

I'm not currently seeing any waiting here. I don't think it's happening in the isAcquired call either - is it happening somewhere else up the line? Or is it just missing here and I should add a while loop with sleep directly in that set function?

jmcclelland · 2023-02-27T22:21:12Z

Oh - I see the acquire call has a default time out of 3 seconds. Which suggests that the test is failing because it takes longer than 3 seconds for one of the SQL transactions in the parallel mailing jobs to complete.

So... hm. I'm still back to the question of what to do. I guess the next step is to figure out why the email is taking longer than 3 seconds to complete the SQL transaction. I'm not sure if, when sending email, the transaction lasts the entire length of sending all the emails, or just one email? In other words, should we legitimately expect a SQL transaction to take longer than 3 seconds or is the mail test failure a real failure that needs to be fixed?

eileenmcnaughton · 2023-02-27T22:37:00Z

@jmcclelland

I'm struggling a bit to understand the test - my suspicion is the test failure represents a fix for the bug you are trying to resolve - ie the first process has maybe not committed the transaction so the second process can't jump in but - @totten probably needs to look - it's a crazy complicated test - which I guess makes sense cos it's trying to create something in one process that should not be possible in one process

In SqlGroup.php line 114:SqlGroup: Failed to acquire lock on cache key.

mattwire · 2023-02-28T19:22:24Z

@jmcclelland if #22013 applies to you / the test servers then you won't see any waiting because you'll always get the lock granted and it effectively won't work. With 22013 you'll get an immediate rejection of the lock. So maybe some combination of the two? Where it waits for the 3 seconds if it was already locked.

eileenmcnaughton · 2023-02-28T19:40:02Z

Hmmm - I think if you are going to test on production you should try just this - we have a clear theory about a sequence of events that would lead to this.

For the other I guess it occurs when php statics are not being used & the lock is being relied on to compensate for that - but I don't have enough information to really assess how that could arise

jmcclelland · 2023-02-28T20:59:26Z

Yes, here is the job that is failing. It seems to be calling createExternalAPI which I believe launches separate processes. I think that means we are not experiencing a problem with the cache set being called recursively so I'm not sure #22013 is involved. Also, with #22013 I think we would be experiencing a SQL error about a constraint violation because we would be successfully acquiring the lock. Instead we are experiencing an error because we are unable to acquire the lock.

And based on @eileenmcnaughton feedback I'll try this out in a limited fashion in production tomorrow. If all goes well I will push it out to all of our sites which should give us reasonable real world data.

eileenmcnaughton · 2023-02-28T21:04:26Z

@jmcclelland & I suspect we are unable to aquire the lock cos the transaction isn't finished - ie cos the patch is working & the test is not completing the transaction to simulate the right scenario - but am trying to get @totten to look

jmcclelland · 2023-03-01T19:31:53Z

The change is live on our first site. I'll keep you all posted :).

eileenmcnaughton · 2023-03-01T19:37:51Z

thanks @jmcclelland

tbember · 2023-03-16T17:38:25Z

I've applied the patch to a client's site that has been having deadlock errors for a specific recurring contribution since last year, and the payment went through this month! Thanks for the patch!

jmcclelland · 2023-03-16T21:15:01Z

Oh good - glad to hear! My update:

I applied it to our own internal CiviCRM database which ran for 10 days without problems.
Last Monday morning (four days ago) I applied it to two of our highest volume CiviCRM groups (both of whom are running multiple mailings in paralell) and so far no complaints. One of these groups has been sending me about 4 - 8 authorize recurring payments every month that are recorded in the system log but don't get processed because -- I'm fairly sure -- of this bug we are hopefully fixing. (And shout out to you @eileenmcnaughton for notificationlog which I use every month to re-run these.... maybe I won't need it after this??)
I plan to roll it out to all of our groups on monday.

eileenmcnaughton · 2023-03-16T21:34:46Z

test this please

totten · 2023-03-17T01:39:32Z

Curious - I tried running api_v3_JobProcessMailingTest locally a few ways:

My regular dmaster with min(php73/mysql57)
My regular dmaster with max(php81/mysql80)
A clean build-0 with dfl(php81/mysql57)

And it ran fine for all of them. But then I started a full run of phpunit-api3 -- and it failed. So there seems to be a test-interaction.

eileenmcnaughton · 2023-03-17T01:42:12Z

Interesting - pretty sure it failed consistently for me

totten · 2023-03-17T02:04:51Z

There's an interaction between this patch, api_v3_ExceptionTest, and api_v3_JobProcessMailingTest; i.e.

(Without patch) After running ExceptionTest, the subsequent JobProcessMailingTest will pass.
(With patch) After running ExceptionTest, the subsequent JobProcessMailingTest will fail.

(To reproduce quickly, delete all the other api3 tests and then run the suite.)

totten · 2023-03-17T04:01:34Z

(@eileenmcnaughton) Interesting - pretty sure it failed consistently for me

FWIW, I usually have to turn off some local customizations to run it. Specifically, it fails because I hard-code email options into /etc/civicrm.settings.d -- which is nice for my own manual testing (everything is delivered to mailhog), but it prevents the test from capturing emails. So in that sense, it can be a bit picky.

(@me) There's an interaction between this patch, api_v3_ExceptionTest, and api_v3_JobProcessMailingTest;

Found the problem - ExceptionTest calls parent::setUp() and $this->useTransaction(). However, the order was flipped (relative to norm). After fixing the order, JobProcessMailingTest started to work for me.

I've rebased to include a fix and force-pushed, and I'll open another PR for a bigger preventive measure.

totten · 2023-03-17T07:12:25Z

OK, so it's passing, now I'm looking at the patch itself a bit more.

The patch provides for RELEASE_LOCK() when either (a) there's no transaction or (b) there's a successful commit. OK. So what happens when (c) there's a rollback?

According to MySQL docs, it doesn't do anything automatically:

Locks obtained with GET_LOCK() are not released when transactions commit or roll back.

So I guess that means you keep a lingering hold on the lock? But I'm not sure: from runtime POV, is that good or bad?

Ah ha. Looking back to the test-interaction, it may have been telling us something after all! Theory: When ExceptionTest ran, it left the lock engaged beyond its intuitive lifetime. WhentestConcurrency() fired up parallel workers, they got stifled by the lingering lock.

Which lock? No idea. But I tried a couple alternative patches - either of these fixes the spooky interaction. One patch makes for an explicit "release after rollback":

diff --git a/CRM/Core/Lock.php b/CRM/Core/Lock.php
index abe42f73f3..f72c655d59 100644
--- a/CRM/Core/Lock.php
+++ b/CRM/Core/Lock.php
@@ -216,6 +216,9 @@ class CRM_Core_Lock implements \Civi\Core\Lock\LockInterface {
         CRM_Core_Transaction::addCallback(CRM_Core_Transaction::PHASE_POST_COMMIT, function ($query, $params) {
           return CRM_Core_DAO::singleValueQuery($query, $params);
         }, [$query, $params]);
+        CRM_Core_Transaction::addCallback(CRM_Core_Transaction::PHASE_POST_ROLLBACK, function ($query, $params) {
+          return CRM_Core_DAO::singleValueQuery($query, $params);
+        }, [$query, $params]);
       }
       else {
         return CRM_Core_DAO::singleValueQuery($query, $params);

The other patch does a generic cleanup in between tests:

diff --git a/tests/phpunit/CiviTest/CiviUnitTestCase.php b/tests/phpunit/CiviTest/CiviUnitTestCase.php
index 9930747914..da8396a655 100644
--- a/tests/phpunit/CiviTest/CiviUnitTestCase.php
+++ b/tests/phpunit/CiviTest/CiviUnitTestCase.php
@@ -453,6 +453,7 @@ class CiviUnitTestCase extends PHPUnit\Framework\TestCase {
       $this->quickCleanup($tablesToTruncate);
       $this->createDomainContacts();
     }
+    CRM_Core_DAO::executeQuery('SELECT RELEASE_ALL_LOCKS()');
 
     $this->cleanTempDirs();
     $this->unsetExtensionSystem();

But, I haven't got my head fully into the issue. So I'm not sure if "release after rollback" policy is good or bad.

eileenmcnaughton · 2023-03-17T09:29:30Z

Just noting the risk with keeping the lock too long after a rollback is probably not that bad in the real world - ie the process normally crashes pretty soon after that & the lock is released at that point

jmcclelland · 2023-03-17T15:50:40Z

Thanks for the extra eyes @totten.

I think I prefer the explicit release after rollback patch.

I agree with @eileenmcnaughton that in practice in really doesn't matter since once we have rolled back, we probably have bigger problems. However, including the explicit release after rollback is, I think, what should happen - the transaction is rolled back, no reason not to allow a new lock to happen.

And it's easier to explain why we would include it (it clearly answers @toten's reasonable question) compared with explaining why we have to release all locks in the test code.

https://lab.civicrm.org/dev/core/-/issues/3988

MegaphoneJon · 2023-03-17T22:00:05Z

FWIW, I usually have to turn off some local customizations to run it. Specifically, it fails because I hard-code email options into /etc/civicrm.settings.d -- which is nice for my own manual testing (everything is delivered to mailhog), but it prevents the test from capturing emails. So in that sense, it can be a bit picky.

Tangential - but I've also had problems with this setting. However, I have my local Postfix set with:

relayhost = [127.0.0.1]:1025

This means anything sending to mail() (or SMTP to localhost) gets delivered to mailhog and it works for the CMS too, not just Civi.

eileenmcnaughton · 2023-03-20T01:23:12Z

@totten was this the next in your merge order?

totten · 2023-03-21T00:35:12Z

OK, I've read more of the discussion - and I agree that release-after-rollback makes sense in the context of the problematic story/scenario. And the test-suite passed with that.

Now that I've got my head into that story, I do like it. And there's been some r-run and several people seem inclined toward the patch. Given the parameters of our current process/capabilities, this seems ready to merge.

(Just as an aside, I did try poking holes with some mental exercises -- e.g. consider whether/how it plays differently for UPDATE; whether/how the InnoDB row-locks are interacting; whether/how it affects anything outside of SqlGroup. These cases are complicated, so take this with a grain of salt, but I couldn't spot any obvious problematic effects. Problems could exist, because it's a complicated area, but I don't think we're going to find them in PR review.)

civibot bot added the master label Feb 27, 2023

totten force-pushed the transaction-lock branch from 0169eb1 to 99a1aa0 Compare March 17, 2023 03:58

totten force-pushed the transaction-lock branch from 99a1aa0 to ec7964f Compare March 17, 2023 04:02

totten mentioned this pull request Mar 17, 2023

CiviUnitTestCase (etal) - Resolve spooky interaction #25855

Merged

totten mentioned this pull request Mar 17, 2023

CiviUnitTestCase - During teardown, ensure that we cleanup locks #25856

Merged

jmcclelland added 2 commits March 17, 2023 14:42

ensure delays from transactions don't cause cache race condition

2f3e740

https://lab.civicrm.org/dev/core/-/issues/3988

ensure we properly pass the arguments needed.

8db447b

totten force-pushed the transaction-lock branch from ec7964f to 8db447b Compare March 17, 2023 21:43

CRM_Core_Lock - Deferred release should work same for commit or rollback

80fba9f

eileenmcnaughton changed the title ~~ensure delays from transactions don't cause cache race condition~~ Ensure delays from transactions don't cause cache race condition Mar 21, 2023

totten merged commit 7c744da into civicrm:master Mar 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure delays from transactions don't cause cache race condition #25673

Ensure delays from transactions don't cause cache race condition #25673

jmcclelland commented Feb 27, 2023

civibot bot commented Feb 27, 2023

mattwire commented Feb 27, 2023

mattwire commented Feb 27, 2023

jmcclelland commented Feb 27, 2023

eileenmcnaughton commented Feb 27, 2023

eileenmcnaughton commented Feb 27, 2023

eileenmcnaughton commented Feb 27, 2023

jmcclelland commented Feb 27, 2023

mattwire commented Feb 27, 2023

jmcclelland commented Feb 27, 2023

jmcclelland commented Feb 27, 2023

eileenmcnaughton commented Feb 27, 2023

mattwire commented Feb 28, 2023

eileenmcnaughton commented Feb 28, 2023

jmcclelland commented Feb 28, 2023

eileenmcnaughton commented Feb 28, 2023

jmcclelland commented Mar 1, 2023

eileenmcnaughton commented Mar 1, 2023

tbember commented Mar 16, 2023

jmcclelland commented Mar 16, 2023

eileenmcnaughton commented Mar 16, 2023

totten commented Mar 17, 2023 •

edited

Loading

eileenmcnaughton commented Mar 17, 2023

totten commented Mar 17, 2023

totten commented Mar 17, 2023

totten commented Mar 17, 2023 •

edited

Loading

eileenmcnaughton commented Mar 17, 2023

jmcclelland commented Mar 17, 2023

MegaphoneJon commented Mar 17, 2023

eileenmcnaughton commented Mar 20, 2023

totten commented Mar 21, 2023 •

edited

Loading

Ensure delays from transactions don't cause cache race condition #25673

Ensure delays from transactions don't cause cache race condition #25673

Conversation

jmcclelland commented Feb 27, 2023

Overview

Before

After

Technical Details

Comments

civibot bot commented Feb 27, 2023

mattwire commented Feb 27, 2023

mattwire commented Feb 27, 2023

jmcclelland commented Feb 27, 2023

eileenmcnaughton commented Feb 27, 2023

eileenmcnaughton commented Feb 27, 2023

eileenmcnaughton commented Feb 27, 2023

jmcclelland commented Feb 27, 2023

mattwire commented Feb 27, 2023

jmcclelland commented Feb 27, 2023

jmcclelland commented Feb 27, 2023

eileenmcnaughton commented Feb 27, 2023

mattwire commented Feb 28, 2023

eileenmcnaughton commented Feb 28, 2023

jmcclelland commented Feb 28, 2023

eileenmcnaughton commented Feb 28, 2023

jmcclelland commented Mar 1, 2023

eileenmcnaughton commented Mar 1, 2023

tbember commented Mar 16, 2023

jmcclelland commented Mar 16, 2023

eileenmcnaughton commented Mar 16, 2023

totten commented Mar 17, 2023 • edited Loading

eileenmcnaughton commented Mar 17, 2023

totten commented Mar 17, 2023

totten commented Mar 17, 2023

totten commented Mar 17, 2023 • edited Loading

eileenmcnaughton commented Mar 17, 2023

jmcclelland commented Mar 17, 2023

MegaphoneJon commented Mar 17, 2023

eileenmcnaughton commented Mar 20, 2023

totten commented Mar 21, 2023 • edited Loading

totten commented Mar 17, 2023 •

edited

Loading

totten commented Mar 17, 2023 •

edited

Loading

totten commented Mar 21, 2023 •

edited

Loading