Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure delays from transactions don't cause cache race condition #25673

Merged
merged 3 commits into from
Mar 21, 2023

Conversation

jmcclelland
Copy link
Contributor

Overview

CRM_Core_Lock should ensure that two processes cannot get the same lock.

However, if each process is writing to the database via a SQL Transaction (which delays the insert), then two processes could get two consecutive locks before either one's transaction completes. That's because we are releasing the lock before the transactions complete.

This PR resolves that problem by checking for an active transaction and if found, delaying the release of a lock until after the currently active transaction is complete.

More discussion is here:

https://lab.civicrm.org/dev/core/-/issues/3988

Before

Two processes that set the exact same cache item from within a transaction might both successfully get a lock while each transaction is in process. When each transaction is complete, the first cache set will succeed but the second one will fail.

After

If two processes try to set the exact same cache item during a transaction, the second one will encounter the lock until the first one's transaction has been completed.

Technical Details

There is some debate in the issue 3988 about whether the cause for the error being encountered is two different processes writing the same cache item OR one process writing the same cache item recursively.

This PR is based on the idea that two separate processes are writing the same cache item. A fix for the recursive problem is here.

Comments

I haven't tested this code in production.

@civibot
Copy link

civibot bot commented Feb 27, 2023

(Standard links)

@civibot civibot bot added the master label Feb 27, 2023
@mattwire
Copy link
Contributor

@jmcclelland @MegaphoneJon The change looks simple enough and I'd be happy to merge if someone can say "Yes I'm running this in production for 1 month, it hasn't caused any problems and it looks like it has improved(/cleared?) the cache errors"

@mattwire
Copy link
Contributor

Oh, and it looks like this PR is causing test failures :-)

@jmcclelland
Copy link
Contributor Author

Woops - that was a silly mistake. Let's see if this passes. If we pass all tests and neither you nor @eileenmcnaughton see any obvious problems I would be willing to try this in production.

@eileenmcnaughton
Copy link
Contributor

Hmm - interestingly the test is one of our concurrency tests

api_v3_JobProcessMailingTest::testConcurrency with data set #0 (array(20, 3, 10, 4, 1), array(2, 1), 4)
API tallies should match.Array

@totten this is one that would otherwise be flaky?

@eileenmcnaughton
Copy link
Contributor

@jmcclelland yes - I think this is worth trying in prod

@eileenmcnaughton
Copy link
Contributor

Actually this fail might suggest it's working - ie it failed to grab the lock & hence can't run at exactly the same time

@jmcclelland
Copy link
Contributor Author

Actually this fail might suggest it's working - ie it failed to grab the lock & hence can't run at exactly the same time

@eileenmcnaughton - yes - it does seem that is the case.

But now I wonder: what should happen if two processes compete for the same lock? Right now the code throws an exception. Is that the right thing that should happen? Or should it instead what 1 second and try again and only after failing for X seconds throw an exception?

@mattwire
Copy link
Contributor

But now I wonder: what should happen if two processes compete for the same lock?
Well they shouldn't be competing for the same lock.. The second process should wait and only fail/timeout/throw exception if the wait time is excessive (I think core defaults to 3 seconds which seems reasonable given how fast computers are meant to be).

@jmcclelland
Copy link
Contributor Author

But now I wonder: what should happen if two processes compete for the same lock?
Well they shouldn't be competing for the same lock.. The second process should wait and only fail/timeout/throw exception if the wait time is excessive (I think core defaults to 3 seconds which seems reasonable given how fast computers are meant to be).

I'm not currently seeing any waiting here. I don't think it's happening in the isAcquired call either - is it happening somewhere else up the line? Or is it just missing here and I should add a while loop with sleep directly in that set function?

@jmcclelland
Copy link
Contributor Author

Oh - I see the acquire call has a default time out of 3 seconds. Which suggests that the test is failing because it takes longer than 3 seconds for one of the SQL transactions in the parallel mailing jobs to complete.

So... hm. I'm still back to the question of what to do. I guess the next step is to figure out why the email is taking longer than 3 seconds to complete the SQL transaction. I'm not sure if, when sending email, the transaction lasts the entire length of sending all the emails, or just one email? In other words, should we legitimately expect a SQL transaction to take longer than 3 seconds or is the mail test failure a real failure that needs to be fixed?

@eileenmcnaughton
Copy link
Contributor

@jmcclelland

I'm struggling a bit to understand the test - my suspicion is the test failure represents a fix for the bug you are trying to resolve - ie the first process has maybe not committed the transaction so the second process can't jump in but - @totten probably needs to look - it's a crazy complicated test - which I guess makes sense cos it's trying to create something in one process that should not be possible in one process

In SqlGroup.php line 114:SqlGroup: Failed to acquire lock on cache key.

@mattwire
Copy link
Contributor

@jmcclelland if #22013 applies to you / the test servers then you won't see any waiting because you'll always get the lock granted and it effectively won't work. With 22013 you'll get an immediate rejection of the lock. So maybe some combination of the two? Where it waits for the 3 seconds if it was already locked.

@eileenmcnaughton
Copy link
Contributor

Hmmm - I think if you are going to test on production you should try just this - we have a clear theory about a sequence of events that would lead to this.

For the other I guess it occurs when php statics are not being used & the lock is being relied on to compensate for that - but I don't have enough information to really assess how that could arise

@jmcclelland
Copy link
Contributor Author

Yes, here is the job that is failing. It seems to be calling createExternalAPI which I believe launches separate processes. I think that means we are not experiencing a problem with the cache set being called recursively so I'm not sure #22013 is involved. Also, with #22013 I think we would be experiencing a SQL error about a constraint violation because we would be successfully acquiring the lock. Instead we are experiencing an error because we are unable to acquire the lock.

And based on @eileenmcnaughton feedback I'll try this out in a limited fashion in production tomorrow. If all goes well I will push it out to all of our sites which should give us reasonable real world data.

@eileenmcnaughton
Copy link
Contributor

@jmcclelland & I suspect we are unable to aquire the lock cos the transaction isn't finished - ie cos the patch is working & the test is not completing the transaction to simulate the right scenario - but am trying to get @totten to look

@jmcclelland
Copy link
Contributor Author

The change is live on our first site. I'll keep you all posted :).

@eileenmcnaughton
Copy link
Contributor

thanks @jmcclelland

@tbember
Copy link

tbember commented Mar 16, 2023

I've applied the patch to a client's site that has been having deadlock errors for a specific recurring contribution since last year, and the payment went through this month! Thanks for the patch!

@jmcclelland
Copy link
Contributor Author

Oh good - glad to hear! My update:

  1. I applied it to our own internal CiviCRM database which ran for 10 days without problems.
  2. Last Monday morning (four days ago) I applied it to two of our highest volume CiviCRM groups (both of whom are running multiple mailings in paralell) and so far no complaints. One of these groups has been sending me about 4 - 8 authorize recurring payments every month that are recorded in the system log but don't get processed because -- I'm fairly sure -- of this bug we are hopefully fixing. (And shout out to you @eileenmcnaughton for notificationlog which I use every month to re-run these.... maybe I won't need it after this??)
  3. I plan to roll it out to all of our groups on monday.

@eileenmcnaughton
Copy link
Contributor

test this please

@totten
Copy link
Member

totten commented Mar 17, 2023

Curious - I tried running api_v3_JobProcessMailingTest locally a few ways:

  • My regular dmaster with min(php73/mysql57)
  • My regular dmaster with max(php81/mysql80)
  • A clean build-0 with dfl(php81/mysql57)

And it ran fine for all of them. But then I started a full run of phpunit-api3 -- and it failed. So there seems to be a test-interaction.

@eileenmcnaughton
Copy link
Contributor

Interesting - pretty sure it failed consistently for me

@totten
Copy link
Member

totten commented Mar 17, 2023

There's an interaction between this patch, api_v3_ExceptionTest, and api_v3_JobProcessMailingTest; i.e.

  • (Without patch) After running ExceptionTest, the subsequent JobProcessMailingTest will pass.
  • (With patch) After running ExceptionTest, the subsequent JobProcessMailingTest will fail.

(To reproduce quickly, delete all the other api3 tests and then run the suite.)

@totten totten force-pushed the transaction-lock branch from 0169eb1 to 99a1aa0 Compare March 17, 2023 03:58
@totten
Copy link
Member

totten commented Mar 17, 2023

(@eileenmcnaughton) Interesting - pretty sure it failed consistently for me

FWIW, I usually have to turn off some local customizations to run it. Specifically, it fails because I hard-code email options into /etc/civicrm.settings.d -- which is nice for my own manual testing (everything is delivered to mailhog), but it prevents the test from capturing emails. So in that sense, it can be a bit picky.

(@me) There's an interaction between this patch, api_v3_ExceptionTest, and api_v3_JobProcessMailingTest;

Found the problem - ExceptionTest calls parent::setUp() and $this->useTransaction(). However, the order was flipped (relative to norm). After fixing the order, JobProcessMailingTest started to work for me.

I've rebased to include a fix and force-pushed, and I'll open another PR for a bigger preventive measure.

@totten
Copy link
Member

totten commented Mar 17, 2023

OK, so it's passing, now I'm looking at the patch itself a bit more.

The patch provides for RELEASE_LOCK() when either (a) there's no transaction or (b) there's a successful commit. OK. So what happens when (c) there's a rollback?

According to MySQL docs, it doesn't do anything automatically:

Locks obtained with GET_LOCK() are not released when transactions commit or roll back.

So I guess that means you keep a lingering hold on the lock? But I'm not sure: from runtime POV, is that good or bad?


Ah ha. Looking back to the test-interaction, it may have been telling us something after all! Theory: When ExceptionTest ran, it left the lock engaged beyond its intuitive lifetime. WhentestConcurrency() fired up parallel workers, they got stifled by the lingering lock.

Which lock? No idea. But I tried a couple alternative patches - either of these fixes the spooky interaction. One patch makes for an explicit "release after rollback":

diff --git a/CRM/Core/Lock.php b/CRM/Core/Lock.php
index abe42f73f3..f72c655d59 100644
--- a/CRM/Core/Lock.php
+++ b/CRM/Core/Lock.php
@@ -216,6 +216,9 @@ class CRM_Core_Lock implements \Civi\Core\Lock\LockInterface {
         CRM_Core_Transaction::addCallback(CRM_Core_Transaction::PHASE_POST_COMMIT, function ($query, $params) {
           return CRM_Core_DAO::singleValueQuery($query, $params);
         }, [$query, $params]);
+        CRM_Core_Transaction::addCallback(CRM_Core_Transaction::PHASE_POST_ROLLBACK, function ($query, $params) {
+          return CRM_Core_DAO::singleValueQuery($query, $params);
+        }, [$query, $params]);
       }
       else {
         return CRM_Core_DAO::singleValueQuery($query, $params);

The other patch does a generic cleanup in between tests:

diff --git a/tests/phpunit/CiviTest/CiviUnitTestCase.php b/tests/phpunit/CiviTest/CiviUnitTestCase.php
index 9930747914..da8396a655 100644
--- a/tests/phpunit/CiviTest/CiviUnitTestCase.php
+++ b/tests/phpunit/CiviTest/CiviUnitTestCase.php
@@ -453,6 +453,7 @@ class CiviUnitTestCase extends PHPUnit\Framework\TestCase {
       $this->quickCleanup($tablesToTruncate);
       $this->createDomainContacts();
     }
+    CRM_Core_DAO::executeQuery('SELECT RELEASE_ALL_LOCKS()');
 
     $this->cleanTempDirs();
     $this->unsetExtensionSystem();

But, I haven't got my head fully into the issue. So I'm not sure if "release after rollback" policy is good or bad.

@eileenmcnaughton
Copy link
Contributor

Just noting the risk with keeping the lock too long after a rollback is probably not that bad in the real world - ie the process normally crashes pretty soon after that & the lock is released at that point

@jmcclelland
Copy link
Contributor Author

Thanks for the extra eyes @totten.

I think I prefer the explicit release after rollback patch.

I agree with @eileenmcnaughton that in practice in really doesn't matter since once we have rolled back, we probably have bigger problems. However, including the explicit release after rollback is, I think, what should happen - the transaction is rolled back, no reason not to allow a new lock to happen.

And it's easier to explain why we would include it (it clearly answers @toten's reasonable question) compared with explaining why we have to release all locks in the test code.

@totten totten force-pushed the transaction-lock branch from ec7964f to 8db447b Compare March 17, 2023 21:43
@MegaphoneJon
Copy link
Contributor

FWIW, I usually have to turn off some local customizations to run it. Specifically, it fails because I hard-code email options into /etc/civicrm.settings.d -- which is nice for my own manual testing (everything is delivered to mailhog), but it prevents the test from capturing emails. So in that sense, it can be a bit picky.

Tangential - but I've also had problems with this setting. However, I have my local Postfix set with:

relayhost = [127.0.0.1]:1025

This means anything sending to mail() (or SMTP to localhost) gets delivered to mailhog and it works for the CMS too, not just Civi.

@eileenmcnaughton
Copy link
Contributor

@totten was this the next in your merge order?

@totten
Copy link
Member

totten commented Mar 21, 2023

OK, I've read more of the discussion - and I agree that release-after-rollback makes sense in the context of the problematic story/scenario. And the test-suite passed with that.

Now that I've got my head into that story, I do like it. And there's been some r-run and several people seem inclined toward the patch. Given the parameters of our current process/capabilities, this seems ready to merge.

(Just as an aside, I did try poking holes with some mental exercises -- e.g. consider whether/how it plays differently for UPDATE; whether/how the InnoDB row-locks are interacting; whether/how it affects anything outside of SqlGroup. These cases are complicated, so take this with a grain of salt, but I couldn't spot any obvious problematic effects. Problems could exist, because it's a complicated area, but I don't think we're going to find them in PR review.)

@eileenmcnaughton eileenmcnaughton changed the title ensure delays from transactions don't cause cache race condition Ensure delays from transactions don't cause cache race condition Mar 21, 2023
@totten totten merged commit 7c744da into civicrm:master Mar 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants