Cron cleanup repeatedly hits deadlocks on large environments where groups can overlap #28007

driskell · 2020-04-28T08:29:54Z

We have two large installations, one Enterprise, one Community, both on 2.3.4, with some long running crons. Frequently the groups overlap and the DELETE operations in the cron cleanup cause deadlocks.

Examining the database environment we can see that some DELETE queries are using the job_code index whereas other are performing a scan as they attempt to use the status column. The ones using an index take exclusive locks on the index then proceed to filter, release, and lock the relevant PRIMARY key entries to take the rows. The queries not using the index take the PRIMARY key locks first then perform filtering and release them. This may not be an exact description but it's my current understanding of what's happening. As you can logically see when two queries run at the same time for two different groups it can cause a deadlock.

Description (*)

I've tested this patch over time and it alleviates the deadlocks I'm seeing.

The patch changes the DELETE for cleanup so it does not use a range for the job_code. This makes the query more likely to use the job_code index, mostly on MySQL 5.7. However, depending on database contents, it can still ignore it if it finds status column is better, mostly on MySQL 5.6 which I tested too. Adding the status column to the job_code index resolves this and I have yet to see incorrect index behaviour.

The additional index also helps with the atomic job status update. Previously the left join to get the existing job status would not always use an index. And could deadlock also with the aforementioned DELETE cleanup queries. The new index on job_code and status makes this join always use the index.

Furthermore, I had issues on development environments which were built from snapshots, and frequently shutdown and restarted, where there would accumulate "running" jobs that never finish as those jobs were terminated whilst running, or were running at the time of the snapshot. This PR also adds a cleanup of those so that after 24 hours they become failed. This matches the logic in the atomic status update, which ignores running jobs older than 24 hours, allowing a new job to start (and was the cause of the buildup, as every day there was a chance of one additional running entry per job.)

Related Pull Requests

This PR provides alternative fix for cleaning up running jobs that failed to the fix in this one: magento/magento2#23054 #24789: Cron job not running after crashed once #23054:
This PR adds an index similar to PR. Not sure what issue was in this PR but it could be the UPDATE/LEFT JOIN slowness is similar as that can avoid using index without this PR magento/magento2#: Add a new index for cron_schedule table #27391: magento/magento2#: Add a new index for cron_schedule table

Fixed Issues (if relevant)

Does not fix OP issue but fixes issue in some comments: Fixes 1213 Deadlock found when trying to get lock #8933 : 1213 Deadlock found when trying to get lock
Fixes Magento 2.3.1 Cron Deadlocks for cron_schedule #22438 : Magento 2.3.1 Cron Deadlocks for cron_schedule
Fixes Magento 2.2.5 - Cron Job Error. Sqlstate[40001]: Serialization failure: 1213 Deadlock #18409 : Magento 2.2.5 - Cron Job Error. SQLSTATE[40001]: Serialization failure: 1213 Deadlock
Fixes Cron job not running after crashed once #23054
Related Magento 2.3.3 Cronjob use too much CPU source #25634 (High CPU can occur due to growing cron_schedule, which this PR resolves)
Related Cronjobs increasing CPU usage and slow queries #26507 (High CPU can occur due to growing cron_schedule, which this PR resolves)
Related Cron using too many resources. #26809 (High CPU can occur due to growing cron_schedule, which this PR resolves)

Manual testing scenarios (*)

Unable to reproduce on demand but this steps hopefully are what we understand is the cause
Setup new Enterprise site and install several new modules with their own cron groups performing tasks taking at least 5 minutes and starting every 10 minutes - cron groups here are sync tasks with remote systems and stock update and product updates
Populate 1000s of products and 1000s of customers
Run cron every minute and set cron to email on failure
Leave running for a week, and notice emails with deadlocks

For running stale jobs:

Take database snapshot whilst jobs running and start it up on new environment
Periodically shutdown server whilst jobs are running, several times
Note even when cron not running, there are running jobs in the cron_schedule table
Note after a week these are still there

Questions or comments

N/A

Contribution checklist (*)

Pull request has a meaningful description of its purpose
All commits are accompanied by meaningful commit messages
All new or changed code is covered with unit/integration tests (if applicable)
All automated tests passed successfully (all builds are green)

m2-assistant · 2020-04-28T08:29:57Z

Hi @driskell. Thank you for your contribution
Here is some useful tips how you can test your changes using Magento test environment.
Add the comment under your pull request to deploy test or vanilla Magento instance:

@magento give me test instance - deploy test instance based on PR changes
@magento give me 2.4-develop instance - deploy vanilla Magento instance

For more details, please, review the Magento Contributor Guide documentation.

driskell · 2020-04-28T08:30:20Z

Will aim to look at tests in next week or so if they need updating but please go ahead and examine anyway.

app/code/Magento/Cron/Observer/ProcessCronQueueObserver.php

app/code/Magento/Cron/Model/ResourceModel/Schedule.php

app/code/Magento/Cron/Observer/ProcessCronQueueObserver.php

VladimirZaets

@kandy Hi, do you have any concerns according to queries?
@hostep agree with you, but also we can deliver it as is, and in the second PR refactor and make configuration for it.
@driskell thanks for collaboration. Can you please take care of failed tests?

kandy · 2020-05-06T21:42:01Z

I prefer to not change indexes in database, but use exists index and do not split one cleanup queries.
Other stuff looks good for me

driskell · 2020-05-11T11:47:39Z

I prefer to not change indexes in database, but use exists index and do not split one cleanup queries.
Other stuff looks good for me

That sounds very fair and it will definitely reduce the impact. I hope to get onto this this week and also look at the test. I'm fairly busy with paid work though so can't guarantee but that's my goal!

ihor-sviziev

Hi @driskell,
I've just updated list of fixed issues - looks like it's quite big number!

I see that now your changes conflicts with 2.4-develop branch, there were merged related changes in a6c93fd. Could you confirm if this issue still reproducing after applying them?
If still yes - please resolve conflict and look at failing tests.

Definitely this is important fix, thank you for it!

ihor-sviziev · 2020-06-19T07:18:56Z

Hi @driskell,
Will you be able to continue working on your PR?

ihor-sviziev · 2020-06-22T06:12:16Z

Hi @engcom-Charlie,
Could you confirm that linked issues still reproducing on 2.4-develop?
It looks to me as really important fix, but due to conflict i'm not sure if the issue wasn't fixed already.

driskell · 2020-06-22T08:13:05Z

I'm hoping to check into this today.

driskell · 2020-06-22T08:47:00Z

@ihor-sviziev

I guess there's a conflict here in goals:

2.4-develop aims to suppress deadlocks and retry them - this accounts to a limited failsafe mechanism that could still surface if the deadlock repeats 5 times. I honestly believe the deadlocks are better off fixed. It seems strange to me that one would accept a greater complexity of code in an attempt to hide an error 99% of the time, rather than attempt to fix fully and predictably.
This PR aimed to resolve the deadlock completely, for both MySQL 5.6 and MySQL 5.7 behaviours, and not require any deadlock retry code.

Secondly, with regards to the deadlock issues I see in our environments. There are 2 kinds. The first is during the DELETE. The second is during the taking of a job lock as per ResourceModel/Schedule.php - there is only an index for job_code and even though it's contained in the query we were seeing deadlocks due to in some cases MySQL believing a scan would be more efficient due to cardinalities of the indexes, and overlapping with locking of other jobs. This is not covered in the current 2.4-develop code. As I remember this was mostly an issue with MySQL 5.6.

Overall the above job lock deadlock is why I modified the index. Yet there was a request to remove that change.

If we continue with this PR I'd much prefer to be able to make it do the following (almost opposite to what @kandy requested):

Add status to the job_code index, to resolve the deadlock in attempting to get a lock.
Split out the cleanup so it cleans each job individually and does not rely on unpredictable job_code
Keep the deadlock retry from 2.4-develop but add warning logs to it - with a note to remove it if no deadlocks occur anymore

ihor-sviziev · 2020-06-22T10:01:12Z

@kandy
I do agree with solution suggested by @driskell in #28007 (comment). Could you review it and give your input?

ihor-sviziev · 2020-06-22T10:03:20Z

@magento run all tests

engcom-Hotel · 2020-12-14T07:57:54Z

@magento run Functional Tests B2B, Functional Tests CE, Magento Health Index, WebAPI Tests

…ents where groups can overlap #28007

m2-assistant · 2020-12-22T00:37:13Z

Hi @driskell, thank you for your contribution!
Please, complete Contribution Survey, it will take less than a minute.
Your feedback will help us to improve contribution process.

ihor-sviziev · 2020-12-22T09:31:33Z

🎉 finally, this amazing fix got merged! Thank you so much @driskell for your contribution!
I hope it will not introduce any new issues.

kilis · 2021-01-20T21:13:03Z

This issue still exists after this fix.
Just tested on live and it hang up again.

hostep · 2021-01-21T07:22:42Z

@kilis: you might need the changes from MC-25132 as well, this was mentioned above in #28007 (comment)

kilis · 2021-01-24T10:09:09Z

@hostep
I moved entire cron module from 2.4 to 2.3.5-p2 as a patch. I do not understand what else is missing.
deadlock-fix.patch.txt

hostep · 2021-01-24T19:37:05Z

@kilis: me neither, sorry, I just tried to point you to a certain comment in case you didn't notice it.

If you can reproduce the problem on a Magento 2.4-develop installation, I'd strongly suggest you open a new issue with detailed steps about how to reproduce it (even though that's probably hard to get right).

ghost added the Progress: pending review label Apr 28, 2020

magento-engcom-team added Component: Cron Release Line: 2.4 labels Apr 28, 2020

driskell commented Apr 28, 2020

View reviewed changes

app/code/Magento/Cron/Observer/ProcessCronQueueObserver.php Show resolved Hide resolved

driskell commented Apr 28, 2020

View reviewed changes

app/code/Magento/Cron/Model/ResourceModel/Schedule.php Show resolved Hide resolved

kandy reviewed Apr 28, 2020

View reviewed changes

app/code/Magento/Cron/Observer/ProcessCronQueueObserver.php Outdated Show resolved Hide resolved

ghost assigned kandy Apr 28, 2020

hostep reviewed May 2, 2020

View reviewed changes

app/code/Magento/Cron/Observer/ProcessCronQueueObserver.php Outdated Show resolved Hide resolved

VladimirZaets requested changes May 4, 2020

View reviewed changes

ghost assigned VladimirZaets May 4, 2020

ghost added Progress: needs update and removed Progress: pending review labels May 4, 2020

ihor-sviziev added the Severity: S1 Affects critical data or functionality and forces users to employ a workaround. label Jun 5, 2020

ihor-sviziev self-assigned this Jun 5, 2020

ihor-sviziev requested changes Jun 5, 2020

View reviewed changes

driskell force-pushed the Fix-cron-schedule-deadlock branch from efe2435 to bf23f57 Compare June 22, 2020 08:58

ihor-sviziev requested a review from kandy June 22, 2020 09:59

m2-community-project bot added Progress: extended testing and removed Progress: testing in progress labels Dec 11, 2020

engcom-Hotel self-assigned this Dec 14, 2020

m2-community-project bot added Progress: ready for testing and removed Progress: extended testing labels Dec 14, 2020

m2-community-project bot added Progress: extended testing and removed Progress: ready for testing labels Dec 14, 2020

m2-community-project bot added Progress: accept Severity: S3 Affects non-critical data or functionality and does not force users to employ a workaround. and removed Progress: extended testing labels Dec 14, 2020

This was referenced Dec 22, 2020

1213 Deadlock found when trying to get lock #8933

Closed

Magento 2.2.5 - Cron Job Error. Sqlstate[40001]: Serialization failure: 1213 Deadlock #18409

Closed

Magento 2.3.1 Cron Deadlocks for cron_schedule #22438

Closed

magento-engcom-team pushed a commit that referenced this pull request Dec 22, 2020

ENGCOM-8074: Cron cleanup repeatedly hits deadlocks on large environm…

0fc0dc4

…ents where groups can overlap #28007

magento-engcom-team merged commit 2c850a4 into magento:2.4-develop Dec 22, 2020

ihor-sviziev added the Award: special achievement label Dec 22, 2020

hostep mentioned this pull request Aug 12, 2021

Many cron processes can wait locks and consume resources #25987

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cron cleanup repeatedly hits deadlocks on large environments where groups can overlap #28007

Cron cleanup repeatedly hits deadlocks on large environments where groups can overlap #28007

driskell commented Apr 28, 2020 •

edited

Loading

m2-assistant bot commented Apr 28, 2020

driskell commented Apr 28, 2020 •

edited

Loading

VladimirZaets left a comment

kandy commented May 6, 2020

driskell commented May 11, 2020

ihor-sviziev left a comment

ihor-sviziev commented Jun 19, 2020

ihor-sviziev commented Jun 22, 2020

driskell commented Jun 22, 2020

driskell commented Jun 22, 2020 •

edited

Loading

ihor-sviziev commented Jun 22, 2020

ihor-sviziev commented Jun 22, 2020

engcom-Hotel commented Dec 14, 2020

m2-assistant bot commented Dec 22, 2020

ihor-sviziev commented Dec 22, 2020

kilis commented Jan 20, 2021 •

edited

Loading

hostep commented Jan 21, 2021

kilis commented Jan 24, 2021

hostep commented Jan 24, 2021

Cron cleanup repeatedly hits deadlocks on large environments where groups can overlap #28007

Cron cleanup repeatedly hits deadlocks on large environments where groups can overlap #28007

Conversation

driskell commented Apr 28, 2020 • edited Loading

Description (*)

Related Pull Requests

Fixed Issues (if relevant)

Manual testing scenarios (*)

Questions or comments

Contribution checklist (*)

m2-assistant bot commented Apr 28, 2020

driskell commented Apr 28, 2020 • edited Loading

VladimirZaets left a comment

Choose a reason for hiding this comment

kandy commented May 6, 2020

driskell commented May 11, 2020

ihor-sviziev left a comment

Choose a reason for hiding this comment

ihor-sviziev commented Jun 19, 2020

ihor-sviziev commented Jun 22, 2020

driskell commented Jun 22, 2020

driskell commented Jun 22, 2020 • edited Loading

ihor-sviziev commented Jun 22, 2020

ihor-sviziev commented Jun 22, 2020

engcom-Hotel commented Dec 14, 2020

m2-assistant bot commented Dec 22, 2020

ihor-sviziev commented Dec 22, 2020

kilis commented Jan 20, 2021 • edited Loading

hostep commented Jan 21, 2021

kilis commented Jan 24, 2021

hostep commented Jan 24, 2021

driskell commented Apr 28, 2020 •

edited

Loading

driskell commented Apr 28, 2020 •

edited

Loading

driskell commented Jun 22, 2020 •

edited

Loading

kilis commented Jan 20, 2021 •

edited

Loading