Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cron cleanup repeatedly hits deadlocks on large environments where groups can overlap #28007

Merged

Conversation

driskell
Copy link
Contributor

@driskell driskell commented Apr 28, 2020

We have two large installations, one Enterprise, one Community, both on 2.3.4, with some long running crons. Frequently the groups overlap and the DELETE operations in the cron cleanup cause deadlocks.

Examining the database environment we can see that some DELETE queries are using the job_code index whereas other are performing a scan as they attempt to use the status column. The ones using an index take exclusive locks on the index then proceed to filter, release, and lock the relevant PRIMARY key entries to take the rows. The queries not using the index take the PRIMARY key locks first then perform filtering and release them. This may not be an exact description but it's my current understanding of what's happening. As you can logically see when two queries run at the same time for two different groups it can cause a deadlock.

Description (*)

I've tested this patch over time and it alleviates the deadlocks I'm seeing.

The patch changes the DELETE for cleanup so it does not use a range for the job_code. This makes the query more likely to use the job_code index, mostly on MySQL 5.7. However, depending on database contents, it can still ignore it if it finds status column is better, mostly on MySQL 5.6 which I tested too. Adding the status column to the job_code index resolves this and I have yet to see incorrect index behaviour.

The additional index also helps with the atomic job status update. Previously the left join to get the existing job status would not always use an index. And could deadlock also with the aforementioned DELETE cleanup queries. The new index on job_code and status makes this join always use the index.

Furthermore, I had issues on development environments which were built from snapshots, and frequently shutdown and restarted, where there would accumulate "running" jobs that never finish as those jobs were terminated whilst running, or were running at the time of the snapshot. This PR also adds a cleanup of those so that after 24 hours they become failed. This matches the logic in the atomic status update, which ignores running jobs older than 24 hours, allowing a new job to start (and was the cause of the buildup, as every day there was a chance of one additional running entry per job.)

Related Pull Requests

  1. This PR provides alternative fix for cleaning up running jobs that failed to the fix in this one: magento/magento2#23054 #24789: Cron job not running after crashed once #23054:
  2. This PR adds an index similar to PR. Not sure what issue was in this PR but it could be the UPDATE/LEFT JOIN slowness is similar as that can avoid using index without this PR magento/magento2#: Add a new index for cron_schedule table #27391: magento/magento2#: Add a new index for cron_schedule table

Fixed Issues (if relevant)

  1. Does not fix OP issue but fixes issue in some comments: Fixes 1213 Deadlock found when trying to get lock #8933 : 1213 Deadlock found when trying to get lock
  2. Fixes Magento 2.3.1 Cron Deadlocks for cron_schedule #22438 : Magento 2.3.1 Cron Deadlocks for cron_schedule
  3. Fixes Magento 2.2.5 - Cron Job Error. Sqlstate[40001]: Serialization failure: 1213 Deadlock #18409 : Magento 2.2.5 - Cron Job Error. SQLSTATE[40001]: Serialization failure: 1213 Deadlock
  4. Fixes Cron job not running after crashed once #23054
  5. Related Magento 2.3.3 Cronjob use too much CPU source #25634 (High CPU can occur due to growing cron_schedule, which this PR resolves)
  6. Related Cronjobs increasing CPU usage and slow queries #26507 (High CPU can occur due to growing cron_schedule, which this PR resolves)
  7. Related Cron using too many resources. #26809 (High CPU can occur due to growing cron_schedule, which this PR resolves)

Manual testing scenarios (*)

  1. Unable to reproduce on demand but this steps hopefully are what we understand is the cause
  2. Setup new Enterprise site and install several new modules with their own cron groups performing tasks taking at least 5 minutes and starting every 10 minutes - cron groups here are sync tasks with remote systems and stock update and product updates
  3. Populate 1000s of products and 1000s of customers
  4. Run cron every minute and set cron to email on failure
  5. Leave running for a week, and notice emails with deadlocks

For running stale jobs:

  1. Take database snapshot whilst jobs running and start it up on new environment
  2. Periodically shutdown server whilst jobs are running, several times
  3. Note even when cron not running, there are running jobs in the cron_schedule table
  4. Note after a week these are still there

Questions or comments

N/A

Contribution checklist (*)

  • Pull request has a meaningful description of its purpose
  • All commits are accompanied by meaningful commit messages
  • All new or changed code is covered with unit/integration tests (if applicable)
  • All automated tests passed successfully (all builds are green)

@m2-assistant
Copy link

m2-assistant bot commented Apr 28, 2020

Hi @driskell. Thank you for your contribution
Here is some useful tips how you can test your changes using Magento test environment.
Add the comment under your pull request to deploy test or vanilla Magento instance:

  • @magento give me test instance - deploy test instance based on PR changes
  • @magento give me 2.4-develop instance - deploy vanilla Magento instance

For more details, please, review the Magento Contributor Guide documentation.

@driskell
Copy link
Contributor Author

driskell commented Apr 28, 2020

Will aim to look at tests in next week or so if they need updating but please go ahead and examine anyway.

@ghost ghost assigned kandy Apr 28, 2020
Copy link
Contributor

@VladimirZaets VladimirZaets left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kandy Hi, do you have any concerns according to queries?
@hostep agree with you, but also we can deliver it as is, and in the second PR refactor and make configuration for it.
@driskell thanks for collaboration. Can you please take care of failed tests?

@kandy
Copy link
Contributor

kandy commented May 6, 2020

I prefer to not change indexes in database, but use exists index and do not split one cleanup queries.
Other stuff looks good for me

@driskell
Copy link
Contributor Author

I prefer to not change indexes in database, but use exists index and do not split one cleanup queries.
Other stuff looks good for me

That sounds very fair and it will definitely reduce the impact. I hope to get onto this this week and also look at the test. I'm fairly busy with paid work though so can't guarantee but that's my goal!

@ihor-sviziev ihor-sviziev added the Severity: S1 Affects critical data or functionality and forces users to employ a workaround. label Jun 5, 2020
@ihor-sviziev ihor-sviziev self-assigned this Jun 5, 2020
Copy link
Contributor

@ihor-sviziev ihor-sviziev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @driskell,
I've just updated list of fixed issues - looks like it's quite big number!

I see that now your changes conflicts with 2.4-develop branch, there were merged related changes in a6c93fd. Could you confirm if this issue still reproducing after applying them?
If still yes - please resolve conflict and look at failing tests.

Definitely this is important fix, thank you for it!

@ihor-sviziev
Copy link
Contributor

Hi @driskell,
Will you be able to continue working on your PR?

@ihor-sviziev
Copy link
Contributor

Hi @engcom-Charlie,
Could you confirm that linked issues still reproducing on 2.4-develop?
It looks to me as really important fix, but due to conflict i'm not sure if the issue wasn't fixed already.

@driskell
Copy link
Contributor Author

I'm hoping to check into this today.

@driskell
Copy link
Contributor Author

driskell commented Jun 22, 2020

@ihor-sviziev

I guess there's a conflict here in goals:

  • 2.4-develop aims to suppress deadlocks and retry them - this accounts to a limited failsafe mechanism that could still surface if the deadlock repeats 5 times. I honestly believe the deadlocks are better off fixed. It seems strange to me that one would accept a greater complexity of code in an attempt to hide an error 99% of the time, rather than attempt to fix fully and predictably.

  • This PR aimed to resolve the deadlock completely, for both MySQL 5.6 and MySQL 5.7 behaviours, and not require any deadlock retry code.

Secondly, with regards to the deadlock issues I see in our environments. There are 2 kinds. The first is during the DELETE. The second is during the taking of a job lock as per ResourceModel/Schedule.php - there is only an index for job_code and even though it's contained in the query we were seeing deadlocks due to in some cases MySQL believing a scan would be more efficient due to cardinalities of the indexes, and overlapping with locking of other jobs. This is not covered in the current 2.4-develop code. As I remember this was mostly an issue with MySQL 5.6.

Overall the above job lock deadlock is why I modified the index. Yet there was a request to remove that change.

If we continue with this PR I'd much prefer to be able to make it do the following (almost opposite to what @kandy requested):

  • Add status to the job_code index, to resolve the deadlock in attempting to get a lock.
  • Split out the cleanup so it cleans each job individually and does not rely on unpredictable job_code
  • Keep the deadlock retry from 2.4-develop but add warning logs to it - with a note to remove it if no deadlocks occur anymore

@ihor-sviziev
Copy link
Contributor

@kandy
I do agree with solution suggested by @driskell in #28007 (comment). Could you review it and give your input?

@ihor-sviziev
Copy link
Contributor

@magento run all tests

@engcom-Hotel
Copy link
Contributor

@magento run Functional Tests B2B, Functional Tests CE, Magento Health Index, WebAPI Tests

@m2-assistant
Copy link

m2-assistant bot commented Dec 22, 2020

Hi @driskell, thank you for your contribution!
Please, complete Contribution Survey, it will take less than a minute.
Your feedback will help us to improve contribution process.

@ihor-sviziev
Copy link
Contributor

🎉 finally, this amazing fix got merged! Thank you so much @driskell for your contribution!
I hope it will not introduce any new issues.

@kilis
Copy link

kilis commented Jan 20, 2021

This issue still exists after this fix.
Just tested on live and it hang up again.

@hostep
Copy link
Contributor

hostep commented Jan 21, 2021

@kilis: you might need the changes from MC-25132 as well, this was mentioned above in #28007 (comment)

@kilis
Copy link

kilis commented Jan 24, 2021

@hostep
I moved entire cron module from 2.4 to 2.3.5-p2 as a patch. I do not understand what else is missing.
deadlock-fix.patch.txt

@hostep
Copy link
Contributor

hostep commented Jan 24, 2021

@kilis: me neither, sorry, I just tried to point you to a certain comment in case you didn't notice it.

If you can reproduce the problem on a Magento 2.4-develop installation, I'd strongly suggest you open a new issue with detailed steps about how to reproduce it (even though that's probably hard to get right).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Auto-Tests: Covered All changes in Pull Request is covered by auto-tests Award: advanced Award: bug fix Award: category of expertise Award: special achievement Component: Cron Priority: P1 Once P0 defects have been fixed, a defect having this priority is the next candidate for fixing. Priority: P2 A defect with this priority could have functionality issues which are not to expectations. Progress: accept Release Line: 2.4 Risk: high Severity: S1 Affects critical data or functionality and forces users to employ a workaround. Severity: S2 Major restrictions or short-term circumventions are required until a fix is available. Severity: S3 Affects non-critical data or functionality and does not force users to employ a workaround.
Projects
None yet