-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cron cleanup repeatedly hits deadlocks on large environments where groups can overlap #28007
Cron cleanup repeatedly hits deadlocks on large environments where groups can overlap #28007
Conversation
Hi @driskell. Thank you for your contribution
For more details, please, review the Magento Contributor Guide documentation. |
Will aim to look at tests in next week or so if they need updating but please go ahead and examine anyway. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to not change indexes in database, but use exists index and do not split one cleanup queries. |
That sounds very fair and it will definitely reduce the impact. I hope to get onto this this week and also look at the test. I'm fairly busy with paid work though so can't guarantee but that's my goal! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @driskell,
I've just updated list of fixed issues - looks like it's quite big number!
I see that now your changes conflicts with 2.4-develop branch, there were merged related changes in a6c93fd. Could you confirm if this issue still reproducing after applying them?
If still yes - please resolve conflict and look at failing tests.
Definitely this is important fix, thank you for it!
Hi @driskell, |
Hi @engcom-Charlie, |
I'm hoping to check into this today. |
I guess there's a conflict here in goals:
Secondly, with regards to the deadlock issues I see in our environments. There are 2 kinds. The first is during the DELETE. The second is during the taking of a job lock as per Overall the above job lock deadlock is why I modified the index. Yet there was a request to remove that change. If we continue with this PR I'd much prefer to be able to make it do the following (almost opposite to what @kandy requested):
|
efe2435
to
bf23f57
Compare
@kandy |
@magento run all tests |
@magento run Functional Tests B2B, Functional Tests CE, Magento Health Index, WebAPI Tests |
…ents where groups can overlap #28007
Hi @driskell, thank you for your contribution! |
🎉 finally, this amazing fix got merged! Thank you so much @driskell for your contribution! |
This issue still exists after this fix. |
@kilis: you might need the changes from MC-25132 as well, this was mentioned above in #28007 (comment) |
@hostep |
@kilis: me neither, sorry, I just tried to point you to a certain comment in case you didn't notice it. If you can reproduce the problem on a Magento |
We have two large installations, one Enterprise, one Community, both on 2.3.4, with some long running crons. Frequently the groups overlap and the DELETE operations in the cron cleanup cause deadlocks.
Examining the database environment we can see that some DELETE queries are using the job_code index whereas other are performing a scan as they attempt to use the status column. The ones using an index take exclusive locks on the index then proceed to filter, release, and lock the relevant PRIMARY key entries to take the rows. The queries not using the index take the PRIMARY key locks first then perform filtering and release them. This may not be an exact description but it's my current understanding of what's happening. As you can logically see when two queries run at the same time for two different groups it can cause a deadlock.
Description (*)
I've tested this patch over time and it alleviates the deadlocks I'm seeing.
The patch changes the DELETE for cleanup so it does not use a range for the job_code. This makes the query more likely to use the job_code index, mostly on MySQL 5.7. However, depending on database contents, it can still ignore it if it finds status column is better, mostly on MySQL 5.6 which I tested too. Adding the status column to the job_code index resolves this and I have yet to see incorrect index behaviour.
The additional index also helps with the atomic job status update. Previously the left join to get the existing job status would not always use an index. And could deadlock also with the aforementioned DELETE cleanup queries. The new index on job_code and status makes this join always use the index.
Furthermore, I had issues on development environments which were built from snapshots, and frequently shutdown and restarted, where there would accumulate "running" jobs that never finish as those jobs were terminated whilst running, or were running at the time of the snapshot. This PR also adds a cleanup of those so that after 24 hours they become failed. This matches the logic in the atomic status update, which ignores running jobs older than 24 hours, allowing a new job to start (and was the cause of the buildup, as every day there was a chance of one additional running entry per job.)
Related Pull Requests
cron_schedule
table #27391: magento/magento2#: Add a new index forcron_schedule
tableFixed Issues (if relevant)
Manual testing scenarios (*)
For running stale jobs:
Questions or comments
N/A
Contribution checklist (*)