Messaging: Avoid deadlocks related to 0 receiver behavior#10132
Merged
mattlord merged 16 commits intovitessio:mainfrom Apr 28, 2022
Merged
Messaging: Avoid deadlocks related to 0 receiver behavior#10132mattlord merged 16 commits intovitessio:mainfrom
mattlord merged 16 commits intovitessio:mainfrom
Conversation
Signed-off-by: Matt Lord <mattalord@gmail.com>
d4ff217 to
f31eefa
Compare
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
f680779 to
d177a21
Compare
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
derekperkins
approved these changes
Apr 25, 2022
See: vitessio/website#1015 Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Member
|
Looks like the majority of tests are failing on |
3 tasks
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
This is for safe concurrency with the last receiver unsubscribing Signed-off-by: Matt Lord <mattalord@gmail.com>
Member
|
@mattlord you really rolled through quite a few iterations, thanks again for digging in so deep. I don't think I have enough context around vstream to provide much feedback on the latest design. Once @rohit-nayak-ps approves it, I'm ready to roll it out in prod to test it out. |
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Member
Author
|
Note that the |
Member
|
@mattlord is this still waiting on @rohit-nayak-ps, or was pairing with @sougou sufficient review? |
sougou
approved these changes
Apr 28, 2022
Member
|
🎉 Thanks so much for tracking this down!! I'll be rolling it out later today |
Member
Author
Excellent! Please let me know how things go. |
tanjinx
added a commit
to slackhq/vitess
that referenced
this pull request
Aug 30, 2022
3 tasks
tanjinx
added a commit
to slackhq/vitess
that referenced
this pull request
Aug 31, 2022
This reverts commit 5e02d6e.
rsajwani
pushed a commit
to planetscale/vitess
that referenced
this pull request
Nov 8, 2022
* Prevent deadlocks related to 0 receiver behavior Signed-off-by: Matt Lord <mattalord@gmail.com> * Update test tables to use poller_idx Signed-off-by: Matt Lord <mattalord@gmail.com> * Minor changes after mutex usage review in message manager + cache Signed-off-by: Matt Lord <mattalord@gmail.com> * Use atomics for receiver count and messages pending Signed-off-by: Matt Lord <mattalord@gmail.com> * Don't take exclusive lock when in fast path Signed-off-by: Matt Lord <mattalord@gmail.com> * Update tests to use the new recommended message table structure See: vitessio/website#1015 Signed-off-by: Matt Lord <mattalord@gmail.com> * Correct tests Signed-off-by: Matt Lord <mattalord@gmail.com> * Update e2e test to use new recommended table structure Signed-off-by: Matt Lord <mattalord@gmail.com> * Fix TestMessageStreamingPlan test Signed-off-by: Matt Lord <mattalord@gmail.com> * Fix godriver/TestStreamMessaging test Signed-off-by: Matt Lord <mattalord@gmail.com> * Split streamMu into streamProcessingMu and lastPollPositionMu Signed-off-by: Matt Lord <mattalord@gmail.com> * Poller cannot take main lock w/o having X stream processing lock Signed-off-by: Matt Lord <mattalord@gmail.com> * Improve the comments a bit Signed-off-by: Matt Lord <mattalord@gmail.com> * Hold the main mutex during Add This is for safe concurrency with the last receiver unsubscribing Signed-off-by: Matt Lord <mattalord@gmail.com> * Changes after pair reviewing with Sugu Signed-off-by: Matt Lord <mattalord@gmail.com> * Use my GitHub handle for the self reference Signed-off-by: Matt Lord <mattalord@gmail.com> Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com>
harshit-gangal
pushed a commit
that referenced
this pull request
Nov 22, 2022
…11619) * Move towards MySQL 8.0 as the default template generation (#11153) * Move towards MySQL 8.0 as the default template generation This upgrades the remaining things to Ubuntu 20.04 and makes MySQL 8.0 the default we run tests against. We still have tests for MySQL 5.7 but those are now explicitly opted into. This should finish up the Ubuntu 20.04 upgrade and also makes things easier for the future when we need to upgrade again. Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> * CI: rename shard vtorc_8.0 to vtorc_5.7, change expected test output for 8.0 Signed-off-by: deepthi <deepthi@planetscale.com> * CI: increase timeout for 8.0 tests on the actual test step from 30 to 45 mins Signed-off-by: deepthi <deepthi@planetscale.com> * CI: increase timeout to 45 minutes for mysql57 tests too. We really only need this for vtorc, but I've made the change to the template so all tests get it. Signed-off-by: deepthi <deepthi@planetscale.com> * CI: fix vtorc test to work with both 5.7 and 8.0 Signed-off-by: deepthi <deepthi@planetscale.com> * CI: missed docker flag in mysql57 template, one more fix to vtorc test Signed-off-by: deepthi <deepthi@planetscale.com> * removing spaces from pb file Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> * removing spaces in pb file part 2 Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> Signed-off-by: deepthi <deepthi@planetscale.com> Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> Co-authored-by: deepthi <deepthi@planetscale.com> Co-authored-by: Rameez Sajwani <rameezwazirali@hotmail.com> * fixing template files to default to mysql80 Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> * changing update statement to alter Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> * Avoid deadlocks related to 0 receiver behavior (#10132) * Prevent deadlocks related to 0 receiver behavior Signed-off-by: Matt Lord <mattalord@gmail.com> * Update test tables to use poller_idx Signed-off-by: Matt Lord <mattalord@gmail.com> * Minor changes after mutex usage review in message manager + cache Signed-off-by: Matt Lord <mattalord@gmail.com> * Use atomics for receiver count and messages pending Signed-off-by: Matt Lord <mattalord@gmail.com> * Don't take exclusive lock when in fast path Signed-off-by: Matt Lord <mattalord@gmail.com> * Update tests to use the new recommended message table structure See: vitessio/website#1015 Signed-off-by: Matt Lord <mattalord@gmail.com> * Correct tests Signed-off-by: Matt Lord <mattalord@gmail.com> * Update e2e test to use new recommended table structure Signed-off-by: Matt Lord <mattalord@gmail.com> * Fix TestMessageStreamingPlan test Signed-off-by: Matt Lord <mattalord@gmail.com> * Fix godriver/TestStreamMessaging test Signed-off-by: Matt Lord <mattalord@gmail.com> * Split streamMu into streamProcessingMu and lastPollPositionMu Signed-off-by: Matt Lord <mattalord@gmail.com> * Poller cannot take main lock w/o having X stream processing lock Signed-off-by: Matt Lord <mattalord@gmail.com> * Improve the comments a bit Signed-off-by: Matt Lord <mattalord@gmail.com> * Hold the main mutex during Add This is for safe concurrency with the last receiver unsubscribing Signed-off-by: Matt Lord <mattalord@gmail.com> * Changes after pair reviewing with Sugu Signed-off-by: Matt Lord <mattalord@gmail.com> * Use my GitHub handle for the self reference Signed-off-by: Matt Lord <mattalord@gmail.com> Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> * remove unwanted sleep Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> * Fix failures Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> * Fix backup tests Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> * fix vtgate_gen4 error Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> * upgrading to mysql 8.0 using vinalla mysql Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> * pin to specific version 8.0.25 Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> * Try to pin mysql version to 8.0.25 using tar file Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> * Fix bug in last commit for 8.0.25 Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> * fixing template files to use mysql8.0.25 Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> * fixing community version Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> * removing all mysql version before installation Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> * moving cluster 11 to self host Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> * trying different combination since mysql is not getting installed correctly Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> * use tar instead of deb file Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> * remove mysql stop statement Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> * setting vt_mysql_root Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> * moving export to right place Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> * change template to accomodate tar file download Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> * fixing mysql80 cluster Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> * Adding mysql to the path env variable Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> * fixing shardedpitr_tls test Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> * fix upgrade-downgrade test Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> * adjust epected AUTO_INCREMENT value Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * removed 'expected_table_structure' files because there are different outputs in mysql57 and in mysql80 Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> * adding mysql version in workflow logs Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> * move to utuntu 18 for upgrade-downgrade test Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> Signed-off-by: Dirkjan Bussink <d.bussink@gmail.com> Signed-off-by: deepthi <deepthi@planetscale.com> Signed-off-by: Rameez Sajwani <rameezwazirali@hotmail.com> Signed-off-by: Matt Lord <mattalord@gmail.com> Signed-off-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com> Co-authored-by: Dirkjan Bussink <d.bussink@gmail.com> Co-authored-by: deepthi <deepthi@planetscale.com> Co-authored-by: Matt Lord <mattalord@gmail.com> Co-authored-by: Shlomi Noach <2607934+shlomi-noach@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
There's conditional behavior and code paths in the message manager when the number of receivers is 0. The main mutex and the stream mutex locks are used to ensure overall consistency. But:
vitess/go/vt/vttablet/tabletserver/messager/message_manager.go
Lines 752 to 787 in c985379
These two factors can lead to deadlocks when the receiver count goes to/from 0 while the the poller runs:
In
unsubscribe()we have the main mutex, and while holding it we then callstopVStream()if the subscriber/receiver count is at 0. ThenstopVStream()tries to get the stream mutex.Concurrently,
runPoller()— which by default, or more accurately when using the documented message table structure here via thevt_poller_intervalcomment directive, runs on a 30 second ticker — gets the stream mutex first, then it later tries to get the main mutex.Now we are deadlocked. They're both waiting on each other.
This then in turn causes things like
PlannedReparentShardto hang because during the transition from primary serving we stop/close the message service, which in turn will wait trying to get the main mutex.Changes
cacheManagementMulock. The explicit usage of this new mutex allows us to avoid the deadlocks seen before because it's taken immediately when the poller or binlog streamer process data coming from the database (which is used to update the cache), and we no longer need to take this lock when starting or stopping the binlog streamer itself.processRowEvent()->Add()— we were at risk of a deadlock. The main mutex became VERY hot because we were checking the receiver count all the time to ensure the fast path (do nothing when there's no receivers).time_acked is null— to try and limit the size of scans done on the underlying message table to try and ensure that the poller runs quickly. This aspect is important because the binlog event processing is blocked during this time. It also prevents us from doing unnecessary work as InnoDB does most of the work — all query (WHERE) predicates become index condition pushdowns and the newpoller_idxindex can be used for the ordering to apply theLIMITas well — and passes the minimum amount of potentially useful data up the stack InnoDB->MySQL->MessageManager->Cache.Notes
This is able to complete on
vitessio/main:Whatever changes are made here we should continue to pass that test (it does). It's even more reliable now — you can run it 1,000 times and no flakiness.
Related Issue(s)
Checklist