From 249c6510d44088bc96fa667209f7950b364b3a1a Mon Sep 17 00:00:00 2001 From: Graham Christensen Date: Tue, 30 May 2017 10:41:12 -0400 Subject: [PATCH] Enable monitoring after each node has already caught up When cloning a slave, the newly cloned slave used to page immediately after the clone finishes, because replication was started and then monitoring was enabled. The hardest part about these pages is they trigger several hours after the original page, making it a particularly punishing overnight page double-tap. This change waits for slaves to catch up before enabling monitoring. Here is a log of what was happening: From Jetpants: 02:39:36 [10.5.2.11:3306] Listening with netcat. 02:39:36 [10.5.2.12] Sending files over to 10.5.2.11:3306: mysql test myapp ibdata1 05:34:30 [10.5.2.12] File copy complete. 05:34:30 [10.5.2.12] Verifying file sizes and types on all destinations. 05:34:31 [10.5.2.12] Verification successful. <...> 05:39:04 [10.5.2.12:3306] Attempting to enable monitoring on service MySQL-Slave 05:39:04 [10.5.2.12:3306] Successfully enabled monitoring for service MySQL-Slave Then from Icinga: External Command[2017-05-30 05:39:04] EXTERNAL COMMAND: ENABLE_SVC_NOTIFICATIONS;10.5.2.12;MySQL-Slave Service Notification[2017-05-30 05:39:05] SERVICE NOTIFICATION: 10.5.2.12;MySQL-Slave;CRITICAL;pagerduty-service-notification;SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 9018; Then from Jetpants: 05:39:19 [10.5.2.11:3306] Waiting to catch up to master 05:39:20 [10.5.2.12:3306] Waiting to catch up to master 05:39:20 [10.5.2.12:3306] Currently 8675 seconds behind master. 05:41:13 [10.5.2.11:3306] Currently 8829 seconds behind master. <...> 05:46:02 [10.5.2.12:3306] Caught up to master. 05:48:05 [10.5.2.11:3306] Caught up to master. 05:48:06 [10.5.2.12:3306] Attempting to enable monitoring on service MySQL-Slave 05:48:06 [10.5.2.12:3306] Successfully enabled monitoring for service MySQL-Slave --- lib/jetpants/db/replication.rb | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/lib/jetpants/db/replication.rb b/lib/jetpants/db/replication.rb index 153fe47..3177079 100644 --- a/lib/jetpants/db/replication.rb +++ b/lib/jetpants/db/replication.rb @@ -295,13 +295,12 @@ def enslave_siblings!(targets) clone_to!(targets) targets.each do |t| - t.enable_monitoring t.change_master_to(master, change_master_options) t.enable_read_only! end [ self, targets ].flatten.each(&:resume_replication) # should already have happened from the clone_to! restart anyway, but just to be explicit [ self, targets ].flatten.concurrent_each{|n| n.catch_up_to_master 21600 } - enable_monitoring + [ self, targets ].flatten.each(&:enable_monitoring) end # Shortcut to call DB#enslave_siblings! on a single target