From 249c6510d44088bc96fa667209f7950b364b3a1a Mon Sep 17 00:00:00 2001
From: Graham Christensen <graham@tumblr.com>
Date: Tue, 30 May 2017 10:41:12 -0400
Subject: [PATCH] Enable monitoring after each node has already caught up

When cloning a slave, the newly cloned slave used to page immediately
after the clone finishes, because replication was started and then
monitoring was enabled.

The hardest part about these pages is they trigger several hours after
the original page, making it a particularly punishing overnight page
double-tap.

This change waits for slaves to catch up before enabling monitoring.

Here is a log of what was happening:

From Jetpants:

    02:39:36 [10.5.2.11:3306] Listening with netcat.
    02:39:36 [10.5.2.12] Sending files over to 10.5.2.11:3306: mysql test myapp ibdata1
    05:34:30 [10.5.2.12] File copy complete.
    05:34:30 [10.5.2.12] Verifying file sizes and types on all destinations.
    05:34:31 [10.5.2.12] Verification successful.
    <...>
    05:39:04 [10.5.2.12:3306] Attempting to enable monitoring on service MySQL-Slave
    05:39:04 [10.5.2.12:3306] Successfully enabled monitoring for service MySQL-Slave

Then from Icinga:

    External Command[2017-05-30 05:39:04] EXTERNAL COMMAND: ENABLE_SVC_NOTIFICATIONS;10.5.2.12;MySQL-Slave
    Service Notification[2017-05-30 05:39:05] SERVICE NOTIFICATION: 10.5.2.12;MySQL-Slave;CRITICAL;pagerduty-service-notification;SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 9018;

Then from Jetpants:

    05:39:19 [10.5.2.11:3306] Waiting to catch up to master
    05:39:20 [10.5.2.12:3306] Waiting to catch up to master
    05:39:20 [10.5.2.12:3306] Currently 8675 seconds behind master.
    05:41:13 [10.5.2.11:3306] Currently 8829 seconds behind master.
    <...>
    05:46:02 [10.5.2.12:3306] Caught up to master.
    05:48:05 [10.5.2.11:3306] Caught up to master.
    05:48:06 [10.5.2.12:3306] Attempting to enable monitoring on service MySQL-Slave
    05:48:06 [10.5.2.12:3306] Successfully enabled monitoring for service MySQL-Slave
---
 lib/jetpants/db/replication.rb | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/lib/jetpants/db/replication.rb b/lib/jetpants/db/replication.rb
index 153fe47..3177079 100644
--- a/lib/jetpants/db/replication.rb
+++ b/lib/jetpants/db/replication.rb
@@ -295,13 +295,12 @@ def enslave_siblings!(targets)
 
       clone_to!(targets)
       targets.each do |t|
-        t.enable_monitoring
         t.change_master_to(master, change_master_options)
         t.enable_read_only!
       end
       [ self, targets ].flatten.each(&:resume_replication) # should already have happened from the clone_to! restart anyway, but just to be explicit
       [ self, targets ].flatten.concurrent_each{|n| n.catch_up_to_master 21600 }
-      enable_monitoring
+      [ self, targets ].flatten.each(&:enable_monitoring)
     end
 
     # Shortcut to call DB#enslave_siblings! on a single target