-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cut-over should wait for heartbeat lag to be low enough to succeed #14
Conversation
Another day passed by and we learned something new about this problem. I added details to the original issue github#799 (comment) about how the Aurora setting My team are not convinced that it's safe for us to change the Aurora default value of |
f05baf2
to
3135a25
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed upstream: github#921 (review)
Will merge here. I have no ownership to merge upstream.
I am currently experiencing something similar to what you described in this issue. github#1081 I read your comment here
Does this mean that I would appreciate any insight you have here @ccoffey |
I have even forcibly throttled myself and myusuf3@tee:~$ echo "sup" | nc -U /tmp/gh-ost.b5prod.vault_items.sock
Copy: 126480000/904947480 14.0%; Applied: 0; Backlog: 0/1000; Time: 3h45m20s(total), 3h45m20s(copy); streamer: mysql-bin-changelog.012031:41829585; Lag: 318.42s, HeartbeatLag: 2229.29s, State: throttled, commanded by user; ETA: 23h6m56s
myusuf3@tee:~$ echo "sup" | nc -U /tmp/gh-ost.b5prod.vault_items.sock
Copy: 126480000/904947480 14.0%; Applied: 0; Backlog: 0/1000; Time: 3h45m22s(total), 3h45m21s(copy); streamer: mysql-bin-changelog.012031:47634096; Lag: 319.92s, HeartbeatLag: 2230.75s, State: throttled, commanded by user; ETA: 23h7m5s
myusuf3@tee:~$ echo "sup" | nc -U /tmp/gh-ost.b5prod.vault_items.sock
Copy: 126480000/904947480 14.0%; Applied: 0; Backlog: 0/1000; Time: 3h45m23s(total), 3h45m22s(copy); streamer: mysql-bin-changelog.012031:50747420; Lag: 320.72s, HeartbeatLag: 2231.53s, State: throttled, commanded by user; ETA: 23h7m10s
myusuf3@tee:~$ echo "sup" | nc -U /tmp/gh-ost.b5prod.vault_items.sock
Copy: 126480000/904947480 14.0%; Applied: 0; Backlog: 0/1000; Time: 3h45m23s(total), 3h45m23s(copy); streamer: mysql-bin-changelog.012031:53188812; Lag: 321.32s, HeartbeatLag: 2232.15s, State: throttled, commanded by user; ETA: 23h7m14s
myusuf3@tee:~$ echo "sup" | nc -U /tmp/gh-ost.b5prod.vault_items.sock
Copy: 126480000/904947480 14.0%; Applied: 0; Backlog: 0/1000; Time: 3h45m24s(total), 3h45m23s(copy); streamer: mysql-bin-changelog.012031:55407229; Lag: 321.82s, HeartbeatLag: 2232.71s, State: throttled, commanded by user; ETA: 23h7m17s
myusuf3@tee:~$ echo "sup" | nc -U /tmp/gh-ost.b5prod.vault_items.sock
Copy: 126480000/904947480 14.0%; Applied: 0; Backlog: 0/1000; Time: 3h45m24s(total), 3h45m24s(copy); streamer: mysql-bin-changelog.012031:57378604; Lag: 322.32s, HeartbeatLag: 2232.41s, State: throttled, commanded by user; ETA: 23h7m20s
myusuf3@tee:~$ echo "sup" | nc -U /tmp/gh-ost.b5prod.vault_items.sock
Copy: 126480000/904947480 14.0%; Applied: 0; Backlog: 0/1000; Time: 3h45m25s(total), 3h45m24s(copy); streamer: mysql-bin-changelog.012031:59364269; Lag: 322.82s, HeartbeatLag: 2232.91s, State: throttled, commanded by user; ETA: 23h7m23s
myusuf3@tee:~$ echo "sup" | nc -U /tmp/gh-ost.b5prod.vault_items.sock
Copy: 126480000/904947480 14.0%; Applied: 0; Backlog: 0/1000; Time: 3h45m25s(total), 3h45m25s(copy); streamer: mysql-bin-changelog.012031:61323482; Lag: 323.32s, HeartbeatLag: 2233.40s, State: throttled, commanded by user; ETA: 23h7m26s``` |
Bumping again. |
It's possible. In my company we figured this out scientifically. We made a copy of our production DB and experimented with it.
This makes sense, the writes done by the copy operation appear on the binlog. At some copy speed, you are going to experience HeartbeatLag. In my post here I mentioned the following:
This is no longer true. We rolled this setting change out to all of our production DBs and we no longer experience significant HeartbeatLag. This was basically a silver bullet for us. We later followed up with another improvement that you might want to try. We noticed that certain migrations (on our largest table) still caused increased Aurora replication latency, sometimes to multiple seconds which was unacceptable. My colleague solved this by running
The reason this works is that Best of luck! |
Description
Related issue: github#799
In the above issue, we see migrations which fail at the cut-over phase with
ERROR Timeout while waiting for events up to lock
. These migrations fail cut-over many times and eventually exhaust all retries.Root cause
Lag experienced by an external replica is not the same as lag experienced by gh-ost while processing the binlog.
show slave status
against an external replica and extracting the value ofSeconds_Behind_Master
.For example: Imagine that both of these lags were ~0 seconds. Then imagine that you throttle gh-ost for N minutes. At this point the external replica's lag will still be ~0 seconds, but gh-ost's lag will be N minutes.
This is important because its gh-ost's lag (not the external replica's lag) that determines if cut-over succeeds or times out.
More Detail
During cut-over:
AllEventsUpToLockProcessed:time.Now()
is inserted into the changelog table--cut-over-lock-timeout-seconds
(default: 3 seconds) for this token to appear on the binlogProblem: It's possible to enter this cut-over phase when gh-ost is so far behind on processing the binlog that it could not possibly catch-up during the timeout.
What this PR proposes
CurrentHeartbeatLag
in theMigrationContext
CurrentHeartbeatLag
every time we intercept a binlog event for the changelog table of type heartbeatCurrentHeartbeatLag
is less than--max-lag-millis
Note: This PR is best-reviewed commit by commit
An example
It's best to demonstrate the value of this change by example.
I am able to reliably reproduce the cut-over problem (40+ failed cut-over attempts) when running gh-ost against an Amazon RDS Aurora DB.
Test setup:
Test process:
Both migrations are run using the following params:
Note:
<TABLE>
and<UNIQUE_ID>
must be different per migration.The following logs came from one of the many experiments I ran.
This log was output by the smaller of the two migrations when it got to
13.9%
for row copy:Important: Notice that
Lag
is0.01s
butHeartbeatLag
is17.92s
. The value ofLag
is actually meaningless here because we are running with--allow-on-master
so we are computingLag
by reading a heartbeat row directly from the table which we wrote it to. This explains the extremely low value of0.01s
.A few minutes later, when row copy completed,
Lag
was0.01s
andHeartbeatLag
was100.79s
:This PR causes gh-ost to wait until the heartbeat lag is less than
--max-lag-millis
before continuing with thecut-over
.Note: If we had tried to cut-over during this period where
HeartbeatLag
was greater than100 seconds
then we would have failed many times.The heartbeat lag only started to reduce (a few minutes later) when the larger migration's row copy completed.
At this point the following message was outputted:
And then the table cut-over succeeded in a single attempt:
Final Thoughts
This problem is likely exacerbated by Aurora because readers in Aurora do not use binlog replication. This means there is no external replica lag that gh-ost can use to throttle itself so gh-ost ends up backing up the binlog. If gh-ost is the only consumer of the binlog (typical for applications that use Aurora) then only gh-ost's cut-over would suffer. Any observer looking at key health metrics on Aurora's dashboard would conclude that the DB was completely healthy.
The specific problem of backing up the binlog could be solved by getting gh-ost to copy rows much slower. However, gh-ost would still be susceptible to this problem in other ways, for example, if gh-ost was throttled heavily just before cut-over. Also, why should we artificially slow down if the DB is perfectly capable of handling the extra load without hurting our SLAs.