Heartbeat Lag with Aurora blocking cutover. #1081

myusuf3 · 2022-01-23T20:52:33Z

Hello everyone!

I am currently running a long running migration on Aurora MySQL 5.7. I running it on a copy of my production database. With Aurora style replication (meaning not actual replication) I have successfully run it on smaller tables, but when I run a migration on the scale of 20-24 hours I keep running into HeartbeatLag that always increases and never returns once it starts to drift.

Here is a sample output from a recent run.

[2022/01/23 20:42:52] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011973, 4)
[2022/01/23 20:42:52] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011973, 4)
2022-01-23 20:42:52 INFO rotate to next log from mysql-bin-changelog.011973:137795900 to mysql-bin-changelog.011973
2022-01-23 20:42:52 INFO rotate to next log from mysql-bin-changelog.011973:0 to mysql-bin-changelog.011973
Copy: 111620000/904947480 12.3%; Applied: 0; Backlog: 0/1000; Time: 3h11m30s(total), 3h11m30s(copy); streamer: mysql-bin-changelog.011973:41926866; Lag: 0.02s, HeartbeatLag: 871.02s, State: migrating; ETA: 22h41m3s
[2022/01/23 20:43:27] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011974, 4)
[2022/01/23 20:43:27] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011974, 4)
2022-01-23 20:43:27 INFO rotate to next log from mysql-bin-changelog.011974:138079395 to mysql-bin-changelog.011974
2022-01-23 20:43:27 INFO rotate to next log from mysql-bin-changelog.011974:0 to mysql-bin-changelog.011974
Copy: 111880000/904947480 12.4%; Applied: 0; Backlog: 0/1000; Time: 3h12m0s(total), 3h12m0s(copy); streamer: mysql-bin-changelog.011974:23055116; Lag: 0.02s, HeartbeatLag: 892.02s, State: migrating; ETA: 22h41m0s
[2022/01/23 20:44:02] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011975, 4)
[2022/01/23 20:44:02] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011975, 4)
2022-01-23 20:44:02 INFO rotate to next log from mysql-bin-changelog.011975:139227492 to mysql-bin-changelog.011975
2022-01-23 20:44:02 INFO rotate to next log from mysql-bin-changelog.011975:0 to mysql-bin-changelog.011975
Copy: 112140000/904947480 12.4%; Applied: 0; Backlog: 0/1000; Time: 3h12m30s(total), 3h12m30s(copy); streamer: mysql-bin-changelog.011975:2853151; Lag: 0.02s, HeartbeatLag: 911.92s, State: migrating; ETA: 22h40m56s
Copy: 112400000/904947480 12.4%; Applied: 0; Backlog: 0/1000; Time: 3h13m0s(total), 3h13m0s(copy); streamer: mysql-bin-changelog.011975:121871428; Lag: 0.02s, HeartbeatLag: 931.22s, State: migrating; ETA: 22h40m52s
[2022/01/23 20:44:40] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011976, 4)
[2022/01/23 20:44:40] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011976, 4)
2022-01-23 20:44:40 INFO rotate to next log from mysql-bin-changelog.011976:148195656 to mysql-bin-changelog.011976
2022-01-23 20:44:40 INFO rotate to next log from mysql-bin-changelog.011976:0 to mysql-bin-changelog.011976
Copy: 112650000/904947480 12.4%; Applied: 0; Backlog: 0/1000; Time: 3h13m30s(total), 3h13m30s(copy); streamer: mysql-bin-changelog.011976:92692665; Lag: 0.02s, HeartbeatLag: 951.42s, State: migrating; ETA: 22h40m56s
[2022/01/23 20:45:15] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011977, 4)
[2022/01/23 20:45:15] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011977, 4)
2022-01-23 20:45:15 INFO rotate to next log from mysql-bin-changelog.011977:140138087 to mysql-bin-changelog.011977
2022-01-23 20:45:15 INFO rotate to next log from mysql-bin-changelog.011977:0 to mysql-bin-changelog.011977
Copy: 112910000/904947480 12.5%; Applied: 0; Backlog: 0/1000; Time: 3h14m0s(total), 3h14m0s(copy); streamer: mysql-bin-changelog.011977:71588220; Lag: 0.02s, HeartbeatLag: 971.02s, State: migrating; ETA: 22h40m51s

I have completed this same migration up until the cut over phase and it was just blocked there due to HeartbeatLag being so high by the time the migration completed. If really odd when the HeartbeatLag kicks in and simply doesn't return to lower values ever increasing from then on out.

I have also set the binlog retention to 72 hours.

call mysql.rds_set_configuration('binlog retention hours', 72);

I even tried to forcibly trigger the cutover and it replied with nothing is blocking the cutover.

I have followed all the recommended Aurora settings aside from running the migration on master directly.

Database backup retention is greater than 1 day to enable binlogs
The parameter aurora_enable_repl_bin_log_filtering is set to 0
I have checked clocks to make sure there was no drift.
row based replication
I have also run this same command smaller migrations without issue.

Here is the command I ran.

 gh-ost \
--max-load=Threads_running=100 \
--critical-load=Threads_running=1000 \
--chunk-size=10000 \
--throttle-control-replicas="" \
--max-lag-millis=1500 \
--user="" \
--password="" \
--host=db \
--allow-on-master \
--database="" \
--table="" \
--verbose \
--alter="" \
--switch-to-rbr \
--allow-master-master \
--cut-over=default \
--exact-rowcount \
--concurrent-rowcount \
--default-retries=120 \
--panic-flag-file=/tmp/ghost.panic.flag \
--postpone-cut-over-flag-file=/tmp/ghost.postpone.flag \
--initially-drop-ghost-table \
--execute

Any guidance from other aurora users would be appreciated.

The text was updated successfully, but these errors were encountered:

zhang17310521020 · 2022-04-13T06:05:08Z

hello,i have same issue ,the question is saved?

myusuf3 · 2022-04-19T15:45:15Z

hello,i have same issue ,the question is saved?

Nope.

Dasiu · 2024-11-21T09:57:19Z

Bump.

I am testing gh-ost and observed very similar issue. In my case it was CloudSQL MySQL (MySQL in GCP). I executed gh-ost on DB host without replica. Migration was long ~20 hours and was throttled a few times and connection was lost and recovered a few times. At some point "Lag" and "HeartbeatLag" drifted and never went down. Ultimately all rows were copied and all binglogs changes applied but gh-ost was forever applying binglogs events (the only ones left were the ones it itself generated, there was no more traffic on this DB host anymore).

zhang17310521020 · 2024-11-21T10:03:25Z

Finally, I dealt with this problem by using online ddl. ------------------------------------------------------------------ 发件人：Dasiu ***@***.***> 发送时间：2024年11月21日(星期四) 17:57 收件人：github/gh-ost ***@***.***> 抄　送：zhang17310521020 ***@***.***>; Comment ***@***.***> 主　题：Re: [github/gh-ost] Heartbeat Lag with Aurora blocking cutover. (Issue #1081) Bump. I am testing gh-ost and observed very similar issue. In my case it was CloudSQL MySQL (MySQL in GCP). I executed gh-ost on DB host without replica. Migration was long ~20 hours and was throttled a few times and connection was lost and recovered a few times. At some point "Lag" and "HeartbeatLag" drifted and never went down. Ultimately all rows were copied and all binglogs changes applied but gh-ost was forever applying binglogs events (the only ones left were the ones it itself generated, there was no more traffic on this DB host anymore). — Reply to this email directly, view it on GitHub <#1081 (comment) >, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AYWCVYN6XKXERUB4BZI7XSD2BWVBLAVCNFSM6AAAAABSGSUOI6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOJQGYZDCOBYGA >. You are receiving this because you commented.Message ID: ***@***.***>

myusuf3 mentioned this issue Jan 23, 2022

Cut-over should wait for heartbeat lag to be low enough to succeed openark/gh-ost#14

Merged

timvaillancourt added the question label Jan 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heartbeat Lag with Aurora blocking cutover. #1081

Heartbeat Lag with Aurora blocking cutover. #1081

myusuf3 commented Jan 23, 2022

zhang17310521020 commented Apr 13, 2022

myusuf3 commented Apr 19, 2022

Dasiu commented Nov 21, 2024

zhang17310521020 commented Nov 21, 2024 via email

Heartbeat Lag with Aurora blocking cutover. #1081

Heartbeat Lag with Aurora blocking cutover. #1081

Comments

myusuf3 commented Jan 23, 2022

zhang17310521020 commented Apr 13, 2022

myusuf3 commented Apr 19, 2022

Dasiu commented Nov 21, 2024

zhang17310521020 commented Nov 21, 2024 via email