Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heartbeat Lag with Aurora blocking cutover. #1081

Open
myusuf3 opened this issue Jan 23, 2022 · 4 comments
Open

Heartbeat Lag with Aurora blocking cutover. #1081

myusuf3 opened this issue Jan 23, 2022 · 4 comments
Labels

Comments

@myusuf3
Copy link

myusuf3 commented Jan 23, 2022

Hello everyone!

I am currently running a long running migration on Aurora MySQL 5.7. I running it on a copy of my production database. With Aurora style replication (meaning not actual replication) I have successfully run it on smaller tables, but when I run a migration on the scale of 20-24 hours I keep running into HeartbeatLag that always increases and never returns once it starts to drift.

Here is a sample output from a recent run.

[2022/01/23 20:42:52] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011973, 4)
[2022/01/23 20:42:52] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011973, 4)
2022-01-23 20:42:52 INFO rotate to next log from mysql-bin-changelog.011973:137795900 to mysql-bin-changelog.011973
2022-01-23 20:42:52 INFO rotate to next log from mysql-bin-changelog.011973:0 to mysql-bin-changelog.011973
Copy: 111620000/904947480 12.3%; Applied: 0; Backlog: 0/1000; Time: 3h11m30s(total), 3h11m30s(copy); streamer: mysql-bin-changelog.011973:41926866; Lag: 0.02s, HeartbeatLag: 871.02s, State: migrating; ETA: 22h41m3s
[2022/01/23 20:43:27] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011974, 4)
[2022/01/23 20:43:27] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011974, 4)
2022-01-23 20:43:27 INFO rotate to next log from mysql-bin-changelog.011974:138079395 to mysql-bin-changelog.011974
2022-01-23 20:43:27 INFO rotate to next log from mysql-bin-changelog.011974:0 to mysql-bin-changelog.011974
Copy: 111880000/904947480 12.4%; Applied: 0; Backlog: 0/1000; Time: 3h12m0s(total), 3h12m0s(copy); streamer: mysql-bin-changelog.011974:23055116; Lag: 0.02s, HeartbeatLag: 892.02s, State: migrating; ETA: 22h41m0s
[2022/01/23 20:44:02] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011975, 4)
[2022/01/23 20:44:02] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011975, 4)
2022-01-23 20:44:02 INFO rotate to next log from mysql-bin-changelog.011975:139227492 to mysql-bin-changelog.011975
2022-01-23 20:44:02 INFO rotate to next log from mysql-bin-changelog.011975:0 to mysql-bin-changelog.011975
Copy: 112140000/904947480 12.4%; Applied: 0; Backlog: 0/1000; Time: 3h12m30s(total), 3h12m30s(copy); streamer: mysql-bin-changelog.011975:2853151; Lag: 0.02s, HeartbeatLag: 911.92s, State: migrating; ETA: 22h40m56s
Copy: 112400000/904947480 12.4%; Applied: 0; Backlog: 0/1000; Time: 3h13m0s(total), 3h13m0s(copy); streamer: mysql-bin-changelog.011975:121871428; Lag: 0.02s, HeartbeatLag: 931.22s, State: migrating; ETA: 22h40m52s
[2022/01/23 20:44:40] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011976, 4)
[2022/01/23 20:44:40] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011976, 4)
2022-01-23 20:44:40 INFO rotate to next log from mysql-bin-changelog.011976:148195656 to mysql-bin-changelog.011976
2022-01-23 20:44:40 INFO rotate to next log from mysql-bin-changelog.011976:0 to mysql-bin-changelog.011976
Copy: 112650000/904947480 12.4%; Applied: 0; Backlog: 0/1000; Time: 3h13m30s(total), 3h13m30s(copy); streamer: mysql-bin-changelog.011976:92692665; Lag: 0.02s, HeartbeatLag: 951.42s, State: migrating; ETA: 22h40m56s
[2022/01/23 20:45:15] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011977, 4)
[2022/01/23 20:45:15] [info] binlogsyncer.go:791 rotate to (mysql-bin-changelog.011977, 4)
2022-01-23 20:45:15 INFO rotate to next log from mysql-bin-changelog.011977:140138087 to mysql-bin-changelog.011977
2022-01-23 20:45:15 INFO rotate to next log from mysql-bin-changelog.011977:0 to mysql-bin-changelog.011977
Copy: 112910000/904947480 12.5%; Applied: 0; Backlog: 0/1000; Time: 3h14m0s(total), 3h14m0s(copy); streamer: mysql-bin-changelog.011977:71588220; Lag: 0.02s, HeartbeatLag: 971.02s, State: migrating; ETA: 22h40m51s

I have completed this same migration up until the cut over phase and it was just blocked there due to HeartbeatLag being so high by the time the migration completed. If really odd when the HeartbeatLag kicks in and simply doesn't return to lower values ever increasing from then on out.

I have also set the binlog retention to 72 hours.

call mysql.rds_set_configuration('binlog retention hours', 72);

I even tried to forcibly trigger the cutover and it replied with nothing is blocking the cutover.

I have followed all the recommended Aurora settings aside from running the migration on master directly.

  • Database backup retention is greater than 1 day to enable binlogs
  • The parameter aurora_enable_repl_bin_log_filtering is set to 0
  • I have checked clocks to make sure there was no drift.
  • row based replication
  • I have also run this same command smaller migrations without issue.

Here is the command I ran.

 gh-ost \
--max-load=Threads_running=100 \
--critical-load=Threads_running=1000 \
--chunk-size=10000 \
--throttle-control-replicas="" \
--max-lag-millis=1500 \
--user="" \
--password="" \
--host=db \
--allow-on-master \
--database="" \
--table="" \
--verbose \
--alter="" \
--switch-to-rbr \
--allow-master-master \
--cut-over=default \
--exact-rowcount \
--concurrent-rowcount \
--default-retries=120 \
--panic-flag-file=/tmp/ghost.panic.flag \
--postpone-cut-over-flag-file=/tmp/ghost.postpone.flag \
--initially-drop-ghost-table \
--execute

Any guidance from other aurora users would be appreciated.

@zhang17310521020
Copy link

hello,i have same issue ,the question is saved?

@myusuf3
Copy link
Author

myusuf3 commented Apr 19, 2022

hello,i have same issue ,the question is saved?

Nope.

@Dasiu
Copy link

Dasiu commented Nov 21, 2024

Bump.

I am testing gh-ost and observed very similar issue. In my case it was CloudSQL MySQL (MySQL in GCP). I executed gh-ost on DB host without replica. Migration was long ~20 hours and was throttled a few times and connection was lost and recovered a few times. At some point "Lag" and "HeartbeatLag" drifted and never went down. Ultimately all rows were copied and all binglogs changes applied but gh-ost was forever applying binglogs events (the only ones left were the ones it itself generated, there was no more traffic on this DB host anymore).

@zhang17310521020
Copy link

zhang17310521020 commented Nov 21, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants