-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Describing safe, blocking, pure-mysql cut-over phase #65
Comments
This was referenced Jun 14, 2016
Merged
Worth notingOn replica, this is a non atomic, two step table swap. The
|
This is implemented in code; so closing. But: there's now a new suggestion: #82 which I believe to be superior. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
UPDATE
#82 overrides this. This is no longer in use.
Finally here's a blocking cut-over phase that will, at worst case (connections die throughout cut-over), create a table-outage (easily reversed).
Here are the steps to a safe solution:
We note different connections as
C1
,C2
, ...Cn
We assume original table is
tbl
, ghost table isghost
.In the below we note
C1
,C18
,C19
as out own, controlling connections. We first assume no error in the below flow:lock tables tbl write
insert
,delete
,update
ontbl
. Because of the lock, they are naturally blocked.ghost
. No new events are coming our way becausetbl
is blocked.rename table tbl to tbl_old
. This gets blocked.rename
is in place (viashow processlist
), and that C1 is still alive; then issues:rename table ghost to tbl
. This gets blocked.tbl
, it doesn't matter, they all get deprioritized, same as C3...C17)unlock tables
What just happened? Let's first explain some stuff:
rename
gets prioritized over the DMLs, even though it came later. That is how MySQL prioritizes queries on metadata-locked tables.tbl
still exists. You'd expect it to fail immediately -- but it does not. It's valid. This is becausetbl
's metadata lock is in use.tbl
even though it doesn't actually exist, because of C19.What happens on failures?
rename
, we get an outage:tbl
is renamed totbl_old
, and the queries get released and complain the table is just not there.rename tbl_old to tbl;
and go drink coffee, then give the entire process another try.rename
, there's no data integrity issue: at this point we've already asserted the tables are in sync. As C1 dies, C18 will immediatelyrename tbl to tbl_old
. An outage will occur, but not for long, because C19 will next issuerename ghost to tbl
, and close the gap. We suffered a minor outage, but no rollback. We roll forward.rename
, nothing bad happens: C19 is still blocking for as long as C1 is running. We find out C18 died, and release C1. C19 attempts torename ghost to tbl
, buttbl
exists (we assume C18 failed) and the query fails. The metadata lock is released and all the queries resume operation on the originaltbl
. The queries suffered a short block, but resume operation automatically. The operation failed but without error. We will need to try the entire cycle again.rename
, same as above.rename tbl_old to tbl
.I'm grateful to reviews over this logic.
The text was updated successfully, but these errors were encountered: