Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atomic cut-over failing due to RENAME running on an unexpected connection id #1449

Closed
qsweber opened this issue Sep 6, 2024 · 1 comment
Closed

Comments

@qsweber
Copy link

qsweber commented Sep 6, 2024

When running gh-ost version 1.1.6 with the the following arguments, I am encountering an issue with the RENAME being stuck in a state of "Waiting for table metadata lock".

Gh-ost Command

gh-ost \
    -max-load=Threads_running=25 \
    -critical-load=Threads_running=100 \
    -allow-on-master \
    -user="foo" \
    -password="foo" \
    -host="foo" \
    -verbose \
    -database="qwebertest" \
    -table="activity_log_small" \
    -alter="ADD COLUMN test_column_eight BINARY(16)" \
    -cut-over=atomic \
    -initially-drop-ghost-table \
    -initially-drop-old-table \
    -initially-drop-socket-file \
    -chunk-size=500 \
    -exact-rowcount \
    -switch-to-rbr \
    -concurrent-rowcount \
    -default-retries=120 \
    -panic-flag-file=/tmp/ghost.panic.flag \
    -postpone-cut-over-flag-file=/tmp/ghost.postpone.flag \
    -cut-over-lock-timeout-seconds=10 \
    -execute 2>&1 | tee logs_attempt_8.log

Logs

While this command is running, it emits these logs:

2024-09-05 12:32:33 INFO Grabbing voluntary lock: gh-ost.110215429.lock
2024-09-05 12:32:33 INFO Setting LOCK timeout as 20 seconds
2024-09-05 12:32:33 INFO Looking for magic cut-over table
2024-09-05 12:32:33 INFO Creating magic cut-over table `qwebertest`.`_activity_log_small_del`
2024-09-05 12:32:33 INFO Magic cut-over table created
2024-09-05 12:32:33 INFO Locking `qwebertest`.`activity_log_small`, `qwebertest`.`_activity_log_small_del`
2024-09-05 12:32:33 INFO Tables locked
2024-09-05 12:32:33 INFO Session locking original & magic tables is 110215429
2024-09-05 12:32:33 INFO Writing changelog state: AllEventsUpToLockProcessed:1725564753915176000
2024-09-05 12:32:34 INFO Intercepted changelog state AllEventsUpToLockProcessed
2024-09-05 12:32:34 INFO Handled changelog state AllEventsUpToLockProcessed
2024-09-05 12:32:35 INFO Waiting for events up to lock
2024-09-05 12:32:35 INFO Waiting for events up to lock: got AllEventsUpToLockProcessed:1725564753915176000
2024-09-05 12:32:35 INFO Done waiting for events up to lock; duration=1.135605542s
2024-09-05 12:32:35 INFO Setting RENAME timeout as 10 seconds
2024-09-05 12:32:35 INFO Session renaming tables is 110198088
2024-09-05 12:32:36 INFO Issuing and expecting this to block: rename /* gh-ost */ table `qwebertest`.`activity_log_small` to `qwebertest`.`_activity_log_small_del`, `qwebertest`.`_activity_log_small_gho` to `qwebertest`.`activity_log_small`
2024-09-05 12:32:46 ERROR Error 1205: Lock wait timeout exceeded; try restarting transaction
2024-09-05 12:32:46 INFO Will now proceed to drop magic table and unlock tables
2024-09-05 12:32:46 INFO Dropping magic cut-over table
2024-09-05 12:32:46 INFO Releasing lock from `qwebertest`.`activity_log_small`, `qwebertest`.`_activity_log_small_del`
2024-09-05 12:32:46 INFO Tables unlocked
2024-09-05 12:32:46 INFO Looking for magic cut-over table
2024-09-05 12:32:47 INFO Checking session lock: gh-ost.110215429.lock
2024-09-05 12:32:48 ERROR Session lock gh-ost.110215429.lock expected to be found but wasn't
2024-09-05 12:32:48 INFO Removing socket file: /tmp/gh-ost.qwebertest.activity_log_small.sock
2024-09-05 12:32:48 FATAL Session lock gh-ost.110215429.lock expected to be found but wasn't

Unable to find the expected entry in the processlist

After this log was emitted Issuing and expecting this to block..., I looked at the processlist:

mysql> show processlist;
+----------+---------+-------------------+--------------------+-------------+------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
| Id       | User    | Host              | db                 | Command     | Time | State                                                           | Info                                                                                                 |
+----------+---------+-------------------+--------------------+-------------+------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
| 16851464 | foouser | 00.00.0.003:43342 | qwebertest         | Query       |    2 | Waiting for table metadata lock                                 | rename /* gh-ost */ table `qwebertest`.`activity_log_small` to `qwebertest`.`_activity_log_small_del |
| 16852076 | foouser | 00.00.00.00:54836 | qwebertest         | Sleep       |    0 |                                                                 | NULL                                                                                                 |
| 16852531 | foouser | 00.00.0.005:39916 | qwebertest         | Sleep       |    4 |                                                                 | NULL                                                                                                 |
| 16852552 | foouser | 00.00.0.005:39958 | qwebertest         | Sleep       |  135 |                                                                 | NULL                                                                                                 |
| 16852793 | foouser | 00.00.00.00:58424 | information_schema | Sleep       |  132 |                                                                 | NULL                                                                                                 |
| 16852812 | foouser | 00.00.00.00:58478 | information_schema | Query       |    0 | init                                                            | show processlist                                                                                     |
| 16852995 | foouser | 00.00.00.00:59596 | information_schema | Binlog Dump |  136 | Source has sent all binlog to replica; waiting for more updates | NULL                                                                                                 |
+----------+---------+-------------------+--------------------+-------------+------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------------+
7 rows in set (0.09 sec)

You can clearly see the RENAME is there with the state of "waiting for table metadata lock", which is what the tool expects:

return this.applier.ExpectProcess(renameSessionId, "metadata lock", "rename")

However, we never see the the log that comes immediately after that:

this.migrationContext.Log.Infof("Found atomic RENAME to be blocking, as expected. Double checking the lock is still in place (though I don't strictly have to)")

Therefore, the tool gets stuck in retrying the query for SHOW PROCESSLIST.

I think this is because of the following oddity of my session:

mysql> select id, command, info from information_schema.processlist;
+----------+---------+--------------------------------------------------------------+
| id       | command | info                                                         |
+----------+---------+--------------------------------------------------------------+
| 16858880 | Sleep   | NULL                                                         |
| 16858899 | Sleep   | NULL                                                         |
| 16858891 | Sleep   | NULL                                                         |
| 16858533 | Query   | select id, command, info from information_schema.processlist |
+----------+---------+--------------------------------------------------------------+
4 rows in set (0.08 sec)

mysql> select connection_id();
+-----------------+
| connection_id() |
+-----------------+
|       125465300 |
+-----------------+
1 row in set (0.09 sec)

mysql> SELECT @@pseudo_thread_id;
+--------------------+
| @@pseudo_thread_id |
+--------------------+
|           16858533 |
+--------------------+
1 row in set (0.09 sec)

You can see that the entry in the processlist is using the @@pseudo_thread_id, not the connection_id().

But the connection_id() is what gets passed in to the query:

if err := tx.QueryRow(`select connection_id()`).Scan(&sessionId); err != nil {

and ? in (0, id)

So in conclusion, the query is running on on an unexpected connection identifier.

I believe this also explains why I'm getting this error log:

Session lock gh-ost.110215429.lock expected to be found but wasn't

The connection ID returned by is_used_lock is not the value we are expecting:

if err := this.db.QueryRow(query, lockName).Scan(&result); err != nil || result != sessionId {

@qsweber qsweber closed this as completed Sep 6, 2024
@qsweber qsweber reopened this Sep 6, 2024
@qsweber qsweber changed the title Atomic cut-over failing due to RENAME being stuck "waiting for table metadata lock" Atomic cut-over failing due to RENAME running on an unexpected connection id Sep 6, 2024
@qsweber
Copy link
Author

qsweber commented Sep 6, 2024

I figured this out. It is related to me running the command against a ProxySQL host, which caused the divergence between the thread_id and connection_id. I was able to fix by connecting directly the MySQL host.

@qsweber qsweber closed this as completed Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant