-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ecto Migrations does not prevent concurrent migrations for Postgresql #276
Comments
Thank you for the report. I was able to reproduce this which was very surprising to me because I knew for a fact that we're locking the table (grep for "FOR UPDATE" in ecto_sql project) before running the migration. Turns out this behaviour only happens when |
I think a simple solution would be to create an empty migration when we
create the table. WDYT?
|
I like this. I think it would show up as with no corresponding migration file in |
Oh, you meant that we generate an empty migration as usual? Totally |
I also saw this behaviour when there was existing migrations in the table which was the reason behind opening this bug. I’ll try and create another test case to reproduce the issue. |
Yes, exactly.
Like @wojtekmach, I could not reproduce after any migration has run. |
So it looks the migration locking logic is something like:
So its possible for two processes to both try and execute the same list of migrations but usually this race doesn't happen because one process will get a list of migrations it needs to execute and then will mostly hold the lock on the migration table preventing the other process from making progress (or at least until the first migration runs and it presumably releases the lock). Also, it is important to note that if this race does happen it looks like there is a further lock and check that is meant to prevent this race from being a problem. However, there is something bugged with this check and it doesn't prevent the same migration from being run twice. I think this is a visibility issue so when you do the query to 'lock for update' and return the existing migrations you are querying at transaction time T, but the new migration row is added to the database at transaction time T+N. so the new row will not be visible to the query so it doesn't see it [I believe in PG with locks you end up seeing some weird mix of records at time T and records at time >T where you have lock conflicts under read commit isolation but this is not relevant because the missing row was not a target of a lock conflict] [1]. usually in databases I think this problem is solved by 'locking the aggregate root' so you would have some other record that is a logical parent of the all the records in the migration table and you would lock around that record [but you need to carefully review whatever the behaviour of your database+isolation gives you to see if this solution would be correct]. probably, in Postgres you can use advisory locks to work around this or a lock on the table but locking the table might cause other problems. You should also be able to perform the query a second time under read-commit isolation in PG and you should see the newly inserted row. Since under read commit I believe PG gives you a fresh snapshot for each query you run. To reproduce:
The second process doesn't see the version added by the first process because of the transaction visibility issue described above. Here is my output:
[1] https://www.postgresql.org/docs/9.5/transaction-iso.html
|
Thanks for the detailed write up. This is weird because before migrating each version we do acquire the lock again and see if it was been migrated or not yet. I am thinking that the lock we are using is wrong altogether... it is locking only on the records that exist but it doesn't re-read the whole table once the lock is free. Can you please try this diff? diff --git a/lib/ecto/migrator.ex b/lib/ecto/migrator.ex
index 6cfd25c..5bbc525 100644
--- a/lib/ecto/migrator.ex
+++ b/lib/ecto/migrator.ex
@@ -499,7 +499,11 @@ defmodule Ecto.Migrator do
all_opts = [prefix: opts[:prefix], timeout: :infinity, log: false]
{migration_repo, query} = SchemaMigration.versions(repo, config)
- callback = &fun.(config, migration_repo.all(&1, all_opts))
+
+ callback = fn locked_query ->
+ _ = migration_repo.all(locked_query, all_opts)
+ fun.(config, migration_repo.all(query, all_opts))
+ end
result =
if should_lock? do |
I think that fixes the issue from looking at the code and also failing to reproduce it after apply the change. I think its a bit of a weird way of doing locking but it seems to work so :/ Like we are not always locking all the records in the table but it seems that we don't really rely on locking all the records on the table so this is ok. I'm not sure if you consider the other issue I brought up a problem or not:
|
That step should not be an issue because the processes are linked, so if the one holding the lock fails, the one migrating should also exit immediately. |
The migration can complete and the 'migration' transaction can commit. At this point the elixir process can exit and then insertion of the 'migration' completed record into the schema migrations table will not happen. Then if you run the migrations the migrations will run twice. This seems to violate the users expectations of migrations only running once. I'm not sure the documentation says they won't run twice but seeing as you try to prevent them from running multiple times in parallel it seems reasonable that a user would assume they would not run twice. EDIT: Maybe I'm missing the exact order of things. But if you have two independent transactions you need some kind of two phase commit + recovery system to ensure integrity between them or you are going violate whatever invariant you are trying to protect. |
The migration and insertion of the migration completion happens inside the same transaction. So if the insertion doesn’t happen, then none of the migration should happen. |
in an earlier post I mentioned that the insertion of the version was done in the separate transaction but looking at the code I don't think this is correct and you might have been misled by that statement of mine. this is my current improved understanding of the order of events:
Child process:
under this order it allows the two degenerate cases:
I think without some kind of two phase commit you can only exclude one of the two above cases. I think if you move the insertion of the schema record into the async transaction (or if that is what actually happens. which you think is what is happening at the moment) before the migration starts then it should not be possible for either cases to happen. note: it you do the insert afterwards then it might be possible for the migration to 'run twice' because of a race (this is a bit difficult to reason about and I might be wrong). though, in the case where it runs 'twice' the second transaction will be aborted because of the failed insert so that might not be considered a problem. the insert creates a mutual exclusion because its a primary key. |
As I mentioned above, the running of migrations and insertion of completion happen in the same transaction/process. So your degenerate cases cannot happen. Here are the events:
|
As I said I'm not very good at reading Ecto code so I might have misread it. But I'll try and explain why I think it is running in a separate transaction (the 'parent process' transaction) and then you can explain to me why you think it is running In the same transaction (the 'child process' transaction) and hopefully we can come to an agreement on what is actually going on. So after reading your reply I commented out the following 3 lines which appear to be running in the 'parent process': https://github.com/elixir-ecto/ecto_sql/blob/master/lib/ecto/migrator.ex#L300
After commenting out those 3 lines it no longer inserts the version into the schema migrations table so that seems evidence to suggest that the 'version' is being inserted into the 'schema_migrations' table in the parent process. I also have an ASCII dump from the tcp connection that did the insert into of the schema migrations table. My migration did an create table operation but there is no 'create table' string in the dump which also suggests that the 'schema_migration' insertion is done in a different transaction from the transaction the migration is run in.
whereas in a different connection there is this ascii dump:
|
Instead of locking rows, we lock the table resource altogether, either using app locks or table locks. Closes #276.
Hi @bmteller, you were right, apologies. When I said they run in the same transaction, I was looking at my local checkout, which had already moved it inside due to your bug report. My bad! 🤦 In any case, here is a PR that moves the whole approach from row locks to table/app locks: #276. This solves the problem where the first migration would not lock and also allows us to keep migrating and adding the migration in the same transaction. More feedback is welcome! |
* Use proper locks on migration tables Instead of locking rows, we lock the table resource altogether, either using app locks or table locks. Closes #276. * Add docs * Remove unused var * Pass all relevant parameters to SQLServer lock
Environment
Erlang/OTP 23 [erts-11.0.2] [source] [64-bit] [smp:16:16] [ds:16:16:10] [async-threads:1] [hipe] [dtrace]
Elixir 1.10.4 (compiled with Erlang/OTP 23)
12.3
ecto 3.5.1 (Hex package) (mix)
ecto_sql 3.5.0 (Hex package) (mix)
Database adapter and version (mix deps):
postgrex 0.15.6 (Hex package) (mix)
Operating system:
osx
Current behavior
I create a migration:
I concurrently try to execute the migration in two processes:
Both of them try and execute the migration:
Expected behavior
In the documentation it says:
I would expect if you tried to run two migrations in parallel then one would block waiting for the other to finish or one would fail. I would not expect both of them to run and one to get a weird error.
Having a quick look at the code it looks like it does run it in a transaction but there is no lock taken on the schema migrations table.
The text was updated successfully, but these errors were encountered: