Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ecto Migrations does not prevent concurrent migrations for Postgresql #276

Closed
bmteller opened this issue Oct 9, 2020 · 16 comments · Fixed by #277
Closed

Ecto Migrations does not prevent concurrent migrations for Postgresql #276

bmteller opened this issue Oct 9, 2020 · 16 comments · Fixed by #277

Comments

@bmteller
Copy link

bmteller commented Oct 9, 2020

Environment

  • Elixir version (elixir -v):

Erlang/OTP 23 [erts-11.0.2] [source] [64-bit] [smp:16:16] [ds:16:16:10] [async-threads:1] [hipe] [dtrace]

Elixir 1.10.4 (compiled with Erlang/OTP 23)

  • Database and version (PostgreSQL 9.4, MongoDB 3.2, etc.):

12.3

  • Ecto version (mix deps):

ecto 3.5.1 (Hex package) (mix)
ecto_sql 3.5.0 (Hex package) (mix)

  • Database adapter and version (mix deps):
    postgrex 0.15.6 (Hex package) (mix)

  • Operating system:

osx

Current behavior

I create a migration:

defmodule Locky.Repo.Migrations.Slow do
  use Ecto.Migration

  def change do
    IO.puts("KEKW SLEEP")
    Process.sleep(30_000)
    IO.puts("KEKW FINISHED")
  end
end

I concurrently try to execute the migration in two processes:

Both of them try and execute the migration:

➜  locky mix ecto.migrate
KEKW SLEEP

20:00:23.844 [info]  == Running 20201009184940 Locky.Repo.Migrations.Slow.change/0 forward
KEKW FINISHED

20:00:53.851 [info]  == Migrated 20201009184940 in 30.0s
➜  locky mix ecto.migrate
KEKW SLEEP

20:00:25.344 [info]  == Running 20201009184940 Locky.Repo.Migrations.Slow.change/0 forward
KEKW FINISHED

20:00:55.352 [info]  == Migrated 20201009184940 in 30.0s

20:00:55.379 [error] Could not update schema migrations. This error usually happens due to the following:

  * The database does not exist
  * The "schema_migrations" table, which Ecto uses for managing
    migrations, was defined by another library
  * There is a deadlock while migrating (such as using concurrent
    indexes with a migration_lock)

To fix the first issue, run "mix ecto.create".

To address the second, you can run "mix ecto.drop" followed by
"mix ecto.create". Alternatively you may configure Ecto to use
another table and/or repository for managing migrations:

    config :locky, Locky.Repo,
      migration_source: "some_other_table_for_schema_migrations",
      migration_repo: AnotherRepoForSchemaMigrations

The full error report is shown below.

** (Ecto.ConstraintError) constraint error when attempting to insert struct:

    * schema_migrations_pkey (unique_constraint)

If you would like to stop this constraint violation from raising an
exception and instead add it as an error to your changeset, please
call `unique_constraint/3` on your changeset with the constraint
`:name` as an option.

The changeset has not defined any constraint.

    (ecto 3.5.1) lib/ecto/repo/schema.ex:688: anonymous fn/4 in Ecto.Repo.Schema.constraints_to_errors/3
    (elixir 1.10.4) lib/enum.ex:1396: Enum."-map/2-lists^map/1-0-"/2
    (ecto 3.5.1) lib/ecto/repo/schema.ex:673: Ecto.Repo.Schema.constraints_to_errors/3
    (ecto 3.5.1) lib/ecto/repo/schema.ex:654: Ecto.Repo.Schema.apply/4
    (ecto 3.5.1) lib/ecto/repo/schema.ex:262: anonymous fn/15 in Ecto.Repo.Schema.do_insert/4
    (ecto_sql 3.5.0) lib/ecto/migrator.ex:661: Ecto.Migrator.verbose_schema_migration/3
    (ecto_sql 3.5.0) lib/ecto/migrator.ex:300: Ecto.Migrator.async_migrate_maybe_in_transaction/7
    (ecto_sql 3.5.0) lib/ecto/adapters/sql.ex:959: anonymous fn/3 in Ecto.Adapters.SQL.checkout_or_transaction/4
    (db_connection 2.2.2) lib/db_connection.ex:1427: DBConnection.run_transaction/4
    (ecto_sql 3.5.0) lib/ecto/adapters/sql.ex:786: Ecto.Adapters.SQL.lock_for_migrations/5
    (ecto_sql 3.5.0) lib/ecto/migrator.ex:511: Ecto.Migrator.lock_for_migrations/4
    (ecto_sql 3.5.0) lib/ecto/migrator.ex:639: anonymous fn/5 in Ecto.Migrator.migrate/4
    (elixir 1.10.4) lib/enum.ex:2111: Enum."-reduce/3-lists^foldl/2-0-"/3
    (ecto_sql 3.5.0) lib/ecto/migrator.ex:638: Ecto.Migrator.migrate/4
    (ecto_sql 3.5.0) lib/ecto/migrator.ex:145: Ecto.Migrator.with_repo/3
    (ecto_sql 3.5.0) lib/mix/tasks/ecto.migrate.ex:133: anonymous fn/5 in Mix.Tasks.Ecto.Migrate.run/2
    (elixir 1.10.4) lib/enum.ex:2111: Enum."-reduce/3-lists^foldl/2-0-"/3
    (ecto_sql 3.5.0) lib/mix/tasks/ecto.migrate.ex:121: Mix.Tasks.Ecto.Migrate.run/2
    (mix 1.10.4) lib/mix/task.ex:330: Mix.Task.run_task/3
    (mix 1.10.4) lib/mix/cli.ex:82: Mix.CLI.run_task/2

Expected behavior

In the documentation it says:

Ecto also locks the schema_migrations table when running migrations, guaranteeing two different servers cannot run the same migration at the same time.

I would expect if you tried to run two migrations in parallel then one would block waiting for the other to finish or one would fail. I would not expect both of them to run and one to get a weird error.


Having a quick look at the code it looks like it does run it in a transaction but there is no lock taken on the schema migrations table.

@wojtekmach
Copy link
Member

wojtekmach commented Oct 10, 2020

Thank you for the report.

I was able to reproduce this which was very surprising to me because I knew for a fact that we're locking the table (grep for "FOR UPDATE" in ecto_sql project) before running the migration.

Turns out this behaviour only happens when schema_migrations table is empty. If there's at least one migration, we do acquire the lock and wait until the other process that started the migration finishes.

@josevalim
Copy link
Member

josevalim commented Oct 10, 2020 via email

@wojtekmach
Copy link
Member

I like this. I think it would show up as with no corresponding migration file in mix ecto.migrations but we could special case it there.

@wojtekmach
Copy link
Member

Oh, you meant that we generate an empty migration as usual? Totally

@bmteller
Copy link
Author

I also saw this behaviour when there was existing migrations in the table which was the reason behind opening this bug. I’ll try and create another test case to reproduce the issue.

@josevalim
Copy link
Member

I like this. I think it would show up as with no corresponding migration file in mix ecto.migrations but we could special case it there.

Yes, exactly.

I also saw this behaviour when there was existing migrations in the table which was the reason behind opening this bug. I’ll try and create another test case to reproduce the issue.

Like @wojtekmach, I could not reproduce after any migration has run.

@bmteller
Copy link
Author

bmteller commented Oct 11, 2020

So it looks the migration locking logic is something like:

  1. lock for update on transactions table
  2. read list of current versions
  3. release lock
  4. for each migration
  5. lock for update on transactions table
  6. check that the migration has not already run
  7. run migration in separate transaction and insert migration version in that transaction?? [there is another potential issue with this step because if the transaction holding the lock aborts (ie: due to a network issue) then two processes can execute the same migration. there is an unwritten assumption that the second transaction is guaranteed to complete before the first but this is not true].

So its possible for two processes to both try and execute the same list of migrations but usually this race doesn't happen because one process will get a list of migrations it needs to execute and then will mostly hold the lock on the migration table preventing the other process from making progress (or at least until the first migration runs and it presumably releases the lock).

Also, it is important to note that if this race does happen it looks like there is a further lock and check that is meant to prevent this race from being a problem. However, there is something bugged with this check and it doesn't prevent the same migration from being run twice. I think this is a visibility issue so when you do the query to 'lock for update' and return the existing migrations you are querying at transaction time T, but the new migration row is added to the database at transaction time T+N. so the new row will not be visible to the query so it doesn't see it [I believe in PG with locks you end up seeing some weird mix of records at time T and records at time >T where you have lock conflicts under read commit isolation but this is not relevant because the missing row was not a target of a lock conflict] [1]. usually in databases I think this problem is solved by 'locking the aggregate root' so you would have some other record that is a logical parent of the all the records in the migration table and you would lock around that record [but you need to carefully review whatever the behaviour of your database+isolation gives you to see if this solution would be correct]. probably, in Postgres you can use advisory locks to work around this or a lock on the table but locking the table might cause other problems. You should also be able to perform the query a second time under read-commit isolation in PG and you should see the newly inserted row. Since under read commit I believe PG gives you a fresh snapshot for each query you run.

To reproduce:

  1. have 1 migration already ran and the 'locky' migration pending from above.
  2. In ecto_sql/lib/migrator.ex in the run method add a sleep after lock_for_migrations and before migrate. I've added my listing below for how my file looks like. this just makes it possibly to deterministically trigger the first race condition.
  3. run one mix ecto.migrate process and wait until it says: 'GOT VERSIONS: SLEEPING'
  4. run second mix ecto.migrate process
412     pending =
413       lock_for_migrations true, repo, opts, fn _config, versions ->
414         cond do
415           opts[:all] ->
416             pending_all(versions, migration_source, direction)
417           to = opts[:to] ->
418             pending_to(versions, migration_source, direction, to)
419           step = opts[:step] ->
420             pending_step(versions, migration_source, direction, step)
421           true ->
422             {:error, ArgumentError.exception("expected one of :all, :to, or :step strategies")}
423         end
424       end
425
426     IO.puts("GOT VERSIONS: SLEEPING")
427     Process.sleep(10_000)
428     ensure_no_duplication!(pending)
429     migrate(Enum.map(pending, &load_migration!/1), direction, repo, opts)

The second process doesn't see the version added by the first process because of the transaction visibility issue described above.

Here is my output:

➜  locky mix ecto.migrate
GOT VERSIONS: SLEEPING
version 20201009184940 in versions: [20201009184939]

15:47:28.390 [info]  == Running 20201009184940 Locky.Repo.Migrations.Slow.up/0 forward

15:47:28.392 [info]  create table foo
KEKW SLEEP
KEKW FINISHED

15:47:58.399 [info]  == Migrated 20201009184940 in 30.0s
➜  locky mix ecto.migrate
GOT VERSIONS: SLEEPING
version 20201009184940 in versions: [20201009184939]

15:47:58.444 [info]  == Running 20201009184940 Locky.Repo.Migrations.Slow.up/0 forward

15:47:58.446 [info]  create table foo
** (Postgrex.Error) ERROR 42P07 (duplicate_table) relation "foo" already exists
    (ecto_sql 3.5.0) lib/ecto/adapters/sql.ex:683: Ecto.Adapters.SQL.raise_sql_call_error/1
    (elixir 1.10.4) lib/enum.ex:1396: Enum."-map/2-lists^map/1-0-"/2
    (ecto_sql 3.5.0) lib/ecto/adapters/sql.ex:770: Ecto.Adapters.SQL.execute_ddl/4
    (ecto_sql 3.5.0) lib/ecto/migration/runner.ex:343: Ecto.Migration.Runner.log_and_execute_ddl/3
    (ecto_sql 3.5.0) lib/ecto/migration/runner.ex:117: anonymous fn/6 in Ecto.Migration.Runner.flush/0
    (elixir 1.10.4) lib/enum.ex:2111: Enum."-reduce/3-lists^foldl/2-0-"/3
    (ecto_sql 3.5.0) lib/ecto/migration/runner.ex:116: Ecto.Migration.Runner.flush/0
    priv/repo/migrations/20201009184940_slow.exs:6: Locky.Repo.Migrations.Slow.up/0
    (ecto_sql 3.5.0) lib/ecto/migration/runner.ex:279: Ecto.Migration.Runner.perform_operation/3
    (stdlib 3.13) timer.erl:166: :timer.tc/1
    (ecto_sql 3.5.0) lib/ecto/migration/runner.ex:25: Ecto.Migration.Runner.run/8
    (ecto_sql 3.5.0) lib/ecto/migrator.ex:349: Ecto.Migrator.attempt/8
    (ecto_sql 3.5.0) lib/ecto/migrator.ex:249: anonymous fn/5 in Ecto.Migrator.do_up/5
    (ecto_sql 3.5.0) lib/ecto/migrator.ex:331: anonymous fn/3 in Ecto.Migrator.run_maybe_in_transaction/6
    (ecto_sql 3.5.0) lib/ecto/adapters/sql.ex:959: anonymous fn/3 in Ecto.Adapters.SQL.checkout_or_transaction/4
    (db_connection 2.2.2) lib/db_connection.ex:1427: DBConnection.run_transaction/4
    (ecto_sql 3.5.0) lib/ecto/migrator.ex:330: Ecto.Migrator.run_maybe_in_transaction/6
    (elixir 1.10.4) lib/task/supervised.ex:90: Task.Supervised.invoke_mfa/2
    (elixir 1.10.4) lib/task/supervised.ex:35: Task.Supervised.reply/5
    (stdlib 3.13) proc_lib.erl:226: :proc_lib.init_p_do_apply/3

[1]

https://www.postgresql.org/docs/9.5/transaction-iso.html

Read Committed is the default isolation level in PostgreSQL. When a transaction uses this isolation level, a SELECT query (without a FOR UPDATE/SHARE clause) sees only data committed before the query began; it never sees either uncommitted data or changes committed during query execution by concurrent transactions. In effect, a SELECT query sees a snapshot of the database as of the instant the query begins to run. However, SELECT does see the effects of previous updates executed within its own transaction, even though they are not yet committed. Also note that two successive SELECT commands can see different data, even though they are within a single transaction, if other transactions commit changes after the first SELECT starts and before the second SELECT starts.

UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE commands behave the same as SELECT in terms of searching for target rows: they will only find target rows that were committed as of the command start time. However, such a target row might have already been updated (or deleted or locked) by another concurrent transaction by the time it is found. In this case, the would-be updater will wait for the first updating transaction to commit or roll back (if it is still in progress). If the first updater rolls back, then its effects are negated and the second updater can proceed with updating the originally found row. If the first updater commits, the second updater will ignore the row if the first updater deleted it, otherwise it will attempt to apply its operation to the updated version of the row. The search condition of the command (the WHERE clause) is re-evaluated to see if the updated version of the row still matches the search condition. If so, the second updater proceeds with its operation using the updated version of the row. In the case of SELECT FOR UPDATE and SELECT FOR SHARE, this means it is the updated version of the row that is locked and returned to the client.

@josevalim
Copy link
Member

Thanks for the detailed write up. This is weird because before migrating each version we do acquire the lock again and see if it was been migrated or not yet. I am thinking that the lock we are using is wrong altogether... it is locking only on the records that exist but it doesn't re-read the whole table once the lock is free. Can you please try this diff?

diff --git a/lib/ecto/migrator.ex b/lib/ecto/migrator.ex
index 6cfd25c..5bbc525 100644
--- a/lib/ecto/migrator.ex
+++ b/lib/ecto/migrator.ex
@@ -499,7 +499,11 @@ defmodule Ecto.Migrator do
 
       all_opts = [prefix: opts[:prefix], timeout: :infinity, log: false]
       {migration_repo, query} = SchemaMigration.versions(repo, config)
-      callback = &fun.(config, migration_repo.all(&1, all_opts))
+
+      callback = fn locked_query ->
+        _ = migration_repo.all(locked_query, all_opts)
+        fun.(config, migration_repo.all(query, all_opts))
+      end
 
       result =
         if should_lock? do

@bmteller
Copy link
Author

I think that fixes the issue from looking at the code and also failing to reproduce it after apply the change. I think its a bit of a weird way of doing locking but it seems to work so :/ Like we are not always locking all the records in the table but it seems that we don't really rely on locking all the records on the table so this is ok.

I'm not sure if you consider the other issue I brought up a problem or not:

run migration in separate transaction and insert migration version in that transaction?? [there is another potential issue with this step because if the transaction holding the lock aborts (ie: due to a network issue) then two processes can execute the same migration. there is an unwritten assumption that the second transaction is guaranteed to complete before the first but this is not true]

@josevalim
Copy link
Member

That step should not be an issue because the processes are linked, so if the one holding the lock fails, the one migrating should also exit immediately.

@bmteller
Copy link
Author

bmteller commented Oct 11, 2020

The migration can complete and the 'migration' transaction can commit. At this point the elixir process can exit and then insertion of the 'migration' completed record into the schema migrations table will not happen. Then if you run the migrations the migrations will run twice. This seems to violate the users expectations of migrations only running once. I'm not sure the documentation says they won't run twice but seeing as you try to prevent them from running multiple times in parallel it seems reasonable that a user would assume they would not run twice.

EDIT: Maybe I'm missing the exact order of things. But if you have two independent transactions you need some kind of two phase commit + recovery system to ensure integrity between them or you are going violate whatever invariant you are trying to protect.

@josevalim
Copy link
Member

The migration and insertion of the migration completion happens inside the same transaction. So if the insertion doesn’t happen, then none of the migration should happen.

@bmteller
Copy link
Author

bmteller commented Oct 11, 2020

in an earlier post I mentioned that the insertion of the version was done in the separate transaction but looking at the code I don't think this is correct and you might have been misled by that statement of mine.

this is my current improved understanding of the order of events:

  1. lock for update on transactions table
  2. read list of current versions
  3. release lock
  4. for each migration
  5. lock for update on transactions table
  6. check that the migration has not already run
  7. start child process to run migration in a separate transaction
  8. insert schema migration into table (i'm guessing this happens in the same TX that is holding the lock but i'm not very good at reading ecto code so i might be wrong here)
  9. signal to child to commit
  10. commit??

Child process:

  1. run migration in separate transaction
  2. tell parent process we have completed successfully
  3. wait for parent to reply
  4. commit

under this order it allows the two degenerate cases:

  • schema migration is inserted and migration did not happen
  • schema migration is not inserted and migration did happen

I think without some kind of two phase commit you can only exclude one of the two above cases. I think if you move the insertion of the schema record into the async transaction (or if that is what actually happens. which you think is what is happening at the moment) before the migration starts then it should not be possible for either cases to happen. note: it you do the insert afterwards then it might be possible for the migration to 'run twice' because of a race (this is a bit difficult to reason about and I might be wrong). though, in the case where it runs 'twice' the second transaction will be aborted because of the failed insert so that might not be considered a problem. the insert creates a mutual exclusion because its a primary key.

@josevalim
Copy link
Member

As I mentioned above, the running of migrations and insertion of completion happen in the same transaction/process. So your degenerate cases cannot happen. Here are the events:

  1. Parent: Lock migrations
  2. Parent: Query migrations
  3. Parent: Spawn child process
  4. Child: Run migration
  5. Child: Insert completion
  6. Child: exit
  7. Parent: Complete transaction and release lock

@bmteller
Copy link
Author

As I said I'm not very good at reading Ecto code so I might have misread it. But I'll try and explain why I think it is running in a separate transaction (the 'parent process' transaction) and then you can explain to me why you think it is running In the same transaction (the 'child process' transaction) and hopefully we can come to an agreement on what is actually going on.

So after reading your reply I commented out the following 3 lines which appear to be running in the 'parent process':

https://github.com/elixir-ecto/ecto_sql/blob/master/lib/ecto/migrator.ex#L300

        verbose_schema_migration repo, "update schema migrations", fn ->
          apply(SchemaMigration, direction, [repo, config, version, opts[:prefix]])
        end

After commenting out those 3 lines it no longer inserts the version into the schema migrations table so that seems evidence to suggest that the 'version' is being inserted into the 'schema_migrations' table in the parent process.

I also have an ASCII dump from the tcp connection that did the insert into of the schema migrations table. My migration did an create table operation but there is no 'create table' string in the dump which also suggests that the 'schema_migration' insertion is done in a different transaction from the transaction the migration is run in.

BEGIN.C...
BEGIN.Z....TB.....ecto_1412.........E...	.....S....2....D............._i...C...
SELECT 1.Z....TB.....ecto_582.........E...	.....S....2....D............._i...C...
SELECT 1.Z....TC...#Secto_insert_schema_migrations.P...mecto_insert_schema_migrations.INSERT INTO "schema_migrations" ("version","inserted_at") VALUES ($1,$2)...D...#Secto_insert_schema_migrations.H....3....1....t.............Zn....B...E.ecto_insert_schema_migrations................_i.........Tiw.S...E...	.....S....2....C....INSERT 0 1.Z....TQ....COMMIT.C....COMMIT.Z....I

whereas in a different connection there is this ascii dump:

BEGIN.Z....TP...?.CREATE TABLE "foo" ("id" bigserial, PRIMARY KEY ("id"))...D....S.H....1....t......n....B............E...	.....C....S.S....2....C....CREATE TABLE.3....Z....TQ....COMMIT

josevalim added a commit that referenced this issue Oct 11, 2020
josevalim added a commit that referenced this issue Oct 11, 2020
Instead of locking rows, we lock the table resource
altogether, either using app locks or table locks.

Closes #276.
@josevalim
Copy link
Member

Hi @bmteller, you were right, apologies. When I said they run in the same transaction, I was looking at my local checkout, which had already moved it inside due to your bug report. My bad! 🤦

In any case, here is a PR that moves the whole approach from row locks to table/app locks: #276.

This solves the problem where the first migration would not lock and also allows us to keep migrating and adding the migration in the same transaction. More feedback is welcome!

josevalim added a commit that referenced this issue Oct 12, 2020
* Use proper locks on migration tables

Instead of locking rows, we lock the table resource
altogether, either using app locks or table locks.

Closes #276.

* Add docs

* Remove unused var

* Pass all relevant parameters to SQLServer lock
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants