- Feature Name: Transactional Schema Changes
- Status: draft
- Start Date: 2020-09-09
- Authors: Andrew Werner, Lucy Zhang
- RFC PR: #55047
- Cockroach Issue: #42061, #43057
CockroachDB's online schema change architecture necessitates steps through multiple descriptor states. All of these state changes are asynchronous after rather than before the transaction commits, preventing that writing transaction from interacting with schema elements undergoing changes and opening the door to non-atomicity in the face of failure. This proposal seeks to pull the steps of an online schema change to, by default, occur during the execution of the transaction in order to remedy the above shortcomings and provide schema changes which cohere with the transactional model. New syntax is proposed to retain the benefits of the existing asynchronous operations.
The lack of transactional schema changes boils down to two fundamental problems:
-
When a user transaction commits, its DMLs all commit but its DDLs are merely queued. It's possible that the async parts of its DDLs will fail and result in a partial commit. The behavior to roll back these failures is, at best, surprising.
-
Some changes to the schema performed in a transaction are not visible for subsequent writes inside of that transaction.
This proposal relies on always creating a new primary index when changing table
column sets (#47989, Schema Change Backfills).
That architectural change will lead to a solution to #35738, a
severe usability issue with CHANGEFEED
in the face of schema changes.
Non-atomic DDLs (#42061)
Many schema changes require O(rows)
work. Generally we've wanted to avoid
performing this work directly on the user's transaction (perhaps partly out of
fear that a long-running transaction may not commit, see #52768, #44645).
Many of these operations may fail (see the table in #42061). In today's world,
when one of these operations fails, we attempt to roll back all of the
DDLs that had been a part of that transaction (this also includes cases where
there are multiple clauses in an ALTER TABLE
). There are some particularly
pernicious interactions when mixing the dropping of columns with other
operations which can fail (see #46541).
Non-visible DDLs (#43057)
This is easiest to motivate with examples. Here we'll see about adding and removing columns though there are other important cases like adding enum values and constraints. Constraints and constraint enforcement are particularly interesting cases that will get more attention later.
When dropping a column, the column is logically dropped
immediately when the transaction which is performing the drop commits; the
column goes from PUBLIC
->WRITE_ONLY
->DELETE_ONLY
->DROPPED
.
The important thing to note is that the writing transaction uses its provisional
view of the table descriptor. The following is fine:
root@127.179.54.55:40591/movr> CREATE TABLE foo (
k STRING PRIMARY KEY,
c1 INT NOT NULL CHECK (c1 > 5),
c2 INT NOT NULL,
c3 INT
);
CREATE TABLE
root@127.179.54.55:40591/movr> BEGIN;
root@127.179.54.55:40591/movr OPEN> ALTER TABLE foo DROP COLUMN c2, DROP COLUMN c3;
ALTER TABLE
root@127.179.54.55:40591/movr OPEN> SELECT * FROM foo;
k | c1
+---+----+
(0 rows)
root@127.179.54.55:40591/movr OPEN> ALTER TABLE foo DROP COLUMN c1;
ALTER TABLE
root@127.179.54.55:40591/movr OPEN> SELECT * FROM foo;
k
+---+
(0 rows)
When the above transaction commits, foo
is in version 2 and has pending
mutations to drop the columns c1
, c2
, and c3
. One thing to note is that
transactions which happen after the above transaction may still observe those
columns due to schema leasing. Another thing to note is that under versions 1
and 2 of foo
, the physical layout has not changed, but the logical layout has
not. When version 3 (DELETE_ONLY
) comes into effect, the physical layout may
differ from version 1. It's critical that by the time the physical layout begins
to change we know that no transactions are in version 1.
While the above scenario on the surface seems desirable, it actually papers over very real complications related to the dropping of columns with constraints (#47719). The problem this posses is that later statements in the transaction do not provide a value for the dropped column, yet, in order to preserve our two version correctness properties, must write a value into that column that does not violate any constraints assumed by concurrent transactions using the previous version.
Let's explore the implications of the above issue as it pertains both to the
executing transaction and to concurrently executing transactions. Imagine that
the following statements were to be inserted to the above session immediately
after dropping c2
and c3
.
root@127.179.54.55:40591/movr OPEN> INSERT INTO foo VALUES ('foo', 42);
INSERT 1
Now imagine a concurrent connection attempts to read from foo. It will issue
a select which will block until the root
transaction concludes. At that point
it will read using the schema version which existed prior to the root
transaction.
other_user@127.179.54.55:37812/movr> SELECT * FROM foo;
k | c1 | c2 | c3
------+----+----+-------
foo | 42 | 0 | NULL
Notice that this concurrent transaction will observe a zero value for
c2 NOT NULL
. This is to deal with the fact that the execution engine
really cannot deal with NULL
values in NOT NULL
columns. What's perhaps
worse is that the same zero value would be written into c1
if an insert
were to be performed after it had been dropped in the transaction. This would
lead the concurrent transaction to observe the check constraint being violated.
There may be cases where this could transiently result in a correctness
violation due to a plan which assumed the check constraint.
Now let's look at the ADD COLUMN case.
root@127.179.54.55:40591/movr> CREATE TABLE foo (k STRING PRIMARY KEY);
root@127.179.54.55:40591/movr> BEGIN;
root@127.179.54.55:40591/movr OPEN> ALTER TABLE foo ADD COLUMN c1 INT;
ALTER TABLE
Time: 2.285371ms
root@127.179.54.55:40591/movr OPEN> SELECT * FROM foo;
k
+---+
(0 rows)
Time: 526.967µs
root@127.179.54.55:40591/movr OPEN> ALTER TABLE foo ADD COLUMN c2 INT;
ALTER TABLE
Time: 3.795652ms
root@127.179.54.55:40591/movr OPEN> SELECT * FROM foo;
k
+---+
(0 rows)
Time: 516.892µs
root@127.179.54.55:40591/movr OPEN> COMMIT;
root@127.179.54.55:40591/movr> SELECT * FROM foo;
k | c1 | c2
+------+------+------+
(0 rows)
How bizarre is that! In the DROP COLUMN case we observe the effect immediately,
within the transaction. In the ADD COLUMN case we don't observe the effect
during the transaction. The reason for this is relatively straightforward: in
order to observe and act on the effect of the ADD COLUMN
we would need the
physical layout of the table at the timestamp at which this transaction commits
to differ from the physical layout of the table when the transaction began. In
order for the schema change protocol to maintain correctness, it must be the
case that for any given point in time (MVCC time) that a table has exactly one
physical layout. We achieve this through the use of the invariant that at most
there are two versions of a table leased at a time: the latest version and the
preceding version. Between two versions of a table descriptor either the
logical layout or the physical layout may change but never both. If both the
logical and physical layout changed in one version step, lessees of the older
version may not know how to interpret table data. Put differently, by the time
a physical layout changes, all lessees of the table are guaranteed to know how
to interpret the new physical layout.
The upshot of this design is that we can achieve compatibility with postgres and meet user expectations for what is possible within a SQL transaction. This proposal is not, however, without its risks. In order to regain transactionality, in some cases, we'll need to actually perform the work of a schema change underneath the user's transaction, prior to it committing. This exposes the system to potentially a lot of wasted work in the face of rollbacks. Additionally, as we'll see, coordinating with concurrent transactions through the key-value store will require some delicate dances to allow the transaction to both achieve serializable isolation as its implementation literally observes state changes.
Let's walk through the idealization of the above example as experienced in postgres to try to fill in some of the details of this proposal. The ideas proposed in this discussion are intended to be seen as both a sketch and a straw-man. More concrete and precise language will arrive in the reference-level explanation.
postgres=> BEGIN;
BEGIN
postgres=> ALTER TABLE foo ADD COLUMN c1 INT NOT NULL DEFAULT 1;
ALTER TABLE
Ideally, at this point, the ongoing transaction could treat c1 as a PUBLIC
column and could perform reads and writes against it. Let's say that at the
outset foo
is at version 1 (foo@v1
). Recall from the online schema change
RFC and the F1 schema change paper that the steps to adding a column are
DELETE_ONLY
->WRITE_ONLY
->(maybe backfill)->PUBLIC
.
In this proposal, during the execution of the above statement we'd first perform
a single-step schema change in a child transaction
(new concept) rather than on the user transaction. This single-step schema
change would add the new column in the DELETE_ONLY
state. In the DELETE_ONLY
state no other transaction write to that column, but other transactions would
know how to interpret the column if they were to come across it. After the child
transaction commits the new version, the user transaction will use child
transactions to wait for all leases on the old version to be dropped (see
Condition Observation).
Once the child transaction commits and has evidence that all leases on all nodes
are up-to-date, the statement execution can continue. At this point, a new child
transaction can write and commit the table descriptor which has the column in
WRITE_ONLY
with a pending mutation to perform a backfill. Once all
nodes have adopted this latest version, a backfill of the new index can be
performed. So long as the column does not become PUBLIC
before the user
transaction commits, other transactions won't observe its side effects (if there
were UNIQUE
or CHECK
constraints that might not be the case, but that comes
later). The user transaction, following execution of the ALTER
statement will
hold an in-memory version of the table descriptor which has the new column as
PUBLIC
which it will use for the execution of the rest of the transaction.
This simplification doesn't deal with rollbacks, concurrent DDL operations, or really anything going wrong. More details on how those scenarios are handled will be introduced in the reference-level explanation. let's proceed with the example:
postgres=> INSERT INTO foo VALUES ('a', 1);
INSERT 0 1
postgres=> SELECT * FROM foo;
k | c1
---+----
a | 1
(1 row)
Here we see the ongoing transaction using the PUBLIC view of the new column.
postgres=> ALTER TABLE foo DROP COLUMN c1;
ALTER TABLE
To execute the above statement we'll need to:
- Write a new provisional table descriptor in a child txn which has no reference
to the new column (functionally equivalent to
foo@v1
) and wait for it to be leased. - Perform the "backfill" on rows written in this transaction which might have used a different physical layout. In the current implementation this means rewriting every row in the primary index to no longer contain the column in question. In the proposal, we'll construct a new primary index which does not include this column.
postgres=> ALTER TABLE foo ADD COLUMN c2 INT NOT NULL DEFAULT 1;
ALTER TABLE
When we do another mutation we need to go through the same exact dance. All
rows written during the current transaction to the table will need to be
re-written such that they have the physical layout of the table as
described by the descriptor which will be committed. The backfill process
determines which rows to rewrite based on their MVCC timestamp. Specifically
it will only backfill rows which have an MVCC timestamp earlier than the
ModificationTime (i.e. commit timestamp as of #40581) which moved the newly
added column to WRITE_ONLY
.
postgres=> COMMIT;
COMMIT
At this point we'll need to write the table descriptor to a new version which has the newly added column c2. This is safe because we'll have very carefully avoided interacting with the descriptor which has changed over the course of the transaction. Well, this glosses over a bunch of details but maybe gives a glimpse of the idea.
Up to this point we've talked in broad strokes about the basic ideas; adjacent descriptor versions are carefully crafted to be compatible and the leasing protocol ensures at most two adjacent versions are concurrently active. The problem, so-to-speak, is that many user-level operations require more than one descriptor version step. Now we need to get into the nuts and bolts of how to coordinate committing these multiple steps through descriptor states while avoiding any user-visible side-effects prior to the commit of the user transaction.
This RFC utilizes the language of the online schema change RFC.
- Child Transaction - a new concept in this RFC. A Child transaction is a transaction which is associated with a parent transaction. This is handy for the purposes of deadlock detection and critical to implement backfills of data and read from indexes which have writes from the parent.
- Descriptor coordination key - a key out of
system.descriptor
which SQL transactions attempting to perform a DDL will write an intent to prior to commencing a transactional schema change. - Synthetic descriptors - a descriptor utilized by a transaction that does not correspond to a committed or written descriptor state. This is not a particularly new or novel concept; we do something akin to this already in the process of validating constraints. This proposal will make rather judicious use of this technique and ideally we can formalize it better in code.
- Constraint invalidation table - a table which can be utilized by transactions to invalidate the addition of a constraint which might have otherwise succeeded.
- Transaction self-pushing - a KV level mechanism to push a transaction manually to a timestamp. Today transactions are only pushed in response to kv-level interactions.
- Condition observation - After a new version of a descriptor is published by a child transaction, the execution must wait for that version to be adopted before proceeding to the next version. We need a tool to observe that all leases have been dropped that interacts appropriately with deadlock detection. This mechanism will retry a closure with a child transaction until that closure indicates that the condition has been satisfied.
During the course of certain schema changes, data stored in the KV needs to change. This may be due to the construction of new indexes or to changes in the layout of an existing index. We'll first discuss the two existing backfill mechanisms, column and index, and then talk about the plan to move all schema changes to utilize the index backfiller.
Today, there are two types of "backfills" as we call them, the "column" backfill and the "index" backfill. The column backfiller is largely a vestige of a time when tables' primary indexes could not be changed. This proposal relies on moving column changes to creating new indexes as outlined in #47989 in order to make rollbacks reasonable and subsequent backfills possible.
The column backfill is used, as you might have guessed, to replace columns.
The way that the column backfiller works is that it updates the physical layout
of an index in-place. Each row is overwritten transactionally during the
backfill to include the new WRITE_AND_DELETE_ONLY
column. It must do this as
the index is concurrently receiving transactional reads and writes.
This should be juxtaposed with the index backfiller which is used to backfill
indexes that are not live for read traffic (though notably are live for
writes). The index backfiller reads from a source index from a fixed snapshot
and populates its target by synthesizing SSTs using the bulkAdder
. All rows
written by the index backfill have a timestamp that precedes any write due to
transactional traffic, thus preserving subsequent writes as they are. This is
needed because AddSSTable does not interact with the usual transaction
mechanisms, it just injects the keys into the keyspace.
What's important to note is that the process of changing the primary key of a table operates by performing an index backfill into a new index and then swapping over to that new index. Similarly now we've changed truncate from creating a new table to swapping over to new copies of indexes in the same table. There is no reason why all column changes couldn't be done this way and doing so opens the door for other benefits (including fixing a very irritating CDC behavior in the face of column changes #35738).
For reasons that will hopefully become clear, utilizing index backfills for all column change operations is an important stake in this design. This is crucial so that the process of rolling back a failed schema change can be as lightweight as scheduling an asynchronous GC job. If we, somehow, continued with the column backfill during a transactional schema change, rolling back a failed change would require synchronously rewriting the index to remove any detritus rows before being able to proceed.
Today's schema changer has evolved semi-organically though a number of iterations. Where to draw the circle around the schema changer is a bit amorphous. We have created new job types for type changes but for practical purposes, that too is the schema changer. Furthermore, we've got logic in the schema changer for applying schema change steps for descriptors created in the current transaction. That importantly is also a part of the schema changer.
In "olden times" (19.2 and prior), the schema changer would watch the set of table descriptors and respond to changes to them. At some point long ago when the jobs infrastructure was added, the schema changer began correlating mutations with jobs as a means to expose some UI. This proved horribly problematic, both because the jobs API was, well, bad (it still is but not nearly as bad as it was) and because of general mishandling of transactions. In 20.1, this story got better as job life-cycles matured and the schemachanger was re-imagined as a job. In that re-imaging we created the notion of the schema change job that is roughly, for the most part, the steps of a descriptor mutation with a one-to-one relationship to a job, executed by the jobs subsystem.
This scheme is way better than the old one, but still problematic
in its ways. For one, we can only talk about modifying a single descriptor in a
job (err, we can also talk about DROP DATABASE CASCADE
and
DROP SCHEMA CASCADE
but that's only further evidence that it's a bad
abstraction). In the 20.1 cycle we added those primary key changes that, for
reasons, rely on creating multiple jobs that are poorly coordinated and lead
to a less than ideal UX.
Another important observation is that the business logic is tied to these things we call descriptor mutations. A mutation is both an implied, almost declarative member of a descriptor and a specification for changes to be made. This overloading is confusing and very poorly represented in code. What's perhaps worse is that the business logic of handling these mutations proceeds entirely procedurally, walking through the same implied state machine as a mess of switches and if statements that hide various operations like backfills, descriptor reference updates, and the like.
In order to move forward we need:
- More flexibility to re-arrange and control the logic of a schema change
- More observability, what really are the states and transitions?
- Decoupled from jobs, still usable by jobs
- Capable of expressing changes over multiple descriptors
This and its coordination in the connExecutor represent the largest code changes required to realize this vision.
The primary new KV concept required by this proposal is the child transaction. A child transaction is related to a snapshot of a parent transaction. A child transaction is able to observe intents written by the parent (at visible sequence numbers). Critically a child transaction is associated with its parent for the purposes of deadlock detection. A child transaction differs from a nested transaction as implemented today primarily because it can be committed independently of the parent transaction and because its side effects will become visible to other transactions when it is committed.
The work required to realize child transactions is discussed in #54633.
Child transactions will be used to read all descriptors in the transaction and to perform (and commit) all writes to descriptors except the final write which makes the schema change visible. This is critical to ensure that the user transaction does not violate serializable isolation. Fundamentally, if the user transaction were to observe the descriptor and then that descriptor were to change during the transaction (which this proposal mandates), then the transaction could not both write to the descriptor and commit.
The process of managing descriptor resolution and ensuring that it occurs inside
of child transactions properly will be delegated to the descs.Collection
and
name resolution infrastructure. One added bit of complexity is ensuring that
the user transaction can read from the various virtual tables which read the
system.descriptors
and system.namespace
tables.
It is also important for child transactions to be able to read provisional values written by their parents. This is important so that index backfills to new indexes will include previous writes from the transaction. Designs in which the parent buffers writes to be re-executed were explored but rejected. Backfilling intents into new indexes as committed values is safe because the index will only become public if the parent transaction commits. In this case, the child transaction will be constructed with the sequence number of the parent transaction.
In other uses of child transactions, it is not desirable to observe writes of the parent transaction. In these cases (like Condition Observation), we'll want to construct the child transaction which is not permitted to encounter an intent due to an ancestor transaction. Child transactions will thus have two forms: explicitly opting into reading ancestor writes or asserting that they will never do so.
Some time was spent exploring an idea whereby child transaction could avoid observing ancestor writes by, effectively, ignoring all sequence numbers of the parent. This solution is unsatisfying as it would be nice for ancestor transactions to causally follow children when they overlap in their key sets. That property is difficult to maintain if the child proceeds without observing the parent without always pushing the parent.
In order to prevent concurrent transactional schema changes on the same table,
we want to achieve exclusive locking semantics on a given descriptor. However,
given that the execution of a schema change will require writing multiple
versions of the descriptor and waiting for those versions to be observed,
the user transaction cannot interact with the descriptor key directly. Doing so
would ensure that serializability was violated. Instead, we introduce a new
table system.descriptor_coordination
which the user transaction can write to
prior to initiating any transactional schema changes.
CREATE TABLE system.descriptor_coordination (
id INT PRIMARY KEY
)
If a DDL executes and, after writing the descriptor coordination key, encounters a descriptor is in a state that indicates that it has a partially applied transactional schema change, then it knows that those changes can be rolled back.
In order to transactionally add constraints, we need to coordinate with concurrent writes to the table. The process of adding the constraint will validate that all existing data meets that constraint. Concurrent writes, however, may violate that constraint. In order to deal with that, we'll combine serializability with explicit execution action to ensure that constraints being added are not concurrently invalidated.
When a constraint is being added, it is added in a VALIDATING
state. In this
state executing transactions detect when they would violate this constraint as
though it were public, but, upon detecting the violation, determine whether
the constraint has been committed. If the transaction determines that the
constraint has not been committed, it must ensure that it will not be committed.
Detecting this error and recovering from it gracefully will require changes to the KV layer. Currently, condition violation errors leave a batch in an undefined state, unable to be handled in a recoverable fashion. Instead, we should return a response with a flag rather than an error for the batch.
The way that the transaction ensures that the constraint will not be committed
is by writing an entry into system.constraint_coordination
, a new table. When
a schema change adds a new constraint in the validating state, it reads the
entire relevant span where other transactions might write. Upon commit, the
user transaction will issue a delete range command over this span. If any
rows have been written, the transaction will fail.
CREATE TABLE system.constraint_coordination (
descriptor_id INT,
violation_txn_id UUID,
violation_meta STRING NOT NULL,
PRIMARY KEY (descriptor_id, violation_txn_id)
)
There are many other protocols which could be imagined for this sort of coordination but this one seems reasonably simple and should work. If we were to implement the Pessimistic Mode extension, things may not work as intended.
When a child transaction in a transactional schema change commits a new version of a descriptor, it must wait for that version to be adopted before proceeding. The process of observing this condition is fundamentally problematic in that it exposes the system to deadlock.
We'll call this topic "condition observation". The sketch of the proposal here is re-iterated in #55981. The idea is that the KV layer will expose functionality to run a closure in a retry loop with backoff until that closure indicates that the condition has been satisfied. The condition will utilize a child transaction at each iteration to check the condition. The reason we give this concept a name is that it interacts directly with deadlock detection.
The hazard we need to deal with is the fact that a transaction with an
outstanding lease we're waiting to drop may be blocked on an intent of the user
transaction (or of another transaction transitively blocked on the user
transaction). The blocking transaction is either writing, in which case it will
(eventually) be in the txn wait queue for the user transaction with a
PUSH_ABORT
request or reading in which case it will have issued a
PUSH_TIMESTAMP
request. We prefer to abort blocked writers rather than let
them abort the user transaction. The way we'll deal with this is to self-push
the user transaction above all blocked readers and we'll issue PUSH_ABORT
requests against all of the writing transactions. We'll need to extend
QueryTxn
to return the type and timestamp of pushers.
This requirement is somewhat unfortunate in that the isolation of the deadlock detection layer from the KV api was nice, but it seems important as at the end of the day, this condition is a resource on which a transaction is blocked.
Upon fulfilling the condition, the user transaction will push itself above the timestamp at which the condition was observed to be true.
In order to add the new column, we’ll assume that the user transaction needs to ensure that the new indexes have been built before the transaction can commit.
CREATE TABLE foo (i INT PRIMARY KEY);
BEGIN;
INSERT INTO foo VALUES (1); -- these contending writes are interesting.
ALTER TABLE foo ADD COLUMN j NOT NULL DEFAULT 42;
INSERT INTO foo VALUES (2, 2);
SELECT * FROM foo; -- (1, 42), (2, 2)
COMMIT;
During the transaction, during the execution of the ALTER
statement, we need
to construct the new indexes. Let’s assume that we’re in a world where
table alterations happen as a form of index creations and swaps as described
already.
The challenge is going to be that all transactions with a commit timestamp before the transaction which adds the constraint should not observe it and all that commit after should.
CREATE TABLE foo (i INT PRIMARY KEY, j INT);
INSERT INTO foo VALUES (2, 2);
BEGIN;
ALTER TABLE foo ADD CONSTRAINT CHECK (i >= j); -- should succeed
INSERT INTO foo VALUES (1, 2); -- should fail
From a high level, we need to ensure that concurrent transactions know that the constraint may exist, but does not definitely exist. This means the constraint needs to be evaluated, but upon failure, special logic needs to run to either check if the constraint has been added, or prevent it from being added.
CREATE TABLE foo (i INT PRIMARY KEY, j INT);
INSERT INTO foo VALUES (2, 2);
BEGIN;
ALTER TABLE foo ADD CONSTRAINT UNIQUE (j);
INSERT INTO foo VALUES (1, 2); -- should fail
- Adding a constraint means adding a unique index, and validating it with all of the added fun of the above check constraint
- Major differences:
- Before index backfill can begin, you need to wait two versions, one for the index in delete only to be adopted and then another for it to be in write-only.
- At that point you perform index backfill on the new unique index.
- Then you have to validate the row counts between the primary and new index are the same. This is a full scan of both the primary and new index (which can be performed concurrently).
At the time when the constraint-adding transaction commits (some ambiguity) all future transactions must see the constraint. Writes which precede the timestamp should not experience any side-effects of the constraint.
To implement this, we must set the table descriptor to a state where writes which would violate the constraint transactionally ensure that the verification will fail by using the constraint coordination span.
Should be an almost identical paradigm, just involves more descriptors.
When a transactional schema change fails to commit (due either to an explicit rollback or due to a session closing), the provisional descriptor changes will need to be reverted and any partially created indexes will be removed. The process is quite like intent resolution for aborted transactions. In the case of an explicit rollback, the connection will eagerly perform this cleanup. However, we cannot rely on this eager cleanup. In cases where future transactions attempt to perform schema changes, the existing "pending" mutations will need to be cleaned up first. This can be done using the child transaction in the same way as any other change that will be made in the course of a transactional schema change.
The process of cleaning up partially created indexes will fall to the GC Job as it does today. The GC Job will not need to respect a GC TTL as the indexes will never have been made public. The cost of these cleanups in the case of frequent rollbacks is a point of concern in this design. Avoiding spurious rollbacks and restarts of expensive transactional schema changes will be important for this proposal to be successful (see Pessimistic mode discussion).
Note that the above paradigm should generalize to savepoints. The fact that we
write intents to new indexes as committed values is not a problem because if a
transaction rolls back to a SAVEPOINT
prior to a write it must also drop any
subsequently added indexes. In that way, there should not be any hazards related
to exposing values that arise due to this proposal. This work is deferred and
discussed in more detail in the
SAVEPOINT
s and "Nested Transactions"
future work section.
As we've mentioned before and will mention again, transactional schema changes as proposed here are not a panacea. For one, if the user connection dies while a long-running operation occurs, that work will be wholly wasted. However, it's worth noting that other systems don't offer such a mechanism, so it must not be that important. One case where we do expect it to be important, and we also can regain the asynchronous behavior without too much work is the case where the schema change is a single statement. In that case, instead of performing the stages using the newly described transaction protocol, all the stages will be pushed into an asynchronous job, as they are today.
Today schema changes are rather lightweight operations. We write out a job and a small change to the descriptor and then wait for the asynchronous set of operations to complete. In this proposal, index creation and validation will occur underneath the user transaction. This means that any reads (not due to backfills or validations) will need to remain valid throughout the course of the transaction in order to maintain serializability (see pessimistic mode discussion below).
One arguably nice property of Cockroach's schema changes today is that they
will proceed even if the client connection breaks before the operation finishes.
Determining when a schema change has reached the point at which it will proceed
is not an easy thing to programmatically determine, but, in practice, most schema
changes enter this phase quite soon after statement execution (or COMMIT
in
the case of explicit transactions). The hope of this RFC is to retain this behavior
for single statement schema changes. In the future work section there is a
proposal to regain this behavior more explicitly and usefully with the
DETACHED
form of schema changes, though, admittedly, that remains the most
vague and open-ended portion of the proposal.
Some discussion of making execution and storage aware of physical layout. It seems possible to encode the structure of the table into the row encoding directly. In that case, the execution engine might be able to detect that the row encoding does not match its expectation and it could upgrade its view of the table metadata during execution. This style of approach seemed extremely cross-cutting and complex. It also does not deal with constraint addition.
Almost certainly we'll need to retain the existing behavior in the mixed version states. We'll need to carefully sequence the rollout of this functionality to ensure that the upgrade process is smooth. The details of the mixed version state remain an unresolved issue.
See #52768. Transactional schema changes will require pushing their timestamp which will require read refreshes. In normal times such refreshes can be catastrophic when considering large transactions. One good thing is that transactional schema changes are not necessarily going to transactionally (as is refresh spans) interact with all that much data. That being said, the cost of a rollback will be potentially immense.
- In the first pass, we could imagine using in-memory, best-effort locks.
- Are these exclusive? All that exists today are exclusive locks but best-effort shared locks are a small task. Distributed, durable upgrade locks are a much bigger task.
- There is some talk about making these much higher reliability.
- Right now in-memory on leaseholder.
- Don’t get transferred with lease.
- Work item to transfer them with cooperative lease transfers.
- Most lease transfers are cooperative.
- For all tracked read spans, acquire locks.
- This implies the existence of ranged shared locks which are currently not supported.
- This will require refreshes but those refreshes are likely to succeed.
- This is purely an optimization to prevent restarts.
- Certain workloads with concurrent access to the table will likely always invalidate reads making this a liveness problem.
- Not a safety concern.
A likely acceptable first step would be to implement pessimistic mode on top of best-effort, non-replicated, upgrade, ranged read locks. These may be a goal of the 21.1 release.
Currently attempts to rollback to a savepoint in a transaction which would discard a schema change is not allowed (#10735). This proposal does not intend to remedy this limitation in its first pass. However, it would be problematic if the proposal precluded a solution.
Any solution would need to keep track of transactional schema change state with each savepoint. Upon rollback, will need to use a child transaction to move the table descriptor back to the logical point it existed as of that savepoint, then update the TableCollection. Should "just work".
As we've mentioned before and will mention again, transactional schema changes as proposed here are not a panacea. For one, if the user connection dies while a long-running operation occurs, that work will be wholly wasted.
The proposal is that by default we won’t ever do the asynchronous work for
schema changes in an explicit transaction. We instead propose that we add a
DETACHED
keyword which can apply to individual statements that will have
better semantics on failure.
We choose the DETACHED
keyword because it is utilized by background jobs and
it is a distinct keyword from that used by postgresql for this sort of
asynchronous schema change. Postgres's CONCURRENTLY
only applies to index
addition. The usability of this keyword
is severely restricted. See the Postgres Concurrently
Documentation
for documentation on the limitations. We'll summarize them below:
Regular index builds permit other regular index builds on the same table to occur in parallel, but only one concurrent index build can occur on a table at a time. In both cases, no other types of schema modification on the table are allowed meanwhile. Another difference is that a regular CREATE INDEX command can be performed within a transaction block, but CREATE INDEX CONCURRENTLY cannot.
Concurrent builds for indexes on partitioned tables are currently not supported. However, you may concurrently build the index on each partition individually and then finally create the partitioned index non-concurrently in order to reduce the time where writes to the partitioned table will be locked out. In this case, building the partitioned index is a metadata only operation.
Why would you want this? Why wouldn’t you just not use a transaction? This seems to be allowing a user to opt in to non-transactional behavior.
If you want transactionality then you’d put them in a transaction and it would
work. The DETACHED
would help here for ergonomic reasons because you don't
have to wait to issue.
A reason for the DETACHED
keyword is to provide a mechanism for queuing up
schema changes without needing to wait potentially hours to tell the system
about it. Maybe this provides an opportunity for optimization with regards to
work scheduling. The semantics here are up in the air.
It is still very unclear what it means for a DETACHED
DDL to fail.
DETACHED
means that we should return to the user immediately without
performing the change.
Should you be able to set up dependent mutations?
Sequencing DETACHED
DDLs after synchronous DDLs seems more-or-less well-
defined. Maybe we say that you can’t interact with schema changes which are
queued.
We don’t today have a sense of dependencies. Some schema changes on the same table could even, in theory, run in parallel.
If any DETACHED
step fails, then all “subsequent” async DDLs fail. This
is similar to the Spanner semantics, and it seems much better than what we do
today from a simplicity perspective. Spanner has a notion of batches of DDL
statements. It appears that different batches do not observe changes which
have been queued but not yet applied.
Cloud Spanner applies statements from the same batch in order, stopping at the first error. If applying a statement results in an error, that statement is rolled back. The results of any previously applied statements in the batch are not rolled back.
The language is confusing when it comes to the interaction between batches:
Lastly, Spanner is not set up for frequent schema changes. This is not an acceptable limitation for cockroach.
Avoid more than 30 DDL statements that require validation or index backfill in a given 7-day period, because each statement creates multiple versions of the schema internally.
How do we expect users to use this DETACHED
behavior?
- This isn’t for the tools, this is for today’s users of schema changes who want their schema changes to happen if their connection disconnects.
- Assuming some people are expecting this async behavior, are we going to
trash existing user’s schema change workflows?
- Everything should look roughly the same.
- The main differences are in behavior during failure.
- The reason for
DETACHED
is to provide customers with a mechanism to issue long-running schema changes and not have to fear their connection dropping leading to the change not happening.
Summary of questions:
- What are the semantics of
DETACHED
in transactions? - What are the semantics of DDLs, specifically interactions with previously
issued
DETACHED
DDLs for subsequent sync or async DDLs.- For sync DDLs, makes sense to wait for all previously issues
DETACHED
DDLs to clear before attempting to run - For
DETACHED
, assume success and queue, any failure and the tail of the queue all fails. System is free to flatten queue into fewer steps but not to reorder?
- For sync DDLs, makes sense to wait for all previously issues
Add support for DETACHED
schema changes to a table which cannot be used
inside of a transaction. DETACHED
schema changes must be single
statements. A DETACHED
schema change operates on the table as though all
previously enqueued DETACHED
schema changes have been completed.
DETACHED
schema changes result in a table that has a job ID corresponding
to the schema change. For multi-clause ALTER
statements, multiple jobs may be
returned. If any async schema change fails, all subsequent queued schema changes
will fail. Non-DETACHED
schema changes may not proceed on a table until
all DETACHED
schema changes finish.
A problem with this proposal is that it means that there is no efficient way to add multiple columns. This seems likely to be unacceptable. Perhaps we need to make an attempt to combine adjacent concurrent operations where easy. This would complicate the failure semantics.
> CREATE TABLE foo();
CREATE TABLE
> ALTER TABLE foo ADD COLUMN i INT NOT NULL DEFAULT 42 DETACHED;
job_id
----------------------
594409267822002177
> CREATE INDEX foo_idx_i ON foo(i) DETACHED;
job_id
----------------------
594409269143273473
> ALTER TABLE foo ADD CONSTRAINT foo_unique_i UNIQUE (i), ADD COLUMN j INT DETACHED;
job_id
----------------------
594409267848740865
594409270501670913
- What should the semantics be when adding a column with a default value and using
the
DETACHED
keywords?- Seems like
DETACHED
operations should not take effect at all.
- Seems like
- How big of a problem is it for transactions to observe side-effects of
non-committed DML transactions?
- In particular, the observation would cause an error when success might be possible if the concurrent DML fails to commit.
- E.g. enforce a constraint before it’s completely added.
- Long-running schema changes which perform writes seem uncommon.
- Is it okay to not deal with the
SetSystemConfigTrigger
?- One oddity in the cockroach transactional schema change landscape is that if you want to perform a DDL in a transaction, it must be the first writing statement. This is due to implementation details of zone configurations but is thematically aligned with regards to limitations of schema changes in transactions.
- #54477
- Should jobs be used in any way for dealing with backfills and validations
underneath a user transaction?
- What are the pros? Lots of cons.
- How exactly will the mixed version state work and how will this proposal be
sequenced across releases.
- The current thinking is that the new schema change infrastructure and
child transactions will be delivered in 21.1. Other surrounding
infrastructure, including, perhaps,
DETACHED
schema changes may also be delivered. Nevertheless, it is likely that in the mixed 21.1-21.2 state, schema changes will behave as in 21.1. - Does this change of behavior upon upgrade imply that the change should be gated by some setting? How do we ensure that this setting is flipped on prior to 22.1?
- The current thinking is that the new schema change infrastructure and
child transactions will be delivered in 21.1. Other surrounding
infrastructure, including, perhaps,
- Privileges: roles and role membership are stored in a system table which has its version incremented on every write in order to ensure properly synchronized dissemination of information. How will that code be affected by this work?
- Dropping schema elements: How should drops interact with ongoing schema changes? Should they cancel them? How would we implement this?