Rebuild SrvVSchema during InitTablet.#6276
Conversation
We used to rebuild SrvVSchema in all cells as part of running GetOrCreateShard from any cell. That caused VSchema problems to spread instantly to all cells at once, so we changed EnsureVSchema to only rebuild SrvVSchema in the local cell. We believed that at least one tablet in a given cell would still end up rebuilding SrvVschema in the cell since they all call GetOrCreateShard at startup. However, I didn't realize that GetOrCreateShard short-circuits if the shard already exists, so it was skipping EnsureVSchema entirely. This moves the rebuild of SrvVSchema to be an explicit step in InitTablet alongside the rebuild of the keyspace graph. This keeps GetOrCreateShard as a task that only needs to be done by one tablet globally because it affects global topo. The rebuilding of per-cell records should be done by a tablet in that cell. Signed-off-by: Anthony Yeh <enisoc@planetscale.com>
| switch { | ||
| case err == nil: | ||
| // Check if vschema was rebuilt after the initial creation of the keyspace. | ||
| if _, keyspaceExists := srvVSchema.GetKeyspaces()[*initKeyspace]; !keyspaceExists { |
There was a problem hiding this comment.
Doesn't this also need a RebuildKeyspaceGraph?
There was a problem hiding this comment.
I don't understand. Are you saying if a keyspace is missing from SrvVSchema, it implies that RebuildKeyspace has not run for that keyspace?
There was a problem hiding this comment.
I thought that SrvKeyspace and SrvVSchema were completely independent. I'll double check.
There was a problem hiding this comment.
You're probably right. I think they are independent.
But I also think that a missing SrvKeyspace is going to be a problem. And now that I think about it, there are some workflows where this has led to odd errors.
However, I think that's an orthogonal issue, and it's better to resolve that separately.
| switch { | ||
| case err == nil: | ||
| // Check if vschema was rebuilt after the initial creation of the keyspace. | ||
| if _, keyspaceExists := srvVSchema.GetKeyspaces()[*initKeyspace]; !keyspaceExists { |
There was a problem hiding this comment.
You're probably right. I think they are independent.
But I also think that a missing SrvKeyspace is going to be a problem. And now that I think about it, there are some workflows where this has led to odd errors.
However, I think that's an orthogonal issue, and it's better to resolve that separately.
We used to rebuild SrvVSchema in all cells as part of running GetOrCreateShard from any cell. That caused VSchema problems to spread instantly to all cells at once, so we changed EnsureVSchema to only rebuild SrvVSchema in the local cell (#5930).
We believed that at least one tablet in a given cell would still end up rebuilding SrvVschema in the cell since they all call GetOrCreateShard at startup. However, I didn't realize that GetOrCreateShard
short-circuits if the shard already exists, so it was skipping EnsureVSchema entirely. As a result, only one cell (corresponding to the one tablet that won the race) actually had its SrvVSchema rebuilt.
This moves the rebuild of SrvVSchema to be an explicit step in InitTablet alongside the rebuild of the keyspace graph. This keeps GetOrCreateShard as a task that only needs to be done by one tablet globally because it affects global topo. The rebuilding of per-cell records should be done by a tablet in that cell.