vreplication: change to use new _vt.vreplication#4034
vreplication: change to use new _vt.vreplication#4034sougou merged 23 commits intovitessio:masterfrom sougou:resharding
Conversation
|
Nit: should it be |
|
This follows the |
|
AFAIK, @alainjobart introduced strategy flags early in the implementation such that he could experiment with the code and skip certain parts of it. In general, the code is proven and therefore I don't see a need for these flags. Sugu, do you plan to make changes to the schema of the vreplication table in the future? That's one probem I hit when I introduced the "max_tps" and "max_replication_lag" columns: When I changed the schema, I had to manually delete the table from all YouTube keyspaces which were previously resharded. I also tried to write some code to automatically detect and update this, but I ran into some limitations how we can run the admin queries through tabletmanager. I don't remember the details. |
|
I'm adding columns to vreplication as we speak. For now, I'm just adding them to |
There was a problem hiding this comment.
"BinlogPlayer is handling" -> "BinlogPlayer handles".
jvaidya
left a comment
There was a problem hiding this comment.
This change makes binlog replication a susbsytem in it's own right and resharding the current use case. This has the potential to enable a bunch of exciting use cases.
Overall, I think it will be useful if you described the high level changes to code either in the description of the CR or as comments in key files.
More comments inline.
There was a problem hiding this comment.
It might make sense to delete the old 'blp_checkpoint' table if it exists.
There was a problem hiding this comment.
Is this always gtid? How is file:pos style replication supported? Or is it not supported?
There was a problem hiding this comment.
The term gtid made sense when this code was written because both MariaDB and MySQL overloaded the term, which was conveniently reused by us. But now, we're shoehorning other things in this. The ideal future term for this would be VReplicationPosition. The refactor is not worth it at this moment.
There was a problem hiding this comment.
Is there a comment someplace where we explain what each column stands for and how it is used? I would find such a comment very useful.
There was a problem hiding this comment.
Added it here. This seems to be the best place.
There was a problem hiding this comment.
On line 485 above, "Retrying txn in 1 second" would be a more useful error message.
There was a problem hiding this comment.
Could you add comment here describing the logic behind this? May be also state where else the VREngine is started?
There was a problem hiding this comment.
Added a comment for now. The action agent has an unusual state machine that I'm still not too familiar with. It's possible that RefreshState is the one where things are first started. So, I'll refrain from commenting until I fully understand.
If at all, it should be a comment describing the tablet manager state machine. Not specifically for VREngine.
There was a problem hiding this comment.
Is this new code? Seems to be similar to vitess/go/vt/vtgate/planbuilder. What has made this necessary?
There was a problem hiding this comment.
It's the same pattern, but the plan is very different. It's only for handling VReplication. I think this is the third planbuilder pattern modeled after tabletserver and vtgate.
There was a problem hiding this comment.
I really like that this functionality is explicit now. As we discussed, it might make sense to make this a higher level module in a future iteration.
There was a problem hiding this comment.
Would we not need to communicate this in the new way of doing things? Perhaps by giving example queries to run with VReplicationExec?
There was a problem hiding this comment.
Yes. I'll add the "user guide" at the end when I polish things up.
go/vt/worker/split_diff.go
Outdated
There was a problem hiding this comment.
Could you add comment for step // 2 here?
There was a problem hiding this comment.
Could you add comments about why 'strategy' is not needed any longer?
There was a problem hiding this comment.
This is in the message of the commit that removes it: 4ea87b9
sougou
left a comment
There was a problem hiding this comment.
I'll edit the PR comments when the change is done. All other review comments adressed/answered.
There was a problem hiding this comment.
Added it here. This seems to be the best place.
There was a problem hiding this comment.
The term gtid made sense when this code was written because both MariaDB and MySQL overloaded the term, which was conveniently reused by us. But now, we're shoehorning other things in this. The ideal future term for this would be VReplicationPosition. The refactor is not worth it at this moment.
There was a problem hiding this comment.
It's the same pattern, but the plan is very different. It's only for handling VReplication. I think this is the third planbuilder pattern modeled after tabletserver and vtgate.
There was a problem hiding this comment.
Yes. I'll add the "user guide" at the end when I polish things up.
go/vt/worker/split_diff.go
Outdated
There was a problem hiding this comment.
This is in the message of the commit that removes it: 4ea87b9
|
@demmer @rafael @alainjobart @michael-berlin @jvaidya This is now ready for review. Given that everyone is short on time, and this is a big change. I propose the following strategy:
|
rafael
left a comment
There was a problem hiding this comment.
@sougou - This is really cool. I did the following review:
- High level review of vt_replication. Most of this part makes sense to me. Didn't look in detail how all the binlog player works, but because the bulk of it didn't change, I didn't want to focus on that. I added few comments in this section.
- I tried to dig deeper in the MigrateServedTypes changes, most of my comments are in that section.
- Did not look in detail at the changes in the tests files.
There was a problem hiding this comment.
Comments does not seem accurate anymore. startPosition/stopPosition are no longer part of this method.
There was a problem hiding this comment.
nit - now that this table is not really tied to what we call workflows in other places, I think this should be something like: stream_creator, stream_manager, stream_owner
There was a problem hiding this comment.
This API is still low level, and should not be used by a human directly unless they knew what they're doing. I expect that there will be newer and more flexible workflows that will eventually wrap this.
There was a problem hiding this comment.
nit - Maybe: readVRSettings retrieves the throttler settings from vtreplication table.
I think we should avoid references to checkpoint as it could be confusing in the future.
go/vt/worker/legacy_split_clone.go
Outdated
There was a problem hiding this comment.
@sougou - This change makes the failure scenario a bit different. I'm not sure if it matters, but wanted to get your thoughts on this.
Before your change:
- Sets source shards.
- Gets all tablet alias for destination
- It refreshes destination tablets.
Now it looks like is doing all these steps one shard at a time in a loop.
There was a problem hiding this comment.
Correct. It was necessary because the uid is now returned by the vreplication create command. So, using it right away to create the source shard makes sense. The UID was previously inferred by the order of the shards at a different location. That approach felt too flimsy, and didn't offer any particular advantages.
go/vt/worker/split_diff.go
Outdated
There was a problem hiding this comment.
Ah nice. This is a good validation related to what we were talking about in Slack. With this, current worfklows that merge shards will fail, which is better than current state where it will pass but only checking one side.
go/vt/worker/split_diff.go
Outdated
There was a problem hiding this comment.
out of curiosity why defer cancel() ? In many other places I always see just cancel()
There was a problem hiding this comment.
The defer gives you the guarantee that it will get called even if there was a panic in the code. However, it's not a good idea to defer inside a for loop. Also, defer is slightly more expensive than regular function calls. So, it may be worth avoiding in tight loops, like byte-by-byte operations.
go/vt/wrangler/keyspace.go
Outdated
There was a problem hiding this comment.
There is a risk here that we've been discussing internally at Slack. If enabling query service in destination shards fails, you could end up in a state where source shards won't have any replicas if they get refreshed.
This change does not make it worse, it is the same behavior we have today. But it is something that I think we should change.
There was a problem hiding this comment.
Yeah. I want to make a separate PR to fix this.
go/vt/wrangler/keyspace.go
Outdated
There was a problem hiding this comment.
I think this might be missing in masterMigrateServedType. We need to RefreshState in source destination shards so they restart and don't have query service enabled disabled in replicas.
There was a problem hiding this comment.
The refresh of the source shards is done at the beginning of the function as soon as their shard records are updated. This is because we have to first stop serving , then wait for replication to catch-up, and then only enable destinations. The diff tool makes this hard to see. The code reads a lot better if you just looked at it without the diff.
There was a problem hiding this comment.
Oh my comment was incorrect. It should have been: refresh destination shards. Before the change when migrating masters:
if err := si.UpdateServedTypesMap(servedType, cells, reverse); err != nil {
return err
}
if tc := si.GetTabletControl(servedType); !reverse && tc != nil && tc.DisableQueryService {
// This is a forwards migration, and the
// destination query service was already in a
// disabled state. We need to enable and force
// a refresh, otherwise it's possible that both
// the source and destination will have query
// service disabled at the same time, and
// queries would have nowhere to go.
if err := si.UpdateDisableQueryService(ctx, servedType, cells, false); err != nil {
return err
}
needToRefreshDestinationTablets = true
}
// .... then
if needToRefreshDestinationTablets {
event.DispatchUpdate(ev, "refreshing destination shard tablets so they restart their query service")
for _, si := range destinationShards {
wr.RefreshTabletsByShard(ctx, si, []topodatapb.TabletType{servedType}, cells)
}
}
After the change, I don't see this happening. I'm only seeing refresh destination master (which existent version also does).
There was a problem hiding this comment.
We synced on this offline, and got to the conclusion that the slight new behavior makes more sense than the original logic.
go/vt/wrangler/keyspace.go
Outdated
There was a problem hiding this comment.
Maybe isSource ? We call them source, destination everywhere. It took me a bit of cognitive load to switch context here and think in terms of to, from.
There was a problem hiding this comment.
isFrom is used here because the direction can be reversed with the -reverse flag. You can see in replicaMigrateServedTypes that we swap the order based on that flag. Hence the new terminology.
go/vt/wrangler/keyspace.go
Outdated
There was a problem hiding this comment.
Given, that you are doing all this refactor now. Would you like to take at stab at making this safer and re-enable query service in masters if we can't catch up in filtered replication?
There was a problem hiding this comment.
Yes. As mentioned earlier, I'll do this in a new PR.
This change deprecates _vt.blp_checkpoint in favor of vreplication, which stands for Vitess Replication. The goal is to make vreplication a standalone sub-module of vitess that other services can use, including the resharding worflow. This first step just changes to using vreplication instead of blp_checkpoint. The table and APIs to perform more autonomous functions. Also, split strategy flags were unsafe and untested. So, they have been deleted. More wholistic operations will be added to better manage resharding and vreplication. Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Added the following fields to vreplication: workflow: who created and manages the replication source: source info for the replication (proto) state: Running, Stopped, etc message: any message relevant to current state BinlogSource contains the source info. It currently supports only keyrange or table list for now, but can be extended to support other rules. The information is stored in proto 'String' format because its keyrange rendering is more readable than JSON. The current change only populates the fields. Next change will use the values. Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Add VReplicationCreate method to TabletManager API. Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
The number of vreplication control functions is too high. I counted seven at the minimum. So, I'm now trying this new approach: a single VReplicationExec function that can execute queries. The goal is to accept SQL statements that access a 'vreplication' table which will be interpreted by the engine, which will in turn update the _vt.vreplication table and also accordingly act upon the change, like start, stop etc. For now, these queries will only be understood by VReplicationExec. In the future, we can look at allowing these as part of the QueryService API. Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Parts of binlog_players.go have been resurrected in these two new types. controller is recreating BinlogPlayerController, but it will effect state management and changes through _vt.vreplication. For now, only the run loop is implemented. tabletPicker pulls out the part of BinlogPlayerController that picks a tablet to replicate from. Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Tests for tabletPicker, controller and Engine have also been added. Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
When blp terminates, it now saves its state in vreplication. This way, a restart of vttablet will not cause replication to unexpectedly restart. Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
WaitForPos polls _vt.vreplication until the desired positon is reached. A more responsive algorithm could be used where we could ask the binlog player to notify of such events, but the design becomes more intricate. Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Also fixed some typos. Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
stats are implemented using independent types to minimize interference with the operation of the main engine. Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Switch split clones, split diffs and migrate served types to use the new replication module. Artifacts of old code needs to be removed, and tests need to be updated. For now, manual resharding works. Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Had to make a few changes for this: * binlog_player reads its own start & stop position. This is a more correct design. Previously, the caller supplied this. * On VReplication update, the new controller inherits the previous controller's blp stats. This gives better continuity. * split clone workers still need to call refresh state to make the tablets non-serving. * MigrateServedTypes/From need to delete the VReplication streams for master. Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
binlog player's db client mock has improved functionality. vtClient has been aptly renamed to dbClient. Test ActionAgent initializes VReplication if needed. binlog_players_map_test has been proactively deleted. It's too far away from the VReplication model. Tests that expect binlog players to start when source shard is added have been changed to not expect that; Binlog player is now started only through explicit vreplication commands. Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
* Unit tests for BinlogPlayer, and some code improvements around error handling and canceled contexts. * Improved other tests, and consolidated mocks and fakes under binlogplayer. * Fixed binlog.py flakyness caused by extra vreplication statement inserted on start of streaming. Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
The code was unnecessarily complex. Separating master from non-master greatly simplifies each flow, and it's a lot more readable. Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Polish up and ready the code for review. Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Fix for proto version related changes. Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
sougou
left a comment
There was a problem hiding this comment.
Comments are all addressed.
There was a problem hiding this comment.
This API is still low level, and should not be used by a human directly unless they knew what they're doing. I expect that there will be newer and more flexible workflows that will eventually wrap this.
go/vt/worker/legacy_split_clone.go
Outdated
There was a problem hiding this comment.
Correct. It was necessary because the uid is now returned by the vreplication create command. So, using it right away to create the source shard makes sense. The UID was previously inferred by the order of the shards at a different location. That approach felt too flimsy, and didn't offer any particular advantages.
go/vt/worker/split_diff.go
Outdated
There was a problem hiding this comment.
The defer gives you the guarantee that it will get called even if there was a panic in the code. However, it's not a good idea to defer inside a for loop. Also, defer is slightly more expensive than regular function calls. So, it may be worth avoiding in tight loops, like byte-by-byte operations.
go/vt/wrangler/keyspace.go
Outdated
There was a problem hiding this comment.
Yeah. I want to make a separate PR to fix this.
go/vt/wrangler/keyspace.go
Outdated
There was a problem hiding this comment.
The refresh of the source shards is done at the beginning of the function as soon as their shard records are updated. This is because we have to first stop serving , then wait for replication to catch-up, and then only enable destinations. The diff tool makes this hard to see. The code reads a lot better if you just looked at it without the diff.
go/vt/wrangler/keyspace.go
Outdated
There was a problem hiding this comment.
isFrom is used here because the direction can be reversed with the -reverse flag. You can see in replicaMigrateServedTypes that we swap the order based on that flag. Hence the new terminology.
go/vt/wrangler/keyspace.go
Outdated
There was a problem hiding this comment.
Yes. As mentioned earlier, I'll do this in a new PR.
Also adjust the split_diff logic after rebase from master Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
jvaidya
left a comment
There was a problem hiding this comment.
LGTM. This is a pretty exciting feature. Thanks for getting this to the finish line.
rafael
left a comment
There was a problem hiding this comment.
Thanks for all the clarifications. This is exciting. LGTM.
This change deprecates _vt.blp_checkpoint in favor
of vreplication, which stands for Vitess Replication.
The goal is to make vreplication a standalone sub-module
of vitess that other services can use, including the
resharding worflow.
The big change in the model is that vreplication is not owned by the resharding workflow. The workflow instead creates vreplication streams as needed, and controls them individually. The stream id for a replication is now generated by vreplication, which the resharding workflow stores and tracks.
This also means that a vreplication stream can be directly created and managed by anyone as needed. This allows for newer and more flexible workflows in the future.
The following parts of the system underwent major changes:
VReplicationExechas been added.