workflow: resharding workflow: Implement checkpointing.#2495
workflow: resharding workflow: Implement checkpointing.#2495michael-berlin merged 7 commits intovitessio:masterfrom
Conversation
49d6d5f to
9a8eea5
Compare
|
I've only reviewed the updated proto definitions. Looks very good. I've left a couple minor comments which are easy to fix. In particular, I like that you documented each message and field :) Reviewed 1 of 1 files at r1, 1 of 4 files at r3. proto/workflow.proto, line 64 at r3 (raw file):
This can be removed, see below. proto/workflow.proto, line 73 at r3 (raw file):
Please move this message after the first use of it i.e. below "WorkflowCheckpoint". This will make it easier to read the file from top to bottom and a reader does not have to "cache" message definitions in his brain which aren't used yet. proto/workflow.proto, line 74 at r3 (raw file):
Reusing the message for the task state sounds great. But in that case you should rename it to "State". For now I suggest not to reuse it. Instead, please duplicate the "WorkflowState" message as a new message "TaskState". Once you've finished the implementation, you'll know if you can actually reuse the same message :) proto/workflow.proto, line 75 at r3 (raw file):
Please change this to a map of strings. This way it's more generic and not resharding specific. Then also rename the field to "attributes" because it will store multiple attributes. proto/workflow.proto, line 78 at r3 (raw file):
nit: WorkflowCheckpoint (without the s) proto/workflow.proto, line 79 at r3 (raw file):
nit: This is missing a word between "is" and "to". Change it e.g. to "is used to". I would add stronger language here e.g.: proto/workflow.proto, line 81 at r3 (raw file):
Let's use an int for this. It should be a simple constant in your code which will only be incremented when you're changing the code and it becomes out of sync with older state versions. proto/workflow.proto, line 86 at r3 (raw file):
Keep the comment slightly more generic: "settings contains workflow specific data e.g. the resharding workflow would store the source and destination shards". (It currently sounds like this message is specific to the resharding workflow, but it doesn't have to be.) Comments from Reviewable |
|
Review status: 2 of 6 files reviewed at latest revision, 8 unresolved discussions, some commit checks broke. proto/workflow.proto, line 64 at r3 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. proto/workflow.proto, line 73 at r3 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. proto/workflow.proto, line 74 at r3 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. proto/workflow.proto, line 75 at r3 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. proto/workflow.proto, line 78 at r3 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. proto/workflow.proto, line 79 at r3 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. proto/workflow.proto, line 81 at r3 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. proto/workflow.proto, line 86 at r3 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. Comments from Reviewable |
9a8eea5 to
a74b031
Compare
|
Reviewed 3 of 5 files at r2, 5 of 5 files at r4. go/vt/workflow/resharding/checkpoint.go, line 15 at r4 (raw file):
For now you don't need this interface. Instead, let's have only the one implementation and use it everywhere. (Instead of using CheckpointFile, you can use the in-memory topology in unit tests.) go/vt/workflow/resharding/checkpoint.go, line 16 at r4 (raw file):
Please remove the suffix go/vt/workflow/resharding/parallel_runner.go, line 16 at r4 (raw file):
nit: Please use go/vt/workflow/resharding/parallel_runner.go, line 16 at r4 (raw file):
nit: Please move this above the fields which are guarded by the mutex. go/vt/workflow/resharding/parallel_runner.go, line 20 at r4 (raw file):
nit: As agreed during the last team meeting, please put the complete function signature on a single line. go/vt/workflow/resharding/parallel_runner.go, line 20 at r4 (raw file):
See my comment below in "extractTasks". This function must have the list of tasks as input and not just the step/phase name. Generating the list of tasks is outside the scope of this function. go/vt/workflow/resharding/parallel_runner.go, line 28 at r4 (raw file):
Please use our Semaphore implementation from the go/vt/workflow/resharding/parallel_runner.go, line 39 at r4 (raw file):
Please move this into its own class (struct). That class should have a method with which you can update the state of a particular task. This way your ParallelRunner does not need to know the WorkflowCheckpoint protobuf message at all and you can also remove the mutex from ParallelRunner. Instead, that mutex should be part of the new class. go/vt/workflow/resharding/parallel_runner.go, line 56 at r4 (raw file):
go/vt/workflow/resharding/parallel_runner_test.go, line 24 at r4 (raw file):
Let's not use time.Sleep() in tests because they'll prolong the execution of this test and therefore the overall execution of tests. Instead, unit tests should be as fast as possible. go/vt/workflow/resharding/parallel_runner_test.go, line 66 at r4 (raw file):
This test is missing checks e.g. you could check in the checkpoint that all tasks are changed to "Done". proto/workflow.proto, line 73 at r3 (raw file): Previously, wangyipei01 wrote…
This is not done yet. proto/workflow.proto, line 79 at r3 (raw file): Previously, wangyipei01 wrote…
Please add the comment I wrote or a similar one. It's necessary to document the API here. proto/workflow.proto, line 17 at r4 (raw file):
Please remove this again. proto/workflow.proto, line 63 at r4 (raw file):
No need for this indirection. Please embed it in the task message instead. proto/workflow.proto, line 67 at r4 (raw file):
No need for this explanation. This can be removed. Instead, please make sure that all your comments follow the style>
i.e. it must start with proto/workflow.proto, line 95 at r4 (raw file):
typo: store Comments from Reviewable |
a74b031 to
5d9d6c0
Compare
|
Review status: all files reviewed at latest revision, 17 unresolved discussions, some commit checks failed. go/vt/workflow/resharding/checkpoint.go, line 15 at r4 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/checkpoint.go, line 16 at r4 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 16 at r4 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 16 at r4 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 20 at r4 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner_test.go, line 24 at r4 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
I just keep the print function there now. go/vt/workflow/resharding/parallel_runner_test.go, line 66 at r4 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. proto/workflow.proto, line 73 at r3 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. proto/workflow.proto, line 79 at r3 (raw file):
Done. proto/workflow.proto, line 17 at r4 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. proto/workflow.proto, line 63 at r4 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. proto/workflow.proto, line 67 at r4 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. proto/workflow.proto, line 95 at r4 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. Comments from Reviewable |
5d9d6c0 to
633fcfe
Compare
|
Changes look good to me. I've left many small comments which are mostly nits and should be easy to address. Can you please use it in the workflow itself as well and publish that code? Reviewed 5 of 5 files at r5. go/vt/workflow/resharding/checkpoint.go, line 14 at r5 (raw file):
I suggest to call this go/vt/workflow/resharding/checkpoint.go, line 16 at r5 (raw file):
Please add a newline before this. That's because all fields guarded by one mutex should be in one group with the mutex definition at the top. See also: https://talks.golang.org/2014/readability.slide#21 go/vt/workflow/resharding/checkpoint.go, line 18 at r5 (raw file):
Let's not use abbreviations. Instead, this field can be called go/vt/workflow/resharding/checkpoint.go, line 23 at r5 (raw file):
Please rename this to go/vt/workflow/resharding/checkpoint.go, line 26 at r5 (raw file):
Please add a newline here. That makes it easier to see what's the mutex boiler plate code and what's the actual core of the function. go/vt/workflow/resharding/checkpoint.go, line 31 at r5 (raw file):
Since it's called go/vt/workflow/resharding/checkpoint.go, line 34 at r5 (raw file):
You can simplify this. See go/vt/workflow/resharding/checkpoint.go, line 42 at r5 (raw file):
Please remove deleted code. if you use new commits, you could get it back from the history instead. go/vt/workflow/resharding/parallel_runner.go, line 23 at r5 (raw file):
nit: Missing verb: probably "have"? go/vt/workflow/resharding/parallel_runner.go, line 30 at r5 (raw file):
Just go/vt/workflow/resharding/parallel_runner.go, line 31 at r5 (raw file):
Please name this go/vt/workflow/resharding/parallel_runner.go, line 31 at r5 (raw file):
Having the code deadlock won't be nice. Instead, you could add a default clause below as follows: default:
panic(fmt.Sprintf("BUG: Invalid concurrency level: %v", concurrencyLevel))go/vt/workflow/resharding/parallel_runner.go, line 55 at r5 (raw file):
go/vt/workflow/resharding/parallel_runner.go, line 69 at r5 (raw file):
optional: While it's neat that you can reuse the semaphore for this, I think it is more straight forward to use a WaitGroup here. go/vt/workflow/resharding/parallel_runner_test.go, line 22 at r5 (raw file):
By default, tests should not output anything. Sometimes, this is unavoidable e.g. when testing an error case which will also log. But here it's easy to avoid. Please use "t.Logf" instead where "t" is go/vt/workflow/resharding/parallel_runner_test.go, line 26 at r5 (raw file):
Please avoid go/vt/workflow/resharding/parallel_runner_test.go, line 26 at r5 (raw file):
Make this an int please. go/vt/workflow/resharding/parallel_runner_test.go, line 38 at r5 (raw file):
Let's use lower case for all task attributes to be consistent. go/vt/workflow/resharding/parallel_runner_test.go, line 52 at r5 (raw file):
In this example, the settings entry should be called "count" and it should only have the number of tasks in it. You could make this task count a parameter (type "int") of this function and then you can reuse it for the initialization and running the tasks. go/vt/workflow/resharding/parallel_runner_test.go, line 59 at r5 (raw file):
Same comment as for the protobuf file: Please organize the elements from top to bottom where it makes sense. go/vt/workflow/resharding/parallel_runner_test.go, line 68 at r5 (raw file):
Use go/vt/workflow/resharding/parallel_runner_test.go, line 76 at r5 (raw file):
Please define a constructor instead i.e. go/vt/workflow/resharding/parallel_runner_test.go, line 83 at r5 (raw file):
This is going to be nil. Instead, I suggest the following: p := &ParallelRunner{} go/vt/workflow/resharding/parallel_runner_test.go, line 86 at r5 (raw file):
This should be a go/vt/workflow/resharding/parallel_runner_test.go, line 89 at r5 (raw file):
nit: Missing space after proto/workflow.proto, line 73 at r3 (raw file): Previously, wangyipei01-bot wrote…
Please see my first comment. I want you to move this message further done the file. proto/workflow.proto, line 69 at r5 (raw file):
Just Comments from Reviewable |
|
go/vt/workflow/resharding/parallel_runner.go, line 31 at r5 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
this name is the same with concurrency package, go/vt/concurrency, and I used this package in parallel_runner. If I changed to this name, it will leads to error. Comments from Reviewable |
|
go/vt/workflow/resharding/parallel_runner.go, line 69 at r5 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
is there a neat way to support both parallel and sequential call using WaitGroup? I only know it's easy to use WaitGroup if we want to set all tasks run in parallel. Comments from Reviewable |
|
go/vt/workflow/resharding/parallel_runner_test.go, line 86 at r5 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. should I also change it to Fatalf when p.Run generate en error? it seems that the job should also not continue in that case. Comments from Reviewable |
|
Review status: 9 of 10 files reviewed at latest revision, 33 unresolved discussions, some commit checks failed. go/vt/workflow/resharding/checkpoint.go, line 14 at r5 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/checkpoint.go, line 16 at r5 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/checkpoint.go, line 18 at r5 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/checkpoint.go, line 23 at r5 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/checkpoint.go, line 26 at r5 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/checkpoint.go, line 31 at r5 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/checkpoint.go, line 34 at r5 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/checkpoint.go, line 42 at r5 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 20 at r4 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 28 at r4 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
this is delayed. (the current API made my implementation more complex without any benefit) go/vt/workflow/resharding/parallel_runner.go, line 39 at r4 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 56 at r4 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 23 at r5 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 30 at r5 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 31 at r5 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 55 at r5 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner_test.go, line 22 at r5 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner_test.go, line 26 at r5 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner_test.go, line 26 at r5 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner_test.go, line 38 at r5 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner_test.go, line 68 at r5 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner_test.go, line 76 at r5 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner_test.go, line 83 at r5 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner_test.go, line 89 at r5 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. proto/workflow.proto, line 73 at r3 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. I forgot to check-in this file. proto/workflow.proto, line 69 at r5 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. Comments from Reviewable |
70446e3 to
e80214b
Compare
3595abd to
e0cd73d
Compare
|
First round of comments. Overall, the structure looks good. I would like to see some changes to how the tasks are stored and how the migrate tasks are executed. Reviewed 1 of 2 files at r6, 3 of 5 files at r8, 5 of 8 files at r9. go/vt/topo/workflow.go, line 30 at r6 (raw file):
Please remove this again. Note that this version cannot be used for the version which you have to store in the proto. It's only used for interactions with the topology. go/vt/workflow/node.go, line 105 at r9 (raw file):
Please undo this change. I'll fix the documentation in a separate PR. go/vt/workflow/resharding/checkpoint.go, line 13 at r9 (raw file):
nit: saves go/vt/workflow/resharding/checkpoint.go, line 32 at r9 (raw file):
This doesn't reflect that you're writing the complete checkpoint. I suggest to change it as follows: // UpdateTask updates the task status in the checkpoint copy and writes the full checkpoint to the topology. go/vt/workflow/resharding/checkpoint.go, line 43 at r9 (raw file):
You assume that the lock on the mutex is already hold when this is called. We signal this in the code by adding the suffix "Locked". Given that you want to export this method as well, I suggest the following:
go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 33 at r9 (raw file):
Please create an enum type for these phases. I like that you're already using a common suffix ("Name"). But that's use a prefix instead and let's name it "Phase" to be more clear what this thing is. Use this prefix for the enum type name as well. Example: type PhaseType string
// Different phases the resharding workflow goes through.
const (
PhaseCopySchema PhaseType = "copy_schema"
PhaseClone PhaseType = "clone"
...
)go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 35 at r9 (raw file):
nit: Let's not shorten this one because there's a difference between the MySQL replication and the Vitess filtered replication. Therefore, please call it "PhaseWaitForFilteredReplication" and "wait_for_filtered_replication". go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 61 at r9 (raw file):
Please rename this to "cloneUINode" to be consistent with the phase name. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 63 at r9 (raw file):
Please rename this to "diffUINode" to be consistent with the phase name. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 94 at r9 (raw file):
There is a better way to generalize this: Pass in the name of the phase and the list of shards. This way you don't need any type switch in the method itself and it's clear in the calling code over which list you're iterating. Please also rename the method to "getTasks" for example because the task objects are already generated. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 96 at r9 (raw file):
Please move these two parameters into the constructor of ParallelRunner as well. At the end, go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 140 at r9 (raw file):
HorizontalResharding go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 144 at r9 (raw file):
Please use the same name as in the interface i.e. "w". go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 159 at r9 (raw file):
Please move these two lines into "initCheckpoint". You don't need the "ts" object here and this way you avoid having to pass it. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 161 at r9 (raw file):
That's too verbose/redundant. Let's go with "checkpoint". go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 174 at r9 (raw file):
Please use the same name as in the interface i.e. "w". go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 177 at r9 (raw file):
checkpoint go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 191 at r9 (raw file):
Please be consistent and use the constants everywhere. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 238 at r9 (raw file):
Please update the comment. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 263 at r9 (raw file):
You're mixing here two different things in one method. Instead, please:
go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 267 at r9 (raw file):
a) Please change this to the same structure as the original workflow i.e. create tasks in the same order. If you want to generalize things, I suggest the following: Write a "addTasks" function which has the phase name and the list of shards as argument. Additionally, it should have "func (i int, name string) map[string]string" as argument which returns the attributes depending on the index in the list. With that, you can write the task creation like this: addTasks(tasks, PhaseClone, sourceShards, func (i int, name string) map[string]string {
return map[string]string{
"vtworker": vtworkers[i],
"keyspace": keyspace,
"source_shard": name,
}
})go/vt/workflow/resharding/parallel_runner.go, line 30 at r8 (raw file):
typo: actionRegistry go/vt/workflow/resharding/parallel_runner.go, line 37 at r8 (raw file):
As discussed offline, the mutex can be removed here. go/vt/workflow/resharding/parallel_runner.go, line 91 at r8 (raw file):
The scope of this variable is longer than a couple of lines. Please call it "node" instead. That will make it easier to read. go/vt/workflow/resharding/parallel_runner.go, line 116 at r8 (raw file):
Instead of closing the retry channel here, please delete the control object from the registry. go/vt/workflow/resharding/parallel_runner.go, line 134 at r8 (raw file):
Access to actionRegistry requires a mutex because two different threads (the execution Go routine and the http handler which calls Action()) can access it. go/vt/workflow/resharding/parallel_runner.go, line 142 at r8 (raw file):
This should be part of the "closeRetryChannel" method. go/vt/workflow/resharding/parallel_runner.go, line 155 at r8 (raw file):
I suggest to name this more generic e.g. "triggerRetry" because the implementation details are not relevant in the method name. go/vt/workflow/resharding/parallel_runner.go, line 155 at r8 (raw file):
This method should be part of "Control" and not "ParallelRunner". go/vt/workflow/resharding/task_helper.go, line 1 at r9 (raw file):
Please rename this file to go/vt/workflow/resharding/task_helper.go, line 16 at r9 (raw file):
shardType is not needed. Let's use / as name instead. go/vt/workflow/resharding/task_helper.go, line 19 at r9 (raw file):
This is not a copy. Instead, you're just returning the list of selected tasks. go/vt/workflow/resharding/task_helper.go, line 37 at r9 (raw file):
Fix comment (wrong method name). go/vt/workflow/resharding/task_helper.go, line 39 at r9 (raw file):
attributes (no abbreviations please) go/vt/workflow/resharding/task_helper.go, line 46 at r9 (raw file):
Please change this to go/vt/workflow/resharding/task_helper.go, line 118 at r9 (raw file):
This is going to change the execution order. Your code would migrate all types of a shard, one shard at a time. But we want to migrate all RDONLY types first, then REPLICA and then MASTER. Given that, please change the code as follows: Split the "migrate" phase into three phases i.e. "migrate_rdonly", "migrate_replica" and "migrate_master". Each phase is going to need its own UI nodes and its own ParallelRunner for the execution. Comments from Reviewable |
|
Comments for the changes to ParallelRunner. The structure of the test for the retry looks very good. Please don't forgot to add the file which has the code for the retry controller. (Note: It may not be worth to have that in a separate file.) Reviewed 1 of 8 files at r9. go/vt/workflow/resharding/parallel_runner.go, line 30 at r8 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Typo is still there. go/vt/workflow/resharding/parallel_runner.go, line 155 at r8 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Please rename the method as suggested. go/vt/workflow/resharding/parallel_runner.go, line 36 at r9 (raw file):
controller go/vt/workflow/resharding/parallel_runner.go, line 47 at r9 (raw file):
Please add this to the call above and return it directly without the intermediate variable go/vt/workflow/resharding/parallel_runner.go, line 97 at r9 (raw file):
Before you start the retry, you need to check if the context is still valid. That's because the error could have been caused by an expired context. In that case you should not retry. You can check for that with a non blocking read on the channel returned by go/vt/workflow/resharding/parallel_runner.go, line 104 at r9 (raw file):
Please move this code into a separate method e.g. "addRetryButton". Then, that method can return your retry channel. go/vt/workflow/resharding/parallel_runner.go, line 122 at r9 (raw file):
Unguarded access to go/vt/workflow/resharding/parallel_runner.go, line 125 at r9 (raw file):
You need to obtain the lock here. Besides that, unassigning the map could go wrong when another Go routine tries to add a controller object into the map after this. Instead of deleting the make, please delete only the controller object for this particular retry. go/vt/workflow/resharding/parallel_runner.go, line 141 at r9 (raw file):
Right now the actionRegistry is only relevant for the retry and not other actions. Given that, please change the structure of this method: First have the switch block which checks the "name". In case of a retry, call a new method which actually triggers the retry. The end result should be that this Action() doesn't access go/vt/workflow/resharding/parallel_runner.go, line 144 at r9 (raw file):
nit: Please stay consistent. In the other function you're naming this go/vt/workflow/resharding/parallel_runner_test.go, line 57 at r9 (raw file):
As discussed offline, it would be better if you re-use the workflow library and don't create parts of it (like the NodeManager) yourself. "TestManagerSimpleRun" is an example for such a test. You should write a second, minimal retry workflow which implements the Factory interface. That workflow can fail on the first attempt, expect a "Retry" action and then finally succeed. go/vt/workflow/resharding/parallel_runner_test.go, line 88 at r9 (raw file):
I like this test and what it's testing. I suggest to do the following two changes:
i.e.:
go/vt/workflow/resharding/parallel_runner_test.go, line 124 at r9 (raw file):
typo: monitor go/vt/workflow/resharding/parallel_runner_test.go, line 159 at r9 (raw file):
better:
(This is how our code does it everywhere else.) Comments from Reviewable |
…e.proto. Implement ParallelRunner and Checkpointer. Create a simple test for ParallelRunner.
workflow. Complete the unit test and E2E test for happy path.
unit test to verify this function. Implemented Horizontal Resharding workflow and tested in unit test and e2e test for the happy path.
for change on node.go in workflow folder.
e0cd73d to
72f499a
Compare
|
Review status: 9 of 11 files reviewed at latest revision, 57 unresolved discussions, some commit checks broke. go/vt/topo/workflow.go, line 30 at r6 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/node.go, line 105 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/checkpoint.go, line 13 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/checkpoint.go, line 32 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/checkpoint.go, line 43 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 33 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 35 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 61 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 63 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 94 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 96 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 140 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 144 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 159 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 161 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 174 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 177 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 191 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 238 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 263 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 267 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 30 at r8 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 37 at r8 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 91 at r8 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 116 at r8 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
I clean up the entire registry here. go/vt/workflow/resharding/parallel_runner.go, line 134 at r8 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 142 at r8 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 155 at r8 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 155 at r8 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 36 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 47 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 97 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 104 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 122 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 125 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 141 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 144 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner_test.go, line 52 at r5 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner_test.go, line 59 at r5 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner_test.go, line 57 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner_test.go, line 88 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner_test.go, line 124 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner_test.go, line 159 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/task_helper.go, line 1 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/task_helper.go, line 16 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/task_helper.go, line 19 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/task_helper.go, line 37 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/task_helper.go, line 39 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/task_helper.go, line 46 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/task_helper.go, line 118 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. Comments from Reviewable |
|
First round of comments. Everything is looking good so far and I think it's coming together nicely :) Reviewed 3 of 10 files at r12, 4 of 8 files at r15. go/vt/workflow/node.go, line 223 at r15 (raw file):
Two things: I would use the term "relative path" in here and rewrite the fact that it's not thread safe. With that, I suggest: go/vt/workflow/node.go, line 228 at r15 (raw file):
Move this line right before the for statement to make it clearer that it's actually a loop variable in some sense. go/vt/workflow/node.go, line 228 at r15 (raw file):
This name is not very meaningful. How about go/vt/workflow/node.go, line 234 at r15 (raw file):
When you rename go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 159 at r9 (raw file): Previously, wangyipei01-bot wrote…
This is not done yet. Opening and closing the topology is only needed in When you move it, please also add a TODO comment that we should extend the factory interface to pass in the topo as well. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 231 at r15 (raw file):
nit: Please swap these two lines because I find it more intuitive that the source comes first and then the destination. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 277 at r15 (raw file):
nit: Please remove the prefix "List" because it's redundant. From the type we already know that it's a string slice :) Please change this throughout the file. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 312 at r15 (raw file):
nit: Similar comment as above. Just call it "tasks" and do not include the data type in the variable name. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 312 at r15 (raw file):
Please add a check here that the number of source shards is equal to the number of vtworkers. If not, this function must return an error. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 317 at r15 (raw file):
nit: Please move the line with the keyspace to the beginning because it's the coarsest dimension when addressing shards. Please make this change for the other occurrences as well. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 376 at r15 (raw file):
Please make this the first parameter. It's common across all invocations. This will make it easier to read the method above because the reader sees that "tasks" is the same for each call. go/vt/workflow/resharding/parallel_runner.go, line 19 at r15 (raw file):
nit: Let's use here CamelCase and not ALLCAPS i.e. it should be "Sequential" and "Parallel" go/vt/workflow/resharding/parallel_runner.go, line 34 at r15 (raw file):
Please add a newline before this line. That's because all fields protected by a mutex must be in a separate group. By following this convention, the code becomes easier to read. go/vt/workflow/resharding/parallel_runner.go, line 39 at r15 (raw file):
typo: workflow go/vt/workflow/resharding/parallel_runner.go, line 41 at r15 (raw file):
typo: synchronizing go/vt/workflow/resharding/parallel_runner.go, line 87 at r15 (raw file):
Before executing the task, please set its state in the checkpoint to go/vt/workflow/resharding/parallel_runner.go, line 100 at r15 (raw file):
nit: Our code base uses the spelling "canceled" with a single l because Go changed to that as well. Please change it as well. go/vt/workflow/resharding/parallel_runner.go, line 108 at r15 (raw file):
Please remove debug statements. go/vt/workflow/resharding/parallel_runner.go, line 160 at r15 (raw file):
nit: Please move this right after the go/vt/workflow/resharding/task.go, line 1 at r15 (raw file):
Please rename this file to go/vt/workflow/resharding/task.go, line 23 at r15 (raw file):
This is a method of "hw" which has the checkpoint as field. You don't need to pass it in as well. go/vt/workflow/resharding/task.go, line 30 at r15 (raw file):
Please add a default case here. In that, you can add an assertation. For that, we use panic with a string starts with "BUG: ". Example: panic(fmt.Sprintf("BUG: unknown phase type: %v", phase))go/vt/workflow/resharding/task.go, line 41 at r15 (raw file):
nit: Using short variables is a good idea when the scope is short. But here're you're missing long ("keyspace") and short names ("s"). To be consistent, I suggest to rename the two shard variables to go/vt/workflow/resharding/task.go, line 43 at r15 (raw file):
nit: Same comment as above. Please always list keyspace first and then the shards. go/vt/workflow/resharding/task.go, line 46 at r15 (raw file):
Please move the logging into the parallel runner. This way, these functions become shorter. e.g. here you can just write Please change this throughout the file. Comments from Reviewable |
72f499a to
02d7a1e
Compare
|
Review status: 7 of 12 files reviewed at latest revision, 52 unresolved discussions, some commit checks broke. go/vt/workflow/node.go, line 223 at r15 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/node.go, line 228 at r15 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/node.go, line 228 at r15 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/node.go, line 234 at r15 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 159 at r9 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. TODO comment is added to the interface definition. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 231 at r15 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 277 at r15 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 312 at r15 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 312 at r15 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 317 at r15 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 376 at r15 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 19 at r15 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 34 at r15 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 39 at r15 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 41 at r15 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 87 at r15 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 100 at r15 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 108 at r15 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 160 at r15 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/task.go, line 1 at r15 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/task.go, line 30 at r15 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/task.go, line 41 at r15 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/task.go, line 43 at r15 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/task.go, line 46 at r15 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. Comments from Reviewable |
02d7a1e to
f201601
Compare
|
Last round of small comments. Once they're addressed, this is LGTM. As discussed offline, I'll review the test code separately when you switch it to the new mocking library. Please also re-run "make proto". Travis is currently failing because of that: https://travis-ci.org/youtube/vitess/jobs/204284333 Reviewed 4 of 6 files at r16. go/vt/workflow/node.go, line 223 at r15 (raw file): Previously, wangyipei01-bot wrote…
Sorry to nitpick here, but I think the changed comment is not clear enough e.g. it doesn't properly express that "subPath" is a relative path which is relative to this node. Another minor issue is that it should be "thread safe" and not "concurrency safe" and that this comment is not detailed enough: You could actually call this method concurrently, but the node tree must not change during each call. Given that, why not just use the comment I wrote? go/vt/workflow/node.go, line 228 at r15 (raw file): Previously, wangyipei01-bot wrote…
I meant the declaration of the node should be right before the // Find the subnode if needed.
parts := strings.Split(subPath, "/")
currentNode := n
for i := 0; i < len(parts); i++ {go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 6 at r16 (raw file):
Please remove this TODO comment. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 35 at r16 (raw file):
typo: workflow go/vt/workflow/resharding/parallel_runner.go, line 35 at r16 (raw file):
typo: Registry go/vt/workflow/resharding/parallel_runner.go, line 89 at r16 (raw file):
nit: This should be "to": Change something to something. Same below ("to done" instead of "of done"). go/vt/workflow/resharding/parallel_runner.go, line 97 at r16 (raw file):
nit: Missing period. go/vt/workflow/resharding/parallel_runner.go, line 104 at r16 (raw file):
nit: has finished go/vt/workflow/resharding/task.go, line 1 at r15 (raw file): Previously, wangyipei01-bot wrote…
File is not renamed yet. Comments from Reviewable |
f201601 to
b3c5cef
Compare
|
Review status: 8 of 13 files reviewed at latest revision, 25 unresolved discussions, some commit checks broke. go/vt/workflow/node.go, line 223 at r15 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/node.go, line 228 at r15 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 6 at r16 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/horizontal_resharding_workflow.go, line 35 at r16 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 35 at r16 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 89 at r16 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 97 at r16 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/parallel_runner.go, line 104 at r16 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/task.go, line 1 at r15 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. go/vt/workflow/resharding/task.go, line 23 at r15 (raw file): Previously, mberlin-bot (Michael Berlin) wrote…
Done. Comments from Reviewable |
|
Reviewed 5 of 5 files at r17. Comments from Reviewable |
horizontal resharding workflow. First round comments resolved. (addressing race condition warnings)
931d1dc to
b1b0a9a
Compare
This is a follow-up fix for vitessio#2495. We were not able to fix Yipei's workstation such that it would use the same protobuf generator as Travis and our setups does. Therefore, I'm re-generating the files separately on my machine.
|
With the more coarse grained locking I think we can further simplify the code for the retry controller. Since this PR is merged, please address my comments in a new PR. Thanks! :) Reviewed 1 of 2 files at r19. go/vt/workflow/resharding/parallel_runner.go, line 35 at r19 (raw file):
Please add a comment here that it's also used to serialize the UI node changes. go/vt/workflow/resharding/parallel_runner.go, line 154 at r19 (raw file):
I think this can be kept simpler now: You can remove the unregister function completely and replace this call with a delete of the controller. Since you're still holding the lock, you know for sure that the controller is still in the map and you can just delete it. The other usage of unregister, when a context is done, is most likely overkill and you can just remove that. When the context is done, all Go routines should end and we don't care about the state of the action registry. go/vt/workflow/resharding/parallel_runner.go, line 162 at r19 (raw file):
Note that we do not use panic for error handling and instead always have However, I can see that this is more an assertion than an error. In that case, please change it as follows:
go/vt/workflow/resharding/parallel_runner.go, line 173 at r19 (raw file):
Similar comment as above: With the locking gone, this can be folded into go/vt/workflow/resharding/retry_controller.go, line 6 at r19 (raw file):
Let's remove this structure and store only the channel in the registry. The only other field here is "node". But you can also get that by using the The path which Action() returns unfortunately also includes the root node. Since no other workflow is using the path so far, please change the This way you can use the returned path to look up the node and you don't need to store it in the registry. The only thing left then is adding and removing the Actions to the node object. You can add this code to your ParallelRunner object. go/vt/workflow/resharding/retry_controller.go, line 15 at r19 (raw file):
This should be: NewRetryController() go/vt/workflow/resharding/retry_controller.go, line 32 at r19 (raw file):
This should never happen, should it? If this happens, this is a bug with your registry mechanism? Comments from Reviewable |
Defined structure to track status per tasks in workflowstatus proto.
Demonstrated its usage in function Run. In each step, the workflow will
first check the status, then generates the tasks parameters for
unfinished tasks. After the execution of each task, it will update the
status as Done/Failed. This status update is verified in the unit test
through a happy path test.
This change is