vtexplain rework and improve#3267
Conversation
Previously, when returning results from queries, vtexplain used the default SingleRowResult that was used in the various tablet tests and as such had a fixed schema of two fields. This caused confusing results for queries involving econdary indexes as the results are used as part of the plan generation. Instead, rework the implementation so that instead it tries to parse the query sent to the fake mysql and in the cases of simple select expressions, uses the knowledge of the table schema to produce Result objects with the correct types and synthetic values. As part of this work, replace the QueryLogger hook in fakesqldb with a more generic hook that enables overriding the query handler.
Modify the fake healthcheck module to allow other types of fake connections to be created in addition to SandboxConn. Remove the Executor interface from SandboxConn. Both of these pave the way for reworking how vtexplain is plumbed to avoid using SandboxConn at all and instead just implement the QueryService interface directly.
The previous method of hijacking the SandboxConn didn't properly execute all of the logic for transactions and instead relied on autocommit behavior at the tablet. Instead, implement the QueryService directly and route the calls to the tabletserver implementation. This does a better job of emulating the actual vtgate transaction flow and opens up the opportunity to demonstrate other things such as 2PC.
To better illustrate the network interactions between vtgate and the shards, use the sync batcher to advance a logical "time" step as part of the simulated RPC layer between the vttablet. Then in the text mode output, sort the queries by time so that the user can better see the ordering and parallelism of the various queries.
Refresh vtexplain test output to match the new implementation. Change default test configuration to use 4 shards instead of 2. Update the test harness to put all the output files into one temp directory to make it easier to copy them into the data/test directory after implementation changes.
sougou
left a comment
There was a problem hiding this comment.
I've reviewed everything except the tests so far.
| colTypes = append(colTypes, querypb.Type_VARCHAR) | ||
| break | ||
| default: | ||
| return fmt.Errorf("unsupported sql value %s", sqlparser.String(node)) |
There was a problem hiding this comment.
This as well as the error below are too restrictive. I feel like we should return some default type, like you're doing for functions.
Also, you can use sqlparser.String(node) for the field name (if there's no alias specified).
| return nil, err | ||
| } | ||
| // Begin is part of the QueryService interface. | ||
| func (tablet *fakeTablet) Begin(ctx context.Context, target *querypb.Target, options *querypb.ExecuteOptions) (int64, error) { |
There was a problem hiding this comment.
You should be able to remove all these boiler-plates by anonymous embedding of tabletserver:
type fakeTablet struct {
*tabletserver.TabletServer
...
}
Then just define the functions that need changing. You can call into the embedded tabletserver using: tablet.(TabletServer).Execute(...)
There was a problem hiding this comment.
I thought about doing that but I'd rather keep the current approach that will explicitly panic if an unimplemented method gets called.
There was a problem hiding this comment.
You could embed the ErrorQueryService instead: https://github.com/youtube/vitess/blob/master/go/vt/vttablet/queryservice/fakes/error_query_service.go#L29
This will it will return an error for each interface method unless you override it here.
There was a problem hiding this comment.
Thanks for the suggestion -- I actually went a slightly different route with the same effect. I changed the tablet to embed an anonymous QueryService interface that is initialized with a wrapped QueryService that throws an error if any method is called (though prints the method name), but the overridden methods still get called as expected.
go/vt/vtexplain/vtexplain.go
Outdated
| } | ||
|
|
||
| if sql != "" { | ||
| batchTime = sync2.NewBatcher(time.Duration(10 * 1000 * 1000)) |
| // newBatch starts a new batch | ||
| func (b *Batcher) newBatch() { | ||
| go func() { | ||
| time.Sleep(b.interval) |
There was a problem hiding this comment.
For containers, we've seen cases where a process can lose CPU for long periods of time, in which case this may fail. I'm wondering if runtime.Gosched may do a better job, and possibly be faster.
Of course, we'll need to run these tests on a heavily loaded machine. For now, I think this is ok. We may have to revisit this if we see flakyness.
There was a problem hiding this comment.
Yeah, I also think that this will result into flaky tests. (That's the case when you launch multiple Go routines in your test code and not all of them will finish their time.Sleep(n) before the test finishes its time.Sleep(2*n) and moves on to the next test case. Weird delays of single routines, but not all of them, and therefore a re-ordering can happen on slow machines.)
Our usual trick is to mock time.Now() when it comes to comparing timestamps.
In general, we try to avoid time.Sleep() in test code because it makes the tests slower and flaky.
In this case, you could make time.Sleep() configurable and your test could provide a more sophisticated version for which you control when it unblocks. (Personally, I like to provide a second private constructor which the test can use to inject their implementation. Example for time.Now(): https://github.com/youtube/vitess/blob/master/go/vt/vtgate/buffer/buffer.go#L107)
In your case, your fake Sleep() could block on reading a channel that was created by the test method. When you close the channel in the test method, time.Sleep() unblocks and triggers the rest of your code.
Does that make sense?
For the channel, please use an empty struct since the values are irrelevant and in fact you don't need to put anything into it. Instead, you'll only use the closure of the channel as signal. I looked around and think that this test is a similar example for using such a channel as notification: https://github.com/youtube/vitess/blob/master/go/stats/counters_test.go#L177 We also have actual code overall the place where we use this pattern.
There was a problem hiding this comment.
I had a long conversation with @sougou on Slack about ways to get rid of the Sleep. Unfortunately there doesn't seem to be any way to do this given how vtgate works, since there's no way to reliably determine when a query execution is blocked waiting for response(s) from the tablet(s) or when it might issue additional queries before waiting.
To that end, I'll try to make the timeouts a bit more forgiving when running in Travis, and if that doesn't work we may have to mark the tests as flaky.
| // newBatch starts a new batch | ||
| func (b *Batcher) newBatch() { | ||
| go func() { | ||
| time.Sleep(b.interval) |
There was a problem hiding this comment.
Yeah, I also think that this will result into flaky tests. (That's the case when you launch multiple Go routines in your test code and not all of them will finish their time.Sleep(n) before the test finishes its time.Sleep(2*n) and moves on to the next test case. Weird delays of single routines, but not all of them, and therefore a re-ordering can happen on slow machines.)
Our usual trick is to mock time.Now() when it comes to comparing timestamps.
In general, we try to avoid time.Sleep() in test code because it makes the tests slower and flaky.
In this case, you could make time.Sleep() configurable and your test could provide a more sophisticated version for which you control when it unblocks. (Personally, I like to provide a second private constructor which the test can use to inject their implementation. Example for time.Now(): https://github.com/youtube/vitess/blob/master/go/vt/vtgate/buffer/buffer.go#L107)
In your case, your fake Sleep() could block on reading a channel that was created by the test method. When you close the channel in the test method, time.Sleep() unblocks and triggers the rest of your code.
Does that make sense?
For the channel, please use an empty struct since the values are irrelevant and in fact you don't need to put anything into it. Instead, you'll only use the closure of the channel as signal. I looked around and think that this test is a similar example for using such a channel as notification: https://github.com/youtube/vitess/blob/master/go/stats/counters_test.go#L177 We also have actual code overall the place where we use this pattern.
|
|
||
| // test single waiter | ||
| go expectBatch("single waiter", b, 1, t) | ||
| time.Sleep(interval * 2) |
There was a problem hiding this comment.
You probably put the sleeps here because you noticed that the test method doesn't wait for the Go routines to finish :) Go quits the test and the binary anyway.
Here the same comments as above applies: time.Sleep() is still prone to flakes and unnecessarily makes your test slower. The solution here is to use a WaitGroup instead to wait for your launched Go routines.
There was a problem hiding this comment.
I can certainly make these changes for the batcher unit test itself, but the vtexplain tests will still be subject to the races.
There was a problem hiding this comment.
Yes, I talked about this with Sugu and I agree: This unit test can be changed in such a way that you have full control over the timing. But for an integration test like the vtexplain test this is not possible.
There was a problem hiding this comment.
Given that we have a timing dependency already with the vtexplain tests, I decided to keep the batcher unit test as-is but relaxed the timeouts a bit and marked the test as flaky to appease travis.
go/vt/vtexplain/vtexplain.go
Outdated
| } | ||
|
|
||
| if sql != "" { | ||
| batchTime = sync2.NewBatcher(time.Duration(10 * 1000 * 1000)) |
There was a problem hiding this comment.
batchTime is not used and not declared before?
I didn't look at the rest of the code, but it's not clear to me yet how this batcher will be used here :)
There was a problem hiding this comment.
This probably warrants a comment, but it's resetting the batch counter declared in vtexplain_vttablet.go so that each query starts from batch 1.
go/sync2/batcher.go
Outdated
| } | ||
|
|
||
| // Wait adds a new waiter to the queue and blocks until the next batch | ||
| func (b *Batcher) Wait() int32 { |
There was a problem hiding this comment.
I think int32 is an implementation detail and having int in the API instead would be nicer, especially since you're adding it in a common package.
Go guarantees you that int fits at least int32, so just casting at the return should be fine.
| return nil, err | ||
| } | ||
| // Begin is part of the QueryService interface. | ||
| func (tablet *fakeTablet) Begin(ctx context.Context, target *querypb.Target, options *querypb.ExecuteOptions) (int64, error) { |
There was a problem hiding this comment.
You could embed the ErrorQueryService instead: https://github.com/youtube/vitess/blob/master/go/vt/vttablet/queryservice/fakes/error_query_service.go#L29
This will it will return an error for each interface method unless you override it here.
| return nil, err | ||
| } | ||
| // Begin is part of the QueryService interface. | ||
| func (tablet *fakeTablet) Begin(ctx context.Context, target *querypb.Target, options *querypb.ExecuteOptions) (int64, error) { |
There was a problem hiding this comment.
nit: Receivers must be 1 or 2 characters only: https://github.com/golang/go/wiki/CodeReviewComments#receiver-names
I would just go here with t (or maybe ft`).
9698fa6 to
2839210
Compare
Three major improvements to vtexplain, building upon #3266 in light of the comments in #3265:
Previously, when returning results from queries, vtexplain used the default SingleRowResult that was used in the various tablet tests and as such had a fixed schema of two fields. This caused confusing results for queries involving secondary indexes as the results are used as part of the plan generation.
Instead, rework the implementation so that it tries to parse the query sent to the fake mysql and in the cases of simple select expressions, uses the knowledge of the table schema to produce Result objects with the correct types and synthetic values.
As part of this work, replace the QueryLogger hook in fakesqldb with a more generic hook that enables overriding the query handler.
The previous method of hijacking the SandboxConn didn't properly execute all of the logic for transactions and instead relied on autocommit behavior at the tablet which I didn't properly understand before.
Instead, change vtexplain to implement the QueryService directly and route the calls to the tabletserver implementation rather than using SandboxConn at all. As part of this, remove the hooks added to SandboxConn for supporting vtexplain.
This does a better job of emulating the actual vtgate transaction flow and opens up the opportunity to demonstrate other things such as 2PC.
Add a logical clock simulation to the vtexplain engine to make it clearer to the user which queries between vtgate and the various tables run in parallel vs depending on the results of one other.
As part of this, rework the text mode output to sort the queries in time order as opposed to per tablet.