Skip to content

vtexplain rework and improve#3267

Merged
sougou merged 16 commits intovitessio:masterfrom
tinyspeck:vtexplain-big-rework
Oct 12, 2017
Merged

vtexplain rework and improve#3267
sougou merged 16 commits intovitessio:masterfrom
tinyspeck:vtexplain-big-rework

Conversation

@demmer
Copy link
Copy Markdown
Member

@demmer demmer commented Oct 1, 2017

Three major improvements to vtexplain, building upon #3266 in light of the comments in #3265:

  1. Types in query results

Previously, when returning results from queries, vtexplain used the default SingleRowResult that was used in the various tablet tests and as such had a fixed schema of two fields. This caused confusing results for queries involving secondary indexes as the results are used as part of the plan generation.

Instead, rework the implementation so that it tries to parse the query sent to the fake mysql and in the cases of simple select expressions, uses the knowledge of the table schema to produce Result objects with the correct types and synthetic values.

As part of this work, replace the QueryLogger hook in fakesqldb with a more generic hook that enables overriding the query handler.

  1. Proper handling of transactions

The previous method of hijacking the SandboxConn didn't properly execute all of the logic for transactions and instead relied on autocommit behavior at the tablet which I didn't properly understand before.

Instead, change vtexplain to implement the QueryService directly and route the calls to the tabletserver implementation rather than using SandboxConn at all. As part of this, remove the hooks added to SandboxConn for supporting vtexplain.

This does a better job of emulating the actual vtgate transaction flow and opens up the opportunity to demonstrate other things such as 2PC.

  1. Time simulation

Add a logical clock simulation to the vtexplain engine to make it clearer to the user which queries between vtgate and the various tables run in parallel vs depending on the results of one other.

As part of this, rework the text mode output to sort the queries in time order as opposed to per tablet.

demmer added 7 commits October 1, 2017 10:24
Previously, when returning results from queries, vtexplain used
the default SingleRowResult that was used in the various tablet
tests and as such had a fixed schema of two fields. This caused
confusing results for queries involving econdary indexes as the
results are used as part of the plan generation.

Instead, rework the implementation so that instead it tries to
parse the query sent to the fake mysql and in the cases of simple
select expressions, uses the knowledge of the table schema to
produce Result objects with the correct types and synthetic
values.

As part of this work, replace the QueryLogger hook in fakesqldb
with a more generic hook that enables overriding the query handler.
Modify the fake healthcheck module to allow other types of fake
connections to be created in addition to SandboxConn.

Remove the Executor interface from SandboxConn.

Both of these pave the way for reworking how vtexplain is plumbed
to avoid using SandboxConn at all and instead just implement the
QueryService interface directly.
The previous method of hijacking the SandboxConn didn't properly
execute all of the logic for transactions and instead relied on
autocommit behavior at the tablet.

Instead, implement the QueryService directly and route the calls
to the tabletserver implementation.

This does a better job of emulating the actual vtgate transaction
flow and opens up the opportunity to demonstrate other things such
as 2PC.
To better illustrate the network interactions between vtgate and the
shards, use the sync batcher to advance a logical "time" step as part
of the simulated RPC layer between the vttablet.

Then in the text mode output, sort the queries by time so that the user
can better see the ordering and parallelism of the various queries.
Refresh vtexplain test output to match the new implementation.

Change default test configuration to use 4 shards instead of 2.

Update the test harness to put all the output files into one temp
directory to make it easier to copy them into the data/test
directory after implementation changes.
@demmer demmer requested review from rafael and sougou October 1, 2017 17:32
Copy link
Copy Markdown
Contributor

@sougou sougou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed everything except the tests so far.

colTypes = append(colTypes, querypb.Type_VARCHAR)
break
default:
return fmt.Errorf("unsupported sql value %s", sqlparser.String(node))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This as well as the error below are too restrictive. I feel like we should return some default type, like you're doing for functions.
Also, you can use sqlparser.String(node) for the field name (if there's no alias specified).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

return nil, err
}
// Begin is part of the QueryService interface.
func (tablet *fakeTablet) Begin(ctx context.Context, target *querypb.Target, options *querypb.ExecuteOptions) (int64, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to remove all these boiler-plates by anonymous embedding of tabletserver:

type fakeTablet struct {
  *tabletserver.TabletServer
  ...
}

Then just define the functions that need changing. You can call into the embedded tabletserver using: tablet.(TabletServer).Execute(...)

Ref: https://golang.org/ref/spec#Struct_types

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about doing that but I'd rather keep the current approach that will explicitly panic if an unimplemented method gets called.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could embed the ErrorQueryService instead: https://github.com/youtube/vitess/blob/master/go/vt/vttablet/queryservice/fakes/error_query_service.go#L29

This will it will return an error for each interface method unless you override it here.

Copy link
Copy Markdown
Member Author

@demmer demmer Oct 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion -- I actually went a slightly different route with the same effect. I changed the tablet to embed an anonymous QueryService interface that is initialized with a wrapped QueryService that throws an error if any method is called (though prints the method name), but the overridden methods still get called as expected.

}

if sql != "" {
batchTime = sync2.NewBatcher(time.Duration(10 * 1000 * 1000))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use 10*time.Millisecond.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// newBatch starts a new batch
func (b *Batcher) newBatch() {
go func() {
time.Sleep(b.interval)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For containers, we've seen cases where a process can lose CPU for long periods of time, in which case this may fail. I'm wondering if runtime.Gosched may do a better job, and possibly be faster.

Of course, we'll need to run these tests on a heavily loaded machine. For now, I think this is ok. We may have to revisit this if we see flakyness.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I also think that this will result into flaky tests. (That's the case when you launch multiple Go routines in your test code and not all of them will finish their time.Sleep(n) before the test finishes its time.Sleep(2*n) and moves on to the next test case. Weird delays of single routines, but not all of them, and therefore a re-ordering can happen on slow machines.)

Our usual trick is to mock time.Now() when it comes to comparing timestamps.

In general, we try to avoid time.Sleep() in test code because it makes the tests slower and flaky.

In this case, you could make time.Sleep() configurable and your test could provide a more sophisticated version for which you control when it unblocks. (Personally, I like to provide a second private constructor which the test can use to inject their implementation. Example for time.Now(): https://github.com/youtube/vitess/blob/master/go/vt/vtgate/buffer/buffer.go#L107)

In your case, your fake Sleep() could block on reading a channel that was created by the test method. When you close the channel in the test method, time.Sleep() unblocks and triggers the rest of your code.

Does that make sense?

For the channel, please use an empty struct since the values are irrelevant and in fact you don't need to put anything into it. Instead, you'll only use the closure of the channel as signal. I looked around and think that this test is a similar example for using such a channel as notification: https://github.com/youtube/vitess/blob/master/go/stats/counters_test.go#L177 We also have actual code overall the place where we use this pattern.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a long conversation with @sougou on Slack about ways to get rid of the Sleep. Unfortunately there doesn't seem to be any way to do this given how vtgate works, since there's no way to reliably determine when a query execution is blocked waiting for response(s) from the tablet(s) or when it might issue additional queries before waiting.

To that end, I'll try to make the timeouts a bit more forgiving when running in Travis, and if that doesn't work we may have to mark the tests as flaky.

// newBatch starts a new batch
func (b *Batcher) newBatch() {
go func() {
time.Sleep(b.interval)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I also think that this will result into flaky tests. (That's the case when you launch multiple Go routines in your test code and not all of them will finish their time.Sleep(n) before the test finishes its time.Sleep(2*n) and moves on to the next test case. Weird delays of single routines, but not all of them, and therefore a re-ordering can happen on slow machines.)

Our usual trick is to mock time.Now() when it comes to comparing timestamps.

In general, we try to avoid time.Sleep() in test code because it makes the tests slower and flaky.

In this case, you could make time.Sleep() configurable and your test could provide a more sophisticated version for which you control when it unblocks. (Personally, I like to provide a second private constructor which the test can use to inject their implementation. Example for time.Now(): https://github.com/youtube/vitess/blob/master/go/vt/vtgate/buffer/buffer.go#L107)

In your case, your fake Sleep() could block on reading a channel that was created by the test method. When you close the channel in the test method, time.Sleep() unblocks and triggers the rest of your code.

Does that make sense?

For the channel, please use an empty struct since the values are irrelevant and in fact you don't need to put anything into it. Instead, you'll only use the closure of the channel as signal. I looked around and think that this test is a similar example for using such a channel as notification: https://github.com/youtube/vitess/blob/master/go/stats/counters_test.go#L177 We also have actual code overall the place where we use this pattern.


// test single waiter
go expectBatch("single waiter", b, 1, t)
time.Sleep(interval * 2)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably put the sleeps here because you noticed that the test method doesn't wait for the Go routines to finish :) Go quits the test and the binary anyway.

Here the same comments as above applies: time.Sleep() is still prone to flakes and unnecessarily makes your test slower. The solution here is to use a WaitGroup instead to wait for your launched Go routines.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can certainly make these changes for the batcher unit test itself, but the vtexplain tests will still be subject to the races.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I talked about this with Sugu and I agree: This unit test can be changed in such a way that you have full control over the timing. But for an integration test like the vtexplain test this is not possible.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we have a timing dependency already with the vtexplain tests, I decided to keep the batcher unit test as-is but relaxed the timeouts a bit and marked the test as flaky to appease travis.

}

if sql != "" {
batchTime = sync2.NewBatcher(time.Duration(10 * 1000 * 1000))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batchTime is not used and not declared before?

I didn't look at the rest of the code, but it's not clear to me yet how this batcher will be used here :)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably warrants a comment, but it's resetting the batch counter declared in vtexplain_vttablet.go so that each query starts from batch 1.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment here.

}

// Wait adds a new waiter to the queue and blocks until the next batch
func (b *Batcher) Wait() int32 {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think int32 is an implementation detail and having int in the API instead would be nicer, especially since you're adding it in a common package.

Go guarantees you that int fits at least int32, so just casting at the return should be fine.

return nil, err
}
// Begin is part of the QueryService interface.
func (tablet *fakeTablet) Begin(ctx context.Context, target *querypb.Target, options *querypb.ExecuteOptions) (int64, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could embed the ErrorQueryService instead: https://github.com/youtube/vitess/blob/master/go/vt/vttablet/queryservice/fakes/error_query_service.go#L29

This will it will return an error for each interface method unless you override it here.

return nil, err
}
// Begin is part of the QueryService interface.
func (tablet *fakeTablet) Begin(ctx context.Context, target *querypb.Target, options *querypb.ExecuteOptions) (int64, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Receivers must be 1 or 2 characters only: https://github.com/golang/go/wiki/CodeReviewComments#receiver-names

I would just go here with t (or maybe ft`).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks.

@demmer demmer force-pushed the vtexplain-big-rework branch from 9698fa6 to 2839210 Compare October 9, 2017 03:32
@sougou
Copy link
Copy Markdown
Contributor

sougou commented Oct 12, 2017

LGTM
By the fact this is WIP, and limitations are known, I'll merge this in.

Approved with PullApprove

@sougou sougou merged commit ea3b980 into vitessio:master Oct 12, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants