vtexplain rework and improve by demmer · Pull Request #3267 · vitessio/vitess

demmer · 2017-10-01T17:32:46Z

Three major improvements to vtexplain, building upon #3266 in light of the comments in #3265:

Types in query results

Previously, when returning results from queries, vtexplain used the default SingleRowResult that was used in the various tablet tests and as such had a fixed schema of two fields. This caused confusing results for queries involving secondary indexes as the results are used as part of the plan generation.

Instead, rework the implementation so that it tries to parse the query sent to the fake mysql and in the cases of simple select expressions, uses the knowledge of the table schema to produce Result objects with the correct types and synthetic values.

As part of this work, replace the QueryLogger hook in fakesqldb with a more generic hook that enables overriding the query handler.

Proper handling of transactions

The previous method of hijacking the SandboxConn didn't properly execute all of the logic for transactions and instead relied on autocommit behavior at the tablet which I didn't properly understand before.

Instead, change vtexplain to implement the QueryService directly and route the calls to the tabletserver implementation rather than using SandboxConn at all. As part of this, remove the hooks added to SandboxConn for supporting vtexplain.

This does a better job of emulating the actual vtgate transaction flow and opens up the opportunity to demonstrate other things such as 2PC.

Time simulation

Add a logical clock simulation to the vtexplain engine to make it clearer to the user which queries between vtgate and the various tables run in parallel vs depending on the results of one other.

As part of this, rework the text mode output to sort the queries in time order as opposed to per tablet.

Previously, when returning results from queries, vtexplain used the default SingleRowResult that was used in the various tablet tests and as such had a fixed schema of two fields. This caused confusing results for queries involving econdary indexes as the results are used as part of the plan generation. Instead, rework the implementation so that instead it tries to parse the query sent to the fake mysql and in the cases of simple select expressions, uses the knowledge of the table schema to produce Result objects with the correct types and synthetic values. As part of this work, replace the QueryLogger hook in fakesqldb with a more generic hook that enables overriding the query handler.

Modify the fake healthcheck module to allow other types of fake connections to be created in addition to SandboxConn. Remove the Executor interface from SandboxConn. Both of these pave the way for reworking how vtexplain is plumbed to avoid using SandboxConn at all and instead just implement the QueryService interface directly.

The previous method of hijacking the SandboxConn didn't properly execute all of the logic for transactions and instead relied on autocommit behavior at the tablet. Instead, implement the QueryService directly and route the calls to the tabletserver implementation. This does a better job of emulating the actual vtgate transaction flow and opens up the opportunity to demonstrate other things such as 2PC.

To better illustrate the network interactions between vtgate and the shards, use the sync batcher to advance a logical "time" step as part of the simulated RPC layer between the vttablet. Then in the text mode output, sort the queries by time so that the user can better see the ordering and parallelism of the various queries.

Refresh vtexplain test output to match the new implementation. Change default test configuration to use 4 shards instead of 2. Update the test harness to put all the output files into one temp directory to make it easier to copy them into the data/test directory after implementation changes.

sougou

I've reviewed everything except the tests so far.

sougou · 2017-10-02T01:47:55Z

go/vt/vtexplain/vtexplain_vttablet.go

+						colTypes = append(colTypes, querypb.Type_VARCHAR)
+						break
+					default:
+						return fmt.Errorf("unsupported sql value %s", sqlparser.String(node))


This as well as the error below are too restrictive. I feel like we should return some default type, like you're doing for functions.
Also, you can use sqlparser.String(node) for the field name (if there's no alias specified).

sougou · 2017-10-02T02:10:41Z

go/vt/vtexplain/vtexplain_vttablet.go

-		return nil, err
-	}
+// Begin is part of the QueryService interface.
+func (tablet *fakeTablet) Begin(ctx context.Context, target *querypb.Target, options *querypb.ExecuteOptions) (int64, error) {


You should be able to remove all these boiler-plates by anonymous embedding of tabletserver:

type fakeTablet struct { *tabletserver.TabletServer ... }

Then just define the functions that need changing. You can call into the embedded tabletserver using: tablet.(TabletServer).Execute(...)

Ref: https://golang.org/ref/spec#Struct_types

I thought about doing that but I'd rather keep the current approach that will explicitly panic if an unimplemented method gets called.

You could embed the ErrorQueryService instead: https://github.com/youtube/vitess/blob/master/go/vt/vttablet/queryservice/fakes/error_query_service.go#L29

This will it will return an error for each interface method unless you override it here.

Thanks for the suggestion -- I actually went a slightly different route with the same effect. I changed the tablet to embed an anonymous QueryService interface that is initialized with a wrapped QueryService that throws an error if any method is called (though prints the method name), but the overridden methods still get called as expected.

sougou · 2017-10-02T02:43:25Z

go/vt/vtexplain/vtexplain.go

 		}

 		if sql != "" {
+			batchTime = sync2.NewBatcher(time.Duration(10 * 1000 * 1000))


use 10*time.Millisecond.

sougou · 2017-10-02T03:00:58Z

go/sync2/batcher.go

+// newBatch starts a new batch
+func (b *Batcher) newBatch() {
+	go func() {
+		time.Sleep(b.interval)


For containers, we've seen cases where a process can lose CPU for long periods of time, in which case this may fail. I'm wondering if runtime.Gosched may do a better job, and possibly be faster.

Of course, we'll need to run these tests on a heavily loaded machine. For now, I think this is ok. We may have to revisit this if we see flakyness.

Yeah, I also think that this will result into flaky tests. (That's the case when you launch multiple Go routines in your test code and not all of them will finish their time.Sleep(n) before the test finishes its time.Sleep(2*n) and moves on to the next test case. Weird delays of single routines, but not all of them, and therefore a re-ordering can happen on slow machines.)

Our usual trick is to mock time.Now() when it comes to comparing timestamps.

In general, we try to avoid time.Sleep() in test code because it makes the tests slower and flaky.

In this case, you could make time.Sleep() configurable and your test could provide a more sophisticated version for which you control when it unblocks. (Personally, I like to provide a second private constructor which the test can use to inject their implementation. Example for time.Now(): https://github.com/youtube/vitess/blob/master/go/vt/vtgate/buffer/buffer.go#L107)

In your case, your fake Sleep() could block on reading a channel that was created by the test method. When you close the channel in the test method, time.Sleep() unblocks and triggers the rest of your code.

Does that make sense?

For the channel, please use an empty struct since the values are irrelevant and in fact you don't need to put anything into it. Instead, you'll only use the closure of the channel as signal. I looked around and think that this test is a similar example for using such a channel as notification: https://github.com/youtube/vitess/blob/master/go/stats/counters_test.go#L177 We also have actual code overall the place where we use this pattern.

I had a long conversation with @sougou on Slack about ways to get rid of the Sleep. Unfortunately there doesn't seem to be any way to do this given how vtgate works, since there's no way to reliably determine when a query execution is blocked waiting for response(s) from the tablet(s) or when it might issue additional queries before waiting.

To that end, I'll try to make the timeouts a bit more forgiving when running in Travis, and if that doesn't work we may have to mark the tests as flaky.

michael-berlin · 2017-10-02T06:35:10Z

go/sync2/batcher.go

+// newBatch starts a new batch
+func (b *Batcher) newBatch() {
+	go func() {
+		time.Sleep(b.interval)


Yeah, I also think that this will result into flaky tests. (That's the case when you launch multiple Go routines in your test code and not all of them will finish their time.Sleep(n) before the test finishes its time.Sleep(2*n) and moves on to the next test case. Weird delays of single routines, but not all of them, and therefore a re-ordering can happen on slow machines.)

Our usual trick is to mock time.Now() when it comes to comparing timestamps.

In general, we try to avoid time.Sleep() in test code because it makes the tests slower and flaky.

In this case, you could make time.Sleep() configurable and your test could provide a more sophisticated version for which you control when it unblocks. (Personally, I like to provide a second private constructor which the test can use to inject their implementation. Example for time.Now(): https://github.com/youtube/vitess/blob/master/go/vt/vtgate/buffer/buffer.go#L107)

In your case, your fake Sleep() could block on reading a channel that was created by the test method. When you close the channel in the test method, time.Sleep() unblocks and triggers the rest of your code.

Does that make sense?

For the channel, please use an empty struct since the values are irrelevant and in fact you don't need to put anything into it. Instead, you'll only use the closure of the channel as signal. I looked around and think that this test is a similar example for using such a channel as notification: https://github.com/youtube/vitess/blob/master/go/stats/counters_test.go#L177 We also have actual code overall the place where we use this pattern.

michael-berlin · 2017-10-02T06:42:31Z

go/sync2/batcher_test.go

+
+	// test single waiter
+	go expectBatch("single waiter", b, 1, t)
+	time.Sleep(interval * 2)


You probably put the sleeps here because you noticed that the test method doesn't wait for the Go routines to finish :) Go quits the test and the binary anyway.

Here the same comments as above applies: time.Sleep() is still prone to flakes and unnecessarily makes your test slower. The solution here is to use a WaitGroup instead to wait for your launched Go routines.

I can certainly make these changes for the batcher unit test itself, but the vtexplain tests will still be subject to the races.

Yes, I talked about this with Sugu and I agree: This unit test can be changed in such a way that you have full control over the timing. But for an integration test like the vtexplain test this is not possible.

Given that we have a timing dependency already with the vtexplain tests, I decided to keep the batcher unit test as-is but relaxed the timeouts a bit and marked the test as flaky to appease travis.

michael-berlin · 2017-10-02T06:44:20Z

go/vt/vtexplain/vtexplain.go

 		}

 		if sql != "" {
+			batchTime = sync2.NewBatcher(time.Duration(10 * 1000 * 1000))


batchTime is not used and not declared before?

I didn't look at the rest of the code, but it's not clear to me yet how this batcher will be used here :)

This probably warrants a comment, but it's resetting the batch counter declared in vtexplain_vttablet.go so that each query starts from batch 1.

I added a comment here.

michael-berlin · 2017-10-02T06:46:50Z

go/sync2/batcher.go

+}
+
+// Wait adds a new waiter to the queue and blocks until the next batch
+func (b *Batcher) Wait() int32 {


I think int32 is an implementation detail and having int in the API instead would be nicer, especially since you're adding it in a common package.

Go guarantees you that int fits at least int32, so just casting at the return should be fine.

michael-berlin · 2017-10-03T00:17:26Z

go/vt/vtexplain/vtexplain_vttablet.go

-		return nil, err
-	}
+// Begin is part of the QueryService interface.
+func (tablet *fakeTablet) Begin(ctx context.Context, target *querypb.Target, options *querypb.ExecuteOptions) (int64, error) {


You could embed the ErrorQueryService instead: https://github.com/youtube/vitess/blob/master/go/vt/vttablet/queryservice/fakes/error_query_service.go#L29

This will it will return an error for each interface method unless you override it here.

michael-berlin · 2017-10-03T00:18:42Z

go/vt/vtexplain/vtexplain_vttablet.go

-		return nil, err
-	}
+// Begin is part of the QueryService interface.
+func (tablet *fakeTablet) Begin(ctx context.Context, target *querypb.Target, options *querypb.ExecuteOptions) (int64, error) {


nit: Receivers must be 1 or 2 characters only: https://github.com/golang/go/wiki/CodeReviewComments#receiver-names

I would just go here with t (or maybe ft`).

Fixed. Thanks.

sougou · 2017-10-12T00:08:16Z

LGTM
By the fact this is WIP, and limitations are known, I'll merge this in.

demmer added 7 commits October 1, 2017 10:24

add a sync2.Batcher utility for batching concurrent operations

8d9db0e

disable tablet autocommit for vtexplain

8ee046a

googlebot added the cla: yes label Oct 1, 2017

demmer requested review from rafael and sougou October 1, 2017 17:32

sougou reviewed Oct 2, 2017

View reviewed changes

michael-berlin reviewed Oct 2, 2017

View reviewed changes

michael-berlin reviewed Oct 3, 2017

View reviewed changes

demmer added 9 commits October 2, 2017 20:54

rename the vtexplain test to indicate to travis it could be flaky

bfce0bd

use time.Milliscond as suggested in PR review

fec19ea

be more permissive when it comes to types

d79aca0

rework to use int as per PR feedback and extend test timing

3c36810

rename the batcher test so travis treats it as flaky

830ea81

update vtexplain to changed Batcher API

0e87ad5

rename receivers from tablet to t as per PR feedback

c70b8b6

add batchTime comment

e89c2a8

rework the explain tablet to avoid QueryService boilerplate

2839210

demmer force-pushed the vtexplain-big-rework branch from 9698fa6 to 2839210 Compare October 9, 2017 03:32

sougou merged commit ea3b980 into vitessio:master Oct 12, 2017

Conversation

demmer commented Oct 1, 2017

Uh oh!

sougou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

demmer Oct 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sougou commented Oct 12, 2017 • edited by alainjobart Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

demmer Oct 9, 2017 •

edited

Loading

sougou commented Oct 12, 2017 •

edited by alainjobart

Loading