Move QueryStats tracking into the QueryEngine and out of the TabletPlan by zmagg · Pull Request #4093 · vitessio/vitess

zmagg · 2018-07-19T00:46:05Z

Description

The QueryStats metrics (including QueryCounts) are incorrect and unreliable for our keyspaces with large amounts of bulk inserts. This is because the TabletPlan cache evicts frequently and the query stats are driven out of the TabletPlan cache. So, on every given metric scrape, the plan in question is not reliably in the cache (and therefore its stats are not there) to be scraped, leading to false gaps in metrics that Prometheus struggles to compute rate()s off of.

This results in graphs that look like this (out of Grafana/Prometheus), when we did not actually have those spikes in query rates on that tablet:

One solution to this problem is proposed in #3667 . If we normalized bulk insert queries, there would be fewer unique queries in the cache and they would rarely get evicted.

This PR proposes that we do something simpler for now, which is just to stash the QueryStats in the QueryEngine and not in the TabletPlan, therefore not evicting the stats when we evict tablet plans.

… so that it doesn't get LRU evicted as part of the TabletPlanCache. Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

sougou · 2018-07-19T02:30:01Z

I think this is potentially dangerous because there's no upper limit on the number of unique queries. So it can cause vttablet to OOM.

One possible hack-around till we get to #3667 is as follows: After building the plan, if the of number inserted rows >X, don't add it to the query stats at all.
The downside is that we won't know the stats for bulk inserts.

We could also create a fake BULK_INSERT entry. But it may not be worth it.

zmagg · 2018-07-19T21:54:03Z

@sougou That makes sense. Yeah. For our particular case it seemed safe-ish because the cardinality of query plan string + table name is very low on all our keyspaces but I definitely see that it doesn't make sense generically.

Hm. I also think that #3667 is just a better solution (should also have some perf wins as well).

What do you mean by a fake BULK_INSERT? A fake general BULK_INSERT type in the query plan cache?

for the QueryStats map that optimizes for the fact that there are many writers to the shared QueryStats map but few readers. Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

zmagg · 2018-08-14T01:26:34Z

@sougou Curious what you think about this and if this sounds right to you / I'm missing anything:

@demmer and I have been talking about this again recently, as it continues to cause pain for us.

He pointed out that there is a baked in upper limit on the number of queries in QueryStats (keyed off of PlanType+TableName) as only queries that are against tables in the database schema will ever be registered in QueryStats. So, a user couldn't erroneously issue a large number of junk queries on junk tables and OOM the vttablet via the QueryStats data structure.

I both tested this with a junk query on an undefined table name and traced the code and this seems right to me.

He also suggested some perf improvements to the QueryStats data structure, which I've committed to this PR.

Thoughts?

`/debug/query_stats` Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

zmagg · 2018-08-14T02:38:01Z

!! The endtoend test failures taught me about the existence of the debug/query_stats page which I had never seen before.

My previous comment is still true, but might be moot--I've now added the sql query string to the key of the QueryStats datastructure in order to power the debug/query_stats page.

The other idea that @demmer and I were talking about was just applying a LRU to the QueryStats datastructure. As there are significantly fewer objects in QueryStats than there are in TabletPlans, a reasonable LRU should mitigate OOM-risks and still keep our metrics from over-evicting. What do you think?

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

zmagg · 2018-08-14T04:37:14Z

(And I’ve just noticed that some of the tests are still failing, I’ll take a look and fix that in the morning.)

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

Drive the QueryCounts / QueryErrorCounts / QueryRowCounts prom/expvar stats out of the QueryEngine tracked statistics keyed by Table/Plan name. Continue to use the existing query stats information in the TabletPlan to drive queryz / query_stats. Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

demmer

The core implementation looks good to me! I made a couple of minor cleanup suggestions but I think other than that this is good to merge.

demmer · 2018-08-23T16:50:41Z

go/vt/vttablet/tabletserver/query_engine.go

 	return 1
 }

+// buildAuthorized builds 'Authorized', which is the runtime part for 'Permissions'.


Looks like this code ended up moving as part of the re-addition of AddStats. Can you move it back to where it was to make the history cleaner?

demmer · 2018-08-23T16:56:02Z

go/vt/vttablet/tabletserver/query_engine.go

+	stats.mu.Unlock()
+}
+
+func (qe *QueryEngine) getQueryCountByTablePlan() map[string]int64 {


I don't think you need new functions for all of these since the other ones (getQueryCount, getQueryTime, etc) are only used for the metric exports, so you should use this implementation for them instead of adding new ones.

demmer · 2018-08-23T17:00:54Z

go/vt/vttablet/tabletserver/query_executor.go

 			return
 		}
-		qre.plan.AddStats(1, duration, qre.logStats.MysqlResponseTime, int64(reply.RowsAffected), 0)
+		qre.addStats(planName, 1, duration, qre.logStats.MysqlResponseTime, int64(reply.RowsAffected), 0)


This is somewhat stylistic, but IMO we should move the qre.logStats and tabletenv.ResultStats updating inside the new qre.addStats function to keep all the query metrics and log recording together.

(Or don't have the function at all and just keep all the updating in-line... it's having a helper function that does some but not all of the stats work which I feel like could be cleaned up)

Get rid of unneccessary duplication of stats methods Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

demmer

One thing I noticed, then I think this is good to go.

demmer · 2018-08-29T23:38:26Z

go/vt/vttablet/tabletserver/query_executor.go

 }

+func (qre *QueryExecutor) addStats(planName string, queryCount int64, duration, mysqlTime time.Duration, rowCount, errorCount int64) {
+	qre.tsv.qe.AddStats(planName, qre.plan.TableName().String(), queryCount, duration, mysqlTime, rowCount, errorCount)


We need some handling for the case where the table name is unknown...

For consistency it should use "Join" as the "table name" in that case.

Hmm. Where there's an unknown table identifier, it looks like qre.plan.TableName().String() is just the empty string which I think will be fine for our metrics. Would you prefer "Join" over that?

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

for errorCounts when unrolling the helper function. Also, sub "join" for unknown table names, as talked about in code review. Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

demmer

LGTM

sougou · 2018-08-31T17:48:47Z

Was the OOM concern addressed?

demmer · 2018-08-31T19:53:40Z

Echoing a conversation from Slack, there isn't really an OOM concern here since the key is TableName + PlanType and we already have data structures in the tablet that are O(Number of Tables).

So I think this is fine.

sougou

As mentioned before, I now agree that this is not an OOM concern.

I would have personally done it differently: just use a simple (and single) Mutex for just the map, because map traversal is so fast that there will never be a contention (based on past experience).

However, this code is also good, and better than what was there before.

Keep track of QueryStats in the QueryEngine instead of the TabletPlan…

4eea157

… so that it doesn't get LRU evicted as part of the TabletPlanCache. Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

zmagg force-pushed the fix-query-counts-spikiness-move-cache branch from dedfdc4 to 4eea157 Compare July 19, 2018 00:47

golint

ef35d38

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

Respond to @demmer's IRL feedback to use a different locking strategy

1347656

for the QueryStats map that optimizes for the fact that there are many writers to the shared QueryStats map but few readers. Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

zmagg force-pushed the fix-query-counts-spikiness-move-cache branch from 6d147b1 to 1347656 Compare August 14, 2018 00:54

Use the per-stat mutex on getters as well.

202066d

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

Key QueryStats including the sql query string, to power

ac0f65f

`/debug/query_stats` Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

Fix data race by acquiring read lock on the QueryStats datastructure.

8b57180

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

zmagg added 2 commits August 14, 2018 16:06

Get the query string correctly.

5f3aaae

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

zmagg requested a review from sougou August 15, 2018 18:18

demmer requested changes Aug 23, 2018

View reviewed changes

Put back the buildAuthorized method where it used to be

7d7716d

Get rid of unneccessary duplication of stats methods Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

demmer reviewed Aug 29, 2018

View reviewed changes

zmagg added 3 commits August 30, 2018 14:31

Get rid of helper function.

fc5f4de

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

Fix error introduced in previous commit where I missed the right params

8a5f811

for errorCounts when unrolling the helper function. Also, sub "join" for unknown table names, as talked about in code review. Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

s/table/tableName

a000403

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>

demmer approved these changes Aug 31, 2018

View reviewed changes

sougou approved these changes Sep 1, 2018

View reviewed changes

sougou merged commit faa1a87 into vitessio:master Sep 1, 2018

zmagg mentioned this pull request Sep 7, 2018

Slack vitess upstream sync 2018 09 06.r0 tinyspeck/vitess#109

Merged

Conversation

zmagg commented Jul 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

sougou commented Jul 19, 2018

Uh oh!

zmagg commented Jul 19, 2018

Uh oh!

zmagg commented Aug 14, 2018

Uh oh!

zmagg commented Aug 14, 2018

Uh oh!

zmagg commented Aug 14, 2018

Uh oh!

demmer left a comment

Choose a reason for hiding this comment

Uh oh!

demmer Aug 23, 2018

Choose a reason for hiding this comment

Uh oh!

demmer Aug 23, 2018

Choose a reason for hiding this comment

Uh oh!

demmer Aug 23, 2018

Choose a reason for hiding this comment

Uh oh!

demmer Aug 23, 2018

Choose a reason for hiding this comment

Uh oh!

demmer left a comment

Choose a reason for hiding this comment

Uh oh!

demmer Aug 29, 2018

Choose a reason for hiding this comment

Uh oh!

zmagg Aug 30, 2018

Choose a reason for hiding this comment

Uh oh!

demmer left a comment

Choose a reason for hiding this comment

Uh oh!

sougou commented Aug 31, 2018

Uh oh!

demmer commented Aug 31, 2018

Uh oh!

sougou left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zmagg commented Jul 19, 2018 •

edited

Loading