Skip to content

Move QueryStats tracking into the QueryEngine and out of the TabletPlan #4093

Merged
sougou merged 12 commits intovitessio:masterfrom
tinyspeck:fix-query-counts-spikiness-move-cache
Sep 1, 2018
Merged

Move QueryStats tracking into the QueryEngine and out of the TabletPlan #4093
sougou merged 12 commits intovitessio:masterfrom
tinyspeck:fix-query-counts-spikiness-move-cache

Conversation

@zmagg
Copy link
Copy Markdown
Contributor

@zmagg zmagg commented Jul 19, 2018

Description

The QueryStats metrics (including QueryCounts) are incorrect and unreliable for our keyspaces with large amounts of bulk inserts. This is because the TabletPlan cache evicts frequently and the query stats are driven out of the TabletPlan cache. So, on every given metric scrape, the plan in question is not reliably in the cache (and therefore its stats are not there) to be scraped, leading to false gaps in metrics that Prometheus struggles to compute rate()s off of.

This results in graphs that look like this (out of Grafana/Prometheus), when we did not actually have those spikes in query rates on that tablet:
screen shot 2018-07-18 at 5 42 46 pm

One solution to this problem is proposed in #3667 . If we normalized bulk insert queries, there would be fewer unique queries in the cache and they would rarely get evicted.

This PR proposes that we do something simpler for now, which is just to stash the QueryStats in the QueryEngine and not in the TabletPlan, therefore not evicting the stats when we evict tablet plans.

… so that it doesn't get

LRU evicted as part of the TabletPlanCache.

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
@zmagg zmagg force-pushed the fix-query-counts-spikiness-move-cache branch from dedfdc4 to 4eea157 Compare July 19, 2018 00:47
Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
@sougou
Copy link
Copy Markdown
Contributor

sougou commented Jul 19, 2018

I think this is potentially dangerous because there's no upper limit on the number of unique queries. So it can cause vttablet to OOM.

One possible hack-around till we get to #3667 is as follows: After building the plan, if the of number inserted rows >X, don't add it to the query stats at all.
The downside is that we won't know the stats for bulk inserts.

We could also create a fake BULK_INSERT entry. But it may not be worth it.

@zmagg
Copy link
Copy Markdown
Contributor Author

zmagg commented Jul 19, 2018

@sougou That makes sense. Yeah. For our particular case it seemed safe-ish because the cardinality of query plan string + table name is very low on all our keyspaces but I definitely see that it doesn't make sense generically.

Hm. I also think that #3667 is just a better solution (should also have some perf wins as well).

What do you mean by a fake BULK_INSERT? A fake general BULK_INSERT type in the query plan cache?

for the QueryStats map that optimizes for the fact that there are many
writers to the shared QueryStats map but few readers.

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
@zmagg zmagg force-pushed the fix-query-counts-spikiness-move-cache branch from 6d147b1 to 1347656 Compare August 14, 2018 00:54
Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
@zmagg
Copy link
Copy Markdown
Contributor Author

zmagg commented Aug 14, 2018

@sougou Curious what you think about this and if this sounds right to you / I'm missing anything:

@demmer and I have been talking about this again recently, as it continues to cause pain for us.

He pointed out that there is a baked in upper limit on the number of queries in QueryStats (keyed off of PlanType+TableName) as only queries that are against tables in the database schema will ever be registered in QueryStats. So, a user couldn't erroneously issue a large number of junk queries on junk tables and OOM the vttablet via the QueryStats data structure.

I both tested this with a junk query on an undefined table name and traced the code and this seems right to me.

He also suggested some perf improvements to the QueryStats data structure, which I've committed to this PR.

Thoughts?

`/debug/query_stats`

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
@zmagg
Copy link
Copy Markdown
Contributor Author

zmagg commented Aug 14, 2018

!! The endtoend test failures taught me about the existence of the debug/query_stats page which I had never seen before.

My previous comment is still true, but might be moot--I've now added the sql query string to the key of the QueryStats datastructure in order to power the debug/query_stats page.

The other idea that @demmer and I were talking about was just applying a LRU to the QueryStats datastructure. As there are significantly fewer objects in QueryStats than there are in TabletPlans, a reasonable LRU should mitigate OOM-risks and still keep our metrics from over-evicting. What do you think?

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
@zmagg
Copy link
Copy Markdown
Contributor Author

zmagg commented Aug 14, 2018

(And I’ve just noticed that some of the tests are still failing, I’ll take a look and fix that in the morning.)

zmagg added 2 commits August 14, 2018 16:06
Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Drive the QueryCounts / QueryErrorCounts / QueryRowCounts prom/expvar
stats out of the QueryEngine tracked statistics keyed by Table/Plan
name.

Continue to use the existing query stats information in the TabletPlan
to drive queryz / query_stats.

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
@zmagg zmagg requested a review from sougou August 15, 2018 18:18
Copy link
Copy Markdown
Member

@demmer demmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core implementation looks good to me! I made a couple of minor cleanup suggestions but I think other than that this is good to merge.

return 1
}

// buildAuthorized builds 'Authorized', which is the runtime part for 'Permissions'.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this code ended up moving as part of the re-addition of AddStats. Can you move it back to where it was to make the history cleaner?

stats.mu.Unlock()
}

func (qe *QueryEngine) getQueryCountByTablePlan() map[string]int64 {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need new functions for all of these since the other ones (getQueryCount, getQueryTime, etc) are only used for the metric exports, so you should use this implementation for them instead of adding new ones.

return
}
qre.plan.AddStats(1, duration, qre.logStats.MysqlResponseTime, int64(reply.RowsAffected), 0)
qre.addStats(planName, 1, duration, qre.logStats.MysqlResponseTime, int64(reply.RowsAffected), 0)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewhat stylistic, but IMO we should move the qre.logStats and tabletenv.ResultStats updating inside the new qre.addStats function to keep all the query metrics and log recording together.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Or don't have the function at all and just keep all the updating in-line... it's having a helper function that does some but not all of the stats work which I feel like could be cleaned up)

Get rid of unneccessary duplication of stats methods

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Copy link
Copy Markdown
Member

@demmer demmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I noticed, then I think this is good to go.

}

func (qre *QueryExecutor) addStats(planName string, queryCount int64, duration, mysqlTime time.Duration, rowCount, errorCount int64) {
qre.tsv.qe.AddStats(planName, qre.plan.TableName().String(), queryCount, duration, mysqlTime, rowCount, errorCount)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need some handling for the case where the table name is unknown...

For consistency it should use "Join" as the "table name" in that case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Where there's an unknown table identifier, it looks like qre.plan.TableName().String() is just the empty string which I think will be fine for our metrics. Would you prefer "Join" over that?

zmagg added 3 commits August 30, 2018 14:31
Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
for errorCounts when unrolling the helper function.

Also, sub "join" for unknown table names, as talked about in code
review.

Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Copy link
Copy Markdown
Member

@demmer demmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sougou
Copy link
Copy Markdown
Contributor

sougou commented Aug 31, 2018

Was the OOM concern addressed?

@demmer
Copy link
Copy Markdown
Member

demmer commented Aug 31, 2018

Echoing a conversation from Slack, there isn't really an OOM concern here since the key is TableName + PlanType and we already have data structures in the tablet that are O(Number of Tables).

So I think this is fine.

Copy link
Copy Markdown
Contributor

@sougou sougou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned before, I now agree that this is not an OOM concern.

I would have personally done it differently: just use a simple (and single) Mutex for just the map, because map traversal is so fast that there will never be a contention (based on past experience).

However, this code is also good, and better than what was there before.

@sougou sougou merged commit faa1a87 into vitessio:master Sep 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants