Move QueryStats tracking into the QueryEngine and out of the TabletPlan #4093
Conversation
… so that it doesn't get LRU evicted as part of the TabletPlanCache. Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
dedfdc4 to
4eea157
Compare
|
I think this is potentially dangerous because there's no upper limit on the number of unique queries. So it can cause vttablet to OOM. One possible hack-around till we get to #3667 is as follows: After building the plan, if the of number inserted rows >X, don't add it to the query stats at all. We could also create a fake |
|
@sougou That makes sense. Yeah. For our particular case it seemed safe-ish because the cardinality of query plan string + table name is very low on all our keyspaces but I definitely see that it doesn't make sense generically. Hm. I also think that #3667 is just a better solution (should also have some perf wins as well). What do you mean by a fake |
for the QueryStats map that optimizes for the fact that there are many writers to the shared QueryStats map but few readers. Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
6d147b1 to
1347656
Compare
Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
|
@sougou Curious what you think about this and if this sounds right to you / I'm missing anything: @demmer and I have been talking about this again recently, as it continues to cause pain for us. He pointed out that there is a baked in upper limit on the number of queries in QueryStats (keyed off of PlanType+TableName) as only queries that are against tables in the database schema will ever be registered in QueryStats. So, a user couldn't erroneously issue a large number of junk queries on junk tables and OOM the vttablet via the QueryStats data structure. I both tested this with a junk query on an undefined table name and traced the code and this seems right to me. He also suggested some perf improvements to the QueryStats data structure, which I've committed to this PR. Thoughts? |
`/debug/query_stats` Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
|
!! The endtoend test failures taught me about the existence of the My previous comment is still true, but might be moot--I've now added the sql query string to the key of the The other idea that @demmer and I were talking about was just applying a LRU to the QueryStats datastructure. As there are significantly fewer objects in QueryStats than there are in TabletPlans, a reasonable LRU should mitigate OOM-risks and still keep our metrics from over-evicting. What do you think? |
Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
|
(And I’ve just noticed that some of the tests are still failing, I’ll take a look and fix that in the morning.) |
Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Drive the QueryCounts / QueryErrorCounts / QueryRowCounts prom/expvar stats out of the QueryEngine tracked statistics keyed by Table/Plan name. Continue to use the existing query stats information in the TabletPlan to drive queryz / query_stats. Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
demmer
left a comment
There was a problem hiding this comment.
The core implementation looks good to me! I made a couple of minor cleanup suggestions but I think other than that this is good to merge.
| return 1 | ||
| } | ||
|
|
||
| // buildAuthorized builds 'Authorized', which is the runtime part for 'Permissions'. |
There was a problem hiding this comment.
Looks like this code ended up moving as part of the re-addition of AddStats. Can you move it back to where it was to make the history cleaner?
| stats.mu.Unlock() | ||
| } | ||
|
|
||
| func (qe *QueryEngine) getQueryCountByTablePlan() map[string]int64 { |
There was a problem hiding this comment.
I don't think you need new functions for all of these since the other ones (getQueryCount, getQueryTime, etc) are only used for the metric exports, so you should use this implementation for them instead of adding new ones.
| return | ||
| } | ||
| qre.plan.AddStats(1, duration, qre.logStats.MysqlResponseTime, int64(reply.RowsAffected), 0) | ||
| qre.addStats(planName, 1, duration, qre.logStats.MysqlResponseTime, int64(reply.RowsAffected), 0) |
There was a problem hiding this comment.
This is somewhat stylistic, but IMO we should move the qre.logStats and tabletenv.ResultStats updating inside the new qre.addStats function to keep all the query metrics and log recording together.
There was a problem hiding this comment.
(Or don't have the function at all and just keep all the updating in-line... it's having a helper function that does some but not all of the stats work which I feel like could be cleaned up)
Get rid of unneccessary duplication of stats methods Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
demmer
left a comment
There was a problem hiding this comment.
One thing I noticed, then I think this is good to go.
| } | ||
|
|
||
| func (qre *QueryExecutor) addStats(planName string, queryCount int64, duration, mysqlTime time.Duration, rowCount, errorCount int64) { | ||
| qre.tsv.qe.AddStats(planName, qre.plan.TableName().String(), queryCount, duration, mysqlTime, rowCount, errorCount) |
There was a problem hiding this comment.
We need some handling for the case where the table name is unknown...
For consistency it should use "Join" as the "table name" in that case.
There was a problem hiding this comment.
Hmm. Where there's an unknown table identifier, it looks like qre.plan.TableName().String() is just the empty string which I think will be fine for our metrics. Would you prefer "Join" over that?
Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
for errorCounts when unrolling the helper function. Also, sub "join" for unknown table names, as talked about in code review. Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
Signed-off-by: Maggie Zhou <mzhou@slack-corp.com>
|
Was the OOM concern addressed? |
|
Echoing a conversation from Slack, there isn't really an OOM concern here since the key is TableName + PlanType and we already have data structures in the tablet that are O(Number of Tables). So I think this is fine. |
sougou
left a comment
There was a problem hiding this comment.
As mentioned before, I now agree that this is not an OOM concern.
I would have personally done it differently: just use a simple (and single) Mutex for just the map, because map traversal is so fast that there will never be a contention (based on past experience).
However, this code is also good, and better than what was there before.
Description
The QueryStats metrics (including
QueryCounts) are incorrect and unreliable for our keyspaces with large amounts of bulk inserts. This is because the TabletPlan cache evicts frequently and the query stats are driven out of the TabletPlan cache. So, on every given metric scrape, the plan in question is not reliably in the cache (and therefore its stats are not there) to be scraped, leading to false gaps in metrics that Prometheus struggles to computerate()s off of.This results in graphs that look like this (out of Grafana/Prometheus), when we did not actually have those spikes in query rates on that tablet:

One solution to this problem is proposed in #3667 . If we normalized bulk insert queries, there would be fewer unique queries in the cache and they would rarely get evicted.
This PR proposes that we do something simpler for now, which is just to stash the QueryStats in the QueryEngine and not in the TabletPlan, therefore not evicting the stats when we evict tablet plans.