-
Notifications
You must be signed in to change notification settings - Fork 1
RFC for inverse distribution functions (PERCENTILE, MEDIAN) #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Why put more queries through query buffers when it has known performance issues? |
|
If the selection is an aggregate then the |
In order to obtain the percentile, we need to count the rows in a selection sorted by arbitrary field. Query buffers facility was an easy choice. I didn't quite like this myself, though (https://basho.slack.com/archives/eng_time-series/p1485792968005660). An alternative way would be to create 100 bins and, for each column's value in the records collected from vnodes, increment the appropriate bin's counter. The records themselves will not be stored, only the count of rows with values fitting in each bin. Memory usage will stay at One drawback of the fixed-bins method is that we need to decide upfront on the number of bins. Given sufficiently large population size of randomly distributed floats, we won't be able to distinguish between percentile(x, 0.33333) and percentile(x, 0.33334). But it's something we can live with. Thanks for your suggestion @andytill. |
There was no intention to make inverse distribution functions work with window aggregation functions or grouping. |
Is this the new proposal for percentile now? |
Sort of. I am in the process of collecting feedback. |
|
Just keeping a sorted bag in some kind of data structure should be sufficient. |
Exactly. It would be a new Let's hear from @ph07 on this matter, specifically on the loss of precision the fixed-bins method has? |
|
ok, I am not suggesting we need a fixed number of bins, like bins used in dict. We can still keep accuracy by keeping a sorted bag. |
|
Two more Qs.
|
State terms are effectively histograms, which could be easily combined, either incrementally or all at once at the final stage. Thanks for another suggestion. Andy! |
|
@andytill There's a problem with PERCENTILE as an aggregate function. It just dawned on me (I could claim I had a suspicion it's not going to work earlier on, but it's 20x20 hindsight of course). In order to put a given cell value into one of the 100 percentile slots in the function's running state, we need to know the max and min of the range -- and that we have no way of knowing before we get the full picture -- that is, before we receive all the chunks down to last. So, with this realisation, I have no workable ideas on how to implement PERCENTILE differently from how I did it in the PRs posted. |
Influxdb claims to use a sorted set in it's docs which would be even easier, but sounds wrong to me because it skews results when there are duplicates. https://docs.influxdata.com/influxdb/v0.8/api/aggregate_functions/#percentile |
|
@andytill For me the choice is this: to obtain an inverse distribution function which requires counting rows and picking one by its position in a sorted selection, I need (a) collect all rows in one place or (b) somehow avoid doing that. In my previous comment I am arguing that (b) is not feasible. |
|
I have never suggested option b) is feasible, I suggested it was unnecessary three days ago.
My original comment which I am still concerned about is this.
|
Because they provide facilities for collecting all rows in one place?
Selections fitting into memory (per configurable limit) will be held in memory. No performance issues here. Selections going overboard will still be usable for the purposes of PERCENTILE, albeit with a performance hit. Now, the choice is, accept performance hit and still serve the query, or run into overload protection issues when the coordinator receives loads of bags. |
|
I'm still not convinced that putting all the data onto disk when it grows too large is a good idea, because of the work that must be done before the query can continue and the pressure it might put onto other requests, and as we have discovered, storage can be much much slower than computation on AWS. Are there any overload tests that prove that this strategy is effective, and is any other DB doing this? |
But what is the alternative? Refuse to serve the query? Network-mounted, lowest-tier AWS storage is indeed suboptimal for query buffers, but certainly there exist other, faster options.
In ts_simple_query_buffers:query_orderby_inmem2ldb` (https://github.com/basho/riak_test/blob/develop/tests/ts_simple_query_buffers_SUITE.erl#L271-L278) I am testing the code paths for all in-memory, all leveldb-backed as well as mixed operation (that is, some data are first accumulated in memory, then dumped to disk). Similar tests exist for PERCENTILE (https://github.com/basho/riak_test/pull/1270/files#diff-ff1938dc629fdc5a892006e05f96d733R93, ..R96 and ..R99).
Other queries not involving query buffers will continue to be served by other workers (up to
That may be investigated in its own right, separately. The fundamental problem remains, what to do with queries with |
Yes, this is called load shedding. We already use it when the user issues queries and there are not enough coordinators available. This technique, is described here by Fred Herbert. http://ferd.ca/queues-don-t-fix-overload.html He is describing a queue as a buffer, whereas we have an in-process buffer. This paragraph stood out for me.
As discussed in the mumble, riak test cannot verify that storing data onto disk when there is too much can save Riak TS from an overload/outage scenario. The pressure on other requests I was referring to was mainly about disk and CPU which is shared. I hadn't thought about exhaustion of coordinators but you're right, that could lead to queries being refused as well. I know that influxdb and cassandra do not use temporary tables in this way, probably for these reasons. |
|
I can't help agreeing that whatever we do, disk-based query buffers/temp tables will remain a source of increased latency, stalls, dropped queries and all the headache. At the same time, refusing to serve queries which could be served given increased timeouts and faster storage for temp tables, is probably not a satisfactory answer either. As a possible resolution, we can implement a configuration switch to allow/disable the fallback to leveldb in query buffer manager. Users who only occasionally issue requests that would require accumulation of data in temp tables in excess of a certain limit -- and who can tolerate a 10-sec latency -- will have the ability to use that. Conversely, if it works out easier for them to retry a query with smaller WHERE ranges than to stall the query buffer manager, will be happy keeping the switch turned off. I also would like to note that the discussion has veered into the larger scope of temp tables in general and away from the core question of the proposed implementation of PERCENTILE involving riak_kv_qry_compiler. |
|
|
||
| ## Abstract | ||
|
|
||
| Riak TS needs to support *inverse distribution functions*, at least including `PERCENTILE`, `MEDIAN`. This RFC details how this can be implemented using *query buffers*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get an explanation of the naming please. Other SQL implementations have PERCENTILE_DISC and PERCENTILE_CONT which of these does our implementation correspond to?
The documentation team will need to have these details to write up the documentation correctly.
|
|
||
| - `LIMIT` is assigned a same-length list of `1`s. | ||
|
|
||
| 4. Finally, multiple columns in `'SELECT'` will be collapsed into a single column. Thus, for functions in `riak_kv_qry_worker`, `SELECT PERCENTILE(x, 0.33), MEDIAN(x)` will become `SELECT x`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like a diagram of how the query is being rewritten and rolled up and down as per the riak_ql documentation
I mean this kind of diagram...
<---Network--->
+ FROM <-----------------------+ + FROM mytable on vnode X
| | |
| SELECT SUM(STemp)/SUM(NoTemp) | | SELECT SUM(temp) AS STemp, COUNT(temp) AS NoTemp
| | Chunk1 |
| GROUP BY [] +--------+ GROUP BY []
| | |
| ORDER BY [] | | ORDER BY []
| | |
+ WHERE [] | + WHERE + start_key = {myfamily, myseries, 1233}
| | end_key = {myfamily, myseries, 4000}
| + temp > 18
|
|
| + FROM mytable on vnode Y
| |
| | SELECT SUM(temp) AS STemp, COUNT(temp) AS NoTemp
| Chunk2 |
+--------+ GROUP BY []
|
| ORDER BY []
|
+ WHERE + start_key = {myfamily, myseries, 4001}
| end_key = {myfamily, myseries, 6789}
+ temp > 18
|
Diagram added in 054590f. |
| The presence of `ORDER BY` will direct `riak_kv_qry_worker` to use query buffers for the query with inverse distribution functions. | ||
|
|
||
| By way of illustration: | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This diagram is incorrect - the ORDER BY, LIMIT and OFFSET clauses are not evaluated on the vnode
|
+1 |
|
@gordonguthrie per our discussions and hand-holding, the diagram updated in bf499b6. |
RTS-1736, RTS-1740 (documentation), RTS-545, RTS-1553 (code)
The proposed RFC describes the approach and implementation details of the two inverse distribution functions, PERCENTILE and MEDIAN, as submitted in basho/riak_ql#167 and basho/riak_kv#1624.
MODE to follow separately.