RFC for inverse distribution functions (PERCENTILE, MEDIAN) #55

hmmr · 2017-02-07T03:17:22Z

RTS-1736, RTS-1740 (documentation), RTS-545, RTS-1553 (code)

The proposed RFC describes the approach and implementation details of the two inverse distribution functions, PERCENTILE and MEDIAN, as submitted in basho/riak_ql#167 and basho/riak_kv#1624.

MODE to follow separately.

andytill · 2017-02-07T07:25:23Z

Why put more queries through query buffers when it has known performance issues?

andytill · 2017-02-07T07:26:40Z

If the selection is an aggregate then the ORDER BY clause could be removed because it does not matter which order the results are accumulated in.

hmmr · 2017-02-07T13:06:38Z

Why put more queries through query buffers when it has known performance issues?

In order to obtain the percentile, we need to count the rows in a selection sorted by arbitrary field. Query buffers facility was an easy choice. I didn't quite like this myself, though (https://basho.slack.com/archives/eng_time-series/p1485792968005660).

An alternative way would be to create 100 bins and, for each column's value in the records collected from vnodes, increment the appropriate bin's counter. The records themselves will not be stored, only the count of rows with values fitting in each bin. Memory usage will stay at 100 * external_size({Bin, Count, Value}).

One drawback of the fixed-bins method is that we need to decide upfront on the number of bins. Given sufficiently large population size of randomly distributed floats, we won't be able to distinguish between percentile(x, 0.33333) and percentile(x, 0.33334). But it's something we can live with.

Thanks for your suggestion @andytill.

hmmr · 2017-02-07T13:13:49Z

If the selection is an aggregate

There was no intention to make inverse distribution functions work with window aggregation functions or grouping.

andytill · 2017-02-07T13:36:52Z

In order to obtain the percentile, we need to count the rows in a selection sorted by arbitrary field. Query buffers facility was an easy choice. I didn't quite like this myself, though (https://basho.slack.com/archives/eng_time-series/p1485792968005660).

An alternative way would be to create 100 bins and, for each column's value in the records collected from vnodes, increment the appropriate bin's counter. The records themselves will not be stored, only the count of rows with values fitting in each bin. Memory usage will stay at 100 * external_size({Bin, Count, Value}).

One drawback of the fixed-bins method is that we need to decide upfront on the number of bins. Given sufficiently large population size of randomly distributed floats, we won't be able to distinguish between percentile(x, 0.33333) and percentile(x, 0.33334). But it's something we can live with.

Is this the new proposal for percentile now?

hmmr · 2017-02-07T13:57:46Z

Is this the new proposal for percentile now?

Sort of. I am in the process of collecting feedback.

andytill · 2017-02-07T14:12:31Z

Just keeping a sorted bag in some kind of data structure should be sufficient.

hmmr · 2017-02-07T14:27:44Z

Just keeping a sorted bag in some kind of data structure should be sufficient.

Exactly. It would be a new calc_type = invdist, with its own run_select_on_invdist_chunk. No query buffer will be harmed.

Let's hear from @ph07 on this matter, specifically on the loss of precision the fixed-bins method has?

andytill · 2017-02-07T14:30:57Z

ok, I am not suggesting we need a fixed number of bins, like bins used in dict.

We can still keep accuracy by keeping a sorted bag.

andytill · 2017-02-07T14:34:41Z

Two more Qs.

Why not use the aggregates we already have, so it can be used in group by?
Does this support multiple percentiles e.g. SELECT PERCENTILE(latency, 98), PERCENTILE(latency,99.9)

hmmr · 2017-02-07T15:11:09Z

Yes, in the (quickly getting obsolescent) qbuf-based implementation I do support multiple calls to percentile, but only as long as they all refer to the same column.
It's getting more interesting by the minute. Going with aggregates, we will be receiving 100-element state terms* from each of the vnodes covered, for every distinct-column percentile entry in the query. With a separate, dedicated calc_type, we will be receiving all chunks raw, then flicking them into bins on the coordinator and dropping. To a degree, it depends on the data, but I believe aggregates can be less resource-consuming.

Ok, the 100-bin dict can be reduced to only contain bins with non-zero count.

State terms are effectively histograms, which could be easily combined, either incrementally or all at once at the final stage.

Thanks for another suggestion. Andy!

hmmr · 2017-02-09T19:40:31Z

@andytill There's a problem with PERCENTILE as an aggregate function. It just dawned on me (I could claim I had a suspicion it's not going to work earlier on, but it's 20x20 hindsight of course).

In order to put a given cell value into one of the 100 percentile slots in the function's running state, we need to know the max and min of the range -- and that we have no way of knowing before we get the full picture -- that is, before we receive all the chunks down to last.

So, with this realisation, I have no workable ideas on how to implement PERCENTILE differently from how I did it in the PRs posted.

andytill · 2017-02-10T10:05:22Z

ok, I am not suggesting we need a fixed number of bins, like bins used in dict.

We can still keep accuracy by keeping a sorted bag.

Influxdb claims to use a sorted set in it's docs which would be even easier, but sounds wrong to me because it skews results when there are duplicates.

https://docs.influxdata.com/influxdb/v0.8/api/aggregate_functions/#percentile

hmmr · 2017-02-10T13:48:28Z

@andytill For me the choice is this: to obtain an inverse distribution function which requires counting rows and picking one by its position in a sorted selection, I need (a) collect all rows in one place or (b) somehow avoid doing that. In my previous comment I am arguing that (b) is not feasible.

andytill · 2017-02-10T14:04:24Z

I have never suggested option b) is feasible, I suggested it was unnecessary three days ago.

Just keeping a sorted bag in some kind of data structure should be sufficient.

My original comment which I am still concerned about is this.

Why put more queries through query buffers when it has known performance issues?

hmmr · 2017-02-10T14:40:11Z

Why put more queries through query buffers

Because they provide facilities for collecting all rows in one place?

when it has known performance issues?

Selections fitting into memory (per configurable limit) will be held in memory. No performance issues here. Selections going overboard will still be usable for the purposes of PERCENTILE, albeit with a performance hit.

Now, the choice is, accept performance hit and still serve the query, or run into overload protection issues when the coordinator receives loads of bags.

andytill · 2017-02-10T15:12:46Z

I'm still not convinced that putting all the data onto disk when it grows too large is a good idea, because of the work that must be done before the query can continue and the pressure it might put onto other requests, and as we have discovered, storage can be much much slower than computation on AWS.

Are there any overload tests that prove that this strategy is effective, and is any other DB doing this?

hmmr · 2017-02-13T16:21:20Z

putting all the data onto disk when it grows too large is a good idea

But what is the alternative? Refuse to serve the query?

Network-mounted, lowest-tier AWS storage is indeed suboptimal for query buffers, but certainly there exist other, faster options.

Are there any overload tests

In ts_simple_query_buffers:query_orderby_inmem2ldb` (https://github.com/basho/riak_test/blob/develop/tests/ts_simple_query_buffers_SUITE.erl#L271-L278) I am testing the code paths for all in-memory, all leveldb-backed as well as mixed operation (that is, some data are first accumulated in memory, then dumped to disk). Similar tests exist for PERCENTILE (https://github.com/basho/riak_test/pull/1270/files#diff-ff1938dc629fdc5a892006e05f96d733R93, ..R96 and ..R99).

pressure it might put onto other requests

Other queries not involving query buffers will continue to be served by other workers (up to riak_kv.query.concurrent_queries). Yes, those which do rely on query buffers will have to wait (or be retried on some other node).

is any other DB doing this?

That may be investigated in its own right, separately. The fundamental problem remains, what to do with queries with WHERE $all_time, be they regular SELECT x .. LIMIT 1 or SELECT PERCENTILE(x), which require allocation of temp tables/buffers. It either has to be a disk-based solution, or no solution at all.

andytill · 2017-02-13T17:14:58Z

But what is the alternative? Refuse to serve the query?

Yes, this is called load shedding. We already use it when the user issues queries and there are not enough coordinators available. This technique, is described here by Fred Herbert.

http://ferd.ca/queues-don-t-fix-overload.html

He is describing a queue as a buffer, whereas we have an in-process buffer. This paragraph stood out for me.

When applying a queue as a buffer, all they're doing is creating a bigger buffer to accumulate data that is in-flight, only to lose it sooner or later. You're making failures more rare, but you're making their magnitude worse.

As discussed in the mumble, riak test cannot verify that storing data onto disk when there is too much can save Riak TS from an overload/outage scenario.

The pressure on other requests I was referring to was mainly about disk and CPU which is shared. I hadn't thought about exhaustion of coordinators but you're right, that could lead to queries being refused as well.

I know that influxdb and cassandra do not use temporary tables in this way, probably for these reasons.

hmmr · 2017-02-13T17:51:47Z

I can't help agreeing that whatever we do, disk-based query buffers/temp tables will remain a source of increased latency, stalls, dropped queries and all the headache.

At the same time, refusing to serve queries which could be served given increased timeouts and faster storage for temp tables, is probably not a satisfactory answer either.

As a possible resolution, we can implement a configuration switch to allow/disable the fallback to leveldb in query buffer manager. Users who only occasionally issue requests that would require accumulation of data in temp tables in excess of a certain limit -- and who can tolerate a 10-sec latency -- will have the ability to use that. Conversely, if it works out easier for them to retry a query with smaller WHERE ranges than to stall the query buffer manager, will be happy keeping the switch turned off.

I also would like to note that the discussion has veered into the larger scope of temp tables in general and away from the core question of the proposed implementation of PERCENTILE involving riak_kv_qry_compiler.

gordonguthrie · 2017-02-14T11:40:49Z

open/inverse-distribution-functions.md

+
+## Abstract
+
+Riak TS needs to support *inverse distribution functions*, at least including `PERCENTILE`, `MEDIAN`.  This RFC details how this can be implemented using *query buffers*.


Can we get an explanation of the naming please. Other SQL implementations have PERCENTILE_DISC and PERCENTILE_CONT which of these does our implementation correspond to?

The documentation team will need to have these details to write up the documentation correctly.

gordonguthrie · 2017-02-14T11:42:43Z

open/inverse-distribution-functions.md

+
+    - `LIMIT` is assigned a same-length list of `1`s.
+
+4. Finally, multiple columns in `'SELECT'` will be collapsed into a single column.  Thus, for functions in `riak_kv_qry_worker`, `SELECT PERCENTILE(x, 0.33), MEDIAN(x)` will become `SELECT x`.


I would like a diagram of how the query is being rewritten and rolled up and down as per the riak_ql documentation

I mean this kind of diagram...

<---Network---> + FROM <-----------------------+ + FROM mytable on vnode X | | | | SELECT SUM(STemp)/SUM(NoTemp) | | SELECT SUM(temp) AS STemp, COUNT(temp) AS NoTemp | | Chunk1 | | GROUP BY [] +--------+ GROUP BY [] | | | | ORDER BY [] | | ORDER BY [] | | | + WHERE [] | + WHERE + start_key = {myfamily, myseries, 1233} | | end_key = {myfamily, myseries, 4000} | + temp > 18 | | | + FROM mytable on vnode Y | | | | SELECT SUM(temp) AS STemp, COUNT(temp) AS NoTemp | Chunk2 | +--------+ GROUP BY [] | | ORDER BY [] | + WHERE + start_key = {myfamily, myseries, 4001} | end_key = {myfamily, myseries, 6789} + temp > 18

hmmr · 2017-02-14T16:41:23Z

Diagram added in 054590f.

gordonguthrie · 2017-02-23T14:32:37Z

open/inverse-distribution-functions.md

+The presence of `ORDER BY` will direct `riak_kv_qry_worker` to use query buffers for the query with inverse distribution functions.
+
+By way of illustration:
+


This diagram is incorrect - the ORDER BY, LIMIT and OFFSET clauses are not evaluated on the vnode

gordonguthrie · 2017-02-24T13:34:40Z

+1

hmmr · 2017-02-27T17:24:26Z

@gordonguthrie per our discussions and hand-holding, the diagram updated in bf499b6.

This was referenced Feb 7, 2017

inverse distribution functions (PERCENTILE_{DISC,CONT}, MEDIAN, MODE) basho/riak_kv#1624

Merged

inverse distribution functions (PERCENTILE_{DISC,CONT}, MEDIAN, MODE) basho/riak_ql#167

Merged

RFC for inverse distribution functions (PERCENTILE etc)

8b34221

hmmr force-pushed the az-percentile branch from 514fb14 to 8b34221 Compare February 7, 2017 03:29

hmmr changed the title ~~RFC for inverse distribution functions (PERCENTILE, MEDIAN)~~ [WIP] RFC for inverse distribution functions (PERCENTILE, MEDIAN) Feb 7, 2017

gordonguthrie reviewed Feb 14, 2017

View reviewed changes

add a diagram illustrating how query gets compiled & transformed

054590f

cover MODE, add note on WHERE, proofread

b144262

hmmr changed the title ~~[WIP] RFC for inverse distribution functions (PERCENTILE, MEDIAN)~~ RFC for inverse distribution functions (PERCENTILE, MEDIAN) Feb 18, 2017

gordonguthrie reviewed Feb 23, 2017

View reviewed changes

Andrei Zavada added 2 commits February 24, 2017 16:39

invdist-functions: a better diagram

26900dd

invdist-functions: even better diagram

bf499b6

hmmr merged commit 4d42172 into master Mar 13, 2017

hmmr deleted the az-percentile branch March 13, 2017 14:59

hmmr mentioned this pull request Mar 16, 2017

move open/inverse-distribution-functions.md into own dir #67

Merged


		## Abstract

		Riak TS needs to support inverse distribution functions, at least including `PERCENTILE`, `MEDIAN`. This RFC details how this can be implemented using query buffers.


		- `LIMIT` is assigned a same-length list of `1`s.

		4. Finally, multiple columns in `'SELECT'` will be collapsed into a single column. Thus, for functions in `riak_kv_qry_worker`, `SELECT PERCENTILE(x, 0.33), MEDIAN(x)` will become `SELECT x`.

		The presence of `ORDER BY` will direct `riak_kv_qry_worker` to use query buffers for the query with inverse distribution functions.

		By way of illustration:

RFC for inverse distribution functions (PERCENTILE, MEDIAN) #55

RFC for inverse distribution functions (PERCENTILE, MEDIAN) #55

Uh oh!

Conversation

hmmr commented Feb 7, 2017

Uh oh!

andytill commented Feb 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andytill commented Feb 7, 2017

Uh oh!

hmmr commented Feb 7, 2017

Uh oh!

hmmr commented Feb 7, 2017

Uh oh!

andytill commented Feb 7, 2017

Uh oh!

hmmr commented Feb 7, 2017

Uh oh!

andytill commented Feb 7, 2017

Uh oh!

hmmr commented Feb 7, 2017

Uh oh!

andytill commented Feb 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andytill commented Feb 7, 2017

Uh oh!

hmmr commented Feb 7, 2017

Uh oh!

hmmr commented Feb 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andytill commented Feb 10, 2017

Uh oh!

hmmr commented Feb 10, 2017

Uh oh!

andytill commented Feb 10, 2017

Uh oh!

hmmr commented Feb 10, 2017

Uh oh!

andytill commented Feb 10, 2017

Uh oh!

hmmr commented Feb 13, 2017

Uh oh!

andytill commented Feb 13, 2017

Uh oh!

hmmr commented Feb 13, 2017

Uh oh!

gordonguthrie Feb 14, 2017

Choose a reason for hiding this comment

Uh oh!

gordonguthrie Feb 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hmmr commented Feb 14, 2017

Uh oh!

gordonguthrie Feb 23, 2017

Choose a reason for hiding this comment

Uh oh!

gordonguthrie commented Feb 24, 2017

Uh oh!

hmmr commented Feb 27, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

andytill commented Feb 7, 2017 •

edited

Loading

andytill commented Feb 7, 2017 •

edited

Loading

hmmr commented Feb 9, 2017 •

edited

Loading

gordonguthrie Feb 14, 2017 •

edited

Loading