Skip to content

Conversation

@hmmr
Copy link
Contributor

@hmmr hmmr commented Feb 7, 2017

RTS-1736, RTS-1740 (documentation), RTS-545, RTS-1553 (code)

The proposed RFC describes the approach and implementation details of the two inverse distribution functions, PERCENTILE and MEDIAN, as submitted in basho/riak_ql#167 and basho/riak_kv#1624.

MODE to follow separately.

@andytill
Copy link
Contributor

andytill commented Feb 7, 2017

Why put more queries through query buffers when it has known performance issues?

@andytill
Copy link
Contributor

andytill commented Feb 7, 2017

If the selection is an aggregate then the ORDER BY clause could be removed because it does not matter which order the results are accumulated in.

@hmmr
Copy link
Contributor Author

hmmr commented Feb 7, 2017

Why put more queries through query buffers when it has known performance issues?

In order to obtain the percentile, we need to count the rows in a selection sorted by arbitrary field. Query buffers facility was an easy choice. I didn't quite like this myself, though (https://basho.slack.com/archives/eng_time-series/p1485792968005660).

An alternative way would be to create 100 bins and, for each column's value in the records collected from vnodes, increment the appropriate bin's counter. The records themselves will not be stored, only the count of rows with values fitting in each bin. Memory usage will stay at 100 * external_size({Bin, Count, Value}).

One drawback of the fixed-bins method is that we need to decide upfront on the number of bins. Given sufficiently large population size of randomly distributed floats, we won't be able to distinguish between percentile(x, 0.33333) and percentile(x, 0.33334). But it's something we can live with.

Thanks for your suggestion @andytill.

@hmmr hmmr changed the title RFC for inverse distribution functions (PERCENTILE, MEDIAN) [WIP] RFC for inverse distribution functions (PERCENTILE, MEDIAN) Feb 7, 2017
@hmmr
Copy link
Contributor Author

hmmr commented Feb 7, 2017

If the selection is an aggregate

There was no intention to make inverse distribution functions work with window aggregation functions or grouping.

@andytill
Copy link
Contributor

andytill commented Feb 7, 2017

In order to obtain the percentile, we need to count the rows in a selection sorted by arbitrary field. Query buffers facility was an easy choice. I didn't quite like this myself, though (https://basho.slack.com/archives/eng_time-series/p1485792968005660).

An alternative way would be to create 100 bins and, for each column's value in the records collected from vnodes, increment the appropriate bin's counter. The records themselves will not be stored, only the count of rows with values fitting in each bin. Memory usage will stay at 100 * external_size({Bin, Count, Value}).

One drawback of the fixed-bins method is that we need to decide upfront on the number of bins. Given sufficiently large population size of randomly distributed floats, we won't be able to distinguish between percentile(x, 0.33333) and percentile(x, 0.33334). But it's something we can live with.

Is this the new proposal for percentile now?

@hmmr
Copy link
Contributor Author

hmmr commented Feb 7, 2017

Is this the new proposal for percentile now?

Sort of. I am in the process of collecting feedback.

@andytill
Copy link
Contributor

andytill commented Feb 7, 2017

Just keeping a sorted bag in some kind of data structure should be sufficient.

@hmmr
Copy link
Contributor Author

hmmr commented Feb 7, 2017

Just keeping a sorted bag in some kind of data structure should be sufficient.

Exactly. It would be a new calc_type = invdist, with its own run_select_on_invdist_chunk. No query buffer will be harmed.

Let's hear from @ph07 on this matter, specifically on the loss of precision the fixed-bins method has?

@andytill
Copy link
Contributor

andytill commented Feb 7, 2017

ok, I am not suggesting we need a fixed number of bins, like bins used in dict.

We can still keep accuracy by keeping a sorted bag.

@andytill
Copy link
Contributor

andytill commented Feb 7, 2017

Two more Qs.

  1. Why not use the aggregates we already have, so it can be used in group by?
  2. Does this support multiple percentiles e.g. SELECT PERCENTILE(latency, 98), PERCENTILE(latency,99.9)

@hmmr
Copy link
Contributor Author

hmmr commented Feb 7, 2017

  1. Yes, in the (quickly getting obsolescent) qbuf-based implementation I do support multiple calls to percentile, but only as long as they all refer to the same column.

  2. It's getting more interesting by the minute. Going with aggregates, we will be receiving 100-element state terms* from each of the vnodes covered, for every distinct-column percentile entry in the query. With a separate, dedicated calc_type, we will be receiving all chunks raw, then flicking them into bins on the coordinator and dropping. To a degree, it depends on the data, but I believe aggregates can be less resource-consuming.

  • Ok, the 100-bin dict can be reduced to only contain bins with non-zero count.

State terms are effectively histograms, which could be easily combined, either incrementally or all at once at the final stage.

Thanks for another suggestion. Andy!

@hmmr
Copy link
Contributor Author

hmmr commented Feb 9, 2017

@andytill There's a problem with PERCENTILE as an aggregate function. It just dawned on me (I could claim I had a suspicion it's not going to work earlier on, but it's 20x20 hindsight of course).

In order to put a given cell value into one of the 100 percentile slots in the function's running state, we need to know the max and min of the range -- and that we have no way of knowing before we get the full picture -- that is, before we receive all the chunks down to last.

So, with this realisation, I have no workable ideas on how to implement PERCENTILE differently from how I did it in the PRs posted.

@andytill
Copy link
Contributor

ok, I am not suggesting we need a fixed number of bins, like bins used in dict.

We can still keep accuracy by keeping a sorted bag.

Influxdb claims to use a sorted set in it's docs which would be even easier, but sounds wrong to me because it skews results when there are duplicates.

https://docs.influxdata.com/influxdb/v0.8/api/aggregate_functions/#percentile

@hmmr
Copy link
Contributor Author

hmmr commented Feb 10, 2017

@andytill For me the choice is this: to obtain an inverse distribution function which requires counting rows and picking one by its position in a sorted selection, I need (a) collect all rows in one place or (b) somehow avoid doing that. In my previous comment I am arguing that (b) is not feasible.

@andytill
Copy link
Contributor

I have never suggested option b) is feasible, I suggested it was unnecessary three days ago.

Just keeping a sorted bag in some kind of data structure should be sufficient.

My original comment which I am still concerned about is this.

Why put more queries through query buffers when it has known performance issues?

@hmmr
Copy link
Contributor Author

hmmr commented Feb 10, 2017

Why put more queries through query buffers

Because they provide facilities for collecting all rows in one place?

when it has known performance issues?

Selections fitting into memory (per configurable limit) will be held in memory. No performance issues here. Selections going overboard will still be usable for the purposes of PERCENTILE, albeit with a performance hit.

Now, the choice is, accept performance hit and still serve the query, or run into overload protection issues when the coordinator receives loads of bags.

@andytill
Copy link
Contributor

I'm still not convinced that putting all the data onto disk when it grows too large is a good idea, because of the work that must be done before the query can continue and the pressure it might put onto other requests, and as we have discovered, storage can be much much slower than computation on AWS.

Are there any overload tests that prove that this strategy is effective, and is any other DB doing this?

@hmmr
Copy link
Contributor Author

hmmr commented Feb 13, 2017

putting all the data onto disk when it grows too large is a good idea

But what is the alternative? Refuse to serve the query?

Network-mounted, lowest-tier AWS storage is indeed suboptimal for query buffers, but certainly there exist other, faster options.

Are there any overload tests

In ts_simple_query_buffers:query_orderby_inmem2ldb` (https://github.com/basho/riak_test/blob/develop/tests/ts_simple_query_buffers_SUITE.erl#L271-L278) I am testing the code paths for all in-memory, all leveldb-backed as well as mixed operation (that is, some data are first accumulated in memory, then dumped to disk). Similar tests exist for PERCENTILE (https://github.com/basho/riak_test/pull/1270/files#diff-ff1938dc629fdc5a892006e05f96d733R93, ..R96 and ..R99).

pressure it might put onto other requests

Other queries not involving query buffers will continue to be served by other workers (up to riak_kv.query.concurrent_queries). Yes, those which do rely on query buffers will have to wait (or be retried on some other node).

is any other DB doing this?

That may be investigated in its own right, separately. The fundamental problem remains, what to do with queries with WHERE $all_time, be they regular SELECT x .. LIMIT 1 or SELECT PERCENTILE(x), which require allocation of temp tables/buffers. It either has to be a disk-based solution, or no solution at all.

@andytill
Copy link
Contributor

But what is the alternative? Refuse to serve the query?

Yes, this is called load shedding. We already use it when the user issues queries and there are not enough coordinators available. This technique, is described here by Fred Herbert.

http://ferd.ca/queues-don-t-fix-overload.html

He is describing a queue as a buffer, whereas we have an in-process buffer. This paragraph stood out for me.

When applying a queue as a buffer, all they're doing is creating a bigger buffer to accumulate data that is in-flight, only to lose it sooner or later. You're making failures more rare, but you're making their magnitude worse.

As discussed in the mumble, riak test cannot verify that storing data onto disk when there is too much can save Riak TS from an overload/outage scenario.

The pressure on other requests I was referring to was mainly about disk and CPU which is shared. I hadn't thought about exhaustion of coordinators but you're right, that could lead to queries being refused as well.

I know that influxdb and cassandra do not use temporary tables in this way, probably for these reasons.

@hmmr
Copy link
Contributor Author

hmmr commented Feb 13, 2017

I can't help agreeing that whatever we do, disk-based query buffers/temp tables will remain a source of increased latency, stalls, dropped queries and all the headache.

At the same time, refusing to serve queries which could be served given increased timeouts and faster storage for temp tables, is probably not a satisfactory answer either.

As a possible resolution, we can implement a configuration switch to allow/disable the fallback to leveldb in query buffer manager. Users who only occasionally issue requests that would require accumulation of data in temp tables in excess of a certain limit -- and who can tolerate a 10-sec latency -- will have the ability to use that. Conversely, if it works out easier for them to retry a query with smaller WHERE ranges than to stall the query buffer manager, will be happy keeping the switch turned off.

I also would like to note that the discussion has veered into the larger scope of temp tables in general and away from the core question of the proposed implementation of PERCENTILE involving riak_kv_qry_compiler.


## Abstract

Riak TS needs to support *inverse distribution functions*, at least including `PERCENTILE`, `MEDIAN`. This RFC details how this can be implemented using *query buffers*.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get an explanation of the naming please. Other SQL implementations have PERCENTILE_DISC and PERCENTILE_CONT which of these does our implementation correspond to?

The documentation team will need to have these details to write up the documentation correctly.


- `LIMIT` is assigned a same-length list of `1`s.

4. Finally, multiple columns in `'SELECT'` will be collapsed into a single column. Thus, for functions in `riak_kv_qry_worker`, `SELECT PERCENTILE(x, 0.33), MEDIAN(x)` will become `SELECT x`.
Copy link
Contributor

@gordonguthrie gordonguthrie Feb 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like a diagram of how the query is being rewritten and rolled up and down as per the riak_ql documentation

I mean this kind of diagram...


                               <---Network--->

+ FROM     <-----------------------+        + FROM     mytable on vnode X
|                                  |        |
| SELECT   SUM(STemp)/SUM(NoTemp)  |        | SELECT   SUM(temp) AS STemp, COUNT(temp) AS NoTemp
|                                  | Chunk1 |
| GROUP BY []                      +--------+ GROUP BY []
|                                  |        |
| ORDER BY []                      |        | ORDER BY []
|                                  |        |
+ WHERE    []                      |        + WHERE + start_key = {myfamily, myseries, 1233}
                                   |                | end_key   = {myfamily, myseries, 4000}
                                   |                + temp      > 18
                                   |
                                   |
                                   |        + FROM     mytable on vnode Y
                                   |        |
                                   |        | SELECT   SUM(temp) AS STemp, COUNT(temp) AS NoTemp
                                   | Chunk2 |
                                   +--------+ GROUP BY []
                                            |
                                            | ORDER BY []
                                            |
                                            + WHERE + start_key = {myfamily, myseries, 4001}
                                                    | end_key   = {myfamily, myseries, 6789}
                                                    + temp      > 18

@hmmr
Copy link
Contributor Author

hmmr commented Feb 14, 2017

Diagram added in 054590f.

@hmmr hmmr changed the title [WIP] RFC for inverse distribution functions (PERCENTILE, MEDIAN) RFC for inverse distribution functions (PERCENTILE, MEDIAN) Feb 18, 2017
The presence of `ORDER BY` will direct `riak_kv_qry_worker` to use query buffers for the query with inverse distribution functions.

By way of illustration:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This diagram is incorrect - the ORDER BY, LIMIT and OFFSET clauses are not evaluated on the vnode

@gordonguthrie
Copy link
Contributor

+1

@hmmr
Copy link
Contributor Author

hmmr commented Feb 27, 2017

@gordonguthrie per our discussions and hand-holding, the diagram updated in bf499b6.

@hmmr hmmr merged commit 4d42172 into master Mar 13, 2017
@hmmr hmmr deleted the az-percentile branch March 13, 2017 14:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants