apply max-series-per-req to non-tagged queries #1926

Dieterbe · 2020-10-16T17:26:27Z

max-series-per-req has been implemented for tagged queries for a quite a while, but wasn't implemented yet for non-tagged queries. We have seen customers do excessively large queries leading to lots of allocated memory, we want to reject the queries instead. fix #1916

This PR does the following:

implement a fractional limit. e.g. when a query node receives a query with a limit of 100, and the cluster has 10 shards and 5 shardgroups (meaning all read nodes own 2 shards out of the 10), then each read node will see a limit of 100 * 2 / 10 = 20.
the read nodes will try to detect limit breach as early as possible without first allocating a bunch of stuff. also, limit breaches are cached in the findCache
some minor refactoring to make this possible and some small doc tweaks. In particular, fetchFunc now gets the full peersGroup map, so it has the awareness of the cluster it needs to give each peer its correct fractional limit (though there's a caveat here, is shardgroups are completely down - degraded cluster - then the fractional limit will be higher than it should. suboptimal but should be acceptable)
add a "global" limit: on a query node, even if all individual fan-out requests somehow succeeded, but the aggregate still breached the limit (unlikely), return an error

see all individual commits for more details.

fetchFunc can use this to determine the ratio of how much data the target peer owns compared to the cluster as a whole. Caveat: this is based on live cluster state. If shardgroups go completely down it'll look like the target peer owns more of the cluster than it actually does. if query limits are set based on this, the limits would loosen up as shards leave the cluster.

we do this by adjusting queryAllShards such that it doesn't assume a simple peer.Post but allows passing in the fetchFunc, such that we can define a custom fetchFunc that can figure out the fraction of the data that our target peer is responsible for

Dieterbe · 2020-10-16T21:03:11Z

tested in docker-cluster-query with these tweaks:

-max-series-per-req = 250000
+max-series-per-req = 15
 # require x-org-id authentication to auth as a specific org. otherwise orgId 1 is assumed
 multi-tenant = true
 # in case our /render endpoint does not support the requested processing, proxy the request to this graphite
 fallback-graphite-addr = http://graphite
 # proxy to graphite when metrictank considers the request bad
-proxy-bad-requests = true //otherwise the http 400's will cause a proxy to graphite. it would still error the same, but the messages are harder to read
+proxy-bad-requests = false

and

-      MT_LOG_LEVEL: info
+      MT_LOG_LEVEL: debug

on all metrictank processes

mt-fakemetrics feed --kafka-mdm-addr localhost:9092 --period 10s --mpo 100

docker-compose logs -f metrictank0 metrictank1 metrictank2 metrictank3 metrictank-q0 | grep -v 'memberlist|CLU manager|Sarama|AM:|kafkamdm|already in index|stats flushing.*to graphite|kafka-cluster|cassandra-store: (save complete|starting to save)|updating.*in.*index'  # easy logging view

demonstration single target

wget -q 'http://localhost:6061/render?target=some.id.of.a.metric.1*&from=-60s' -O - | jsonpp | r target
        "target": "some.id.of.a.metric.1",
        "target": "some.id.of.a.metric.10",
        "target": "some.id.of.a.metric.100",
        "target": "some.id.of.a.metric.11",
        "target": "some.id.of.a.metric.12",
        "target": "some.id.of.a.metric.13",
        "target": "some.id.of.a.metric.14",
        "target": "some.id.of.a.metric.15",
        "target": "some.id.of.a.metric.16",
        "target": "some.id.of.a.metric.17",
        "target": "some.id.of.a.metric.18",
        "target": "some.id.of.a.metric.19",

wget --server-response 'http://localhost:6061/render?target=some.id.of.a.metric.*&from=-60s' -O -
--2020-10-16 22:43:16--  http://localhost:6061/render?target=some.id.of.a.metric.*&from=-60s
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:6061... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 400 Bad Request
  Content-Length: 15
  Content-Type: text/plain
  Date: Fri, 16 Oct 2020 20:43:16 GMT
  Server: Caddy
  Trace-Id: 28522102b04971f2
  Vary: Origin
2020-10-16 22:43:16 ERROR 400: Bad Request.

demonstration of breaching limit across multiple targets

wget -q 'http://localhost:6061/render?target=some.id.of.a.metric.1*&target=some.id.of.a.metric.1&from=-60s' -O - | jsonpp | grep target
        "target": "some.id.of.a.metric.1",
        "target": "some.id.of.a.metric.10",
        "target": "some.id.of.a.metric.100",
        "target": "some.id.of.a.metric.11",
        "target": "some.id.of.a.metric.12",
        "target": "some.id.of.a.metric.13",
        "target": "some.id.of.a.metric.14",
        "target": "some.id.of.a.metric.15",
        "target": "some.id.of.a.metric.16",
        "target": "some.id.of.a.metric.17",
        "target": "some.id.of.a.metric.18",
        "target": "some.id.of.a.metric.19",
        "target": "some.id.of.a.metric.1",

wget --server-response  'http://localhost:6061/render?target=some.id.of.a.metric.1*&target=some.id.of.a.metric.{1,2}&from=-60s' -O -                                ⏎
--2020-10-16 22:39:35--  http://localhost:6061/render?target=some.id.of.a.metric.1*&target=some.id.of.a.metric.%7B1,2%7D&from=-60s
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:6061... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 400 Bad Request
  Content-Length: 15
  Content-Type: text/plain
  Date: Fri, 16 Oct 2020 20:39:35 GMT
  Server: Caddy
  Trace-Id: 2153aa0824baf179
  Vary: Origin
2020-10-16 22:39:35 ERROR 400: Bad Request.

note: this one works because we dedup identical targets. bit of an (unlikely) loophole. it returns all targets twice.

wget --server-response  'http://localhost:6061/render?target=some.id.of.a.metric.1*&target=some.id.of.a.metric.1*&from=-60s'

I also verified that the check for http400 leading to aborting the spec-exec works leading to a cancelation.
(unfortunately we don't actually implement the cancellation on the read nodes yet)

metrictank-q0_1  | ABORT SPECEXEC
metrictank-q0_1  | 2020-10-16 20:46:43.103 [ERROR] Peer metrictank3 responded with error = "400 Bad Request"
metrictank-q0_1  | 2020-10-16 20:46:43.103 [INFO] ts=2020-10-16T20:46:43.103780874Z traceID=79c160bd75d20d0b, sampled=true msg="GET /render?from=-60s&target=some.id.of.a.metric.1%2A&target=some.id.of.a.metric.%2A (400) 1.055897ms" orgID=1 sourceIP="172.25.0.1" error="400%20Bad%20Request"
metrictank-q0_1  | 2020-10-16 20:46:43.103 [INFO] CLU HTTPNode: context canceled on request to peer metrictank0

I also made a couple of prints for the findCache and tested querying 1.* (ok) and * (not ok )
the cache seemed to handle this fine.

metrictank2_1    | DIETER ADDING TO CACHE some.id.of.a.metric.1*
metrictank2_1    | ([]*memory.Node) (len=6 cap=8) {
metrictank2_1    |  (*memory.Node)(0xc001a72500)(leaf - some.id.of.a.metric.16),
metrictank2_1    |  (*memory.Node)(0xc001a72680)(leaf - some.id.of.a.metric.14),
metrictank2_1    |  (*memory.Node)(0xc001a728c0)(leaf - some.id.of.a.metric.15),
metrictank2_1    |  (*memory.Node)(0xc001a72900)(leaf - some.id.of.a.metric.1),
metrictank2_1    |  (*memory.Node)(0xc001a729c0)(leaf - some.id.of.a.metric.13),
metrictank2_1    |  (*memory.Node)(0xc0017bfc80)(leaf - some.id.of.a.metric.100)
metrictank2_1    | }
metrictank2_1    | (interface {}) <nil>
...
metrictank2_1    | DIETER FINDCACHE SAID some.id.of.a.metric.1*
metrictank2_1    | (memory.CacheResult) {
metrictank2_1    |  nodes: ([]*memory.Node) (len=6 cap=8) {
metrictank2_1    |   (*memory.Node)(0xc001a72500)(leaf - some.id.of.a.metric.16),
metrictank2_1    |   (*memory.Node)(0xc001a72680)(leaf - some.id.of.a.metric.14),
metrictank2_1    |   (*memory.Node)(0xc001a728c0)(leaf - some.id.of.a.metric.15),
metrictank2_1    |   (*memory.Node)(0xc001a72900)(leaf - some.id.of.a.metric.1),
metrictank2_1    |   (*memory.Node)(0xc001a729c0)(leaf - some.id.of.a.metric.13),
metrictank2_1    |   (*memory.Node)(0xc0017bfc80)(leaf - some.id.of.a.metric.100)
metrictank2_1    |  },
metrictank2_1    |  err: (error) <nil>
metrictank2_1    | }
...
metrictank2_1    | DIETER ADDING TO CACHE some.id.of.a.metric.*
metrictank2_1    | ([]*memory.Node) <nil>
metrictank2_1    | (errors.BadRequest) (len=15) limit exhausted
metrictank2_1    | 2020-10-16 20:59:49.429 [INFO] ts=2020-10-16T20:59:49.428962139Z traceID=650ec2db82e2365d, sampled=true msg="POST /index/find (400) 413.898µs" orgID=0 sourceIP="172.25.0.14" error="limit%20exhausted"
metrictank2_1    | 2020/10/16 21:00:08 [DEBUG] memberlist: Initiating push/pull sync with: 172.25.0.13:7946
...
metrictank2_1    | DIETER FINDCACHE SAID some.id.of.a.metric.*
metrictank2_1    | (memory.CacheResult) {
metrictank2_1    |  nodes: ([]*memory.Node) <nil>,
metrictank2_1    |  err: (errors.BadRequest) (len=15) limit exhausted
metrictank2_1    | }
metrictank2_1    | 2020-10-16 21:01:38.173 [INFO] ts=2020-10-16T21:01:38.173685973Z traceID=34079b8664f88809, sampled=true msg="POST /index/find (400) 365.66µs" orgID=0 sourceIP="172.25.0.14" error="limit%20exhausted"

note that we also now cache limit breaches in the findcache

if the request is bad, there is no point retrying on a different replica of the same shard. breaking the limit on one, will also break it on the 2nd. See also #985

Dieterbe · 2020-10-16T22:27:30Z

caveat:
if find is called with the same patterns but different limits, this may lead to a proliferation of entries in the findCache
this would only happen if the shard topology changes (or shardgroups go down) on a live cluster (rare), and also if the same patterns are queried but as different targets (e.g. target=foo , target=bar&target=foo, target=foo&target=consolidateBy(foo,'sum') this may all lead to different limits being applied (limit takes into account already resolved series) and thus the size of the findcache increasing.

this all sounds reasonable though. in practice this should typically not have an adverse effect.

cache results depend on the limit.

robert-milan

I've reviewed it, tested it locally. Overall I am pretty sure it is fine, but would like a little more time to look into 1 or 2 things.

robert-milan

I think this should work fine.

Dieterbe force-pushed the maxSeries-non-tagged branch from 1faf7d3 to b9c3b64 Compare October 16, 2020 17:27

Dieterbe mentioned this pull request Oct 16, 2020

graphite query editor: find request failure invisible in UI grafana/grafana#28336

Closed

Dieterbe added 6 commits October 16, 2020 22:53

implement maxSeriesPerReq for non-tagged queries

56cd023

use type alias

9248690

docs fix

39dee72

/index/find -> pass through the limit to the index Find()

3eaf5ad

Dieterbe force-pushed the maxSeries-non-tagged branch from b9c3b64 to df76f3e Compare October 16, 2020 21:06

Dieterbe marked this pull request as ready for review October 16, 2020 21:06

Dieterbe requested a review from robert-milan October 16, 2020 21:06

Dieterbe added 3 commits October 17, 2020 00:18

implement a limit on index.Find

3e2ca8e

note that we also now cache limit breaches in the findcache

spec-exec: don't retry requests that fail http 400

795dcb3

if the request is bad, there is no point retrying on a different replica of the same shard. breaking the limit on one, will also break it on the 2nd. See also #985

fix tests

8a7ee29

Dieterbe force-pushed the maxSeries-non-tagged branch from df76f3e to 68637ef Compare October 16, 2020 22:19

Dieterbe added 2 commits October 17, 2020 00:29

unit tests for Find limit

df44ebd

bugfix: findCache must be limit-aware

596df84

cache results depend on the limit.

Dieterbe force-pushed the maxSeries-non-tagged branch from 68637ef to 596df84 Compare October 16, 2020 22:30

robert-milan reviewed Oct 22, 2020

View reviewed changes

robert-milan approved these changes Oct 22, 2020

View reviewed changes

robert-milan merged commit 5b78a26 into master Oct 22, 2020

robert-milan deleted the maxSeries-non-tagged branch October 22, 2020 20:30

Dieterbe mentioned this pull request Oct 28, 2020

"limit exhausted" message due to max-series-per-req limit (for untagged requests) #1930

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

apply max-series-per-req to non-tagged queries #1926

apply max-series-per-req to non-tagged queries #1926

Dieterbe commented Oct 16, 2020 •

edited

Loading

Dieterbe commented Oct 16, 2020 •

edited

Loading

Dieterbe commented Oct 16, 2020 •

edited

Loading

robert-milan left a comment

robert-milan left a comment

apply max-series-per-req to non-tagged queries #1926

apply max-series-per-req to non-tagged queries #1926

Conversation

Dieterbe commented Oct 16, 2020 • edited Loading

Dieterbe commented Oct 16, 2020 • edited Loading

demonstration single target

demonstration of breaching limit across multiple targets

Dieterbe commented Oct 16, 2020 • edited Loading

robert-milan left a comment

Choose a reason for hiding this comment

robert-milan left a comment

Choose a reason for hiding this comment

Dieterbe commented Oct 16, 2020 •

edited

Loading

Dieterbe commented Oct 16, 2020 •

edited

Loading

Dieterbe commented Oct 16, 2020 •

edited

Loading