ES|QL: add track for LOOKUP JOIN scale tests#719
ES|QL: add track for LOOKUP JOIN scale tests#719luigidellaquila merged 17 commits intoelastic:masterfrom
Conversation
joins/README.md
Outdated
There was a problem hiding this comment.
This assumes the file was generated with at least 1k documents.
This is likely true for small cardinality fields, but might not be the case for 100m.
I wonder if it is worth mentioning explicitly?
There was a problem hiding this comment.
👍 I added a small note
and some clarificaitons in the README
| { | ||
| "name": "esql_lookup_join_1k_keys_where_no_match", | ||
| "operation-type": "esql", | ||
| "query": "FROM join_base_idx | lookup join lookup_idx_1000_f10 on key_1000 | where concat(lookup_keyword_0, \"foo\") == \"bar\"" |
There was a problem hiding this comment.
For now these queries work the same as the *keys_where_limit* (see above), but in the future we could be able to push down to lucene one of the two, but most likely not the other one.
|
Thanks @gbanasiak! |
alex-spies
left a comment
There was a problem hiding this comment.
Thanks a lot Luigi! Looks good, although I do have suggestions for further iterations on the track.
I think in a follow up, we should deliberately force the lookups onto the coordinator or the data nodes, resp. See my comment below.
Additionally, I'm curious about how the following queries would perform:
- lookup join against a lookup index where the non-lookup fields contain large values (long strings or many multi-values, for instance) - but maybe that's less relevant as the default dataset has 10 additional text fields already.
- What happens when the lookup indices have many rows matching the same keys? Maybe the default dataset should have some repetitions?
Especially the last one will be important IMHO as with the planned SQL-like semantics, the cardinality of the whole result set will be changing with every LOOKUP JOIN - but it's already interesting in the current state, as the current implementation may pool a lot of multivalues.
| { | ||
| "name": "esql_lookup_join_1k_100k_200k_500k", | ||
| "operation-type": "esql", | ||
| "query": "FROM join_base_idx | lookup join lookup_idx_1000_f10 on key_1000 | rename lookup_keyword_0 as lk_1k | lookup join lookup_idx_100000_f10 on key_100000 | rename lookup_keyword_0 as lk_100k | lookup join lookup_idx_200000_f10 on key_200000 | rename lookup_keyword_0 as lk_200k | lookup join lookup_idx_500000_f10 on key_500000 | rename lookup_keyword_0 as lk_500k | keep id, key_1000, key_100000, key_200000, key_500000, lk_1k, lk_100k, lk_200k, lk_500k | limit 1000" |
There was a problem hiding this comment.
The limit 1000 at the end is pushed down, and therefore all lookups will happen on the coordinator node, I think. This means we also perform the lookups against max 1000 rows.
The same is true essentially for all other queries in this file because we add an implicit LIMIT 1000 per default.
I think it'd be good to have versions of these queries where we add a SORT ... at the end - this should force the lookups onto the data nodes (best to confirm, though, by looking at the plans).
There was a problem hiding this comment.
Good point, I think it makes sense to have it as next iteration.
I'll add some queries right away
There was a problem hiding this comment.
Actually, I realized I was being imprecise. The queries with a WHERE clause will not have the limit pushed down past the LOOKUP JOINs, because the WHERE clause should prevent the pushdown. So there's already queries that force the execution of LOOKUP JOINs onto the data nodes.
| ./lookup_idx.sh 1000 10 1 | shuf | bzip2 -c > lookup_idx_1000_f10.json.bz2 | ||
| ./lookup_idx.sh 100000 10 1 | shuf | bzip2 -c > lookup_idx_100000_f10.json.bz2 | ||
| ./lookup_idx.sh 200000 10 1 | shuf | bzip2 -c > lookup_idx_200000_f10.json.bz2 | ||
| ./lookup_idx.sh 500000 10 1 | shuf | bzip2 -c > lookup_idx_500000_f10.json.bz2 | ||
| ./lookup_idx.sh 1000000 10 1 | shuf | bzip2 -c > lookup_idx_1000000_f10.json.bz2 | ||
| ./lookup_idx.sh 5000000 10 1 | shuf | bzip2 -c > lookup_idx_5000000_f10.json.bz2 | ||
| ./joins_main_idx.sh 10000000 | bzip2 -c > join_base_idx-10M.json.bz2 |
There was a problem hiding this comment.
nit: maybe a lower number of lookup indices would be sufficient? Like, 1k, 50k, 1M, 10M?
I think it's more interesting to have different numbers of repetitions IMHO.
| {# | ||
| "operation": "esql_lookup_join_100k_keys_where_no_match", | ||
| "tags": ["lookup", "join"], | ||
| "clients": 1, | ||
| "warmup-iterations": 10, | ||
| "iterations": 50 | ||
| #} |
There was a problem hiding this comment.
Are the ..._where_no_match challenges commented out, resp. is this on purpose?
There was a problem hiding this comment.
Yes, it's on purpose. These are super expensive and just time out.
My plan is to run some iterations with a higher timeout (tens of minutes) and test the limits here
Adding a track to measure the performance of ES|QL LOOKUP JOIN at scale.
The PR contains