ES|QL: add track for LOOKUP JOIN scale tests by luigidellaquila · Pull Request #719 · elastic/rally-tracks

luigidellaquila · 2024-12-31T14:43:20Z

Adding a track to measure the performance of ES|QL LOOKUP JOIN at scale.

The PR contains

a set of scripts to generate the corpus
queries with the corresponding challenges
track configuration to load the corpus and run the queries

…orpus

idegtiarenko · 2025-01-02T08:27:24Z

joins/README.md

This assumes the file was generated with at least 1k documents.
This is likely true for small cardinality fields, but might not be the case for 100m.
I wonder if it is worth mentioning explicitly?

👍 I added a small note

and some clarificaitons in the README

luigidellaquila · 2025-01-02T17:19:35Z

joins/operations/default.json

+    {
+      "name": "esql_lookup_join_1k_keys_where_no_match",
+      "operation-type": "esql",
+      "query": "FROM join_base_idx | lookup join lookup_idx_1000_f10 on key_1000 | where concat(lookup_keyword_0, \"foo\") == \"bar\""


For now these queries work the same as the *keys_where_limit* (see above), but in the future we could be able to push down to lucene one of the two, but most likely not the other one.

joins/files.txt

joins/challenges/default.json

joins/track.json

joins/index-join_base_idx.json

joins/track.json

joins/README.md

gbanasiak

Many thanks for iterating. Forgot to mention earlier: if the track is serverless-ready please add joins in here. Can be done in a separate PR.

luigidellaquila · 2025-01-09T08:33:42Z

Thanks @gbanasiak!
We are still fine-tuning the Serverless tests, I'll add it with a follow-up PR

alex-spies

Thanks a lot Luigi! Looks good, although I do have suggestions for further iterations on the track.

I think in a follow up, we should deliberately force the lookups onto the coordinator or the data nodes, resp. See my comment below.

Additionally, I'm curious about how the following queries would perform:

lookup join against a lookup index where the non-lookup fields contain large values (long strings or many multi-values, for instance) - but maybe that's less relevant as the default dataset has 10 additional text fields already.
What happens when the lookup indices have many rows matching the same keys? Maybe the default dataset should have some repetitions?

Especially the last one will be important IMHO as with the planned SQL-like semantics, the cardinality of the whole result set will be changing with every LOOKUP JOIN - but it's already interesting in the current state, as the current implementation may pool a lot of multivalues.

alex-spies · 2025-01-09T10:39:38Z

joins/operations/default.json

+    {
+    "name": "esql_lookup_join_1k_100k_200k_500k",
+    "operation-type": "esql",
+    "query": "FROM join_base_idx | lookup join lookup_idx_1000_f10 on key_1000 | rename lookup_keyword_0 as lk_1k | lookup join lookup_idx_100000_f10 on key_100000 | rename lookup_keyword_0 as lk_100k | lookup join lookup_idx_200000_f10 on key_200000 | rename lookup_keyword_0 as lk_200k | lookup join lookup_idx_500000_f10 on key_500000 | rename lookup_keyword_0 as lk_500k | keep id, key_1000, key_100000, key_200000, key_500000, lk_1k, lk_100k, lk_200k, lk_500k | limit 1000"


The limit 1000 at the end is pushed down, and therefore all lookups will happen on the coordinator node, I think. This means we also perform the lookups against max 1000 rows.

The same is true essentially for all other queries in this file because we add an implicit LIMIT 1000 per default.

I think it'd be good to have versions of these queries where we add a SORT ... at the end - this should force the lookups onto the data nodes (best to confirm, though, by looking at the plans).

Good point, I think it makes sense to have it as next iteration.
I'll add some queries right away

Actually, I realized I was being imprecise. The queries with a WHERE clause will not have the limit pushed down past the LOOKUP JOINs, because the WHERE clause should prevent the pushdown. So there's already queries that force the execution of LOOKUP JOINs onto the data nodes.

alex-spies · 2025-01-09T10:45:42Z

joins/_tools/README.md

+./lookup_idx.sh 1000 10 1 | shuf | bzip2 -c > lookup_idx_1000_f10.json.bz2
+./lookup_idx.sh 100000 10 1 | shuf | bzip2 -c > lookup_idx_100000_f10.json.bz2
+./lookup_idx.sh 200000 10 1 | shuf | bzip2 -c > lookup_idx_200000_f10.json.bz2
+./lookup_idx.sh 500000 10 1 | shuf | bzip2 -c > lookup_idx_500000_f10.json.bz2
+./lookup_idx.sh 1000000 10 1 | shuf | bzip2 -c > lookup_idx_1000000_f10.json.bz2
+./lookup_idx.sh 5000000 10 1 | shuf | bzip2 -c > lookup_idx_5000000_f10.json.bz2
+./joins_main_idx.sh 10000000 | bzip2 -c > join_base_idx-10M.json.bz2


nit: maybe a lower number of lookup indices would be sufficient? Like, 1k, 50k, 1M, 10M?

I think it's more interesting to have different numbers of repetitions IMHO.

alex-spies · 2025-01-09T10:46:59Z

joins/challenges/default.json

+        {#
+          "operation": "esql_lookup_join_100k_keys_where_no_match",
+          "tags": ["lookup", "join"],
+          "clients": 1,
+          "warmup-iterations": 10,
+          "iterations": 50
+        #}


Are the ..._where_no_match challenges commented out, resp. is this on purpose?

Yes, it's on purpose. These are super expensive and just time out.
My plan is to run some iterations with a higher timeout (tens of minutes) and test the limits here

luigidellaquila added 12 commits December 19, 2024 10:51

Clone nyc_taxis into joins track and create scripts to generate the c…

eeb1eb5

…orpus

Fix corpus scripts

5101637

Make it work

d4cde8a

First join tracks

57a95ad

more queries

40476b7

Split ingestion and parameterize ingest percentage

7e6692e

More queries

2720b29

README and cleanup

9dd9e01

100M lookup, scripts, README

c8311ea

Multiple joins

6f342b8

cleanup

f68c12d

Add configuration for 1B docs base index

29642c1

idegtiarenko reviewed Jan 2, 2025

View reviewed changes

luigidellaquila added 3 commits January 2, 2025 14:07

Add challenges for join of all dataset

ea3145c

Fix test mode

8d4bf02

Default shards/replicas count

899f013

and some clarificaitons in the README

luigidellaquila commented Jan 2, 2025

View reviewed changes

Disable too expensive challenges

8e8ecd9

luigidellaquila marked this pull request as ready for review January 7, 2025 09:23

alex-spies self-requested a review January 7, 2025 09:51

luigidellaquila requested a review from gbanasiak January 7, 2025 15:55

idegtiarenko approved these changes Jan 8, 2025

View reviewed changes

gbanasiak reviewed Jan 8, 2025

View reviewed changes

implement review suggestions

b051f04

gbanasiak approved these changes Jan 8, 2025

View reviewed changes

luigidellaquila merged commit b3c872f into elastic:master Jan 9, 2025
13 checks passed

alex-spies reviewed Jan 9, 2025

View reviewed changes

gbanasiak mentioned this pull request Jan 15, 2025

Temporarily disable joins in IT #726

Closed

NickDris mentioned this pull request Aug 28, 2025

dris test branch NickDris/rally-tracks#2

Closed

Conversation

luigidellaquila commented Dec 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gbanasiak left a comment

Choose a reason for hiding this comment

Uh oh!

luigidellaquila commented Jan 9, 2025

Uh oh!

Uh oh!

alex-spies left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

luigidellaquila commented Dec 31, 2024 •

edited

Loading