-
Notifications
You must be signed in to change notification settings - Fork 222
ES|QL: add track for LOOKUP JOIN scale tests #719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
luigidellaquila
merged 17 commits into
elastic:master
from
luigidellaquila:esql_join_scale
Jan 9, 2025
Merged
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
eeb1eb5
Clone nyc_taxis into joins track and create scripts to generate the c…
luigidellaquila 5101637
Fix corpus scripts
luigidellaquila d4cde8a
Make it work
luigidellaquila 57a95ad
First join tracks
luigidellaquila 40476b7
more queries
luigidellaquila 7e6692e
Split ingestion and parameterize ingest percentage
luigidellaquila 2720b29
More queries
luigidellaquila 9dd9e01
README and cleanup
luigidellaquila c8311ea
100M lookup, scripts, README
luigidellaquila 6f342b8
Multiple joins
luigidellaquila f68c12d
cleanup
luigidellaquila 29642c1
Add configuration for 1B docs base index
luigidellaquila ea3145c
Add challenges for join of all dataset
luigidellaquila 8d4bf02
Fix test mode
luigidellaquila 899f013
Default shards/replicas count
luigidellaquila 8e8ecd9
Disable too expensive challenges
luigidellaquila b051f04
implement review suggestions
luigidellaquila File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,52 @@ | ||
| ## JOINS track | ||
|
|
||
| This track contains an artificial dataset intended to test JOIN operations with different key cardinalities. | ||
|
|
||
| The dataset can be generated using the scripts in the `_tools` directory. | ||
|
|
||
| ### Example Documents | ||
|
|
||
| Main index: | ||
|
|
||
| ```json | ||
| { | ||
| "id": 56, | ||
| "@timestamp": 946728056, | ||
| "key_1000": "56", | ||
| "key_100000": "56", | ||
| "key_200000": "56", | ||
| "key_500000": "56", | ||
| "key_1000000": "56", | ||
| "key_5000000": "56", | ||
| "key_100000000": "56", | ||
| "field_0": "text with value 0_56", | ||
| "field_1": "text with value 1_56", | ||
| "field_2": "text with value 2_56", | ||
| ... | ||
| "field_99": "text with value 99_56" | ||
| } | ||
| ``` | ||
|
|
||
| The cardinality of the keys is the same as the key name, eg. `key_1000` will have 1000 different values in the dataset, | ||
| from `0` to `999` (unless the dataset is not big enough to contain all the keys of a given cardinality, | ||
| eg. with a dataset of 1000 documents, `key_100000000` will contain only 1000 distinct keys, one per document). | ||
| The IDs and the timestamps are sequential. | ||
|
|
||
| ### Parameters | ||
|
|
||
| This track allows to overwrite the following parameters using `--track-params`: | ||
|
|
||
| * `bulk_size` (default: 10000) | ||
| * `bulk_indexing_clients` (default: 8): Number of clients that issue bulk indexing requests. | ||
| * `ingest_percentage` (default: 100): A number between 0 and 100 that defines how much of the document corpus should be ingested. It will be applied to the main index and to the large join indexes (ie. not to join indexes with up to 500K documents) | ||
| * `number_of_replicas` (default: 1): This only applies to the main index (not to lookup indexes) | ||
| * `number_of_shards` (default: 5): This only applies to the main index (not to lookup indexes) | ||
| * `source_mode` (default: stored): Should the `_source` be `stored` to disk exactly as sent (the default), thrown away (`disabled`), or reconstructed on the fly (`synthetic`) | ||
| * `index_settings`: A list of index settings. Index settings defined elsewhere need to be overridden explicitly. | ||
| * `cluster_health` (default: "green"): The minimum required cluster health. | ||
| * `include_non_serverless_index_settings` (default: true for non-serverless clusters, false for serverless clusters): Whether to include non-serverless index settings. | ||
|
|
||
|
|
||
| ### License | ||
|
|
||
| According to the [Open Data Law](https://opendata.cityofnewyork.us/open-data-law/) this data is available as public domain. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,89 @@ | ||
| ## About JOINS datasets | ||
|
|
||
|
|
||
| ### Contents | ||
|
|
||
| This directory contains two scripts | ||
|
|
||
| - joins_main_idx.sh - generates the main index, ie. the one that is intended to be used in the FROM clause of the query. | ||
| - lookup_idx.sh - generates lookup (join) indexes, ie. those that are intended to be used in the JOIN command. | ||
|
|
||
|
|
||
| ### Generating the main index | ||
|
|
||
| `joins_main_idx.sh` generates JSON documents with the following fields: | ||
|
|
||
| - `id`: numeric (incremental) | ||
| - `@timestamp`: numeric, with value id + 946728000 | ||
| - `key_1000`: string, with value id%1000, intended to be a foreign key to a lookup index | ||
| - `key_100000`: string, with value id%100000, intended to be a foreign key to a lookup index | ||
| - `key_200000`: string, with value id%200000, intended to be a foreign key to a lookup index | ||
| - `key_500000`: string, with value id%500000, intended to be a foreign key to a lookup index | ||
| - `key_1000000`: string, with value id%1000000, intended to be a foreign key to a lookup index | ||
| - `key_5000000`: string, with value id%5000000, intended to be a foreign key to a lookup index | ||
| - `key_100000000`: string, with value id%100000000, intended to be a foreign key to a lookup index | ||
| - 100 additional text fields (`field_0` to `field_99`) | ||
|
|
||
| By default it produces 1000 documents (one per row), but the number can be changed passing a command line argument. | ||
|
|
||
| #### Example usage | ||
|
|
||
| Generate a file with 50.000 documents and bzip it: | ||
|
|
||
| ```shell | ||
| ./joins_main_idx.sh 50000 | bzip2 -c > join_base_idx.json.bz2 | ||
| ``` | ||
|
|
||
|
|
||
| ### Generating the lookup indexes | ||
|
|
||
| `lookup_idx.sh` produces a lookup index. | ||
|
|
||
| It accepts three parameters as input: | ||
|
|
||
| - cardinality (default 1000): the number of keys to be generated | ||
| - fields (default 10): the number of additional fields per document | ||
| - repetitions (default 1): the number of repetitions per key | ||
|
|
||
| The result will be a file with the following fields: | ||
|
|
||
| - `key_<cardinaltity>`: a text containing the lookup key (practically, it's just a sequential number). | ||
| Since the default cardinality is 1000, the name of this field will be `key_1000` by default. | ||
| Passing a different cardinality as input will also result in a different field name. | ||
| - `M` additional fields (`M` is defined by the `fields` input param), called `lookup_keyword_0`, `lookup_keyword_1`...`lookup_keyword_M-1`, | ||
| containing the following string: `val <id> rep <repetition>`; the id is the same as the value of the key, the repetition is the value of the repetition (exmaple below) | ||
|
|
||
| #### Example usage | ||
|
|
||
|
|
||
| Generate a lookup dataset with 20.000 keys, repeated 3 times each (ie. 60.000 documents in total), with 5 additional text fields. | ||
| Then shuffle the rows and bzip the result. | ||
|
|
||
| ```shell | ||
| ./lookup_idx.sh 20000 5 3 | shuf | bzip2 - c > my_lookup_idx.json.bz2 | ||
| ``` | ||
|
|
||
| The generated file will be like: | ||
|
|
||
| ```json | ||
| ... | ||
| {"key_20000": "15", "field_0": "val 15 rep 0", ... "field_4": "val 15 rep 0"} | ||
| ... | ||
| ``` | ||
|
|
||
|
|
||
|
|
||
|
|
||
| ## The default dataset | ||
|
|
||
| The dataset for this benchmark was generated with the following: | ||
|
|
||
| ```shell | ||
| ./lookup_idx.sh 1000 10 1 | shuf | bzip2 -c > lookup_idx_1000_f10.json.bz2 | ||
| ./lookup_idx.sh 100000 10 1 | shuf | bzip2 -c > lookup_idx_100000_f10.json.bz2 | ||
| ./lookup_idx.sh 200000 10 1 | shuf | bzip2 -c > lookup_idx_200000_f10.json.bz2 | ||
| ./lookup_idx.sh 500000 10 1 | shuf | bzip2 -c > lookup_idx_500000_f10.json.bz2 | ||
| ./lookup_idx.sh 1000000 10 1 | shuf | bzip2 -c > lookup_idx_1000000_f10.json.bz2 | ||
| ./lookup_idx.sh 5000000 10 1 | shuf | bzip2 -c > lookup_idx_5000000_f10.json.bz2 | ||
| ./joins_main_idx.sh 10000000 | bzip2 -c > join_base_idx-10M.json.bz2 | ||
| ``` | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,62 @@ | ||
| #!/bin/zsh | ||
|
|
||
| # generate the corpus of the main index | ||
| # with the following fields: | ||
| # id: sequential | ||
| # @timestamp: sequential, starting Jan 1st 2000 | ||
| # key_1_000: a keyword field with cardinality 1.000 | ||
| # key_100_000: a keyword field with cardinality 100.000 | ||
| # key_200_000: a keyword field with cardinality 200.000 | ||
| # key_500_000: a keyword field with cardinality 500.000 | ||
| # key_1_000_000: a keyword field with cardinality 1.000.000 | ||
| # key_5_000_000: a keyword field with cardinality 5.000.000 | ||
| # key_100_000_000: a keyword field with cardinality 100.000.000 | ||
| # field_1..100: 100 text fields | ||
| # | ||
| # | ||
| # By default the script produces 1000 documents with 100 additional keyword fields, but this default is overridden passing two command line arguments, | ||
| # eg. | ||
| # | ||
| # ./joins_main_idx.sh 100000 3 | ||
| # | ||
| # produces 100.000 documents with 3 additional fields each | ||
|
|
||
|
|
||
|
|
||
| if [ "$#" -eq 0 ]; then | ||
| ndocs=1000 | ||
| fields=100 | ||
| elif [ "$#" -eq 2 ]; then | ||
| ndocs=$1 | ||
| fields=$2 | ||
| else | ||
| echo "This script accepts zero or two arguments: number of docs, number of additional fields" | ||
| echo "eg." | ||
| echo | ||
| echo "./joins_main_idx.sh 100 20" | ||
| echo | ||
| echo "will produce 100 documents, each document with 20 additional keyword fields" | ||
| echo | ||
| echo "With no arguments:" | ||
| echo | ||
| echo "./joins_main_idx.sh" | ||
| echo | ||
| echo "will produce 1000 documents, each document with 100 additional keyword fields" | ||
| fi | ||
|
|
||
| for ((id = 0; id<ndocs; id++)); do | ||
| echo -n '{' | ||
| echo -n '"id": '$id'' | ||
| echo -n ', "@timestamp": '$((id+946728000)) | ||
| echo -n ', "key_1000": "'$((id%1000))'"' | ||
| echo -n ', "key_100000": "'$((id%100000))'"' | ||
| echo -n ', "key_200000": "'$((id%200000))'"' | ||
| echo -n ', "key_500000": "'$((id%500000))'"' | ||
| echo -n ', "key_1000000": "'$((id%1000000))'"' | ||
| echo -n ', "key_5000000": "'$((id%5000000))'"' | ||
| echo -n ', "key_100000000": "'$((id%100000000))'"' | ||
| for ((i = 0; i<fields; i++)); do | ||
| echo -n ', "field_'$i'": "value '$i'_'$id'"' | ||
| done | ||
| echo '}' | ||
| done; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,52 @@ | ||
| #!/bin/zsh | ||
|
|
||
| # generate the corpus of the lookup index | ||
| # with the following fields: | ||
| # key_<cardinality>: a keyword field with cardinality <cardinality> | ||
| # lookup_keyword_1..n: n keyword fields | ||
| # | ||
| # By default the script produces 1000 documents with keys 1...000, 10 keyword fields per doc | ||
| # | ||
| # these defaults can be overridden passing the following command line arguments: | ||
| # - cardinality: | ||
| # - n of fields | ||
| # - n of repetitions | ||
| # | ||
| # The final number of documents will be cardinality X n of repetitions | ||
| # | ||
| # eg. | ||
| # | ||
| # ./lookup_idx.sh 1000 20 3 | ||
| # | ||
| # produces 3000 documents, each key will be repeated three times, each document will have 20 keyword fields | ||
|
|
||
|
|
||
|
|
||
|
|
||
| if [ "$#" -eq 0 ]; then | ||
| cardinality=1000 | ||
| fields=10 | ||
| repetitions=1 | ||
| elif [ "$#" -eq 3 ]; then | ||
| cardinality=$1 | ||
| fields=$2 | ||
| repetitions=$3 | ||
| else | ||
| echo "This script accepts zero or three arguments: cardinality, number of fields and number of repetitions" | ||
| echo "eg." | ||
| echo | ||
| echo "./lookup_idx.sh 100 20 3" | ||
| echo | ||
| echo "will produce 300 documents, 100 keys (repeated three times), each document with 20 keyword fields" | ||
| fi | ||
|
|
||
| for ((id = 0; id<cardinality; id++)); do | ||
| for ((repetition = 0; repetition<repetitions; repetition++)); do | ||
| echo -n '{' | ||
| echo -n '"key_'$cardinality'": "'$id'"' | ||
| for ((i = 0; i<fields; i++)); do | ||
| echo -n ', "lookup_keyword_'$i'": "val '$id' rep '$repetition'"' | ||
| done | ||
| echo '}' | ||
| done | ||
| done; |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe a lower number of lookup indices would be sufficient? Like, 1k, 50k, 1M, 10M?
I think it's more interesting to have different numbers of repetitions IMHO.