elastic · luigidellaquila · Jan 9, 2025 · Dec 19, 2024 · Dec 19, 2024 · Dec 19, 2024
diff --git a/joins/README.md b/joins/README.md
@@ -0,0 +1,52 @@
+## JOINS track
+
+This track contains an artificial dataset intended to test JOIN operations with different key cardinalities.
+
+The dataset can be generated using the scripts in the `_tools` directory.
+
+### Example Documents
+
+Main index: 
+
+```json
+{
+  "id": 56,
+  "@timestamp": 946728056,
+  "key_1000": "56",
+  "key_100000": "56",
+  "key_200000": "56",
+  "key_500000": "56",
+  "key_1000000": "56",
+  "key_5000000": "56",
+  "key_100000000": "56",
+  "field_0": "text with value 0_56",
+  "field_1": "text with value 1_56",
+  "field_2": "text with value 2_56",
+  ...
+  "field_99": "text with value 99_56"
+}
+```
+
+The cardinality of the keys is the same as the key name, eg. `key_1000` will have 1000 different values in the dataset,
+from `0` to `999` (unless the dataset is not big enough to contain all the keys of a given cardinality, 
+eg. with a dataset of 1000 documents, `key_100000000` will contain only 1000 distinct keys, one per document).
+The IDs and the timestamps are sequential.
+
+### Parameters
+
+This track allows to overwrite the following parameters using `--track-params`:
+
+* `bulk_size` (default: 10000)
+* `bulk_indexing_clients` (default: 8): Number of clients that issue bulk indexing requests.
+* `ingest_percentage` (default: 100): A number between 0 and 100 that defines how much of the document corpus should be ingested. It will be applied to the main index and to the large join indexes (ie. not to join indexes with up to 500K documents)
+* `number_of_replicas` (default: 1): This only applies to the main index (not to lookup indexes)
+* `number_of_shards` (default: 5): This only applies to the main index (not to lookup indexes)
+* `source_mode` (default: stored): Should the `_source` be `stored` to disk exactly as sent (the default), thrown away (`disabled`), or reconstructed on the fly (`synthetic`)
+* `index_settings`: A list of index settings. Index settings defined elsewhere need to be overridden explicitly.
+* `cluster_health` (default: "green"): The minimum required cluster health.
+* `include_non_serverless_index_settings` (default: true for non-serverless clusters, false for serverless clusters): Whether to include non-serverless index settings.
+
+
+### License
+
+According to the [Open Data Law](https://opendata.cityofnewyork.us/open-data-law/) this data is available as public domain.
diff --git a/joins/_tools/README.md b/joins/_tools/README.md
@@ -0,0 +1,89 @@
+## About JOINS datasets
+
+
+### Contents 
+
+This directory contains two scripts
+
+- joins_main_idx.sh - generates the main index, ie. the one that is intended to be used in the FROM clause of the query.
+- lookup_idx.sh - generates lookup (join) indexes, ie. those that are intended to be used in the JOIN command.
+
+
+### Generating the main index
+
+`joins_main_idx.sh` generates JSON documents with the following fields:
+
+- `id`: numeric (incremental)
+- `@timestamp`: numeric, with value id + 946728000 
+- `key_1000`: string, with value id%1000, intended to be a foreign key to a lookup index
+- `key_100000`: string, with value id%100000, intended to be a foreign key to a lookup index
+- `key_200000`: string, with value id%200000, intended to be a foreign key to a lookup index
+- `key_500000`: string, with value id%500000, intended to be a foreign key to a lookup index
+- `key_1000000`: string, with value id%1000000, intended to be a foreign key to a lookup index
+- `key_5000000`: string, with value id%5000000, intended to be a foreign key to a lookup index
+- `key_100000000`: string, with value id%100000000, intended to be a foreign key to a lookup index
+- 100 additional text fields (`field_0` to `field_99`)
+
+By default it produces 1000 documents (one per row), but the number can be changed passing a command line argument.
+
+#### Example usage
+
+Generate a file with 50.000 documents and bzip it: 
+
+```shell
+./joins_main_idx.sh 50000 | bzip2 -c > join_base_idx.json.bz2
+```
+
+
+### Generating the lookup indexes
+
+`lookup_idx.sh` produces a lookup index. 
+
+It accepts three parameters as input:
+
+- cardinality (default 1000): the number of keys to be generated 
+- fields (default 10): the number of additional fields per document
+- repetitions (default 1): the number of repetitions per key
+
+The result will be a file with the following fields:
+
+- `key_<cardinaltity>`: a text containing the lookup key (practically, it's just a sequential number). 
+Since the default cardinality is 1000, the name of this field will be `key_1000` by default. 
+Passing a different cardinality as input will also result in a different field name.
+- `M` additional fields (`M` is defined by the `fields` input param), called `lookup_keyword_0`, `lookup_keyword_1`...`lookup_keyword_M-1`, 
+containing the following string:  `val <id> rep <repetition>`; the id is the same as the value of the key, the repetition is the value of the repetition (exmaple below) 
+
+#### Example usage
+
+
+Generate a lookup dataset with 20.000 keys, repeated 3 times each (ie. 60.000 documents in total), with 5 additional text fields.
+Then shuffle the rows and bzip the result.
+
+```shell
+./lookup_idx.sh 20000 5 3 | shuf | bzip2 - c > my_lookup_idx.json.bz2
+```
+
+The generated file will be like:
+
+```json
+...
+{"key_20000": "15", "field_0": "val 15 rep 0",  ... "field_4": "val 15 rep 0"}
+...
+```
+
+
+
+
+## The default dataset
+
+The dataset for this benchmark was generated with the following:
+
+```shell
+./lookup_idx.sh 1000 10 1 | shuf | bzip2 -c > lookup_idx_1000_f10.json.bz2
+./lookup_idx.sh 100000 10 1 | shuf | bzip2 -c > lookup_idx_100000_f10.json.bz2
+./lookup_idx.sh 200000 10 1 | shuf | bzip2 -c > lookup_idx_200000_f10.json.bz2
+./lookup_idx.sh 500000 10 1 | shuf | bzip2 -c > lookup_idx_500000_f10.json.bz2
+./lookup_idx.sh 1000000 10 1 | shuf | bzip2 -c > lookup_idx_1000000_f10.json.bz2
+./lookup_idx.sh 5000000 10 1 | shuf | bzip2 -c > lookup_idx_5000000_f10.json.bz2
+./joins_main_idx.sh 10000000 | bzip2 -c > join_base_idx-10M.json.bz2
+```
diff --git a/joins/_tools/joins_main_idx.sh b/joins/_tools/joins_main_idx.sh
@@ -0,0 +1,62 @@
+#!/bin/zsh
+
+# generate the corpus of the main index
+# with the following fields:
+# id: sequential
+# @timestamp: sequential, starting Jan 1st 2000
+# key_1_000: a keyword field with cardinality 1.000
+# key_100_000: a keyword field with cardinality 100.000
+# key_200_000: a keyword field with cardinality 200.000
+# key_500_000: a keyword field with cardinality 500.000
+# key_1_000_000: a keyword field with cardinality 1.000.000
+# key_5_000_000: a keyword field with cardinality 5.000.000
+# key_100_000_000: a keyword field with cardinality 100.000.000
+# field_1..100: 100 text fields
+#
+#
+# By default the script produces 1000 documents with 100 additional keyword fields, but this default is overridden passing two command line arguments,
+# eg.
+#
+# ./joins_main_idx.sh 100000 3
+#
+# produces 100.000 documents with 3 additional fields each
+
+
+
+if [ "$#" -eq 0 ]; then
+    ndocs=1000
+    fields=100
+elif [ "$#" -eq 2 ]; then
+    ndocs=$1
+    fields=$2
+else
+  echo "This script accepts zero or two arguments: number of docs, number of additional fields"
+  echo "eg."
+  echo
+  echo "./joins_main_idx.sh 100 20"
+  echo
+  echo "will produce 100 documents, each document with 20 additional keyword fields"
+  echo
+  echo "With no arguments:"
+  echo
+  echo "./joins_main_idx.sh"
+  echo
+  echo "will produce 1000 documents, each document with 100 additional keyword fields"
+fi
+
+for ((id = 0; id<ndocs; id++)); do
+  echo -n '{'
+  echo -n '"id": '$id''
+  echo -n ', "@timestamp": '$((id+946728000))
+  echo -n ', "key_1000": "'$((id%1000))'"'
+  echo -n ', "key_100000": "'$((id%100000))'"'
+  echo -n ', "key_200000": "'$((id%200000))'"'
+  echo -n ', "key_500000": "'$((id%500000))'"'
+  echo -n ', "key_1000000": "'$((id%1000000))'"'
+  echo -n ', "key_5000000": "'$((id%5000000))'"'
+  echo -n ', "key_100000000": "'$((id%100000000))'"'
+  for ((i = 0; i<fields; i++)); do
+    echo -n ', "field_'$i'": "value '$i'_'$id'"'
+  done
+  echo '}'
+done;
diff --git a/joins/_tools/lookup_idx.sh b/joins/_tools/lookup_idx.sh
@@ -0,0 +1,52 @@
+#!/bin/zsh
+
+# generate the corpus of the lookup index
+# with the following fields:
+# key_<cardinality>: a keyword field with cardinality <cardinality>
+# lookup_keyword_1..n: n keyword fields
+#
+# By default the script produces 1000 documents with keys 1...000, 10 keyword fields per doc
+#
+# these defaults can be overridden passing the following command line arguments:
+# - cardinality:
+# - n of fields
+# - n of repetitions
+#
+# The final number of documents will be cardinality X n of repetitions
+#
+# eg.
+#
+# ./lookup_idx.sh 1000 20 3
+#
+# produces 3000 documents, each key will be repeated three times, each document will have 20 keyword fields
+
+
+
+
+if [ "$#" -eq 0 ]; then
+    cardinality=1000
+    fields=10
+    repetitions=1
+elif [ "$#" -eq 3 ]; then
+    cardinality=$1
+    fields=$2
+    repetitions=$3
+else
+  echo "This script accepts zero or three arguments: cardinality, number of fields and number of repetitions"
+  echo "eg."
+  echo
+  echo "./lookup_idx.sh 100 20 3"
+  echo
+  echo "will produce 300 documents, 100 keys (repeated three times), each document with 20 keyword fields"
+fi
+
+for ((id = 0; id<cardinality; id++)); do
+  for ((repetition = 0; repetition<repetitions; repetition++)); do
+    echo -n '{'
+    echo -n '"key_'$cardinality'": "'$id'"'
+    for ((i = 0; i<fields; i++)); do
+      echo -n ', "lookup_keyword_'$i'": "val '$id' rep '$repetition'"'
+    done
+    echo '}'
+  done
+done;