diff --git a/_benchmark/features/index.md b/_benchmark/features/index.md new file mode 100644 index 00000000000..35a5f65d987 --- /dev/null +++ b/_benchmark/features/index.md @@ -0,0 +1,19 @@ +--- +layout: default +title: Additional features +nav_order: 30 +has_children: true +has_toc: false +redirect_from: + - /benchmark/features/ +more_cards: + - heading: "Synthetic data generation" + description: "Create synthetic datasets using index mappings or custom Python logic for comprehensive benchmarking and testing." + link: "/benchmark/features/synthetic-data-generation/" +--- + +# Additional features + +In addition to general benchmarking, OpenSearch Benchmark provides several specialized features. + +{% include cards.html cards=page.more_cards %} \ No newline at end of file diff --git a/_benchmark/features/synthetic-data-generation/custom-logic-sdg.md b/_benchmark/features/synthetic-data-generation/custom-logic-sdg.md new file mode 100644 index 00000000000..96292aa758e --- /dev/null +++ b/_benchmark/features/synthetic-data-generation/custom-logic-sdg.md @@ -0,0 +1,257 @@ +--- +layout: default +title: Generating data using custom logic +nav_order: 35 +parent: Synthetic data generation +grand_parent: Additional features +--- + +# Generating data using custom logic + +You can generate synthetic data using custom logic defined in a Python module. This approach offers you the most granular control over how synthetic data is produced in OpenSearch Benchmark. This is especially useful if you understand the distribution of your data and the relationship between different fields. + +## The generate_synthetic_document function + +Every custom module provided to OpenSearch Benchmark must define the `generate_synthetic_document(providers, **custom_lists)` function. This function defines how OpenSearch Benchmark generates each synthetic document. + +### Function parameters + +| Parameter | Required/Optional | Description | +|---|---|---| +| `providers` | Required | A dictionary containing data generation tools. Available providers are `generic` (Mimesis [Generic provider](https://mimesis.name/master/api.html#generic-providers)) and `random` (Mimesis [Random class](https://mimesis.name/master/random_and_seed.html)). To add custom providers, see [Advanced configuration](#advanced-configuration). | +| `custom_lists` | Optional | Keyword arguments containing predefined lists of values that you can use in your data generation logic. These are defined in your YAML configuration file under `custom_lists` and allow you to separate data values from your Python code. For example, if you define `dog_names: [Buddy, Max, Luna]` in YAML, you can access it as `custom_lists['dog_names']` in your function. This makes it easy to modify data values without changing your Python code. | + +### Basic function template + +```python +def generate_synthetic_document(providers, **custom_lists): + # Access the available providers + generic = providers['generic'] + random_provider = providers['random'] + + # Generate a document using the providers + document = { + 'name': generic.person.full_name(), + 'age': random_provider.randint(18, 80), + 'email': generic.person.email(), + 'timestamp': generic.datetime.datetime() + } + + # Optionally, use custom lists if provided + if 'categories' in custom_lists: + document['category'] = random_provider.choice(custom_lists['categories']) + + return document +``` +{% include copy.html %} + +For more information, see the [Mimesis documentation](https://mimesis.name/master/api.html). + +## Python module example + +The following example Python module demonstrates custom logic for generating documents about dog drivers for a fictional ride-sharing company, *Pawber*, which uses OpenSearch to store and search large volumes of ride-sharing data. + +This example showcases several advanced concepts: +- **[Custom provider classes](#advanced-configuration)** (`NumericString`, `MultipleChoices`) that extend Mimesis functionality +- **[Custom lists](#advanced-configuration)** for data values like dog names, breeds, and treats (referenced as `custom_lists['dog_names']`) +- **Geographic clustering** logic for realistic location data +- **Complex document structures** with nested objects and relationships + +Save this code to a file called `pawber.py` in your desired directory (for example, `~/pawber.py`): + +```python +from mimesis.providers.base import BaseProvider +from mimesis.enums import TimestampFormat + +import random + +GEOGRAPHIC_CLUSTERS = { + 'Manhattan': { + 'center': {'lat': 40.7831, 'lon': -73.9712}, + 'radius': 0.05 # degrees + }, + 'Brooklyn': { + 'center': {'lat': 40.6782, 'lon': -73.9442}, + 'radius': 0.05 + }, + 'Austin': { + 'center': {'lat': 30.2672, 'lon': -97.7431}, + 'radius': 0.1 # Increased radius to cover more of Austin + } +} + +def generate_location(cluster): + """Generate a random location within a cluster""" + center = GEOGRAPHIC_CLUSTERS[cluster]['center'] + radius = GEOGRAPHIC_CLUSTERS[cluster]['radius'] + lat = center['lat'] + random.uniform(-radius, radius) + lon = center['lon'] + random.uniform(-radius, radius) + return {'lat': lat, 'lon': lon} + +class NumericString(BaseProvider): + class Meta: + name = "numeric_string" + + @staticmethod + def generate(length=5) -> str: + return ''.join([str(random.randint(0, 9)) for _ in range(length)]) + +class MultipleChoices(BaseProvider): + class Meta: + name = "multiple_choices" + + @staticmethod + def generate(choices, num_of_choices=5) -> str: + import logging + logger = logging.getLogger(__name__) + logger.info("Choices: %s", choices) + logger.info("Length: %s", num_of_choices) + total_choices_available = len(choices) - 1 + + return [choices[random.randint(0, total_choices_available)] for _ in range(num_of_choices)] + +def generate_synthetic_document(providers, **custom_lists): + generic = providers['generic'] + random_mimesis = providers['random'] + + first_name = generic.person.first_name() + last_name = generic.person.last_name() + city = random.choice(list(GEOGRAPHIC_CLUSTERS.keys())) + + # Driver Document + document = { + "dog_driver_id": f"DD{generic.numeric_string.generate(length=4)}", + "dog_name": random_mimesis.choice(custom_lists['dog_names']), + "dog_breed": random_mimesis.choice(custom_lists['dog_breeds']), + "license_number": f"{random_mimesis.choice(custom_lists['license_plates'])}{generic.numeric_string.generate(length=4)}", + "favorite_treats": random_mimesis.choice(custom_lists['treats']), + "preferred_tip": random_mimesis.choice(custom_lists['tips']), + "vehicle_type": random_mimesis.choice(custom_lists['vehicle_types']), + "vehicle_make": random_mimesis.choice(custom_lists['vehicle_makes']), + "vehicle_model": random_mimesis.choice(custom_lists['vehicle_models']), + "vehicle_year": random_mimesis.choice(custom_lists['vehicle_years']), + "vehicle_color": random_mimesis.choice(custom_lists['vehicle_colors']), + "license_plate": random_mimesis.choice(custom_lists['license_plates']), + "current_location": generate_location(city), + "status": random.choice(['available', 'busy', 'offline']), + "current_ride": f"R{generic.numeric_string.generate(length=6)}", + "account_status": random_mimesis.choice(custom_lists['account_status']), + "join_date": generic.datetime.formatted_date(), + "total_rides": generic.numeric.integer_number(start=1, end=200), + "rating": generic.numeric.float_number(start=1.0, end=5.0, precision=2), + "earnings": { + "today": { + "amount": generic.numeric.float_number(start=1.0, end=5.0, precision=2), + "currency": "USD" + }, + "this_week": { + "amount": generic.numeric.float_number(start=1.0, end=5.0, precision=2), + "currency": "USD" + }, + "this_month": { + "amount": generic.numeric.float_number(start=1.0, end=5.0, precision=2), + "currency": "USD" + } + }, + "last_grooming_check": "2023-12-01", + "owner": { + "first_name": first_name, + "last_name": last_name, + "email": f"{first_name}{last_name}@gmail.com" + }, + "special_skills": generic.multiple_choices.generate(custom_lists['skills'], num_of_choices=3), + "bark_volume": generic.numeric.float_number(start=1.0, end=10.0, precision=2), + "tail_wag_speed": generic.numeric.float_number(start=1.0, end=10.0, precision=1) + } + + return document +``` +{% include copy.html %} + +## Generating data + +To generate synthetic data using custom logic, use the `generate-data` subcommand and provide the required custom Python module, index name, output path, and total amount of data to generate: + +```shell +osb generate-data --custom-module ~/pawber.py --index-name pawber-data --output-path ~/Desktop/sdg_outputs/ --total-size 2 +``` +{% include copy.html %} + +For a complete list of available parameters and their descriptions, see the [`generate-data` command reference]({{site.url}}{{site.baseurl}}/benchmark/reference/commands/generate-data/). + +## Example output + +The following is an example output of generating 100 GB of data: + +``` + ____ _____ __ ____ __ __ + / __ \____ ___ ____ / ___/___ ____ ___________/ /_ / __ )___ ____ _____/ /_ ____ ___ ____ ______/ /__ + / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \ / __ / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/ +/ /_/ / /_/ / __/ / / /__/ / __/ /_/ / / / /__/ / / / / /_/ / __/ / / / /__/ / / / / / / / / /_/ / / / ,< +\____/ .___/\___/_/ /_/____/\___/\__,_/_/ \___/_/ /_/ /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/ /_/|_| + /_/ + + +[NOTE] ✨ Dashboard link to monitor processes and task streams: [http://127.0.0.1:8787/status] +[NOTE] ✨ For users who are running generation on a virtual machine, consider SSH port forwarding (tunneling) to localhost to view dashboard. +[NOTE] Example of localhost command for SSH port forwarding (tunneling) from an AWS EC2 instance: +ssh -i -N -L localhost:8787:localhost:8787 ec2-user@ + +Total GB to generate: [1] +Average document size in bytes: [412] +Max file size in GB: [40] + +100%|███████████████████████████████████████████████████████████████████| 100.07G/100.07G [3:35:29<00:00, 3.98MB/s] + +Generated 24271844660 docs in 12000 seconds. Total dataset size is 100.21GB. +✅ Visit the following path to view synthetically generated data: /home/ec2-user/ + +----------------------------------- +[INFO] ✅ SUCCESS (took 272 seconds) +----------------------------------- +``` + +## Advanced configuration + +You can optionally create a YAML configuration file to store custom data and providers. The configuration file must define a `CustomGenerationValues` parameter. + +The following parameters are available in `CustomGenerationValues`. Both parameters are optional. + +| Parameter | Required/Optional | Description | +|---|---|---| +| `custom_lists` | Optional | Predefined arrays of values that you can reference in your Python module using `custom_lists['list_name']`. This allows you to separate data values from your code logic, making it easy to modify data values without changing your Python file. For example, `dog_names: [Buddy, Max, Luna]` becomes accessible as `custom_lists['dog_names']`. | +| `custom_providers` | Optional | Custom data generation classes that extend Mimesis functionality. These should be defined as classes in your Python module (like `NumericString` or `MultipleChoices` in the [example](#python-module-example)) and then listed in this parameter by name. This allows you to create specialized data generators beyond what Mimesis provides by default. | + +### Example configuration file + +Save your configuration in a YAML file: + +```yml +CustomGenerationValues: + # Generate data using a custom Python module + custom_lists: + # Custom lists to consolidate all values in this YAML file + dog_names: [Hana, Youpie, Charlie, Lucy, Cooper, Luna, Rocky, Daisy, Buddy, Molly] + dog_breeds: [Jindo, Labrador, German Shepherd, Golden Retriever, Bulldog, Poodle, Beagle, Rottweiler, Boxer, Dachshund, Chihuahua] + treats: [cookies, pup_cup, jerky] + custom_providers: + # OSB's synthetic data generator uses Mimesis; custom providers are essentially custom Python classes that adds more functionality to Mimesis + - NumericString + - MultipleChoices +``` +{% include copy.html %} + + +### Using the configuration + +To use your configuration file, add the `--custom-config` parameter to the `generate-data` command: + +```shell +osb generate-data --custom-module ~/pawber.py --index-name pawber-data --output-path ~/Desktop/sdg_outputs/ --total-size 2 --custom-config ~/Desktop/sdg-config.yml +``` +{% include copy.html %} + +## Related documentation + +- [`generate-data` command reference]({{site.url}}{{site.baseurl}}/benchmark/reference/commands/generate-data/) +- [Generating data using index mappings]({{site.url}}{{site.baseurl}}/benchmark/features/synthetic-data-generation/mapping-sdg/) diff --git a/_benchmark/features/synthetic-data-generation/generating-vectors.md b/_benchmark/features/synthetic-data-generation/generating-vectors.md new file mode 100644 index 00000000000..0165e99d5e4 --- /dev/null +++ b/_benchmark/features/synthetic-data-generation/generating-vectors.md @@ -0,0 +1,543 @@ +--- +layout: default +title: Generating vectors +nav_order: 40 +parent: Synthetic data generation +grand_parent: Additional features +--- + +# Generating vectors + +You can generate synthetic dense and sparse vectors from mappings using OpenSearch Benchmark's synthetic data generator. + +## Dense vectors + +Dense vectors (represented by the [`knn_vector`]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/knn-vector/) field type in OpenSearch) are numerical representations of data, such as text or images, in which most or all dimensions have non-zero values. These vectors typically contain floating-point numbers between -1.0 and 1.0, with each dimension contributing to the overall meaning. + +Example embedding for the word "dog": + +```json +{ + "embedding": [0.234, -0.567, 0.123, 0.891, -0.234, 0.456, ..., 0.789] +} +``` + +## Sparse vectors + +Sparse vectors (represented by the [`sparse_vector`]({{site.url}}{{site.baseurl}}/field-types/supported-field-types/sparse-vector/) field type in OpenSearch) are vectors in which most dimensions are zero, represented as key-value pairs of non-zero token IDs and their weights. Think of sparse vectors as a dictionary of important words with their importance scores, in which only significant terms are stored. + +Example text: "Korean jindos are hunting dogs that have a reputation for being loyal, independent, and confident." + +Sparse vector representation of example text: + +```json +{ + "5432": 0.85, // "korean" - very important (specific descriptor) + "7821": 0.78, // "jindos" - very important (breed name) + "2": 0.45, // "dog" - moderately important (general category) + "9999": 0.32, // "loyal" - somewhat important (characteristic) + "1111": 0.12 // "things" - less important (common word) +} +``` +--- + +## Basic usage + +The following examples show how to generate vectors with minimal configuration using only OpenSearch index mappings. + +### Generating dense vectors + +Generate random 128-dimensional vectors with minimal configuration. + +**1. Create a mapping file** (`simple-knn-mapping.json`): + +```json +{ + "settings": { + "index.knn": true + }, + "mappings": { + "properties": { + "title": {"type": "text"}, + "my_embedding": { + "type": "knn_vector", + "dimension": 128 + } + } + } +} +``` +{% include copy.html %} + +**2. Generate data**: + +```bash +opensearch-benchmark generate-data \ + --index-name my-vectors \ + --index-mappings simple-knn-mapping.json \ + --output-path ./output \ + --total-size 1 +``` +{% include copy.html %} + +#### Generated output + +In each of the generated documents, the `my_embedding` field might appear as follows: + +```json +{ + "title": "Sample text 42", + "my_embedding": [0.234, -0.567, 0.123, ..., 0.891] // 128 random floats [-1.0, 1.0] +} +``` + +### Generating sparse vectors + +Generate sparse vectors with default configuration (10 tokens). + +**1. Create a mapping file** (`simple-sparse-mapping.json`): + +```json +{ + "mappings": { + "properties": { + "content": {"type": "text"}, + "sparse_embedding": { + "type": "sparse_vector" + } + } + } +} +``` +{% include copy.html %} + +**2. Generate data** (same command pattern): + +```bash +opensearch-benchmark generate-data \ + --index-name my-sparse \ + --index-mappings simple-sparse-mapping.json \ + --output-path ./output \ + --total-size 1 +``` +{% include copy.html %} + +#### Generated output + +In each of the generated documents, the `sparse_embedding` field might appear as follows: + +```json +{ + "content": "Sample text content", + "sparse_embedding": { + "1000": 0.3421, + "1100": 0.5234, + "1200": 0.7821, + "1300": 0.1523, + "1400": 0.9102, + "1500": 0.4567, + "1600": 0.2341, + "1700": 0.6789, + "1800": 0.8123, + "1900": 0.3456 + } +} +``` + +Using only an OpenSearch index mapping, OSB can generate synthetic dense and sparse vectors. However, this produces basic synthetic vectors. For more realistic distributions and clusterings, we recommend configuring the parameters described in the following section. + +--- + +## Dense vector (k-NN vector) parameters + +The following are parameters that you can add to your synthetic data generation configuration file (YAML Config) to fine-tune generation of dense vectors. These parameters are used in the `field_overrides` section with the `generate_knn_vector` generator. For complete configuration details, see [Advanced configuration](/benchmark/features/synthetic-data-generation/mapping-sdg/#advanced-configuration). + +#### dimension + +This parameter specifies the number of dimensions in the vector. Optional. + +**How to specify**: The `dimension` must be defined in your OpenSearch index mapping file. You can optionally override this value in your YAML configuration using the `dimension` parameter in `field_overrides`. + +**Impact**: +- **Memory**: Higher dimensions = more storage + - 128D ≈ 0.5 KB per vector + - 768D ≈ 3 KB per vector + - 1536D ≈ 6 KB per vector +- **Performance**: More dimensions = slower indexing and search +- **Quality**: Must match your actual embedding model's output + +The following table shows common dimension values and their typical use cases. + +| Dimension | Use Case | Example Models | +|-----------|----------|----------------| +| 128 | Lightweight, custom models | Custom embeddings, fast search | +| 384 | General purpose | sentence-transformers/all-MiniLM-L6-v2 | +| 768 | Standard NLP | BERT-Base, DistilBERT, MPNet | +| 1024 | High quality NLP | BERT-Large | +| 1536 | OpenAI standard | text-embedding-ada-002, text-embedding-3-small | +| 3072 | OpenAI premium | text-embedding-3-large | + +**Example**: + +```yaml +field_overrides: + my_embedding: + generator: generate_knn_vector + params: + dimension: 768 # Override mapping dimension if needed +``` +{% include copy.html %} + +**Best practice**: This parameter must match your embedding model's dimension. + +--- + +#### sample_vectors + +This parameter provides base vectors to which the generator adds noise, creating realistic variations and clusters. Optional, but highly recommended. + +Without sample vectors, OSB's synthetic data generator generates random uniform vectors across the entire space, which is unrealistic and offers poor search quality. Providing sample vectors allows OSB's synthetic data generator to create more realistic and natural clusters. + +After you prepare a list of sample vectors, insert them as a **list of lists**, in which each inner list is a complete vector. The following example provides sample vectors in the synthetic data generation configuration file: + +```yaml +field_overrides: + product_embedding: + generator: generate_knn_vector + params: + dimension: 768 + sample_vectors: + - [0.12, -0.34, 0.56, ..., 0.23] # Vector 1 (768 values) + - [-0.23, 0.45, -0.12, ..., -0.15] # Vector 2 (768 values) + - [0.34, 0.21, -0.45, ..., 0.42] # Vector 3 (768 values) +``` +{% include copy.html %} + +Use the following guidelines to determine the number of vectors that you provide: + +- **Minimum**: 3--5 for basic clustering +- **Recommended**: 5--10 for realistic distribution +- **Maximum**: 20+ for complex multi-cluster scenarios + +**How to obtain sample vectors**: + +**Option 1: Using actual embeddings from your domain (Recommended)**: Use actual embeddings from your domain, representing different semantic clusters. Random generation without sample vectors produces unrealistic data unsuitable for search quality testing. + +**Option 2: Using sentence-transformers** in Python: + +```python +from sentence_transformers import SentenceTransformer + +model = SentenceTransformer('all-MiniLM-L6-v2') + +# Create representative texts from different categories +texts = [ + "Electronics and gadgets", + "Clothing and fashion", + "Home and kitchen appliances", + "Books and literature", + "Sports and outdoor equipment" +] + +embeddings = model.encode(texts) +print(embeddings.tolist()) # Copy to your synthetic data generation configuration file (YAML config) +``` +{% include copy.html %} + +--- + +#### distribution_type + +This parameter specifies the type of noise distribution. Optional. Default is `gaussian`. + +**Valid values**: +- `gaussian`: Normal distribution N(0, `noise_factor`) + - Most realistic (natural variation with occasional outliers) + - Produces smooth clusters + - Some values can extend beyond expected range + +- `uniform`: Uniform distribution [-`noise_factor`, +`noise_factor`] + - Bounded variation (no extreme outliers) + - More predictable results + - Flat probability across range + +**Configuration**: +```yaml +field_overrides: + realistic_embedding: + generator: generate_knn_vector + params: + sample_vectors: [...] + noise_factor: 0.1 + distribution_type: gaussian # More realistic + + controlled_embedding: + generator: generate_knn_vector + params: + sample_vectors: [...] + noise_factor: 0.1 + distribution_type: uniform # More predictable +``` +{% include copy.html %} + +**Best practice**: Use `gaussian` for production-like benchmarks. + +--- + +#### noise_factor + +This parameter controls the amount of noise added to base vectors: +- For `gaussian`: Standard deviation of normal distribution +- For `uniform`: Range of uniform distribution (±`noise_factor`) + +Optional. Default is `0.1`. + +The following table shows how different `noise_factor` values impact the generated data. + +| `noise_factor` | Effect | Use Case | +|--------------|--------|----------| +| 0.01--0.05 | Tight clustering, minimal variation | Duplicate detection, near-exact matches | +| 0.1--0.2 | Natural variation within topic | General semantic search, recommendations | +| 0.3--0.5 | Wide dispersion, diverse concepts | Broad topic matching, discovery | +| > 0.5 | Very scattered, overlapping clusters | Testing edge cases, stress testing | + +**Configuration**: + +```yaml +field_overrides: + tight_clustering: + generator: generate_knn_vector + params: + sample_vectors: [...] + noise_factor: 0.05 # Tight clusters + + diverse_results: + generator: generate_knn_vector + params: + sample_vectors: [...] + noise_factor: 0.2 # More variation +``` +{% include copy.html %} + +**Best practice**: Start with `0.1`, then adjust based on search recall or precision requirements. + +--- + +#### normalize + +This parameter normalizes vectors after noise addition, making their magnitude (length) exactly `1.0`. Optional. Default is `false`. + +The following table shows when to set `normalize` to `true` based on your index configuration. + +| `space_type` in the index mapping | `normalize` value | Explanation | +| --------------------------------- | ----------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `cosinesimil` | `true` | Cosine similarity depends only on vector direction. Pre-normalizing improves performance because the dot product directly represents cosine similarity. | +| `l2` | `false` | L2 distance relies on vector magnitude. Normalizing removes magnitude information and reduces accuracy. | +| `innerproduct` | `false` | Inner product incorporates vector magnitude into the similarity score, so normalization would change the intended scoring behavior. | + +**Real-world model guidance**: + +* **OpenAI embeddings**: These vectors are pre-normalized, so set `normalize` to `true`. +* **sentence-transformers**: Many models output normalized vectors. Review the model documentation; in most cases, `normalize` should be set to `true`. +* **BERT (raw output)**: Raw BERT embeddings are not normalized. Set `normalize` to `false` and rely on the index configuration to perform normalization if needed. + + +**Configuration**: + +```yaml +field_overrides: + # For cosine similarity search + cosine_embedding: + generator: generate_knn_vector + params: + dimension: 384 + sample_vectors: [...] + normalize: true # Required for accurate cosine similarity + + # For L2 distance search + l2_embedding: + generator: generate_knn_vector + params: + dimension: 768 + sample_vectors: [...] + normalize: false # Keep original magnitudes +``` +{% include copy.html %} + +**Best practice**: Match your OpenSearch index's `space_type` setting. + +--- + +## Sparse vectors parameters + +The following are parameters that you can add to your synthetic data generation configuration file to finetune how sparse vectors are generated. These parameters are used in the `field_overrides` section with the `generate_sparse_vector` generator. For complete configuration details, see [Advanced configuration](/benchmark/features/synthetic-data-generation/mapping-sdg/#advanced-configuration). + +#### num_tokens + +This parameter specifies the number of token-weight pairs to generate per vector. Optional. Default is `10`. + +**Impact**: +- **Low (5--10)**: Very sparse, fast search, may miss some relevant documents +- **Medium (10--25)**: Balanced performance and recall +- **High (50--100)**: Dense sparse representation, comprehensive but slower + +The following table shows typical `num_tokens` values for different models and approaches. + +| Model/Approach | Typical `num_tokens` | Use Case | +|----------------|-------------------|----------| +| SPLADE v1 | 10--15 | Standard sparse neural search | +| SPLADE v2 | 15--25 | Improved recall | +| DeepImpact | 8--12 | Efficient sparse search | +| Custom/Hybrid | 20--50 | Rich representations | + +**Configuration**: + +```yaml +field_overrides: + sparse_standard: + generator: generate_sparse_vector + params: + num_tokens: 15 # Standard SPLADE-like + + sparse_rich: + generator: generate_sparse_vector + params: + num_tokens: 30 # Richer representation +``` +{% include copy.html %} + +**Best practice**: Start with `10--15`; increase if recall is insufficient. + +--- + +#### min_weight and max_weight + +These parameters define the range of token importance weights. Optional. Default `min_weight` is `0.01`; default `max_weight` is `1.0`. + +**Impact**: +- `min_weight`: Excludes low-importance tokens from generation. Tokens with weights below this value are not included. +- `max_weight`: Limits the upper bound of token influence to prevent any single token from dominating the vector. + +The following table shows common weight range configurations and their use cases. + +| Configuration | `min_weight` | `max_weight` | Use case | +|---------------|-----|-----|----------| +| Standard SPLADE | `0.01` | `1.0` | Default, balanced importance | +| Narrow range | `0.1` | `0.9` | More uniform importance | +| Wide range | `0.01` | `2.0` | Strong importance signals | +| High threshold | `0.05` | `1.0` | Filters low-confidence tokens | + +**Configuration**: + +```yaml +field_overrides: + sparse_balanced: + generator: generate_sparse_vector + params: + num_tokens: 15 + min_weight: 0.01 + max_weight: 1.0 + + sparse_uniform: + generator: generate_sparse_vector + params: + num_tokens: 20 + min_weight: 0.2 # Higher minimum + max_weight: 0.8 # Lower maximum +``` +{% include copy.html %} + +**Constraints**: +- `min_weight` must be > `0.0` (OpenSearch requires positive weights) +- `max_weight` must be > `min_weight` +- Weights are rounded to `4` decimal places + +**Best practice**: Keep `min_weight` small (`0.01--0.05`) to allow nuanced weighting. + +--- + +#### token_id_start and token_id_step + +These parameters define how token IDs are assigned during vector generation: + +- `token_id_start`: Sets the starting token ID in the generated sequence. Default is `1000`. + +- `token_id_step`: Specifies the increment applied between each consecutive token ID. Default is `100`. + +**Generated sequence**: `start, start+step, start+2*step, ...` + +**Example** with `start=1000`, `step=100`, `num_tokens=5`: + +```json +{ + "1000": 0.3421, // token_id_start + "1100": 0.5234, // start + 1*step + "1200": 0.7821, // start + 2*step + "1300": 0.1523, // start + 3*step + "1400": 0.9102 // start + 4*step +} +``` +{% include copy.html %} + +The following table shows different token ID configurations and their use cases. + +| Configuration | `token_id_start` | `token_id_step` | Use case | +| --------------------------- | ----------------- | --------------- | ------------------------------------------------------------------ | +| Default testing | `1000` | `100` | Helps visually distinguish generated token ranges. | +| Realistic vocabulary | `0` | `1` | Aligns token IDs with a real model's vocabulary indices. | +| Multi-field generation | `1000`, `5000`, `10000` | `1` | Keeps token ID ranges separate across different fields. | +| Large vocabulary simulation | `0` | `1` | Supports generation scenarios with vocabularies of `50,000`+ tokens. | + +**Configuration**: + +```yaml +field_overrides: + # Default: easy debugging + sparse_debug: + generator: generate_sparse_vector + params: + num_tokens: 10 + token_id_start: 1000 + token_id_step: 100 + + # Realistic: actual vocab indices + sparse_realistic: + generator: generate_sparse_vector + params: + num_tokens: 15 + token_id_start: 0 + token_id_step: 1 + + # Multiple fields: separate ranges + sparse_field1: + generator: generate_sparse_vector + params: + token_id_start: 1000 + + sparse_field2: + generator: generate_sparse_vector + params: + token_id_start: 5000 +``` +{% include copy.html %} + +**Note**: Token IDs in the generated data are sequential. In real sparse vectors, IDs may be non-sequential based on the actual vocabulary. This difference does not impact OpenSearch indexing or search functionality. + +**Best practice**: Use a larger `token_id_step` (for example, `100`) for debugging, and set `token_id_step` to `1` for production-like data. + +--- + +## Choosing simple or complex generation approaches + +The following table outlines when to use simple generation versus a more complex, configurable approach based on your testing goals. + +| Scenario | Recommended approach | Rationale | +| ------------------------- | ----------------------------------------------- | --------------------------------------------------------------------------------------------- | +| Learning or quick testing | Simple generation (no additional configuration) | Provides the fastest setup and is sufficient for basic validation. | +| Load testing | Simple generation | Prioritizes data volume and throughput over vector realism. | +| Realistic benchmarks | Complex generation (with configuration) | Requires realistic vector clustering and distributions to reflect real-world behavior. | +| Production simulation | Complex generation | Needs vector characteristics that closely match those produced by the actual embedding model. | +| Search quality testing | Complex generation | Requires meaningful vector clusters to evaluate recall and precision accurately. | + + +**Recommendation**: For search quality testing or algorithm comparisons, use complex configuration with sample vectors to ensure realistic data distributions. diff --git a/_benchmark/features/synthetic-data-generation/index.md b/_benchmark/features/synthetic-data-generation/index.md new file mode 100644 index 00000000000..326c3fdcb2f --- /dev/null +++ b/_benchmark/features/synthetic-data-generation/index.md @@ -0,0 +1,48 @@ +--- +layout: default +title: Synthetic data generation +nav_order: 5 +has_children: true +parent: Additional features +has_toc: false +redirect_from: + - /benchmark/features/synthetic-data-generation/ +cards: + - heading: "Generate data using index mappings" + description: "Create synthetic data based on your OpenSearch index mappings." + link: "/benchmark/features/synthetic-data-generation/mapping-sdg/" + - heading: "Generate data using custom logic" + description: "Build synthetic data using your own scripts or domain-specific rules." + link: "/benchmark/features/synthetic-data-generation/custom-logic-sdg/" +more_cards: + - heading: "Generating vectors" + description: "Generate synthetic dense and sparse vectors with configurable parameters for realistic AI/ML benchmarking scenarios." + link: "/benchmark/features/synthetic-data-generation/generating-vectors/" +tip_cards: + - heading: "Tips and best practices" + description: "Learn practical guidance and best practices to optimize your synthetic data generation workflows." + link: "/benchmark/features/synthetic-data-generation/tips/" +--- + +# Synthetic data generation +**Introduced 2.0** +{: .label .label-purple } + +OpenSearch Benchmark provides a built-in synthetic data generator that can create datasets for any use case at any scale. It currently supports two generation methods: + +* **Random data generation** produces fields with randomized values. This is useful for stress testing and evaluating system performance under load. +* **Rule-based data generation** creates data according to user-defined rules. This is helpful for testing specific scenarios, benchmarking query behavior, or simulating domain-specific patterns. + +## Data generation methods + +OpenSearch Benchmark currently supports the following data generation methods. + +{% include cards.html cards=page.cards %} + +For advanced synthetic data generation capabilities, explore vector generation. + +{% include cards.html cards=page.more_cards %} + +## Tips and best practices + +{% include cards.html cards=page.tip_cards %} diff --git a/_benchmark/features/synthetic-data-generation/mapping-sdg.md b/_benchmark/features/synthetic-data-generation/mapping-sdg.md new file mode 100644 index 00000000000..cea19e0f2cf --- /dev/null +++ b/_benchmark/features/synthetic-data-generation/mapping-sdg.md @@ -0,0 +1,431 @@ +--- +layout: default +title: Generating data using index mappings +nav_order: 15 +parent: Synthetic data generation +grand_parent: Additional features +--- + +# Generating data using index mappings + +You can use OpenSearch index mappings to generate synthetic data. This approach offers a balance between automation and customization. + +To use this method, save your OpenSearch index mappings to a JSON file: + +```json +{ + "mappings": { + "properties": { + "title": { + "type": "text", + "analyzer": "standard", + "fields": { + "keyword": { + "type": "keyword", + "ignore_above": 256 + } + } + }, + "description": { + "type": "text" + }, + "price": { + "type": "float" + }, + "created_at": { + "type": "date", + "format": "strict_date_optional_time||epoch_millis" + }, + "is_available": { + "type": "boolean" + }, + "category_id": { + "type": "integer" + }, + "tags": { + "type": "keyword" + } + } + }, + "settings": { + "number_of_shards": 1, + "number_of_replicas": 1 + } +} +``` + +OpenSearch Benchmark works with any valid index mappings, regardless of complexity. You can provide more complex mappings similar to the following: + +
+ + Mappings + + {: .text-delta} + +```json +{ + "mappings": { + "dynamic": "strict", + "properties": { + "user": { + "type": "object", + "properties": { + "id": { + "type": "keyword" + }, + "email": { + "type": "keyword" + }, + "name": { + "type": "text", + "fields": { + "keyword": { + "type": "keyword", + "ignore_above": 256 + }, + "completion": { + "type": "completion" + } + }, + "analyzer": "standard" + }, + "address": { + "type": "object", + "properties": { + "street": { + "type": "text" + }, + "city": { + "type": "keyword" + }, + "state": { + "type": "keyword" + }, + "zip": { + "type": "keyword" + }, + "location": { + "type": "geo_point" + } + } + }, + "preferences": { + "type": "object", + "dynamic": true + } + } + }, + "orders": { + "type": "nested", + "properties": { + "id": { + "type": "keyword" + }, + "date": { + "type": "date", + "format": "strict_date_optional_time||epoch_millis" + }, + "amount": { + "type": "scaled_float", + "scaling_factor": 100 + }, + "status": { + "type": "keyword" + }, + "items": { + "type": "nested", + "properties": { + "product_id": { + "type": "keyword" + }, + "name": { + "type": "text", + "fields": { + "keyword": { + "type": "keyword" + } + } + }, + "quantity": { + "type": "short" + }, + "price": { + "type": "float" + }, + "categories": { + "type": "keyword" + } + } + }, + "shipping_address": { + "type": "object", + "properties": { + "street": { + "type": "text" + }, + "city": { + "type": "keyword" + }, + "state": { + "type": "keyword" + }, + "zip": { + "type": "keyword" + }, + "location": { + "type": "geo_point" + } + } + } + } + }, + "activity_log": { + "type": "nested", + "properties": { + "timestamp": { + "type": "date" + }, + "action": { + "type": "keyword" + }, + "ip_address": { + "type": "ip" + }, + "details": { + "type": "object", + "enabled": false + } + } + }, + "metadata": { + "type": "object", + "properties": { + "created_at": { + "type": "date" + }, + "updated_at": { + "type": "date" + }, + "tags": { + "type": "keyword" + }, + "source": { + "type": "keyword" + }, + "version": { + "type": "integer" + } + } + }, + "description": { + "type": "text", + "analyzer": "english", + "fields": { + "keyword": { + "type": "keyword", + "ignore_above": 256 + }, + "standard": { + "type": "text", + "analyzer": "standard" + } + } + }, + "ranking_scores": { + "type": "object", + "properties": { + "popularity": { + "type": "float" + }, + "relevance": { + "type": "float" + }, + "quality": { + "type": "float" + } + } + }, + "permissions": { + "type": "nested", + "properties": { + "user_id": { + "type": "keyword" + }, + "role": { + "type": "keyword" + }, + "granted_at": { + "type": "date" + } + } + } + } + }, + "settings": { + "number_of_shards": 3, + "number_of_replicas": 2, + "analysis": { + "analyzer": { + "email_analyzer": { + "type": "custom", + "tokenizer": "uax_url_email", + "filter": ["lowercase", "stop"] + } + } + } + } +} +``` + +
+ +## Generating data + +To generate synthetic data using index mappings, use the `generate-data` subcommand and provide the required index mappings file, index name, output path, and total amount of data to generate: + +```shell +osb generate-data --index-name --index-mappings --output-path --total-size +``` +{% include copy.html %} + +For a complete list of available parameters and their descriptions, see the [`generate-data` command reference]({{site.url}}{{site.baseurl}}/benchmark/reference/commands/generate-data/). + +## Example output + +The following is an example output of generating 100 GB of data: + +``` + ____ _____ __ ____ __ __ + / __ \____ ___ ____ / ___/___ ____ ___________/ /_ / __ )___ ____ _____/ /_ ____ ___ ____ ______/ /__ + / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \ / __ / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/ +/ /_/ / /_/ / __/ / / /__/ / __/ /_/ / / / /__/ / / / / /_/ / __/ / / / /__/ / / / / / / / / /_/ / / / ,< +\____/ .___/\___/_/ /_/____/\___/\__,_/_/ \___/_/ /_/ /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/ /_/|_| + /_/ + + +[NOTE] ✨ Dashboard link to monitor processes and task streams: [http://127.0.0.1:8787/status] +[NOTE] ✨ For users who are running generation on a virtual machine, consider SSH port forwarding (tunneling) to localhost to view dashboard. +[NOTE] Example of localhost command for SSH port forwarding (tunneling) from an AWS EC2 instance: +ssh -i -N -L localhost:8787:localhost:8787 ec2-user@ + +Total GB to generate: [1] +Average document size in bytes: [412] +Max file size in GB: [40] + +100%|███████████████████████████████████████████████████████████████████| 100.07G/100.07G [3:35:29<00:00, 3.98MB/s] + +Generated 24271844660 docs in 12000 seconds. Total dataset size is 100.21GB. +✅ Visit the following path to view synthetically generated data: /home/ec2-user/ + +----------------------------------- +[INFO] ✅ SUCCESS (took 272 seconds) +----------------------------------- +``` + +## Advanced configuration + +You can control how synthetic data is generated by creating a YAML configuration file. The following is an example configuration file that defines custom rules in the `MappingGenerationValues` parameter: + +```yml +MappingGenerationValues: + # For users who want more granular control over how data is generated when providing an OpenSearch mapping + generator_overrides: + # Overrides all instances of generators with these settings. Specify type and params + integer: + min: 0 + max: 20 + long: + min: 0 + max: 1000 + float: + min: 0.0 + max: 1.0 + double: + min: 0.0 + max: 2000.0 + date: + start_date: "2020-01-01" + end_date: "2023-01-01" + format: "yyyy-mm-dd" + text: + must_include: ["lorem", "ipsum"] + keyword: + choices: ["alpha", "beta", "gamma"] + + field_overrides: + # Specify field name as key of dict. For its values, specify generator and its params. Params must adhere to existing params for each generator + # For nested fields, use dot notation: Example preferences.allergies if allergies is a subfield of preferences object + title: + generator: generate_keyword + params: + choices: ["Helly R", "Mark S", "Irving B"] + + promo_codes: + generator: generate_keyword + params: + choices: ["HOT_SUMMER", "TREATSYUM!"] + + # Nested fields, use dot notation + orders.items.product_id: + generator: generate_keyword + params: + choices: ["Python", "English"] +``` +{% include copy.html %} + +`MappingGenerationValues` supports the following parameters. + +| Parameter | Description | +|---|---| +| `generator_overrides` | Defines custom generator rules for specific OpenSearch field types. Any field that uses the corresponding type will follow these rules. See [Generator overrides parameters](#generator-overrides-parameters). | +| `field_overrides` | Defines generator rules for individual fields by field name. These apply only to the fields explicitly listed. For nested fields, use dot notation (for example, `orders.items.product_id`). See [Field overrides parameters](#field-overrides-parameters). | + +If both `generator_overrides` and `field_overrides` are present, `field_overrides` take precedence. +{: .important} + +#### Generator overrides parameters + +The following parameters are available for each OpenSearch field type in `generator_overrides`. + +| Field type | Parameters | +|---|---| +| `integer`, `long`, `short`, `byte` | `min`, `max` | +| `float`, `double` | `min`, `max`, `round` (the number of decimal places to round to) | +| `date` | `start_date`, `end_date`, `format` | +| `text` | `must_include` (array of terms to include in generated text) | +| `keyword` | `choices` (array of keywords to randomly select from) | + +#### Field overrides parameters + +The following generators and their parameters are available for use in `field_overrides`. + +| Generator | Parameters | +|---|---| +| `generate_text` | `must_include` (array of terms to include in generated text) | +| `generate_keyword` | `choices` (array of keywords to randomly select from) | +| `generate_integer` | `min`, `max` | +| `generate_long` | `min`, `max` | +| `generate_short` | `min`, `max` | +| `generate_byte` | `min`, `max` | +| `generate_float` | `min`, `max`, `round` (the number of decimal places to round to) | +| `generate_double` | `min`, `max` | +| `generate_boolean` | N/A| +| `generate_date` | `format`, `start_date`, `end_date` | +| `generate_ip` | N/A| +| `generate_geo_point` | N/A| +| `generate_knn_vector` | `dimension`, `sample_vectors`, `noise_factor`, `distribution_type`, `normalize`. See [Generating vectors](/benchmark/features/synthetic-data-generation/generating-vectors/). | +| `generate_sparse_vector` | `num_tokens`, `min_weight`, `max_weight`, `token_id_start`, `token_id_step`. See [Generating vectors](/benchmark/features/synthetic-data-generation/generating-vectors/). | + +### Using the configuration + +To use your configuration file, provide its full path in the `--custom-config` parameter: + +```shell +osb generate-data --index-name --index-mappings --output-path --total-size --custom-config ~/Desktop/sdg-config.yml +``` +{% include copy.html %} + +## Related documentation + +- [`generate-data` command reference]({{site.url}}{{site.baseurl}}/benchmark/reference/commands/generate-data/) +- [Generating data using custom logic]({{site.url}}{{site.baseurl}}/benchmark/features/synthetic-data-generation/custom-logic-sdg/) \ No newline at end of file diff --git a/_benchmark/features/synthetic-data-generation/tips.md b/_benchmark/features/synthetic-data-generation/tips.md new file mode 100644 index 00000000000..e93e621e4f0 --- /dev/null +++ b/_benchmark/features/synthetic-data-generation/tips.md @@ -0,0 +1,23 @@ +--- +layout: default +title: Tips and best practices +nav_order: 45 +parent: Synthetic data generation +grand_parent: Additional features +--- + +# Tips and best practices + +The following tips help you efficiently generate synthetic data and monitor performance during the process. + +### Visualizing generation + +The generated URL opens a [Dask dashboard](https://docs.dask.org/en/latest/dashboard.html) that visualizes the data generation process. You can monitor CPU and memory usage for each worker and view a CPU flamegraph of the generation workflow. This helps track resource usage and optimize performance, especially when using a [custom Python module]({{site.url}}{{site.baseurl}}/benchmark/features/synthetic-data-generation/custom-logic-sdg/). + +### Use default settings + +We recommend starting with the default synthetic data generation settings. These guidelines help you choose appropriate settings for efficient and reliable synthetic data generation: + +* Set the number of workers to **no more than the CPU count** on the load generation host. +* Use a **chunk size of 10,000 documents** per chunk. +* Adjust the `max_file_size_gb` setting as needed to control how much data is written to each generated file. diff --git a/_benchmark/quickstart.md b/_benchmark/quickstart.md index 928aae59805..5b0505a1a10 100644 --- a/_benchmark/quickstart.md +++ b/_benchmark/quickstart.md @@ -114,9 +114,9 @@ You can now run your first benchmark. The following benchmark uses the [percolat ### Understanding workload command flags -Benchmarks are run using the [`run`]({{site.url}}{{site.baseurl}}/benchmark/commands/run/) command with the following command flags: +Benchmarks are run using the [`run`]({{site.url}}{{site.baseurl}}/benchmark/reference/commands/run/) command with the following command flags: -For additional `run` command flags, see the [run]({{site.url}}{{site.baseurl}}/benchmark/commands/run/) reference. Some commonly used options are `--workload-params`, `--exclude-tasks`, and `--include-tasks`. +For additional `run` command flags, see the [run]({{site.url}}{{site.baseurl}}/benchmark/reference/commands/run/) reference. Some commonly used options are `--workload-params`, `--exclude-tasks`, and `--include-tasks`. {: .tip} * `--pipeline=benchmark-only` : Informs OSB that users wants to provide their own OpenSearch cluster. diff --git a/_benchmark/reference/commands/aggregate.md b/_benchmark/reference/commands/aggregate.md index 908c595a224..624d6e1c1aa 100644 --- a/_benchmark/reference/commands/aggregate.md +++ b/_benchmark/reference/commands/aggregate.md @@ -1,7 +1,7 @@ --- layout: default title: aggregate -nav_order: 85 +nav_order: 10 parent: Command reference grand_parent: OpenSearch Benchmark Reference redirect_from: diff --git a/_benchmark/reference/commands/command-flags.md b/_benchmark/reference/commands/command-flags.md index bc8b609923f..0aac72ccac1 100644 --- a/_benchmark/reference/commands/command-flags.md +++ b/_benchmark/reference/commands/command-flags.md @@ -1,7 +1,7 @@ --- layout: default title: Command flags -nav_order: 51 +nav_order: 150 parent: Command reference redirect_from: - /benchmark/commands/command-flags/ diff --git a/_benchmark/reference/commands/compare.md b/_benchmark/reference/commands/compare.md index a7d03c47951..8653e9bc079 100644 --- a/_benchmark/reference/commands/compare.md +++ b/_benchmark/reference/commands/compare.md @@ -1,7 +1,7 @@ --- layout: default title: compare -nav_order: 55 +nav_order: 20 parent: Command reference grand_parent: OpenSearch Benchmark Reference redirect_from: diff --git a/_benchmark/reference/commands/download.md b/_benchmark/reference/commands/download.md index c6d31bbd5df..6003d0e5271 100644 --- a/_benchmark/reference/commands/download.md +++ b/_benchmark/reference/commands/download.md @@ -1,7 +1,7 @@ --- layout: default title: download -nav_order: 60 +nav_order: 30 parent: Command reference grand_parent: OpenSearch Benchmark Reference redirect_from: diff --git a/_benchmark/reference/commands/generate-data.md b/_benchmark/reference/commands/generate-data.md new file mode 100644 index 00000000000..1ee74252b69 --- /dev/null +++ b/_benchmark/reference/commands/generate-data.md @@ -0,0 +1,91 @@ +--- +layout: default +title: generate-data +nav_order: 50 +parent: Command reference +grand_parent: OpenSearch Benchmark Reference +redirect_from: + - /benchmark/commands/generate-data/ +--- + +# generate-data + +The `generate-data` command creates synthetic datasets for benchmarking and testing. OpenSearch Benchmark supports two methods for data generation: using OpenSearch index mappings or custom Python modules with user-defined logic. For more information, see [Synthetic data generation]({{site.url}}{{site.baseurl}}/benchmark/features/synthetic-data-generation/). + +## Usage + +```shell +osb generate-data --index-name --output-path --total-size [OPTIONS] +``` + +**Requirements**: + +- Either `--index-mappings` or `--custom-module` must be specified, but not both. +- When using `--custom-module`, your Python module must include the `generate_synthetic_document(providers, **custom_lists)` function. + +## Data generation methods + +Choose one of the following approaches: + +**Method 1: Using index mappings**: + +```shell +osb generate-data --index-name my-index --index-mappings mapping.json --output-path ./data --total-size 1 +``` + +**Method 2: Using custom Python module**: + +```shell +osb generate-data --index-name my-index --custom-module custom.py --output-path ./data --total-size 1 +``` + +## Options + +Use the following options with the `generate-data` command. + +| Option | Required/Optional | Description | +|---|---|---| +| `--index-name` or `-n` | Required | The name of the data corpora you want to generate. | +| `--output-path` or `-p` | Required | The path where you want the data to be generated. | +| `--total-size` or `-s` | Required | The total amount of data you want to generate, in GB. | +| `--index-mappings` or `-i` | Conditional (Either `--index-mappings` or `--custom-module` must be specified)| The path to the OpenSearch index mappings you want to use. Required when using mapping-based generation. Cannot be used with `--custom-module`. | +| `--custom-module` or `-m` | Conditional (Either `--index-mappings` or `--custom-module` must be specified)| The path to the Python module that includes your custom logic. Required when using custom logic generation. Cannot be used with `--index-mappings`. The Python module must include the `generate_synthetic_document(providers, **custom_lists)` function. | +| `--custom-config` or `-c` | Optional | The path to a YAML configuration file defining rules for how you want data to be generated. | +| `--test-document` or `-t` | Optional | When this flag is present, OSB generates a single synthetic document and outputs it to the console. This provides you with a way to verify that the example document generated aligns with your expectations. When the flag is not present, the entire data corpora will be generated. | + +## Example output + +The following is an example output when generating synthetic data: + +``` + ____ _____ __ ____ __ __ + / __ \____ ___ ____ / ___/___ ____ ___________/ /_ / __ )___ ____ _____/ /_ ____ ___ ____ ______/ /__ + / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \ / __ / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/ +/ /_/ / /_/ / __/ / / /__/ / __/ /_/ / / / /__/ / / / / /_/ / __/ / / / /__/ / / / / / / / / /_/ / / / ,< +\____/ .___/\___/_/ /_/____/\___/\__,_/_/ \___/_/ /_/ /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/ /_/|_| + /_/ + + +[NOTE] ✨ Dashboard link to monitor processes and task streams: [http://127.0.0.1:8787/status] +[NOTE] ✨ For users who are running generation on a virtual machine, consider SSH port forwarding (tunneling) to localhost to view dashboard. +[NOTE] Example of localhost command for SSH port forwarding (tunneling) from an AWS EC2 instance: +ssh -i -N -L localhost:8787:localhost:8787 ec2-user@ + +Total GB to generate: [1] +Average document size in bytes: [412] +Max file size in GB: [40] + +100%|███████████████████████████████████████████████████████████████████| 100.07G/100.07G [3:35:29<00:00, 3.98MB/s] + +Generated 24271844660 docs in 12000 seconds. Total dataset size is 100.21GB. +✅ Visit the following path to view synthetically generated data: /home/ec2-user/ + +----------------------------------- +[INFO] ✅ SUCCESS (took 272 seconds) +----------------------------------- +``` + +## Related documentation + +- [Generating data using index mappings]({{site.url}}{{site.baseurl}}/benchmark/features/synthetic-data-generation/mapping-sdg/) +- [Generating data using custom logic]({{site.url}}{{site.baseurl}}/benchmark/features/synthetic-data-generation/custom-logic-sdg/) \ No newline at end of file diff --git a/_benchmark/reference/commands/index.md b/_benchmark/reference/commands/index.md index 323f0246a93..b2e7914ca3c 100644 --- a/_benchmark/reference/commands/index.md +++ b/_benchmark/reference/commands/index.md @@ -3,6 +3,7 @@ layout: default title: Command reference nav_order: 50 has_children: true +has_toc: false parent: OpenSearch Benchmark Reference redirect_from: - /benchmark/commands/index/ @@ -12,13 +13,16 @@ redirect_from: # OpenSearch Benchmark command reference -This section provides a list of commands supported by OpenSearch Benchmark, including commonly used commands such as `run` and `list`. +OpenSearch Benchmark supports the following commands: -- [compare]({{site.url}}{{site.baseurl}}/benchmark/commands/compare/) -- [download]({{site.url}}{{site.baseurl}}/benchmark/commands/download/) -- [run]({{site.url}}{{site.baseurl}}/benchmark/commands/run/) -- [info]({{site.url}}{{site.baseurl}}/benchmark/commands/info/) -- [list]({{site.url}}{{site.baseurl}}/benchmark/commands/list/) +- [aggregate]({{site.url}}{{site.baseurl}}/benchmark/reference/commands/aggregate/) +- [compare]({{site.url}}{{site.baseurl}}/benchmark/reference/commands/compare/) +- [download]({{site.url}}{{site.baseurl}}/benchmark/reference/commands/download/) +- [generate-data]({{site.url}}{{site.baseurl}}/benchmark/reference/commands/generate-data/) +- [info]({{site.url}}{{site.baseurl}}/benchmark/reference/commands/info/) +- [list]({{site.url}}{{site.baseurl}}/benchmark/reference/commands/list/) +- [redline-test]({{site.url}}{{site.baseurl}}/benchmark/reference/commands/redline-test/) +- [run]({{site.url}}{{site.baseurl}}/benchmark/reference/commands/run/) ## List of common options @@ -28,3 +32,4 @@ All OpenSearch Benchmark commands support the following options: - `--quiet`: Hides as much of the results output as possible. Default is `false`. - `--offline`: Indicates whether OpenSearch Benchmark has a connection to the internet. Default is `false`. +For more information about command options, see [Command flags]({{site.url}}{{site.baseurl}}/benchmark/reference/commands/command-flags/). \ No newline at end of file diff --git a/_benchmark/reference/commands/info.md b/_benchmark/reference/commands/info.md index a9dc509b4eb..6d31ed67e87 100644 --- a/_benchmark/reference/commands/info.md +++ b/_benchmark/reference/commands/info.md @@ -1,7 +1,7 @@ --- layout: default title: info -nav_order: 75 +nav_order: 70 parent: Command reference grand_parent: OpenSearch Benchmark Reference redirect_from: diff --git a/_benchmark/reference/commands/run.md b/_benchmark/reference/commands/run.md index da460d614d0..65632a88c19 100644 --- a/_benchmark/reference/commands/run.md +++ b/_benchmark/reference/commands/run.md @@ -1,11 +1,11 @@ --- layout: default title: run -nav_order: 65 +nav_order: 90 parent: Command reference grand_parent: OpenSearch Benchmark Reference redirect_from: - - /benchmark/commands/run/ + - /benchmark/commands/execute-test/ ---