generated from amazon-archives/__template_Apache-2.0
-
Notifications
You must be signed in to change notification settings - Fork 628
Add Synthetic Data Generation & Features to OSB Documentation #11480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
IanHoang
wants to merge
8
commits into
opensearch-project:main
Choose a base branch
from
IanHoang:add-features
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+1,432
−15
Open
Changes from 6 commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
fbb34cf
Add Synthetic Data Generation & Features to OSB Documentation
IanHoang 9a7617d
Address format
IanHoang d689b3a
Add generating vectors advanced section
IanHoang 606db9d
Add feedback from Michael
IanHoang 8609c28
Doc review
kolchfa-aws 285476f
Resolve merge conflicts
kolchfa-aws 6dc2208
Update _benchmark/features/synthetic-data-generation/custom-logic-sdg.md
kolchfa-aws 4409481
Fix links
kolchfa-aws File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,19 @@ | ||
| --- | ||
| layout: default | ||
| title: Additional features | ||
| nav_order: 30 | ||
| has_children: true | ||
| has_toc: false | ||
| redirect_from: | ||
| - /benchmark/features/ | ||
| more_cards: | ||
| - heading: "Synthetic data generation" | ||
| description: "Create synthetic datasets using index mappings or custom Python logic for comprehensive benchmarking and testing." | ||
| link: "/benchmark/features/synthetic-data-generation/" | ||
| --- | ||
|
|
||
| # Additional features | ||
|
|
||
| In addition to general benchmarking, OpenSearch Benchmark provides several specialized features. | ||
|
|
||
| {% include cards.html cards=page.more_cards %} |
257 changes: 257 additions & 0 deletions
257
_benchmark/features/synthetic-data-generation/custom-logic-sdg.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,257 @@ | ||
| --- | ||
| layout: default | ||
| title: Generating data using custom logic | ||
| nav_order: 35 | ||
| parent: Synthetic data generation | ||
| grand_parent: Additional features | ||
| --- | ||
|
|
||
| # Generating data using custom logic | ||
|
|
||
| You can generate synthetic data using custom logic defined in a Python module. This approach offers you the most granular control over how synthetic data is produced in OpenSearch Benchmark. This is especially useful if you understand the distribution of your data and the relationship between different fields. | ||
|
|
||
| ## The generate_synthetic_document function | ||
|
|
||
| Every custom module provided to OpenSearch Benchmark must define the `generate_synthetic_document(providers, **custom_lists)` function. This function defines how OpenSearch Benchmark generates each synthetic document. | ||
|
|
||
| ### Function parameters | ||
|
|
||
| | Parameter | Required/Optional | Description | | ||
| |---|---|---| | ||
| | `providers` | Required | A dictionary containing data generation tools. Available providers are `generic` (Mimesis [Generic provider](https://mimesis.name/master/api.html#generic-providers)) and `random` (Mimesis [Random class](https://mimesis.name/master/random_and_seed.html)). To add custom providers, see [Advanced configuration](#advanced-configuration). | | ||
| | `**custom_lists` | Optional | Keyword arguments containing predefined lists of values that you can use in your data generation logic. These are defined in your YAML configuration file under `custom_lists` and allow you to separate data values from your Python code. For example, if you define `dog_names: [Buddy, Max, Luna]` in YAML, you can access it as `custom_lists['dog_names']` in your function. This makes it easy to modify data values without changing your Python code. | | ||
|
|
||
| ### Basic function template | ||
|
|
||
| ```python | ||
| def generate_synthetic_document(providers, **custom_lists): | ||
| # Access the available providers | ||
| generic = providers['generic'] | ||
| random_provider = providers['random'] | ||
|
|
||
| # Generate a document using the providers | ||
| document = { | ||
| 'name': generic.person.full_name(), | ||
| 'age': random_provider.randint(18, 80), | ||
| 'email': generic.person.email(), | ||
| 'timestamp': generic.datetime.datetime() | ||
| } | ||
|
|
||
| # Optionally, use custom lists if provided | ||
| if 'categories' in custom_lists: | ||
| document['category'] = random_provider.choice(custom_lists['categories']) | ||
|
|
||
| return document | ||
| ``` | ||
| {% include copy.html %} | ||
|
|
||
| For more information, see the [Mimesis documentation](https://mimesis.name/master/api.html). | ||
|
|
||
| ## Python module example | ||
|
|
||
| The following example Python module demonstrates custom logic for generating documents about dog drivers for a fictional ride-sharing company, *Pawber*, which uses OpenSearch to store and search large volumes of ride-sharing data. | ||
|
Check failure on line 52 in _benchmark/features/synthetic-data-generation/custom-logic-sdg.md
|
||
|
|
||
| This example showcases several advanced concepts: | ||
| - **[Custom provider classes](#advanced-configuration)** (`NumericString`, `MultipleChoices`) that extend Mimesis functionality | ||
| - **[Custom lists](#advanced-configuration)** for data values like dog names, breeds, and treats (referenced as `custom_lists['dog_names']`) | ||
| - **Geographic clustering** logic for realistic location data | ||
| - **Complex document structures** with nested objects and relationships | ||
|
|
||
| Save this code to a file called `pawber.py` in your desired directory (for example, `~/pawber.py`): | ||
|
|
||
| ```python | ||
| from mimesis.providers.base import BaseProvider | ||
| from mimesis.enums import TimestampFormat | ||
|
|
||
| import random | ||
|
|
||
| GEOGRAPHIC_CLUSTERS = { | ||
| 'Manhattan': { | ||
| 'center': {'lat': 40.7831, 'lon': -73.9712}, | ||
| 'radius': 0.05 # degrees | ||
| }, | ||
| 'Brooklyn': { | ||
| 'center': {'lat': 40.6782, 'lon': -73.9442}, | ||
| 'radius': 0.05 | ||
| }, | ||
| 'Austin': { | ||
| 'center': {'lat': 30.2672, 'lon': -97.7431}, | ||
| 'radius': 0.1 # Increased radius to cover more of Austin | ||
| } | ||
| } | ||
|
|
||
| def generate_location(cluster): | ||
| """Generate a random location within a cluster""" | ||
| center = GEOGRAPHIC_CLUSTERS[cluster]['center'] | ||
| radius = GEOGRAPHIC_CLUSTERS[cluster]['radius'] | ||
| lat = center['lat'] + random.uniform(-radius, radius) | ||
| lon = center['lon'] + random.uniform(-radius, radius) | ||
| return {'lat': lat, 'lon': lon} | ||
|
|
||
| class NumericString(BaseProvider): | ||
| class Meta: | ||
| name = "numeric_string" | ||
|
|
||
| @staticmethod | ||
| def generate(length=5) -> str: | ||
| return ''.join([str(random.randint(0, 9)) for _ in range(length)]) | ||
|
|
||
| class MultipleChoices(BaseProvider): | ||
| class Meta: | ||
| name = "multiple_choices" | ||
|
|
||
| @staticmethod | ||
| def generate(choices, num_of_choices=5) -> str: | ||
| import logging | ||
| logger = logging.getLogger(__name__) | ||
| logger.info("Choices: %s", choices) | ||
| logger.info("Length: %s", num_of_choices) | ||
| total_choices_available = len(choices) - 1 | ||
|
|
||
| return [choices[random.randint(0, total_choices_available)] for _ in range(num_of_choices)] | ||
|
|
||
| def generate_synthetic_document(providers, **custom_lists): | ||
| generic = providers['generic'] | ||
| random_mimesis = providers['random'] | ||
|
|
||
| first_name = generic.person.first_name() | ||
| last_name = generic.person.last_name() | ||
| city = random.choice(list(GEOGRAPHIC_CLUSTERS.keys())) | ||
|
|
||
| # Driver Document | ||
| document = { | ||
| "dog_driver_id": f"DD{generic.numeric_string.generate(length=4)}", | ||
| "dog_name": random_mimesis.choice(custom_lists['dog_names']), | ||
| "dog_breed": random_mimesis.choice(custom_lists['dog_breeds']), | ||
| "license_number": f"{random_mimesis.choice(custom_lists['license_plates'])}{generic.numeric_string.generate(length=4)}", | ||
| "favorite_treats": random_mimesis.choice(custom_lists['treats']), | ||
| "preferred_tip": random_mimesis.choice(custom_lists['tips']), | ||
| "vehicle_type": random_mimesis.choice(custom_lists['vehicle_types']), | ||
| "vehicle_make": random_mimesis.choice(custom_lists['vehicle_makes']), | ||
| "vehicle_model": random_mimesis.choice(custom_lists['vehicle_models']), | ||
| "vehicle_year": random_mimesis.choice(custom_lists['vehicle_years']), | ||
| "vehicle_color": random_mimesis.choice(custom_lists['vehicle_colors']), | ||
| "license_plate": random_mimesis.choice(custom_lists['license_plates']), | ||
| "current_location": generate_location(city), | ||
| "status": random.choice(['available', 'busy', 'offline']), | ||
| "current_ride": f"R{generic.numeric_string.generate(length=6)}", | ||
| "account_status": random_mimesis.choice(custom_lists['account_status']), | ||
| "join_date": generic.datetime.formatted_date(), | ||
| "total_rides": generic.numeric.integer_number(start=1, end=200), | ||
| "rating": generic.numeric.float_number(start=1.0, end=5.0, precision=2), | ||
| "earnings": { | ||
| "today": { | ||
| "amount": generic.numeric.float_number(start=1.0, end=5.0, precision=2), | ||
| "currency": "USD" | ||
| }, | ||
| "this_week": { | ||
| "amount": generic.numeric.float_number(start=1.0, end=5.0, precision=2), | ||
| "currency": "USD" | ||
| }, | ||
| "this_month": { | ||
| "amount": generic.numeric.float_number(start=1.0, end=5.0, precision=2), | ||
| "currency": "USD" | ||
| } | ||
| }, | ||
| "last_grooming_check": "2023-12-01", | ||
| "owner": { | ||
| "first_name": first_name, | ||
| "last_name": last_name, | ||
| "email": f"{first_name}{last_name}@gmail.com" | ||
| }, | ||
| "special_skills": generic.multiple_choices.generate(custom_lists['skills'], num_of_choices=3), | ||
| "bark_volume": generic.numeric.float_number(start=1.0, end=10.0, precision=2), | ||
| "tail_wag_speed": generic.numeric.float_number(start=1.0, end=10.0, precision=1) | ||
| } | ||
|
|
||
| return document | ||
| ``` | ||
| {% include copy.html %} | ||
|
|
||
| ## Generating data | ||
|
|
||
| To generate synthetic data using custom logic, use the `generate-data` subcommand and provide the required custom Python module, index name, output path, and total amount of data to generate: | ||
|
|
||
| ```shell | ||
| osb generate-data --custom-module ~/pawber.py --index-name pawber-data --output-path ~/Desktop/sdg_outputs/ --total-size 2 | ||
| ``` | ||
| {% include copy.html %} | ||
|
|
||
| For a complete list of available parameters and their descriptions, see the [`generate-data` command reference]({{site.url}}{{site.baseurl}}/benchmark/reference/commands/generate-data/). | ||
|
|
||
| ## Example output | ||
|
|
||
| The following is an example output of generating 100 GB of data: | ||
|
|
||
| ``` | ||
| ____ _____ __ ____ __ __ | ||
| / __ \____ ___ ____ / ___/___ ____ ___________/ /_ / __ )___ ____ _____/ /_ ____ ___ ____ ______/ /__ | ||
| / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \ / __ / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/ | ||
| / /_/ / /_/ / __/ / / /__/ / __/ /_/ / / / /__/ / / / / /_/ / __/ / / / /__/ / / / / / / / / /_/ / / / ,< | ||
| \____/ .___/\___/_/ /_/____/\___/\__,_/_/ \___/_/ /_/ /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/ /_/|_| | ||
| /_/ | ||
|
|
||
|
|
||
| [NOTE] ✨ Dashboard link to monitor processes and task streams: [http://127.0.0.1:8787/status] | ||
| [NOTE] ✨ For users who are running generation on a virtual machine, consider SSH port forwarding (tunneling) to localhost to view dashboard. | ||
IanHoang marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| [NOTE] Example of localhost command for SSH port forwarding (tunneling) from an AWS EC2 instance: | ||
| ssh -i <PEM_FILEPATH> -N -L localhost:8787:localhost:8787 ec2-user@<DNS> | ||
|
|
||
| Total GB to generate: [1] | ||
| Average document size in bytes: [412] | ||
| Max file size in GB: [40] | ||
|
|
||
| 100%|███████████████████████████████████████████████████████████████████| 100.07G/100.07G [3:35:29<00:00, 3.98MB/s] | ||
|
|
||
| Generated 24271844660 docs in 12000 seconds. Total dataset size is 100.21GB. | ||
| ✅ Visit the following path to view synthetically generated data: /home/ec2-user/ | ||
|
|
||
| ----------------------------------- | ||
| [INFO] ✅ SUCCESS (took 272 seconds) | ||
| ----------------------------------- | ||
| ``` | ||
|
|
||
| ## Advanced configuration | ||
|
|
||
| You can optionally create a YAML configuration file to store custom data and providers. The configuration file must define a `CustomGenerationValues` parameter. | ||
|
|
||
| The following parameters are available in `CustomGenerationValues`. Both parameters are optional. | ||
|
|
||
| | Parameter | Required/Optional | Description | | ||
| |---|---|---| | ||
| | `custom_lists` | Optional | Predefined arrays of values that you can reference in your Python module using `custom_lists['list_name']`. This allows you to separate data values from your code logic, making it easy to modify data values without changing your Python file. For example, `dog_names: [Buddy, Max, Luna]` becomes accessible as `custom_lists['dog_names']`. | | ||
| | `custom_providers` | Optional | Custom data generation classes that extend Mimesis functionality. These should be defined as classes in your Python module (like `NumericString` or `MultipleChoices` in the [example](#python-module-example)) and then listed in this parameter by name. This allows you to create specialized data generators beyond what Mimesis provides by default. | | ||
|
|
||
| ### Example configuration file | ||
|
|
||
| Save your configuration in a YAML file: | ||
|
|
||
| ```yml | ||
| CustomGenerationValues: | ||
| # Generate data using a custom Python module | ||
| custom_lists: | ||
| # Custom lists to consolidate all values in this YAML file | ||
| dog_names: [Hana, Youpie, Charlie, Lucy, Cooper, Luna, Rocky, Daisy, Buddy, Molly] | ||
| dog_breeds: [Jindo, Labrador, German Shepherd, Golden Retriever, Bulldog, Poodle, Beagle, Rottweiler, Boxer, Dachshund, Chihuahua] | ||
| treats: [cookies, pup_cup, jerky] | ||
| custom_providers: | ||
| # OSB's synthetic data generator uses Mimesis; custom providers are essentially custom Python classes that adds more functionality to Mimesis | ||
| - NumericString | ||
| - MultipleChoices | ||
| ``` | ||
| {% include copy.html %} | ||
|
|
||
|
|
||
| ### Using the configuration | ||
|
|
||
| To use your configuration file, add the `--custom-config` parameter to the `generate-data` command: | ||
|
|
||
| ```shell | ||
| osb generate-data --custom-module ~/pawber.py --index-name pawber-data --output-path ~/Desktop/sdg_outputs/ --total-size 2 --custom-config ~/Desktop/sdg-config.yml | ||
| ``` | ||
| {% include copy.html %} | ||
|
|
||
| ## Related documentation | ||
|
|
||
| - [`generate-data` command reference]({{site.url}}{{site.baseurl}}/benchmark/reference/commands/generate-data/) | ||
| - [Generating data using index mappings]({{site.url}}{{site.baseurl}}/benchmark/features/synthetic-data-generation/mapping-sdg/) | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.