Skip to content
Closed
1 change: 1 addition & 0 deletions docs/tutorials.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,4 @@ Explore synthetic data tutorials with the option to run them **either in Google
| Enrich Sensitive Data with LLMs using Synthetic Replicas | [![Run on Colab](https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab)](https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/synthetic-enrich/synthetic-enrich.ipynb) | [View Notebook](./tutorials/synthetic-enrich/synthetic-enrich.ipynb) |
| MOSTLY AI vs. SDV comparison: single-table scenario | [![Run on Colab](https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab)](https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/sdv-comparison/single-table-scenario/single-table-scenario.ipynb) | [View Notebook](./tutorials/sdv-comparison/single-table-scenario/single-table-scenario.ipynb) |
| MOSTLY AI vs. SDV comparison: sequential scenario | [![Run on Colab](https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab)](https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/sdv-comparison/sequential-scenario/sequential-scenario.ipynb) | [View Notebook](./tutorials/sdv-comparison/sequential-scenario/sequential-scenario.ipynb) |
| Creating and Using Datasets | [![Run on Colab](https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab)](https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/using-datasets/creating-and-using-datasets.ipynb) | [View Notebook](./tutorials/using-datasets/creating-and-using-datasets.ipynb) |
288 changes: 288 additions & 0 deletions docs/tutorials/using-datasets/creating-and-using-datasets.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,288 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "2f1c82f1",
"metadata": {},
"source": [
"# Creating and Using Datasets <a href=\"https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/using-datasets/creating-and-using-datasets.ipynb\" target=\"_blank\"><img src=\"https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab\" alt=\"Run on Colab\"></a>\n",
"\n",
"In this notebook, we demonstrate the creation and usage of Datasets using the Synthetic Data SDK.\n",
"\n",
"Full Datasets endpoints documentation is available in the [API documentation](https://api-docs.mostly.ai/#ee628a4d-afb6-4d44-96dd-a86e1b345f95)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "002d715f",
"metadata": {},
"outputs": [],
"source": [
"# Install SDK in CLIENT mode\n",
"!uv pip install -U mostlyai\n",
"# Or install in LOCAL mode\n",
"!uv pip install -U 'mostlyai[local]'\n",
"# Note: Restart kernel session after installation!"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e06ff1a1",
"metadata": {},
"outputs": [],
"source": [
"from mostlyai.sdk import MostlyAI\n",
"\n",
"# Get your api key from https://app.mostly.ai/settings/api-keys\n",
"api_key = \"my-api-key\"\n",
"\n",
"# initialize the SDK in LOCAL mode\n",
"mostly = MostlyAI(api_key=api_key)"
]
},
{
"cell_type": "markdown",
"id": "ebbaa63d",
"metadata": {},
"source": [
"## Creating a Dataset\n",
"\n",
"All Datasets must contain at least a `name` and a `description`. The `name` is how you would like other users to identify your Datasets, but the `description` helps users or the Assistant understand how to handle the Dataset (including a remote location from which the Dataset files can be downloaded). See [Example Descriptions](#example-descriptions).\n",
"\n",
"A Dataset can also optionally include either of the following:\n",
"- Files: remotely stored files containing the underlying dataset as well as any supporting artifacts. See [Datasets with Files](#datasets-from-files)\n",
"- Connector: the [Connector](https://mostly-ai.github.io/mostlyai/#data-connectors) asset that provides access to the target dataset. See [Datasets with Conectors](#datasets-with-connectors)\n"
]
},
{
"cell_type": "markdown",
"id": "05ebdb7e",
"metadata": {},
"source": [
"### Creating a Dataset from a description\n",
"\n",
"A dataset created with just a `description` can contain instructions for users or the MOSTLY AI Assistant. Consider the following example, whose description instructs the Assistant to download data from a remote location on the public internet."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a33bf0c9",
"metadata": {},
"outputs": [],
"source": [
"from mostlyai.sdk.domain import DatasetConfig\n",
"\n",
"# Dataset config\n",
"config = DatasetConfig(\n",
" name=\"airlines_example_dataset\",\n",
" description=\"navigate to https://github.com/mostly-ai/public-demo-data/tree/dev/airlines and download the flight.csv file\",\n",
")\n",
"\n",
"dataset = mostly.datasets.create(config=config)"
]
},
{
"cell_type": "markdown",
"id": "ce46c6a3",
"metadata": {},
"source": [
"Now you can access this dataset on the [MOSTLY AI Platform](https://app.mostly.ai) or via the SDK, see [Using a Dataset](#using-a-dataset) for more information about the latter.\n",
"\n",
"On the MOSTLY AI Platform, click Explore to use the Dataset with [the Assistant](https://docs.mostly.ai/assistant). The Assistant can help you train a generator and generate synthetic data, or create artifacts like visualizations that you can share with anyone.\n",
"\n",
"![](./images/datasets-01.png)"
]
},
{
"cell_type": "markdown",
"id": "cd3edb19",
"metadata": {},
"source": [
"### Creating a Dataset from a file\n",
"\n",
"You can create a dataset using a file from your LFS as well. For this tutorial, we shall use the fictitious user data creating with [MOSTLY Mock](https://github.com/mostly-ai/mostlyai-mock) and available at `./data/mock-users.csv`.\n",
"\n",
"A dataset created from a file still has a `name` and a `description` but the `description` can now be used to explain the data structure to other users or prompt the Assistant on how to handle the file or files.\n",
"\n",
"The `description` parameter accepts [Markdown](https://daringfireball.net/projects/markdown/) syntax styling and formatting."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4e061f7b",
"metadata": {},
"outputs": [],
"source": [
"# Define the description with Markdown formatting applied\n",
"description = \"\"\"\n",
"## Mock User Data Dataset\n",
"\n",
"This dataset contains 1,000 rows of realistic mock user profiles for a consumer web application. \n",
"It includes demographic information, contact details, and account metadata, suitable for analytics, testing, or prototyping.\n",
"\n",
"### Data Dictionary\n",
"\n",
"| Column | Description |\n",
"|---------------------|-----------------------------------------------------------|\n",
"| user_id | Unique user identifier |\n",
"| first_name | User's first name |\n",
"| last_name | User's last name |\n",
"| email | Realistic email address |\n",
"| gender | Gender (male, female, non-binary) |\n",
"| date_of_birth | Date of birth (1950-01-01 to 2005-12-31) |\n",
"| country | Country of residence |\n",
"| city | City of residence |\n",
"| signup_date | Date the user signed up (2018-01-01 to 2023-12-31) |\n",
"| last_login | Last login date (2023-01-01 to 2023-12-31) |\n",
"| account_status | Account status (active, inactive, suspended) |\n",
"| subscription_type | Subscription type (free, basic, premium) |\n",
"| referral_source | How the user was referred (organic, ad, friend, other) |\n",
"| num_logins | Number of logins (0 to 500) |\n",
"| avg_session_minutes | Average session length in minutes (1.0 to 120.0) |\n",
"\"\"\"\n",
"\n",
"# Dataset config\n",
"config = DatasetConfig(\n",
" name=\"Mock User Data Dataset\",\n",
" description=description,\n",
")\n",
"\n",
"dataset = mostly.datasets.create(config=config)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a5941e87",
"metadata": {},
"outputs": [],
"source": [
"# Upload a locally stored file to the Dataset created in the previous step\n",
"dataset.upload_file(\"./data/mock_user_data.csv\")"
]
},
{
"cell_type": "markdown",
"id": "4c54a4bd",
"metadata": {},
"source": [
"As we saw previously, the Dataset has been created, and the Markdown formatting is visible in the `description`.\n",
"\n",
"The Dataset can now be explored on the MOSTLY AI Platform!\n",
"\n",
"![](./images/datasets-02.png)"
]
},
{
"cell_type": "markdown",
"id": "0bcfc78d",
"metadata": {},
"source": [
"### Creating a Dataset from a Connector\n",
"\n",
"A [Connector](https://docs.mostly.ai/connectors) lets you connect to your own remote data sources from the MOSTLY AI ecosystem.\n",
"\n",
"Datasets can be created with reference to a specfic connector that exists on your MOSTLY AI Platform instance or one to which you have access (for example, a public connector).\n",
"\n",
"For more information about creating a connector, refer to the [MOSTLY AI documentation](https://docs.mostly.ai/connectors/create)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fa13aeee",
"metadata": {},
"outputs": [],
"source": [
"# Dataset config\n",
"config = DatasetConfig(\n",
" name=\"Baseball Folder Dataset\",\n",
" description=\"Dataset referencing all data in the 'baseball' folder of the specified connector.\",\n",
" connectors=[{\"id\": \"e43aa845-8d77-4cda-bc9e-10da9a1496a9\", \"locations\": [\"baseball\"]}],\n",
")\n",
"\n",
"dataset = mostly.datasets.create(config=config)"
]
},
{
"cell_type": "markdown",
"id": "bc5e7e8b",
"metadata": {},
"source": [
"We can explore the Dataset using the MOSTLY AI Assitant or use it further with the Synthetic Data SDK.\n",
"\n",
"![](./images/datasets-03.png)"
]
},
{
"cell_type": "markdown",
"id": "d1c139e5",
"metadata": {},
"source": [
"## Retrieving a Dataset\n",
"\n",
"Once a dataset has been created, you can retrieve it with the SDK in order to work with it locally. \n",
"\n",
"In this section of the tutorial, we'll retrive a public dataset from MOSTLY AI and explore it locally."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c9e4dd29",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"# Define dataset ID and file to download\n",
"dataset_id = \"17da17c9-3606-423f-996b-4a458f5997b5\"\n",
"file_name = \"stations.parquet\"\n",
"local_path = \"./data/stations.parquet\"\n",
"\n",
"# Download the file from the dataset\n",
"dataset = mostly.datasets.get(dataset_id)\n",
"dataset.download_file(file_name, local_path)\n",
"\n",
"# Load and display the data\n",
"df = pd.read_parquet(local_path)\n",
"print(df.head())"
]
},
{
"cell_type": "markdown",
"id": "73bbe756",
"metadata": {},
"source": [
"You can now access the locally stored copy of the [Meteostat weather station data](https://app.mostly.ai/d/datasets/17da17c9-3606-423f-996b-4a458f5997b5).\n",
"\n",
"This data can be used to train a [Generator](https://github.com/mostly-ai/mostlyai/blob/main/docs/tutorials/getting-started/getting-started.ipynb) with the Synthetic Data SDK or perform any other kind of analysis you wish."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.18"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading