mostly-ai · mostlyken · Sep 17, 2025 · Sep 17, 2025 · Sep 18, 2025 · Sep 18, 2025
@@ -27,3 +27,4 @@ Explore synthetic data tutorials with the option to run them **either in Google
 | Enrich Sensitive Data with LLMs using Synthetic Replicas      | [![Run on Colab](https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab)](https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/synthetic-enrich/synthetic-enrich.ipynb)           | [View Notebook](./tutorials/synthetic-enrich/synthetic-enrich.ipynb)           |
 | MOSTLY AI vs. SDV comparison: single-table scenario | [![Run on Colab](https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab)](https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/sdv-comparison/single-table-scenario/single-table-scenario.ipynb)           | [View Notebook](./tutorials/sdv-comparison/single-table-scenario/single-table-scenario.ipynb)           |
 | MOSTLY AI vs. SDV comparison: sequential scenario | [![Run on Colab](https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab)](https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/sdv-comparison/sequential-scenario/sequential-scenario.ipynb)           | [View Notebook](./tutorials/sdv-comparison/sequential-scenario/sequential-scenario.ipynb)           |
+| Creating and Using Datasets | [![Run on Colab](https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab)](https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/using-datasets/creating-and-using-datasets.ipynb)           | [View Notebook](./tutorials/using-datasets/creating-and-using-datasets.ipynb)           |
@@ -0,0 +1,288 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "2f1c82f1",
+   "metadata": {},
+   "source": [
+    "# Creating and Using Datasets <a href=\"https://colab.research.google.com/github/mostly-ai/mostlyai/blob/main/docs/tutorials/using-datasets/creating-and-using-datasets.ipynb\" target=\"_blank\"><img src=\"https://img.shields.io/badge/Open%20in-Colab-blue?logo=google-colab\" alt=\"Run on Colab\"></a>\n",
+    "\n",
+    "In this notebook, we demonstrate the creation and usage of Datasets using the Synthetic Data SDK.\n",
+    "\n",
+    "Full Datasets endpoints documentation is available in the [API documentation](https://api-docs.mostly.ai/#ee628a4d-afb6-4d44-96dd-a86e1b345f95)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "002d715f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Install SDK in CLIENT mode\n",
+    "!uv pip install -U mostlyai\n",
+    "# Or install in LOCAL mode\n",
+    "!uv pip install -U 'mostlyai[local]'\n",
+    "# Note: Restart kernel session after installation!"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e06ff1a1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from mostlyai.sdk import MostlyAI\n",
+    "\n",
+    "# Get your api key from https://app.mostly.ai/settings/api-keys\n",
+    "api_key = \"my-api-key\"\n",
+    "\n",
+    "# initialize the SDK in LOCAL mode\n",
+    "mostly = MostlyAI(api_key=api_key)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ebbaa63d",
+   "metadata": {},
+   "source": [
+    "## Creating a Dataset\n",
+    "\n",
+    "All Datasets must contain at least a `name` and a `description`. The `name` is how you would like other users to identify your Datasets, but the `description` helps users or the Assistant understand how to handle the Dataset (including a remote location from which the Dataset files can be downloaded). See [Example Descriptions](#example-descriptions).\n",
+    "\n",
+    "A Dataset can also optionally include either of the following:\n",
+    "- Files: remotely stored files containing the underlying dataset as well as any supporting artifacts. See [Datasets with Files](#datasets-from-files)\n",
+    "- Connector: the [Connector](https://mostly-ai.github.io/mostlyai/#data-connectors) asset that provides access to the target dataset. See [Datasets with Conectors](#datasets-with-connectors)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "05ebdb7e",
+   "metadata": {},
+   "source": [
+    "### Creating a Dataset from a description\n",
+    "\n",
+    "A dataset created with just a `description` can contain instructions for users or the MOSTLY AI Assistant. Consider the following example, whose description instructs the Assistant to download data from a remote location on the public internet."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a33bf0c9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from mostlyai.sdk.domain import DatasetConfig\n",
+    "\n",
+    "# Dataset config\n",
+    "config = DatasetConfig(\n",
+    "    name=\"airlines_example_dataset\",\n",
+    "    description=\"navigate to https://github.com/mostly-ai/public-demo-data/tree/dev/airlines and download the flight.csv file\",\n",
+    ")\n",
+    "\n",
+    "dataset = mostly.datasets.create(config=config)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ce46c6a3",
+   "metadata": {},
+   "source": [
+    "Now you can access this dataset on the [MOSTLY AI Platform](https://app.mostly.ai) or via the SDK, see [Using a Dataset](#using-a-dataset) for more information about the latter.\n",
+    "\n",
+    "On the MOSTLY AI Platform, click Explore to use the Dataset with [the Assistant](https://docs.mostly.ai/assistant). The Assistant can help you train a generator and generate synthetic data, or create artifacts like visualizations that you can share with anyone.\n",
+    "\n",
+    "![](./images/datasets-01.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cd3edb19",
+   "metadata": {},
+   "source": [
+    "### Creating a Dataset from a file\n",
+    "\n",
+    "You can create a dataset using a file from your LFS as well. For this tutorial, we shall use the fictitious user data creating with [MOSTLY Mock](https://github.com/mostly-ai/mostlyai-mock) and available at `./data/mock-users.csv`.\n",
+    "\n",
+    "A dataset created from a file still has a `name` and a `description` but the `description` can now be used to explain the data structure to other users or prompt the Assistant on how to handle the file or files.\n",
+    "\n",
+    "The `description` parameter accepts [Markdown](https://daringfireball.net/projects/markdown/) syntax styling and formatting."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4e061f7b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Define the description with Markdown formatting applied\n",
+    "description = \"\"\"\n",
+    "## Mock User Data Dataset\n",
+    "\n",
+    "This dataset contains 1,000 rows of realistic mock user profiles for a consumer web application.  \n",
+    "It includes demographic information, contact details, and account metadata, suitable for analytics, testing, or prototyping.\n",
+    "\n",
+    "### Data Dictionary\n",
+    "\n",
+    "| Column              | Description                                               |\n",
+    "|---------------------|-----------------------------------------------------------|\n",
+    "| user_id             | Unique user identifier                                    |\n",
+    "| first_name          | User's first name                                         |\n",
+    "| last_name           | User's last name                                          |\n",
+    "| email               | Realistic email address                                   |\n",
+    "| gender              | Gender (male, female, non-binary)                         |\n",
+    "| date_of_birth       | Date of birth (1950-01-01 to 2005-12-31)                  |\n",
+    "| country             | Country of residence                                      |\n",
+    "| city                | City of residence                                         |\n",
+    "| signup_date         | Date the user signed up (2018-01-01 to 2023-12-31)        |\n",
+    "| last_login          | Last login date (2023-01-01 to 2023-12-31)                |\n",
+    "| account_status      | Account status (active, inactive, suspended)              |\n",
+    "| subscription_type   | Subscription type (free, basic, premium)                  |\n",
+    "| referral_source     | How the user was referred (organic, ad, friend, other)    |\n",
+    "| num_logins          | Number of logins (0 to 500)                               |\n",
+    "| avg_session_minutes | Average session length in minutes (1.0 to 120.0)          |\n",
+    "\"\"\"\n",
+    "\n",
+    "# Dataset config\n",
+    "config = DatasetConfig(\n",
+    "    name=\"Mock User Data Dataset\",\n",
+    "    description=description,\n",
+    ")\n",
+    "\n",
+    "dataset = mostly.datasets.create(config=config)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a5941e87",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Upload a locally stored file to the Dataset created in the previous step\n",
+    "dataset.upload_file(\"./data/mock_user_data.csv\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4c54a4bd",
+   "metadata": {},
+   "source": [
+    "As we saw previously, the Dataset has been created, and the Markdown formatting is visible in the `description`.\n",
+    "\n",
+    "The Dataset can now be explored on the MOSTLY AI Platform!\n",
+    "\n",
+    "![](./images/datasets-02.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "0bcfc78d",
+   "metadata": {},
+   "source": [
+    "### Creating a Dataset from a Connector\n",
+    "\n",
+    "A [Connector](https://docs.mostly.ai/connectors) lets you connect to your own remote data sources from the MOSTLY AI ecosystem.\n",
+    "\n",
+    "Datasets can be created with reference to a specfic connector that exists on your MOSTLY AI Platform instance or one to which you have access (for example, a public connector).\n",
+    "\n",
+    "For more information about creating a connector, refer to the [MOSTLY AI documentation](https://docs.mostly.ai/connectors/create)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fa13aeee",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Dataset config\n",
+    "config = DatasetConfig(\n",
+    "    name=\"Baseball Folder Dataset\",\n",
+    "    description=\"Dataset referencing all data in the 'baseball' folder of the specified connector.\",\n",
+    "    connectors=[{\"id\": \"e43aa845-8d77-4cda-bc9e-10da9a1496a9\", \"locations\": [\"baseball\"]}],\n",
+    ")\n",
+    "\n",
+    "dataset = mostly.datasets.create(config=config)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bc5e7e8b",
+   "metadata": {},
+   "source": [
+    "We can explore the Dataset using the MOSTLY AI Assitant or use it further with the Synthetic Data SDK.\n",
+    "\n",
+    "![](./images/datasets-03.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d1c139e5",
+   "metadata": {},
+   "source": [
+    "## Retrieving a Dataset\n",
+    "\n",
+    "Once a dataset has been created, you can retrieve it with the SDK in order to work with it locally. \n",
+    "\n",
+    "In this section of the tutorial, we'll retrive a public dataset from MOSTLY AI and explore it locally."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c9e4dd29",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "# Define dataset ID and file to download\n",
+    "dataset_id = \"17da17c9-3606-423f-996b-4a458f5997b5\"\n",
+    "file_name = \"stations.parquet\"\n",
+    "local_path = \"./data/stations.parquet\"\n",
+    "\n",
+    "# Download the file from the dataset\n",
+    "dataset = mostly.datasets.get(dataset_id)\n",
+    "dataset.download_file(file_name, local_path)\n",
+    "\n",
+    "# Load and display the data\n",
+    "df = pd.read_parquet(local_path)\n",
+    "print(df.head())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "73bbe756",
+   "metadata": {},
+   "source": [
+    "You can now access the locally stored copy of the [Meteostat weather station data](https://app.mostly.ai/d/datasets/17da17c9-3606-423f-996b-4a458f5997b5).\n",
+    "\n",
+    "This data can be used to train a [Generator](https://github.com/mostly-ai/mostlyai/blob/main/docs/tutorials/getting-started/getting-started.ipynb) with the Synthetic Data SDK or perform any other kind of analysis you wish."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.18"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}