singlestore-labs · kesmit13 · Apr 17, 2024
diff --git a/notebooks/generative-ai-with-vertex/meta.toml b/notebooks/generative-ai-with-vertex/meta.toml
@@ -0,0 +1,9 @@
+[meta]
+title="Building a Generative AI Application with Vertex AI and SingleStoreDB"
+description="""\
+    Learn to build an AI application using Google Cloud's Vertex AI
+    and SingleStoreDB.
+    """
+icon="crystal-ball"
+tags=["ai"]
+destinations=["spaces"]
diff --git a/notebooks/generative-ai-with-vertex/notebook.ipynb b/notebooks/generative-ai-with-vertex/notebook.ipynb
@@ -0,0 +1,240 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "id": "53da8269-44c3-4299-9a8e-f911beb5661e",
+      "metadata": {},
+      "source": [
+        "<div id=\"singlestore-header\" style=\"display: flex; background-color: rgba(255, 167, 103, 0.25); padding: 5px;\">\n",
+        "    <div id=\"icon-image\" style=\"width: 90px; height: 90px;\">\n",
+        "        <img width=\"100%\" height=\"100%\" src=\"https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/master/common/images/header-icons/crystal-ball.png\" />\n",
+        "    </div>\n",
+        "    <div id=\"text\" style=\"padding: 5px; margin-left: 10px;\">\n",
+        "        <div id=\"badge\" style=\"display: inline-block; background-color: rgba(0, 0, 0, 0.15); border-radius: 4px; padding: 4px 8px; align-items: center; margin-top: 6px; margin-bottom: -2px; font-size: 80%\">SingleStore Notebooks</div>\n",
+        "        <h1 style=\"font-weight: 500; margin: 8px 0 0 4px;\">Building a Generative AI Application with Vertex AI and SingleStoreDB</h1>\n",
+        "    </div>\n",
+        "</div>"
+      ]
+    },
+    {
+      "attachments": {},
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Document Ingestion\n",
+        "\n",
+        "Welcome to this guide on building a state-of-the-art General AI application using Google Cloud's Vertex AI and SingleStoreDB. This guide aims to provide a seamless experience, offering step-by-step instructions, code explanations, and best practices.\n",
+        "\n",
+        "## Overview\n",
+        "\n",
+        "Vertex AI, a product by Google Cloud, offers an integrated suite of machine learning tools that allows developers to build, deploy, and scale AI models faster than ever. On the other hand, SingleStoreDB offers a fast, scalable, and SQL-compliant relational database system. By combining the power of Vertex AI's machine learning capabilities with the efficient storage and retrieval mechanisms of SingleStoreDB, we can create robust AI applications that respond to user queries in real-time.\n",
+        "\n",
+        "### What You'll Learn\n",
+        "\n",
+        "- Setting up your environment with the necessary packages and credentials.\n",
+        "- Fetching and processing data to be used in our AI models.\n",
+        "- Storing and managing data efficiently using SingleStoreDB.\n",
+        "- Leveraging the power of Vertex AI for real-time data processing and insights.\n",
+        "- Building a retrieval-based QA system to answer user queries.\n",
+        "\n",
+        "### Prerequisites\n",
+        "\n",
+        "- Basic knowledge of Python programming.\n",
+        "- Familiarity with Google Cloud services and SQL databases.\n",
+        "- An active Google Cloud account.\n",
+        "- A SingleStoreDB hosted or self-managed instance.\n",
+        "\n",
+        "**Let's dive in and start building!**"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 1,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "%pip install --quiet google-cloud-aiplatform langchain github-clone\n",
+        "%pip install --quiet unstructured unstructured[pdf] pytesseract\n",
+        "%pip install --quiet singlestoredb"
+      ]
+    },
+    {
+      "attachments": {},
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Authentication\n",
+        "\n",
+        "The next step involves authenticating our session with Google Cloud. By running the following cell, you'll be prompted to log in using your Google Cloud credentials. Follow the instructions to complete the login process."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 2,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from google.colab import auth as google_auth\n",
+        "\n",
+        "google_auth.authenticate_user()"
+      ]
+    },
+    {
+      "attachments": {},
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Import modules"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 3,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Vertex AI\n",
+        "import vertexai\n",
+        "from google.cloud import aiplatform\n",
+        "from vertexai.language_models import TextEmbeddingModel, TextGenerationModel\n",
+        "\n",
+        "# Langchain\n",
+        "from langchain.llms import VertexAI\n",
+        "from langchain.vectorstores import SingleStoreDB"
+      ]
+    },
+    {
+      "attachments": {},
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Obtaining a dataset\n",
+        "\n",
+        "The following is a dataset composed by public data provided by the IRS regarding the 2023 tax season.\n",
+        "\n",
+        "You can download the dataset to your computer and explore it by following [this link](https://drive.google.com/file/d/1mdDHBnSWwDbMo2xyRk9gxUAswhyb9uKw/view?usp=drive_link).\n",
+        "\n",
+        "After the dataset is downloaded, the contents will be ingested into SingleStore.\n",
+        "\n",
+        "The Document processing includes chunking the documents leveraging Langchain's chunking libraries, and generating embeddings using the Google PaLM 2 text-gecko-001 model."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 4,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from google.colab import auth\n",
+        "from oauth2client.client import GoogleCredentials\n",
+        "\n",
+        "FILE_URL = \"https://github.com/datagabe/hollywood/raw/main/sample_tax_information.zip\"\n",
+        "\n",
+        "!wget {FILE_URL} -O dataset.zip\n",
+        "!mkdir dataset\n",
+        "!unzip dataset.zip -d dataset"
+      ]
+    },
+    {
+      "attachments": {},
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Loading Data from a Directory\n",
+        "\n",
+        "Once you have downloaded the dataset from Google Drive, and it is already unzipped, you will leverage Langchain's DirectoryLoader loader to chunk the documents before ingesting them to your SingleStore DB."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 5,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import unstructured\n",
+        "from langchain.document_loaders import DirectoryLoader\n",
+        "\n",
+        "loader = DirectoryLoader('dataset')\n",
+        "\n",
+        "docs = loader.load()"
+      ]
+    },
+    {
+      "attachments": {},
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Splitting the Data\n",
+        "\n",
+        "To process the data more efficiently, we'll split the loaded content into smaller chunks. The RecursiveCharacterTextSplitter class helps in achieving this by dividing the data based on specified character limits."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 6,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
+        "\n",
+        "text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=50)\n",
+        "all_splits = text_splitter.split_documents(docs)"
+      ]
+    },
+    {
+      "attachments": {},
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Setting Up SingleStoreDB with Vertex AI Embeddings\n",
+        "\n",
+        "For efficient storage and retrieval of our data, we use SingleStoreDB in conjunction with Vertex AI embeddings. The following cell sets up the necessary environment variables and initializes the SingleStoreDB instance with Vertex AI embeddings."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 7,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from langchain.embeddings import VertexAIEmbeddings\n",
+        "\n",
+        "# Init Vertex AI Platform\n",
+        "aiplatform.init(project=\"\", location=\"us-central1\")\n",
+        "\n",
+        "# Generate embeddings and ingest documents\n",
+        "vectorstore = SingleStoreDB.from_documents(documents=all_splits, embedding=VertexAIEmbeddings(model_name=\"textembedding-gecko@003\"))"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "id": "2061618b-db57-4f41-a856-2d7ce69f5025",
+      "metadata": {},
+      "source": [
+        "<div id=\"singlestore-footer\" style=\"background-color: rgba(194, 193, 199, 0.25); height:2px; margin-bottom:10px\"></div>\n",
+        "<div><img src=\"https://raw.githubusercontent.com/singlestore-labs/spaces-notebooks/master/common/images/singlestore-logo-grey.png\" style=\"padding: 0px; margin: 0px; height: 24px\"/></div>"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3 (ipykernel)",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.11.6"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 4
+}
diff --git a/notebooks/rag-example/meta.toml b/notebooks/rag-example/meta.toml
@@ -0,0 +1,9 @@
+[meta]
+title="Using RAG with SingleStoreDB"
+description="""\
+    Leverage the RAG pattern in the context of the generative AI
+    lifecycle patterns.
+    """
+icon="crystal-ball"
+tags=["ai"]
+destinations=["spaces"]