diff --git a/Getting Started with Snowflake Cortex ML-Based Functions/Getting Started with Snowflake Cortex ML-Based Functions.ipynb b/Getting Started with Snowflake Cortex ML-Based Functions/Getting Started with Snowflake Cortex ML-Based Functions.ipynb index e260e61..57225d7 100644 --- a/Getting Started with Snowflake Cortex ML-Based Functions/Getting Started with Snowflake Cortex ML-Based Functions.ipynb +++ b/Getting Started with Snowflake Cortex ML-Based Functions/Getting Started with Snowflake Cortex ML-Based Functions.ipynb @@ -1,770 +1,793 @@ { - "metadata": { - "kernelspec": { - "display_name": "Streamlit Notebook", - "name": "streamlit" - } - }, - "nbformat_minor": 5, - "nbformat": 4, - "cells": [ - { - "cell_type": "markdown", - "id": "3aac5b2e-9939-4b2d-a088-5472570707c4", - "metadata": { - "name": "cell1", - "collapsed": false - }, - "source": "# Getting Started with Snowflake Cortex ML-Based Functions\n\n## Overview \n\nOne of the most critical activities that a Data/Business Analyst has to perform is to produce recommendations to their business stakeholders based upon the insights they have gleaned from their data. In practice, this means that they are often required to build models to: make forecasts, identify long running trends, and identify abnormalities within their data. However, Analysts are often impeded from creating the best models possible due to the depth of statistical and machine learning knowledge required to implement them in practice. Further, python or other programming frameworks may be unfamiliar to Analysts who write SQL, and the nuances of fine-tuning a model may require expert knowledge that may be out of reach. \n\nFor these use cases, Snowflake has developed a set of SQL based ML Functions, that implement machine learning models on the user's behalf. As of December 2023, three ML Functions are available for time-series based data:\n\n1. Forecasting: which enables users to forecast a metric based on past values. Common use-cases for forecasting including predicting future sales, demand for particular sku's of an item, or volume of traffic into a website over a period of time.\n2. Anomaly Detection: which flags anomalous values using both unsupervised and supervised learning methods. This may be useful in use-cases where you want to identify spikes in your cloud spend, identifying abnormal data points in logs, and more.\n3. Contribution Explorer: which enables users to perform root cause analysis to determine the most significant drivers to a particular metric of interest. \n\nFor further details on ML Functions, please refer to the [snowflake documentation](https://docs.snowflake.com/guides-overview-analysis). \n\n### Prerequisites\n- Working knowledge of SQL\n- A Snowflake account login with an ACCOUNTADMIN role. If not, you will need to use a different role that has the ability to create database, schema, table, stages, tasks, email integrations, and stored procedures. \n\n### What You\u2019ll Learn \n- How to make use of Anomaly Detection & Forecasting ML Functions to create models and produce predictions\n- Use Tasks to retrain models on a regular cadence\n- Use the [email notfication integration](https://docs.snowflake.com/en/user-guide/email-stored-procedures) to send email reports of the model results after completion \n\n### What You\u2019ll Build \nThis Quickstart is designed to help you get up to speed with both the Forecasting and Anomaly Detection ML Functions. \nWe will work through an example using data from a fictitious food truck company, Tasty Bytes, to first create a forecasting model to predict the demand for each menu-item that Tasty Bytes sells in Vancouver. Predicting this demand is important to Tasty Bytes, as it allows them to plan ahead and get enough of the raw ingredients to fulfill customer demand. \n\nWe will start with one food item at first, but then scale this up to all the items in Vancouver and add additional datapoints like holidays to see if it can improve the model's performance. Then, to see if there have been any trending food items, we will build an anomaly detection model to understand if certain food items have been selling anomalously. We will wrap up this Quickstart by showcasing how you can use Tasks to schedule your model training process, and use the email notification integration to send out a report on trending food items. \n\nLet's get started!" - }, - { - "cell_type": "markdown", - "id": "29090d0b-7020-4cc1-b1b4-adc556d77348", - "metadata": { - "name": "cell2", - "collapsed": false - }, - "source": "## Setting Up Data in Snowflake\n\n### Overview:\nYou will use Snowflake Notebook to: \n- Create Snowflake objects (i.e warehouse, database, schema, etc..)\n- Ingest sales data from S3 and load it into a snowflake table\n- Access Holiday data from the Snowflake Marketplace (or load from S3). " - }, - { - "cell_type": "markdown", - "id": "f0e98da4-358f-45d6-94d0-be434f62ebf4", - "metadata": { - "name": "cell3", - "collapsed": false - }, - "source": "\n### Step 1: Loading Holiday Data from S3 bucket\n\nNote that you can perform this step by following [the instructions here](https://quickstarts.snowflake.com/guide/ml_forecasting_ad/index.html?index=..%2F..index#1) to access the dataset on the Snowflake Marketplace. For the simplicity of this demo, we will load this dataset from an S3 bucket." - }, - { - "cell_type": "code", - "id": "8d50cbf4-0c8d-4950-86cb-114990437ac9", - "metadata": { - "language": "sql", - "name": "cell4", - "collapsed": false, - "codeCollapsed": false - }, - "source": "-- Load data for use in this demo. \n-- Create a csv file format: \nCREATE OR REPLACE FILE FORMAT csv_ff\n type = 'csv'\n SKIP_HEADER = 1,\n COMPRESSION = AUTO;", - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "code", - "id": "5e0e32db-3b00-4071-be00-4bc0e9f5a344", - "metadata": { - "language": "sql", - "name": "cell5", - "collapsed": false - }, - "outputs": [], - "source": "-- Create an external stage pointing to s3, to load your data. \nCREATE OR REPLACE STAGE s3load \n COMMENT = 'Quickstart S3 Stage Connection'\n url = 's3://sfquickstarts/notebook_demos/frostbyte_tastybytes/'\n file_format = csv_ff;", - "execution_count": null - }, - { - "cell_type": "code", - "id": "00095f04-38ec-479d-83a3-2ac6b82662df", - "metadata": { - "language": "sql", - "name": "cell6", - "codeCollapsed": false, - "collapsed": false - }, - "outputs": [], - "source": "LS @s3load;", - "execution_count": null - }, - { - "cell_type": "code", - "id": "7e5ae191-2af7-49b1-b79f-b18ff1a8e99c", - "metadata": { - "language": "sql", - "name": "cell7", - "codeCollapsed": false, - "collapsed": false - }, - "outputs": [], - "source": "-- Define your table.\nCREATE OR REPLACE TABLE PUBLIC_HOLIDAYS(\n \tDATE DATE,\n\tHOLIDAY_NAME VARCHAR(16777216),\n\tIS_FINANCIAL BOOLEAN\n);", - "execution_count": null - }, - { - "cell_type": "code", - "id": "e03e845b-300f-4a94-8ce7-b729ed4d316e", - "metadata": { - "language": "sql", - "name": "cell8", - "codeCollapsed": false, - "collapsed": false - }, - "outputs": [], - "source": "-- Ingest data from s3 into your table.\nCOPY INTO PUBLIC_HOLIDAYS FROM @s3load/holidays.csv;", - "execution_count": null - }, - { - "cell_type": "code", - "id": "e71c170c-7bca-40e2-a60a-b7df07e01293", - "metadata": { - "language": "sql", - "name": "cell9", - "codeCollapsed": false, - "collapsed": false - }, - "outputs": [], - "source": "SELECT * from PUBLIC_HOLIDAYS;", - "execution_count": null - }, - { - "cell_type": "markdown", - "id": "9d3a5d8a-fff8-4033-9ade-a0995fdecbe4", - "metadata": { - "name": "cell10", - "collapsed": false - }, - "source": "### Step 2: Creating Objects, Load Data, & Set Up Tables\n\nRun the following SQL commands in the worksheet to create the required Snowflake objects, ingest sales data from S3, and update your Search Path to make it easier to work with the ML Functions. " - }, - { - "cell_type": "code", - "id": "9994c336-01e2-466f-b34f-fbf66525e2d6", - "metadata": { - "language": "sql", - "name": "cell11", - "collapsed": false, - "codeCollapsed": false - }, - "outputs": [], - "source": "-- Create an external stage pointing to s3, to load your data. \nCREATE OR REPLACE STAGE s3load \n COMMENT = 'Quickstart S3 Stage Connection'\n url = 's3://sfquickstarts/frostbyte_tastybytes/mlpf_quickstart/'\n file_format = csv_ff;", - "execution_count": null - }, - { - "cell_type": "code", - "id": "91774fde-c76d-4b1e-8d1a-021746b54830", - "metadata": { - "language": "sql", - "name": "cell12", - "collapsed": false, - "codeCollapsed": false - }, - "outputs": [], - "source": "-- Define your table.\nCREATE OR REPLACE TABLE tasty_byte_sales(\n \tDATE DATE,\n\tPRIMARY_CITY VARCHAR(16777216),\n\tMENU_ITEM_NAME VARCHAR(16777216),\n\tTOTAL_SOLD NUMBER(17,0)\n);", - "execution_count": null - }, - { - "cell_type": "code", - "id": "21c3eb38-6a62-4c42-af34-9b060d1f0821", - "metadata": { - "language": "sql", - "name": "cell13", - "collapsed": false, - "codeCollapsed": false - }, - "outputs": [], - "source": "-- Ingest data from s3 into your table.\nCOPY INTO tasty_byte_sales FROM @s3load/ml_functions_quickstart.csv;", - "execution_count": null - }, - { - "cell_type": "code", - "id": "3fbcb3fe-47a9-4315-b72b-b45ac41f7ab5", - "metadata": { - "language": "sql", - "name": "cell14", - "codeCollapsed": false - }, - "outputs": [], - "source": "-- View a sample of the ingested data: \nSELECT * FROM tasty_byte_sales LIMIT 100;", - "execution_count": null - }, - { - "cell_type": "markdown", - "id": "d580ae45-c6f7-4f36-970a-e5b170ac8eef", - "metadata": { - "name": "cell15", - "collapsed": false - }, - "source": "At this point, we have all the data we need to start building models. We will get started with building our first forecasting model. \n\n## Forecasting Demand for Lobster Mac & Cheese\n\nWe will start off by first building a forecasting model to predict the demand for Lobster Mac & Cheese in Vancouver.\n\n\n### Step 1: Visualize Daily Sales on Snowsight\n\nBefore building our model, let's first visualize our data to get a feel for what daily sales looks like. Run the following sql command in your Snowsight UI, and toggle to the chart at the bottom.\n" - }, - { - "cell_type": "code", - "id": "a5689582-eec1-46d9-908e-ef88ca3c6d2a", - "metadata": { - "language": "sql", - "name": "cell16", - "collapsed": false - }, - "outputs": [], - "source": "-- query a sample of the ingested data\nSELECT *\n FROM tasty_byte_sales\n WHERE menu_item_name LIKE 'Lobster Mac & Cheese';", - "execution_count": null - }, - { - "cell_type": "markdown", - "id": "2ca817f0-77e6-47f9-8e98-397a6badadd6", - "metadata": { - "name": "cell17", - "collapsed": false - }, - "source": "We can plot the daily sales for the item Lobster Mac & Cheese going back all the way to 2014." - }, - { - "cell_type": "code", - "id": "b4d3e0c1-7941-423c-982a-39201eb3d92a", - "metadata": { - "language": "python", - "name": "cell18", - "collapsed": false, - "codeCollapsed": false - }, - "outputs": [], - "source": "# TODO: CELL REFERENCE REPLACE\ndf = cells.cell16.to_pandas()\nimport altair as alt\nalt.Chart(df).mark_line().encode(\n x = \"DATE\",\n y = \"TOTAL_SOLD\"\n)", - "execution_count": null - }, - { - "cell_type": "markdown", - "id": "fb69d629-eb18-4cf5-ad4d-026e26a701c3", - "metadata": { - "name": "cell19", - "collapsed": false - }, - "source": "Observing the chart, one thing we can notice is that there appears to be a seasonal trend present for sales, on a yearly basis. This is an important consideration for building robust forecasting models, and we want to make sure that we feed in enough training data that represents one full cycle of the time series data we are modeling for. The forecasting ML function is smart enough to be able to automatically identify and handle multiple seasonality patterns, so we will go ahead and use the latest year's worth of data as input to our model. In the query below, we will also convert the date column using the `to_timestamp_ntz` function, so that it be used in the forecasting function. " - }, - { - "cell_type": "code", - "id": "46a61a60-0f32-4875-a6cb-79f52fcc47cb", - "metadata": { - "language": "sql", - "name": "cell20", - "collapsed": false, - "codeCollapsed": false - }, - "outputs": [], - "source": "-- Create Table containing the latest years worth of sales data: \nCREATE OR REPLACE TABLE vancouver_sales AS (\n SELECT\n to_timestamp_ntz(date) as timestamp,\n primary_city,\n menu_item_name,\n total_sold\n FROM\n tasty_byte_sales\n WHERE\n date > (SELECT max(date) - interval '1 year' FROM tasty_byte_sales)\n GROUP BY\n all\n);", - "execution_count": null - }, - { - "cell_type": "markdown", - "id": "08184365-5247-424a-ae58-7cfe54acc448", - "metadata": { - "name": "cell21", - "collapsed": false - }, - "source": "\n### Step 2: Creating our First Forecasting Model: Lobster Mac & Cheese\n\nWe can use SQL to directly call the forecasting ML function. Under the hood, the forecasting ML function automatically takes care of many of the data science best practices that are required to build good models. This includes performing hyper-parameter tuning, adjusting for missing data, and creating new features. We will build our first forecasting model below, for only the Lobster Mac & Cheese menu item. \n" - }, - { - "cell_type": "code", - "id": "7074d117-4b8c-4ed7-825d-4e50a40570ab", - "metadata": { - "language": "sql", - "name": "cell22", - "collapsed": false, - "codeCollapsed": false - }, - "outputs": [], - "source": "-- Create view for lobster sales\nCREATE OR REPLACE VIEW lobster_sales AS (\n SELECT\n timestamp,\n total_sold\n FROM\n vancouver_sales\n WHERE\n menu_item_name LIKE 'Lobster Mac & Cheese'\n);\n", - "execution_count": null - }, - { - "cell_type": "code", - "id": "1e8c21b1-6279-435b-ae23-7010f9a471eb", - "metadata": { - "language": "sql", - "name": "cell23", - "codeCollapsed": false - }, - "outputs": [], - "source": "-- Build Forecasting model; this could take ~15-25 secs; please be patient\nCREATE OR REPLACE forecast lobstermac_forecast (\n INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'lobster_sales'),\n TIMESTAMP_COLNAME => 'TIMESTAMP',\n TARGET_COLNAME => 'TOTAL_SOLD'\n);", - "execution_count": null - }, - { - "cell_type": "code", - "id": "1c3a97a5-dcbb-41f8-b471-aa19f73264a4", - "metadata": { - "language": "sql", - "name": "cell24", - "codeCollapsed": false - }, - "outputs": [], - "source": "-- Show models to confirm training has completed\nSHOW forecast;", - "execution_count": null - }, - { - "cell_type": "markdown", - "id": "4617ee0c-041e-4389-97c2-d8b4b055d62d", - "metadata": { - "name": "cell25", - "collapsed": false - }, - "source": "In the steps above, we create a view containing the relevant daily sales for our Lobster Mac & Cheese item, to which we pass to the forecast function. The last step should confirm that the model has been created, and ready to create predictions. \n" - }, - { - "cell_type": "markdown", - "id": "c5e40a4b-3b7c-4f1a-a267-0b5b41c62c6a", - "metadata": { - "name": "cell26", - "collapsed": false - }, - "source": "## Step 3: Creating and Visualizing Predictions\n\nLet's now use our trained `lobstermac_forecast` model to create predictions for the demand for the next 10 days. \n" - }, - { - "cell_type": "code", - "id": "e6505815-b48a-4be1-aaf9-653b4e6e36ca", - "metadata": { - "language": "sql", - "name": "cell27", - "codeCollapsed": false, - "collapsed": false - }, - "outputs": [], - "source": "-- Create predictions, and save results to a table: \nCALL lobstermac_forecast!FORECAST(FORECASTING_PERIODS => 10);", - "execution_count": null - }, - { - "cell_type": "code", - "id": "cdf65508-5b09-4ec4-8bc3-156a17714d53", - "metadata": { - "language": "sql", - "name": "cell28", - "codeCollapsed": false, - "collapsed": false - }, - "outputs": [], - "source": "-- Store the results of the cell above as a table\nCREATE OR REPLACE TABLE macncheese_predictions AS (\n SELECT * FROM {{cell27}}\n);", - "execution_count": null - }, - { - "cell_type": "code", - "id": "89b4caa3-9b8f-48a9-bfaa-6c65825ad3df", - "metadata": { - "language": "sql", - "name": "cell29", - "codeCollapsed": false, - "collapsed": false - }, - "outputs": [], - "source": "-- Visualize the results, overlaid on top of one another: \nSELECT\n timestamp,\n total_sold,\n NULL AS forecast\nFROM\n lobster_sales\nWHERE\n timestamp > '2023-03-01'\nUNION\nSELECT\n TS AS timestamp,\n NULL AS total_sold,\n forecast\nFROM\n macncheese_predictions\nORDER BY\n timestamp asc;", - "execution_count": null - }, - { - "cell_type": "code", - "id": "36e67d30-4f29-4fac-8855-24225ef6ce94", - "metadata": { - "language": "python", - "name": "cell30", - "codeCollapsed": false - }, - "outputs": [], - "source": "import pandas as pd\ndf = cells.cell29.to_pandas()\ndf = pd.melt(df,id_vars=[\"TIMESTAMP\"],value_vars=[\"TOTAL_SOLD\",\"FORECAST\"])\ndf = df.replace({\"TOTAL_SOLD\":\"ACTUAL\"})\ndf.columns = [\"TIMESTAMP\",\"TYPE\", \"AMOUNT SOLD\"]\n\nimport altair as alt\nalt.Chart(df).mark_line().encode(\n x = \"TIMESTAMP\",\n y = \"AMOUNT SOLD\",\n color = \"TYPE\"\n)", - "execution_count": null - }, - { - "cell_type": "markdown", - "id": "7a0c80e5-9a3e-454d-a41a-bc7d9e66cbf1", - "metadata": { - "name": "cell31", - "collapsed": false - }, - "source": "There we have it! We just created our first set of predictions for the next 10 days worth of demand, which can be used to inform how much inventory of raw ingredients we may need. As shown from the above visualization, there seems to also be a weekly trend for the items sold, which the model was also able to pick up on. \n\n**Note:** You may notice that your chart has included the null being represented as 0's. Make sure to select the 'none' aggregation for each of columns as shown on the right hand side of the image above to reproduce the image. Additionally, your visualization may look different based on what version of the ML forecast function you call. The above image was created with **version 7.0**.\n" - }, - { - "cell_type": "markdown", - "id": "abc163cd-f544-4aa2-bceb-18b7fa7ba3f8", - "metadata": { - "name": "cell32", - "collapsed": false - }, - "source": "### Step 4: Understanding Forecasting Output & Configuration Options\n\nIf we have a look at the prediction results, we can see that the following columns are outputted as shown below. \n\n1. TS: Which represents the Timestamp for the forecast prediction\n2. Forecast: The output/prediction made by the model\n3. Lower/Upper_Bound: Separate columns that specify the [prediction interval](https://en.wikipedia.org/wiki/Prediction_interval)\n\n\nThe forecast function exposes a `config_object` that allows you to control the outputted prediction interval. This value ranges from 0 to 1, with a larger value providing a wider range between the lower and upper bound. See below for an example of how change this when producing inferences: \n" - }, - { - "cell_type": "code", - "id": "0ccc768a-aaf4-4323-8409-77bf941aee10", - "metadata": { - "language": "sql", - "name": "cell33", - "codeCollapsed": false - }, - "outputs": [], - "source": "CALL lobstermac_forecast!FORECAST(FORECASTING_PERIODS => 10, CONFIG_OBJECT => {'prediction_interval': .9});", - "execution_count": null - }, - { - "cell_type": "markdown", - "id": "7c1d28db-7b6a-42ee-958f-eeeab8f9f658", - "metadata": { - "name": "cell34", - "collapsed": false - }, - "source": "## Building Multiple Forecasts & Adding Holiday Information\n\nIn the previous section, we built a forecast model to predict the demand for only the Lobster Mac & Cheese item our food trucks were selling. However, this is not the only item sold in the city of Vancouver - what if we wanted to build out a separate forecast model for each of the individual items? We can use the `series_colname` argument in the forecasting ML function, which lets a user specify a column that contains the different series that needs to be forecasted individually. \n\nFurther, there may be additional data points we want to include in our model to produce better results. In the previous section, we saw that for the Lobster Mac & Cheese item, there were some days that had major spikes in the number of items sold. One hypothesis that could explain these jumps are holidays where people are perhaps more likely to go out and buy from Tasty Bytes. We can also include these additional [exogenous variables](https://en.wikipedia.org/wiki/Exogenous_and_endogenous_variables) to our model. \n\n\n### Step 1: Build Multi-Series Forecast for Vancouver\n\nFollow the SQL Commands below to create a multi-series forecasting model for the city of Vancouver, with holiday data also included. \n\n" - }, - { - "cell_type": "code", - "id": "fdae6e2a-d5d7-4df5-bb3c-e15d554a481a", - "metadata": { - "language": "sql", - "name": "cell35", - "collapsed": false, - "codeCollapsed": false - }, - "outputs": [], - "source": "-- Create a view for our training data, including the holidays for all items sold: \nCREATE OR REPLACE VIEW allitems_vancouver as (\n SELECT\n vs.timestamp,\n vs.menu_item_name,\n vs.total_sold,\n coalesce(ch.holiday_name, '') as holiday_name\n FROM \n vancouver_sales vs\n left join public_holidays ch on vs.timestamp = ch.date\n WHERE MENU_ITEM_NAME in ('Mothers Favorite', 'Bottled Soda', 'Ice Tea')\n);", - "execution_count": null - }, - { - "cell_type": "code", - "id": "f77bcac4-6c31-45e0-90c2-23765ee6520f", - "metadata": { - "language": "sql", - "name": "cell36" - }, - "outputs": [], - "source": "-- Train Model; this could take ~15-25 secs; please be patient\nCREATE OR REPLACE forecast vancouver_forecast (\n INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'allitems_vancouver'),\n SERIES_COLNAME => 'MENU_ITEM_NAME',\n TIMESTAMP_COLNAME => 'TIMESTAMP',\n TARGET_COLNAME => 'TOTAL_SOLD'\n);\n", - "execution_count": null - }, - { - "cell_type": "code", - "id": "251406e3-8892-4d51-b3f4-f3d7326a9142", - "metadata": { - "language": "sql", - "name": "cell37" - }, - "outputs": [], - "source": "-- show it\nSHOW forecast;", - "execution_count": null - }, - { - "cell_type": "markdown", - "id": "2610541f-3965-427e-b551-b6ec7530006b", - "metadata": { - "name": "cell38", - "collapsed": false - }, - "source": "\nYou may notice as you do the left join that there are a lot of null values for the column `holiday_name`. Not to worry! ML Functions are able to automatically handle and adjust for missing values as these. \n" - }, - { - "cell_type": "markdown", - "id": "75f77058-3853-4f50-9a0b-07b33564c120", - "metadata": { - "name": "cell39", - "collapsed": false - }, - "source": "\n### Step 2: Create Predictions\n\nUnlike the single series model we built in the previous section, we can not simply use the `vancouver_forecast!forecast` method to generate predictions for our current model. Since we have added holidays as an exogenous variable, we need to prepare an inference dataset and pass it into our trained model.\n" - }, - { - "cell_type": "code", - "id": "5d970fdf-9237-48c6-a97e-6a61ad0bb326", - "metadata": { - "language": "sql", - "name": "cell40", - "collapsed": false, - "codeCollapsed": false - }, - "outputs": [], - "source": "-- Retrieve the latest date from our input dataset, which is 05/28/2023: \nSELECT MAX(timestamp) FROM vancouver_sales;", - "execution_count": null - }, - { - "cell_type": "code", - "id": "83f41480-7b4a-4fc7-a92b-5290c69f7219", - "metadata": { - "language": "sql", - "name": "cell41", - "collapsed": false, - "codeCollapsed": false - }, - "outputs": [], - "source": "-- Create view for inference data\nCREATE OR REPLACE VIEW vancouver_forecast_data AS (\n WITH future_dates AS (\n SELECT\n '2023-05-28' ::DATE + row_number() over (\n ORDER BY\n 0\n ) AS timestamp\n FROM\n TABLE(generator(rowcount => 10))\n ),\n food_items AS (\n SELECT\n DISTINCT menu_item_name\n FROM\n allitems_vancouver\n ),\n joined_menu_items AS (\n SELECT\n *\n FROM\n food_items\n CROSS JOIN future_dates\n ORDER BY\n menu_item_name ASC,\n timestamp ASC\n )\n SELECT\n jmi.menu_item_name,\n to_timestamp_ntz(jmi.timestamp) AS timestamp,\n ch.holiday_name\n FROM\n joined_menu_items AS jmi\n LEFT JOIN public_holidays ch ON jmi.timestamp = ch.date\n ORDER BY\n jmi.menu_item_name ASC,\n jmi.timestamp ASC\n);", - "execution_count": null - }, - { - "cell_type": "code", - "id": "713c19fb-fdfd-46a5-9242-33e7d29e6dfb", - "metadata": { - "language": "sql", - "name": "cell42", - "collapsed": false, - "codeCollapsed": false - }, - "outputs": [], - "source": "-- Call the model on the forecast data to produce predictions: \nCALL vancouver_forecast!forecast(\n INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'vancouver_forecast_data'),\n SERIES_COLNAME => 'menu_item_name',\n TIMESTAMP_COLNAME => 'timestamp'\n );", - "execution_count": null - }, - { - "cell_type": "code", - "id": "6f902d24-7b77-43fc-97fc-242732acb9ae", - "metadata": { - "language": "sql", - "name": "cell43", - "collapsed": false - }, - "outputs": [], - "source": "-- Store results into a table: \nCREATE OR REPLACE TABLE vancouver_predictions AS (\n SELECT *\n FROM {{cell42}}\n);", - "execution_count": null - }, - { - "cell_type": "markdown", - "id": "1590d2f3-d282-40d2-bcc9-623c8ac58b6f", - "metadata": { - "name": "cell44", - "collapsed": false - }, - "source": "Above, we used the generator function to generate the next 10 days from 05/28/2023, which was the latest date in our training dataset. We then performed a cross join against all the distinct food items we sell within Vancouver, and lastly joined it against our holiday table so that the model is able to make use of it. \n" - }, - { - "cell_type": "markdown", - "id": "f12725e3-3a47-42b8-8fa2-8ce256ead96b", - "metadata": { - "name": "cell45", - "collapsed": false - }, - "source": "### Step 3: Feature Importance & Evaluation Metrics\n\nAn important part of the model building process is understanding how the individual columns or features that you put into the model weigh in on the final predictions made. This can help provide intuition into what the most significant drivers are, and allow us to iterate by either including other columns that may be predictive or removing those that don't provide much value. The forecasting ML Function gives you the ability to calculate [feature importance](https://docs.snowflake.com/en/user-guide/analysis-forecasting#understanding-feature-importance), using the `explain_feature_importance` method as shown below. \n" - }, - { - "cell_type": "code", - "id": "51dab86e-e15c-473d-90cc-8df2942c52cb", - "metadata": { - "language": "sql", - "name": "cell46", - "collapsed": false, - "codeCollapsed": false - }, - "outputs": [], - "source": "-- get Feature Importance\nCALL VANCOUVER_FORECAST!explain_feature_importance();", - "execution_count": null - }, - { - "cell_type": "markdown", - "id": "a8add16e-3268-4590-a153-f30dfeaa92d7", - "metadata": { - "name": "cell47", - "collapsed": false - }, - "source": "\nThe output of this call for our multi-series forecast model is shown above, which you can explore further. One thing to notice here is that, for this particular dataset, including holidays as an exogenous variable didn't dramatically impact our predictions. We may consider dropping this altogether, and only rely on the daily sales themselves. **Note**, based on the version of the ML Function, the outputted feature importances may be different compared to what is shown below due how features are generated by the model. \n\n\nIn addition to feature importances, evaluating model accuracy is important in knowing if the model is able to accurately make future predictions. Using the sql command below, you can get a variety of model metrics that describe how well it performed on a holdout set. For more details please see [understanding evaluation metrics](https://docs.snowflake.com/en/user-guide/ml-powered-forecasting#understanding-evaluation-metrics).\n" - }, - { - "cell_type": "code", - "id": "1014390b-42e4-4250-b000-c484cd91d8c1", - "metadata": { - "language": "sql", - "name": "cell48", - "collapsed": false - }, - "outputs": [], - "source": "-- Evaluate model performance:\nCALL VANCOUVER_FORECAST!show_evaluation_metrics();", - "execution_count": null - }, - { - "cell_type": "markdown", - "id": "bbca5839-9221-438d-ae3a-1a84a27138db", - "metadata": { - "name": "cell49" - }, - "source": "## Identifying Anomalous Sales with the Anomaly Detection ML Function\n\nIn the past couple of sections we have built forecasting models for the items sold in Vancouver to plan ahead to meet demand. As an analyst, another question we might be interested in understanding further are anomalous sales. If there is a consistent trend across a particular food item, this may constitute a recent trend, and we can use this information to better understand the customer experience and optimize it. \n\n### Step 1: Building the Anomaly Detection Model\n\nIn this section, we will make use of the [anomaly detection ML Function](https://docs.snowflake.com/en/user-guide/analysis-anomaly-detection) to build a model for anamolous sales for all items sold in Vancouver. Since we had found that holidays were not impacting the model, we have dropped that as a column for our anomaly model. \n" - }, - { - "cell_type": "code", - "id": "44836532-8276-4d7f-a488-b8049fcfcb4a", - "metadata": { - "language": "sql", - "name": "cell50", - "collapsed": false, - "codeCollapsed": false - }, - "outputs": [], - "source": "-- Create a view containing our training data\nCREATE OR REPLACE VIEW vancouver_anomaly_training_set AS (\n SELECT *\n FROM vancouver_sales\n WHERE timestamp < (SELECT MAX(timestamp) FROM vancouver_sales) - interval '1 Month'\n);", - "execution_count": null - }, - { - "cell_type": "code", - "id": "fd2a7cc8-c3e1-47dc-8513-b6fbf60aeaf3", - "metadata": { - "language": "sql", - "name": "cell51", - "collapsed": false, - "codeCollapsed": false - }, - "outputs": [], - "source": "-- Create a view containing the data we want to make inferences on\nCREATE OR REPLACE VIEW vancouver_anomaly_analysis_set AS (\n SELECT *\n FROM vancouver_sales\n WHERE timestamp > (SELECT MAX(timestamp) FROM vancouver_anomaly_training_set)\n);", - "execution_count": null - }, - { - "cell_type": "code", - "id": "9c5239ab-470f-4c66-b293-7ff013d945f0", - "metadata": { - "language": "sql", - "name": "cell52", - "collapsed": false, - "codeCollapsed": false - }, - "outputs": [], - "source": "-- Create the model: UNSUPERVISED method, however can pass labels as well; this could take ~15-25 secs; please be patient \nCREATE OR REPLACE snowflake.ml.anomaly_detection vancouver_anomaly_model(\n INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'vancouver_anomaly_training_set'),\n SERIES_COLNAME => 'MENU_ITEM_NAME',\n TIMESTAMP_COLNAME => 'TIMESTAMP',\n TARGET_COLNAME => 'TOTAL_SOLD',\n LABEL_COLNAME => ''\n); ", - "execution_count": null - }, - { - "cell_type": "code", - "id": "e2b437aa-9595-44ae-8975-414ce974748a", - "metadata": { - "language": "sql", - "name": "cell53", - "collapsed": false, - "codeCollapsed": false - }, - "outputs": [], - "source": "-- Call the model and store the results into table; this could take ~10-20 secs; please be patient\nCALL vancouver_anomaly_model!DETECT_ANOMALIES(\n INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'vancouver_anomaly_analysis_set'),\n SERIES_COLNAME => 'MENU_ITEM_NAME',\n TIMESTAMP_COLNAME => 'TIMESTAMP',\n TARGET_COLNAME => 'TOTAL_SOLD',\n CONFIG_OBJECT => {'prediction_interval': 0.95}\n);", - "execution_count": null - }, - { - "cell_type": "code", - "id": "46d17b4b-c965-4f52-b9f2-875f1c69b79c", - "metadata": { - "language": "sql", - "name": "cell54", - "collapsed": false - }, - "outputs": [], - "source": "-- Create a table from the results\nCREATE OR REPLACE TABLE vancouver_anomalies AS (\n SELECT *\n FROM {{cell53}}\n);", - "execution_count": null - }, - { - "cell_type": "code", - "id": "3565b1c7-124b-483c-a556-d7c7896892c2", - "metadata": { - "language": "sql", - "name": "cell55", - "collapsed": false - }, - "outputs": [], - "source": "-- Review the results\nSELECT * FROM vancouver_anomalies;", - "execution_count": null - }, - { - "cell_type": "markdown", - "id": "4988f71d-b04a-4276-9a86-e31256e8e866", - "metadata": { - "name": "cell56", - "collapsed": false - }, - "source": "\nA few comments on the code above: \n1. Anomaly detection is able work in both a supervised and unsupervised manner. In this case, we trained it in the unsupervised fashion. If you have a column that specifies labels for whether something was anomalous, you can use the `LABEL_COLNAME` argument to specify that column. \n2. Similar to the forecasting ML Function, you also have the option to specify the `prediction_interval`. In this context, this is used to control how 'agressive' the model is in identifying an anomaly. A value closer to 1 means that fewer observations will be marked anomalous, whereas a lower value would mark more instances as anomalous. See [documentation](https://docs.snowflake.com/en/user-guide/analysis-anomaly-detection#specifying-the-prediction-interval-for-anomaly-detection) for further details. \n\nThe output of the model should look similar to that found in the image below. Refer to the [output documentation](https://docs.snowflake.com/sql-reference/classes/anomaly_detection#id7) for further details on what all the columns specify. \n" - }, - { - "cell_type": "code", - "id": "f338d097-d86f-4f60-8cd6-56da9a6f9fde", - "metadata": { - "language": "python", - "name": "cell57" - }, - "outputs": [], - "source": "import streamlit as st\nst.image(\"https://quickstarts.snowflake.com/guide/ml_forecasting_ad/img/3f01053690feeebb.png\",width=1000)", - "execution_count": null - }, - { - "cell_type": "markdown", - "id": "6d6c4e7a-b275-4c74-be44-3dd9b26657cc", - "metadata": { - "name": "cell58" - }, - "source": "### Step 2: Identifying Trends\n\nWith our model output, we are now in a position to see how many times an anomalous sale occured for each of the items in our most recent month's worth of sales data. Using the sql below:\n" - }, - { - "cell_type": "code", - "id": "756ad1cd-2c7c-4636-9340-56f14db6e2a2", - "metadata": { - "language": "sql", - "name": "cell59", - "collapsed": false, - "codeCollapsed": false - }, - "outputs": [], - "source": "-- Query to identify trends\nSELECT series, is_anomaly, count(is_anomaly) AS num_records\nFROM vancouver_anomalies\nWHERE is_anomaly =1\nGROUP BY ALL\nORDER BY num_records DESC\nLIMIT 5;", - "execution_count": null - }, - { - "cell_type": "markdown", - "id": "128d59a7-f1e8-4a19-8a6f-4d712dd0d9f8", - "metadata": { - "name": "cell60" - }, - "source": "From the results above, it seems as if Hot Ham & Cheese, Pastrami, and Italian have had the most number of anomalous sales in the month of May!" - }, - { - "cell_type": "markdown", - "id": "7b48df83-2536-4543-b935-a2c22da84b23", - "metadata": { - "name": "cell61", - "collapsed": false - }, - "source": "## Productionizing Your Workflow Using Tasks & Stored Procedures\n\nIn this last section, we will walk through how we can use the models created previously and build them into a pipeline to send email reports for the most trending items in the past 30 days. This involves a few components that includes: \n\n1. Using [Tasks](https://docs.snowflake.com/en/user-guide/tasks-intro) to retrain the model every month, to make sure it is fresh\n2. Setting up an [email notification integration](https://docs.snowflake.com/en/user-guide/email-stored-procedures) to send emails to our stakeholders\n3. A [Snowpark Python Stored Procedure](https://docs.snowflake.com/en/sql-reference/stored-procedures-python) to extract the anomalies and send formatted emails containing the most trending items. \n" - }, - { - "cell_type": "code", - "id": "878677a3-7c8f-47bc-af85-c458d143e6ff", - "metadata": { - "language": "sql", - "name": "cell62", - "collapsed": false, - "codeCollapsed": false - }, - "outputs": [], - "source": "-- Note: It's important to update the recipient email twice in the code below\n-- Create a task to run every month to retrain the anomaly detection model: \nCREATE OR REPLACE TASK ad_vancouver_training_task\n WAREHOUSE = quickstart_wh\n SCHEDULE = 'USING CRON 0 0 1 * * America/Los_Angeles' -- Runs once a month\nAS\nCREATE OR REPLACE snowflake.ml.anomaly_detection vancouver_anomaly_model(\n INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'vancouver_anomaly_training_set'),\n SERIES_COLNAME => 'MENU_ITEM_NAME',\n TIMESTAMP_COLNAME => 'TIMESTAMP',\n TARGET_COLNAME => 'TOTAL_SOLD',\n LABEL_COLNAME => ''\n); ", - "execution_count": null - }, - { - "cell_type": "code", - "id": "b824e165-f947-431e-a13c-17d568e8ae10", - "metadata": { - "language": "sql", - "name": "cell63", - "codeCollapsed": false, - "collapsed": false - }, - "outputs": [], - "source": "-- Creates a Stored Procedure to extract the anomalies from our freshly trained model: \nCREATE OR REPLACE PROCEDURE extract_anomalies()\nRETURNS TABLE ()\nLANGUAGE sql \nAS\nBEGIN\n CALL vancouver_anomaly_model!DETECT_ANOMALIES(\n INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'vancouver_anomaly_analysis_set'),\n SERIES_COLNAME => 'MENU_ITEM_NAME',\n TIMESTAMP_COLNAME => 'TIMESTAMP',\n TARGET_COLNAME => 'TOTAL_SOLD',\n CONFIG_OBJECT => {'prediction_interval': 0.95});\nDECLARE res RESULTSET DEFAULT (\n SELECT series, is_anomaly, count(is_anomaly) as num_records \n FROM TABLE(result_scan(-1)) \n WHERE is_anomaly = 1 \n GROUP BY ALL\n HAVING num_records > 5\n ORDER BY num_records DESC);\nBEGIN \n RETURN table(res);\nEND;\nEND;", - "execution_count": null - }, - { - "cell_type": "markdown", - "id": "0e48da86-bbf6-491a-9973-d03845377982", - "metadata": { - "name": "cell64", - "collapsed": false - }, - "source": "This is an example of how you can create an email notification. Note that you need to replace the recipients field with a valid email address: \n\n```sql\n-- Create an email integration: \nCREATE OR REPLACE NOTIFICATION INTEGRATION my_email_int\nTYPE = EMAIL\nENABLED = TRUE\nALLOWED_RECIPIENTS = (''); -- update the recipient's email here\n```" - }, - { - "cell_type": "markdown", - "id": "d840f067-99ea-4e65-9082-1f41b20a499a", - "metadata": { - "name": "cell65", - "collapsed": false - }, - "source": "Create Snowpark Python Stored Procedure to format email and send it\n```sql\nCREATE OR REPLACE PROCEDURE send_anomaly_report()\nRETURNS string\nLANGUAGE python\nruntime_version = 3.9\npackages = ('snowflake-snowpark-python')\nhandler = 'send_email'\n-- update the recipient's email below\nAS\n$$\ndef send_email(session):\n session.call('extract_anomalies').collect()\n printed = session.sql(\n \"select * from table(result_scan(last_query_id(-1)))\"\n ).to_pandas().to_html()\n session.call('system$send_email',\n 'my_email_int',\n '',\n 'Email Alert: Anomaly Report Has Been created',\n printed,\n 'text/html')\n$$;\n```" - }, - { - "cell_type": "markdown", - "id": "bde7204e-5ac2-4d4a-b00e-e8ba13f56917", - "metadata": { - "name": "cell66" - }, - "source": "Orchestrating the Tasks: \n```sql\nCREATE OR REPLACE TASK send_anomaly_report_task\n warehouse = quickstart_wh\n AFTER AD_VANCOUVER_TRAINING_TASK\n AS CALL send_anomaly_report();\n```" - }, - { - "cell_type": "markdown", - "id": "3f0970c1-2340-4777-961a-c52b1555ace7", - "metadata": { - "name": "cell67", - "collapsed": false - }, - "source": "Steps to resume and then immediately execute the task DAG: \n```sql\nALTER TASK SEND_ANOMALY_REPORT_TASK RESUME;\nALTER TASK AD_VANCOUVER_TRAINING_TASK RESUME;\nEXECUTE TASK AD_VANCOUVER_TRAINING_TASK;\n```" - }, - { - "cell_type": "markdown", - "id": "1e74a68b-b5c3-45f8-b412-17f5cfe3d414", - "metadata": { - "name": "cell68" - }, - "source": "Some considerations to keep in mind from the above code: \n1. **Use the freshest data available**: In the code above, we used `vancouver_anomaly_analysis_set` to retrain the model, which, because the data is static, would contain the same data as the original model. In a production setting, you may accordingly adjust the input table/view to have the most updated dataset to retrain the model.\n2. **Sending emails**: This requires you to set up an integration, and specify who the recipients of the email should be. When completed appropriately, you'll recieve an email from `no-reply@snowflake.net`, as seen below. \n3. **Formatting results**: We've made use of a snowpark stored procedure, to take advantage of the functions that pandas has to neatly present the resultset into an email. For futher details and options, refer to this [medium post](https://medium.com/snowflake/hey-snowflake-send-me-an-email-243741a0fe3) by Felipe Hoffa.\n4. **Executing the Tasks**: We have set this task to run the first of every month - if you would like to run it immediately, you'll have to change the state of the task to `RESUME` as shown in the last three lines of code above, before executing the parent task `AD_VANCOUVER_TRAINING_TASK`. Note that we have orchestrated the task to send the email to the user *after* the model has been retrained. After executing, you may expect to see an email similar to the one below within a few minutes.\n" - }, - { - "cell_type": "markdown", - "id": "c8112e22-b651-4e23-bcba-30fe2f3f9818", - "metadata": { - "name": "cell69" - }, - "source": "## Conclusion\n\n**You did it!** Congrats on building your first set of models using Snowflake Cortex ML-Based Functions. \n\nAs a review, in this guide we covered how you are able to: \n\n- Acquire holiday data from the snowflake marketplace\n- Visualized sales data from our fitictious company Tasty Bytes\n- Built out forecasting model for only a single item (Lobster Mac & Cheese), before moving onto a multi-series forecast for all the food items sold in Vancouver\n- Used Anomaly detection ML Function to identify anomalous sales, and used it to understand recent trends in sales data\n- Productionize pipelines using Tasks & Stored Procedures, so you can get the latest results from your model on a regular cadence\n\n### Resources: \nThis guide contained code patterns that you can leverage to get quickly started with Snowflake Cortex ML-Based Functions. For further details, here are some useful resources: \n\n- [Anomaly Detection](https://docs.snowflake.com/en/user-guide/analysis-anomaly-detection) Product Docs, alongside the [anomaly syntax](https://docs.snowflake.com/en/sql-reference/classes/anomaly_detection)\n- [Forecasting](https://docs.snowflake.com/en/user-guide/analysis-forecasting) Product Docs, alongside the [forecasting syntax](https://docs.snowflake.com/sql-reference/classes/forecast)" - } - ] + "metadata": { + "kernelspec": { + "display_name": "Streamlit Notebook", + "name": "streamlit" + } + }, + "nbformat_minor": 5, + "nbformat": 4, + "cells": [ + { + "cell_type": "markdown", + "id": "3aac5b2e-9939-4b2d-a088-5472570707c4", + "metadata": { + "name": "cell1", + "collapsed": false + }, + "source": "# Getting Started with Snowflake Cortex ML-Based Functions\n\n## Overview \n\nOne of the most critical activities that a Data/Business Analyst has to perform is to produce recommendations to their business stakeholders based upon the insights they have gleaned from their data. In practice, this means that they are often required to build models to: make forecasts, identify long running trends, and identify abnormalities within their data. However, Analysts are often impeded from creating the best models possible due to the depth of statistical and machine learning knowledge required to implement them in practice. Further, python or other programming frameworks may be unfamiliar to Analysts who write SQL, and the nuances of fine-tuning a model may require expert knowledge that may be out of reach. \n\nFor these use cases, Snowflake has developed a set of SQL based ML Functions, that implement machine learning models on the user's behalf. As of December 2023, three ML Functions are available for time-series based data:\n\n1. Forecasting: which enables users to forecast a metric based on past values. Common use-cases for forecasting including predicting future sales, demand for particular sku's of an item, or volume of traffic into a website over a period of time.\n2. Anomaly Detection: which flags anomalous values using both unsupervised and supervised learning methods. This may be useful in use-cases where you want to identify spikes in your cloud spend, identifying abnormal data points in logs, and more.\n3. Contribution Explorer: which enables users to perform root cause analysis to determine the most significant drivers to a particular metric of interest. \n\nFor further details on ML Functions, please refer to the [snowflake documentation](https://docs.snowflake.com/guides-overview-analysis). \n\n### Prerequisites\n- Working knowledge of SQL\n- A Snowflake account login with an ACCOUNTADMIN role. If not, you will need to use a different role that has the ability to create database, schema, table, stages, tasks, email integrations, and stored procedures. \n\n### What You’ll Learn \n- How to make use of Anomaly Detection & Forecasting ML Functions to create models and produce predictions\n- Use Tasks to retrain models on a regular cadence\n- Use the [email notfication integration](https://docs.snowflake.com/en/user-guide/email-stored-procedures) to send email reports of the model results after completion \n\n### What You’ll Build \nThis Quickstart is designed to help you get up to speed with both the Forecasting and Anomaly Detection ML Functions. \nWe will work through an example using data from a fictitious food truck company, Tasty Bytes, to first create a forecasting model to predict the demand for each menu-item that Tasty Bytes sells in Vancouver. Predicting this demand is important to Tasty Bytes, as it allows them to plan ahead and get enough of the raw ingredients to fulfill customer demand. \n\nWe will start with one food item at first, but then scale this up to all the items in Vancouver and add additional datapoints like holidays to see if it can improve the model's performance. Then, to see if there have been any trending food items, we will build an anomaly detection model to understand if certain food items have been selling anomalously. We will wrap up this Quickstart by showcasing how you can use Tasks to schedule your model training process, and use the email notification integration to send out a report on trending food items. \n\nLet's get started!" + }, + { + "cell_type": "markdown", + "id": "29090d0b-7020-4cc1-b1b4-adc556d77348", + "metadata": { + "name": "cell2", + "collapsed": false + }, + "source": "## Setting Up Data in Snowflake\n\n### Overview:\nYou will use Snowflake Notebook to: \n- Create Snowflake objects (i.e warehouse, database, schema, etc..)\n- Ingest sales data from S3 and load it into a snowflake table\n- Access Holiday data from the Snowflake Marketplace (or load from S3). " + }, + { + "cell_type": "markdown", + "id": "f0e98da4-358f-45d6-94d0-be434f62ebf4", + "metadata": { + "name": "cell3", + "collapsed": false + }, + "source": "\n### Step 1: Loading Holiday Data from S3 bucket\n\nNote that you can perform this step by following [the instructions here](https://quickstarts.snowflake.com/guide/ml_forecasting_ad/index.html?index=..%2F..index#1) to access the dataset on the Snowflake Marketplace. For the simplicity of this demo, we will load this dataset from an S3 bucket." + }, + { + "cell_type": "code", + "id": "8d50cbf4-0c8d-4950-86cb-114990437ac9", + "metadata": { + "language": "sql", + "name": "cell4", + "collapsed": false, + "codeCollapsed": false + }, + "source": "-- Load data for use in this demo. \n-- Create a csv file format: \nCREATE OR REPLACE FILE FORMAT csv_ff\n type = 'csv'\n SKIP_HEADER = 1,\n COMPRESSION = AUTO;", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "id": "5e0e32db-3b00-4071-be00-4bc0e9f5a344", + "metadata": { + "language": "sql", + "name": "cell5", + "collapsed": false + }, + "outputs": [], + "source": "-- Create an external stage pointing to s3, to load your data. \nCREATE OR REPLACE STAGE s3load \n COMMENT = 'Quickstart S3 Stage Connection'\n url = 's3://sfquickstarts/notebook_demos/frostbyte_tastybytes/'\n file_format = csv_ff;", + "execution_count": null + }, + { + "cell_type": "code", + "id": "00095f04-38ec-479d-83a3-2ac6b82662df", + "metadata": { + "language": "sql", + "name": "cell6", + "codeCollapsed": false, + "collapsed": false + }, + "outputs": [], + "source": "LS @s3load;", + "execution_count": null + }, + { + "cell_type": "code", + "id": "7e5ae191-2af7-49b1-b79f-b18ff1a8e99c", + "metadata": { + "language": "sql", + "name": "cell7", + "codeCollapsed": false, + "collapsed": false + }, + "outputs": [], + "source": "-- Define your table.\nCREATE OR REPLACE TABLE PUBLIC_HOLIDAYS(\n \tDATE DATE,\n\tHOLIDAY_NAME VARCHAR(16777216),\n\tIS_FINANCIAL BOOLEAN\n);", + "execution_count": null + }, + { + "cell_type": "code", + "id": "e03e845b-300f-4a94-8ce7-b729ed4d316e", + "metadata": { + "language": "sql", + "name": "cell8", + "codeCollapsed": false, + "collapsed": false + }, + "outputs": [], + "source": "-- Ingest data from s3 into your table.\nCOPY INTO PUBLIC_HOLIDAYS FROM @s3load/holidays.csv;", + "execution_count": null + }, + { + "cell_type": "code", + "id": "e71c170c-7bca-40e2-a60a-b7df07e01293", + "metadata": { + "language": "sql", + "name": "cell9", + "codeCollapsed": false, + "collapsed": false + }, + "outputs": [], + "source": "SELECT * from PUBLIC_HOLIDAYS;", + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "9d3a5d8a-fff8-4033-9ade-a0995fdecbe4", + "metadata": { + "name": "cell10", + "collapsed": false + }, + "source": "### Step 2: Creating Objects, Load Data, & Set Up Tables\n\nRun the following SQL commands in the worksheet to create the required Snowflake objects, ingest sales data from S3, and update your Search Path to make it easier to work with the ML Functions. " + }, + { + "cell_type": "code", + "id": "9994c336-01e2-466f-b34f-fbf66525e2d6", + "metadata": { + "language": "sql", + "name": "cell11", + "collapsed": false, + "codeCollapsed": false + }, + "outputs": [], + "source": "-- Create an external stage pointing to s3, to load your data. \nCREATE OR REPLACE STAGE s3load \n COMMENT = 'Quickstart S3 Stage Connection'\n url = 's3://sfquickstarts/frostbyte_tastybytes/mlpf_quickstart/'\n file_format = csv_ff;", + "execution_count": null + }, + { + "cell_type": "code", + "id": "91774fde-c76d-4b1e-8d1a-021746b54830", + "metadata": { + "language": "sql", + "name": "cell12", + "collapsed": false, + "codeCollapsed": false + }, + "outputs": [], + "source": "-- Define your table.\nCREATE OR REPLACE TABLE tasty_byte_sales(\n \tDATE DATE,\n\tPRIMARY_CITY VARCHAR(16777216),\n\tMENU_ITEM_NAME VARCHAR(16777216),\n\tTOTAL_SOLD NUMBER(17,0)\n);", + "execution_count": null + }, + { + "cell_type": "code", + "id": "21c3eb38-6a62-4c42-af34-9b060d1f0821", + "metadata": { + "language": "sql", + "name": "cell13", + "collapsed": false, + "codeCollapsed": false + }, + "outputs": [], + "source": "-- Ingest data from s3 into your table.\nCOPY INTO tasty_byte_sales FROM @s3load/ml_functions_quickstart.csv;", + "execution_count": null + }, + { + "cell_type": "code", + "id": "3fbcb3fe-47a9-4315-b72b-b45ac41f7ab5", + "metadata": { + "language": "sql", + "name": "cell14", + "codeCollapsed": false + }, + "outputs": [], + "source": "-- View a sample of the ingested data: \nSELECT * FROM tasty_byte_sales LIMIT 100;", + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "d580ae45-c6f7-4f36-970a-e5b170ac8eef", + "metadata": { + "name": "cell15", + "collapsed": false + }, + "source": "At this point, we have all the data we need to start building models. We will get started with building our first forecasting model. \n\n## Forecasting Demand for Lobster Mac & Cheese\n\nWe will start off by first building a forecasting model to predict the demand for Lobster Mac & Cheese in Vancouver.\n\n\n### Step 1: Visualize Daily Sales on Snowsight\n\nBefore building our model, let's first visualize our data to get a feel for what daily sales looks like. Run the following sql command in your Snowsight UI, and toggle to the chart at the bottom.\n" + }, + { + "cell_type": "code", + "id": "a5689582-eec1-46d9-908e-ef88ca3c6d2a", + "metadata": { + "language": "sql", + "name": "cell16", + "collapsed": false + }, + "outputs": [], + "source": "-- query a sample of the ingested data\nSELECT *\n FROM tasty_byte_sales\n WHERE menu_item_name LIKE 'Lobster Mac & Cheese';", + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "2ca817f0-77e6-47f9-8e98-397a6badadd6", + "metadata": { + "name": "cell17", + "collapsed": false + }, + "source": "We can plot the daily sales for the item Lobster Mac & Cheese going back all the way to 2014." + }, + { + "cell_type": "code", + "id": "b4d3e0c1-7941-423c-982a-39201eb3d92a", + "metadata": { + "language": "python", + "name": "cell18", + "collapsed": false, + "codeCollapsed": false + }, + "outputs": [], + "source": "# TODO: CELL REFERENCE REPLACE\ndf = cells.cell16.to_pandas()\nimport altair as alt\nalt.Chart(df).mark_line().encode(\n x = \"DATE\",\n y = \"TOTAL_SOLD\"\n)", + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "fb69d629-eb18-4cf5-ad4d-026e26a701c3", + "metadata": { + "name": "cell19", + "collapsed": false + }, + "source": "Observing the chart, one thing we can notice is that there appears to be a seasonal trend present for sales, on a yearly basis. This is an important consideration for building robust forecasting models, and we want to make sure that we feed in enough training data that represents one full cycle of the time series data we are modeling for. The forecasting ML function is smart enough to be able to automatically identify and handle multiple seasonality patterns, so we will go ahead and use the latest year's worth of data as input to our model. In the query below, we will also convert the date column using the `to_timestamp_ntz` function, so that it be used in the forecasting function. " + }, + { + "cell_type": "code", + "id": "46a61a60-0f32-4875-a6cb-79f52fcc47cb", + "metadata": { + "language": "sql", + "name": "cell20", + "collapsed": false, + "codeCollapsed": false + }, + "outputs": [], + "source": "-- Create Table containing the latest years worth of sales data: \nCREATE OR REPLACE TABLE vancouver_sales AS (\n SELECT\n to_timestamp_ntz(date) as timestamp,\n primary_city,\n menu_item_name,\n total_sold\n FROM\n tasty_byte_sales\n WHERE\n date > (SELECT max(date) - interval '1 year' FROM tasty_byte_sales)\n GROUP BY\n all\n);", + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "08184365-5247-424a-ae58-7cfe54acc448", + "metadata": { + "name": "cell21", + "collapsed": false + }, + "source": "\n### Step 2: Creating our First Forecasting Model: Lobster Mac & Cheese\n\nWe can use SQL to directly call the forecasting ML function. Under the hood, the forecasting ML function automatically takes care of many of the data science best practices that are required to build good models. This includes performing hyper-parameter tuning, adjusting for missing data, and creating new features. We will build our first forecasting model below, for only the Lobster Mac & Cheese menu item. \n" + }, + { + "cell_type": "code", + "id": "7074d117-4b8c-4ed7-825d-4e50a40570ab", + "metadata": { + "language": "sql", + "name": "cell22", + "collapsed": false, + "codeCollapsed": false + }, + "outputs": [], + "source": "-- Create view for lobster sales\nCREATE OR REPLACE VIEW lobster_sales AS (\n SELECT\n timestamp,\n total_sold\n FROM\n vancouver_sales\n WHERE\n menu_item_name LIKE 'Lobster Mac & Cheese'\n);\n", + "execution_count": null + }, + { + "cell_type": "code", + "id": "1e8c21b1-6279-435b-ae23-7010f9a471eb", + "metadata": { + "language": "sql", + "name": "cell23", + "codeCollapsed": false + }, + "outputs": [], + "source": "-- Build Forecasting model; this could take ~15-25 secs; please be patient\nCREATE OR REPLACE forecast lobstermac_forecast (\n INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'lobster_sales'),\n TIMESTAMP_COLNAME => 'TIMESTAMP',\n TARGET_COLNAME => 'TOTAL_SOLD'\n);", + "execution_count": null + }, + { + "cell_type": "code", + "id": "1c3a97a5-dcbb-41f8-b471-aa19f73264a4", + "metadata": { + "language": "sql", + "name": "cell24", + "codeCollapsed": false + }, + "outputs": [], + "source": "-- Show models to confirm training has completed\nSHOW forecast;", + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "4617ee0c-041e-4389-97c2-d8b4b055d62d", + "metadata": { + "name": "cell25", + "collapsed": false + }, + "source": "In the steps above, we create a view containing the relevant daily sales for our Lobster Mac & Cheese item, to which we pass to the forecast function. The last step should confirm that the model has been created, and ready to create predictions. \n" + }, + { + "cell_type": "markdown", + "id": "c5e40a4b-3b7c-4f1a-a267-0b5b41c62c6a", + "metadata": { + "name": "cell26", + "collapsed": false + }, + "source": "## Step 3: Creating and Visualizing Predictions\n\nLet's now use our trained `lobstermac_forecast` model to create predictions for the demand for the next 10 days. \n" + }, + { + "cell_type": "code", + "id": "e6505815-b48a-4be1-aaf9-653b4e6e36ca", + "metadata": { + "language": "sql", + "name": "cell27", + "codeCollapsed": false, + "collapsed": false + }, + "outputs": [], + "source": "-- Create predictions, and save results to a table: \nCALL lobstermac_forecast!FORECAST(FORECASTING_PERIODS => 10);", + "execution_count": null + }, + { + "cell_type": "code", + "id": "cdf65508-5b09-4ec4-8bc3-156a17714d53", + "metadata": { + "language": "sql", + "name": "cell28", + "codeCollapsed": false, + "collapsed": false + }, + "outputs": [], + "source": "-- Store the results of the cell above as a table\nCREATE OR REPLACE TABLE macncheese_predictions AS (\n SELECT * FROM {{cell27}}\n);", + "execution_count": null + }, + { + "cell_type": "code", + "id": "89b4caa3-9b8f-48a9-bfaa-6c65825ad3df", + "metadata": { + "language": "sql", + "name": "cell29", + "codeCollapsed": false, + "collapsed": false + }, + "outputs": [], + "source": "-- Visualize the results, overlaid on top of one another: \nSELECT\n timestamp,\n total_sold,\n NULL AS forecast\nFROM\n lobster_sales\nWHERE\n timestamp > '2023-03-01'\nUNION\nSELECT\n TS AS timestamp,\n NULL AS total_sold,\n forecast\nFROM\n macncheese_predictions\nORDER BY\n timestamp asc;", + "execution_count": null + }, + { + "cell_type": "code", + "id": "36e67d30-4f29-4fac-8855-24225ef6ce94", + "metadata": { + "language": "python", + "name": "cell30", + "codeCollapsed": false + }, + "outputs": [], + "source": "import pandas as pd\ndf = cells.cell29.to_pandas()\ndf = pd.melt(df,id_vars=[\"TIMESTAMP\"],value_vars=[\"TOTAL_SOLD\",\"FORECAST\"])\ndf = df.replace({\"TOTAL_SOLD\":\"ACTUAL\"})\ndf.columns = [\"TIMESTAMP\",\"TYPE\", \"AMOUNT SOLD\"]\n\nimport altair as alt\nalt.Chart(df).mark_line().encode(\n x = \"TIMESTAMP\",\n y = \"AMOUNT SOLD\",\n color = \"TYPE\"\n)", + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "7a0c80e5-9a3e-454d-a41a-bc7d9e66cbf1", + "metadata": { + "name": "cell31", + "collapsed": false + }, + "source": "There we have it! We just created our first set of predictions for the next 10 days worth of demand, which can be used to inform how much inventory of raw ingredients we may need. As shown from the above visualization, there seems to also be a weekly trend for the items sold, which the model was also able to pick up on. \n\n**Note:** You may notice that your chart has included the null being represented as 0's. Make sure to select the 'none' aggregation for each of columns as shown on the right hand side of the image above to reproduce the image. Additionally, your visualization may look different based on what version of the ML forecast function you call. The above image was created with **version 7.0**.\n" + }, + { + "cell_type": "markdown", + "id": "abc163cd-f544-4aa2-bceb-18b7fa7ba3f8", + "metadata": { + "name": "cell32", + "collapsed": false + }, + "source": "### Step 4: Understanding Forecasting Output & Configuration Options\n\nIf we have a look at the prediction results, we can see that the following columns are outputted as shown below. \n\n1. TS: Which represents the Timestamp for the forecast prediction\n2. Forecast: The output/prediction made by the model\n3. Lower/Upper_Bound: Separate columns that specify the [prediction interval](https://en.wikipedia.org/wiki/Prediction_interval)\n\n\nThe forecast function exposes a `config_object` that allows you to control the outputted prediction interval. This value ranges from 0 to 1, with a larger value providing a wider range between the lower and upper bound. See below for an example of how change this when producing inferences: \n" + }, + { + "cell_type": "code", + "id": "0ccc768a-aaf4-4323-8409-77bf941aee10", + "metadata": { + "language": "sql", + "name": "cell33", + "codeCollapsed": false + }, + "outputs": [], + "source": "CALL lobstermac_forecast!FORECAST(FORECASTING_PERIODS => 10, CONFIG_OBJECT => {'prediction_interval': .9});", + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "7c1d28db-7b6a-42ee-958f-eeeab8f9f658", + "metadata": { + "name": "cell34", + "collapsed": false + }, + "source": "## Building Multiple Forecasts & Adding Holiday Information\n\nIn the previous section, we built a forecast model to predict the demand for only the Lobster Mac & Cheese item our food trucks were selling. However, this is not the only item sold in the city of Vancouver - what if we wanted to build out a separate forecast model for each of the individual items? We can use the `series_colname` argument in the forecasting ML function, which lets a user specify a column that contains the different series that needs to be forecasted individually. \n\nFurther, there may be additional data points we want to include in our model to produce better results. In the previous section, we saw that for the Lobster Mac & Cheese item, there were some days that had major spikes in the number of items sold. One hypothesis that could explain these jumps are holidays where people are perhaps more likely to go out and buy from Tasty Bytes. We can also include these additional [exogenous variables](https://en.wikipedia.org/wiki/Exogenous_and_endogenous_variables) to our model. \n\n\n### Step 1: Build Multi-Series Forecast for Vancouver\n\nFollow the SQL Commands below to create a multi-series forecasting model for the city of Vancouver, with holiday data also included. \n\n" + }, + { + "cell_type": "code", + "id": "fdae6e2a-d5d7-4df5-bb3c-e15d554a481a", + "metadata": { + "language": "sql", + "name": "cell35", + "collapsed": false, + "codeCollapsed": false + }, + "outputs": [], + "source": "-- Create a view for our training data, including the holidays for all items sold: \nCREATE OR REPLACE VIEW allitems_vancouver as (\n SELECT\n vs.timestamp,\n vs.menu_item_name,\n vs.total_sold,\n coalesce(ch.holiday_name, '') as holiday_name\n FROM \n vancouver_sales vs\n left join public_holidays ch on vs.timestamp = ch.date\n WHERE MENU_ITEM_NAME in ('Mothers Favorite', 'Bottled Soda', 'Ice Tea')\n);", + "execution_count": null + }, + { + "cell_type": "code", + "id": "f77bcac4-6c31-45e0-90c2-23765ee6520f", + "metadata": { + "language": "sql", + "name": "cell36" + }, + "outputs": [], + "source": "-- Train Model; this could take ~15-25 secs; please be patient\nCREATE OR REPLACE forecast vancouver_forecast (\n INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'allitems_vancouver'),\n SERIES_COLNAME => 'MENU_ITEM_NAME',\n TIMESTAMP_COLNAME => 'TIMESTAMP',\n TARGET_COLNAME => 'TOTAL_SOLD'\n);\n", + "execution_count": null + }, + { + "cell_type": "code", + "id": "251406e3-8892-4d51-b3f4-f3d7326a9142", + "metadata": { + "language": "sql", + "name": "cell37" + }, + "outputs": [], + "source": "-- show it\nSHOW forecast;", + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "2610541f-3965-427e-b551-b6ec7530006b", + "metadata": { + "name": "cell38", + "collapsed": false + }, + "source": "\nYou may notice as you do the left join that there are a lot of null values for the column `holiday_name`. Not to worry! ML Functions are able to automatically handle and adjust for missing values as these. \n" + }, + { + "cell_type": "markdown", + "id": "75f77058-3853-4f50-9a0b-07b33564c120", + "metadata": { + "name": "cell39", + "collapsed": false + }, + "source": "\n### Step 2: Create Predictions\n\nUnlike the single series model we built in the previous section, we can not simply use the `vancouver_forecast!forecast` method to generate predictions for our current model. Since we have added holidays as an exogenous variable, we need to prepare an inference dataset and pass it into our trained model.\n" + }, + { + "cell_type": "code", + "id": "5d970fdf-9237-48c6-a97e-6a61ad0bb326", + "metadata": { + "language": "sql", + "name": "cell40", + "collapsed": false, + "codeCollapsed": false + }, + "outputs": [], + "source": "-- Retrieve the latest date from our input dataset, which is 05/28/2023: \nSELECT MAX(timestamp) FROM vancouver_sales;", + "execution_count": null + }, + { + "cell_type": "code", + "id": "83f41480-7b4a-4fc7-a92b-5290c69f7219", + "metadata": { + "language": "sql", + "name": "cell41", + "collapsed": false, + "codeCollapsed": false + }, + "outputs": [], + "source": "-- Create view for inference data\nCREATE OR REPLACE VIEW vancouver_forecast_data AS (\n WITH future_dates AS (\n SELECT\n '2023-05-28' ::DATE + row_number() over (\n ORDER BY\n 0\n ) AS timestamp\n FROM\n TABLE(generator(rowcount => 10))\n ),\n food_items AS (\n SELECT\n DISTINCT menu_item_name\n FROM\n allitems_vancouver\n ),\n joined_menu_items AS (\n SELECT\n *\n FROM\n food_items\n CROSS JOIN future_dates\n ORDER BY\n menu_item_name ASC,\n timestamp ASC\n )\n SELECT\n jmi.menu_item_name,\n to_timestamp_ntz(jmi.timestamp) AS timestamp,\n ch.holiday_name\n FROM\n joined_menu_items AS jmi\n LEFT JOIN public_holidays ch ON jmi.timestamp = ch.date\n ORDER BY\n jmi.menu_item_name ASC,\n jmi.timestamp ASC\n);", + "execution_count": null + }, + { + "cell_type": "code", + "id": "713c19fb-fdfd-46a5-9242-33e7d29e6dfb", + "metadata": { + "language": "sql", + "name": "cell42", + "collapsed": false, + "codeCollapsed": false + }, + "outputs": [], + "source": "-- Call the model on the forecast data to produce predictions: \nCALL vancouver_forecast!forecast(\n INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'vancouver_forecast_data'),\n SERIES_COLNAME => 'menu_item_name',\n TIMESTAMP_COLNAME => 'timestamp'\n );", + "execution_count": null + }, + { + "cell_type": "code", + "id": "6f902d24-7b77-43fc-97fc-242732acb9ae", + "metadata": { + "language": "sql", + "name": "cell43", + "collapsed": false + }, + "outputs": [], + "source": "-- Store results into a table: \nCREATE OR REPLACE TABLE vancouver_predictions AS (\n SELECT *\n FROM {{cell42}}\n);", + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "1590d2f3-d282-40d2-bcc9-623c8ac58b6f", + "metadata": { + "name": "cell44", + "collapsed": false + }, + "source": "Above, we used the generator function to generate the next 10 days from 05/28/2023, which was the latest date in our training dataset. We then performed a cross join against all the distinct food items we sell within Vancouver, and lastly joined it against our holiday table so that the model is able to make use of it. \n" + }, + { + "cell_type": "markdown", + "id": "f12725e3-3a47-42b8-8fa2-8ce256ead96b", + "metadata": { + "name": "cell45", + "collapsed": false + }, + "source": "### Step 3: Feature Importance & Evaluation Metrics\n\nAn important part of the model building process is understanding how the individual columns or features that you put into the model weigh in on the final predictions made. This can help provide intuition into what the most significant drivers are, and allow us to iterate by either including other columns that may be predictive or removing those that don't provide much value. The forecasting ML Function gives you the ability to calculate [feature importance](https://docs.snowflake.com/en/user-guide/analysis-forecasting#understanding-feature-importance), using the `explain_feature_importance` method as shown below. \n" + }, + { + "cell_type": "code", + "id": "51dab86e-e15c-473d-90cc-8df2942c52cb", + "metadata": { + "language": "sql", + "name": "cell46", + "collapsed": false, + "codeCollapsed": false + }, + "outputs": [], + "source": "-- get Feature Importance\nCALL VANCOUVER_FORECAST!explain_feature_importance();", + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "a8add16e-3268-4590-a153-f30dfeaa92d7", + "metadata": { + "name": "cell47", + "collapsed": false + }, + "source": "\nThe output of this call for our multi-series forecast model is shown above, which you can explore further. One thing to notice here is that, for this particular dataset, including holidays as an exogenous variable didn't dramatically impact our predictions. We may consider dropping this altogether, and only rely on the daily sales themselves. **Note**, based on the version of the ML Function, the outputted feature importances may be different compared to what is shown below due how features are generated by the model. \n\n\nIn addition to feature importances, evaluating model accuracy is important in knowing if the model is able to accurately make future predictions. Using the sql command below, you can get a variety of model metrics that describe how well it performed on a holdout set. For more details please see [understanding evaluation metrics](https://docs.snowflake.com/en/user-guide/ml-powered-forecasting#understanding-evaluation-metrics).\n" + }, + { + "cell_type": "code", + "id": "1014390b-42e4-4250-b000-c484cd91d8c1", + "metadata": { + "language": "sql", + "name": "cell48", + "collapsed": false + }, + "outputs": [], + "source": "-- Evaluate model performance:\nCALL VANCOUVER_FORECAST!show_evaluation_metrics();", + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "bbca5839-9221-438d-ae3a-1a84a27138db", + "metadata": { + "name": "cell49" + }, + "source": "## Identifying Anomalous Sales with the Anomaly Detection ML Function\n\nIn the past couple of sections we have built forecasting models for the items sold in Vancouver to plan ahead to meet demand. As an analyst, another question we might be interested in understanding further are anomalous sales. If there is a consistent trend across a particular food item, this may constitute a recent trend, and we can use this information to better understand the customer experience and optimize it. \n\n### Step 1: Building the Anomaly Detection Model\n\nIn this section, we will make use of the [anomaly detection ML Function](https://docs.snowflake.com/en/user-guide/analysis-anomaly-detection) to build a model for anamolous sales for all items sold in Vancouver. Since we had found that holidays were not impacting the model, we have dropped that as a column for our anomaly model. \n" + }, + { + "cell_type": "code", + "id": "44836532-8276-4d7f-a488-b8049fcfcb4a", + "metadata": { + "language": "sql", + "name": "cell50", + "collapsed": false, + "codeCollapsed": false + }, + "outputs": [], + "source": "-- Create a view containing our training data\nCREATE OR REPLACE VIEW vancouver_anomaly_training_set AS (\n SELECT *\n FROM vancouver_sales\n WHERE timestamp < (SELECT MAX(timestamp) FROM vancouver_sales) - interval '1 Month'\n);", + "execution_count": null + }, + { + "cell_type": "code", + "id": "fd2a7cc8-c3e1-47dc-8513-b6fbf60aeaf3", + "metadata": { + "language": "sql", + "name": "cell51", + "collapsed": false, + "codeCollapsed": false + }, + "outputs": [], + "source": "-- Create a view containing the data we want to make inferences on\nCREATE OR REPLACE VIEW vancouver_anomaly_analysis_set AS (\n SELECT *\n FROM vancouver_sales\n WHERE timestamp > (SELECT MAX(timestamp) FROM vancouver_anomaly_training_set)\n);", + "execution_count": null + }, + { + "cell_type": "code", + "id": "9c5239ab-470f-4c66-b293-7ff013d945f0", + "metadata": { + "language": "sql", + "name": "cell52", + "collapsed": false, + "codeCollapsed": false + }, + "outputs": [], + "source": "-- Create the model: UNSUPERVISED method, however can pass labels as well; this could take ~15-25 secs; please be patient \nCREATE OR REPLACE snowflake.ml.anomaly_detection vancouver_anomaly_model(\n INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'vancouver_anomaly_training_set'),\n SERIES_COLNAME => 'MENU_ITEM_NAME',\n TIMESTAMP_COLNAME => 'TIMESTAMP',\n TARGET_COLNAME => 'TOTAL_SOLD',\n LABEL_COLNAME => ''\n); ", + "execution_count": null + }, + { + "cell_type": "code", + "id": "e2b437aa-9595-44ae-8975-414ce974748a", + "metadata": { + "language": "sql", + "name": "cell53", + "collapsed": false, + "codeCollapsed": false + }, + "outputs": [], + "source": "-- Call the model and store the results into table; this could take ~10-20 secs; please be patient\nCALL vancouver_anomaly_model!DETECT_ANOMALIES(\n INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'vancouver_anomaly_analysis_set'),\n SERIES_COLNAME => 'MENU_ITEM_NAME',\n TIMESTAMP_COLNAME => 'TIMESTAMP',\n TARGET_COLNAME => 'TOTAL_SOLD',\n CONFIG_OBJECT => {'prediction_interval': 0.95}\n);", + "execution_count": null + }, + { + "cell_type": "code", + "id": "46d17b4b-c965-4f52-b9f2-875f1c69b79c", + "metadata": { + "language": "sql", + "name": "cell54", + "collapsed": false + }, + "outputs": [], + "source": "-- Create a table from the results\nCREATE OR REPLACE TABLE vancouver_anomalies AS (\n SELECT *\n FROM {{cell53}}\n);", + "execution_count": null + }, + { + "cell_type": "code", + "id": "3565b1c7-124b-483c-a556-d7c7896892c2", + "metadata": { + "language": "sql", + "name": "cell55", + "collapsed": false + }, + "outputs": [], + "source": "-- Review the results\nSELECT * FROM vancouver_anomalies;", + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "4988f71d-b04a-4276-9a86-e31256e8e866", + "metadata": { + "name": "cell56", + "collapsed": false + }, + "source": "\nA few comments on the code above: \n1. Anomaly detection is able work in both a supervised and unsupervised manner. In this case, we trained it in the unsupervised fashion. If you have a column that specifies labels for whether something was anomalous, you can use the `LABEL_COLNAME` argument to specify that column. \n2. Similar to the forecasting ML Function, you also have the option to specify the `prediction_interval`. In this context, this is used to control how 'agressive' the model is in identifying an anomaly. A value closer to 1 means that fewer observations will be marked anomalous, whereas a lower value would mark more instances as anomalous. See [documentation](https://docs.snowflake.com/en/user-guide/analysis-anomaly-detection#specifying-the-prediction-interval-for-anomaly-detection) for further details. \n\nThe output of the model should look similar to that found in the image below. Refer to the [output documentation](https://docs.snowflake.com/sql-reference/classes/anomaly_detection#id7) for further details on what all the columns specify. \n" + }, + { + "cell_type": "code", + "id": "f338d097-d86f-4f60-8cd6-56da9a6f9fde", + "metadata": { + "language": "python", + "name": "cell57" + }, + "outputs": [], + "source": "import streamlit as st\nst.image(\"https://quickstarts.snowflake.com/guide/ml_forecasting_ad/img/3f01053690feeebb.png\",width=1000)", + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "6d6c4e7a-b275-4c74-be44-3dd9b26657cc", + "metadata": { + "name": "cell58" + }, + "source": "### Step 2: Identifying Trends\n\nWith our model output, we are now in a position to see how many times an anomalous sale occured for each of the items in our most recent month's worth of sales data. Using the sql below:\n" + }, + { + "cell_type": "code", + "id": "756ad1cd-2c7c-4636-9340-56f14db6e2a2", + "metadata": { + "language": "sql", + "name": "cell59", + "collapsed": false, + "codeCollapsed": false + }, + "outputs": [], + "source": "-- Query to identify trends\nSELECT series, is_anomaly, count(is_anomaly) AS num_records\nFROM vancouver_anomalies\nWHERE is_anomaly =1\nGROUP BY ALL\nORDER BY num_records DESC\nLIMIT 5;", + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "128d59a7-f1e8-4a19-8a6f-4d712dd0d9f8", + "metadata": { + "name": "cell60" + }, + "source": "From the results above, it seems as if Hot Ham & Cheese, Pastrami, and Italian have had the most number of anomalous sales in the month of May!" + }, + { + "cell_type": "markdown", + "id": "7b48df83-2536-4543-b935-a2c22da84b23", + "metadata": { + "name": "cell61", + "collapsed": false + }, + "source": "## Productionizing Your Workflow Using Tasks & Stored Procedures\n\nIn this last section, we will walk through how we can use the models created previously and build them into a pipeline to send email reports for the most trending items in the past 30 days. This involves a few components that includes: \n\n1. Using [Tasks](https://docs.snowflake.com/en/user-guide/tasks-intro) to retrain the model every month, to make sure it is fresh\n2. Setting up an [email notification integration](https://docs.snowflake.com/en/user-guide/email-stored-procedures) to send emails to our stakeholders\n3. A [Snowpark Python Stored Procedure](https://docs.snowflake.com/en/sql-reference/stored-procedures-python) to extract the anomalies and send formatted emails containing the most trending items. \n" + }, + { + "cell_type": "code", + "id": "878677a3-7c8f-47bc-af85-c458d143e6ff", + "metadata": { + "language": "sql", + "name": "cell62", + "collapsed": false, + "codeCollapsed": false + }, + "outputs": [], + "source": "-- Note: It's important to update the recipient email twice in the code below\n-- Create a task to run every month to retrain the anomaly detection model: \nCREATE OR REPLACE TASK ad_vancouver_training_task\n WAREHOUSE = quickstart_wh\n SCHEDULE = 'USING CRON 0 0 1 * * America/Los_Angeles' -- Runs once a month\nAS\nCREATE OR REPLACE snowflake.ml.anomaly_detection vancouver_anomaly_model(\n INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'vancouver_anomaly_training_set'),\n SERIES_COLNAME => 'MENU_ITEM_NAME',\n TIMESTAMP_COLNAME => 'TIMESTAMP',\n TARGET_COLNAME => 'TOTAL_SOLD',\n LABEL_COLNAME => ''\n); ", + "execution_count": null + }, + { + "cell_type": "code", + "id": "b824e165-f947-431e-a13c-17d568e8ae10", + "metadata": { + "language": "sql", + "name": "cell63", + "codeCollapsed": false, + "collapsed": false + }, + "outputs": [], + "source": "-- Creates a Stored Procedure to extract the anomalies from our freshly trained model: \nCREATE OR REPLACE PROCEDURE extract_anomalies()\nRETURNS TABLE ()\nLANGUAGE sql \nAS\nBEGIN\n CALL vancouver_anomaly_model!DETECT_ANOMALIES(\n INPUT_DATA => SYSTEM$REFERENCE('VIEW', 'vancouver_anomaly_analysis_set'),\n SERIES_COLNAME => 'MENU_ITEM_NAME',\n TIMESTAMP_COLNAME => 'TIMESTAMP',\n TARGET_COLNAME => 'TOTAL_SOLD',\n CONFIG_OBJECT => {'prediction_interval': 0.95});\nDECLARE res RESULTSET DEFAULT (\n SELECT series, is_anomaly, count(is_anomaly) as num_records \n FROM TABLE(result_scan(-1)) \n WHERE is_anomaly = 1 \n GROUP BY ALL\n HAVING num_records > 5\n ORDER BY num_records DESC);\nBEGIN \n RETURN table(res);\nEND;\nEND;", + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "0e48da86-bbf6-491a-9973-d03845377982", + "metadata": { + "name": "cell64", + "collapsed": false + }, + "source": "This is an example of how you can create an email notification. Note that you need to replace the `ALLOWED_RECIPIENTS` field with a valid email address(es): \n\n```sql\n-- Create an email integration: \nCREATE OR REPLACE NOTIFICATION INTEGRATION my_email_int\nTYPE = EMAIL\nENABLED = TRUE\nALLOWED_RECIPIENTS = (''); -- update the recipient's email here\n```" + }, + { + "cell_type": "markdown", + "id": "d840f067-99ea-4e65-9082-1f41b20a499a", + "metadata": { + "name": "cell65", + "collapsed": false + }, + "source": "Create Snowpark Python Stored Procedure to format email and send it. Ensure that the `EMAIL RECIPIENT HERE!` is updated the email address(es) as given in previous step.\n\n```sql\nCREATE OR REPLACE PROCEDURE send_anomaly_report()\nRETURNS string\nLANGUAGE python\nruntime_version = 3.9\npackages = ('snowflake-snowpark-python')\nhandler = 'send_email'\n-- update the recipient's email below\nAS\n$$\ndef send_email(session):\n session.call('extract_anomalies').collect()\n printed = session.sql(\n \"select * from table(result_scan(last_query_id(-1)))\"\n ).to_pandas().to_html()\n session.call('system$send_email',\n 'my_email_int',\n '',\n 'Email Alert: Anomaly Report Has Been created',\n printed,\n 'text/html')\n$$;\n```" + }, + { + "cell_type": "markdown", + "id": "bde7204e-5ac2-4d4a-b00e-e8ba13f56917", + "metadata": { + "name": "cell66", + "collapsed": false + }, + "source": "### Orchestrating the Tasks\n" + }, + { + "cell_type": "code", + "id": "6af12e20-3aca-4dec-a2cc-a1109ca97169", + "metadata": { + "language": "sql", + "name": "cell70" + }, + "outputs": [], + "source": "CREATE OR REPLACE TASK send_anomaly_report_task\n warehouse = quickstart_wh\n AFTER AD_VANCOUVER_TRAINING_TASK\n AS CALL send_anomaly_report();", + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "3f0970c1-2340-4777-961a-c52b1555ace7", + "metadata": { + "name": "cell67", + "collapsed": false + }, + "source": "Steps to resume and then immediately execute the task DAG \n" + }, + { + "cell_type": "code", + "id": "10e36e81-b6ab-4ddc-a959-a03baabe6bd2", + "metadata": { + "language": "sql", + "name": "cell71" + }, + "outputs": [], + "source": "ALTER TASK SEND_ANOMALY_REPORT_TASK RESUME;\nALTER TASK AD_VANCOUVER_TRAINING_TASK RESUME;\nEXECUTE TASK AD_VANCOUVER_TRAINING_TASK;", + "execution_count": null + }, + { + "cell_type": "markdown", + "id": "1e74a68b-b5c3-45f8-b412-17f5cfe3d414", + "metadata": { + "name": "cell68" + }, + "source": "Some considerations to keep in mind from the above code: \n1. **Use the freshest data available**: In the code above, we used `vancouver_anomaly_analysis_set` to retrain the model, which, because the data is static, would contain the same data as the original model. In a production setting, you may accordingly adjust the input table/view to have the most updated dataset to retrain the model.\n2. **Sending emails**: This requires you to set up an integration, and specify who the recipients of the email should be. When completed appropriately, you'll recieve an email from `no-reply@snowflake.net`, as seen below. \n3. **Formatting results**: We've made use of a snowpark stored procedure, to take advantage of the functions that pandas has to neatly present the resultset into an email. For futher details and options, refer to this [medium post](https://medium.com/snowflake/hey-snowflake-send-me-an-email-243741a0fe3) by Felipe Hoffa.\n4. **Executing the Tasks**: We have set this task to run the first of every month - if you would like to run it immediately, you'll have to change the state of the task to `RESUME` as shown in the last three lines of code above, before executing the parent task `AD_VANCOUVER_TRAINING_TASK`. Note that we have orchestrated the task to send the email to the user *after* the model has been retrained. After executing, you may expect to see an email similar to the one below within a few minutes.\n" + }, + { + "cell_type": "markdown", + "id": "c8112e22-b651-4e23-bcba-30fe2f3f9818", + "metadata": { + "name": "cell69" + }, + "source": "## Conclusion\n\n**You did it!** Congrats on building your first set of models using Snowflake Cortex ML-Based Functions. \n\nAs a review, in this guide we covered how you are able to: \n\n- Acquire holiday data from the snowflake marketplace\n- Visualized sales data from our fitictious company Tasty Bytes\n- Built out forecasting model for only a single item (Lobster Mac & Cheese), before moving onto a multi-series forecast for all the food items sold in Vancouver\n- Used Anomaly detection ML Function to identify anomalous sales, and used it to understand recent trends in sales data\n- Productionize pipelines using Tasks & Stored Procedures, so you can get the latest results from your model on a regular cadence\n\n### Resources: \nThis guide contained code patterns that you can leverage to get quickly started with Snowflake Cortex ML-Based Functions. For further details, here are some useful resources: \n\n- [Anomaly Detection](https://docs.snowflake.com/en/user-guide/analysis-anomaly-detection) Product Docs, alongside the [anomaly syntax](https://docs.snowflake.com/en/sql-reference/classes/anomaly_detection)\n- [Forecasting](https://docs.snowflake.com/en/user-guide/analysis-forecasting) Product Docs, alongside the [forecasting syntax](https://docs.snowflake.com/sql-reference/classes/forecast)" + } + ] } \ No newline at end of file