-
Notifications
You must be signed in to change notification settings - Fork 705
XGBoost tutorial #820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Yancey0623
merged 4 commits into
sql-machine-learning:develop
from
Yancey0623:xgboost_tutorial
Sep 16, 2019
Merged
XGBoost tutorial #820
Changes from 1 commit
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,357 @@ | ||
| { | ||
| "cells": [ | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "# XGBoost on SQLFlow Tutorial\n", | ||
| "\n", | ||
| "This is a tutorial on train/predict XGBoost model in SQLFLow, you can find the design doc from [here](https://github.com/sql-machine-learning/sqlflow/blob/develop/doc/xgboost_on_sqlflow_design.md), in this tutorial you will know how to:\n", | ||
Yancey0623 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| "- Train a XGBoost model to fit the boston housing price; and\n", | ||
Yancey0623 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| "- Predict the housing price using the trained model;\n", | ||
| "\n", | ||
| "\n", | ||
| "## The Dataset\n", | ||
| "\n", | ||
| "This tutorial would use the [Boston Housing](https://www.kaggle.com/c/boston-housing) as the demonstration dataset.\n", | ||
| "The database contains 506 lines and 14 columns, the meaning of each column is as follows:\n", | ||
| "\n", | ||
| "Column | Explain \n", | ||
| "-- | -- \n", | ||
| "crim|per capita crime rate by town.\n", | ||
| "zn|proportion of residential land zoned for lots over 25,000 sq.ft.\n", | ||
| "indus|proportion of non-retail business acres per town.\n", | ||
| "chas|Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).\n", | ||
| "nox|nitrogen oxides concentration (parts per 10 million).\n", | ||
| "rm|average number of rooms per dwelling.\n", | ||
| "age|proportion of owner-occupied units built prior to 1940.\n", | ||
| "dis|weighted mean of distances to five Boston employment centres.\n", | ||
| "rad|index of accessibility to radial highways.\n", | ||
| "tax|full-value property-tax rate per \\$10,000.\n", | ||
| "ptratio|pupil-teacher ratio by town.\n", | ||
| "black|1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.\n", | ||
| "lstat|lower status of the population (percent).\n", | ||
| "medv|median value of owner-occupied homes in $1000s.\n", | ||
| "\n", | ||
| "We seperated the dataset into train/test datasets which used to train/predict our model." | ||
Yancey0623 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 1, | ||
| "metadata": {}, | ||
| "outputs": [ | ||
| { | ||
| "data": { | ||
| "text/plain": [ | ||
| "+---------+---------+------+-----+---------+-------+\n", | ||
| "| Field | Type | Null | Key | Default | Extra |\n", | ||
| "+---------+---------+------+-----+---------+-------+\n", | ||
| "| crim | float | YES | | None | |\n", | ||
| "| zn | float | YES | | None | |\n", | ||
| "| indus | float | YES | | None | |\n", | ||
| "| chas | int(11) | YES | | None | |\n", | ||
| "| nox | float | YES | | None | |\n", | ||
| "| rm | float | YES | | None | |\n", | ||
| "| age | float | YES | | None | |\n", | ||
| "| dis | float | YES | | None | |\n", | ||
| "| rad | int(11) | YES | | None | |\n", | ||
| "| tax | int(11) | YES | | None | |\n", | ||
| "| ptratio | float | YES | | None | |\n", | ||
| "| b | float | YES | | None | |\n", | ||
| "| lstat | float | YES | | None | |\n", | ||
| "| medv | float | YES | | None | |\n", | ||
| "+---------+---------+------+-----+---------+-------+" | ||
| ] | ||
| }, | ||
| "execution_count": 1, | ||
| "metadata": {}, | ||
| "output_type": "execute_result" | ||
| } | ||
| ], | ||
| "source": [ | ||
| "%%sqlflow\n", | ||
| "describe boston.train;" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 2, | ||
| "metadata": {}, | ||
| "outputs": [ | ||
| { | ||
| "data": { | ||
| "text/plain": [ | ||
| "+---------+---------+------+-----+---------+-------+\n", | ||
| "| Field | Type | Null | Key | Default | Extra |\n", | ||
| "+---------+---------+------+-----+---------+-------+\n", | ||
| "| crim | float | YES | | None | |\n", | ||
| "| zn | float | YES | | None | |\n", | ||
| "| indus | float | YES | | None | |\n", | ||
| "| chas | int(11) | YES | | None | |\n", | ||
| "| nox | float | YES | | None | |\n", | ||
| "| rm | float | YES | | None | |\n", | ||
| "| age | float | YES | | None | |\n", | ||
| "| dis | float | YES | | None | |\n", | ||
| "| rad | int(11) | YES | | None | |\n", | ||
| "| tax | int(11) | YES | | None | |\n", | ||
| "| ptratio | float | YES | | None | |\n", | ||
| "| b | float | YES | | None | |\n", | ||
| "| lstat | float | YES | | None | |\n", | ||
| "| medv | float | YES | | None | |\n", | ||
| "+---------+---------+------+-----+---------+-------+" | ||
| ] | ||
| }, | ||
| "execution_count": 2, | ||
| "metadata": {}, | ||
| "output_type": "execute_result" | ||
| } | ||
| ], | ||
| "source": [ | ||
| "%%sqlflow\n", | ||
| "describe boston.test;" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "## Fit Boston Housing Price\n", | ||
| "\n", | ||
| "First, Let's train a XGBoost regression model to fit the housing price, we prefer to train the model for `30 rounds`,\n", | ||
| "and using `squarederror` loss function that the SQLFLow extended SQL can be like:\n", | ||
| "\n", | ||
| "``` sql\n", | ||
| "TRAIN xgboost.gbtree\n", | ||
| "WITH\n", | ||
| " train.num_boost_round=30,\n", | ||
| " objective=\"reg:squarederror\"\n", | ||
| "```\n", | ||
| "\n", | ||
| "`xgboost.gbtree` is the estimator name, `gbtree` is one of the XGBoost booster, you can find more information from [here](https://xgboost.readthedocs.io/en/latest/parameter.html#general-parameters).\n", | ||
| "\n", | ||
| "We can specify the training data columns in `COLUMN clause`, and the label by `LABEL` keyword:\n", | ||
| "\n", | ||
| "``` sql\n", | ||
| "COLUMN crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat\n", | ||
| "LABEL medv\n", | ||
| "```\n", | ||
| "\n", | ||
| "To save the trained model, we can use `INTO clause` to specify a model name:\n", | ||
| "\n", | ||
| "``` sql\n", | ||
| "INTO sqlflow_models.my_xgb_regression_model\n", | ||
| "```\n", | ||
| "\n", | ||
| "Second, let's use a standar SQL to fetch the traning data from table `boston.train`:\n", | ||
| "\n", | ||
| "``` sql\n", | ||
| "SELECT * FROM boston.train\n", | ||
| "```\n", | ||
| "\n", | ||
| "Finally, the following is the SQLFlow Train statment of this regression task, you can run it in the cell:" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 5, | ||
| "metadata": {}, | ||
| "outputs": [ | ||
| { | ||
| "name": "stdout", | ||
| "output_type": "stream", | ||
| "text": [ | ||
| "[03:44:56] 387x13 matrix with 5031 entries loaded from train.txt\n", | ||
| "\n", | ||
| "[03:44:56] 109x13 matrix with 1417 entries loaded from test.txt\n", | ||
| "\n", | ||
| "[0]\ttrain-rmse:17.0286\tvalidation-rmse:17.8089\n", | ||
| "\n", | ||
| "[1]\ttrain-rmse:12.285\tvalidation-rmse:13.2787\n", | ||
| "\n", | ||
| "[2]\ttrain-rmse:8.93071\tvalidation-rmse:9.87677\n", | ||
| "\n", | ||
| "[3]\ttrain-rmse:6.60757\tvalidation-rmse:7.64013\n", | ||
| "\n", | ||
| "[4]\ttrain-rmse:4.96022\tvalidation-rmse:6.0181\n", | ||
| "\n", | ||
| "[5]\ttrain-rmse:3.80725\tvalidation-rmse:4.95013\n", | ||
| "\n", | ||
| "[6]\ttrain-rmse:2.94382\tvalidation-rmse:4.2357\n", | ||
| "\n", | ||
| "[7]\ttrain-rmse:2.36361\tvalidation-rmse:3.74683\n", | ||
| "\n", | ||
| "[8]\ttrain-rmse:1.95236\tvalidation-rmse:3.43284\n", | ||
| "\n", | ||
| "[9]\ttrain-rmse:1.66604\tvalidation-rmse:3.20455\n", | ||
| "\n", | ||
| "[10]\ttrain-rmse:1.4738\tvalidation-rmse:3.08947\n", | ||
| "\n", | ||
| "[11]\ttrain-rmse:1.35336\tvalidation-rmse:3.0492\n", | ||
| "\n", | ||
| "[12]\ttrain-rmse:1.22835\tvalidation-rmse:2.99508\n", | ||
| "\n", | ||
| "[13]\ttrain-rmse:1.15615\tvalidation-rmse:2.98604\n", | ||
| "\n", | ||
| "[14]\ttrain-rmse:1.11082\tvalidation-rmse:2.96433\n", | ||
| "\n", | ||
| "[15]\ttrain-rmse:1.01666\tvalidation-rmse:2.96584\n", | ||
| "\n", | ||
| "[16]\ttrain-rmse:0.953761\tvalidation-rmse:2.94013\n", | ||
| "\n", | ||
| "[17]\ttrain-rmse:0.905753\tvalidation-rmse:2.91569\n", | ||
| "\n", | ||
| "[18]\ttrain-rmse:0.870137\tvalidation-rmse:2.89735\n", | ||
| "\n", | ||
| "[19]\ttrain-rmse:0.800778\tvalidation-rmse:2.87206\n", | ||
| "\n", | ||
| "[20]\ttrain-rmse:0.757704\tvalidation-rmse:2.86564\n", | ||
| "\n", | ||
| "[21]\ttrain-rmse:0.74058\tvalidation-rmse:2.86587\n", | ||
| "\n", | ||
| "[22]\ttrain-rmse:0.66901\tvalidation-rmse:2.86224\n", | ||
| "\n", | ||
| "[23]\ttrain-rmse:0.647195\tvalidation-rmse:2.87395\n", | ||
| "\n", | ||
| "[24]\ttrain-rmse:0.609025\tvalidation-rmse:2.86069\n", | ||
| "\n", | ||
| "[25]\ttrain-rmse:0.562925\tvalidation-rmse:2.87205\n", | ||
| "\n", | ||
| "[26]\ttrain-rmse:0.541676\tvalidation-rmse:2.86275\n", | ||
| "\n", | ||
| "[27]\ttrain-rmse:0.524815\tvalidation-rmse:2.87106\n", | ||
| "\n", | ||
| "[28]\ttrain-rmse:0.483566\tvalidation-rmse:2.86129\n", | ||
| "\n", | ||
| "[29]\ttrain-rmse:0.460363\tvalidation-rmse:2.85877\n", | ||
| "\n" | ||
| ] | ||
| } | ||
| ], | ||
| "source": [ | ||
| "%%sqlflow\n", | ||
| "SELECT * FROM boston.train\n", | ||
| "TRAIN xgboost.gbtree\n", | ||
| "WITH\n", | ||
| " objective=\"reg:squarederror\",\n", | ||
| " train.num_boost_round = 30\n", | ||
| "COLUMN crim, zn, indus, chas, nox, rm, age, dis, rad, tax, ptratio, b, lstat\n", | ||
| "LABEL medv\n", | ||
| "INTO sqlflow_models.my_xgb_regression_model;" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "### Predict the housing price\n", | ||
| "After training a regression model, let's predict the house price using the trained model.\n", | ||
| "\n", | ||
| "First, we can specify the trained model by `USING clause`: \n", | ||
| "\n", | ||
| "```sql\n", | ||
| "USING sqlflow_models.my_xgb_regression_model\n", | ||
| "```\n", | ||
| "\n", | ||
| "Than, we can specify the prediction result table by `PREDICT clause`:\n", | ||
| "\n", | ||
| "``` sql\n", | ||
| "PREDICT boston.predict.medv\n", | ||
| "```\n", | ||
| "\n", | ||
| "And using a standar SQL to fetch the prediction data:\n", | ||
| "\n", | ||
| "``` sql\n", | ||
| "SELECT * FROM boston.test\n", | ||
| "```\n", | ||
| "\n", | ||
| "Finally, the following is the SQLFLow Prediction statment:" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 8, | ||
| "metadata": {}, | ||
| "outputs": [ | ||
| { | ||
| "name": "stdout", | ||
| "output_type": "stream", | ||
| "text": [ | ||
| "[03:45:18] 10x13 matrix with 130 entries loaded from predict.txt\n", | ||
| "\n", | ||
| "Done predicting. Predict table : boston.predict\n", | ||
| "\n" | ||
| ] | ||
| } | ||
| ], | ||
| "source": [ | ||
| "%%sqlflow\n", | ||
| "SELECT * FROM boston.test\n", | ||
| "PREDICT boston.predict.medv\n", | ||
| "USING sqlflow_models.my_xgb_regression_model;" | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "markdown", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "Let's have a glance at prediction results." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": 10, | ||
| "metadata": {}, | ||
| "outputs": [ | ||
| { | ||
| "data": { | ||
| "text/plain": [ | ||
| "+---------+-----+-------+------+-------+-------+------+--------+-----+-----+---------+--------+-------+---------+\n", | ||
| "| crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | b | lstat | medv |\n", | ||
| "+---------+-----+-------+------+-------+-------+------+--------+-----+-----+---------+--------+-------+---------+\n", | ||
| "| 0.2896 | 0.0 | 9.69 | 0 | 0.585 | 5.39 | 72.9 | 2.7986 | 6 | 391 | 19.2 | 396.9 | 21.14 | 21.9436 |\n", | ||
| "| 0.26838 | 0.0 | 9.69 | 0 | 0.585 | 5.794 | 70.6 | 2.8927 | 6 | 391 | 19.2 | 396.9 | 14.1 | 21.9667 |\n", | ||
| "| 0.23912 | 0.0 | 9.69 | 0 | 0.585 | 6.019 | 65.3 | 2.4091 | 6 | 391 | 19.2 | 396.9 | 12.92 | 22.9708 |\n", | ||
| "| 0.17783 | 0.0 | 9.69 | 0 | 0.585 | 5.569 | 73.5 | 2.3999 | 6 | 391 | 19.2 | 395.77 | 15.1 | 22.6373 |\n", | ||
| "| 0.22438 | 0.0 | 9.69 | 0 | 0.585 | 6.027 | 79.7 | 2.4982 | 6 | 391 | 19.2 | 396.9 | 14.33 | 21.9439 |\n", | ||
| "| 0.06263 | 0.0 | 11.93 | 0 | 0.573 | 6.593 | 69.1 | 2.4786 | 1 | 273 | 21.0 | 391.99 | 9.67 | 24.0095 |\n", | ||
| "| 0.04527 | 0.0 | 11.93 | 0 | 0.573 | 6.12 | 76.7 | 2.2875 | 1 | 273 | 21.0 | 396.9 | 9.08 | 25.0 |\n", | ||
| "| 0.06076 | 0.0 | 11.93 | 0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1 | 273 | 21.0 | 396.9 | 5.64 | 31.6326 |\n", | ||
| "| 0.10959 | 0.0 | 11.93 | 0 | 0.573 | 6.794 | 89.3 | 2.3889 | 1 | 273 | 21.0 | 393.45 | 6.48 | 26.8375 |\n", | ||
| "| 0.04741 | 0.0 | 11.93 | 0 | 0.573 | 6.03 | 80.8 | 2.505 | 1 | 273 | 21.0 | 396.9 | 7.88 | 22.5877 |\n", | ||
| "+---------+-----+-------+------+-------+-------+------+--------+-----+-----+---------+--------+-------+---------+" | ||
| ] | ||
| }, | ||
| "execution_count": 10, | ||
| "metadata": {}, | ||
| "output_type": "execute_result" | ||
| } | ||
| ], | ||
| "source": [ | ||
| "%%sqlflow\n", | ||
| "SELECT * FROM boston.predict;" | ||
| ] | ||
| } | ||
| ], | ||
| "metadata": { | ||
| "kernelspec": { | ||
| "display_name": "Python 3", | ||
| "language": "python", | ||
| "name": "python3" | ||
| }, | ||
| "language_info": { | ||
| "codemirror_mode": { | ||
| "name": "ipython", | ||
| "version": 3 | ||
| }, | ||
| "file_extension": ".py", | ||
| "mimetype": "text/x-python", | ||
| "name": "python", | ||
| "nbconvert_exporter": "python", | ||
| "pygments_lexer": "ipython3", | ||
| "version": "3.6.9" | ||
| } | ||
| }, | ||
| "nbformat": 4, | ||
| "nbformat_minor": 2 | ||
| } | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.