Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace dask-xgboost with xgboost #50

Merged
merged 8 commits into from
Dec 18, 2020
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions examples/examples-cpu/.saturn/start
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
pip uninstall -y dask-xgboost xgboost || true

rm -f /opt/conda/envs/saturn/lib/libxgboost.so
rm -f /opt/conda/envs/saturn/lib/python3.7/site-packages/xgboost/lib/libxgboost.so

pip install --upgrade 'xgboost>=1.3.0'
108 changes: 94 additions & 14 deletions examples/examples-cpu/nyc-taxi-snowflake/xgboost-dask.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,13 @@
"</table>"
jameslamb marked this conversation as resolved.
Show resolved Hide resolved
jameslamb marked this conversation as resolved.
Show resolved Hide resolved
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook describes how to use Dask to scale training of XGBoost models. For more detailed information, see [\"Distributed XGBoost with Dask\"](https://xgboost.readthedocs.io/en/latest/tutorials/dask.html) in the XGBoost documentation and [\"XGBoost Training with Dask\"](https://www.saturncloud.io/docs/tutorials/xgboost/) in Saturn Cloud's documentation."
]
},
{
"cell_type": "code",
"execution_count": null,
Expand Down Expand Up @@ -50,7 +57,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Initialize Dask cluster"
"# Initialize Dask cluster\n",
"\n",
"The code below uses [`dask-saturn`](https://github.com/saturncloud/dask-saturn) to create a Dask cluster or connect to one that is already running."
]
},
{
Expand Down Expand Up @@ -249,7 +258,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Train a model"
"# Train a model\n",
"\n",
"This example uses the native Dask integration built into XGBoost. That integration was added in `xgboost` 1.3.0, and should be preferred to [`dask-xgboost`](https://github.com/dask/dask-xgboost)."
]
},
{
Expand All @@ -258,25 +269,73 @@
"metadata": {},
"outputs": [],
"source": [
"import dask_xgboost\n",
"\n",
"xgb_reg = dask_xgboost.XGBRegressor(\n",
" objective=\"reg:squarederror\",\n",
" tree_method='approx',\n",
" learning_rate=0.1,\n",
" max_depth=5,\n",
" n_estimators=50,\n",
"import xgboost as xgb"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Training data for `xgboost.dask` needs to be prepared in a special object called `DaskDMatrix`. This is like the XGBoost `DMatrix` that you might be familiar with, but is backed by Dasks's distributed collections (Dask DataFrame and Dask Array)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dtrain = xgb.dask.DaskDMatrix(\n",
" client=client,\n",
" data=taxi_train[features],\n",
" label=taxi_train[y_col]\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can pass any [xgboost parameters](https://xgboost.readthedocs.io/en/latest/parameter.html) to `xgb.dask.train()`. The training process will then start up on all workers that have some of the data in `dtrain`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"_ = xgb_reg.fit(taxi_train[features], y=taxi_train[y_col])"
"result = xgb.dask.train(\n",
" client=client,\n",
" params={\n",
" \"objective\": \"reg:squarederror\",\n",
" \"tree_method\": \"hist\",\n",
" \"learning_rate\": 0.1,\n",
" \"max_depth\": 5,\n",
" },\n",
" dtrain=dtrain,\n",
" num_boost_round=50\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`xgb.dask.train()` produces a regular `xgb.core.Booster` object, the same model object produced by non-Dask training."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"booster = result[\"booster\"]\n",
"type(booster)\n",
"\n",
"# xgboost.core.Booster"
]
},
{
Expand All @@ -295,7 +354,7 @@
"import cloudpickle\n",
"\n",
"with open(f'{MODEL_PATH}/xgboost_dask.pkl', 'wb') as f:\n",
" cloudpickle.dump(xgb_reg, f)"
" cloudpickle.dump(booster, f)"
]
},
{
Expand All @@ -319,6 +378,28 @@
"_ = wait(taxi_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`xgboost.dask.predict()` can be used to create predictiosn on a Dask collection using an XGBoost model object. Note that this model object is just a regular XGBoost booster, not a special Dask-specific model object.\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"`xgboost.dask.predict()` can be used to create predictiosn on a Dask collection using an XGBoost model object. Note that this model object is just a regular XGBoost booster, not a special Dask-specific model object.\n",
"`xgboost.dask.predict()` can be used to create predictions on a Dask collection using an XGBoost model object. Note that this model object is just a regular XGBoost booster, not a special Dask-specific model object.\n",

"\n",
"This function returns a Dask Array or Dask Series of predictions, depending on the input type."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"preds = xgb.dask.predict(\n",
" client=client,\n",
" model=booster,\n",
" data=taxi_test[features]\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand All @@ -334,8 +415,7 @@
"source": [
"from dask_ml.metrics import mean_squared_error\n",
"\n",
"preds = xgb_reg.predict(taxi_test[features])\n",
"mean_squared_error(taxi_test[y_col].to_dask_array(), preds, squared=False)"
"mean_squared_error(taxi_test[y_col].to_dask_array(), preds.to_dask_array(), squared=False)"
]
}
],
Expand Down
2 changes: 1 addition & 1 deletion examples/examples-cpu/nyc-taxi-snowflake/xgboost.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -146,7 +146,7 @@
"\n",
"xgb_reg = xgboost.XGBRegressor(\n",
" objective=\"reg:squarederror\",\n",
" tree_method='approx',\n",
" tree_method='hist',\n",
" learning_rate=0.1,\n",
" max_depth=5,\n",
" n_estimators=50,\n",
Expand Down
108 changes: 94 additions & 14 deletions examples/examples-cpu/nyc-taxi/xgboost-dask.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,13 @@
"</table>"
jameslamb marked this conversation as resolved.
Show resolved Hide resolved
jameslamb marked this conversation as resolved.
Show resolved Hide resolved
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This notebook describes how to use Dask to scale training of XGBoost models. For more detailed information, see [\"Distributed XGBoost with Dask\"](https://xgboost.readthedocs.io/en/latest/tutorials/dask.html) in the XGBoost documentation and [\"XGBoost Training with Dask\"](https://www.saturncloud.io/docs/tutorials/xgboost/) in Saturn Cloud's documentation."
]
},
{
"cell_type": "code",
"execution_count": null,
Expand Down Expand Up @@ -50,7 +57,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Initialize Dask cluster"
"# Initialize Dask cluster\n",
"\n",
"The code below uses [`dask-saturn`](https://github.com/saturncloud/dask-saturn) to create a Dask cluster or connect to one that is already running."
]
},
{
Expand Down Expand Up @@ -183,7 +192,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Train a model"
"# Train a model\n",
"\n",
"This example uses the native Dask integration built into XGBoost. That integration was added in `xgboost` 1.3.0, and should be preferred to [`dask-xgboost`](https://github.com/dask/dask-xgboost)."
]
},
{
Expand All @@ -192,25 +203,73 @@
"metadata": {},
"outputs": [],
"source": [
"import dask_xgboost\n",
"\n",
"xgb_reg = dask_xgboost.XGBRegressor(\n",
" objective=\"reg:squarederror\",\n",
" tree_method='approx',\n",
" learning_rate=0.1,\n",
" max_depth=5,\n",
" n_estimators=50,\n",
"import xgboost as xgb"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Training data for `xgboost.dask` needs to be prepared in a special object called `DaskDMatrix`. This is like the XGBoost `DMatrix` that you might be familiar with, but is backed by Dasks's distributed collections (Dask DataFrame and Dask Array)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dtrain = xgb.dask.DaskDMatrix(\n",
" client=client,\n",
" data=taxi_train[features],\n",
" label=taxi_train[y_col]\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can pass any [xgboost parameters](https://xgboost.readthedocs.io/en/latest/parameter.html) to `xgb.dask.train()`. The training process will then start up on all workers that have some of the data in `dtrain`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"_ = xgb_reg.fit(taxi_train[features], y=taxi_train[y_col])"
"result = xgb.dask.train(\n",
jameslamb marked this conversation as resolved.
Show resolved Hide resolved
" client=client,\n",
" params={\n",
" \"objective\": \"reg:squarederror\",\n",
" \"tree_method\": \"hist\",\n",
" \"learning_rate\": 0.1,\n",
" \"max_depth\": 5,\n",
" },\n",
" dtrain=dtrain,\n",
" num_boost_round=50\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`xgb.dask.train()` produces a regular `xgb.core.Booster` object, the same model object produced by non-Dask training."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"booster = result[\"booster\"]\n",
"type(booster)\n",
"\n",
"# xgboost.core.Booster"
]
},
{
Expand All @@ -229,7 +288,7 @@
"import cloudpickle\n",
"\n",
"with open(f'{MODEL_PATH}/xgboost_dask.pkl', 'wb') as f:\n",
" cloudpickle.dump(xgb_reg, f)"
" cloudpickle.dump(booster, f)"
]
},
{
Expand Down Expand Up @@ -257,6 +316,28 @@
"taxi_test = prep_df(taxi_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`xgboost.dask.predict()` can be used to create predictiosn on a Dask collection using an XGBoost model object. Note that this model object is just a regular XGBoost booster, not a special Dask-specific model object.\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"`xgboost.dask.predict()` can be used to create predictiosn on a Dask collection using an XGBoost model object. Note that this model object is just a regular XGBoost booster, not a special Dask-specific model object.\n",
"`xgboost.dask.predict()` can be used to create predictions on a Dask collection using an XGBoost model object. Note that this model object is just a regular XGBoost booster, not a special Dask-specific model object.\n",

"\n",
"This function returns a Dask Array or Dask Series of predictions, depending on the input type."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"preds = xgb.dask.predict(\n",
" client=client,\n",
" model=booster,\n",
" data=taxi_test[features]\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand All @@ -272,8 +353,7 @@
"source": [
"from dask_ml.metrics import mean_squared_error\n",
"\n",
"preds = xgb_reg.predict(taxi_test[features])\n",
"mean_squared_error(taxi_test[y_col].to_dask_array(), preds, squared=False)"
"mean_squared_error(taxi_test[y_col].to_dask_array(), preds.to_dask_array(), squared=False)"
]
}
],
Expand Down
2 changes: 1 addition & 1 deletion examples/examples-cpu/nyc-taxi/xgboost.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -124,7 +124,7 @@
"\n",
"xgb_reg = xgboost.XGBRegressor(\n",
" objective=\"reg:squarederror\",\n",
" tree_method='approx',\n",
" tree_method='hist',\n",
" learning_rate=0.1,\n",
" max_depth=5,\n",
" n_estimators=50,\n",
Expand Down