diff --git a/examples/examples-cpu/.saturn/saturn.json b/examples/examples-cpu/.saturn/saturn.json index 299886be..10c3f7b0 100644 --- a/examples/examples-cpu/.saturn/saturn.json +++ b/examples/examples-cpu/.saturn/saturn.json @@ -1,5 +1,5 @@ { - "image": "saturncloud/saturn:2020.11.30", + "image": "saturncloud/saturn:2020.12.16-dev", "jupyter": { "size": "large", "disk_space": "10Gi", diff --git a/examples/examples-cpu/nyc-taxi-snowflake/xgboost-dask.ipynb b/examples/examples-cpu/nyc-taxi-snowflake/xgboost-dask.ipynb index 82c8d19c..b776b77a 100644 --- a/examples/examples-cpu/nyc-taxi-snowflake/xgboost-dask.ipynb +++ b/examples/examples-cpu/nyc-taxi-snowflake/xgboost-dask.ipynb @@ -18,6 +18,13 @@ "" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This notebook describes how to use Dask to scale training of XGBoost models. For more detailed information, see [\"Distributed XGBoost with Dask\"](https://xgboost.readthedocs.io/en/latest/tutorials/dask.html) in the XGBoost documentation and [\"XGBoost Training with Dask\"](https://www.saturncloud.io/docs/tutorials/xgboost/) in Saturn Cloud's documentation." + ] + }, { "cell_type": "code", "execution_count": null, @@ -50,7 +57,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Initialize Dask cluster" + "# Initialize Dask cluster\n", + "\n", + "The code below uses [`dask-saturn`](https://github.com/saturncloud/dask-saturn) to create a Dask cluster or connect to one that is already running." ] }, { @@ -249,7 +258,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Train a model" + "# Train a model\n", + "\n", + "This example uses the native Dask integration built into XGBoost. That integration was added in `xgboost` 1.3.0, and should be preferred to [`dask-xgboost`](https://github.com/dask/dask-xgboost)." ] }, { @@ -258,17 +269,36 @@ "metadata": {}, "outputs": [], "source": [ - "import dask_xgboost\n", - "\n", - "xgb_reg = dask_xgboost.XGBRegressor(\n", - " objective=\"reg:squarederror\",\n", - " tree_method='approx',\n", - " learning_rate=0.1,\n", - " max_depth=5,\n", - " n_estimators=50,\n", + "import xgboost as xgb" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Training data for `xgboost.dask` needs to be prepared in a special object called `DaskDMatrix`. This is like the XGBoost `DMatrix` that you might be familiar with, but is backed by Dask's distributed collections (Dask DataFrame and Dask Array)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dtrain = xgb.dask.DaskDMatrix(\n", + " client=client,\n", + " data=taxi_train[features],\n", + " label=taxi_train[y_col]\n", ")" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can pass any [xgboost parameters](https://xgboost.readthedocs.io/en/latest/parameter.html) to `xgb.dask.train()`. The training process will then start up on all workers that have some of the data in `dtrain`." + ] + }, { "cell_type": "code", "execution_count": null, @@ -276,7 +306,36 @@ "outputs": [], "source": [ "%%time\n", - "_ = xgb_reg.fit(taxi_train[features], y=taxi_train[y_col])" + "result = xgb.dask.train(\n", + " client=client,\n", + " params={\n", + " \"objective\": \"reg:squarederror\",\n", + " \"tree_method\": \"hist\",\n", + " \"learning_rate\": 0.1,\n", + " \"max_depth\": 5,\n", + " },\n", + " dtrain=dtrain,\n", + " num_boost_round=50\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`xgb.dask.train()` produces a regular `xgb.core.Booster` object, the same model object produced by non-Dask training." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "booster = result[\"booster\"]\n", + "type(booster)\n", + "\n", + "# xgboost.core.Booster" ] }, { @@ -295,7 +354,7 @@ "import cloudpickle\n", "\n", "with open(f'{MODEL_PATH}/xgboost_dask.pkl', 'wb') as f:\n", - " cloudpickle.dump(xgb_reg, f)" + " cloudpickle.dump(booster, f)" ] }, { @@ -319,6 +378,28 @@ "_ = wait(taxi_test)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`xgboost.dask.predict()` can be used to create predictions on a Dask collection using an XGBoost model object. Because the model object here is just a regular XGBoost model, using `dask-xgboost` for batch scoring doesn't require that you also perform training on Dask.\n", + "\n", + "This function returns a Dask Array or Dask Series of predictions, depending on the input type." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "preds = xgb.dask.predict(\n", + " client=client,\n", + " model=booster,\n", + " data=taxi_test[features]\n", + ")" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -334,8 +415,7 @@ "source": [ "from dask_ml.metrics import mean_squared_error\n", "\n", - "preds = xgb_reg.predict(taxi_test[features])\n", - "mean_squared_error(taxi_test[y_col].to_dask_array(), preds, squared=False)" + "mean_squared_error(taxi_test[y_col].to_dask_array(), preds.to_dask_array(), squared=False)" ] } ], diff --git a/examples/examples-cpu/nyc-taxi-snowflake/xgboost.ipynb b/examples/examples-cpu/nyc-taxi-snowflake/xgboost.ipynb index d922e148..bcaf098f 100644 --- a/examples/examples-cpu/nyc-taxi-snowflake/xgboost.ipynb +++ b/examples/examples-cpu/nyc-taxi-snowflake/xgboost.ipynb @@ -147,7 +147,7 @@ "\n", "xgb_reg = xgboost.XGBRegressor(\n", " objective=\"reg:squarederror\",\n", - " tree_method='approx',\n", + " tree_method='hist',\n", " learning_rate=0.1,\n", " max_depth=5,\n", " n_estimators=50,\n", diff --git a/examples/examples-cpu/nyc-taxi/xgboost-dask.ipynb b/examples/examples-cpu/nyc-taxi/xgboost-dask.ipynb index aa761142..8e469213 100644 --- a/examples/examples-cpu/nyc-taxi/xgboost-dask.ipynb +++ b/examples/examples-cpu/nyc-taxi/xgboost-dask.ipynb @@ -18,6 +18,13 @@ "" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This notebook describes how to use Dask to scale training of XGBoost models. For more detailed information, see [\"Distributed XGBoost with Dask\"](https://xgboost.readthedocs.io/en/latest/tutorials/dask.html) in the XGBoost documentation and [\"XGBoost Training with Dask\"](https://www.saturncloud.io/docs/tutorials/xgboost/) in Saturn Cloud's documentation." + ] + }, { "cell_type": "code", "execution_count": null, @@ -50,7 +57,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Initialize Dask cluster" + "# Initialize Dask cluster\n", + "\n", + "The code below uses [`dask-saturn`](https://github.com/saturncloud/dask-saturn) to create a Dask cluster or connect to one that is already running." ] }, { @@ -183,7 +192,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "# Train a model" + "# Train a model\n", + "\n", + "This example uses the native Dask integration built into XGBoost. That integration was added in `xgboost` 1.3.0, and should be preferred to [`dask-xgboost`](https://github.com/dask/dask-xgboost)." ] }, { @@ -192,17 +203,36 @@ "metadata": {}, "outputs": [], "source": [ - "import dask_xgboost\n", - "\n", - "xgb_reg = dask_xgboost.XGBRegressor(\n", - " objective=\"reg:squarederror\",\n", - " tree_method='approx',\n", - " learning_rate=0.1,\n", - " max_depth=5,\n", - " n_estimators=50,\n", + "import xgboost as xgb" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Training data for `xgboost.dask` needs to be prepared in a special object called `DaskDMatrix`. This is like the XGBoost `DMatrix` that you might be familiar with, but is backed by Dask's distributed collections (Dask DataFrame and Dask Array)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dtrain = xgb.dask.DaskDMatrix(\n", + " client=client,\n", + " data=taxi_train[features],\n", + " label=taxi_train[y_col]\n", ")" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can pass any [xgboost parameters](https://xgboost.readthedocs.io/en/latest/parameter.html) to `xgb.dask.train()`. The training process will then start up on all workers that have some of the data in `dtrain`." + ] + }, { "cell_type": "code", "execution_count": null, @@ -210,7 +240,36 @@ "outputs": [], "source": [ "%%time\n", - "_ = xgb_reg.fit(taxi_train[features], y=taxi_train[y_col])" + "result = xgb.dask.train(\n", + " client=client,\n", + " params={\n", + " \"objective\": \"reg:squarederror\",\n", + " \"tree_method\": \"hist\",\n", + " \"learning_rate\": 0.1,\n", + " \"max_depth\": 5,\n", + " },\n", + " dtrain=dtrain,\n", + " num_boost_round=50\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`xgb.dask.train()` produces a regular `xgb.core.Booster` object, the same model object produced by non-Dask training." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "booster = result[\"booster\"]\n", + "type(booster)\n", + "\n", + "# xgboost.core.Booster" ] }, { @@ -229,7 +288,7 @@ "import cloudpickle\n", "\n", "with open(f'{MODEL_PATH}/xgboost_dask.pkl', 'wb') as f:\n", - " cloudpickle.dump(xgb_reg, f)" + " cloudpickle.dump(booster, f)" ] }, { @@ -257,6 +316,28 @@ "taxi_test = prep_df(taxi_test)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`xgboost.dask.predict()` can be used to create predictions on a Dask collection using an XGBoost model object. Because the model object here is just a regular XGBoost model, using `xgboost.dask` for batch scoring doesn't require that you also perform training on Dask.\n", + "\n", + "This function returns a Dask Array or Dask Series of predictions, depending on the input type." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "preds = xgb.dask.predict(\n", + " client=client,\n", + " model=booster,\n", + " data=taxi_test[features]\n", + ")" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -272,8 +353,7 @@ "source": [ "from dask_ml.metrics import mean_squared_error\n", "\n", - "preds = xgb_reg.predict(taxi_test[features])\n", - "mean_squared_error(taxi_test[y_col].to_dask_array(), preds, squared=False)" + "mean_squared_error(taxi_test[y_col].to_dask_array(), preds.to_dask_array(), squared=False)" ] } ], diff --git a/examples/examples-cpu/nyc-taxi/xgboost.ipynb b/examples/examples-cpu/nyc-taxi/xgboost.ipynb index 604688c3..bf10f411 100644 --- a/examples/examples-cpu/nyc-taxi/xgboost.ipynb +++ b/examples/examples-cpu/nyc-taxi/xgboost.ipynb @@ -124,7 +124,7 @@ "\n", "xgb_reg = xgboost.XGBRegressor(\n", " objective=\"reg:squarederror\",\n", - " tree_method='approx',\n", + " tree_method='hist',\n", " learning_rate=0.1,\n", " max_depth=5,\n", " n_estimators=50,\n",