saturncloud · jameslamb · Dec 18, 2020 · Dec 14, 2020 · Dec 14, 2020 · Dec 14, 2020
diff --git a/examples/examples-cpu/.saturn/start b/examples/examples-cpu/.saturn/start
@@ -0,0 +1,6 @@
+pip uninstall -y dask-xgboost xgboost || true
+
+rm -f /opt/conda/envs/saturn/lib/libxgboost.so
+rm -f /opt/conda/envs/saturn/lib/python3.7/site-packages/xgboost/lib/libxgboost.so
+
+pip install --upgrade 'xgboost>=1.3.0'
diff --git a/examples/examples-cpu/nyc-taxi-snowflake/xgboost-dask.ipynb b/examples/examples-cpu/nyc-taxi-snowflake/xgboost-dask.ipynb
@@ -18,6 +18,13 @@
     "</table>"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This notebook describes how to use Dask to scale training of XGBoost models. For more detailed information, see [\"Distributed XGBoost with Dask\"](https://xgboost.readthedocs.io/en/latest/tutorials/dask.html) in the XGBoost documentation and [\"XGBoost Training with Dask\"](https://www.saturncloud.io/docs/tutorials/xgboost/) in Saturn Cloud's documentation."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -50,7 +57,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Initialize Dask cluster"
+    "# Initialize Dask cluster\n",
+    "\n",
+    "The code below uses [`dask-saturn`](https://github.com/saturncloud/dask-saturn) to create a Dask cluster or connect to one that is already running."
    ]
   },
   {
@@ -249,7 +258,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Train a model"
+    "# Train a model\n",
+    "\n",
+    "This example uses the native Dask integration built into XGBoost. That integration was added in `xgboost` 1.3.0, and should be preferred to [`dask-xgboost`](https://github.com/dask/dask-xgboost)."
    ]
   },
   {
@@ -258,25 +269,73 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import dask_xgboost\n",
-    "\n",
-    "xgb_reg = dask_xgboost.XGBRegressor(\n",
-    "    objective=\"reg:squarederror\",\n",
-    "    tree_method='approx',\n",
-    "    learning_rate=0.1,\n",
-    "    max_depth=5,\n",
-    "    n_estimators=50,\n",
+    "import xgboost as xgb"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Training data for `xgboost.dask` needs to be prepared in a special object called `DaskDMatrix`. This is like the XGBoost `DMatrix` that you might be familiar with, but is backed by Dasks's distributed collections (Dask DataFrame and Dask Array)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dtrain = xgb.dask.DaskDMatrix(\n",
+    "    client=client,\n",
+    "    data=taxi_train[features],\n",
+    "    label=taxi_train[y_col]\n",
     ")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can pass any [xgboost parameters](https://xgboost.readthedocs.io/en/latest/parameter.html) to `xgb.dask.train()`. The training process will then start up on all workers that have some of the data in `dtrain`."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
     "%%time\n",
-    "_ = xgb_reg.fit(taxi_train[features], y=taxi_train[y_col])"
+    "result = xgb.dask.train(\n",
+    "    client=client,\n",
+    "    params={\n",
+    "        \"objective\": \"reg:squarederror\",\n",
+    "        \"tree_method\": \"hist\",\n",
+    "        \"learning_rate\": 0.1,\n",
+    "        \"max_depth\": 5,\n",
+    "    },\n",
+    "    dtrain=dtrain,\n",
+    "    num_boost_round=50\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`xgb.dask.train()` produces a regular `xgb.core.Booster` object, the same model object produced by non-Dask training."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "booster = result[\"booster\"]\n",
+    "type(booster)\n",
+    "\n",
+    "# xgboost.core.Booster"
    ]
   },
   {
@@ -295,7 +354,7 @@
     "import cloudpickle\n",
     "\n",
     "with open(f'{MODEL_PATH}/xgboost_dask.pkl', 'wb') as f:\n",
-    "    cloudpickle.dump(xgb_reg, f)"
+    "    cloudpickle.dump(booster, f)"
    ]
   },
   {
@@ -319,6 +378,28 @@
     "_ = wait(taxi_test)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`xgboost.dask.predict()` can be used to create predictiosn on a Dask collection using an XGBoost model object. Note that this model object is just a regular XGBoost booster, not a special Dask-specific model object.\n",
-    "`xgboost.dask.predict()` can be used to create predictiosn on a Dask collection using an XGBoost model object. Note that this model object is just a regular XGBoost booster, not a special Dask-specific model object.\n",
+    "`xgboost.dask.predict()` can be used to create predictions on a Dask collection using an XGBoost model object. Note that this model object is just a regular XGBoost booster, not a special Dask-specific model object.\n",
-    "`xgboost.dask.predict()` can be used to create predictiosn on a Dask collection using an XGBoost model object. Note that this model object is just a regular XGBoost booster, not a special Dask-specific model object.\n",
+    "`xgboost.dask.predict()` can be used to create predictions on a Dask collection using an XGBoost model object. Note that this model object is just a regular XGBoost booster, not a special Dask-specific model object.\n",
+    "\n",
+    "This function returns a Dask Array or Dask Series of predictions, depending on the input type."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "preds = xgb.dask.predict(\n",
+    "    client=client,\n",
+    "    model=booster,\n",
+    "    data=taxi_test[features]\n",
+    ")"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -334,8 +415,7 @@
    "source": [
     "from dask_ml.metrics import mean_squared_error\n",
     "\n",
-    "preds = xgb_reg.predict(taxi_test[features])\n",
-    "mean_squared_error(taxi_test[y_col].to_dask_array(), preds, squared=False)"
+    "mean_squared_error(taxi_test[y_col].to_dask_array(), preds.to_dask_array(), squared=False)"
    ]
   }
  ],

diff --git a/examples/examples-cpu/nyc-taxi-snowflake/xgboost.ipynb b/examples/examples-cpu/nyc-taxi-snowflake/xgboost.ipynb
@@ -146,7 +146,7 @@
     "\n",
     "xgb_reg = xgboost.XGBRegressor(\n",
     "    objective=\"reg:squarederror\",\n",
-    "    tree_method='approx',\n",
+    "    tree_method='hist',\n",
     "    learning_rate=0.1,\n",
     "    max_depth=5,\n",
     "    n_estimators=50,\n",

diff --git a/examples/examples-cpu/nyc-taxi/xgboost-dask.ipynb b/examples/examples-cpu/nyc-taxi/xgboost-dask.ipynb
@@ -18,6 +18,13 @@
     "</table>"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This notebook describes how to use Dask to scale training of XGBoost models. For more detailed information, see [\"Distributed XGBoost with Dask\"](https://xgboost.readthedocs.io/en/latest/tutorials/dask.html) in the XGBoost documentation and [\"XGBoost Training with Dask\"](https://www.saturncloud.io/docs/tutorials/xgboost/) in Saturn Cloud's documentation."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -50,7 +57,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Initialize Dask cluster"
+    "# Initialize Dask cluster\n",
+    "\n",
+    "The code below uses [`dask-saturn`](https://github.com/saturncloud/dask-saturn) to create a Dask cluster or connect to one that is already running."
    ]
   },
   {
@@ -183,7 +192,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Train a model"
+    "# Train a model\n",
+    "\n",
+    "This example uses the native Dask integration built into XGBoost. That integration was added in `xgboost` 1.3.0, and should be preferred to [`dask-xgboost`](https://github.com/dask/dask-xgboost)."
    ]
   },
   {
@@ -192,25 +203,73 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "import dask_xgboost\n",
-    "\n",
-    "xgb_reg = dask_xgboost.XGBRegressor(\n",
-    "    objective=\"reg:squarederror\",\n",
-    "    tree_method='approx',\n",
-    "    learning_rate=0.1,\n",
-    "    max_depth=5,\n",
-    "    n_estimators=50,\n",
+    "import xgboost as xgb"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Training data for `xgboost.dask` needs to be prepared in a special object called `DaskDMatrix`. This is like the XGBoost `DMatrix` that you might be familiar with, but is backed by Dasks's distributed collections (Dask DataFrame and Dask Array)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dtrain = xgb.dask.DaskDMatrix(\n",
+    "    client=client,\n",
+    "    data=taxi_train[features],\n",
+    "    label=taxi_train[y_col]\n",
     ")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can pass any [xgboost parameters](https://xgboost.readthedocs.io/en/latest/parameter.html) to `xgb.dask.train()`. The training process will then start up on all workers that have some of the data in `dtrain`."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
    "metadata": {},
    "outputs": [],
    "source": [
     "%%time\n",
-    "_ = xgb_reg.fit(taxi_train[features], y=taxi_train[y_col])"
+    "result = xgb.dask.train(\n",
+    "    client=client,\n",
+    "    params={\n",
+    "        \"objective\": \"reg:squarederror\",\n",
+    "        \"tree_method\": \"hist\",\n",
+    "        \"learning_rate\": 0.1,\n",
+    "        \"max_depth\": 5,\n",
+    "    },\n",
+    "    dtrain=dtrain,\n",
+    "    num_boost_round=50\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`xgb.dask.train()` produces a regular `xgb.core.Booster` object, the same model object produced by non-Dask training."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "booster = result[\"booster\"]\n",
+    "type(booster)\n",
+    "\n",
+    "# xgboost.core.Booster"
    ]
   },
   {
@@ -229,7 +288,7 @@
     "import cloudpickle\n",
     "\n",
     "with open(f'{MODEL_PATH}/xgboost_dask.pkl', 'wb') as f:\n",
-    "    cloudpickle.dump(xgb_reg, f)"
+    "    cloudpickle.dump(booster, f)"
    ]
   },
   {
@@ -257,6 +316,28 @@
     "taxi_test = prep_df(taxi_test)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`xgboost.dask.predict()` can be used to create predictiosn on a Dask collection using an XGBoost model object. Note that this model object is just a regular XGBoost booster, not a special Dask-specific model object.\n",
-    "`xgboost.dask.predict()` can be used to create predictiosn on a Dask collection using an XGBoost model object. Note that this model object is just a regular XGBoost booster, not a special Dask-specific model object.\n",
+    "`xgboost.dask.predict()` can be used to create predictions on a Dask collection using an XGBoost model object. Note that this model object is just a regular XGBoost booster, not a special Dask-specific model object.\n",
-    "`xgboost.dask.predict()` can be used to create predictiosn on a Dask collection using an XGBoost model object. Note that this model object is just a regular XGBoost booster, not a special Dask-specific model object.\n",
+    "`xgboost.dask.predict()` can be used to create predictions on a Dask collection using an XGBoost model object. Note that this model object is just a regular XGBoost booster, not a special Dask-specific model object.\n",
+    "\n",
+    "This function returns a Dask Array or Dask Series of predictions, depending on the input type."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "preds = xgb.dask.predict(\n",
+    "    client=client,\n",
+    "    model=booster,\n",
+    "    data=taxi_test[features]\n",
+    ")"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -272,8 +353,7 @@
    "source": [
     "from dask_ml.metrics import mean_squared_error\n",
     "\n",
-    "preds = xgb_reg.predict(taxi_test[features])\n",
-    "mean_squared_error(taxi_test[y_col].to_dask_array(), preds, squared=False)"
+    "mean_squared_error(taxi_test[y_col].to_dask_array(), preds.to_dask_array(), squared=False)"
    ]
   }
  ],

diff --git a/examples/examples-cpu/nyc-taxi/xgboost.ipynb b/examples/examples-cpu/nyc-taxi/xgboost.ipynb
@@ -124,7 +124,7 @@
     "\n",
     "xgb_reg = xgboost.XGBRegressor(\n",
     "    objective=\"reg:squarederror\",\n",
-    "    tree_method='approx',\n",
+    "    tree_method='hist',\n",
     "    learning_rate=0.1,\n",
     "    max_depth=5,\n",
     "    n_estimators=50,\n",