x-datascience-datacamp · waddason · Dec 17, 2024 · Dec 18, 2024 · Dec 20, 2024
diff --git a/02_pipelines_and_column_transformers/02-parallel_and_caching_with_joblib.ipynb b/02_pipelines_and_column_transformers/02-parallel_and_caching_with_joblib.ipynb
diff --git a/03_generalization_and_cv/01-cross_validation_and_metrics.ipynb b/03_generalization_and_cv/01-cross_validation_and_metrics.ipynb
@@ -10,7 +10,7 @@
     "\n",
     "## Table of contents\n",
     "\n",
-    "* [1 The benefits of cross-calidation](#benefitscv)\n",
+    "* [1 The benefits of cross-validation](#benefitscv)\n",
     "    * [1.1 Load our dataset](#benefitscv_load)\n",
     "    * [1.2 Empirical error vs generalization error](#benefitscv_empirical)\n",
     "    * [1.3 A single error is not enough... what about the variance?](#benefitscv_single)\n",
@@ -23,7 +23,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "A core question in machine learning is how to evaluate the performance of a model once it's parameters are estimated (i.e. the model has been trained). In this notebook, we aim at presenting how you should answer this question in a statistically sound way. First, we will present the benefits of using cross-validation for this task and then have a quick look at different strategies and metrics that one should use in supervised learning."
+    "A core question in machine learning is how to evaluate the performance of a model once its parameters are estimated (i.e. the model has been trained). In this notebook, we aim at presenting how you should answer this question in a statistically sound way. First, we will present the benefits of using cross-validation for this task and then have a quick look at different strategies and metrics that one should use in supervised learning."
    ]
   },
   {
@@ -465,11 +465,11 @@
     "    <h3>Generalization error:</h3>\n",
     "     The aim of model training is to select the model $f$ out of a class of models $\\mathcal F$ that minimizes a measure of the risk. The risk is measured with a loss $l$ between the true value $y$ associated to $x$ and the prediction $f(x)$ and thus we want to find:  \n",
     "        $$\n",
-    "        f^\\star = \\arg\\min_{f \\in \\mathcal F}\\mathbb E_{(x, y) \\sim \\pi}[l(f(x), y]\n",
+    "        f^\\star = \\arg\\min_{f \\in \\mathcal F}\\mathbb E_{(x, y) \\sim \\pi}[l(f(x), y)]\n",
     "        $$  \n",
     "    The issue is that we cannot compute the expectation $\\mathbb E_{(x, y) \\sim \\pi}$ because we don't know the input distribution $\\pi$. Therefore, we approximate it with a set of examples $\\{(x_1, y_1), \\dots (x_N, y_N)\\}$ drawn <i>i.i.d.</i> from $\\pi$ and use the <b>Empirical Risk Minimization</b> (ERM):\n",
     "        $$\n",
-    "        \\widehat{f} = \\arg\\min_{f \\in \\mathcal F}\\frac1N\\sum_{i=1}^Nl(f(x_i), y_i]\n",
+    "        \\widehat{f} = \\arg\\min_{f \\in \\mathcal F}\\frac1N\\sum_{i=1}^Nl(f(x_i), y_i)\n",
     "        $$\n",
     "    If the samples are drawn independently, we know that the error has a variance of $\\mathcal O\\left(\\frac{1}{\\sqrt{N}}\\right)$. Thus there is a gap between the minimizer of the risk and the minimizer of the empirical risk. If we optimize too much for the ERM, the gap might be big and the selected model will have bad performance on unseen data. This is what is called <b>overfitting</b>. To control, this, one need to have a measure of the risk independent from the measure of the risk which is used to select the model: the Empirical Risk on the test set!\n",
     "</div>\n"
@@ -486,14 +486,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "While we were able to estimate the generalization error, we are indeed unable to know anything about the variance of our model and thus if it is robust or not. This is where the framework of cross-validation is used. Indeed, we can repeat our experiment and compute several time our generalization error and get intuition about the stability of our model."
+    "While we were able to estimate the generalization error, we are indeed unable to know anything about the variance of our model and thus if it is robust or not. This is where the framework of cross-validation is used. Indeed, we can repeat our experiment and compute several times our generalization error and get intuition about the stability of our model."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The simplest way that we can think of is to shuffle our data and split it into two sets as we previously did and repeat several time our experiment. In scikit-learn, using the function `cross_validate` with the cross-validation `ShuffleSplit` allows us to make such evaluation."
+    "The simplest way that we can think of is to shuffle our data and split it into two sets as we previously did and repeat several times our experiment. In scikit-learn, using the function `cross_validate` with the cross-validation `ShuffleSplit` allows us to make such evaluation."
    ]
   },
   {
@@ -902,7 +902,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We see that the median value range from 50 k\\\\$ up to 500 k\\\\$. Thus an error range of 3 k\\\\$ means that our cross-validation results can be trusted and do not suffer from an execessive variance. Regarding the performance of our model itself, we can see that making an error of 45 k\\\\$ would be problematic even more if this happen for housing with low value. However, we also see some limitation regarding the metric that we are using. Making an error of 45 k\\\\$ for a target at 50 k\\\\$ and at 500 k\\\\$ should not have the same impact. We should instead use the mean absolute percentage error which will give a relative error."
+    "We see that the median value range from 50 k\\\\$ up to 500 k\\\\$. Thus, an error range of 3 k\\\\$ means that our cross-validation results can be trusted and do not suffer from an excessive variance. Regarding the performance of our model itself, we can see that making an error of 45 k\\\\$ would be problematic even more if this happens for housing with low value. However, we also see some limitation regarding the metric that we are using. Making an error of 45 k\\\\$ for a target at 50 k\\\\$ and at 500 k\\\\$ should not have the same impact. We should instead use the mean absolute percentage error which will give a relative error."
    ]
   },
   {
@@ -1077,7 +1077,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We see that with a low number of samples, the variance is much larger. Indeed, for low number of sample, we cannot even trust our cross-validation and therefore cannot conclude anything about our regressor. Therefore, it is really important to make experiment with a large enough sample size to be sure about the conclusions which would be drawn."
+    "We see that with a low number of samples, the variance is much larger. Indeed, for low number of samples, we cannot even trust our cross-validation and therefore cannot conclude anything about our regressor. Therefore, it is really important to make experiment with a large enough sample size to be sure about the conclusions which would be drawn."
    ]
   },
   {
@@ -1127,7 +1127,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We plot the generalization errors for each of the experiment. We see that even our regressor does not perform well, it is far above chances our a regressor that would predict the mean target."
+    "We plot the generalization errors for each of the experiment. We see that even if our regressor does not perform well, it is far above chances our a regressor that would predict the mean target."
    ]
   },
   {
@@ -1193,7 +1193,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let's take an example of some financial quotes. These are the value of compagny  stocks with the time."
+    "Let's take an example of some financial quotes. These are the value of company's stocks with the time."
    ]
   },
   {

diff --git a/04_metrics/01-evaluation_metrics_regression.ipynb b/04_metrics/01-evaluation_metrics_regression.ipynb