Exercise: Missing Values

{"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"name":"python","version":"3.10.13","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"kaggle":{"accelerator":"none","dataSources":[{"sourceId":10211,"databundleVersionId":111096,"sourceType":"competition"}],"isInternetEnabled":false,"language":"python","sourceType":"notebook","isGpuEnabled":false}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"**This notebook is an exercise in the [Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning) course.  You can reference the tutorial at [this link](https://www.kaggle.com/alexisbcook/missing-values).**\n\n---\n","metadata":{}},{"cell_type":"markdown","source":"Now it's your turn to test your new knowledge of **missing values** handling. You'll probably find it makes a big difference.\n\n# Setup\n\nThe questions will give you feedback on your work. Run the following cell to set up the feedback system.","metadata":{}},{"cell_type":"code","source":"# Set up code checking\nimport os\nif not os.path.exists(\"../input/train.csv\"):\n    os.symlink(\"../input/home-data-for-ml-course/train.csv\", \"../input/train.csv\")  \n    os.symlink(\"../input/home-data-for-ml-course/test.csv\", \"../input/test.csv\") \nfrom learntools.core import binder\nbinder.bind(globals())\nfrom learntools.ml_intermediate.ex2 import *\nprint(\"Setup Complete\")","metadata":{"execution":{"iopub.status.busy":"2024-06-16T15:31:45.376724Z","iopub.execute_input":"2024-06-16T15:31:45.377257Z","iopub.status.idle":"2024-06-16T15:31:46.597476Z","shell.execute_reply.started":"2024-06-16T15:31:45.377208Z","shell.execute_reply":"2024-06-16T15:31:46.596295Z"},"trusted":true},"execution_count":1,"outputs":[{"name":"stdout","text":"Setup Complete\n","output_type":"stream"}]},{"cell_type":"markdown","source":"In this exercise, you will work with data from the [Housing Prices Competition for Kaggle Learn Users](https://www.kaggle.com/c/home-data-for-ml-course). \n\n![Ames Housing dataset image](https://storage.googleapis.com/kaggle-media/learn/images/lTJVG4e.png)\n\nRun the next code cell without changes to load the training and validation sets in `X_train`, `X_valid`, `y_train`, and `y_valid`.  The test set is loaded in `X_test`.","metadata":{}},{"cell_type":"code","source":"import pandas as pd\nfrom sklearn.model_selection import train_test_split\n\n# Read the data\nX_full = pd.read_csv('../input/train.csv', index_col='Id')\nX_test_full = pd.read_csv('../input/test.csv', index_col='Id')\n\n# Remove rows with missing target, separate target from predictors\nX_full.dropna(axis=0, subset=['SalePrice'], inplace=True)\ny = X_full.SalePrice\nX_full.drop(['SalePrice'], axis=1, inplace=True)\n\n# To keep things simple, we'll use only numerical predictors\nX = X_full.select_dtypes(exclude=['object'])\nX_test = X_test_full.select_dtypes(exclude=['object'])\n\n# Break off validation set from training data\nX_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,\n                                                      random_state=0)","metadata":{"execution":{"iopub.status.busy":"2024-06-16T15:31:56.283243Z","iopub.execute_input":"2024-06-16T15:31:56.283651Z","iopub.status.idle":"2024-06-16T15:31:57.093720Z","shell.execute_reply.started":"2024-06-16T15:31:56.283616Z","shell.execute_reply":"2024-06-16T15:31:57.092425Z"},"trusted":true},"execution_count":2,"outputs":[]},{"cell_type":"markdown","source":"Use the next code cell to print the first five rows of the data.","metadata":{}},{"cell_type":"code","source":"X_train.head()","metadata":{"execution":{"iopub.status.busy":"2024-06-16T15:32:02.413244Z","iopub.execute_input":"2024-06-16T15:32:02.413651Z","iopub.status.idle":"2024-06-16T15:32:02.447200Z","shell.execute_reply.started":"2024-06-16T15:32:02.413618Z","shell.execute_reply":"2024-06-16T15:32:02.445729Z"},"trusted":true},"execution_count":3,"outputs":[{"execution_count":3,"output_type":"execute_result","data":{"text/plain":"     MSSubClass  LotFrontage  LotArea  OverallQual  OverallCond  YearBuilt  \\\nId                                                                           \n619          20         90.0    11694            9            5       2007   \n871          20         60.0     6600            5            5       1962   \n93           30         80.0    13360            5            7       1921   \n818          20          NaN    13265            8            5       2002   \n303          20        118.0    13704            7            5       2001   \n\n     YearRemodAdd  MasVnrArea  BsmtFinSF1  BsmtFinSF2  ...  GarageArea  \\\nId                                                     ...               \n619          2007       452.0          48           0  ...         774   \n871          1962         0.0           0           0  ...         308   \n93           2006         0.0         713           0  ...         432   \n818          2002       148.0        1218           0  ...         857   \n303          2002       150.0           0           0  ...         843   \n\n     WoodDeckSF  OpenPorchSF  EnclosedPorch  3SsnPorch  ScreenPorch  PoolArea  \\\nId                                                                              \n619           0          108              0          0          260         0   \n871           0            0              0          0            0         0   \n93            0            0             44          0            0         0   \n818         150           59              0          0            0         0   \n303         468           81              0          0            0         0   \n\n     MiscVal  MoSold  YrSold  \nId                            \n619        0       7    2007  \n871        0       8    2009  \n93         0       8    2009  \n818        0       7    2008  \n303        0       1    2006  \n\n[5 rows x 36 columns]","text/html":"<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>MSSubClass</th>\n      <th>LotFrontage</th>\n      <th>LotArea</th>\n      <th>OverallQual</th>\n      <th>OverallCond</th>\n      <th>YearBuilt</th>\n      <th>YearRemodAdd</th>\n      <th>MasVnrArea</th>\n      <th>BsmtFinSF1</th>\n      <th>BsmtFinSF2</th>\n      <th>...</th>\n      <th>GarageArea</th>\n      <th>WoodDeckSF</th>\n      <th>OpenPorchSF</th>\n      <th>EnclosedPorch</th>\n      <th>3SsnPorch</th>\n      <th>ScreenPorch</th>\n      <th>PoolArea</th>\n      <th>MiscVal</th>\n      <th>MoSold</th>\n      <th>YrSold</th>\n    </tr>\n    <tr>\n      <th>Id</th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n      <th></th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>619</th>\n      <td>20</td>\n      <td>90.0</td>\n      <td>11694</td>\n      <td>9</td>\n      <td>5</td>\n      <td>2007</td>\n      <td>2007</td>\n      <td>452.0</td>\n      <td>48</td>\n      <td>0</td>\n      <td>...</td>\n      <td>774</td>\n      <td>0</td>\n      <td>108</td>\n      <td>0</td>\n      <td>0</td>\n      <td>260</td>\n      <td>0</td>\n      <td>0</td>\n      <td>7</td>\n      <td>2007</td>\n    </tr>\n    <tr>\n      <th>871</th>\n      <td>20</td>\n      <td>60.0</td>\n      <td>6600</td>\n      <td>5</td>\n      <td>5</td>\n      <td>1962</td>\n      <td>1962</td>\n      <td>0.0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>...</td>\n      <td>308</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>8</td>\n      <td>2009</td>\n    </tr>\n    <tr>\n      <th>93</th>\n      <td>30</td>\n      <td>80.0</td>\n      <td>13360</td>\n      <td>5</td>\n      <td>7</td>\n      <td>1921</td>\n      <td>2006</td>\n      <td>0.0</td>\n      <td>713</td>\n      <td>0</td>\n      <td>...</td>\n      <td>432</td>\n      <td>0</td>\n      <td>0</td>\n      <td>44</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>8</td>\n      <td>2009</td>\n    </tr>\n    <tr>\n      <th>818</th>\n      <td>20</td>\n      <td>NaN</td>\n      <td>13265</td>\n      <td>8</td>\n      <td>5</td>\n      <td>2002</td>\n      <td>2002</td>\n      <td>148.0</td>\n      <td>1218</td>\n      <td>0</td>\n      <td>...</td>\n      <td>857</td>\n      <td>150</td>\n      <td>59</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>7</td>\n      <td>2008</td>\n    </tr>\n    <tr>\n      <th>303</th>\n      <td>20</td>\n      <td>118.0</td>\n      <td>13704</td>\n      <td>7</td>\n      <td>5</td>\n      <td>2001</td>\n      <td>2002</td>\n      <td>150.0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>...</td>\n      <td>843</td>\n      <td>468</td>\n      <td>81</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>0</td>\n      <td>1</td>\n      <td>2006</td>\n    </tr>\n  </tbody>\n</table>\n<p>5 rows × 36 columns</p>\n</div>"},"metadata":{}}]},{"cell_type":"markdown","source":"You can already see a few missing values in the first several rows.  In the next step, you'll obtain a more comprehensive understanding of the missing values in the dataset.\n\n# Step 1: Preliminary investigation\n\nRun the code cell below without changes.","metadata":{}},{"cell_type":"code","source":"# Shape of training data (num_rows, num_columns)\nprint(X_train.shape)\n\n# Number of missing values in each column of training data\nmissing_val_count_by_column = (X_train.isnull().sum())\nprint(missing_val_count_by_column[missing_val_count_by_column > 0])","metadata":{"execution":{"iopub.status.busy":"2024-06-16T15:32:05.059533Z","iopub.execute_input":"2024-06-16T15:32:05.059960Z","iopub.status.idle":"2024-06-16T15:32:05.069203Z","shell.execute_reply.started":"2024-06-16T15:32:05.059916Z","shell.execute_reply":"2024-06-16T15:32:05.067906Z"},"trusted":true},"execution_count":4,"outputs":[{"name":"stdout","text":"(1168, 36)\nLotFrontage    212\nMasVnrArea       6\nGarageYrBlt     58\ndtype: int64\n","output_type":"stream"}]},{"cell_type":"markdown","source":"### Part A\n\nUse the above output to answer the questions below.","metadata":{}},{"cell_type":"code","source":"# Fill in the line below: How many rows are in the training data?\nnum_rows = 1168\n\n# Fill in the line below: How many columns in the training data\n# have missing values?\nnum_cols_with_missing = 3\n\n# Fill in the line below: How many missing entries are contained in \n# all of the training data?\ntot_missing = 212 + 6 + 58\n\n# Check your answers\nstep_1.a.check()","metadata":{"execution":{"iopub.status.busy":"2024-06-16T15:35:06.943516Z","iopub.execute_input":"2024-06-16T15:35:06.943998Z","iopub.status.idle":"2024-06-16T15:35:06.955661Z","shell.execute_reply.started":"2024-06-16T15:35:06.943960Z","shell.execute_reply":"2024-06-16T15:35:06.954274Z"},"trusted":true},"execution_count":7,"outputs":[{"output_type":"display_data","data":{"text/plain":"<IPython.core.display.Javascript object>","application/javascript":"parent.postMessage({\"jupyterEvent\": \"custom.exercise_interaction\", \"data\": {\"outcomeType\": 1, \"valueTowardsCompletion\": 1.0, \"interactionType\": 1, \"questionType\": 1, \"questionId\": \"1.1_InvestigateEquality\", \"learnToolsVersion\": \"0.3.4\", \"failureMessage\": \"\", \"exceptionClass\": \"\", \"trace\": \"\"}}, \"*\")"},"metadata":{}},{"output_type":"display_data","data":{"text/plain":"Correct","text/markdown":"<span style=\"color:#33cc33\">Correct</span>"},"metadata":{}}]},{"cell_type":"markdown","source":"### Lines below will give you a hint or solution code\nstep_1.a.hint()\nstep_1.a.solution()","metadata":{"execution":{"iopub.status.busy":"2024-06-16T15:33:38.725822Z","iopub.execute_input":"2024-06-16T15:33:38.726326Z","iopub.status.idle":"2024-06-16T15:33:38.742287Z","shell.execute_reply.started":"2024-06-16T15:33:38.726287Z","shell.execute_reply":"2024-06-16T15:33:38.740899Z"}}},{"cell_type":"code","source":"# step_1.a.hint()\n# step_1.a.solution()","metadata":{"execution":{"iopub.status.busy":"2024-06-16T15:34:46.065255Z","iopub.execute_input":"2024-06-16T15:34:46.065674Z","iopub.status.idle":"2024-06-16T15:34:46.081305Z","shell.execute_reply.started":"2024-06-16T15:34:46.065639Z","shell.execute_reply":"2024-06-16T15:34:46.080147Z"},"trusted":true},"execution_count":6,"outputs":[{"output_type":"display_data","data":{"text/plain":"<IPython.core.display.Javascript object>","application/javascript":"parent.postMessage({\"jupyterEvent\": \"custom.exercise_interaction\", \"data\": {\"interactionType\": 2, \"questionType\": 1, \"questionId\": \"1.1_InvestigateEquality\", \"learnToolsVersion\": \"0.3.4\", \"valueTowardsCompletion\": 0.0, \"failureMessage\": \"\", \"exceptionClass\": \"\", \"trace\": \"\", \"outcomeType\": 4}}, \"*\")"},"metadata":{}},{"output_type":"display_data","data":{"text/plain":"Hint: Use the output of `X_train.shape` to get the number of rows and columns in the training data.  The `missing_val_count_by_column` Series has an entry for each column in the data, and the output above prints the number of missing entries for each column with at least one missing entry.","text/markdown":"<span style=\"color:#3366cc\">Hint:</span> Use the output of `X_train.shape` to get the number of rows and columns in the training data.  The `missing_val_count_by_column` Series has an entry for each column in the data, and the output above prints the number of missing entries for each column with at least one missing entry."},"metadata":{}},{"output_type":"display_data","data":{"text/plain":"<IPython.core.display.Javascript object>","application/javascript":"parent.postMessage({\"jupyterEvent\": \"custom.exercise_interaction\", \"data\": {\"interactionType\": 3, \"questionType\": 1, \"questionId\": \"1.1_InvestigateEquality\", \"learnToolsVersion\": \"0.3.4\", \"valueTowardsCompletion\": 0.0, \"failureMessage\": \"\", \"exceptionClass\": \"\", \"trace\": \"\", \"outcomeType\": 4}}, \"*\")"},"metadata":{}},{"output_type":"display_data","data":{"text/plain":"Solution: \n```python\n# How many rows are in the training data?\nnum_rows = 1168\n\n# How many columns in the training data have missing values?\nnum_cols_with_missing = 3\n\n# How many missing entries are contained in all of the training data?\ntot_missing = 212 + 6 + 58\n\n```","text/markdown":"<span style=\"color:#33cc99\">Solution:</span> \n```python\n# How many rows are in the training data?\nnum_rows = 1168\n\n# How many columns in the training data have missing values?\nnum_cols_with_missing = 3\n\n# How many missing entries are contained in all of the training data?\ntot_missing = 212 + 6 + 58\n\n```"},"metadata":{}}]},{"cell_type":"markdown","source":"### Part B\nConsidering your answers above, what do you think is likely the best approach to dealing with the missing values?","metadata":{}},{"cell_type":"code","source":"# Check your answer (Run this code cell to receive credit!)\n# step_1.b.check()","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"step_1.b.hint()","metadata":{"execution":{"iopub.status.busy":"2024-06-16T15:36:28.787037Z","iopub.execute_input":"2024-06-16T15:36:28.787483Z","iopub.status.idle":"2024-06-16T15:36:28.797851Z","shell.execute_reply.started":"2024-06-16T15:36:28.787448Z","shell.execute_reply":"2024-06-16T15:36:28.796511Z"},"trusted":true},"execution_count":8,"outputs":[{"output_type":"display_data","data":{"text/plain":"<IPython.core.display.Javascript object>","application/javascript":"parent.postMessage({\"jupyterEvent\": \"custom.exercise_interaction\", \"data\": {\"interactionType\": 2, \"questionType\": 4, \"questionId\": \"1.2_InvestigateThought\", \"learnToolsVersion\": \"0.3.4\", \"valueTowardsCompletion\": 0.0, \"failureMessage\": \"\", \"exceptionClass\": \"\", \"trace\": \"\", \"outcomeType\": 4}}, \"*\")"},"metadata":{}},{"output_type":"display_data","data":{"text/plain":"Hint: Does the dataset have a lot of missing values, or just a few?  Would we lose much information if we completely ignored the columns with missing entries?","text/markdown":"<span style=\"color:#3366cc\">Hint:</span> Does the dataset have a lot of missing values, or just a few?  Would we lose much information if we completely ignored the columns with missing entries?"},"metadata":{}}]},{"cell_type":"markdown","source":"To compare different approaches to dealing with missing values, you'll use the same `score_dataset()` function from the tutorial.  This function reports the [mean absolute error](https://en.wikipedia.org/wiki/Mean_absolute_error) (MAE) from a random forest model.","metadata":{}},{"cell_type":"code","source":"from sklearn.ensemble import RandomForestRegressor\nfrom sklearn.metrics import mean_absolute_error\n\n# Function for comparing different approaches\ndef score_dataset(X_train, X_valid, y_train, y_valid):\n    model = RandomForestRegressor(n_estimators=100, random_state=0)\n    model.fit(X_train, y_train)\n    preds = model.predict(X_valid)\n    return mean_absolute_error(y_valid, preds)","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"# Step 2: Drop columns with missing values\n\nIn this step, you'll preprocess the data in `X_train` and `X_valid` to remove columns with missing values.  Set the preprocessed DataFrames to `reduced_X_train` and `reduced_X_valid`, respectively.  ","metadata":{}},{"cell_type":"code","source":"# Fill in the line below: get names of columns with missing values\n____ # Your code here\n\n# Fill in the lines below: drop columns in training and validation data\nreduced_X_train = ____\nreduced_X_valid = ____\n\n# Check your answers\nstep_2.check()","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"# Lines below will give you a hint or solution code\n#step_2.hint()\n#step_2.solution()","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"Run the next code cell without changes to obtain the MAE for this approach.","metadata":{}},{"cell_type":"code","source":"print(\"MAE (Drop columns with missing values):\")\nprint(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"# Step 3: Imputation\n\n### Part A\n\nUse the next code cell to impute missing values with the mean value along each column.  Set the preprocessed DataFrames to `imputed_X_train` and `imputed_X_valid`.  Make sure that the column names match those in `X_train` and `X_valid`.","metadata":{}},{"cell_type":"code","source":"from sklearn.impute import SimpleImputer\n\n# Fill in the lines below: imputation\n____ # Your code here\nimputed_X_train = ____\nimputed_X_valid = ____\n\n# Fill in the lines below: imputation removed column names; put them back\nimputed_X_train.columns = ____\nimputed_X_valid.columns = ____\n\n# Check your answers\nstep_3.a.check()","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"# Lines below will give you a hint or solution code\n#step_3.a.hint()\n#step_3.a.solution()","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"Run the next code cell without changes to obtain the MAE for this approach.","metadata":{}},{"cell_type":"code","source":"print(\"MAE (Imputation):\")\nprint(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"### Part B\n\nCompare the MAE from each approach.  Does anything surprise you about the results?  Why do you think one approach performed better than the other?","metadata":{}},{"cell_type":"code","source":"# Check your answer (Run this code cell to receive credit!)\nstep_3.b.check()","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"#step_3.b.hint()","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"# Step 4: Generate test predictions\n\nIn this final step, you'll use any approach of your choosing to deal with missing values.  Once you've preprocessed the training and validation features, you'll train and evaluate a random forest model.  Then, you'll preprocess the test data before generating predictions that can be submitted to the competition!\n\n### Part A\n\nUse the next code cell to preprocess the training and validation data.  Set the preprocessed DataFrames to `final_X_train` and `final_X_valid`.  **You can use any approach of your choosing here!**  in order for this step to be marked as correct, you need only ensure:\n- the preprocessed DataFrames have the same number of columns,\n- the preprocessed DataFrames have no missing values, \n- `final_X_train` and `y_train` have the same number of rows, and\n- `final_X_valid` and `y_valid` have the same number of rows.","metadata":{}},{"cell_type":"code","source":"# Preprocessed training and validation features\nfinal_X_train = ____\nfinal_X_valid = ____\n\n# Check your answers\nstep_4.a.check()","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"# Lines below will give you a hint or solution code\n#step_4.a.hint()\n#step_4.a.solution()","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"Run the next code cell to train and evaluate a random forest model.  (*Note that we don't use the `score_dataset()` function above, because we will soon use the trained model to generate test predictions!*)","metadata":{}},{"cell_type":"code","source":"# Define and fit model\nmodel = RandomForestRegressor(n_estimators=100, random_state=0)\nmodel.fit(final_X_train, y_train)\n\n# Get validation predictions and MAE\npreds_valid = model.predict(final_X_valid)\nprint(\"MAE (Your approach):\")\nprint(mean_absolute_error(y_valid, preds_valid))","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"### Part B\n\nUse the next code cell to preprocess your test data.  Make sure that you use a method that agrees with how you preprocessed the training and validation data, and set the preprocessed test features to `final_X_test`.\n\nThen, use the preprocessed test features and the trained model to generate test predictions in `preds_test`.\n\nIn order for this step to be marked correct, you need only ensure:\n- the preprocessed test DataFrame has no missing values, and\n- `final_X_test` has the same number of rows as `X_test`.","metadata":{}},{"cell_type":"code","source":"# Fill in the line below: preprocess test data\nfinal_X_test = ____\n\n# Fill in the line below: get test predictions\npreds_test = ____\n\n# Check your answers\nstep_4.b.check()","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"# Lines below will give you a hint or solution code\n#step_4.b.hint()\n#step_4.b.solution()","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"Run the next code cell without changes to save your results to a CSV file that can be submitted directly to the competition.","metadata":{}},{"cell_type":"code","source":"# Save test predictions to file\noutput = pd.DataFrame({'Id': X_test.index,\n                       'SalePrice': preds_test})\noutput.to_csv('submission.csv', index=False)","metadata":{},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"# Submit your results\n\nOnce you have successfully completed Step 4, you're ready to submit your results to the leaderboard!  (_You also learned how to do this in the previous exercise.  If you need a reminder of how to do this, please use the instructions below._)  \n\nFirst, you'll need to join the competition if you haven't already.  So open a new window by clicking on [this link](https://www.kaggle.com/c/home-data-for-ml-course).  Then click on the **Join Competition** button.\n\n![join competition image](https://storage.googleapis.com/kaggle-media/learn/images/wLmFtH3.png)\n\nNext, follow the instructions below:\n1. Begin by clicking on the **Save Version** button in the top right corner of the window.  This will generate a pop-up window.  \n2. Ensure that the **Save and Run All** option is selected, and then click on the **Save** button.\n3. This generates a window in the bottom left corner of the notebook.  After it has finished running, click on the number to the right of the **Save Version** button.  This pulls up a list of versions on the right of the screen.  Click on the ellipsis **(...)** to the right of the most recent version, and select **Open in Viewer**.  This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.\n4. Click on the **Data** tab near the top of the screen.  Then, click on the file you would like to submit, and click on the **Submit** button to submit your results to the leaderboard.\n\nYou have now successfully submitted to the competition!\n\nIf you want to keep working to improve your performance, select the **Edit** button in the top right of the screen. Then you can change your code and repeat the process. There's a lot of room to improve, and you will climb up the leaderboard as you work.\n\n\n# Keep going\n\nMove on to learn what **[categorical variables](https://www.kaggle.com/alexisbcook/categorical-variables)** are, along with how to incorporate them into your machine learning models.  Categorical variables are very common in real-world data, but you'll get an error if you try to plug them into your models without processing them first!","metadata":{}},{"cell_type":"markdown","source":"---\n\n\n\n\n*Have questions or comments? Visit the [course discussion forum](https://www.kaggle.com/learn/intermediate-machine-learning/discussion) to chat with other learners.*","metadata":{}}]}