MIT-LCP
diff --git a/‎01_explore_patients.ipynb
+382 b/‎01_explore_patients.ipynb
+382
@@ -0,0 +1,382 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "view-in-github"
+   },
+   "source": [
+    "<a href=\"https://colab.research.google.com/github/MIT-LCP/bidmc-datathon/blob/master/01_explore_patients.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "NCI19_Ix7xuI"
+   },
+   "source": [
+    "# eICU Collaborative Research Database\n",
+    "\n",
+    "# Notebook 1: Exploring the patient table\n",
+    "\n",
+    "The aim of this notebook is to get set up with access to a demo version of the [eICU Collaborative Research Database](http://eicu-crd.mit.edu/). The demo is a subset of the full database, limited to ~1000 patients.\n",
+    "\n",
+    "We begin by exploring the `patient` table, which contains patient demographics and admission and discharge details for hospital and ICU stays. For more detail, see: http://eicu-crd.mit.edu/eicutables/patient/"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Prerequisites\n",
+    "\n",
+    "- If you do not have a Gmail account, please create one at http://www.gmail.com. \n",
+    "- If you have not yet signed the data use agreement (DUA) sent by the organizers, please do so now to get access to the dataset."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "l_CmlcBu8Wei"
+   },
+   "source": [
+    "## Load libraries and connect to the data\n",
+    "\n",
+    "Run the following cells to import some libraries and then connect to the database."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {},
+    "colab_type": "code",
+    "id": "3WQsJiAj8B5L"
+   },
+   "outputs": [],
+   "source": [
+    "# Import libraries\n",
+    "import numpy as np\n",
+    "import os\n",
+    "import pandas as pd\n",
+    "import matplotlib.pyplot as plt\n",
+    "import matplotlib.patches as patches\n",
+    "import matplotlib.path as path\n",
+    "\n",
+    "# Make pandas dataframes prettier\n",
+    "from IPython.display import display, HTML\n",
+    "\n",
+    "# Access data using Google BigQuery.\n",
+    "from google.colab import auth\n",
+    "from google.cloud import bigquery"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "Ld59KZ0W9E4v"
+   },
+   "source": [
+    "Before running any queries, you need to first authenticate yourself by running the following cell. If you are running it for the first time, it will ask you to follow a link to log in using your Gmail account, and accept the data access requests to your profile. Once this is done, it will generate a string of verification code, which you should paste back to the cell below and press enter."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {},
+    "colab_type": "code",
+    "id": "ABh4hMt288yg"
+   },
+   "outputs": [],
+   "source": [
+    "auth.authenticate_user()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "BPoHP2a8_eni"
+   },
+   "source": [
+    "We'll also set the project details."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {},
+    "colab_type": "code",
+    "id": "P0fdtVMa_di9"
+   },
+   "outputs": [],
+   "source": [
+    "project_id='bidmc-datathon'\n",
+    "os.environ[\"GOOGLE_CLOUD_PROJECT\"]=project_id"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "5bHZALFP9VN1"
+   },
+   "source": [
+    "# \"Querying\" our database with SQL\n",
+    "\n",
+    "Now we can start exploring the data. We'll begin by running a simple query to load all columns of the `patient` table to a Pandas DataFrame. The query is written in SQL, a common language for extracting data from databases. The structure of an SQL query is:\n",
+    "\n",
+    "```sql\n",
+    "SELECT <columns>\n",
+    "FROM <table>\n",
+    "WHERE <criteria, optional>\n",
+    "```\n",
+    "\n",
+    "`*` is a wildcard that indicates all columns"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# BigQuery\n",
+    "\n",
+    "Our dataset is stored on BigQuery, Google's database engine. We can run our query on the database using some special (\"magic\") [BigQuery syntax](https://googleapis.dev/python/bigquery/latest/magics.html)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {},
+    "colab_type": "code",
+    "id": "RE-UZAPG_rHq"
+   },
+   "outputs": [],
+   "source": [
+    "%%bigquery patient\n",
+    "\n",
+    "SELECT *\n",
+    "FROM `physionet-data.eicu_crd_demo.patient`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "YbnkcCZxBkdK"
+   },
+   "source": [
+    "We have now assigned the output to our query to a variable called `patient`. Let's use the `head` method to view the first few rows of our data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {},
+    "colab_type": "code",
+    "id": "GZph0FPDASEs"
+   },
+   "outputs": [],
+   "source": [
+    "# view the top few rows of the patient data\n",
+    "patient.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "TlxaXLevC_Rz"
+   },
+   "source": [
+    "## Questions\n",
+    "\n",
+    "- What does `patientunitstayid` represent? (hint, see: http://eicu-crd.mit.edu/eicutables/patient/)\n",
+    "- What does `patienthealthsystemstayid` represent?\n",
+    "- What does `uniquepid` represent?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {},
+    "colab_type": "code",
+    "id": "2rLY0WyCBzp9"
+   },
+   "outputs": [],
+   "source": [
+    "# select a limited number of columns to view\n",
+    "columns = ['uniquepid', 'patientunitstayid','gender','age','unitdischargestatus']\n",
+    "patient[columns].head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "FSdS2hS4EWtb"
+   },
+   "source": [
+    "- Try running the following query, which lists unique values in the age column. What do you notice?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {},
+    "colab_type": "code",
+    "id": "0Aom69ftDxBN"
+   },
+   "outputs": [],
+   "source": [
+    "# what are the unique values for age?\n",
+    "age_col = 'age'\n",
+    "patient[age_col].sort_values().unique()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "Y_qJL94jE0k8"
+   },
+   "source": [
+    "- Try plotting a histogram of ages using the command in the cell below. What happens? Why?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {},
+    "colab_type": "code",
+    "id": "1zad3Gr4D4LE"
+   },
+   "outputs": [],
+   "source": [
+    "# try plotting a histogram of ages\n",
+    "patient[age_col].plot(kind='hist', bins=15)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "xIdwVEEPF25H"
+   },
+   "source": [
+    "Let's create a new column named `age_num`, then try again."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {},
+    "colab_type": "code",
+    "id": "-rwc-28oFF6R"
+   },
+   "outputs": [],
+   "source": [
+    "# create a column containing numerical ages\n",
+    "# If ‘coerce’, then invalid parsing will be set as NaN\n",
+    "agenum_col = 'age_num'\n",
+    "patient[agenum_col] = pd.to_numeric(patient[age_col], errors='coerce')\n",
+    "patient[agenum_col].sort_values().unique()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {},
+    "colab_type": "code",
+    "id": "uTFMqqWqFMjG"
+   },
+   "outputs": [],
+   "source": [
+    "patient[agenum_col].plot(kind='hist', bins=15)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "FrbR8rV3GlR1"
+   },
+   "source": [
+    "## Questions\n",
+    "\n",
+    "- Use the `mean()` method to find the average age. Why do we expect this to be lower than the true mean?\n",
+    "- In the same way that you use `mean()`, you can use `describe()`, `max()`, and `min()`. Look at the admission heights (`admissionheight`) of patients in cm. What issue do you see? How can you deal with this issue?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {},
+    "colab_type": "code",
+    "id": "TPps13DZG6Ac"
+   },
+   "outputs": [],
+   "source": [
+    "adheight_col = 'admissionheight'\n",
+    "patient[adheight_col].describe()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "colab": {},
+    "colab_type": "code",
+    "id": "9jhV9xQoGRJq"
+   },
+   "outputs": [],
+   "source": [
+    "# set threshold\n",
+    "adheight_col = 'admissionheight'\n",
+    "patient[patient[adheight_col] < 10] = None"
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "collapsed_sections": [],
+   "include_colab_link": true,
+   "name": "01-explore-patient-table",
+   "provenance": [],
+   "version": "0.3.2"
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.7.4"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}