Skip to content

Commit 1d5e363

Browse files
committed
make a start
1 parent 0ca17a8 commit 1d5e363

10 files changed

+7441
-2
lines changed

01_explore_patients.ipynb

+382
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,382 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {
6+
"colab_type": "text",
7+
"id": "view-in-github"
8+
},
9+
"source": [
10+
"<a href=\"https://colab.research.google.com/github/MIT-LCP/bidmc-datathon/blob/master/01_explore_patients.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
11+
]
12+
},
13+
{
14+
"cell_type": "markdown",
15+
"metadata": {
16+
"colab_type": "text",
17+
"id": "NCI19_Ix7xuI"
18+
},
19+
"source": [
20+
"# eICU Collaborative Research Database\n",
21+
"\n",
22+
"# Notebook 1: Exploring the patient table\n",
23+
"\n",
24+
"The aim of this notebook is to get set up with access to a demo version of the [eICU Collaborative Research Database](http://eicu-crd.mit.edu/). The demo is a subset of the full database, limited to ~1000 patients.\n",
25+
"\n",
26+
"We begin by exploring the `patient` table, which contains patient demographics and admission and discharge details for hospital and ICU stays. For more detail, see: http://eicu-crd.mit.edu/eicutables/patient/"
27+
]
28+
},
29+
{
30+
"cell_type": "markdown",
31+
"metadata": {},
32+
"source": [
33+
"## Prerequisites\n",
34+
"\n",
35+
"- If you do not have a Gmail account, please create one at http://www.gmail.com. \n",
36+
"- If you have not yet signed the data use agreement (DUA) sent by the organizers, please do so now to get access to the dataset."
37+
]
38+
},
39+
{
40+
"cell_type": "markdown",
41+
"metadata": {
42+
"colab_type": "text",
43+
"id": "l_CmlcBu8Wei"
44+
},
45+
"source": [
46+
"## Load libraries and connect to the data\n",
47+
"\n",
48+
"Run the following cells to import some libraries and then connect to the database."
49+
]
50+
},
51+
{
52+
"cell_type": "code",
53+
"execution_count": null,
54+
"metadata": {
55+
"colab": {},
56+
"colab_type": "code",
57+
"id": "3WQsJiAj8B5L"
58+
},
59+
"outputs": [],
60+
"source": [
61+
"# Import libraries\n",
62+
"import numpy as np\n",
63+
"import os\n",
64+
"import pandas as pd\n",
65+
"import matplotlib.pyplot as plt\n",
66+
"import matplotlib.patches as patches\n",
67+
"import matplotlib.path as path\n",
68+
"\n",
69+
"# Make pandas dataframes prettier\n",
70+
"from IPython.display import display, HTML\n",
71+
"\n",
72+
"# Access data using Google BigQuery.\n",
73+
"from google.colab import auth\n",
74+
"from google.cloud import bigquery"
75+
]
76+
},
77+
{
78+
"cell_type": "markdown",
79+
"metadata": {
80+
"colab_type": "text",
81+
"id": "Ld59KZ0W9E4v"
82+
},
83+
"source": [
84+
"Before running any queries, you need to first authenticate yourself by running the following cell. If you are running it for the first time, it will ask you to follow a link to log in using your Gmail account, and accept the data access requests to your profile. Once this is done, it will generate a string of verification code, which you should paste back to the cell below and press enter."
85+
]
86+
},
87+
{
88+
"cell_type": "code",
89+
"execution_count": null,
90+
"metadata": {
91+
"colab": {},
92+
"colab_type": "code",
93+
"id": "ABh4hMt288yg"
94+
},
95+
"outputs": [],
96+
"source": [
97+
"auth.authenticate_user()"
98+
]
99+
},
100+
{
101+
"cell_type": "markdown",
102+
"metadata": {
103+
"colab_type": "text",
104+
"id": "BPoHP2a8_eni"
105+
},
106+
"source": [
107+
"We'll also set the project details."
108+
]
109+
},
110+
{
111+
"cell_type": "code",
112+
"execution_count": null,
113+
"metadata": {
114+
"colab": {},
115+
"colab_type": "code",
116+
"id": "P0fdtVMa_di9"
117+
},
118+
"outputs": [],
119+
"source": [
120+
"project_id='bidmc-datathon'\n",
121+
"os.environ[\"GOOGLE_CLOUD_PROJECT\"]=project_id"
122+
]
123+
},
124+
{
125+
"cell_type": "markdown",
126+
"metadata": {
127+
"colab_type": "text",
128+
"id": "5bHZALFP9VN1"
129+
},
130+
"source": [
131+
"# \"Querying\" our database with SQL\n",
132+
"\n",
133+
"Now we can start exploring the data. We'll begin by running a simple query to load all columns of the `patient` table to a Pandas DataFrame. The query is written in SQL, a common language for extracting data from databases. The structure of an SQL query is:\n",
134+
"\n",
135+
"```sql\n",
136+
"SELECT <columns>\n",
137+
"FROM <table>\n",
138+
"WHERE <criteria, optional>\n",
139+
"```\n",
140+
"\n",
141+
"`*` is a wildcard that indicates all columns"
142+
]
143+
},
144+
{
145+
"cell_type": "markdown",
146+
"metadata": {},
147+
"source": [
148+
"# BigQuery\n",
149+
"\n",
150+
"Our dataset is stored on BigQuery, Google's database engine. We can run our query on the database using some special (\"magic\") [BigQuery syntax](https://googleapis.dev/python/bigquery/latest/magics.html)."
151+
]
152+
},
153+
{
154+
"cell_type": "code",
155+
"execution_count": null,
156+
"metadata": {
157+
"colab": {},
158+
"colab_type": "code",
159+
"id": "RE-UZAPG_rHq"
160+
},
161+
"outputs": [],
162+
"source": [
163+
"%%bigquery patient\n",
164+
"\n",
165+
"SELECT *\n",
166+
"FROM `physionet-data.eicu_crd_demo.patient`"
167+
]
168+
},
169+
{
170+
"cell_type": "markdown",
171+
"metadata": {
172+
"colab_type": "text",
173+
"id": "YbnkcCZxBkdK"
174+
},
175+
"source": [
176+
"We have now assigned the output to our query to a variable called `patient`. Let's use the `head` method to view the first few rows of our data."
177+
]
178+
},
179+
{
180+
"cell_type": "code",
181+
"execution_count": null,
182+
"metadata": {
183+
"colab": {},
184+
"colab_type": "code",
185+
"id": "GZph0FPDASEs"
186+
},
187+
"outputs": [],
188+
"source": [
189+
"# view the top few rows of the patient data\n",
190+
"patient.head()"
191+
]
192+
},
193+
{
194+
"cell_type": "markdown",
195+
"metadata": {
196+
"colab_type": "text",
197+
"id": "TlxaXLevC_Rz"
198+
},
199+
"source": [
200+
"## Questions\n",
201+
"\n",
202+
"- What does `patientunitstayid` represent? (hint, see: http://eicu-crd.mit.edu/eicutables/patient/)\n",
203+
"- What does `patienthealthsystemstayid` represent?\n",
204+
"- What does `uniquepid` represent?"
205+
]
206+
},
207+
{
208+
"cell_type": "code",
209+
"execution_count": null,
210+
"metadata": {
211+
"colab": {},
212+
"colab_type": "code",
213+
"id": "2rLY0WyCBzp9"
214+
},
215+
"outputs": [],
216+
"source": [
217+
"# select a limited number of columns to view\n",
218+
"columns = ['uniquepid', 'patientunitstayid','gender','age','unitdischargestatus']\n",
219+
"patient[columns].head()"
220+
]
221+
},
222+
{
223+
"cell_type": "markdown",
224+
"metadata": {
225+
"colab_type": "text",
226+
"id": "FSdS2hS4EWtb"
227+
},
228+
"source": [
229+
"- Try running the following query, which lists unique values in the age column. What do you notice?"
230+
]
231+
},
232+
{
233+
"cell_type": "code",
234+
"execution_count": null,
235+
"metadata": {
236+
"colab": {},
237+
"colab_type": "code",
238+
"id": "0Aom69ftDxBN"
239+
},
240+
"outputs": [],
241+
"source": [
242+
"# what are the unique values for age?\n",
243+
"age_col = 'age'\n",
244+
"patient[age_col].sort_values().unique()"
245+
]
246+
},
247+
{
248+
"cell_type": "markdown",
249+
"metadata": {
250+
"colab_type": "text",
251+
"id": "Y_qJL94jE0k8"
252+
},
253+
"source": [
254+
"- Try plotting a histogram of ages using the command in the cell below. What happens? Why?"
255+
]
256+
},
257+
{
258+
"cell_type": "code",
259+
"execution_count": null,
260+
"metadata": {
261+
"colab": {},
262+
"colab_type": "code",
263+
"id": "1zad3Gr4D4LE"
264+
},
265+
"outputs": [],
266+
"source": [
267+
"# try plotting a histogram of ages\n",
268+
"patient[age_col].plot(kind='hist', bins=15)"
269+
]
270+
},
271+
{
272+
"cell_type": "markdown",
273+
"metadata": {
274+
"colab_type": "text",
275+
"id": "xIdwVEEPF25H"
276+
},
277+
"source": [
278+
"Let's create a new column named `age_num`, then try again."
279+
]
280+
},
281+
{
282+
"cell_type": "code",
283+
"execution_count": null,
284+
"metadata": {
285+
"colab": {},
286+
"colab_type": "code",
287+
"id": "-rwc-28oFF6R"
288+
},
289+
"outputs": [],
290+
"source": [
291+
"# create a column containing numerical ages\n",
292+
"# If ‘coerce’, then invalid parsing will be set as NaN\n",
293+
"agenum_col = 'age_num'\n",
294+
"patient[agenum_col] = pd.to_numeric(patient[age_col], errors='coerce')\n",
295+
"patient[agenum_col].sort_values().unique()"
296+
]
297+
},
298+
{
299+
"cell_type": "code",
300+
"execution_count": null,
301+
"metadata": {
302+
"colab": {},
303+
"colab_type": "code",
304+
"id": "uTFMqqWqFMjG"
305+
},
306+
"outputs": [],
307+
"source": [
308+
"patient[agenum_col].plot(kind='hist', bins=15)"
309+
]
310+
},
311+
{
312+
"cell_type": "markdown",
313+
"metadata": {
314+
"colab_type": "text",
315+
"id": "FrbR8rV3GlR1"
316+
},
317+
"source": [
318+
"## Questions\n",
319+
"\n",
320+
"- Use the `mean()` method to find the average age. Why do we expect this to be lower than the true mean?\n",
321+
"- In the same way that you use `mean()`, you can use `describe()`, `max()`, and `min()`. Look at the admission heights (`admissionheight`) of patients in cm. What issue do you see? How can you deal with this issue?"
322+
]
323+
},
324+
{
325+
"cell_type": "code",
326+
"execution_count": null,
327+
"metadata": {
328+
"colab": {},
329+
"colab_type": "code",
330+
"id": "TPps13DZG6Ac"
331+
},
332+
"outputs": [],
333+
"source": [
334+
"adheight_col = 'admissionheight'\n",
335+
"patient[adheight_col].describe()"
336+
]
337+
},
338+
{
339+
"cell_type": "code",
340+
"execution_count": null,
341+
"metadata": {
342+
"colab": {},
343+
"colab_type": "code",
344+
"id": "9jhV9xQoGRJq"
345+
},
346+
"outputs": [],
347+
"source": [
348+
"# set threshold\n",
349+
"adheight_col = 'admissionheight'\n",
350+
"patient[patient[adheight_col] < 10] = None"
351+
]
352+
}
353+
],
354+
"metadata": {
355+
"colab": {
356+
"collapsed_sections": [],
357+
"include_colab_link": true,
358+
"name": "01-explore-patient-table",
359+
"provenance": [],
360+
"version": "0.3.2"
361+
},
362+
"kernelspec": {
363+
"display_name": "Python 3",
364+
"language": "python",
365+
"name": "python3"
366+
},
367+
"language_info": {
368+
"codemirror_mode": {
369+
"name": "ipython",
370+
"version": 3
371+
},
372+
"file_extension": ".py",
373+
"mimetype": "text/x-python",
374+
"name": "python",
375+
"nbconvert_exporter": "python",
376+
"pygments_lexer": "ipython3",
377+
"version": "3.7.4"
378+
}
379+
},
380+
"nbformat": 4,
381+
"nbformat_minor": 1
382+
}

0 commit comments

Comments
 (0)