first commit

bricaud · bricaud · commit 3a37a3695874 · 2021-07-09T09:25:40.000+02:00
diff --git a/README.md b/README.md
@@ -0,0 +1,20 @@
+# BXD project
+
+The BXD project aims at revealing and exploring the complex relationship between mice genes, phenotypes, protein expression in tissues and more. The main goal of this repository is to provide a tutorial on the data and different machine learning approaches used for investigating the mice data. It is coded in python with jupyter notebooks. Each notebook focus on a particular aspect of the data or a particular ML method.
+
+## List of notebooks
+
+Here is the list of notebooks with a short description.
+
+* `data_exploration.ipynb`: introduction to the dataset with a decription of the different files and what they contain.
+* `random_forest-phenotypes-genotypes.ipynb`: A simple implementation of Random Forest to find complex combination of genes that could influence phenotypes.
+
+## How to run the tutorials
+
+There are several possible ways to run the notebooks.
+* clone the repository on a local machine with python and jupyter installed, along with several data mining and machine learning modules. The simplest is to install Anaconda. 
+* The nobebooks can be run online for example using Binder. Everything will run online without the need to install anything on the personal machine. However it may be slower and the user can experience disconnections during long period of inactivity.
+
+## BXD data
+
+The dataset for the experiments is open and stored on an server from EPFL. It is a s3 storage at `'endpoint_url':'https://os.unil.cloud.switch.ch'` that can be accessed using the url `'s3://lts2-graphnex/BXDmice/`. See the notebooks for access examples.
diff --git a/data_exploration.ipynb b/data_exploration.ipynb
@@ -0,0 +1,271 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a7c22dcf-5a2e-48e0-9c84-a11651e9b025",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "import os"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5cd30231-1d5b-4ca3-ae3d-728372852aef",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Config for accessing the data on the s3 storage\n",
+    "storage_options = {'anon':True, 'client_kwargs':{'endpoint_url':'https://os.unil.cloud.switch.ch'}}\n",
+    "s3_path = 's3://lts2-graphnex/BXDmice/'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2ab1e6b6-c958-4461-8dc1-f9e173e558e3",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "## Genotype\n",
+    "The genotype file contains a list of differences in the genome of the different mice. These differences are at the scale of a nucleotide. In the data table, each row is an `SNP` [Single-nucleotide polymorphism](https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism). It can be inherited from one of the initial ancestors or the other. This is encoded as a binary value -1 or 1. The initial ancestors have a zero value."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "589de0ca-5aae-4025-85d0-836d12de47fb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Load the data\n",
+    "# Genotype\n",
+    "genotype_path = os.path.join(s3_path, 'genotype_BXD.txt.gz')\n",
+    "genotype = pd.read_csv(genotype_path, sep='\\t', storage_options=storage_options)\n",
+    "print('File {} Opened.'.format(genotype_path))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7ad624ce-63fe-452e-a612-cc3ea685a3dd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "genotype.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f12c0d44-a09a-451a-a727-c8f7ba7f160d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Gene postion in the genome\n",
+    "geno_map_path = os.path.join(s3_path, 'map_BXD.txt.gz')\n",
+    "geno_map = pd.read_csv(geno_map_path, sep='\\t', storage_options=storage_options)\n",
+    "print('File {} Opened.'.format(geno_map_path))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9655b2f5-6604-4e38-a15d-f17c79847b80",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "geno_map.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "03dfacf9-da9a-4dbe-a320-58b766316c30",
+   "metadata": {},
+   "source": [
+    "## Tissues\n",
+    "During or after experiments, the expression of proteins in different tissues of the mice has been measured.\n",
+    "The measurements have been recorded in a file per tissue. The data are in a large table with proteins as rows and mice as columns. The expression is a float number.\n",
+    "\n",
+    "For each mouse, only a subset of the tissues have been measured. Therefore, not all mice are present in each tissue data and different group of mice are found in the different tissue files."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "724eca66-ded1-4cee-a872-db1692a4bdf5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Tissue\n",
+    "tissue_name = 'Muscle_CD'\n",
+    "#organ = 'Lung'\n",
+    "#organ = 'Hippocampus'\n",
+    "#organ = 'Gastrointestinal'\n",
+    "tissue_path = os.path.join(s3_path,  'expression data', tissue_name + '.txt.gz')\n",
+    "tissue = pd.read_csv(tissue_path, sep='\\t', storage_options=storage_options)\n",
+    "print('File {} Opened.'.format(tissue_path))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c04eff0b-b998-4d0b-af7f-53b7ca3160f0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "tissue.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "70abc242-b56a-4c5f-b83a-f503ae156445",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_genotype = pd.read_csv(path_bxd + file_genotype, sep='\\t')\n",
+    "df_map = pd.read_csv(path_bxd + 'map_BXD.txt', sep='\\t')\n",
+    "df_genotype.insert(0, 'Chr', df_map['Chr'].values)\n",
+    "df_genotype.insert(2, 'Pos', df_map['Pos'].values)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a976b942-5433-4725-b9a3-b8593baca597",
+   "metadata": {},
+   "source": [
+    "## Phenotype\n",
+    "The phenotype data corresponds to the results of different experiments. It is made of 2 files, one file contains the results and the other contain the description of the experiment (experiment type, authors,...).\n",
+    "In the result table, rows correspond to phenotypes and columns to mouse strains. The entries are float numbers. The table contains a large number of missing values as not all the mouse strains have been involved in all the experiments."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "70697a87-b2bf-4107-a3bf-441ed7cf6cef",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Load the data\n",
+    "# Phenotype\n",
+    "phenotype_path = os.path.join(s3_path, 'Phenotype.txt.gz')\n",
+    "phenotype = pd.read_csv(phenotype_path, sep='\\t', storage_options=storage_options)\n",
+    "print('File {} Opened.'.format(phenotype_path))\n",
+    "# Phenotype description\n",
+    "phenotypeinfo_path = os.path.join(s3_path, 'phenotypes_id_aligner.txt.gz')\n",
+    "phenotypeinfo = pd.read_csv(phenotypeinfo_path, sep='\\t', storage_options=storage_options)\n",
+    "print('File {} Opened.'.format(phenotypeinfo_path))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0fcea224-df5e-4255-9850-766eaf7f6445",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "phenotype.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ae45cbb7-4cc2-4571-9db7-ec852d80444d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "phenotypeinfo.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "411d02c2-fac9-4351-b321-0d6952946ea6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "phenotypeinfo[phenotypeinfo['RecordID']==12894]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3eef0f3f-3550-4b39-91cc-7b0692a1f642",
+   "metadata": {},
+   "source": [
+    "## Data cleaning"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fb3c6e29-4ab2-4e9e-8b40-c41ce665c85c",
+   "metadata": {},
+   "source": [
+    "### Drop duplicate genes in the dataset\n",
+    "Some lines in the genotype DataFrame are identical and we will drop them to reduce the number of features and the computation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ce653493-b42d-4952-8256-d5dd91e77371",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# drop duplicate genes in the dataset\n",
+    "geno_merge = pd.merge(geno_map, genotype, on='SNP')\n",
+    "print('Size of the data before dropping duplicates',geno_merge.shape)\n",
+    "# define a duplicate SNP as: \n",
+    "# 1) an SNP where all the entries corresponding to BXD mice are identical to another SNP and\n",
+    "# 2) both SNPs are on the same chromosome.\n",
+    "col_to_search_duplicates = ['Chr'] + list(genotype.columns.values[5:])\n",
+    "geno_reduced = geno_merge.drop_duplicates(subset=col_to_search_duplicates)\n",
+    "print('Size of the data after dropping duplicates',geno_reduced.shape)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "60080f9d-022f-4ae1-b330-97478efa39f5",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Optionally, save the result as a compressed csv file, to be used by other notebooks\n",
+    "geno_reduced.to_csv('geno_reduced.csv.gz')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3b25fbbe-55dc-48e5-a33d-bf613a342b9e",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "venv",
+   "language": "python",
+   "name": "venv"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/random_forest-genotype-phenotype.ipynb b/random_forest-genotype-phenotype.ipynb