add more info on the datasets

bricaud · bricaud · commit 91fba86f5fcb · 2021-07-13T22:06:05.000+02:00
diff --git a/README.md b/README.md
@@ -17,4 +17,8 @@ There are several possible ways to run the notebooks.
 
 ## BXD data
 
-The dataset for the experiments is open and stored on an server from EPFL. It is a s3 storage at `'endpoint_url':'https://os.unil.cloud.switch.ch'` that can be accessed using the url `'s3://lts2-graphnex/BXDmice/`. See the notebooks for access examples.
+The dataset for the experiments is open and stored on an server from EPFL. It is a s3 storage at `'endpoint_url':'https://os.unil.cloud.switch.ch'` that can be accessed using the url `'s3://lts2-graphnex/BXDmice/`. See the notebooks for access examples.
+
+## Dataset description
+
+The dataset contains genomic data, protein expression in different body tissues of the mouse as well as the phenotype of mice over more that 5000 different experiments. The experiments cover a wide range of tests such as obesity related, insulin or expression of a variety of health markers in the mouse. 
diff --git a/data_exploration.ipynb b/data_exploration.ipynb
@@ -32,7 +32,7 @@
    },
    "source": [
     "## Genotype\n",
-    "The genotype file contains a list of differences in the genome of the different mice. These differences are at the scale of a nucleotide. In the data table, each row is an `SNP` [Single-nucleotide polymorphism](https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism). It can be inherited from one of the initial ancestors or the other. This is encoded as a binary value -1 or 1. The initial ancestors have a zero value."
+    "The genotype file contains a list of differences in the genome of the different mice. These differences are at the scale of a nucleotide. In the data table, each row is an `SNP` [Single-nucleotide polymorphism](https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism). It can be inherited from one of the initial ancestors or the other. This is encoded as a binary value, -1 if the `SNP` comes from `C57BL.6J` and 1 from `DBA.2J`. The first descendants of the initial parents, `B6D2F1` and `D2B6F1`, have a zero value since their genome is a perfect mixing with one chromosome from each parent (except for a few SNPs). The stains of the second generation of mice are called `BXD*` (B 'black 6' `C57BL.6J` crossed with D `DBA.2J`)."
    ]
   },
   {
@@ -89,9 +89,9 @@
    "source": [
     "## Tissues\n",
     "During or after experiments, the expression of proteins in different tissues of the mice has been measured.\n",
-    "The measurements have been recorded in a file per tissue. The data are in a large table with proteins as rows and mice as columns. The expression is a float number.\n",
+    "The measurements have been recorded in different files and one file corresponds to one tissue. The dataset inside a file is a large table with proteins as rows and mice as columns. The measurements are float numbers.\n",
     "\n",
-    "For each mouse, only a subset of the tissues have been measured. Therefore, not all mice are present in each tissue data and different group of mice are found in the different tissue files."
+    "For each mouse, only a subset of the tissues have been measured. Therefore, not all mice are present in each tissue measurement dataset and different strains of mice are found in the different tissue files."
    ]
   },
   {
@@ -102,10 +102,13 @@
    "outputs": [],
    "source": [
     "# Tissue\n",
-    "tissue_name = 'Muscle_CD'\n",
-    "#organ = 'Lung'\n",
-    "#organ = 'Hippocampus'\n",
-    "#organ = 'Gastrointestinal'\n",
+    "tissue_name = 'LiverProt_CD'\n",
+    "# Other examples:\n",
+    "#tissue_name = 'Eye'\n",
+    "#tissue_name = 'Muscle_CD'\n",
+    "#tissue_name = 'Hippocampus'\n",
+    "#tissue_name = 'Gastrointestinal'\n",
+    "#tissue_name = 'Lung'\n",
     "tissue_path = os.path.join(s3_path,  'expression data', tissue_name + '.txt.gz')\n",
     "tissue = pd.read_csv(tissue_path, sep='\\t', storage_options=storage_options)\n",
     "print('File {} Opened.'.format(tissue_path))"
@@ -121,13 +124,24 @@
     "tissue.head()"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "366f8b4c-86b6-4510-9c72-e9bc55d8f6c6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Remove the columns (mouse strains) where there are no measurement:\n",
+    "tissue.dropna(axis=1).head()"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "a976b942-5433-4725-b9a3-b8593baca597",
    "metadata": {},
    "source": [
     "## Phenotype\n",
-    "The phenotype data corresponds to the results of different experiments. It is made of 2 files, one file contains the results and the other contain the description of the experiment (experiment type, authors,...).\n",
+    "The phenotype data corresponds to the results of different experiments. It is made of 2 files, one file contains the results and the other contains the description of the experiments (experiment type, authors,...).\n",
     "In the result table, rows correspond to phenotypes and columns to mouse strains. The entries are float numbers. The table contains a large number of missing values as not all the mouse strains have been involved in all the experiments."
    ]
   },
@@ -156,7 +170,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "phenotype.head()"
+    "phenotype.head(10)"
    ]
   },
   {
@@ -176,6 +190,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "# Example of one phenotype:\n",
     "phenotypeinfo[phenotypeinfo['RecordID']==12894]"
    ]
   },