|
32 | 32 | },
|
33 | 33 | "source": [
|
34 | 34 | "## Genotype\n",
|
35 |
| - "The genotype file contains a list of differences in the genome of the different mice. These differences are at the scale of a nucleotide. In the data table, each row is an `SNP` [Single-nucleotide polymorphism](https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism). It can be inherited from one of the initial ancestors or the other. This is encoded as a binary value -1 or 1. The initial ancestors have a zero value." |
| 35 | + "The genotype file contains a list of differences in the genome of the different mice. These differences are at the scale of a nucleotide. In the data table, each row is an `SNP` [Single-nucleotide polymorphism](https://en.wikipedia.org/wiki/Single-nucleotide_polymorphism). It can be inherited from one of the initial ancestors or the other. This is encoded as a binary value, -1 if the `SNP` comes from `C57BL.6J` and 1 from `DBA.2J`. The first descendants of the initial parents, `B6D2F1` and `D2B6F1`, have a zero value since their genome is a perfect mixing with one chromosome from each parent (except for a few SNPs). The stains of the second generation of mice are called `BXD*` (B 'black 6' `C57BL.6J` crossed with D `DBA.2J`)." |
36 | 36 | ]
|
37 | 37 | },
|
38 | 38 | {
|
|
89 | 89 | "source": [
|
90 | 90 | "## Tissues\n",
|
91 | 91 | "During or after experiments, the expression of proteins in different tissues of the mice has been measured.\n",
|
92 |
| - "The measurements have been recorded in a file per tissue. The data are in a large table with proteins as rows and mice as columns. The expression is a float number.\n", |
| 92 | + "The measurements have been recorded in different files and one file corresponds to one tissue. The dataset inside a file is a large table with proteins as rows and mice as columns. The measurements are float numbers.\n", |
93 | 93 | "\n",
|
94 |
| - "For each mouse, only a subset of the tissues have been measured. Therefore, not all mice are present in each tissue data and different group of mice are found in the different tissue files." |
| 94 | + "For each mouse, only a subset of the tissues have been measured. Therefore, not all mice are present in each tissue measurement dataset and different strains of mice are found in the different tissue files." |
95 | 95 | ]
|
96 | 96 | },
|
97 | 97 | {
|
|
102 | 102 | "outputs": [],
|
103 | 103 | "source": [
|
104 | 104 | "# Tissue\n",
|
105 |
| - "tissue_name = 'Muscle_CD'\n", |
106 |
| - "#organ = 'Lung'\n", |
107 |
| - "#organ = 'Hippocampus'\n", |
108 |
| - "#organ = 'Gastrointestinal'\n", |
| 105 | + "tissue_name = 'LiverProt_CD'\n", |
| 106 | + "# Other examples:\n", |
| 107 | + "#tissue_name = 'Eye'\n", |
| 108 | + "#tissue_name = 'Muscle_CD'\n", |
| 109 | + "#tissue_name = 'Hippocampus'\n", |
| 110 | + "#tissue_name = 'Gastrointestinal'\n", |
| 111 | + "#tissue_name = 'Lung'\n", |
109 | 112 | "tissue_path = os.path.join(s3_path, 'expression data', tissue_name + '.txt.gz')\n",
|
110 | 113 | "tissue = pd.read_csv(tissue_path, sep='\\t', storage_options=storage_options)\n",
|
111 | 114 | "print('File {} Opened.'.format(tissue_path))"
|
|
121 | 124 | "tissue.head()"
|
122 | 125 | ]
|
123 | 126 | },
|
| 127 | + { |
| 128 | + "cell_type": "code", |
| 129 | + "execution_count": null, |
| 130 | + "id": "366f8b4c-86b6-4510-9c72-e9bc55d8f6c6", |
| 131 | + "metadata": {}, |
| 132 | + "outputs": [], |
| 133 | + "source": [ |
| 134 | + "# Remove the columns (mouse strains) where there are no measurement:\n", |
| 135 | + "tissue.dropna(axis=1).head()" |
| 136 | + ] |
| 137 | + }, |
124 | 138 | {
|
125 | 139 | "cell_type": "markdown",
|
126 | 140 | "id": "a976b942-5433-4725-b9a3-b8593baca597",
|
127 | 141 | "metadata": {},
|
128 | 142 | "source": [
|
129 | 143 | "## Phenotype\n",
|
130 |
| - "The phenotype data corresponds to the results of different experiments. It is made of 2 files, one file contains the results and the other contain the description of the experiment (experiment type, authors,...).\n", |
| 144 | + "The phenotype data corresponds to the results of different experiments. It is made of 2 files, one file contains the results and the other contains the description of the experiments (experiment type, authors,...).\n", |
131 | 145 | "In the result table, rows correspond to phenotypes and columns to mouse strains. The entries are float numbers. The table contains a large number of missing values as not all the mouse strains have been involved in all the experiments."
|
132 | 146 | ]
|
133 | 147 | },
|
|
156 | 170 | "metadata": {},
|
157 | 171 | "outputs": [],
|
158 | 172 | "source": [
|
159 |
| - "phenotype.head()" |
| 173 | + "phenotype.head(10)" |
160 | 174 | ]
|
161 | 175 | },
|
162 | 176 | {
|
|
176 | 190 | "metadata": {},
|
177 | 191 | "outputs": [],
|
178 | 192 | "source": [
|
| 193 | + "# Example of one phenotype:\n", |
179 | 194 | "phenotypeinfo[phenotypeinfo['RecordID']==12894]"
|
180 | 195 | ]
|
181 | 196 | },
|
|
0 commit comments