Skip to content

Commit 0cec506

Browse files
committed
correct small bug and adpat to tissue files
1 parent b5469ae commit 0cec506

File tree

2 files changed

+257
-9
lines changed

2 files changed

+257
-9
lines changed

genotype-graph.ipynb

+2-9
Original file line numberDiff line numberDiff line change
@@ -62,14 +62,7 @@
6262
"genotype_path = os.path.join(s3_path, 'geno_reduced.csv.gz')\n",
6363
"#genotype_path = os.path.join(s3_path, 'genotype_BXD.csv.gz')\n",
6464
"genotype = pd.read_csv(genotype_path, storage_options=storage_options)\n",
65-
"print('File {} Opened.'.format(genotype_path))\n",
66-
"phenotype_path = os.path.join(s3_path, 'Phenotype.txt.gz')\n",
67-
"phenotype = pd.read_csv(phenotype_path, sep='\\t', storage_options=storage_options)\n",
68-
"print('File {} Opened.'.format(phenotype_path))\n",
69-
"# Phenotype description\n",
70-
"phenotypeinfo_path = os.path.join(s3_path, 'phenotypes_id_aligner.txt.gz')\n",
71-
"phenotypeinfo = pd.read_csv(phenotypeinfo_path, sep='\\t', storage_options=storage_options)\n",
72-
"print('File {} Opened.'.format(phenotypeinfo_path))"
65+
"print('File {} Opened.'.format(genotype_path))"
7366
]
7467
},
7568
{
@@ -128,7 +121,7 @@
128121
" sigma = 1\n",
129122
" return np.exp(- sigma * d)\n",
130123
" \n",
131-
"M = csr_matrix(geno_knn.shape)\n",
124+
"M = geno_knn.copy()\n",
132125
"M.data = distance2weight(geno_knn.data)\n",
133126
"\n",
134127
"print('A distance of 1 becomes a weight of {}.'.format(str(distance2weight(1))))"

tissue-graph.ipynb

+255
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,255 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Graph of gene expression similarity inside a tissue\n",
8+
"The idea here is to build a similarity graph between gene expression.The idea is the same as for the genotype graph, see the \"genotype graph\" notebook for more info.\n",
9+
"\n",
10+
"In this notebook, proteins or gene expression are nodes of the graph. They are connected to their k nearest neighbors. The connections are weighted by the similarity between two protein expression according to a chosen distance. To each protein is associated a vector encoding its variations over the BXD mouse dataset. Two proteins are similar if their vectors are close in term of Euclidean distance."
11+
]
12+
},
13+
{
14+
"cell_type": "code",
15+
"execution_count": null,
16+
"metadata": {},
17+
"outputs": [],
18+
"source": [
19+
"import pandas as pd\n",
20+
"import numpy as np\n",
21+
"from scipy.sparse import csr_matrix\n",
22+
"import os"
23+
]
24+
},
25+
{
26+
"cell_type": "code",
27+
"execution_count": null,
28+
"metadata": {},
29+
"outputs": [],
30+
"source": [
31+
"import networkx as nx\n",
32+
"import sklearn.metrics\n",
33+
"import sklearn.neighbors\n",
34+
"import matplotlib.pyplot as plt"
35+
]
36+
},
37+
{
38+
"cell_type": "markdown",
39+
"metadata": {},
40+
"source": [
41+
"# Importing the data"
42+
]
43+
},
44+
{
45+
"cell_type": "code",
46+
"execution_count": null,
47+
"metadata": {},
48+
"outputs": [],
49+
"source": [
50+
"# Config for accessing the data on the s3 storage\n",
51+
"storage_options = {'anon':True, 'client_kwargs':{'endpoint_url':'https://os.unil.cloud.switch.ch'}}\n",
52+
"s3_path = 's3://lts2-graphnex/BXDmice/'"
53+
]
54+
},
55+
{
56+
"cell_type": "code",
57+
"execution_count": null,
58+
"metadata": {},
59+
"outputs": [],
60+
"source": [
61+
"# Load the data\n",
62+
"# Tissue\n",
63+
"tissue_name = 'LiverProt_CD'\n",
64+
"# Other examples:\n",
65+
"#tissue_name = 'Eye'\n",
66+
"#tissue_name = 'Muscle_CD'\n",
67+
"#tissue_name = 'Hippocampus'\n",
68+
"#tissue_name = 'Gastrointestinal'\n",
69+
"#tissue_name = 'Lung'\n",
70+
"tissue_path = os.path.join(s3_path, 'expression data', tissue_name + '.txt.gz')\n",
71+
"tissue = pd.read_csv(tissue_path, sep='\\t', storage_options=storage_options)\n",
72+
"print('File {} Opened.'.format(tissue_path))"
73+
]
74+
},
75+
{
76+
"cell_type": "markdown",
77+
"metadata": {},
78+
"source": [
79+
"## Computing the distances"
80+
]
81+
},
82+
{
83+
"cell_type": "code",
84+
"execution_count": null,
85+
"metadata": {},
86+
"outputs": [],
87+
"source": [
88+
"# Remove the columns (mouse strains) where there are no measurement:\n",
89+
"tissue = tissue.dropna(axis=1)\n",
90+
"# Extract the data as a numpy array (drop the first columns)\n",
91+
"tissue_values = tissue.iloc[:,2:].values"
92+
]
93+
},
94+
{
95+
"cell_type": "markdown",
96+
"metadata": {},
97+
"source": [
98+
"### Normalizing\n",
99+
"If unormalized, the graph of gene expression may not account for correlated expressions but only for similar concentration."
100+
]
101+
},
102+
{
103+
"cell_type": "code",
104+
"execution_count": null,
105+
"metadata": {},
106+
"outputs": [],
107+
"source": [
108+
"from sklearn.preprocessing import normalize\n",
109+
"tissue_values = normalize(tissue_values, norm='l2', axis=1)\n",
110+
"tissue_values.shape"
111+
]
112+
},
113+
{
114+
"cell_type": "code",
115+
"execution_count": null,
116+
"metadata": {},
117+
"outputs": [],
118+
"source": [
119+
"# Default distance is Euclidean\n",
120+
"num_neighbors = 4\n",
121+
"tissue_knn = sklearn.neighbors.kneighbors_graph(tissue_values, num_neighbors, mode='distance')\n",
122+
"# Optionally, one can use the following function to compute all the distances:\n",
123+
"#geno_distances = sklearn.metrics.pairwise_distances(geno_values)"
124+
]
125+
},
126+
{
127+
"cell_type": "code",
128+
"execution_count": null,
129+
"metadata": {},
130+
"outputs": [],
131+
"source": [
132+
"# Distribution of weights\n",
133+
"plt.hist(tissue_knn.data, bins=50)\n",
134+
"plt.title('Distribution of distances')\n",
135+
"plt.xlabel('Distance')\n",
136+
"plt.ylabel('Nb of edges')\n",
137+
"plt.show()"
138+
]
139+
},
140+
{
141+
"cell_type": "code",
142+
"execution_count": null,
143+
"metadata": {},
144+
"outputs": [],
145+
"source": [
146+
"# Distance to weight\n",
147+
"# Modify the non-zero values to turn them into weights instead of distances\n",
148+
"def distance2weight(d):\n",
149+
" sigma = 1\n",
150+
" return np.exp(- sigma * d)\n",
151+
" \n",
152+
"M = tissue_knn.copy() #csr_matrix(tissue_knn.shape)\n",
153+
"M.data = distance2weight(tissue_knn.data)\n",
154+
"\n",
155+
"print('A distance of 1 becomes a weight of {}.'.format(str(distance2weight(1))))"
156+
]
157+
},
158+
{
159+
"cell_type": "code",
160+
"execution_count": null,
161+
"metadata": {},
162+
"outputs": [],
163+
"source": [
164+
"# Distribution of weights\n",
165+
"plt.hist(M.data, bins=20)\n",
166+
"plt.title('Distribution of weights')\n",
167+
"plt.xlabel('Weight value')\n",
168+
"plt.ylabel('Nb of edges')\n",
169+
"plt.show()"
170+
]
171+
},
172+
{
173+
"cell_type": "markdown",
174+
"metadata": {},
175+
"source": [
176+
"## Building the graph"
177+
]
178+
},
179+
{
180+
"cell_type": "code",
181+
"execution_count": null,
182+
"metadata": {},
183+
"outputs": [],
184+
"source": [
185+
"G = nx.from_scipy_sparse_matrix(M)"
186+
]
187+
},
188+
{
189+
"cell_type": "code",
190+
"execution_count": null,
191+
"metadata": {},
192+
"outputs": [],
193+
"source": [
194+
"# Adding info on the nodes of the graph\n",
195+
"tissueinfo_dic = tissue[['gene']].to_dict()\n",
196+
"nx.set_node_attributes(G, tissueinfo_dic['gene'], name='Gene') # gene name"
197+
]
198+
},
199+
{
200+
"cell_type": "code",
201+
"execution_count": null,
202+
"metadata": {},
203+
"outputs": [],
204+
"source": [
205+
"# Saving the graph as a gexf file readable with Gephi.\n",
206+
"nx.write_gexf(G,tissue_name + 'graph.gexf')"
207+
]
208+
},
209+
{
210+
"cell_type": "markdown",
211+
"metadata": {},
212+
"source": [
213+
"Graph plotted using Gephi, colored by community (communities found automatically with Gephi). The gene expression forms two distinct clusters. \n",
214+
"\n",
215+
"![gene expression graph](liver_gene_expression.png)"
216+
]
217+
},
218+
{
219+
"cell_type": "markdown",
220+
"metadata": {},
221+
"source": [
222+
"## Applications of the graph\n",
223+
"There are different possible applications of this graph, see the \"genotype graph\" notebook for examples."
224+
]
225+
},
226+
{
227+
"cell_type": "code",
228+
"execution_count": null,
229+
"metadata": {},
230+
"outputs": [],
231+
"source": []
232+
}
233+
],
234+
"metadata": {
235+
"kernelspec": {
236+
"display_name": "venv",
237+
"language": "python",
238+
"name": "venv"
239+
},
240+
"language_info": {
241+
"codemirror_mode": {
242+
"name": "ipython",
243+
"version": 3
244+
},
245+
"file_extension": ".py",
246+
"mimetype": "text/x-python",
247+
"name": "python",
248+
"nbconvert_exporter": "python",
249+
"pygments_lexer": "ipython3",
250+
"version": "3.8.0"
251+
}
252+
},
253+
"nbformat": 4,
254+
"nbformat_minor": 4
255+
}

0 commit comments

Comments
 (0)