correct small bug and adpat to tissue files

bricaud · bricaud · commit 0cec50604166 · 2021-07-19T15:45:29.000+02:00
diff --git a/genotype-graph.ipynb b/genotype-graph.ipynb
@@ -62,14 +62,7 @@
     "genotype_path = os.path.join(s3_path, 'geno_reduced.csv.gz')\n",
     "#genotype_path = os.path.join(s3_path, 'genotype_BXD.csv.gz')\n",
     "genotype = pd.read_csv(genotype_path, storage_options=storage_options)\n",
-    "print('File {} Opened.'.format(genotype_path))\n",
-    "phenotype_path = os.path.join(s3_path, 'Phenotype.txt.gz')\n",
-    "phenotype = pd.read_csv(phenotype_path, sep='\\t', storage_options=storage_options)\n",
-    "print('File {} Opened.'.format(phenotype_path))\n",
-    "# Phenotype description\n",
-    "phenotypeinfo_path = os.path.join(s3_path, 'phenotypes_id_aligner.txt.gz')\n",
-    "phenotypeinfo = pd.read_csv(phenotypeinfo_path, sep='\\t', storage_options=storage_options)\n",
-    "print('File {} Opened.'.format(phenotypeinfo_path))"
+    "print('File {} Opened.'.format(genotype_path))"
    ]
   },
   {
@@ -128,7 +121,7 @@
     "    sigma = 1\n",
     "    return np.exp(- sigma * d)\n",
     "    \n",
-    "M = csr_matrix(geno_knn.shape)\n",
+    "M = geno_knn.copy()\n",
     "M.data = distance2weight(geno_knn.data)\n",
     "\n",
     "print('A distance of 1 becomes a weight of {}.'.format(str(distance2weight(1))))"
diff --git a/tissue-graph.ipynb b/tissue-graph.ipynb
@@ -0,0 +1,255 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Graph of gene expression similarity inside a tissue\n",
+    "The idea here is to build a similarity graph between gene expression.The idea is the same as for the genotype graph, see the \"genotype graph\" notebook for more info.\n",
+    "\n",
+    "In this notebook, proteins or gene expression are nodes of the graph. They are connected to their k nearest neighbors. The connections are weighted by the similarity between two protein expression according to a chosen distance. To each protein is associated a vector encoding its variations over the BXD mouse dataset. Two proteins are similar if their vectors are close in term of Euclidean distance."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "import numpy as np\n",
+    "from scipy.sparse import csr_matrix\n",
+    "import os"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import networkx as nx\n",
+    "import sklearn.metrics\n",
+    "import sklearn.neighbors\n",
+    "import matplotlib.pyplot as plt"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Importing the data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Config for accessing the data on the s3 storage\n",
+    "storage_options = {'anon':True, 'client_kwargs':{'endpoint_url':'https://os.unil.cloud.switch.ch'}}\n",
+    "s3_path = 's3://lts2-graphnex/BXDmice/'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Load the data\n",
+    "# Tissue\n",
+    "tissue_name = 'LiverProt_CD'\n",
+    "# Other examples:\n",
+    "#tissue_name = 'Eye'\n",
+    "#tissue_name = 'Muscle_CD'\n",
+    "#tissue_name = 'Hippocampus'\n",
+    "#tissue_name = 'Gastrointestinal'\n",
+    "#tissue_name = 'Lung'\n",
+    "tissue_path = os.path.join(s3_path,  'expression data', tissue_name + '.txt.gz')\n",
+    "tissue = pd.read_csv(tissue_path, sep='\\t', storage_options=storage_options)\n",
+    "print('File {} Opened.'.format(tissue_path))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Computing the distances"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Remove the columns (mouse strains) where there are no measurement:\n",
+    "tissue = tissue.dropna(axis=1)\n",
+    "# Extract the data as a numpy array (drop the first columns)\n",
+    "tissue_values = tissue.iloc[:,2:].values"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Normalizing\n",
+    "If unormalized, the graph of gene expression may not account for correlated expressions but only for similar concentration."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from sklearn.preprocessing import normalize\n",
+    "tissue_values = normalize(tissue_values, norm='l2', axis=1)\n",
+    "tissue_values.shape"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Default distance is Euclidean\n",
+    "num_neighbors = 4\n",
+    "tissue_knn = sklearn.neighbors.kneighbors_graph(tissue_values, num_neighbors, mode='distance')\n",
+    "# Optionally, one can use the following function to compute all the distances:\n",
+    "#geno_distances = sklearn.metrics.pairwise_distances(geno_values)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Distribution of weights\n",
+    "plt.hist(tissue_knn.data, bins=50)\n",
+    "plt.title('Distribution of distances')\n",
+    "plt.xlabel('Distance')\n",
+    "plt.ylabel('Nb of edges')\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Distance to weight\n",
+    "# Modify the non-zero values to turn them into weights instead of distances\n",
+    "def distance2weight(d):\n",
+    "    sigma = 1\n",
+    "    return np.exp(- sigma * d)\n",
+    "    \n",
+    "M = tissue_knn.copy() #csr_matrix(tissue_knn.shape)\n",
+    "M.data = distance2weight(tissue_knn.data)\n",
+    "\n",
+    "print('A distance of 1 becomes a weight of {}.'.format(str(distance2weight(1))))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Distribution of weights\n",
+    "plt.hist(M.data, bins=20)\n",
+    "plt.title('Distribution of weights')\n",
+    "plt.xlabel('Weight value')\n",
+    "plt.ylabel('Nb of edges')\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Building the graph"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "G = nx.from_scipy_sparse_matrix(M)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Adding info on the nodes of the graph\n",
+    "tissueinfo_dic = tissue[['gene']].to_dict()\n",
+    "nx.set_node_attributes(G, tissueinfo_dic['gene'], name='Gene') # gene name"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Saving the graph as a gexf file readable with Gephi.\n",
+    "nx.write_gexf(G,tissue_name + 'graph.gexf')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Graph plotted using Gephi, colored by community (communities found automatically with Gephi). The gene expression forms two distinct clusters. \n",
+    "\n",
+    "![gene expression graph](liver_gene_expression.png)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Applications of the graph\n",
+    "There are different possible applications of this graph, see the \"genotype graph\" notebook for examples."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "venv",
+   "language": "python",
+   "name": "venv"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}