ZJUFanLab
diff --git a/‎README.md
+58-12 b/‎README.md
+58-12
diff --git a/‎images/workflow.jpg
-59.3 KB b/‎images/workflow.jpg
-59.3 KB
diff --git a/‎requirements.txt
-7 b/‎requirements.txt
-7
diff --git a/‎scniche/datasets/_dataset.py
+43 b/‎scniche/datasets/_dataset.py
+43
diff --git a/‎scniche/preprocess/__init__.py
+1 b/‎scniche/preprocess/__init__.py
+1
diff --git a/‎scniche/preprocess/_build.py
+37-28 b/‎scniche/preprocess/_build.py
+37-28
diff --git a/‎scniche/preprocess/_utils.py
+24-13 b/‎scniche/preprocess/_utils.py
+24-13
diff --git a/‎scniche/trainer/__init__.py
+1-1 b/‎scniche/trainer/__init__.py
+1-1
@@ -1,4 +1,4 @@
-# scNiche v1.0.0
+# scNiche v1.1.0
 
 ## Identification and characterization of cell niches in tissue from spatial omics data at single-cell resolution
 
@@ -11,10 +11,12 @@ scNiche is a computational framework to identify and characterize cell niches fr
 ## Requirements and Installation
 [![anndata 0.10.1](https://img.shields.io/badge/anndata-0.10.1-success)](https://pypi.org/project/anndata/) [![pandas 1.5.0](https://img.shields.io/badge/pandas-1.5.0-important)](https://pypi.org/project/pandas/) [![squidpy 1.2.3](https://img.shields.io/badge/squidpy-1.2.3-critical)](https://pypi.org/project/squidpy/) [![scanpy 1.9.1](https://img.shields.io/badge/scanpy-1.9.1-informational)](https://github.com/scverse/scanpy) [![dgl 1.1.0+cu113](https://img.shields.io/badge/dgl-1.1.0%2Bcu113-blueviolet)](https://www.dgl.ai/)  [![torch 1.21.1+cu113](https://img.shields.io/badge/torch-1.12.1%2Bcu113-%23808080)](https://pytorch.org/get-started/locally/) [![matplotlib 3.6.2](https://img.shields.io/badge/matplotlib-3.6.2-ff69b4)](https://pypi.org/project/matplotlib/) [![seaborn 0.13.0](https://img.shields.io/badge/seaborn-0.13.0-9cf)](https://pypi.org/project/seaborn/) 
 
-### Create and activate Python environment
+### Create and activate conda environment with requirements installed.
 For scNiche, the Python version need is over 3.9. If you have already installed a lower version of Python, consider installing Anaconda, and then you can create a new environment.
 ```
-conda create -n scniche python=3.9
+cd scNiche-main
+
+conda env create -f scniche_dev.yaml -n scniche
 conda activate scniche
 ```
 
@@ -29,24 +31,65 @@ pip install dgl==1.1.0+cu113 -f https://data.dgl.ai/wheels/cu113/repo.html
 ```
 The version of PyTorch and DGL should be suitable to the CUDA version of your machine. You can find the appropriate version on the [PyTorch](https://pytorch.org/get-started/locally/) and [DGL](https://www.dgl.ai/) website.
 
-### Install other requirements
-```
-cd scNiche-main
-pip install -r requirements.txt
-```
+
 ### Install scNiche
 ```
 python setup.py build
 python setup.py install
 ```
 
 ## Tutorials (identify cell niches)
-scNiche requires the single-cell spatial omics data (stored as `.h5ad` format) as input, where cell population label of each cell needs to be provided.
+#### - Spatial proteomics data or single-cell spatial transcriptomics data
+
+By default, scNiche requires the single-cell spatial omics data (stored as `.h5ad` format) as input, where cell population label of each cell needs to be provided. 
 
 Here are examples of scNiche on simulated and biological datasets:
 * [Demonstration of scNiche on the simulated data](tutorial/tutorial_simulated.ipynb)
-* [Demonstration of scNiche on the mouse spleen CODEX data](tutorial/tutorial_spleen.ipynb)
-* [Demonstration of scNiche on the human upper tract urothelial carcinoma (UTUC) IMC data](tutorial/tutorial_utuc.ipynb)
+* [Demonstration of scNiche on the mouse V1 neocortex STARmap data](tutorial/tutorial_STARmap.ipynb)
+
+
+scNiche also provides a subgraph-based batch training strategy to scale to large datasets and multi-slices:
+
+1. Batch training strategy of scNiche for single-slice:
+* [Demonstration of scNiche on the mouse spleen CODEX data](tutorial/tutorial_spleen.ipynb) (over 80,000 cells per slice)
+
+2. Batch training strategy of scNiche for multi-slices:
+* [Demonstration of scNiche on the human upper tract urothelial carcinoma (UTUC) IMC data](tutorial/tutorial_utuc.ipynb) (containing 115,060 cells from 16 slices)
+* [Demonstration of scNiche on the mouse frontal cortex and striatum MERFISH data](tutorial/tutorial_MERFISH.ipynb) (containing 376,107 cells from 31 slices)
+
+
+#### - Low-resolution spatial transcriptomics data 
+We here take 4 slices from the same donor of the [human DLPFC 10X Visium data](http://spatial.libd.org/spatialLIBD/) as an example.
+
+In contrast to spatial proteomics data, which usually contain only a few dozen proteins, these spatial transcriptomics data can often measure tens of thousands of genes, 
+with potential batch effects commonly present across tissue slices from different samples. 
+Therefore, dimensionality reduction and batch effect removal need to be performed on the molecular profiles of the cells and their neighborhoods before run scNiche.
+We used [scVI](https://github.com/scverse/scvi-tools) by defalut, however, simple PCA dimensionality reduction or other deep learning-based integration methods like [scArches](https://github.com/theislab/scarches) are also applicable.
+
+Furthermore, cell type labels are usually unavailable for these spatial transcriptomics data. As alternatives, 
+we can: 
+1. Use the `deconvolution results of spots` as a substitute view to replace the `cellular compositions of neighborhoods`. 
+We used the human middle temporal gyrus (MTG) scRNA-seq data by [Hodge et al.](https://doi.org/10.1038/s41586-019-1506-7) as the single-cell reference, and deconvoluted the spots using [Cell2location](https://github.com/BayraktarLab/cell2location):
+
+* [Demonstration of scNiche on Slice 151673 (with deconvolution results)](tutorial/tutorial_dlpfc151673.ipynb)
+
+2. Only use the molecular profiles of cells and neighborhoods as input:
+
+* [Demonstration of scNiche on Slice 151673 (without deconvolution results)](tutorial/tutorial_dlpfc151673-2view.ipynb)
+
+
+Multi-slice analysis of 4 slices based on the batch training strategy of scNiche:
+
+* [Demonstration of scNiche on 4 slices from the same donor (with deconvolution results)](tutorial/tutorial_DLPFC.ipynb)
+
+#### - Spatial multi-omics data 
+The strategy of scNiche for modeling features from different views of the cell offers more possible avenues for expansion, 
+such as application to spatial multi-omics data. We here ran scNiche on a postnatal day (P)22 mouse brain coronal section 
+dataset generated by [Zhang et al.](https://doi.org/10.1038/s41586-023-05795-1), which includes RNA-seq and CUT&Tag (acetylated histone H3 Lys27 (H3K27ac) histone modification) modalities.
+The dataset can be downloaded [here](https://zenodo.org/records/10362607).
+
+* [Demonstration of scNiche on the mouse brain spatial CUT&Tag–RNA-seq data](tutorial/tutorial_multi-omics.ipynb)
+
 
 ## Tutorials (characterize cell niches)
 scNiche also offers a downstream analytical framework for characterizing cell niches more comprehensively.
@@ -56,6 +99,9 @@ Here are examples of scNiche on two biological datasets:
 * [Demonstration of scNiche on the mouse liver Seq-Scope data](tutorial/tutorial_liver.ipynb)
 
 
+## Acknowledgements
+The scNiche model is developed based on the [multi-view clustering framework (CMGEC)](https://github.com/wangemm/CMGEC-TMM-2021). We thank the authors for releasing the codes.
+
 ## About
-scNiche was developed by Jingyang Qian. Should you have any questions, please contact Jingyang Qian at [email protected].
+scNiche is developed by Jingyang Qian. Should you have any questions, please contact Jingyang Qian at [email protected].
 
@@ -42,6 +42,49 @@ def human_utuc_imc():
     return adata
 
 
+def mouse_v1_starmap():
+    """
+    Raw mouse V1 neocortex dataset from Wang et al. (Science, 2018), containing 1 slice replicate with the layer labels.
+
+    This downloads 9.6 MB of data upon the first call of the function and stores it in `./scniche_data/STARmap.h5ad`.
+    :return: AnnData
+    """
+    url = "https://figshare.com/ndownloader/files/50249244"
+    datasetdir = './scniche_data/STARmap.h5ad'
+    adata = sc.read(datasetdir, backup_url=url)
+    return adata
+
+
+def human_dlpfc_visium():
+    """
+    Raw human DLPFC dataset from Maynard et al. (Nat Neurosci., 2021),
+    containing 4 slices (Slice 151673, 151674, 151675, and 151676) from the same donor with the layer labels.
+    The scVI (Nat Methods., 2018) embedding as well as the Cell2location (Nat Biotechnol., 2022) deconvolution results
+    are also provided.
+
+    This downloads 71.93 MB of data upon the first call of the function and stores it in `./scniche_data/DLPFC.h5ad`.
+    :return: AnnData
+    """
+    url = "https://figshare.com/ndownloader/files/50249673"
+    datasetdir = './scniche_data/DLPFC.h5ad'
+    adata = sc.read(datasetdir, backup_url=url)
+    return adata
+
+
+def mouse_aging_merfish():
+    """
+    processed mouse aging brain dataset from Allen et al. (Cell, 2023), containing 31 slices with the tissue labels.
+    The data has been normalized and scaled by the original authors, and the PCA results are also provided.
+
+    This downloads 281.56 MB of data upon the first call of the function and stores it in `./scniche_data/MERFISH_Aging.h5ad`.
+    :return: AnnData
+    """
+    url = "https://figshare.com/ndownloader/files/50251680"
+    datasetdir = './scniche_data/MERFISH_Aging.h5ad'
+    adata = sc.read(datasetdir, backup_url=url)
+    return adata
+
+
 def human_tnbc_mibi_tof():
     """
     Processed human triple-negative breast cancer (TNBC) MIBI-TOF dataset from Keren et al. (Cell, 2018),
 
@@ -9,6 +9,7 @@
     "process_multi_slices",
     "construct_graph",
     "random_split",
+    "random_split2",
     "myDataset",
     "prepare_data",
     "prepare_data_batch",
 
@@ -5,20 +5,25 @@
 
 def prepare_data(
         adata: AnnData,
+        choose_views: Optional[list] = None,
         k_cutoff_graph: int = 20,
         mik_graph: int = 5,
         verbose: bool = True
 ):
-
-    feat1 = adata.obsm['X_cn_norm']
-    feat2 = adata.obsm['X_data']
-    feat3 = adata.obsm['X_data_nbr']
-
     if verbose:
         print("-------Constructing graph for each view...")
-    for view, feat in zip(['g1', 'g2', 'g3'], [feat1, feat2, feat3]):
+    if choose_views is None:
+        choose_views = ['X_cn_norm', 'X_data', 'X_data_nbr']
+    else:
+        missing_views = [view for view in choose_views if view not in adata.obsm.keys()]
+        if missing_views:
+            raise ValueError(f"The following views are missing in adata.obsm: {', '.join(missing_views)}")
+
+    for view in choose_views:
+        feat = adata.obsm[view]
         g = construct_graph(np.array(feat), k_cutoff_graph, mik_graph)
-        adata.uns[view] = g
+        graph_name = 'g_' + view
+        adata.uns[graph_name] = g
     if verbose:
         print("Constructing done.")
 
@@ -27,23 +32,25 @@ def prepare_data(
 
 def prepare_data_batch(
         adata: AnnData,
+        choose_views: Optional[list] = None,
         batch_num: int = 4,
         k_cutoff_graph: int = 20,
         mik_graph: int = 5,
         verbose: bool = True
 ):
-    feat1 = adata.obsm['X_cn_norm']
-    feat2 = adata.obsm['X_data']
-    feat3 = adata.obsm['X_data_nbr']
-
-    # TODO: batch idx
+    # create batch idx
     random.seed(123)
     batch_size = adata.shape[0] // batch_num
     left_cell_num = adata.shape[0] % batch_num
     add_cell_num = batch_num - left_cell_num
     add_cell = random.choices(range(adata.shape[0]), k=add_cell_num)
 
-    batch_idx = random_split(adata.shape[0], batch_size)
+    # bug fixed
+    if left_cell_num < batch_size:
+        batch_idx = random_split(adata.shape[0], batch_size)
+    else:
+        batch_idx = random_split2(adata.shape[0], batch_num)
+
     if left_cell_num > 0:
         for i in range(left_cell_num):
             batch_idx[i].append(batch_idx[len(batch_idx) - 1][i])
@@ -57,28 +64,30 @@ def prepare_data_batch(
 
     adata.uns['batch_idx'] = batch_idx_new
 
-    g1_list = []
-    g2_list = []
-    g3_list = []
+    # check
+    if choose_views is None:
+        choose_views = ['X_cn_norm', 'X_data', 'X_data_nbr']
+    else:
+        missing_views = [view for view in choose_views if view not in adata.obsm.keys()]
+        if missing_views:
+            raise ValueError(f"The following views are missing in adata.obsm: {', '.join(missing_views)}")
+
+    feat = [adata.obsm[view] for view in choose_views]
+    g_list = [[] for _ in range(len(feat))]
+
     if verbose:
         print("-------Constructing batch-graph for each view...")
-    for i in tqdm(range(batch_num)):
-        feat1_tmp = feat1[batch_idx_new[i]]
-        feat2_tmp = feat2[batch_idx_new[i]]
-        feat3_tmp = feat3[batch_idx_new[i]]
-
-        g1_tmp = construct_graph(np.array(feat1_tmp), k_cutoff_graph, mik_graph)
-        g2_tmp = construct_graph(np.array(feat2_tmp), k_cutoff_graph, mik_graph)
-        g3_tmp = construct_graph(np.array(feat3_tmp), k_cutoff_graph, mik_graph)
 
-        g1_list.append(g1_tmp)
-        g2_list.append(g2_tmp)
-        g3_list.append(g3_tmp)
+    for i in tqdm(range(batch_num)):
+        for j in range(len(feat)):
+            feat_tmp = feat[j][batch_idx_new[i]]
+            g_tmp = construct_graph(np.array(feat_tmp), k_cutoff_graph, mik_graph)
+            g_list[j].append(g_tmp)
 
     if verbose:
         print("Constructing done.")
 
-    mydataset = myDataset(g1_list, g2_list, g3_list)
+    mydataset = myDataset(g_list)
     dataloader = GraphDataLoader(mydataset, batch_size=1, shuffle=False, pin_memory=True)
     adata.uns['dataloader'] = dataloader
 
 
@@ -7,6 +7,7 @@
 from anndata import AnnData
 from tqdm import tqdm
 from sklearn.decomposition import PCA
+from scipy.sparse import issparse
 from torch.utils.data import Dataset, DataLoader
 from sklearn.neighbors import NearestNeighbors
 from typing import Optional, Union
@@ -38,8 +39,8 @@ def cal_spatial_neighbors(
 
     # CNs
     meta = adata.obs.copy()
-    meta['x_new'] = adata.obsm['spatial'][:, 0]
-    meta['y_new'] = adata.obsm['spatial'][:, 1]
+    meta['x_new'] = list(adata.obsm['spatial'][:, 0])
+    meta['y_new'] = list(adata.obsm['spatial'][:, 1])
 
     if celltype_order is None:
         celltype_order = sorted(meta[celltype_key].unique())
@@ -109,7 +110,10 @@ def cal_spatial_exp(
     if layer_key is not None:
         data_raw = adata.obsm[layer_key].copy()
     else:
-        data_raw = adata.X.copy()
+        if issparse(adata.X):
+            data_raw = adata.X.toarray().copy()
+        else:
+            data_raw = adata.X.copy()
     data_nbr = []
     for i in range(indices.shape[0]):
         data_nbr_tmp = data_raw[indices[i]].mean(axis=0)
@@ -192,12 +196,25 @@ def construct_graph(
     return g
 
 
+# left_cell_num < batch_size
 def random_split(n, m):
     nums = list(range(n))
     random.shuffle(nums)
     return [nums[i:i + m] for i in range(0, n, m)]
 
 
+# left_cell_num > batch_size
+def random_split2(n, batch_num):
+    nums = list(range(n))
+    random.shuffle(nums)
+
+    batch_size = n // (batch_num + 1)
+    result = [nums[i * batch_size: (i + 1) * batch_size] for i in range(batch_num)]
+    result.append(nums[batch_num * batch_size:])
+
+    return result
+
+
 def set_seed():
     # seed
     seed = 123
@@ -210,21 +227,15 @@ def set_seed():
 
 
 class myDataset(Dataset):
-    def __init__(self, g1, g2, g3):
-        self.g1 = g1
-        self.g2 = g2
-        self.g3 = g3
+    def __init__(self, g_list):
+        self.g_list = g_list
 
     def __getitem__(self, idx):
 
-        tmp_g1 = self.g1[idx]
-        tmp_g2 = self.g2[idx]
-        tmp_g3 = self.g3[idx]
-
-        return tmp_g1, tmp_g2, tmp_g3
+        return tuple(g[idx] for g in self.g_list)
 
     def __len__(self):
-        return len(self.g1)
+        return len(self.g_list[0])
 
 
 
@@ -3,7 +3,7 @@
 from ._utils import *
 
 __all__ = [
-    "GAE",
+    "MGAE",
     "FeatureFusion",
     "InnerProductDecoder",
     "GFN",