Potential inconsistency between R and Python Versions #12

michaelGuo1204 · 2024-07-22T14:23:36Z

Hi,

Very fascinating and exciting work! :)

This issue is mainly about the implementation of AGF across different versions of BANKSY. In the R version, AGF (also called 1-Harmonics) is computed as follows:

# when M=1
aggr <- knn_df[, abs(
            fscale(gcm[, to, drop = FALSE]) %*% (weight * exp(j * M * phi))
        ), by = from]

As I understand it, (weight * exp(j * M * phi)) is the AGF coefficient and fscale(gcm[, to, drop = FALSE]) is the scaled neighborhood expression. However, in the Python version, such neighborhood expression is computed as follows:

weights_abs.data = np.absolute(weights_abs.data)
nbr_avgs = weights_abs @ X_dense
zerod = X_dense[ind_temp, :] - nbr_avgs[n, :]

I think that zerod and fscale(gcm[... should be consistent, such that nbr_avgs should be some mean without weights. May I ask whether such behavior is desirable?

Appreciate your efforts!

The text was updated successfully, but these errors were encountered:

vipulsinghal02 · 2024-07-24T21:14:44Z

Hey Michael! Thanks a lot for catching this. I am traveling at the moment, but will have a careful look in a week or so and get back to you.

Vipul

vipulsinghal02 · 2024-07-31T04:38:00Z

Hi Michael,

You are right in that there is an inconsistency between the Python and R versions. The R version behaves as desired, and looks like the Python version needs a bugfix. We will get on that soon, and update you once the Python version is fixed. In the meantime, if you would like to use the Python version, I think perhaps you can try just using the neighbourhood mean without the AGF (i.e., M = 0) in your analysis. If you do want to use the AGF, I would suggest using the R version for now (or generating the BANKSY matrix using the R version, and importing it into Python via a csv file).

Just for concreteness / our own reference, here is what the two versions are doing currently for the computation of the M = 1 (AGF) component of the BANKSY matrix, with centering turned on. The symbols are defined precisely in the Methods section of the manuscript.

The R version (correct):
For a given cell $u$ and a given gene $q$, compute the (unweighted) mean expression of $q$ in the neighbourhood of $u$. Let's call this $\bar g_u^q \triangleq \frac{1}{\lvert \eta^u\rvert}\sum_{v \in \eta^u} g^q_v$. Then, compute the AGF (cf. Eq. (1) in the paper) on the centered values of gene expression, defined as,

$$\tilde G^q_u \triangleq \bigg\lvert \sum_{v \in \eta^u} (g^q_v - \bar g^q_u) \cdot \Gamma_{uv} \cdot e^{i\phi_{uv}} \bigg\rvert.$$

In the Python version, as you point out, the-mean-to-center-by is computed using both the normalized gaussian envelope term $\Gamma_{uv}$ and the complex exponential,

$$\begin{align}\bar g^q_u &= \sum_{v \in \eta^u} \bigg\lvert \Gamma_{uv} \cdot e^{i\phi_{uv}} \bigg\rvert g^q_v \\ &= \sum_{v \in \eta^u} \Gamma_{uv} g^q_v,\end{align}$$

where we have used the properties of the complex exponential (unit magnitude), $\Gamma$ (positive), and the fact that ($\cdot$) is just scalar multiplication.

We then compute the $(u, q)$-th element of AGF matrix ($\tilde G^q_u$), exactly as above, resulting in

$$ \begin{align} \tilde G^q_u & \triangleq \bigg\lvert \sum_{v \in \eta^u} (g^q_v - \bar g^q_u) \cdot \Gamma_{uv} \cdot e^{i\phi_{uv}} \bigg\rvert \\ & = \bigg\lvert \sum_{v \in \eta^u} \left(g^q_v - \sum_{\hat v \in \eta^u} \Gamma_{u\hat v} g^q_{\hat v}\right) \cdot \Gamma_{uv} \cdot e^{i\phi_{uv}} \bigg\rvert, \end{align} $$

which, while probably fine in practice (since the centering term is still a mean), does mean that the two versions don't match exactly.

Thanks a lot for catching this again!

Vipul

michaelGuo1204 · 2024-08-04T08:53:20Z

Thank you soooo much for answering my questions! I have tried replacing weighted average with uniform average (by just using np.mean), and it seems that the overall clustering results on the SlideSeq datasets have finally improved.

I have just one more question: what kind of clustering method do you suggest for general ST data, such as 10X Visium or SlideSeq? I noticed that you have implemented a version of Leiden clustering on SNN, which is consistent with the clustering results in Seurat (very clear and well-structured code, by the way!). The clustering results look fine, but as the dataset scale increases (~50k samples), this clustering procedure takes too much time. I tried moving on to KMeans, but the results are not acceptable under any n_clusters. Do you have any suggestions on that?

Much appreciated!

vipulsinghal02 · 2024-08-05T03:45:06Z

Cool! Glad with the corrected centering works! On your other question:

I think maybe you can try two things (described below). I have not tried them, but they are on my list. If you try them, and want to add a vignette / tutorial to the banksy repo(s), feel free to fork and then do a PR!

Scaling Idea 1:
Instead of (or in addition to) doing highly variable genes down to 2000 genes, which results in quite a large matrix when you have 50k spots, perhaps you can do an initial PCA with 100 PCs to further reduce the number of features. These new 100 features should capture most of the information content in the 20k (or 2k HVG) genes. If you wanted to be fancy, you could try NMF, but honestly PCA should work fine, and be more scalable anyway). My suspicion is that doing HVG first, then doing the further redution to 100PCs will be better than reducing 20k genes directly to 100PCs, but I could be mistaken. Worth trying both!

Then just run BANKSY as usual (so compute neighbours in this 100 feature space). In particular, you will still reduce your (100PC + 100 nbd mean feature + 100 AGF feature) $\times$ (50k sample) neighbour augmented banksy matrix to a 20 PC by 50k matrix before clustering.

Scaling Idea 2 (I think you might need to use the R version for this)
Use BPcells. Seurat actually has an integration of BPcells into their framework, and BANKSY has a Seurat compatible version.

vipulsinghal02 self-assigned this Jul 29, 2024

vipulsinghal02 assigned chousn Jul 31, 2024

vipulsinghal02 mentioned this issue Aug 27, 2024

Error in RunBanksy with Visium HD data prabhakarlab/Banksy#38

Closed

vipulsinghal02 closed this as completed Aug 27, 2024

vipulsinghal02 mentioned this issue Oct 2, 2024

Running Banksy on large Xenium Dataset prabhakarlab/Banksy#39

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential inconsistency between R and Python Versions #12

Potential inconsistency between R and Python Versions #12

michaelGuo1204 commented Jul 22, 2024

vipulsinghal02 commented Jul 24, 2024

vipulsinghal02 commented Jul 31, 2024 •

edited

Loading

michaelGuo1204 commented Aug 4, 2024

vipulsinghal02 commented Aug 5, 2024 •

edited

Loading

Potential inconsistency between R and Python Versions #12

Potential inconsistency between R and Python Versions #12

Comments

michaelGuo1204 commented Jul 22, 2024

vipulsinghal02 commented Jul 24, 2024

vipulsinghal02 commented Jul 31, 2024 • edited Loading

michaelGuo1204 commented Aug 4, 2024

vipulsinghal02 commented Aug 5, 2024 • edited Loading

vipulsinghal02 commented Jul 31, 2024 •

edited

Loading

vipulsinghal02 commented Aug 5, 2024 •

edited

Loading