Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
13b1b21
init internal functions for permutation testing
angelphanth Sep 9, 2025
cdb1705
user facing in init file
angelphanth Sep 9, 2025
a13be15
:construction: some tests
angelphanth Sep 9, 2025
13149df
Merge branch 'main' into anglup-learning
angelphanth Sep 9, 2025
5a5a42a
:bug: degeneracy check for paired permutations
angelphanth Sep 10, 2025
179ce05
use internal functions
angelphanth Sep 10, 2025
a144f7e
:memo: copilot docstrings
angelphanth Sep 10, 2025
e66b6b3
tests for accept reject null
angelphanth Sep 10, 2025
53d5f46
init tutorial
angelphanth Sep 10, 2025
0a7e33a
diff conditions were diff mgys studies
angelphanth Sep 16, 2025
dc648f6
other go term
angelphanth Sep 16, 2025
719b9e3
Merge branch 'main' into anglup-learning
angelphanth Oct 13, 2025
5492203
:construction: instead saving processed dataset to example_data/mgnify
angelphanth Oct 14, 2025
8ebab69
:memo:docs: Update API examples for permutation testing
angelphanth Oct 14, 2025
b6d049c
jupytext of the permutation testing demo notebook
angelphanth Oct 14, 2025
74255e3
:lipstick:style: lint and reformatd code
angelphanth Oct 14, 2025
afc2283
Merge branch 'main' into anglup-learning
angelphanth Oct 24, 2025
23a79ff
:bug:fix: update PR module link in contributing guide
angelphanth Oct 24, 2025
2194b08
:bug: fix: add quotation marks to command
angelphanth Oct 24, 2025
5b0124e
:sparkles: feat: add pydantic model for perm test output
angelphanth Oct 29, 2025
83bbfda
:sparkles: feat: add pydantic model for perm test output
angelphanth Oct 29, 2025
e614049
:memo: docs: jupytexted the notebook
angelphanth Oct 29, 2025
6bea62d
:lipstick: style: lint and format
angelphanth Oct 29, 2025
f27cf2e
Update docs/api_examples/permutation_testing.py
angelphanth Nov 7, 2025
8415f0f
Update docs/api_examples/permutation_testing.py
angelphanth Nov 7, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions CONTRIBUTING.rst
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ advanced editor to help you with some of the common steps described below, e.g.
$ cd acore/
$ python -m venv .env
$ source .env/bin/activate
$ pip install -e .[dev]
$ pip install -e ".[dev]"

If you work on a Windows shell, see the docs for instructions:
`How venvs work <https://docs.python.org/3/library/venv.html#how-venvs-work>`_
Expand Down Expand Up @@ -125,7 +125,7 @@ Before you submit a pull request, check that it meets these guidelines:
3. The pull request should pass the workflows on GitHub.

See for example the PR-Template for a module:
`Add module PR template <https://github.com/Multiomics-Analytics-Group/acore/blob/main/.github/PULL_REQUEST_TEMPLATE/module.md>`_.
`Add module PR template <https://github.com/Multiomics-Analytics-Group/acore/blob/main/.github/workflows/PULL_REQUEST_TEMPLATE/module.md>`_.



Expand Down
262 changes: 262 additions & 0 deletions docs/api_examples/permutation_testing.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,262 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "e9bdb7c9",
"metadata": {},
"source": [
"# Tutorial: Permutation Testing using `acore`\n",
"\n",
"In this notebook we will demonstrate how to use acore's permutation testing functions on metagenomics data collected by [Ju and colleagues (2018)](https://doi.org/10.1038/s41396-018-0277-8).\n",
"\n",
"The samples in this demo were collected from wastewaster treatment plant inffluent (MGYS00005056) and effluent (MGYS00005058).\n",
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple spelling errors in this line: "wastewaster" should be "wastewater" and "inffluent" should be "influent".

Suggested change
"The samples in this demo were collected from wastewaster treatment plant inffluent (MGYS00005056) and effluent (MGYS00005058).\n",
"The samples in this demo were collected from wastewater treatment plant influent (MGYS00005056) and effluent (MGYS00005058).\n",

Copilot uses AI. Check for mistakes.
"\n",
"For this demo we look at the GO term abundance tables generated by the Mgnify pipeline. The values in the table are the absolute abundance of selected GO terms for each sample, which we then transform to relative abundances and centred-log ratios. \n"
]
},
{
"cell_type": "markdown",
"id": "dc9f17b2",
"metadata": {},
"source": [
"## Data preparation details\n",
"\n",
"### Downloading\n",
"The analysed samples were downloaded via the [MGnify API](https://www.ebi.ac.uk/metagenomics/api/docs/). The inffluent (INF) and effluent (EFFF) datasets have paired samples and we also needed to download the sample metadata (also available via Mgnify API) to assign the correct pairing.\n",
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple spelling errors: "inffluent" should be "influent" and "EFFF" should be "EFF".

Copilot uses AI. Check for mistakes.
"\n",
"### Preprocessing of abundances\n",
"- To account for technical variation due to sequencing technology limitations, we first transform the abundance values so they are relative to the total reads for the sample aka getting relative abundances. \n",
"- The relative abundances are compositional data (CoDa) so we map them to unconstrained vectors using centred log-ratio transformation `acore.microbiome.internal_functions.calc_clr()` to not violate assumptions of any frequentist stats we do\n",
"\n",
"### Preprocessing of the metadata \n",
"- the sample metadata needed for this demo (sampling location) were available in their \"sample-desc\" \n",
"- the sample-desc for each sample in both INF and EFF were parsed and used for pairing off\n",
"\n",
"### Subset of data for demo\n",
"- For this demo we only look at [go term GO:0017001](https://www.ebi.ac.uk/QuickGO/term/GO:0017001)\n",
"- It's expected that antibiotic catabolic processes to be higher in INF vs EFF\n",
"\n",
"### Saving the demo dataset\n",
"This example subset of data was saved to a CSV, ./example_data/mgnify/Ju2018_GO0017001_enf_inf_paired.csv. The data dictionary is below:\n",
"\n",
"| column | description | dtype |\n",
"|-------------------|-------------------------------------------------------------------------------------------------------------------|-------|\n",
"| eff_id | The run id for the mgnify analysis of the effluent sample. | str |\n",
"| inf_id | The run id for the mgnify analysis of the influent sample. | str |\n",
"| sampling_location | [The ISO 3166-1 alpha-2 code](http://iso.org/obp/ui/#iso:pub:PUB500001:en) for the country where the sample was from. | str |\n",
"| sampling_read | Replicates? | str |\n",
"| eff_abundance | The relative abundance of the GO term for a given effluent sample following preprocessing (i.e., CoDA and CLR) | float |\n",
"| inf_abundance | The relative abundance of the GO term for a given influent sample following preprocessing (i.e., CoDA and CLR) | float |\n",
"\n",
"-----\n",
"\n",
"We will now proceed with reading in the prepared dataset. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "00d8f038",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>eff_id</th>\n",
" <th>inf_id</th>\n",
" <th>sampling_location</th>\n",
" <th>sampling_read</th>\n",
" <th>eff_abundance</th>\n",
" <th>inf_abundance</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>ERR2985255</td>\n",
" <td>ERR2814663</td>\n",
" <td>TG</td>\n",
" <td>READ2 Taxonomy ID:256318</td>\n",
" <td>3.257283</td>\n",
" <td>4.226819</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>ERR2985256</td>\n",
" <td>ERR2814664</td>\n",
" <td>MN</td>\n",
" <td>READ2 Taxonomy ID:256318</td>\n",
" <td>2.572841</td>\n",
" <td>3.847191</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>ERR2985257</td>\n",
" <td>ERR2814651</td>\n",
" <td>AH</td>\n",
" <td>READ1 Taxonomy ID:256318</td>\n",
" <td>4.298777</td>\n",
" <td>4.086841</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>ERR2985258</td>\n",
" <td>ERR2814667</td>\n",
" <td>TE</td>\n",
" <td>READ1 Taxonomy ID:256318</td>\n",
" <td>2.758982</td>\n",
" <td>3.436752</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>ERR2985259</td>\n",
" <td>ERR2814660</td>\n",
" <td>FD</td>\n",
" <td>READ1 Taxonomy ID:256318</td>\n",
" <td>3.364675</td>\n",
" <td>3.486673</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" eff_id inf_id sampling_location sampling_read \\\n",
"0 ERR2985255 ERR2814663 TG READ2 Taxonomy ID:256318 \n",
"1 ERR2985256 ERR2814664 MN READ2 Taxonomy ID:256318 \n",
"2 ERR2985257 ERR2814651 AH READ1 Taxonomy ID:256318 \n",
"3 ERR2985258 ERR2814667 TE READ1 Taxonomy ID:256318 \n",
"4 ERR2985259 ERR2814660 FD READ1 Taxonomy ID:256318 \n",
"\n",
" eff_abundance inf_abundance \n",
"0 3.257283 4.226819 \n",
"1 2.572841 3.847191 \n",
"2 4.298777 4.086841 \n",
"3 2.758982 3.436752 \n",
"4 3.364675 3.486673 "
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd \n",
"\n",
"df_data = pd.read_csv(\n",
" 'https://raw.githubusercontent.com/Multiomics-Analytics-Group/acore/refs/heads/anglup-learning/example_data/mgnify/Ju2018_GO0017001_enf_inf_paired.csv'\n",
")\n",
"# sanity check \n",
"df_data.head()"
]
},
{
"cell_type": "markdown",
"id": "7cf6c3d0",
"metadata": {},
"source": [
"## The permutation test\n",
"\n",
"Since these are paird samples we will proceed with paired sample permutation test using `acore.perumutation_test.paired_permutation()`. \n",
"\n",
"The permutation test compares the actual observed chosen metric (e.g., t-statistic, mean difference) with metrics calculated when the dataset values are randomly shuffled permutations of the dataset. \n",
"\n",
"If we do 100 permutations of our data (although we should do a bunch more) and only 1 of those permutations falsely showed a larger effect size than the actual observed effect than it suggests there is a 1/100 chance (p value of 0.01) of the observed effect sizze having occurred by chance. "
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error: "sizze" should be "size".

Suggested change
"If we do 100 permutations of our data (although we should do a bunch more) and only 1 of those permutations falsely showed a larger effect size than the actual observed effect than it suggests there is a 1/100 chance (p value of 0.01) of the observed effect sizze having occurred by chance. "
"If we do 100 permutations of our data (although we should do a bunch more) and only 1 of those permutations falsely showed a larger effect size than the actual observed effect than it suggests there is a 1/100 chance (p value of 0.01) of the observed effect size having occurred by chance. "

Copilot uses AI. Check for mistakes.
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "5eefb720",
"metadata": {},
"outputs": [],
"source": [
"from acore.permutation_test import paired_permutation\n",
"\n",
"# optional choice of random number generatorfor repro\n",
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space in comment: "generatorfor" should be "generator for".

Suggested change
"# optional choice of random number generatorfor repro\n",
"# optional choice of random number generator for repro\n",

Copilot uses AI. Check for mistakes.
"import numpy as np\n",
"rng = np.random.default_rng(12345)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "9eabb911",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'metric': <function ttest_rel at 0x10fee80e0>, 'observed': (np.float64(6.7389860601792275), np.float64(7.122287781830209e-07)), 'p_value': 0.0}\n",
"{'metric': <function mean at 0x10920a370>, 'observed': np.float64(0.5350826397500547), 'p_value': 0.0}\n",
"{'metric': <function mean at 0x10920a370>, 'observed': np.float64(0.5350826397500547), 'p_value': 0.0}\n"
]
}
],
"source": [
"# trying diff metrics to demo functionality also\n",
"for metric in ['t-statistic', 'mean', np.mean]:\n",
" result = paired_permutation(\n",
" df_data['inf_abundance'].to_numpy(),\n",
" df_data['eff_abundance'].to_numpy(), \n",
" metric=metric, \n",
" n_permutations=10000, \n",
" rng=rng\n",
" )\n",
" # verbosity\n",
" print(result)"
]
},
{
"cell_type": "markdown",
"id": "1ad22530",
"metadata": {},
"source": [
"## Result\n",
"\n",
"Based on the permutation tests by test statistic and mean difference, the probability of the observed metrics (t=6.739 and mean diff=0.535) occurring at random would be <0.00001."
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.0"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
101 changes: 101 additions & 0 deletions docs/api_examples/permutation_testing.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# ---
# jupyter:
# jupytext:
# text_representation:
# extension: .py
# format_name: percent
# format_version: '1.3'
# jupytext_version: 1.17.3
# kernelspec:
# display_name: .venv
# language: python
# name: python3
# ---

# %% [markdown]
# # Permutation Tests
#
# In this notebook we will demonstrate how to use acore's permutation testing functions on metagenomics data collected by [Ju and colleagues (2018)](https://doi.org/10.1038/s41396-018-0277-8).
#
# The samples in this demo were collected from wastewaster treatment plant inffluent (MGYS00005056) and effluent (MGYS00005058).
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple spelling errors in this line: "wastewaster" should be "wastewater" and "inffluent" should be "influent".

Suggested change
# The samples in this demo were collected from wastewaster treatment plant inffluent (MGYS00005056) and effluent (MGYS00005058).
# The samples in this demo were collected from wastewater treatment plant influent (MGYS00005056) and effluent (MGYS00005058).

Copilot uses AI. Check for mistakes.
#
# For this demo we look at the GO term abundance tables generated by the Mgnify pipeline. The values in the table are the absolute abundance of selected GO terms for each sample, which we then transform to relative abundances and centred-log ratios.
#

# %% [markdown]
# ## Data preparation details
#
# ### Downloading
# The analysed samples were downloaded via the [MGnify API](https://www.ebi.ac.uk/metagenomics/api/docs/). The inffluent (INF) and effluent (EFFF) datasets have paired samples and we also needed to download the sample metadata (also available via Mgnify API) to assign the correct pairing.
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple spelling errors: "inffluent" should be "influent" and "EFFF" should be "EFF".

Copilot uses AI. Check for mistakes.
#
# ### Preprocessing of abundances
# - To account for technical variation due to sequencing technology limitations, we first transform the abundance values so they are relative to the total reads for the sample aka getting relative abundances.
# - The relative abundances are compositional data (CoDa) so we map them to unconstrained vectors using centred log-ratio transformation [`acore.microbiome.internal_functions.calc_clr`](`acore.microbiome.internal_functions.calc_clr`) to not violate assumptions of any frequentist stats we do
#
# ### Preprocessing of the metadata
# - the sample metadata needed for this demo (sampling location) were available in their "sample-desc"
# - the sample-desc for each sample in both INF and EFF were parsed and used for pairing off
#
# ### Subset of data for demo
# - For this demo we only look at [go term GO:0017001](https://www.ebi.ac.uk/QuickGO/term/GO:0017001)
# - It's expected that antibiotic catabolic processes to be higher in INF vs EFF
#
# ### Saving the demo dataset
# This example subset of data was saved to a CSV, ./example_data/mgnify/Ju2018_GO0017001_enf_inf_paired.csv. The data dictionary is below:
#
# | column | description | dtype |
# |-------------------|-------------------------------------------------------------------------------------------------------------------|-------|
# | eff_id | The run id for the mgnify analysis of the effluent sample. | str |
# | inf_id | The run id for the mgnify analysis of the influent sample. | str |
# | sampling_location | [The ISO 3166-1 alpha-2 code](http://iso.org/obp/ui/#iso:pub:PUB500001:en) for the country where the sample was from. | str |
# | sampling_read | Replicates? | str |
# | eff_abundance | The relative abundance of the GO term for a given effluent sample following preprocessing (i.e., CoDA and CLR) | float |
# | inf_abundance | The relative abundance of the GO term for a given influent sample following preprocessing (i.e., CoDA and CLR) | float |
#
# -----
#
# We will now proceed with reading in the prepared dataset.

# %%
import pandas as pd

df_data = pd.read_csv(
"https://raw.githubusercontent.com/Multiomics-Analytics-Group/acore/refs/heads/anglup-learning/example_data/mgnify/Ju2018_GO0017001_enf_inf_paired.csv"
)
# sanity check
df_data.head()

# %% [markdown]
# ## The permutation test
#
# Since these are paird samples we will proceed with paired sample permutation test using `acore.perumutation_test.paired_permutation()`.
#
# The permutation test compares the actual observed chosen metric (e.g., t-statistic, mean difference) with metrics calculated when the dataset values are randomly shuffled permutations of the dataset.
#
# If we do 100 permutations of our data (although we should do a bunch more) and only 1 of those permutations falsely showed a larger effect size than the actual observed effect than it suggests there is a 1/100 chance (p value of 0.01) of the observed effect sizze having occurred by chance.
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling error: "sizze" should be "size".

Suggested change
# If we do 100 permutations of our data (although we should do a bunch more) and only 1 of those permutations falsely showed a larger effect size than the actual observed effect than it suggests there is a 1/100 chance (p value of 0.01) of the observed effect sizze having occurred by chance.
# If we do 100 permutations of our data (although we should do a bunch more) and only 1 of those permutations falsely showed a larger effect size than the actual observed effect than it suggests there is a 1/100 chance (p value of 0.01) of the observed effect size having occurred by chance.

Copilot uses AI. Check for mistakes.

# %%
from acore.permutation_test import paired_permutation

# optional choice of random number generatorfor repro
Copy link

Copilot AI Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space in comment: "generatorfor" should be "generator for".

Suggested change
# optional choice of random number generatorfor repro
# optional choice of random number generator for repro

Copilot uses AI. Check for mistakes.
import numpy as np

rng = np.random.default_rng(12345)

# %%
# trying diff metrics to demo functionality also
for metric in ["t-statistic", "mean", np.mean]:
result = paired_permutation(
df_data["inf_abundance"].to_numpy(),
df_data["eff_abundance"].to_numpy(),
metric=metric,
n_permutations=10000,
rng=rng,
)
# verbosity
print(result)

# %% [markdown]
# ## Result
#
# Based on the permutation tests by test statistic and mean difference, the probability of the observed metrics (t=6.739 and mean diff=0.535) occurring at random would be <0.00001.
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ Acore documentation
api_examples/normalization_analysis
api_examples/ANCOVA_analysis
api_examples/enrichment_analysis
api_examples/permutation_testing

.. toctree::
:maxdepth: 1
Expand Down
Loading