-
Notifications
You must be signed in to change notification settings - Fork 1
permutation tests #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
13b1b21
cdb1705
a13be15
13149df
5a5a42a
179ce05
a144f7e
e66b6b3
53d5f46
0a7e33a
dc648f6
719b9e3
5492203
8ebab69
b6d049c
74255e3
afc2283
23a79ff
2194b08
5b0124e
83bbfda
e614049
6bea62d
f27cf2e
8415f0f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,262 @@ | ||||||
| { | ||||||
| "cells": [ | ||||||
| { | ||||||
| "cell_type": "markdown", | ||||||
| "id": "e9bdb7c9", | ||||||
| "metadata": {}, | ||||||
| "source": [ | ||||||
| "# Tutorial: Permutation Testing using `acore`\n", | ||||||
| "\n", | ||||||
| "In this notebook we will demonstrate how to use acore's permutation testing functions on metagenomics data collected by [Ju and colleagues (2018)](https://doi.org/10.1038/s41396-018-0277-8).\n", | ||||||
| "\n", | ||||||
| "The samples in this demo were collected from wastewaster treatment plant inffluent (MGYS00005056) and effluent (MGYS00005058).\n", | ||||||
| "\n", | ||||||
| "For this demo we look at the GO term abundance tables generated by the Mgnify pipeline. The values in the table are the absolute abundance of selected GO terms for each sample, which we then transform to relative abundances and centred-log ratios. \n" | ||||||
| ] | ||||||
| }, | ||||||
| { | ||||||
| "cell_type": "markdown", | ||||||
| "id": "dc9f17b2", | ||||||
| "metadata": {}, | ||||||
| "source": [ | ||||||
| "## Data preparation details\n", | ||||||
| "\n", | ||||||
| "### Downloading\n", | ||||||
| "The analysed samples were downloaded via the [MGnify API](https://www.ebi.ac.uk/metagenomics/api/docs/). The inffluent (INF) and effluent (EFFF) datasets have paired samples and we also needed to download the sample metadata (also available via Mgnify API) to assign the correct pairing.\n", | ||||||
|
||||||
| "\n", | ||||||
| "### Preprocessing of abundances\n", | ||||||
| "- To account for technical variation due to sequencing technology limitations, we first transform the abundance values so they are relative to the total reads for the sample aka getting relative abundances. \n", | ||||||
| "- The relative abundances are compositional data (CoDa) so we map them to unconstrained vectors using centred log-ratio transformation `acore.microbiome.internal_functions.calc_clr()` to not violate assumptions of any frequentist stats we do\n", | ||||||
| "\n", | ||||||
| "### Preprocessing of the metadata \n", | ||||||
| "- the sample metadata needed for this demo (sampling location) were available in their \"sample-desc\" \n", | ||||||
| "- the sample-desc for each sample in both INF and EFF were parsed and used for pairing off\n", | ||||||
| "\n", | ||||||
| "### Subset of data for demo\n", | ||||||
| "- For this demo we only look at [go term GO:0017001](https://www.ebi.ac.uk/QuickGO/term/GO:0017001)\n", | ||||||
| "- It's expected that antibiotic catabolic processes to be higher in INF vs EFF\n", | ||||||
| "\n", | ||||||
| "### Saving the demo dataset\n", | ||||||
| "This example subset of data was saved to a CSV, ./example_data/mgnify/Ju2018_GO0017001_enf_inf_paired.csv. The data dictionary is below:\n", | ||||||
| "\n", | ||||||
| "| column | description | dtype |\n", | ||||||
| "|-------------------|-------------------------------------------------------------------------------------------------------------------|-------|\n", | ||||||
| "| eff_id | The run id for the mgnify analysis of the effluent sample. | str |\n", | ||||||
| "| inf_id | The run id for the mgnify analysis of the influent sample. | str |\n", | ||||||
| "| sampling_location | [The ISO 3166-1 alpha-2 code](http://iso.org/obp/ui/#iso:pub:PUB500001:en) for the country where the sample was from. | str |\n", | ||||||
| "| sampling_read | Replicates? | str |\n", | ||||||
| "| eff_abundance | The relative abundance of the GO term for a given effluent sample following preprocessing (i.e., CoDA and CLR) | float |\n", | ||||||
| "| inf_abundance | The relative abundance of the GO term for a given influent sample following preprocessing (i.e., CoDA and CLR) | float |\n", | ||||||
| "\n", | ||||||
| "-----\n", | ||||||
| "\n", | ||||||
| "We will now proceed with reading in the prepared dataset. " | ||||||
| ] | ||||||
| }, | ||||||
| { | ||||||
| "cell_type": "code", | ||||||
| "execution_count": 1, | ||||||
| "id": "00d8f038", | ||||||
| "metadata": {}, | ||||||
| "outputs": [ | ||||||
| { | ||||||
| "data": { | ||||||
| "text/html": [ | ||||||
| "<div>\n", | ||||||
| "<style scoped>\n", | ||||||
| " .dataframe tbody tr th:only-of-type {\n", | ||||||
| " vertical-align: middle;\n", | ||||||
| " }\n", | ||||||
| "\n", | ||||||
| " .dataframe tbody tr th {\n", | ||||||
| " vertical-align: top;\n", | ||||||
| " }\n", | ||||||
| "\n", | ||||||
| " .dataframe thead th {\n", | ||||||
| " text-align: right;\n", | ||||||
| " }\n", | ||||||
| "</style>\n", | ||||||
| "<table border=\"1\" class=\"dataframe\">\n", | ||||||
| " <thead>\n", | ||||||
| " <tr style=\"text-align: right;\">\n", | ||||||
| " <th></th>\n", | ||||||
| " <th>eff_id</th>\n", | ||||||
| " <th>inf_id</th>\n", | ||||||
| " <th>sampling_location</th>\n", | ||||||
| " <th>sampling_read</th>\n", | ||||||
| " <th>eff_abundance</th>\n", | ||||||
| " <th>inf_abundance</th>\n", | ||||||
| " </tr>\n", | ||||||
| " </thead>\n", | ||||||
| " <tbody>\n", | ||||||
| " <tr>\n", | ||||||
| " <th>0</th>\n", | ||||||
| " <td>ERR2985255</td>\n", | ||||||
| " <td>ERR2814663</td>\n", | ||||||
| " <td>TG</td>\n", | ||||||
| " <td>READ2 Taxonomy ID:256318</td>\n", | ||||||
| " <td>3.257283</td>\n", | ||||||
| " <td>4.226819</td>\n", | ||||||
| " </tr>\n", | ||||||
| " <tr>\n", | ||||||
| " <th>1</th>\n", | ||||||
| " <td>ERR2985256</td>\n", | ||||||
| " <td>ERR2814664</td>\n", | ||||||
| " <td>MN</td>\n", | ||||||
| " <td>READ2 Taxonomy ID:256318</td>\n", | ||||||
| " <td>2.572841</td>\n", | ||||||
| " <td>3.847191</td>\n", | ||||||
| " </tr>\n", | ||||||
| " <tr>\n", | ||||||
| " <th>2</th>\n", | ||||||
| " <td>ERR2985257</td>\n", | ||||||
| " <td>ERR2814651</td>\n", | ||||||
| " <td>AH</td>\n", | ||||||
| " <td>READ1 Taxonomy ID:256318</td>\n", | ||||||
| " <td>4.298777</td>\n", | ||||||
| " <td>4.086841</td>\n", | ||||||
| " </tr>\n", | ||||||
| " <tr>\n", | ||||||
| " <th>3</th>\n", | ||||||
| " <td>ERR2985258</td>\n", | ||||||
| " <td>ERR2814667</td>\n", | ||||||
| " <td>TE</td>\n", | ||||||
| " <td>READ1 Taxonomy ID:256318</td>\n", | ||||||
| " <td>2.758982</td>\n", | ||||||
| " <td>3.436752</td>\n", | ||||||
| " </tr>\n", | ||||||
| " <tr>\n", | ||||||
| " <th>4</th>\n", | ||||||
| " <td>ERR2985259</td>\n", | ||||||
| " <td>ERR2814660</td>\n", | ||||||
| " <td>FD</td>\n", | ||||||
| " <td>READ1 Taxonomy ID:256318</td>\n", | ||||||
| " <td>3.364675</td>\n", | ||||||
| " <td>3.486673</td>\n", | ||||||
| " </tr>\n", | ||||||
| " </tbody>\n", | ||||||
| "</table>\n", | ||||||
| "</div>" | ||||||
| ], | ||||||
| "text/plain": [ | ||||||
| " eff_id inf_id sampling_location sampling_read \\\n", | ||||||
| "0 ERR2985255 ERR2814663 TG READ2 Taxonomy ID:256318 \n", | ||||||
| "1 ERR2985256 ERR2814664 MN READ2 Taxonomy ID:256318 \n", | ||||||
| "2 ERR2985257 ERR2814651 AH READ1 Taxonomy ID:256318 \n", | ||||||
| "3 ERR2985258 ERR2814667 TE READ1 Taxonomy ID:256318 \n", | ||||||
| "4 ERR2985259 ERR2814660 FD READ1 Taxonomy ID:256318 \n", | ||||||
| "\n", | ||||||
| " eff_abundance inf_abundance \n", | ||||||
| "0 3.257283 4.226819 \n", | ||||||
| "1 2.572841 3.847191 \n", | ||||||
| "2 4.298777 4.086841 \n", | ||||||
| "3 2.758982 3.436752 \n", | ||||||
| "4 3.364675 3.486673 " | ||||||
| ] | ||||||
| }, | ||||||
| "execution_count": 1, | ||||||
| "metadata": {}, | ||||||
| "output_type": "execute_result" | ||||||
| } | ||||||
| ], | ||||||
| "source": [ | ||||||
| "import pandas as pd \n", | ||||||
| "\n", | ||||||
| "df_data = pd.read_csv(\n", | ||||||
| " 'https://raw.githubusercontent.com/Multiomics-Analytics-Group/acore/refs/heads/anglup-learning/example_data/mgnify/Ju2018_GO0017001_enf_inf_paired.csv'\n", | ||||||
| ")\n", | ||||||
| "# sanity check \n", | ||||||
| "df_data.head()" | ||||||
| ] | ||||||
| }, | ||||||
| { | ||||||
| "cell_type": "markdown", | ||||||
| "id": "7cf6c3d0", | ||||||
| "metadata": {}, | ||||||
| "source": [ | ||||||
| "## The permutation test\n", | ||||||
| "\n", | ||||||
| "Since these are paird samples we will proceed with paired sample permutation test using `acore.perumutation_test.paired_permutation()`. \n", | ||||||
| "\n", | ||||||
| "The permutation test compares the actual observed chosen metric (e.g., t-statistic, mean difference) with metrics calculated when the dataset values are randomly shuffled permutations of the dataset. \n", | ||||||
| "\n", | ||||||
| "If we do 100 permutations of our data (although we should do a bunch more) and only 1 of those permutations falsely showed a larger effect size than the actual observed effect than it suggests there is a 1/100 chance (p value of 0.01) of the observed effect sizze having occurred by chance. " | ||||||
|
||||||
| "If we do 100 permutations of our data (although we should do a bunch more) and only 1 of those permutations falsely showed a larger effect size than the actual observed effect than it suggests there is a 1/100 chance (p value of 0.01) of the observed effect sizze having occurred by chance. " | |
| "If we do 100 permutations of our data (although we should do a bunch more) and only 1 of those permutations falsely showed a larger effect size than the actual observed effect than it suggests there is a 1/100 chance (p value of 0.01) of the observed effect size having occurred by chance. " |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space in comment: "generatorfor" should be "generator for".
| "# optional choice of random number generatorfor repro\n", | |
| "# optional choice of random number generator for repro\n", |
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,101 @@ | ||||||
| # --- | ||||||
| # jupyter: | ||||||
| # jupytext: | ||||||
| # text_representation: | ||||||
| # extension: .py | ||||||
| # format_name: percent | ||||||
| # format_version: '1.3' | ||||||
| # jupytext_version: 1.17.3 | ||||||
| # kernelspec: | ||||||
| # display_name: .venv | ||||||
| # language: python | ||||||
| # name: python3 | ||||||
| # --- | ||||||
|
|
||||||
| # %% [markdown] | ||||||
| # # Permutation Tests | ||||||
| # | ||||||
| # In this notebook we will demonstrate how to use acore's permutation testing functions on metagenomics data collected by [Ju and colleagues (2018)](https://doi.org/10.1038/s41396-018-0277-8). | ||||||
| # | ||||||
| # The samples in this demo were collected from wastewaster treatment plant inffluent (MGYS00005056) and effluent (MGYS00005058). | ||||||
|
||||||
| # The samples in this demo were collected from wastewaster treatment plant inffluent (MGYS00005056) and effluent (MGYS00005058). | |
| # The samples in this demo were collected from wastewater treatment plant influent (MGYS00005056) and effluent (MGYS00005058). |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple spelling errors: "inffluent" should be "influent" and "EFFF" should be "EFF".
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling error: "sizze" should be "size".
| # If we do 100 permutations of our data (although we should do a bunch more) and only 1 of those permutations falsely showed a larger effect size than the actual observed effect than it suggests there is a 1/100 chance (p value of 0.01) of the observed effect sizze having occurred by chance. | |
| # If we do 100 permutations of our data (although we should do a bunch more) and only 1 of those permutations falsely showed a larger effect size than the actual observed effect than it suggests there is a 1/100 chance (p value of 0.01) of the observed effect size having occurred by chance. |
Copilot
AI
Nov 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing space in comment: "generatorfor" should be "generator for".
| # optional choice of random number generatorfor repro | |
| # optional choice of random number generator for repro |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple spelling errors in this line: "wastewaster" should be "wastewater" and "inffluent" should be "influent".