Merge pull request bigbio#2 from bigbio/metadata-files

METADATA FORMAT
mobiusklein · Jul 27, 2023 · f5acdb7 · f5acdb7
2 parents 8eff5d2 + bcdc80f
commit f5acdb7
Show file tree

Hide file tree

Showing 8 changed files with 219 additions and 24 deletions.
diff --git a/.gitignore b/.gitignore
@@ -157,4 +157,6 @@ cython_debug/
 #  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
 #  and can be added to the global gitignore or merged into this file.  For a more nuclear
 #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
-#.idea/
+
+.idea/*
+.idea/.gitignore
diff --git a/README.md b/README.md
@@ -1,33 +1,22 @@
 # quantms.io
 
-The proteomics quantification formats, is a Github repository that aims to formalize some existing proteomics quantification formats for large scale quantification experiments. Different from previous efforts like (mzTab, quantML, etc), the present representation aims to reuse existing popular formats and work in the following use cases: 
+[quantms](https://docs.quantms.org) is a nf-core pipeline for the analysis of quantitative proteomics data. The pipeline is based on the [OpenMS](https://www.openms.de/) framework and [DIA-NN](https://github.com/vdemichev/DiaNN); and it is designed to analyze large scale experiments. the main outputs of quantms tools are the following: 
 
-**Note**: Before starting, for a more generic/extended MS format for quantitative proteomics please check the [mzTab](https://github.com/HUPO-PSI/mzTab) format. mzTab is a generic format for MS data, including quantification data for proteomics and metabolomics experiments. Aim to capture not only the quantitative information but also, the identification information, including the peptide spectrum matches (psms), post-translational modifications, etc.   
+- [mzTab](https://github.com/HUPO-PSI/mzTab) files with the identification and quantification information.
+- [MSstats](https://msstats.org/wp-content/uploads/2017/01/MSstats_v3.7.3_manual.pdf) input file with the peptide quantification values needed for the MSstats analysis.
+- [MSstats](https://msstats.org/wp-content/uploads/2017/01/MSstats_v3.7.3_manual.pdf) output file with the differential expression values for each protein. 
+- The input [SDRF](https://github.com/bigbio/proteomics-sample-metadata) of the pipeline if available. 
 
-## Why a new format?
+Here, we aim to formalize and develop a more standardized format that enables better representation of the identification and quantification results but also enables new and novel use cases for proteomics data analysis: 
 
-Why other efforts like mzTab, quantML, have been developed to represent quantitative proteomics data, we believe those formats are not enough to represent the following information, and also fails to handle the following cases: 
+- Fast and easy visualization of the identification and quantification results.
+- Easy integration with other omics data.
+- Easy integration with sample metadata.
+- AI/ML model development based on identification and quantification results.
 
-- QuantML and mzTab are design for DDA experiments, with lower number of ms_runs.  
-   - In mzTab, when the number of ms_runs increases, the number of columns in the peptide table with null values increases making difficult the reliability of the format.
-   - In mzTab, when the number of ms_runs and samples increases, the metadata section increases making difficult the reliability of the format. for each ms_run, at least 5 sections are needed, in an experiment with 1000 ms_runs, 5000 sections are needed. 
-   - QuantML was never designed to handle large scale experiments, and the format is not flexible enough to handle large scale experiments.
-- mzTab is a large format making difficult to handle and visualize quantitative information as simple as: 
-   - Different expression results tables. 
-   - Raw intensities tables at peptide/protein information.
-- Sample metadata integration with [SDRF-Proteomics](https://github.com/bigbio/proteomics-sample-metadata) format is not possible.
+**Note**: We are not trying to replace the mzTab format, but to provide a new format that enables AI-related use cases. Most of the features of the mzTab format will be included in the new format.  
 
-More important, both formats and previous efforts do not provide enough tooling framework to enable bioinformatic software packages, main reason why the formats are yet popular within the bioinformatics community. 
-
-## Gols and Use cases
-
-The main goals of this repository are:
-
-- Provide a data model to represent quantitative proteomics data for absolute quantification and differential expression experiments.
-- Provide a data model to represent quantitative proteomics data for DIA and DDA experiments.
-- Provide a data model to represent quantitative proteomics data for large scale experiments.
-- Provide a data model that enable integration with the Sample to Data Relationship Format (SDRF-Proteomics) for proteomics experiments.
-- Provide a data model to represent protein and peptide quantification data.
+## Data model
 
 The GitHub repository aims to provide multiple formats for serialization of the data model, including:
 

diff --git a/docs/AE.md b/docs/AE.md
@@ -0,0 +1,56 @@
+# Absolute expression format
+
+The absolute expression format aims to cover the following use cases:
+
+- Fast and easy visualization absolute expression (AE) results using iBAQ values. 
+- Store the AE results of each protein on each sample.
+- Provide information about the condition (factor value) of each sample for easy integration.
+- Store metadata information about the project, the workflow and the columns in the file.
+
+## Format 
+The absolute expression format by quantms is a tab-delimited file format that contains the following fields:
+
+- `Protein` -> Protein accession or semicolon-separated list of accessions for indistinguishable groups
+- `SampleID` -> Sample accession in the SDRF.
+- `Condition` -> Condition name
+- `iBAQ` -> iBAQ value
+- `riBAQ` -> Relative iBAQ value
+
+Example: 
+
+| Protein    | SampleID     | Condition | iBAQ   | riBAQ  |
+| ---------  |--------------|-----------|--------| -------|
+|LV861_HUMAN | Sample-1     | heart     | 1234.1 | 12.34  |
+
+## AE Header 
+
+By default, the MSstats format does not have any header of metadata. We suggest adding a header to the output for better understanding of the file. By default, MSstats allows comments in the file if the line starts with `#`. The quantms output will start with some key value pairs that describe the project, the workflow and also the columns in the file. For example: 
+
+`#project_accession=PXD000000`
+
+In addition, for each `Default` column of the matrix the following information should be added: 
+
+```
+#INFO=<ID=Protein, Number=inf, Type=String, Description="Protein Accession">
+#INFO=<ID=SampleID, Number=1, Type=String, Description="Sample Accession in the SDRF">
+#INFO=<ID=Condition, Number=1, Type=String, Description="Value of the factor value">
+#INFO=<ID=iBAQ, Number=1, Type=Float, Description="Intensity based absolute quantification">
+#INFO=<ID=riBAQ, Number=1, Type=Float, Description="relative iBAQ">
+```
+
+- The `ID` is the column name in the matrix, the `Number` is the number of values in the column (separated by `;`), the `Type` is the type of the values in the column and the `Description` is a description of the column. The number of values in the column can go from 1 to `inf` (infinity).
+- Protein groups are written as a list of protein accessions separated by `;` (e.g. `P12345;P12346`) 
+
+We suggest including the following properties in the header: 
+
+- project_accession: The project accession in PRIDE Archive
+- project_title: The project title in PRIDE Archive
+- project_description: The project description in PRIDE Archive
+- quanmts_version: The version of the quantms workflow used to generate the file
+- factor_value: The factor values used in the analysis (e.g. `tissue`)
+
+
+Please check also the differential expression example for more information [DE](DE.md)
+
+
+
diff --git a/docs/DE.md b/docs/DE.md
@@ -0,0 +1,64 @@
+# Differential expression format
+
+## Use cases
+
+- Store the differential express proteins between two contrasts,  with the corresponding fold changes and p-values. 
+- Enable easy visualization using tools like [Volcano Plot](https://en.wikipedia.org/wiki/Volcano_plot_(statistics)).
+- Enable easy integration with other omics data resources. 
+- Store metadata information about the project, the workflow and the columns in the file.
+
+## Format
+
+The differential expression format by quantms is based on the [MSstats](https://msstats.org/wp-content/uploads/2017/01/MSstats_v3.7.3_manual.pdf) output. The MSstats format is a tab-delimited file that contains the following fields - see example [file](include/PXD004683.csv):
+
+- `Protein` -> Protein Accession
+- `Label` -> Label for the contrast on which the fold changes and p-values are based on
+- `log2FC` -> Log2 Fold Change	
+- `SE` -> Standard error of the log2 fold change 	
+- `DF` -> Degree of freedom of the Student test	
+- `pvalue`	-> Raw p-values
+- `adj.pvalue`	->  P-values adjusted among all the proteins in the specific comparison using the approach by Benjamini and Hochberg
+- `issue` -> Issue column shows if there is any issue for inference in corresponding protein and comparison, for example, OneConditionMissing or CompleteMissing. 
+
+Example: 
+
+| Protein    | Label                          | log2FC | SE | DF | pvalue | adj.pvalue | issue |
+| ---------  |--------------------------------| ------ | -- | -- | ------ | ---------- |-------|
+|LV861_HUMAN | normal-squamous cell carcinoma | 0.60   | 0.87 | 8  | 0.51   | 0.62       | NA  |
+
+## DE Header 
+
+By default, the MSstats format does not have any header of metadata. We suggest adding a header to the output for better understanding of the file. By default, MSstats allows comments in the file if the line starts with `#`. The quantms output will start with some key value pairs that describe the project, the workflow and also the columns in the file. For example: 
+
+`#project_accession=PXD000000`
+
+In addition, for each `Default` column of the matrix the following information should be added: 
+
+```
+#INFO=<ID=Protein, Number=inf, Type=String, Description="Protein Accession">
+#INFO=<ID=Label, Number=1, Type=String, Description="Label for the Conditions combination">
+#INFO=<ID=log2FC, Number=1, Type=Float, Description="Log2 Fold Change">
+#INFO=<ID=SE, Number=1, Type=Float, Description="Standard error of the log2 fold change">
+#INFO=<ID=DF, Number=1, Type=Integer, Description="Degree of freedom of the Student test">
+#INFO=<ID=pvalue, Number=1, Type=Float, Description="Raw p-values">
+#INFO=<ID=adj.pvalue, Number=1, Type=Float, Description="P-values adjusted among all the proteins in the specific comparison using the approach by Benjamini and Hochberg">
+#INFO=<ID=issue, Number=1, Type=String, Description="Issue column shows if there is any issue for inference in corresponding protein and comparison">
+```
+
+- The `ID` is the column name in the matrix, the `Number` is the number of values in the column (separated by `;`), the `Type` is the type of the values in the column and the `Description` is a description of the column. The number of values in the column can go from 1 to `inf` (infinity).
+- Protein groups are written as a list of protein accessions separated by `;` (e.g. `P12345;P12346`) 
+
+We suggest including the following properties in the header: 
+
+- project_accession: The project accession in PRIDE Archive
+- project_title: The project title in PRIDE Archive
+- project_description: The project description in PRIDE Archive
+- quanmts_version: The version of the quantms workflow used to generate the file
+- factor_value: The factor values used in the analysis (e.g. `phenotype`)
+- fdr_threshold: The FDR threshold used to filter the protein lists (e.g. `adj.pvalue < 0.05`)
+
+
+A complete example of a quantms output file can be seen [here](include/PXD004683-quantms.csv).
+
+
+
diff --git a/docs/METADATA.md b/docs/METADATA.md
@@ -0,0 +1,43 @@
+# Metadata Files
+
+
+## Project file
+
+The project file is a json file that contains the metadata of the project. The project file is used to link the different files of the project and to store the metadata of the project. 
+The project file is a json file that contains the following fields:
+
+- `project_accession` -> ProteomeXchange Identifier -> `string`
+- `project_title` -> Project title -> `string` 
+- `project_description` -> Project description -> `string`
+- `project_sample_description` -> Sample description of the project -> `string` 
+- `project_data_description` -> Data description of the project -> `string`
+- `project_pubmed_id` -> PubMed identifier -> `string`
+- `organism` -> List organism name -> `list[string]`
+- `organism_part` -> List of organism part -> `list[string]`
+- `disease` -> List of diseases -> `list[string]`
+- `cell line` -> List of cell line (if available) -> `list[string]`
+- `instrument` -> List of instrument names -> `list[string]`
+- `enzyme` -> List of protease type for digest -> `list[string]`
+- `experiment_type` -> List of all keywords in ProteomeXchange or PRIDE around the dataset. -> `list[string]`
+- `acquisition_properties` -> List of key value pairs for the acquisition properies (see example below) -> `list[Object]`
+- `quantms_files` -> List of all files generated by quantms and collected in the final results folder-> `list[string]`
+- `quantms_version` -> Version of quantms used to generate the files -> `string`
+- `comments` -> List of comments or additional information needed -> `list[string]`
+
+Example of `acquisition_properties`:
+
+```json
+"acquisition_properties": [
+     {"precursor tolerance": "0.05 Da"},
+     {"dissociation method": "HCD"}
+]
+```
+
+In the acquisition properties only the instrument and the enzyme are not present and should be written independently in the properties `instrument` and `enzyme`.
+
+## Sample file
+
+We only provide here the SDRF format used to analyze the data with quantms. The SDRF file is a tab-delimited file that contains the metadata of the samples. 
+The SDRF file is used to link the different files of the project and to store the metadata of the samples.
+
+Read [here](https://github.com/bigbio/proteomics-sample-metadata/tree/master/sdrf-proteomics) more about SDRF. 
diff --git a/docs/README.md b/docs/README.md
@@ -0,0 +1,9 @@
+## The quantms output
+
+[quantms](https://github.com/bigbio/quantms) is a nf-core workflow that allows to process and analyze mass spectrometry data. At the end of the workflow it provides multiple output files. Here this repo defines some outputs that are relevant for AI/ML models development.
+
+The `.qms` folder will contain multiple metadata files that will be used to describe the project, the samples, the data acquisition and the data processing. 
+Each of these files will be described in the following sections: 
+
+- [METADATA.md](METADATA.md): A json file for metadata about the analyzed project
+- [AE.md](AE.md) or [DE.md](DE.md): A csv file based on the MSstats (TODO link) format for either absolute expression or differential expression.
diff --git a/docs/include/PXD004683-quantms.csv b/docs/include/PXD004683-quantms.csv
@@ -0,0 +1,23 @@
+#project_accession=PXD004683
+#project_title=Comparison of Lung Cancer Proteome Profiles 2: TMT
+#project_description=The goal of this project is to compare label free quantification, chemical labeling with tandem mass tags, and data independent acquisition discovery proteomics approaches using lung squamous cell carcinomas and adjacent lung tissues.
+#quanmts_version=1.1
+#factor_value=phenotype
+#fdr_threshold=0.05
+#INFO=<ID=Protein, Number=1, Type=String, Description="Protein Accession">
+#INFO=<ID=Label, Number=1, Type=String, Description="Label for the Conditions combination">
+#INFO=<ID=log2FC, Number=1, Type=Float, Description="Log2 Fold Change">
+#INFO=<ID=SE, Number=1, Type=Float, Description="Standard error of the log2 fold change">
+#INFO=<ID=DF, Number=1, Type=Integer, Description="Degree of freedom of the Student test">
+#INFO=<ID=pvalue, Number=1, Type=Float, Description="Raw p-values">
+#INFO=<ID=adj.pvalue, Number=1, Type=Float, Description="P-values adjusted among all the proteins in the specific comparison using the approach by Benjamini and Hochberg">
+#INFO=<ID=issue, Number=1, Type=String, Description="Issue column shows if there is any issue for inference in corresponding protein and comparison">
+Protein	Label	log2FC	SE	DF	pvalue	adj.pvalue	issue
+sp|A0A075B6I0|LV861_HUMAN	normal-squamous cell carcinoma	-0.608881401746116	0.874698505540308	7.98511941081737	0.506115855701404	0.617123653183989	NA
+sp|A0A075B6I9|LV746_HUMAN;sp|P04211|LV743_HUMAN	normal-squamous cell carcinoma	0.0829439874291534	0.443186083429313	3	0.863481956217341	0.906074074166224	NA
+sp|A0A075B6K5|LV39_HUMAN;sp|P80748|LV321_HUMAN	normal-squamous cell carcinoma	0.351641705304375	0.627271406995195	3	0.614219027352121	0.715426380410605	NA
+sp|A0A075B6P5|KV228_HUMAN;sp|A0A087WW87|KV240_HUMAN;sp|P01614|KVD40_HUMAN;sp|P01615|KVD28_HUMAN	normal-squamous cell carcinoma	0.623701373790229	0.439899830677411	7.81685717215862	0.194849139992051	0.301042531561804	NA
+sp|A0A075B6R9|KVD24_HUMAN;sp|A0A0C4DH68|KV224_HUMAN	normal-squamous cell carcinoma	0.816912144580217	0.662487816084451	2.99999865855028	0.305351471232629	0.425359338806872	NA
+sp|A0A0A0MS14|HV145_HUMAN	normal-squamous cell carcinoma	0.463815419660831	0.35065539946005	7.00000073730547	0.227505049749684	0.339501300487503	NA
+sp|A0A0B4J1V2|HV226_HUMAN	normal-squamous cell carcinoma	0.330528524034208	0.563852807731553	8.00000000152193	0.573906234550985	0.679287242071378	NA
+sp|A0A0B4J1Y9|HV372_HUMAN	normal-squamous cell carcinoma	0.17575018697919	0.453684944610657	7.94865778157153	0.708638428068355	0.790767482120683	NA
diff --git a/docs/include/PXD004683.csv b/docs/include/PXD004683.csv
@@ -0,0 +1,9 @@
+Protein	Label	log2FC	SE	DF	pvalue	adj.pvalue	issue
+sp|A0A075B6I0|LV861_HUMAN	normal-squamous cell carcinoma	-0.608881401746116	0.874698505540308	7.98511941081737	0.506115855701404	0.617123653183989	NA
+sp|A0A075B6I9|LV746_HUMAN;sp|P04211|LV743_HUMAN	normal-squamous cell carcinoma	0.0829439874291534	0.443186083429313	3	0.863481956217341	0.906074074166224	NA
+sp|A0A075B6K5|LV39_HUMAN;sp|P80748|LV321_HUMAN	normal-squamous cell carcinoma	0.351641705304375	0.627271406995195	3	0.614219027352121	0.715426380410605	NA
+sp|A0A075B6P5|KV228_HUMAN;sp|A0A087WW87|KV240_HUMAN;sp|P01614|KVD40_HUMAN;sp|P01615|KVD28_HUMAN	normal-squamous cell carcinoma	0.623701373790229	0.439899830677411	7.81685717215862	0.194849139992051	0.301042531561804	NA
+sp|A0A075B6R9|KVD24_HUMAN;sp|A0A0C4DH68|KV224_HUMAN	normal-squamous cell carcinoma	0.816912144580217	0.662487816084451	2.99999865855028	0.305351471232629	0.425359338806872	NA
+sp|A0A0A0MS14|HV145_HUMAN	normal-squamous cell carcinoma	0.463815419660831	0.35065539946005	7.00000073730547	0.227505049749684	0.339501300487503	NA
+sp|A0A0B4J1V2|HV226_HUMAN	normal-squamous cell carcinoma	0.330528524034208	0.563852807731553	8.00000000152193	0.573906234550985	0.679287242071378	NA
+sp|A0A0B4J1Y9|HV372_HUMAN	normal-squamous cell carcinoma	0.17575018697919	0.453684944610657	7.94865778157153	0.708638428068355	0.790767482120683	NA