forked from bigbio/quantms
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request bigbio#2 from bigbio/metadata-files
METADATA FORMAT
- Loading branch information
Showing
8 changed files
with
219 additions
and
24 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
# Absolute expression format | ||
|
||
The absolute expression format aims to cover the following use cases: | ||
|
||
- Fast and easy visualization absolute expression (AE) results using iBAQ values. | ||
- Store the AE results of each protein on each sample. | ||
- Provide information about the condition (factor value) of each sample for easy integration. | ||
- Store metadata information about the project, the workflow and the columns in the file. | ||
|
||
## Format | ||
The absolute expression format by quantms is a tab-delimited file format that contains the following fields: | ||
|
||
- `Protein` -> Protein accession or semicolon-separated list of accessions for indistinguishable groups | ||
- `SampleID` -> Sample accession in the SDRF. | ||
- `Condition` -> Condition name | ||
- `iBAQ` -> iBAQ value | ||
- `riBAQ` -> Relative iBAQ value | ||
|
||
Example: | ||
|
||
| Protein | SampleID | Condition | iBAQ | riBAQ | | ||
| --------- |--------------|-----------|--------| -------| | ||
|LV861_HUMAN | Sample-1 | heart | 1234.1 | 12.34 | | ||
|
||
## AE Header | ||
|
||
By default, the MSstats format does not have any header of metadata. We suggest adding a header to the output for better understanding of the file. By default, MSstats allows comments in the file if the line starts with `#`. The quantms output will start with some key value pairs that describe the project, the workflow and also the columns in the file. For example: | ||
|
||
`#project_accession=PXD000000` | ||
|
||
In addition, for each `Default` column of the matrix the following information should be added: | ||
|
||
``` | ||
#INFO=<ID=Protein, Number=inf, Type=String, Description="Protein Accession"> | ||
#INFO=<ID=SampleID, Number=1, Type=String, Description="Sample Accession in the SDRF"> | ||
#INFO=<ID=Condition, Number=1, Type=String, Description="Value of the factor value"> | ||
#INFO=<ID=iBAQ, Number=1, Type=Float, Description="Intensity based absolute quantification"> | ||
#INFO=<ID=riBAQ, Number=1, Type=Float, Description="relative iBAQ"> | ||
``` | ||
|
||
- The `ID` is the column name in the matrix, the `Number` is the number of values in the column (separated by `;`), the `Type` is the type of the values in the column and the `Description` is a description of the column. The number of values in the column can go from 1 to `inf` (infinity). | ||
- Protein groups are written as a list of protein accessions separated by `;` (e.g. `P12345;P12346`) | ||
|
||
We suggest including the following properties in the header: | ||
|
||
- project_accession: The project accession in PRIDE Archive | ||
- project_title: The project title in PRIDE Archive | ||
- project_description: The project description in PRIDE Archive | ||
- quanmts_version: The version of the quantms workflow used to generate the file | ||
- factor_value: The factor values used in the analysis (e.g. `tissue`) | ||
|
||
|
||
Please check also the differential expression example for more information [DE](DE.md) | ||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
# Differential expression format | ||
|
||
## Use cases | ||
|
||
- Store the differential express proteins between two contrasts, with the corresponding fold changes and p-values. | ||
- Enable easy visualization using tools like [Volcano Plot](https://en.wikipedia.org/wiki/Volcano_plot_(statistics)). | ||
- Enable easy integration with other omics data resources. | ||
- Store metadata information about the project, the workflow and the columns in the file. | ||
|
||
## Format | ||
|
||
The differential expression format by quantms is based on the [MSstats](https://msstats.org/wp-content/uploads/2017/01/MSstats_v3.7.3_manual.pdf) output. The MSstats format is a tab-delimited file that contains the following fields - see example [file](include/PXD004683.csv): | ||
|
||
- `Protein` -> Protein Accession | ||
- `Label` -> Label for the contrast on which the fold changes and p-values are based on | ||
- `log2FC` -> Log2 Fold Change | ||
- `SE` -> Standard error of the log2 fold change | ||
- `DF` -> Degree of freedom of the Student test | ||
- `pvalue` -> Raw p-values | ||
- `adj.pvalue` -> P-values adjusted among all the proteins in the specific comparison using the approach by Benjamini and Hochberg | ||
- `issue` -> Issue column shows if there is any issue for inference in corresponding protein and comparison, for example, OneConditionMissing or CompleteMissing. | ||
|
||
Example: | ||
|
||
| Protein | Label | log2FC | SE | DF | pvalue | adj.pvalue | issue | | ||
| --------- |--------------------------------| ------ | -- | -- | ------ | ---------- |-------| | ||
|LV861_HUMAN | normal-squamous cell carcinoma | 0.60 | 0.87 | 8 | 0.51 | 0.62 | NA | | ||
|
||
## DE Header | ||
|
||
By default, the MSstats format does not have any header of metadata. We suggest adding a header to the output for better understanding of the file. By default, MSstats allows comments in the file if the line starts with `#`. The quantms output will start with some key value pairs that describe the project, the workflow and also the columns in the file. For example: | ||
|
||
`#project_accession=PXD000000` | ||
|
||
In addition, for each `Default` column of the matrix the following information should be added: | ||
|
||
``` | ||
#INFO=<ID=Protein, Number=inf, Type=String, Description="Protein Accession"> | ||
#INFO=<ID=Label, Number=1, Type=String, Description="Label for the Conditions combination"> | ||
#INFO=<ID=log2FC, Number=1, Type=Float, Description="Log2 Fold Change"> | ||
#INFO=<ID=SE, Number=1, Type=Float, Description="Standard error of the log2 fold change"> | ||
#INFO=<ID=DF, Number=1, Type=Integer, Description="Degree of freedom of the Student test"> | ||
#INFO=<ID=pvalue, Number=1, Type=Float, Description="Raw p-values"> | ||
#INFO=<ID=adj.pvalue, Number=1, Type=Float, Description="P-values adjusted among all the proteins in the specific comparison using the approach by Benjamini and Hochberg"> | ||
#INFO=<ID=issue, Number=1, Type=String, Description="Issue column shows if there is any issue for inference in corresponding protein and comparison"> | ||
``` | ||
|
||
- The `ID` is the column name in the matrix, the `Number` is the number of values in the column (separated by `;`), the `Type` is the type of the values in the column and the `Description` is a description of the column. The number of values in the column can go from 1 to `inf` (infinity). | ||
- Protein groups are written as a list of protein accessions separated by `;` (e.g. `P12345;P12346`) | ||
|
||
We suggest including the following properties in the header: | ||
|
||
- project_accession: The project accession in PRIDE Archive | ||
- project_title: The project title in PRIDE Archive | ||
- project_description: The project description in PRIDE Archive | ||
- quanmts_version: The version of the quantms workflow used to generate the file | ||
- factor_value: The factor values used in the analysis (e.g. `phenotype`) | ||
- fdr_threshold: The FDR threshold used to filter the protein lists (e.g. `adj.pvalue < 0.05`) | ||
|
||
|
||
A complete example of a quantms output file can be seen [here](include/PXD004683-quantms.csv). | ||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
# Metadata Files | ||
|
||
|
||
## Project file | ||
|
||
The project file is a json file that contains the metadata of the project. The project file is used to link the different files of the project and to store the metadata of the project. | ||
The project file is a json file that contains the following fields: | ||
|
||
- `project_accession` -> ProteomeXchange Identifier -> `string` | ||
- `project_title` -> Project title -> `string` | ||
- `project_description` -> Project description -> `string` | ||
- `project_sample_description` -> Sample description of the project -> `string` | ||
- `project_data_description` -> Data description of the project -> `string` | ||
- `project_pubmed_id` -> PubMed identifier -> `string` | ||
- `organism` -> List organism name -> `list[string]` | ||
- `organism_part` -> List of organism part -> `list[string]` | ||
- `disease` -> List of diseases -> `list[string]` | ||
- `cell line` -> List of cell line (if available) -> `list[string]` | ||
- `instrument` -> List of instrument names -> `list[string]` | ||
- `enzyme` -> List of protease type for digest -> `list[string]` | ||
- `experiment_type` -> List of all keywords in ProteomeXchange or PRIDE around the dataset. -> `list[string]` | ||
- `acquisition_properties` -> List of key value pairs for the acquisition properies (see example below) -> `list[Object]` | ||
- `quantms_files` -> List of all files generated by quantms and collected in the final results folder-> `list[string]` | ||
- `quantms_version` -> Version of quantms used to generate the files -> `string` | ||
- `comments` -> List of comments or additional information needed -> `list[string]` | ||
|
||
Example of `acquisition_properties`: | ||
|
||
```json | ||
"acquisition_properties": [ | ||
{"precursor tolerance": "0.05 Da"}, | ||
{"dissociation method": "HCD"} | ||
] | ||
``` | ||
|
||
In the acquisition properties only the instrument and the enzyme are not present and should be written independently in the properties `instrument` and `enzyme`. | ||
|
||
## Sample file | ||
|
||
We only provide here the SDRF format used to analyze the data with quantms. The SDRF file is a tab-delimited file that contains the metadata of the samples. | ||
The SDRF file is used to link the different files of the project and to store the metadata of the samples. | ||
|
||
Read [here](https://github.com/bigbio/proteomics-sample-metadata/tree/master/sdrf-proteomics) more about SDRF. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
## The quantms output | ||
|
||
[quantms](https://github.com/bigbio/quantms) is a nf-core workflow that allows to process and analyze mass spectrometry data. At the end of the workflow it provides multiple output files. Here this repo defines some outputs that are relevant for AI/ML models development. | ||
|
||
The `.qms` folder will contain multiple metadata files that will be used to describe the project, the samples, the data acquisition and the data processing. | ||
Each of these files will be described in the following sections: | ||
|
||
- [METADATA.md](METADATA.md): A json file for metadata about the analyzed project | ||
- [AE.md](AE.md) or [DE.md](DE.md): A csv file based on the MSstats (TODO link) format for either absolute expression or differential expression. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
#project_accession=PXD004683 | ||
#project_title=Comparison of Lung Cancer Proteome Profiles 2: TMT | ||
#project_description=The goal of this project is to compare label free quantification, chemical labeling with tandem mass tags, and data independent acquisition discovery proteomics approaches using lung squamous cell carcinomas and adjacent lung tissues. | ||
#quanmts_version=1.1 | ||
#factor_value=phenotype | ||
#fdr_threshold=0.05 | ||
#INFO=<ID=Protein, Number=1, Type=String, Description="Protein Accession"> | ||
#INFO=<ID=Label, Number=1, Type=String, Description="Label for the Conditions combination"> | ||
#INFO=<ID=log2FC, Number=1, Type=Float, Description="Log2 Fold Change"> | ||
#INFO=<ID=SE, Number=1, Type=Float, Description="Standard error of the log2 fold change"> | ||
#INFO=<ID=DF, Number=1, Type=Integer, Description="Degree of freedom of the Student test"> | ||
#INFO=<ID=pvalue, Number=1, Type=Float, Description="Raw p-values"> | ||
#INFO=<ID=adj.pvalue, Number=1, Type=Float, Description="P-values adjusted among all the proteins in the specific comparison using the approach by Benjamini and Hochberg"> | ||
#INFO=<ID=issue, Number=1, Type=String, Description="Issue column shows if there is any issue for inference in corresponding protein and comparison"> | ||
Protein Label log2FC SE DF pvalue adj.pvalue issue | ||
sp|A0A075B6I0|LV861_HUMAN normal-squamous cell carcinoma -0.608881401746116 0.874698505540308 7.98511941081737 0.506115855701404 0.617123653183989 NA | ||
sp|A0A075B6I9|LV746_HUMAN;sp|P04211|LV743_HUMAN normal-squamous cell carcinoma 0.0829439874291534 0.443186083429313 3 0.863481956217341 0.906074074166224 NA | ||
sp|A0A075B6K5|LV39_HUMAN;sp|P80748|LV321_HUMAN normal-squamous cell carcinoma 0.351641705304375 0.627271406995195 3 0.614219027352121 0.715426380410605 NA | ||
sp|A0A075B6P5|KV228_HUMAN;sp|A0A087WW87|KV240_HUMAN;sp|P01614|KVD40_HUMAN;sp|P01615|KVD28_HUMAN normal-squamous cell carcinoma 0.623701373790229 0.439899830677411 7.81685717215862 0.194849139992051 0.301042531561804 NA | ||
sp|A0A075B6R9|KVD24_HUMAN;sp|A0A0C4DH68|KV224_HUMAN normal-squamous cell carcinoma 0.816912144580217 0.662487816084451 2.99999865855028 0.305351471232629 0.425359338806872 NA | ||
sp|A0A0A0MS14|HV145_HUMAN normal-squamous cell carcinoma 0.463815419660831 0.35065539946005 7.00000073730547 0.227505049749684 0.339501300487503 NA | ||
sp|A0A0B4J1V2|HV226_HUMAN normal-squamous cell carcinoma 0.330528524034208 0.563852807731553 8.00000000152193 0.573906234550985 0.679287242071378 NA | ||
sp|A0A0B4J1Y9|HV372_HUMAN normal-squamous cell carcinoma 0.17575018697919 0.453684944610657 7.94865778157153 0.708638428068355 0.790767482120683 NA |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
Protein Label log2FC SE DF pvalue adj.pvalue issue | ||
sp|A0A075B6I0|LV861_HUMAN normal-squamous cell carcinoma -0.608881401746116 0.874698505540308 7.98511941081737 0.506115855701404 0.617123653183989 NA | ||
sp|A0A075B6I9|LV746_HUMAN;sp|P04211|LV743_HUMAN normal-squamous cell carcinoma 0.0829439874291534 0.443186083429313 3 0.863481956217341 0.906074074166224 NA | ||
sp|A0A075B6K5|LV39_HUMAN;sp|P80748|LV321_HUMAN normal-squamous cell carcinoma 0.351641705304375 0.627271406995195 3 0.614219027352121 0.715426380410605 NA | ||
sp|A0A075B6P5|KV228_HUMAN;sp|A0A087WW87|KV240_HUMAN;sp|P01614|KVD40_HUMAN;sp|P01615|KVD28_HUMAN normal-squamous cell carcinoma 0.623701373790229 0.439899830677411 7.81685717215862 0.194849139992051 0.301042531561804 NA | ||
sp|A0A075B6R9|KVD24_HUMAN;sp|A0A0C4DH68|KV224_HUMAN normal-squamous cell carcinoma 0.816912144580217 0.662487816084451 2.99999865855028 0.305351471232629 0.425359338806872 NA | ||
sp|A0A0A0MS14|HV145_HUMAN normal-squamous cell carcinoma 0.463815419660831 0.35065539946005 7.00000073730547 0.227505049749684 0.339501300487503 NA | ||
sp|A0A0B4J1V2|HV226_HUMAN normal-squamous cell carcinoma 0.330528524034208 0.563852807731553 8.00000000152193 0.573906234550985 0.679287242071378 NA | ||
sp|A0A0B4J1Y9|HV372_HUMAN normal-squamous cell carcinoma 0.17575018697919 0.453684944610657 7.94865778157153 0.708638428068355 0.790767482120683 NA |