Merge pull request #21 from PRIDE-Archive/dev

minor changes
PRIDE-Archive · Sep 26, 2024 · b2c9b4b · b2c9b4b
2 parents 8e27277 + 3d88c01
commit b2c9b4b
Show file tree

Hide file tree

Showing 5 changed files with 107 additions and 28 deletions.
diff --git a/.gitignore b/.gitignore
@@ -1,4 +1,13 @@
 .pytest_cache
 dist
 build
-*.egg-info
+*.egg-info
+*.egg
+.idea/*
+.idea/
+.vscode
+__pycache__
+*.pyc
+*.pyo
+*.pyd
+paper/paper.pdf
diff --git a/paper/figure.png b/paper/figure.png
diff --git a/paper/infrastructure.pptx b/paper/infrastructure.pptx
diff --git a/paper/paper.bib b/paper/paper.bib
@@ -113,3 +113,78 @@ @article{Mehta2023-og
                spectrometry; multi-omics; proteomics; reproducibility",
   language  = "en"
 }
+
+@article{Deutsch2023-mu,
+  title     = "The {ProteomeXchange} consortium at 10 years: 2023 update",
+  author    = "Deutsch, Eric W and Bandeira, Nuno and Perez-Riverol, Yasset and
+               Sharma, Vagisha and Carver, Jeremy J and Mendoza, Luis and
+               Kundu, Deepti J and Wang, Shengbo and Bandla, Chakradhar and
+               Kamatchinathan, Selvakumar and Hewapathirana, Suresh and
+               Pullman, Benjamin S and Wertz, Julie and Sun, Zhi and Kawano,
+               Shin and Okuda, Shujiro and Watanabe, Yu and MacLean, Brendan
+               and MacCoss, Michael J and Zhu, Yunping and Ishihama, Yasushi
+               and Vizca{\'\i}no, Juan Antonio",
+  abstract  = "Mass spectrometry (MS) is by far the most used experimental
+               approach in high-throughput proteomics. The ProteomeXchange (PX)
+               consortium of proteomics resources
+               (http://www.proteomexchange.org) was originally set up to
+               standardize data submission and dissemination of public MS
+               proteomics data. It is now 10 years since the initial data
+               workflow was implemented. In this manuscript, we describe the
+               main developments in PX since the previous update manuscript in
+               Nucleic Acids Research was published in 2020. The six members of
+               the Consortium are PRIDE, PeptideAtlas (including PASSEL),
+               MassIVE, jPOST, iProX and Panorama Public. We report the current
+               data submission statistics, showcasing that the number of
+               datasets submitted to PX resources has continued to increase
+               every year. As of June 2022, more than 34 233 datasets had been
+               submitted to PX resources, and from those, 20 062 (58.6\%) just
+               in the last three years. We also report the development of the
+               Universal Spectrum Identifiers and the improvements in capturing
+               the experimental metadata annotations. In parallel, we highlight
+               that data re-use activities of public datasets continue to
+               increase, enabling connections between PX resources and other
+               popular bioinformatics resources, novel research and also new
+               data resources. Finally, we summarise the current
+               state-of-the-art in data management practices for sensitive
+               human (clinical) proteomics data.",
+  journal   = "Nucleic Acids Res.",
+  publisher = "Oxford University Press (OUP)",
+  volume    =  51,
+  number    = "D1",
+  pages     = "D1539--D1548",
+  month     =  jan,
+  year      =  2023,
+  copyright = "https://creativecommons.org/licenses/by/4.0/",
+  language  = "en"
+}
+
+@article{Thakur2024-zu,
+  title    = "{EMBL's} European bioinformatics institute ({EMBL-EBI}) in 2023",
+  author   = "Thakur, Matthew and Buniello, Annalisa and Brooksbank, Catherine
+              and Gurwitz, Kim T and Hall, Matthew and Hartley, Matthew and
+              Hulcoop, David G and Leach, Andrew R and Marques, Diana and
+              Martin, Maria and Mithani, Aziz and McDonagh, Ellen M and
+              Mutasa-Gottgens, Euphemia and Ochoa, David and Perez-Riverol,
+              Yasset and Stephenson, James and Varadi, Mihaly and Velankar,
+              Sameer and Vizcaino, Juan Antonio and Witham, Rick and McEntyre,
+              Johanna",
+  abstract = "The European Molecular Biology Laboratory's European
+              Bioinformatics Institute (EMBL-EBI) is one of the world's leading
+              sources of public biomolecular data. Based at the Wellcome Genome
+              Campus in Hinxton, UK, EMBL-EBI is one of six sites of the
+              European Molecular Biology Laboratory (EMBL), Europe's only
+              intergovernmental life sciences organisation. This overview
+              summarises the latest developments in the services provided by
+              EMBL-EBI data resources to scientific communities globally. These
+              developments aim to ensure EMBL-EBI resources meet the current
+              and future needs of these scientific communities, accelerating
+              the impact of open biological data for all.",
+  journal  = "Nucleic Acids Res.",
+  volume   =  52,
+  number   = "D1",
+  pages    = "D10--D17",
+  month    =  jan,
+  year     =  2024,
+  language = "en"
+}
diff --git a/paper/paper.md b/paper/paper.md
@@ -28,50 +28,45 @@ bibliography: paper.bib
 ---
 
 # Summary
-# Summary
-
-`pridepy` is a Python client designed to access the PRIDE Archive `[@Perez-Riverol2022-ow]`, a major public repository for proteomics data. The `pridepy` provides a flexible, programmatic interface to search, retrieve, and download data from the PRIDE Archive via its REST API. This tool simplifies the integration of PRIDE data into bioinformatics pipelines, making it easier for researchers to access large datasets programmatically.
 
-`pridepy` can be easily installed using pip:
+The Proteomics Identification Database (PRIDE) [@Perez-Riverol2022-ow] is the world's largest repository for proteomics data and a founding member of ProteomeXchange [@Deutsch2023-mu]. We introduce pridepy, a Python client designed to access PRIDE Archive data, including project metadata and file downloads. pridepy offers a flexible programmatic interface for searching, retrieving, and downloading data via the PRIDE REST API. This tool simplifies the integration of PRIDE datasets into bioinformatics pipelines, making it easier for researchers to handle large datasets programmatically.
 
 # Statement of Need
+
+The PRIDE Archive storages an extensive collection of proteomics data [@Perez-Riverol2022-ow], but manually accessing this data can be inefficient and time-consuming. With the growing need for cloud-based [@Dai2024-yc] and HPC bioinformatics tools [@Mehta2023-og], command-line utilities that seamlessly interact with the PRIDE API are increasingly important. `pridepy` addresses this by enabling researchers to programmatically access PRIDE using Python, a widely adopted language. It allows efficient dataset integration into automated workflows, with support for large-scale data transfers via [Aspera](https://www.ibm.com/products/aspera), [Globus](https://www.globus.org/data-transfer), FTP, and HTTPS, making it ideal for scalable, reproducible pipelines.
+
 # Methods
 
-`pridepy` is built in Python and interacts with the [PRIDE Archive REST API](https://www.ebi.ac.uk/pride/ws/archive/v2/swagger-ui.html). The core functionality includes:
+`pridepy` is built in Python and interacts with the [PRIDE Archive REST API](https://www.ebi.ac.uk/pride/ws/archive/v2/swagger-ui.html). The library and package not only provide data models for eanc data structure of the API but also a set of commandline to facilitate their use by users. The main features of `pridepy` include:
+
+- The main use case and functionality of pridepy is file downloading from PRIDE Archive (**Figure 1**). PRIDE archive stores the data in a S3-like storage system, called FIRE [@Thakur2024-zu] which also includes other major EMBL-EBI archives such as ENA (European Nucleotide Archive) and EGA (European Genome-phenome Archive). FIRE data is accessible via multiple protocols including FTP, Aspera, S3 and Globus. The pridepy client provides a simple command line interface to download files from PRIDE Archive using these protocols. Each protocol offers different advantages:
+  - FTP: Widely supported and easy to use
+  - Aspera: High-speed file transfers, especially for large files or over long distances
+  - S3 streaming: Easy to download private datasets and stream small files
+  - Globus: Reliable transfers for very large datasets. 
+
+For private datasets, only S3 streaming is supported and users need to provide submitter or reviewer credentials to access the data.
 
-- Searching for datasets using accession numbers or keywords.
-- Retrieving and downloading raw files or specific project data using different protocols (e.g., Aspera, Globus, FTP, and HTTPS). This is supported by multiple protocols implemented at the PRIDE Archive (Figure 1).
-- Handling biological data types such as proteins and peptides through a high-level interface.
-
-The API client leverages Python's request library to handle HTTP requests and responses. It provides a structured approach to query the database, filter results, and download associated files, including mass spectrometry data.
+The client is available on [PyPI](https://pypi.org/project/pridepy/) and can be installed using `pip`. The source code is hosted on [GitHub](https://github.com/bigbio/pridepy) and is open-source under the Apache 2.0 license. In addition, a conda recipe is available for easy installation in conda environments. The package is continuously tested using GitHub Actions and has been successfully deployed on the EMBL-EBI HPC cluster. 
 
-## Dependencies
+![Figure 1: Architecture of transfer protocols supported by PRIDE Archive](figure.png){ width=80% }
 
-`pridepy` relies on the following main Python libraries:
-- `requests`: For handling HTTP requests
-- `pandas`: For data manipulation and analysis
-- Additional libraries may be required for specific transfer protocols (e.g., Aspera, Globus)
-- Handling biological data types such as proteins and peptides through a high-level interface.
-
-The API client leverages Python's request library to handle HTTP requests and responses. It provides a structured approach to query the database, filter results, and download associated files, including mass spectrometry data.
-# Usage
+# Downloading files from PRIDE Archive
 
-The main features of `pridepy` include:
-- `download_all_raw_files`: Downloads all raw files for a given project.
-- `search_projects_by_keywords_and_filters`: Searches PRIDE Archive projects by keyword, species, instrument, etc.
-- `search_protein_evidences`: Retrieves protein evidence associated with a project.
+Users can download files from PRIDE Archive using the following command options: `download-all-raw-files` and `download-files-by-name`. The `download-all-raw-files` command downloads all raw files from a dataset, while the `download-files-by-name` command downloads a single file by name. Users can specify the output directory, protocol (FTP, Aspera, S3, or Globus), and other options to customize the download process. pridepy implements robust error handling and retry mechanisms to ensure successful downloads, especially for large datasets or unstable network connections. One example of downloading all raw files using Aspera from a dataset is shown below:
 
-Here's a simple example of how to use `pridepy` to download raw files:
-- `search_protein_evidences`: Retrieves protein evidence associated with a project.
+```bash
+$ pridepy download-all-raw-files -a PXD012353 -o /Users/yourname/Downloads/foldername/ -p aspera
+```
 
 This makes the client suitable for handling large-scale proteomics data in automated workflows, particularly in environments requiring bulk downloads of proteomics datasets.
 
 # Discussion and Future Directions
 
-`pridepy` successfully simplifies access to the PRIDE Archive, but future development could focus on improving the handling of large downloads by implementing parallel downloads or better error handling mechanisms. Additionally, adding more advanced querying capabilities, such as custom filters for specific peptide or protein properties, would make the tool even more powerful for large-scale proteomics analysis. Expanding user documentation and examples could help broaden its use within the scientific community.
+`pridepy` successfully simplifies access to the PRIDE Archive data, but future development could focus on improving the handling of large downloads by implementing parallel downloads. Additionally, we will expand the user documentation and examples could help broaden its use within the scientific community; and at the same time produce a group of benchmarks to evaluate the performance of the client in different scenarios. We plat to continue extending the library to support more features of the PRIDE Archive API, such as dataset metadata streaming, and submission of new datasets to the PRIDE Archive.
 
 # Acknowledgments
 
-We would like to thank the PRIDE Archive team and contributors to this project for their invaluable input and feedback.
+We would like to thank the PRIDE Archive team and contributors to this project for their invaluable input and feedback. The work is supported by core funding from the European Molecular Biology Laboratory (EMBL) and the Wellcome Trust [grant numbers 208391/Z/17/Z and 223745/Z/21/Z], and the BBSRC grant ‘DIA-Exchange’ [BB/X001911/1]. 
 
 # References