Skip to content

Commit

Permalink
Merge pull request #21 from PRIDE-Archive/dev
Browse files Browse the repository at this point in the history
minor changes
  • Loading branch information
ypriverol authored Sep 26, 2024
2 parents 8e27277 + 3d88c01 commit b2c9b4b
Show file tree
Hide file tree
Showing 5 changed files with 107 additions and 28 deletions.
11 changes: 10 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,13 @@
.pytest_cache
dist
build
*.egg-info
*.egg-info
*.egg
.idea/*
.idea/
.vscode
__pycache__
*.pyc
*.pyo
*.pyd
paper/paper.pdf
Binary file modified paper/figure.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added paper/infrastructure.pptx
Binary file not shown.
75 changes: 75 additions & 0 deletions paper/paper.bib
Original file line number Diff line number Diff line change
Expand Up @@ -113,3 +113,78 @@ @article{Mehta2023-og
spectrometry; multi-omics; proteomics; reproducibility",
language = "en"
}

@article{Deutsch2023-mu,
title = "The {ProteomeXchange} consortium at 10 years: 2023 update",
author = "Deutsch, Eric W and Bandeira, Nuno and Perez-Riverol, Yasset and
Sharma, Vagisha and Carver, Jeremy J and Mendoza, Luis and
Kundu, Deepti J and Wang, Shengbo and Bandla, Chakradhar and
Kamatchinathan, Selvakumar and Hewapathirana, Suresh and
Pullman, Benjamin S and Wertz, Julie and Sun, Zhi and Kawano,
Shin and Okuda, Shujiro and Watanabe, Yu and MacLean, Brendan
and MacCoss, Michael J and Zhu, Yunping and Ishihama, Yasushi
and Vizca{\'\i}no, Juan Antonio",
abstract = "Mass spectrometry (MS) is by far the most used experimental
approach in high-throughput proteomics. The ProteomeXchange (PX)
consortium of proteomics resources
(http://www.proteomexchange.org) was originally set up to
standardize data submission and dissemination of public MS
proteomics data. It is now 10 years since the initial data
workflow was implemented. In this manuscript, we describe the
main developments in PX since the previous update manuscript in
Nucleic Acids Research was published in 2020. The six members of
the Consortium are PRIDE, PeptideAtlas (including PASSEL),
MassIVE, jPOST, iProX and Panorama Public. We report the current
data submission statistics, showcasing that the number of
datasets submitted to PX resources has continued to increase
every year. As of June 2022, more than 34 233 datasets had been
submitted to PX resources, and from those, 20 062 (58.6\%) just
in the last three years. We also report the development of the
Universal Spectrum Identifiers and the improvements in capturing
the experimental metadata annotations. In parallel, we highlight
that data re-use activities of public datasets continue to
increase, enabling connections between PX resources and other
popular bioinformatics resources, novel research and also new
data resources. Finally, we summarise the current
state-of-the-art in data management practices for sensitive
human (clinical) proteomics data.",
journal = "Nucleic Acids Res.",
publisher = "Oxford University Press (OUP)",
volume = 51,
number = "D1",
pages = "D1539--D1548",
month = jan,
year = 2023,
copyright = "https://creativecommons.org/licenses/by/4.0/",
language = "en"
}

@article{Thakur2024-zu,
title = "{EMBL's} European bioinformatics institute ({EMBL-EBI}) in 2023",
author = "Thakur, Matthew and Buniello, Annalisa and Brooksbank, Catherine
and Gurwitz, Kim T and Hall, Matthew and Hartley, Matthew and
Hulcoop, David G and Leach, Andrew R and Marques, Diana and
Martin, Maria and Mithani, Aziz and McDonagh, Ellen M and
Mutasa-Gottgens, Euphemia and Ochoa, David and Perez-Riverol,
Yasset and Stephenson, James and Varadi, Mihaly and Velankar,
Sameer and Vizcaino, Juan Antonio and Witham, Rick and McEntyre,
Johanna",
abstract = "The European Molecular Biology Laboratory's European
Bioinformatics Institute (EMBL-EBI) is one of the world's leading
sources of public biomolecular data. Based at the Wellcome Genome
Campus in Hinxton, UK, EMBL-EBI is one of six sites of the
European Molecular Biology Laboratory (EMBL), Europe's only
intergovernmental life sciences organisation. This overview
summarises the latest developments in the services provided by
EMBL-EBI data resources to scientific communities globally. These
developments aim to ensure EMBL-EBI resources meet the current
and future needs of these scientific communities, accelerating
the impact of open biological data for all.",
journal = "Nucleic Acids Res.",
volume = 52,
number = "D1",
pages = "D10--D17",
month = jan,
year = 2024,
language = "en"
}
49 changes: 22 additions & 27 deletions paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,50 +28,45 @@ bibliography: paper.bib
---

# Summary
# Summary

`pridepy` is a Python client designed to access the PRIDE Archive `[@Perez-Riverol2022-ow]`, a major public repository for proteomics data. The `pridepy` provides a flexible, programmatic interface to search, retrieve, and download data from the PRIDE Archive via its REST API. This tool simplifies the integration of PRIDE data into bioinformatics pipelines, making it easier for researchers to access large datasets programmatically.

`pridepy` can be easily installed using pip:
The Proteomics Identification Database (PRIDE) [@Perez-Riverol2022-ow] is the world's largest repository for proteomics data and a founding member of ProteomeXchange [@Deutsch2023-mu]. We introduce pridepy, a Python client designed to access PRIDE Archive data, including project metadata and file downloads. pridepy offers a flexible programmatic interface for searching, retrieving, and downloading data via the PRIDE REST API. This tool simplifies the integration of PRIDE datasets into bioinformatics pipelines, making it easier for researchers to handle large datasets programmatically.

# Statement of Need

The PRIDE Archive storages an extensive collection of proteomics data [@Perez-Riverol2022-ow], but manually accessing this data can be inefficient and time-consuming. With the growing need for cloud-based [@Dai2024-yc] and HPC bioinformatics tools [@Mehta2023-og], command-line utilities that seamlessly interact with the PRIDE API are increasingly important. `pridepy` addresses this by enabling researchers to programmatically access PRIDE using Python, a widely adopted language. It allows efficient dataset integration into automated workflows, with support for large-scale data transfers via [Aspera](https://www.ibm.com/products/aspera), [Globus](https://www.globus.org/data-transfer), FTP, and HTTPS, making it ideal for scalable, reproducible pipelines.

# Methods

`pridepy` is built in Python and interacts with the [PRIDE Archive REST API](https://www.ebi.ac.uk/pride/ws/archive/v2/swagger-ui.html). The core functionality includes:
`pridepy` is built in Python and interacts with the [PRIDE Archive REST API](https://www.ebi.ac.uk/pride/ws/archive/v2/swagger-ui.html). The library and package not only provide data models for eanc data structure of the API but also a set of commandline to facilitate their use by users. The main features of `pridepy` include:

- The main use case and functionality of pridepy is file downloading from PRIDE Archive (**Figure 1**). PRIDE archive stores the data in a S3-like storage system, called FIRE [@Thakur2024-zu] which also includes other major EMBL-EBI archives such as ENA (European Nucleotide Archive) and EGA (European Genome-phenome Archive). FIRE data is accessible via multiple protocols including FTP, Aspera, S3 and Globus. The pridepy client provides a simple command line interface to download files from PRIDE Archive using these protocols. Each protocol offers different advantages:
- FTP: Widely supported and easy to use
- Aspera: High-speed file transfers, especially for large files or over long distances
- S3 streaming: Easy to download private datasets and stream small files
- Globus: Reliable transfers for very large datasets.

For private datasets, only S3 streaming is supported and users need to provide submitter or reviewer credentials to access the data.

- Searching for datasets using accession numbers or keywords.
- Retrieving and downloading raw files or specific project data using different protocols (e.g., Aspera, Globus, FTP, and HTTPS). This is supported by multiple protocols implemented at the PRIDE Archive (Figure 1).
- Handling biological data types such as proteins and peptides through a high-level interface.

The API client leverages Python's request library to handle HTTP requests and responses. It provides a structured approach to query the database, filter results, and download associated files, including mass spectrometry data.
The client is available on [PyPI](https://pypi.org/project/pridepy/) and can be installed using `pip`. The source code is hosted on [GitHub](https://github.com/bigbio/pridepy) and is open-source under the Apache 2.0 license. In addition, a conda recipe is available for easy installation in conda environments. The package is continuously tested using GitHub Actions and has been successfully deployed on the EMBL-EBI HPC cluster.

## Dependencies
![Figure 1: Architecture of transfer protocols supported by PRIDE Archive](figure.png){ width=80% }

`pridepy` relies on the following main Python libraries:
- `requests`: For handling HTTP requests
- `pandas`: For data manipulation and analysis
- Additional libraries may be required for specific transfer protocols (e.g., Aspera, Globus)
- Handling biological data types such as proteins and peptides through a high-level interface.

The API client leverages Python's request library to handle HTTP requests and responses. It provides a structured approach to query the database, filter results, and download associated files, including mass spectrometry data.
# Usage
# Downloading files from PRIDE Archive

The main features of `pridepy` include:
- `download_all_raw_files`: Downloads all raw files for a given project.
- `search_projects_by_keywords_and_filters`: Searches PRIDE Archive projects by keyword, species, instrument, etc.
- `search_protein_evidences`: Retrieves protein evidence associated with a project.
Users can download files from PRIDE Archive using the following command options: `download-all-raw-files` and `download-files-by-name`. The `download-all-raw-files` command downloads all raw files from a dataset, while the `download-files-by-name` command downloads a single file by name. Users can specify the output directory, protocol (FTP, Aspera, S3, or Globus), and other options to customize the download process. pridepy implements robust error handling and retry mechanisms to ensure successful downloads, especially for large datasets or unstable network connections. One example of downloading all raw files using Aspera from a dataset is shown below:

Here's a simple example of how to use `pridepy` to download raw files:
- `search_protein_evidences`: Retrieves protein evidence associated with a project.
```bash
$ pridepy download-all-raw-files -a PXD012353 -o /Users/yourname/Downloads/foldername/ -p aspera
```

This makes the client suitable for handling large-scale proteomics data in automated workflows, particularly in environments requiring bulk downloads of proteomics datasets.

# Discussion and Future Directions

`pridepy` successfully simplifies access to the PRIDE Archive, but future development could focus on improving the handling of large downloads by implementing parallel downloads or better error handling mechanisms. Additionally, adding more advanced querying capabilities, such as custom filters for specific peptide or protein properties, would make the tool even more powerful for large-scale proteomics analysis. Expanding user documentation and examples could help broaden its use within the scientific community.
`pridepy` successfully simplifies access to the PRIDE Archive data, but future development could focus on improving the handling of large downloads by implementing parallel downloads. Additionally, we will expand the user documentation and examples could help broaden its use within the scientific community; and at the same time produce a group of benchmarks to evaluate the performance of the client in different scenarios. We plat to continue extending the library to support more features of the PRIDE Archive API, such as dataset metadata streaming, and submission of new datasets to the PRIDE Archive.

# Acknowledgments

We would like to thank the PRIDE Archive team and contributors to this project for their invaluable input and feedback.
We would like to thank the PRIDE Archive team and contributors to this project for their invaluable input and feedback. The work is supported by core funding from the European Molecular Biology Laboratory (EMBL) and the Wellcome Trust [grant numbers 208391/Z/17/Z and 223745/Z/21/Z], and the BBSRC grant ‘DIA-Exchange’ [BB/X001911/1].

# References

0 comments on commit b2c9b4b

Please sign in to comment.