spectra-cluster-hadoop

Introduction

The spectra-cluster-hadoop application is the Apache Hadoop of the newly developed PRIDE Cluster algorithm. It is used to cluster the complete public data in the PRIDE repository for MS/MS based proteomics data.

The spectra-cluster-hadoop application relies on the spectra-cluster clustering API. All implementations of relevant clustering algorithms can be found there.

The following descriptions are only based on the clustering pipeline used to create the PRIDE Cluster resource.

The PRIDE Cluster Pipeline

The pipeline itself is run through the runClusteringJob.sh script. The order of the lanched jobs and used thresholds are set there. Additionally, detailed configuration options for all jobs can be found in the respective configuration files.

The pipeline used to create the PRIDE Cluster resource consists of four MapReduce jobs:

The Spectrum job is used to load all spectra from the source (MGF) files. Spectra are normalized and only the 70 highest peaks per spectrum retained.
The Peak job is used to create the original clustering. The [runClusteringJob.sh](https://github.com/spectra-cluster/spectra-cluster-hadoop/blob/ master/script/runClusteringJob.sh) script launches this job with decreasing thresholds (default starting at an estimated cluster purity of 0.999 in 4 decreasing rounds to a final purity of 0.99). In the first round only spectra that share one of the six highest peaks are compared. In subsequent rounds only spectra that were within the 30 highest scoring matches in the previous round are being compared.
The Merge job merges clusters in neighbouring windows, again with decreasing accuracy (same settings as for [Peak](https://github.com/spectra-cluster/spectra-cluster-hadoop/tree/master/src/main/java/uk/ac/ebi/pride/ spectracluster/hadoop/peak) job used).
The Output job writes the result into the .clustering format (see the clustering-file-reader API for more information).

Getting started

Installation

You will need to have Maven installed in order to build spectra-cluster-hadoop.

$ mvn clean package

After the build, you should find spectra-cluster-hadoop-X.X.X.zip in the target folder.

Running the library

Unzip the release of the library under a dedicated folder and execute the following command.

Usage: ./runClusteringJob.sh [main directory] [job prefix = ''] [similarity threshold settings = 0.999:0.99:4] [output folder = main directory]
  [main directory]      Path on Hadoop to use as a working directory. The sub-
                        directory 'spectra' will be used as input directory
  [job prefix]          (optional) A prefix to add to the Hadoop job names.
  [similarity threshold]          (optional) The similarity threshold settings,
                        in the format of <highest threshold>:<final threshold>:<number of steps>
  [output folder]       (optional) If this option is set, the results are
                        written to this folder instead of the [main directory]

Getting help

If you have questions or need additional help, please contact the PRIDE help desk at the EBI. email: pride-support at ebi.ac.uk (replace at with @).

Giving your feedback

Please give us your feedback, including error reports, suggestions on improvements, new feature requests. You can do so by opening a new issue at our issues section

How to cite

Please cite this library using one of the following publications:

Griss J, Foster JM, Hermjakob H, Vizcaíno JA. PRIDE Cluster: building the consensus of proteomics data. Nature methods. 2013;10(2):95-96. doi:10.1038/nmeth.2343. PDF, HTML, PubMed

Contribute

We welcome all contributions submitted as pull request.

License

This project is available under the Apache 2 open source software (OSS) license.

Name		Name	Last commit message	Last commit date
Latest commit History 353 Commits
conf		conf
script		script
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spectra-cluster-hadoop

Introduction

The PRIDE Cluster Pipeline

Getting started

Installation

Running the library

Getting help

Giving your feedback

How to cite

Contribute

License

About

Releases

Packages

Contributors 2

Languages

spectra-cluster/spectra-cluster-hadoop

Folders and files

Latest commit

History

Repository files navigation

spectra-cluster-hadoop

Introduction

The PRIDE Cluster Pipeline

Getting started

Installation

Running the library

Getting help

Giving your feedback

How to cite

Contribute

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages