Skip to content

vivaxgen/sra-repo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sra-repo

SRA repository management system

Overview

sra-repo is a set of command line tools to manage a centralized, local repository containing published FASTQ files downloaded from SRA databases (NCBI Entrez or EBI ENA).

Some features:

  • perform parallel downloads from SRA databases (NCBI Entrez or EBI ENA)
  • automatically set all fastq files to read-only to avoid accidental modification of the files
  • create symbolic links for the necessary fastq files from central location to target directory

Examples

sra-repo has bash tab-complete feature to make typing faster and typo-error free. Single-tapping TAB key will complete the argument automatically while double-tapping on TAB key will provide the available arguments. Do note that this feature only available under bash shell. Try:

sra-repo.py [TAB][TAB]
sra-repo.py fe[TAB]

Several usage examples:

Fetching FASTQ files

Fetching SRAs from public database (by default, sra-repo.py will try EBI ENA first, and then NCBI Entrez) using 3 parallel downloader workers (tasks):

sra-repo.py fetch --ntasks 3 ERR175543 ERR175544

Fetching SRAs with SRA IDs taken from a file containing each ID per line:

sra-repo.py fetch --ntasks 10 --idfile my_sraids.txt

Fetching SRAs with SRA IDs taken from a column named ENA of a tab-delimited file with proper headers (ie. a sample file):

sra-repo.py fetch --ntasks 20 --samplefile my_samplefile.tsv:ENA

Checking FASTQ files

To check the existance of certain SRA IDs in the database:

sra-repo.py check ERR175543 ERR175544

or:

sra-repo.py check --idfile my_sraids.txt

or:

sra-repo.py check --samplefile my_samplefile.tsv:ENA

To also validate the FASTQ files, use --validate argument:

sra-repo.py check --validate ERR175543 ERR175544

Finding information about FASTQ files

Other commands available in sra-repo are info and path, which will provide information on the FASTQ files and the actual path where the FASTQ files were stored, respectively:

sra-repo.py info ERR175543
sra-repo.py path ERR175543

Both commands also can accept a SRA ID file or a sample file, using --sraidfile or --samplefile argument.

Linking FASTQ files

Linking FASTQ files to a target directory is usually necessary before any analysis be performed, as it will ease dealing with file path etc.

To create links for several SRA IDs directly:

mkdir test
sra-repo.py link --outdir test ERR175543 ERR175544
ls test

A text file or tab-delimited file can also be used:

sra-repo.py link --outdir test --idfile my_sraids.txt

or:

sra-repo.py link --outdir test --samplefile my_samplefile.tsv:Sample,ENA

Please note that when using samplefile, the column names for Sample identifier and SRA ids are required.

When using a sample file, sra-repo can provide a manifest file, a two-column tab-delimited file with SAMPLE and FASTQ header, providing the sample code and its associated FASTQ files separated by comma for paired files, and semi-colon for different SRA for the same sample:

sra-repo.py link --outdir test --o my-manifest.tsv --samplefile my_samplefile.tsv:Sample,ENA

Quick Installation

Decide where the main root directory for sra-repo and its storage will be installed. Run the following command to install sra-repo, including all of its dependencies, and provide the directory when prompted:

"${SHELL}" <(curl -L https://raw.githubusercontent.com/vivaxgen/install/main/sra-repo.sh)

Make sure the installation has completed sucessfully. Take a note on the activation script that needs to be sourced every time sra-repo is going to be used.

Manual Installation

The first step is to decide the main root directory where sra-repo and its repository system will be stored. For example, with main root directory of /shared/SRA, the following directory structure would be recommended:

/shared/SRA
/shared/SRA/bin [for activate.sh script]
/shared/SRA/opt [for manual installation of the requirements if without Conda ]
/shared/SRA/opt/env [for sra-repo installation]
/shared/SRA/store [for the main repository of all FASTQ files]
/shared/SRA/tmp [for temporary space during downloads and format convertion]
/shared/SRA/cache [for samtools-fastq caching system converting CRAM to FASTQ]

To prepare the above directory structures and also install sra-repo, the following commands can be used:

export MAIN_ROOT=/share/SRA
mkdir $MAIN_ROOT/bin $MAIN_ROOT/opt $MAIN_ROOT/opt/env $MAIN_ROOT/store $MAIN_ROOT/tmp $MAIN_ROOT/cache
git clone https://github.com/vivaxgen/sra-repo.git $MAIN_ROOT/opt/env/

sra-repo is written in Python (the development is with Python 3.11) with the following additional modules used:

  • pycurl
  • requests
  • rich
  • argcomplete

Python can be installed either using Conda, or using the operating system software manager (eg. dnf for rpm-based Linux system or apt for deb-based Linux system), or download directly from https://python.org. Once Python3 has been installed, install the required modules by doing the following:

pip3 install pycurl rich requests argcomplete

sra-repo also requires several external software to be installed:

If all requirements are going to be manually installed (ie. not using Conda), all requirements can be installed in $MAIN_ROOT/opt where MAIN_ROOT is the main root directory of sra-repo repository (eg. /shared/SRA with the above example).

[to be continued]

About

SRA repository management system

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published