Skip to content

Access Data Release

Artem Babaian edited this page May 10, 2023 · 34 revisions

s3://lovelywater2/ : Serratus data-warehouse

Current Version: v230110

Versioned and structured data releases are freely hosted on AWS S3 in our data-warehouse: "lovelywater2".

Unstructured data and intermediate files are in the Working Data Directories.

Structured Data Types

  1. Search sequence references
  2. SRA Run Info Queries
  3. Summary-level data
  4. Alignment-level data (.bam or .pro, see notes below)
  5. Assembly-level data
  6. RdRP barcode sequences (PALMdb)

Folder organization

## Folder organization
                                                                               NEW/UPDATED
s3://lovelywater2/     # A Read-Only Archive of Serratus Data Releases
⦿ Common files
├── assembly/         # Viral assembly and annotation data                     
│   └─── cov/         # .fasta  : Assembled/filtered coronaviruses             
│   └─── contigs/     # CoronaSPAdes output, contigs, graphs, stats...         
│   └─── annotation/  # CoV annotation and taxonomic assignments
├── seq/              # Reference sequences used in data-releases      
│   └─── cov3ma/      # Nucleotide viral pangenome
│   └─── protref5/    # Protein viral panproteome
│   └─── rdrp1/       # viral RNA dependent RNA polymerase collection 1
│   └─── rdrp5/       # dark  RNA dependent RNA polymerase collection 5        ***
├── sra/              # sraRunInfo.csv files and queries for data (per query)
│   └─── README.md    # see github.com/ababaian/serratus/wiki/SRA-queries      ***
│   └─── *query*      # (see below)                                            ***
⦿ Nucleotide search files
├── bam/              # .bam    : Aligned files
├── summary/          # .summary: Original alignment summaries (deprecated)  
├── summary2/         # .summary: Alignment summaries
⦿ Translated-nucleotide (protein) search files
├── pro/              # .pro.gz : Translated-nucleotide alignments (diamond)
├── psummary/         # .psummary: Protein
⦿ RdRP 1 translated-nucleotide search files
├── rpro/              # .pro.gz : Aligned files                              ***
├── rsummary/          # .psummary: Alignment summaries for rdrp-search       ***
⦿ Dark RdRP 5 translated-nucleotide search files
├── dpro/              # .pro.gz : Aligned files                              ***
├── dsummary/          # .psummary: Alignment summaries for rdrp-search       ***
⦿ Index Files
├ index.tsv           # Index file of completed SRA accessions
├ pindex.tsv          # Index file of completed protein SRA accessions
├ rindex.tsv          # Index file of completed rdrp SRA accessions           ***
├ dindex.tsv          # Index file of completed dark rdrp SRA accessions      ***
├ LICENSE.md          #
└ README.md           # This README.md                                        **

s3://lovelywater2/sra/
* QUERY SETS *
├ v201210/               # Query sets from major version v210225 and prior
├ v220113/               # Query sets from major version v210225
└ v230116_SraRunInfo.csv # master query CSV for v230116                          ***

See also: SRA Query Sets

Naming Convention

All folders are flat, with files named {sra_accession}.{ext}

For example, the SRA library SRA123456 processed in the 'viro' query will have the files:

  • s3://lovelywater2/bam/SRA123456.bam
  • s3://lovelywater2/summary/SRA123456.summary
  • s3://lovelywater2/assembly/contigs/SRA123456.coronaSPAdes.gene_clusters.fa

Accessing Data

The S3 bucket has public read-only permissions. All files can be downloaded via aws cli or wget/curl.

  • aws-cli : aws s3 cp s3://lovelywater2/<file_path>.

  • wget/curl : wget https://lovelywater2.s3.amazonaws.com/<file_path>

To find or access a sub-set of data use the index file:

`aws s3 cp s3://lovelywater2/index.tsv ./`

`grep "SRR1234" index.tsv > matches`

`aws s3 cp --recursive -exclude "*" -include "SRR1234*" s3://lovelywater2/summary/ ./SRR1234/`

Access Alignment Data in IGV

As of version 20200821, all .bam files are sorted and have an associated .bai index file in the ~/bam/ directory. These alignment files can be visualized directly in a genome browswer such as igv using the cov3ma as reference genome.

IGV Stream Alignment: File --> Load from URL --> https://lovelywater2.s3.amazonaws.com/bam/ERR2756788.bam

You can then navigate to a relevant accession such as "EU769558.1" and directly vizualize read alignments.

IGV screenshot

.pro files

Translated-nucleotide alignment data are saved as (.pro), the output of diamond -f 6 with the following ordered-fields.

qseqid  qstart qend qlen qstrand sseqid  sstart send slen pident evalue cigar qseq_translated full_qseq full_qseq_mate

(See also: Diamond Wiki)

.mfc compressed files

FASTA assemblies are compressed using MFCompress.

# Quick install (linux 64bit)
wget http://sweet.ua.pt/ap/software/mfcompress/MFCompress-linux64-1.01.tgz
tar -xvf MFCompress-linux64-1.01.tgz
cp MFC*/MFC* ./; rm -rf MFCompress-linux64-1.01

# Decompress
MFCompressD SRR01234.fa.mfc

LICENSE

All data released in s3://lovelywater2/ is done so under the cc0 v1.0 license as defined in s3://lovelywater2/LICENSE.md.

Genomes and Contigs

RdRP barcode sequence database

PALMdb is a database of viral polymerase palmprint (barcode) sequences classified by (1) taxonomy and (2) species-like operational taxonomic units (OTUs) obtained by clustering at 90% sequence identity. PALMdb was created using the palmscan algorithm to mine public sequence databases and Serratus contigs. The 2021-03-14 update includes 250,799 novel Serratus palmprint sequences, representing 132,992 new OTUs.

Clone this wiki locally