squab performs gene expression quantification by counting the number of aligned records that intersects a set of features. Output can be the raw counts or normalized counts in TPM (transcripts per million) or FPKM (fragments per kilobase per million mapped reads).
The original goal of this project is to provide a faster alternative to htseq-count. It uses similar counting rules and outputs a compatible data table.
Install Rust and use cargo
to install squab
.
$ cargo install --locked --git https://github.com/zaeleus/squab.git
squab has two subcommands: quantify
and normalize
.
quantify
performs gene expression quantification by counting the number of
times aligned records intersect known gene annotations.
Gene expression quantification
Usage: squab quantify [OPTIONS] --output <OUTPUT> --annotations <ANNOTATIONS> <BAM>
Arguments:
<BAM> Input alignment file
Options:
--with-secondary-records
Count secondary records (BAM flag 0x100)
--with-supplementary-records
Count supplementary records (BAM flag 0x800)
--with-nonunique-records
Count nonunique records (BAM data tag NH > 1)
--strand-specification <STRAND_SPECIFICATION>
Strand specification [default: auto] [possible values: none, forward, reverse, auto]
-t, --feature-type <FEATURE_TYPE>
Feature type to count [default: exon]
-i, --id <ID>
Feature attribute to use as the feature identity [default: gene_id]
--min-mapping-quality <MIN_MAPPING_QUALITY>
[default: 10]
-o, --output <OUTPUT>
Output destination for feature counts
-a, --annotations <ANNOTATIONS>
Input annotations file (GFF3)
--threads <THREADS>
Force a specific number of threads
-h, --help
Print help information
The default output is a tab-delimited text file with two columns: the feature identifier (string) and the number of reads (integer) from the input alignment that overlap it. This file is compatible as output from htseq-count, meaning it includes statistics in the trailer.
normalize
takes raw counts and normalizes them by gene length, meaning the
annotations used for quantification must be the same given here. Two
normalization methods are available: FPKM for single sample normalization and
TPM for across samples normalization (default).
Typically, this is only used when a sample was previously quanitifed, e.g.,
using squab quantify
or htseq-count
.
Normalize features counts
Usage: squab normalize [OPTIONS] --annotations <ANNOTATIONS> <COUNTS>
Arguments:
<COUNTS>
Input counts file
Options:
-t, --feature-type <FEATURE_TYPE>
Feature type to count
[default: exon]
-i, --id <ID>
Feature attribute to use as the feature identity
[default: gene_id]
-a, --annotations <ANNOTATIONS>
Input annotations file (GFF3)
--method <METHOD>
Quantification normalization method
[default: tpm]
Possible values:
- fpkm: fragments per kilobase per million mapped reads
- tpm: transcripts per million
-h, --help
Print help information (use `-h` for a summary)
The output is a tab-delimited text file with two columns: the feature identifier (string) and the normalized value (double).
$ squab \
quantify \
--annotations annotations.gff3.gz \
--output sample.counts.tsv \
sample.bam
$ squab \
quantify \
--annotations annotations.gff3.gz \
--feature-type gene \
--id gene_name \
--output sample.counts.tsv \
sample.bam
$ squab \
normalize \
--method fpkm \
--annotations annotations.gff3.gz \
sample.counts.tsv \
> sample.fpkm.tsv
- Counts are taken only as the union of matched feature sets, i.e., reads that overlap any part of the feature is considered once.
- For paired end alignments, a read that matches itself before a mate is found replaces the previously known record.
-
S Anders, T P Pyl, W Huber. HTSeq – A Python framework to work with high-throughput sequencing data. bioRxiv 2014. https://doi.org/10.1101/002824
-
Wagner, G.P., Kin, K. & Lynch, V.J. Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples. Theory Biosci. 131, 281–285 (2012). https://doi.org/10.1007/s12064-012-0162-3