Bystro

TLDR; 1,000x+ faster than VEP, more complete annotation + online search (https://bystro.io) for datasets of up to 47TB (compressed) online, or petabytes offline.

Bystro Publication

For datasets and scripts used, please visit github.com/bystro-paper

If using Bystro, please cite Kotlar et al, Genome Biology, 2018

Web Tutorial

Start here: TUTORIAL.md

For most users, we recommend https://bystro.io .

The web app gives full access to all of Bystro's capabilities, provides a convenient search/filtering interface, supports large data sets (tested up to 890GB uncompressed/129GB compressed), and has excellent performance.

Installing Bystro

Bystro consists of 2 main components: the Bystro Python package, which consists of the Bystro ML library, CLI tool, and a collection of easy to use biology tools including global ancestry and the Bystro annotator (Perl).

The Bystro Python package also gives the ability to launch workers to process jobs from the Bystro API server, but this is not necessary for most users.

Installing the Bystro Python libraries and CLI tools

To install the Bystro Python package, run:

pip install --pre bystro

The Bystro ancestry CLI score tool (bystro-api ancestry score) parses VCF files to generate dosage matrices. This requires bystro-vcf, a Go program which can be installed with:

# Requires Go: install from https://golang.org/doc/install
go install github.com/bystrogenomics/[email protected]

Bystro is compatible with Linux and MacOS. Windows support is experimental. If you are installing on MacOS as a native binary (Arm), you will need to install the following additional dependencies:

brew install cmake

Please refer to INSTALL.md for more details.

Installing the Bystro Annotator

Please refer to INSTALL.md for instructions on how to install the Bystro annotator.

File support

Bystro relies on pluggable (via Bystro's YAML config) pre-processors to normalize variant inputs (dealing with VCF issues such as padding), calculate whether a site is a transition or transversion, calculate sample maf, identify hets/homozygotes/missing samples, calculate heterozygosity, homozygosity, missingness, and more.

VCF format: Bystro-Vcf
SNP format: Bystro-SNP
Create your own to support other formats!

Annotation (Output) Field Descriptions

Please read FIELDS.md

The Bystro configuration file

The config file describes the state of both the database and the annotation. It's required for annotating or building

It has several keys:

tracks: The highest level organization for database values. Tracks have a name property, which must be unique, and a type, which must be one of:

sparse: A bed file, or any file that can be mapped to chrom, chromStart, and chromEnd columns.
- This is used for dbSNP, and Clinvar records, but many files can be fit this format.
- Mapping fields can be managed by the fieldMap key
score: A wigFix file.
- Used for phastCons, phyloP
cadd:
- A CADD file, or Bystro's custom "bed-like" CADD file, which has 2 header lines, and chrom, chromStart, chromEnd columns, followed by standard CADD fields
- CADD format: http://cadd.gs.washington.edu

gene: A UCSC gene track table (ex: knownGene, refGene, sgdGene) stored as a tab separated output, with column names as columns. Conversion from SQL to the expected tab-delimited format is controlled by bin/bystro-utils.pl, which will automatically fetch the requested sql, and generate the tab-delimited output.

For instance: For a config file that has the following track

chromosomes:
  - chr1
tracks:
  tracks:
  - name: refSeq
    type: gene
    utils:
    - args:
        connection:
          database: hg19
        sql: SELECT r.*, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.kgID, '')) SEPARATOR
          ';') FROM kgXref x WHERE x.refseq=r.name) AS kgID, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.description,
          '')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS description,
          (SELECT GROUP_CONCAT(DISTINCT(NULLIF(e.value, '')) SEPARATOR ';') FROM knownToEnsembl
          e JOIN kgXref x ON x.kgID = e.name WHERE x.refseq = r.name) AS ensemblID,
          (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.tRnaName, '')) SEPARATOR ';') FROM
          kgXref x WHERE x.refseq=r.name) AS tRnaName, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.spID,
          '')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS spID, (SELECT
          GROUP_CONCAT(DISTINCT(NULLIF(x.spDisplayID, '')) SEPARATOR ';') FROM kgXref
          x WHERE x.refseq=r.name) AS spDisplayID, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.protAcc,
          '')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS protAcc, (SELECT
          GROUP_CONCAT(DISTINCT(NULLIF(x.mRNA, '')) SEPARATOR ';') FROM kgXref x WHERE
          x.refseq=r.name) AS mRNA, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.rfamAcc,
          '')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS rfamAcc FROM
          refGene r WHERE chrom=%chromosomes%;

Running bin/bystro-utils.pl --config <path/to/this/config> will result in the following config:

chromosomes:
  - chr1
tracks:
  tracks:
  - name: refSeq
    type: gene
    local_files:
      - hg19.kgXref.chr1.gz
      name: refSeq
      type: gene
      utils:
      - args:
          connection:
            database: hg19
          sql: SELECT r.*, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.kgID, '')) SEPARATOR
            ';') FROM kgXref x WHERE x.refseq=r.name) AS kgID, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.description,
            '')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS description,
            (SELECT GROUP_CONCAT(DISTINCT(NULLIF(e.value, '')) SEPARATOR ';') FROM knownToEnsembl
            e JOIN kgXref x ON x.kgID = e.name WHERE x.refseq = r.name) AS ensemblID,
            (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.tRnaName, '')) SEPARATOR ';') FROM
            kgXref x WHERE x.refseq=r.name) AS tRnaName, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.spID,
            '')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS spID, (SELECT
            GROUP_CONCAT(DISTINCT(NULLIF(x.spDisplayID, '')) SEPARATOR ';') FROM kgXref
            x WHERE x.refseq=r.name) AS spDisplayID, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.protAcc,
            '')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS protAcc, (SELECT
            GROUP_CONCAT(DISTINCT(NULLIF(x.mRNA, '')) SEPARATOR ';') FROM kgXref x WHERE
            x.refseq=r.name) AS mRNA, (SELECT GROUP_CONCAT(DISTINCT(NULLIF(x.rfamAcc,
            '')) SEPARATOR ';') FROM kgXref x WHERE x.refseq=r.name) AS rfamAcc FROM
            refGene r WHERE chrom=%chromosomes%;
        completed: <date fetched>
        name: fetch

hg19.kgXref.chr1.gz will contain:

bin	name	chrom	strand	txStart	txEnd	cdsStart	cdsEnd	exonCount	exonStarts	exonEnds	score	name2	cdsStartStat	cdsEndStat	exonFrames	kgID	description	ensemblID	tRnaName	spID	spDisplayID	protAcc	mRNA	rfamAcc

0	NM_001376542	chr1	+	66999275	67216822	67000041	67208778	25	66999275,66999928,67091529,67098752,67105459,67108492,67109226,67126195,67133212,67136677,67137626,67138963,67142686,67145360,67147551,67154830,67155872,67161116,67184976,67194946,67199430,67205017,67206340,67206954,67208755,	66999620,67000051,67091593,67098777,67105516,67108547,67109402,67126207,67133224,67136702,67137678,67139049,67142779,67145435,67148052,67154958,67155999,67161176,67185088,67195102,67199563,67205220,67206405,67207119,67216822,	0	SGIP1	cmpl	cmpl	-1,0,1,2,0,0,1,0,0,0,1,2,1,1,1,1,0,1,1,2,2,0,2,1,1,	NA	NA	NA	NA	NA	NA	NA	NA	NA

nearest: A pre-calculated gene track that is intersected with a target gene track.

Example:
```
- name: refSeq.gene
  dist: false
  storeNearest: true
  to: txEnd
  type: nearest
  features:
  - name2
  from: txStart
  local_files:
  - hg19.kgXref.chr*.gz
```
Options:
- dist: bool
  - Calculate the distance to the nearest target gene record. If the
vcf: A VCF v4.* file

chromosomes: The allowable chromosomes.
- Each row of every track must be identified by these chromosomes (during building)
- Each row of any input file submitted for annotation must also be "" "" (during annotation)
- However, Bystro is flexible about the chr prefix
Ex: For the following config
```
chromosomes:
  - chr1
  - chr2
  - chr3
```
Only chr1, chr2, and chr3 will be accepted. However, Bystro tries to make your life easy
1. We currently follow UCSC conventions for chromosomes, meaning they should be prepended by chr
2. Bystro will automatically append chr to chromosomes read from an input file during annotation.
3. Bystro allows the transformation of any field during building, configurable in the YAML config file for that assembly, making it easy to prepend chr to the source file chromosome field
Ex: Clinvar doesn't have a chr prefix, so during building we specify:
```
tracks:
  - name: clinvar
    build_field_transformations:
      chrom: chr .
    fieldMap:
      Chromosome: chrom
```
Here fieldMap allows us to rename header fields, and build_field_transformations allows us to define a prepend operation (chr . can be interpreted as the perl command "chr" . $chrom)

So: input files do not need to have their chromosomes prepended by chr. Bystro will normalize the name.

In this example chromosomes 1 and chr1 will be built/annotated, but 1_rand will not.

Directories and Files

These describe where the Bystro database and any source files are located.

files_dir : The parent folder within which each track's local_files are located

Bystro automatically checks for local_files at parent/trackName/file

Ex: For the config file containing
```
files_dir: /path/to/files/
track:
  - name: refSeq
    local_files:
      - hg19.refGene.chr1.gz
      # and more files
```
Bystro will expect files in /path/to/files/refSeq/hg19.refGene.chr1.gz

database_dir : Each database is held within database_dir, in a folder of the name assembly

Ex: For the config file containing
```
assembly: hg19
database_dir: /path/to/databases/
```
Bystro will look for the database /path/to/databases/hg19

Name		Name	Last commit message	Last commit date
Latest commit History 1,074 Commits
.github		.github
.vscode		.vscode
config		config
docs		docs
go		go
install		install
perl		perl
python		python
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.initialize_conda_env.sh		.initialize_conda_env.sh
API.md		API.md
BUILD.md		BUILD.md
CONTRIBUTING.md		CONTRIBUTING.md
Changes.md		Changes.md
Dockerfile.perl		Dockerfile.perl
Dockerfile.python		Dockerfile.python
FIELDS.md		FIELDS.md
INSTALL.md		INSTALL.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
TESTING.md		TESTING.md
TUTORIAL.md		TUTORIAL.md
dev-startup.yml		dev-startup.yml
install-apt.sh		install-apt.sh
install-rpm.sh		install-rpm.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bystro

Bystro Publication

Web Tutorial

Installing Bystro

Installing the Bystro Python libraries and CLI tools

Installing the Bystro Annotator

File support

Annotation (Output) Field Descriptions

The Bystro configuration file

Directories and Files

About

Releases

Packages

Languages

License

akotlar/bystro

Folders and files

Latest commit

History

Repository files navigation

Bystro

Bystro Publication

Web Tutorial

Installing Bystro

Installing the Bystro Python libraries and CLI tools

Installing the Bystro Annotator

File support

Annotation (Output) Field Descriptions

The Bystro configuration file

Directories and Files

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages