Skip to content

Commit

Permalink
feat: implement "seqvars ingest" command (#199) (#206)
Browse files Browse the repository at this point in the history
  • Loading branch information
holtgrewe authored Oct 6, 2023
1 parent a188c70 commit 21b965c
Show file tree
Hide file tree
Showing 67 changed files with 3,840 additions and 4 deletions.
2 changes: 2 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
tests/db/compile/** filter=lfs diff=lfs merge=lfs -text
tests/db/hpo/** filter=lfs diff=lfs merge=lfs -text
tests/seqvars/ingest/db/grch37/** filter=lfs diff=lfs merge=lfs -text
tests/seqvars/ingest/db/grch38/** filter=lfs diff=lfs merge=lfs -text
14 changes: 12 additions & 2 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 5 additions & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ hgvs = "0.11"
indexmap = { version = "2.0", features = ["serde"] }
itertools = "0.11"
log = "0.4"
mehari = "0.8"
mehari = "0.10"
multimap = "0.9"
procfs = "0.15"
prost = "0.12"
Expand All @@ -47,6 +47,9 @@ thousands = "0.2"
tracing = "0.1"
tracing-subscriber = "0.3"
uuid = { version = "1.4", features = ["v4", "fast-rng", "serde"] }
noodles-vcf = "0.40.0"
rocksdb = { version = "0.21.0", features = ["multi-threaded-cf"] }
noodles-bgzf = "0.24.0"


[build-dependencies]
Expand All @@ -55,6 +58,7 @@ prost-build = "0.12"
[dev-dependencies]
file_diff = "1.0"
float-cmp = "0.9.0"
hxdmp = "0.2.1"
insta = { version = "1.32", features = ["yaml"] }
pretty_assertions = "1.4"
rstest = "0.18.2"
Expand Down
56 changes: 56 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,62 @@ $ varfish-server-worker db mk-inhouse \
[--path-input-tsvs @IN/path-list2.txt]
```

## The `seqvars ingest` Command

This command takes as the input a single VCF file from a (supported) variant caller and converts it into a file for further querying.
The command interprets the following fields which are written out by the commonly used variant callers such as GATK UnifiedGenotyper, GATK HaplotypeCaller, and Illumina Dragen.

- `FORMAT/GT` -- genotype
- `FORMAT/GQ` -- genotype quality
- `FORMAT/DP` -- total read coverage
- `FORMAT/AD` -- allelic depth, one value per allele (including reference0)
- `FORMAT/PS` -- physical phasing information as written out by GATK HaplotypeCaller in GVCF workflow and Dragen variant caller
- `FORMAT/SQ` -- "somatic quality" for each alternate allele, as written out by Illumina Dragen variant caller
- this field will be written as `FORMAT/GQ`

The `seqvars ingest` command will annotate the variants with the following information:

- gnomAD genomes and exomes allele frequencies
- gnomAD-mtDNA and HelixMtDb allele frequencies
- functional annotation following the [VCF ANN field standard](https://pcingola.github.io/SnpEff/adds/VCFannotationformat_v1.0.pdf)

The command will emit one output line for each variant allele from the input and each affected gene.
That is, if two variant alleles affect two genes, four records will be written to the output file.
The annotation will be written out for one highest impact.

Overall, the command will emit the following header rows in addition to the `##contig=<ID=.,length=.>` lines.

```
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=gnomad_exomes_an,Number=1,Type=Integer,Description="Number of samples in gnomAD exomes">
##INFO=<ID=gnomad_exomes_hom,Number=1,Type=Integer,Description="Number of hom. alt. carriers in gnomAD exomes">
##INFO=<ID=gnomad_exomes_het,Number=1,Type=Integer,Description="Number of het. alt. carriers in gnomAD exomes">
##INFO=<ID=gnomad_exomes_hemi,Number=1,Type=Integer,Description="Number of hemi. alt. carriers in gnomAD exomes">
##INFO=<ID=gnomad_genomes_an,Number=1,Type=Integer,Description="Number of samples in gnomAD genomes">
##INFO=<ID=gnomad_genomes_hom,Number=1,Type=Integer,Description="Number of hom. alt. carriers in gnomAD genomes">
##INFO=<ID=gnomad_genomes_het,Number=1,Type=Integer,Description="Number of het. alt. carriers in gnomAD genomes">
##INFO=<ID=gnomad_genomes_hemi,Number=1,Type=Integer,Description="Number of hemi. alt. carriers in gnomAD genomes">
##INFO=<ID=helix_an,Number=1,Type=Integer,Description="Number of samples in HelixMtDb">
##INFO=<ID=helix_hom,Number=1,Type=Integer,Description="Number of hom. alt. carriers in HelixMtDb">
##INFO=<ID=helix_het,Number=1,Type=Integer,Description="Number of het. alt. carriers in HelixMtDb">
##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO'">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##x-varfish-version=<ID=varfish-server-worker,Version=x.y.z>
##x-varfish-version=<ID=orig-caller,Name=Dragen,Version=SW: 07.021.624.3.10.9, HW: 07.021.624>
##x-varfish-version=<ID=orig-caller,Name=GatkHaplotypeCaller,Version=4.4.0.0>
```

> [!NOTE]
> The gnomad-mtDNA information is written to the `INFO/gnomdad_genome_*` fields.
> [!NOTE]
> Future versions of the worker will annotate the worst effect on a MANE select or MANE Clinical transcript.
# Developer Information

This section is only relevant for developers of `varfish-server-worker`.
Expand Down
8 changes: 8 additions & 0 deletions src/common.rs
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,14 @@ pub struct Args {
pub verbose: Verbosity<InfoLevel>,
}

impl Default for Args {
fn default() -> Self {
Self {
verbose: Verbosity::new(0, 0),
}
}
}

/// Helper to print the current memory resident set size via `tracing`.
pub fn trace_rss_now() {
let me = procfs::process::Process::myself().unwrap();
Expand Down
23 changes: 23 additions & 0 deletions src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

pub mod common;
pub mod db;
pub mod seqvars;
pub mod sv;

use clap::{Args, Parser, Subcommand};
Expand Down Expand Up @@ -33,6 +34,8 @@ enum Commands {
Db(Db),
/// SV filtration related commands.
Sv(Sv),
/// Sequence variant related commands.
Seqvars(Seqvars),
}

/// Parsing of "db *" sub commands.
Expand Down Expand Up @@ -67,6 +70,21 @@ enum SvCommands {
Query(sv::query::Args),
}

/// Parsing of "seqvars *" sub commands.
#[derive(Debug, Args)]
#[command(args_conflicts_with_subcommands = true)]
struct Seqvars {
/// The sub command to run
#[command(subcommand)]
command: SeqvarsCommands,
}

/// Enum supporting the parsing of "sv *" sub commands.
#[derive(Debug, Subcommand)]
enum SeqvarsCommands {
Ingest(seqvars::ingest::Args),
}

fn main() -> Result<(), anyhow::Error> {
let cli = Cli::parse();

Expand Down Expand Up @@ -98,6 +116,11 @@ fn main() -> Result<(), anyhow::Error> {
db::to_bin::cli::run(&cli.common, args)?;
}
},
Commands::Seqvars(seqvars) => match &seqvars.command {
SeqvarsCommands::Ingest(args) => {
seqvars::ingest::run(&cli.common, args)?;
}
},
Commands::Sv(sv) => match &sv.command {
SvCommands::Query(args) => {
sv::query::run(&cli.common, args)?;
Expand Down
Loading

0 comments on commit 21b965c

Please sign in to comment.