Maximum-coverage algorithm to select samples for validation and resequencing. This is a reimplementation of SVCollector
Utmos is a tool for solving the maximum-coverage problem for sample selection. Utmos leverages scikit-allele, joblib, numpy, and hdf5 to allow inputs from extremely large cohorts. Subsets of variants can be converted into smaller processing-ready files, which allows the data intake step to become massively parallel. Then, during selection, Utmos concatenates the subsets and allows control over memory consumption for easier analysis.
Download a release and run
python3 -m pip install Utmos-<version>.tar.gz
Alternatively, build from the repository,
git clone https://github.com/ACEnglish/utmos.git
cd utmos/
python3 -m pip install .
utmos select input1.vcf > sample_report.txt
See --help
of commands for more details.
Tab-delimited file of:
Column | Description |
---|---|
sample | Sample name |
var_count | Number of variants in the sample |
new_count | Number of new variants contributed by the sample |
tot_captured | Running total of number of variants captured by all samples upto this point |
pct_captured | Percent of all variants captured by samples upto this point |
Pulls genotype presence/absence information from vcf[.gz]
into numpy arrays, calcluates allele frequencies
and saves using joblib. This step is optional, but makes it easier to convert multiple VCFs at once
(with separate jobs). The genotype array is shrunk using numpy.packbits
and joblib compression can reduce file size
even further.
As a test, the genotype-only chr22 snps from the 1kgp (2,504 samples x 1,103,547 variants)
File | Size (megabytes) | Fold Decrease |
---|---|---|
VCF | 198 | . |
without packbits | 64 | 3.09 |
with packbits | 29 | 6.82 |
--compress 9 | 26 | 7.62 |
both axis pack* | 12 | 16.5 |
both axis pack* -c 9 | 11 | 18 |
* both axis pack is not yet implemented since the overhead it requires slows runtime a little bit. I'll implement it if there's any demand
usage: convert [-h] [--no-singleton] [--lowmem] [-B BUFFER] [-c COMPRESS] in_file out_file
--no-singleton
exclude variants which are singletons--lowmem
lowers memory usage by converting directly into an intermediate hdf5 file.--buffer
is an integer pased to scikit-allele during conversion and represents how many variants are parsed at once. The default value is probably fine for many use-cases, but performance can be tested. Generally, VCFs with extremely high sample counts (100k+) should have a smaller buffer (the default). VCFs without many samples or with high amounts of memory available can use a higher buffer in an attempt to speed up conversion.--compress
compression level performed by joblib between 0 (lowest compression, fastest IO) and 10 (high compression, slower IO)in_file
an input vcf[.gz]out_file
an output .jl file
VCFs can be pre-filtered and piped into convert e.g.:
bcftools view -i "FILTER == 'PASS'" input.vcf.gz | utmos convert /dev/stdin output.jl
Select samples for validation and resequencing.
Works by iteratively selecting samples based on a score and reporting how many variants the sample contains as well as
how many unseen variants are contributed by the sample. Scores by default are variant counts which can optionally be
weighed with --weights
and/or --af
usage: select [-h] [-c COUNT] [-o OUT] [--debug] [--af] [--weights WEIGHTS]
[--subset SUBSET] [--exclude EXCLUDE] [--lowmem LOWMEM]
[--buffer BUFFER] [--maxmem MAXMEM]
[in_files ...]
one or more input files and can be a mix of vcf[.gz] or jl files from utmos convert
. If there was a
previous run of utmos select --lowmem example.hdf5
, the a single in_file of the hdf5 file can be specified. This
saves time by not needing to concatenate results between multiple runs of a dataset.
Sets how many samples are selected. Can be a count (>= 1) or a percent of samples if < [0, 1). Select all samples (or until all variants have been used, whichever comes first) with -1
Reduce bias towards rare/private alleles by using the allele-frequency weighted matrix for scoring. So, instead of 0/1
for variant presence in each cell of the matrix, the value will be presence * AF
A tab-delimited file of samples and their weights. Not every sample in the vcf needs to be given a score in the weight file. Any sample without a provided weight is given a 1.
A subset of samples to include in the selection. To help organize your metadata, multiple --subset
arguments can be
provided and they will be concatenated. e.g. you have subsetA.txt
and you would like to add sample HG1234
to that
subset for an utmos run, simply specify utmos select --subset subsetA.txt --subset HG1234 input.jl
A list of samples to exclude from the analysis. Similar to --subset
, you can provide multiple --exclude
arguments.
By default, utmos will hold all variants in memory. This consumes approximately
(num_variants * num_samples * dtype_bytes) / 1e9
GB of memory.
Where dtype_bytes
is the size per-datapoint (typically 4 bytes).
For datasets with too much data to hold in memory, utmos allows a user to take advantage of hdf5 files and use disk storage instead of memory. However, this comes at the expense of needing to create intermediate files and increases IO, thus one should expect an increased runtime.
The --lowmem
mode works by first concatenating all the input vcf/jl together into an hdf5 file. --buffer
rows will be held in-memory before writing/appending the data into the hdf5 file. Then, for each iteration, the dataset
is reduced by removing the used sample's column and all variants the sample contained from the rows before writing the
reduced matrices to a temporary hdf5 in the $TMPDIR
. If at any point through the iterations the dataset is estimated
to have a total size below --maxmem
, the full dataset is pulled into memory before continuing to the end of the
analysis.
Name of the hdf5 file which is filled with the concatenated in_files.
Number of variants (rows) to buffer in memory during concatenation before writing to the hdf5 file.
Maximum memory (in GB) available to utmos. For debugging, taking the 'shortcut' to pull the data into memory if/when
possible can be skipped by setting --maxmem 0
. However, when actually performing summation, utmos assumes at least 1GB
of memory is available.
If you have a previous run in which you used --lowmem
to make an hdf5 file, you can reuse it and save
the concatenation/conversion work. Simply provide a single in_file
of file.hdf5
or provide no in_files
and the
parameter --lowmem file.hdf5
. Note that if you create an hdf5 file with select --af
, it will hold the AF weighted
matrix and can only be reused with select --af
.
When running select --af
, variant positions are weighed by their allele frequency. For multi-allelic VCFs, the site is
weighed by the largest allele frequency observed. If this is not the desired behavior, split multi-allelics in the VCF
with bcftools norm -N -m-any
.
Running on a 2013 Mac Pro and using chr22 from 1kgp genotype
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502//ALL.chr22.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz
2,504 samples x 1,103,547 variants
Utmos runtime:
real 24m26.507s
user 11m37.160s
sys 5m12.713s
SVCollector runtime: (including 30 seconds to uncompress the VCF)
real 61m34.008s
user 49m28.622s
sys 1m24.693s
On newer hardware (Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz) and running the docker through singularity:
#Utmos
real 6m30.870s
user 5m53.305s
sys 0m36.403s
#SVCollector
real 7m44.110s
user 7m8.516s
sys 0m17.368s
A Dockerfile exists to build an image of utmos. To make a Docker image, clone the repository and run
docker build -t utmos .
You can then run utmos through docker using
docker run -v `pwd`:/data -it utmos
Where pwd
can be whatever directory you'd like to mount in the docker to the path /data/
, which is the working
directory for the utmos run. You can provide parameters directly to the entry point.
docker run -v `pwd`:/data -it utmos convert example.vcf.gz example.jl