Skip to content

Commit

Permalink
feat: Significantly decreased runtime of simgenotype (#163)
Browse files Browse the repository at this point in the history
Co-authored-by: Arya Massarat <[email protected]>
  • Loading branch information
mlamkin7 and aryarm authored Jan 13, 2023
1 parent 9f7890a commit de011d0
Show file tree
Hide file tree
Showing 16 changed files with 711 additions and 306 deletions.
7 changes: 4 additions & 3 deletions docs/commands/karyogram.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,10 @@ Additional Options
~~~~~~~~~~~~~~~~~~
You may also specify the following options:

* ``--centromeres <FILE>``: Path to a file describing the locations of chromosome ends and centromeres. An example file is given here: ``tests/data/centromeres_hg19.txt``. The columns are: chromosome, chrom_start, centromere, chrom_end. For acrocentric chromosomes, the centromere field is ommitted. This file format was taken from `here <https://github.com/armartin/ancestry_pipeline>`_.
* ``--colors "pop1:color1,pop2:color2..."``: You can optionally specify which colors should be used for each population. If colors are not given, the script chooses reasonable defaults.
* ``--title <TITLE>``: Title for the resulting karyogram.
* ``--centromeres <FILE>`` - Path to a file describing the locations of chromosome ends and centromeres. An example file is given here: ``tests/data/centromeres_hg19.txt``. The columns are: chromosome, chrom_start, centromere, chrom_end. For acrocentric chromosomes, the centromere field is ommitted. This file format was taken from `here <https://github.com/armartin/ancestry_pipeline>`_.
* ``--colors "pop1:color1,pop2:color2..."`` - You can optionally specify which colors should be used for each population. If colors are not given, the script chooses reasonable defaults.
* ``--title <TITLE>`` - Title for the resulting karyogram.
* ``--verbosity <LEVEL>`` - What level of output the logger should print to stdout. Please see `logging levels <https://docs.python.org/3/library/logging.html>`_ for output levels. Default = INFO [Optional]

Example
~~~~~~~
Expand Down
36 changes: 28 additions & 8 deletions docs/commands/simgenotype.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,21 +17,27 @@ Basic Usage
--mapdir GENETICMAPDIR \
--chroms LIST,OF,CHROMS \
--region CHR:START-END \
--invcf REFVCF \
--ref_vcf REFVCF \
--sample_info SAMPLEINFOFILE \
--out OUTPREFIX
--pop_field \
--out /PATH/TO/OUTPUT.VCF.GZ
Detailed information about each option, and example commands using publicly available files, are shown below.

Parameter Descriptions
~~~~~~~~~~~~~~~~~~~~~~
* ``--model`` - Parameters for simulating admixture across generations including sample size, population fractions, and number of generations.
* ``--mapdir`` - Directory containing all .map files with this `structure <https://www.cog-genomics.org/plink/1.9/formats#map>`_ where the third position is in centiMorgans
* ``--out`` - Full output path to file of the structure ``/path/to/output.(vcf|bcf|vcf.gz|pgen)`` which if ``vcf.gz`` is chosen outputs ``/path/to/output.vcf.gz`` and breakpoints file ``/path/to/output.bp``
* ``--chroms`` - List of chromosomes to be simulated. The map file directory must contain the "chr<CHR>" where <CHR> is the chromosome identifier eg. 1,2,...,X
* ``--region`` - Limit the simulation to a region within a single chromosome. Overwrites chroms with the chrom listed in this region. eg 1:1-10000 [Optional]
* ``--invcf`` - Input VCF file used to simulate specifiic haplotypes for resulting samples
* ``--seed`` - Seed for randomized calculations during simulation of breakpoints. [Optional]
* ``--popsize`` - Population size for each generaetion that is sampled from to create our simulated samples. Default = max(10000, 10*samples) [Optional]
* ``--ref_vcf`` - Input VCF or PGEN file used to simulate specifiic haplotypes for resulting samples
* ``--sample_info`` - File used to map samples in ``REFVCF`` to populations found in ``MODELFILE``
* ``--out`` - Output prefix of the structure ``/path/to/output`` which results in the vcf file ``output.vcf.gz`` and breakpoints file ``output.bp``
* ``--region`` - Limit the simulation to a region within a single chromosome. Overwrites chroms with the chrom listed in this region. eg 1:1-10000 [Optional]
* ``--pop_field`` - Flag for ouputting population field in VCF output. Note this flag does not work when your output is in PGEN format. [Optional]
* ``--sample_field`` - Flag for ouputting sample field in VCF output. Note this flag does not work when your output is in PGEN format. Should only be used for debugging. [Optional]
* ``--verbosity`` - What level of output the logger should print to stdout. Please see `logging levels <https://docs.python.org/3/library/logging.html>`_ for output levels. Default = INFO [Optional]

File Formats
~~~~~~~~~~~~
Expand All @@ -48,10 +54,24 @@ Examples
haptools simgenotype \
--model tests/data/outvcf_gen.dat \
--mapdir tests/data/map/ \
--chroms 1,2 \
--invcf tests/data/outvcf_test.vcf \
--region 1:1-83000 \
--ref_vcf tests/data/outvcf_test.vcf \
--sample_info tests/data/outvcf_info.tab \
--pop_field \
--out tests/data/example_simgenotype.vcf
If speed is important, it's generally faster to use PGEN files than VCFs.

.. code-block:: bash
haptools simgenotype \
--model tests/data/outvcf_gen.dat \
--mapdir tests/data/map/ \
--region 1:1-83000 \
--ref_vcf tests/data/outvcf_test.pgen \
--sample_info tests/data/outvcf_info.tab \
--out tests/data/example_simgenotype
--pop_field \
--out tests/data/example_simgenotype.pgen
Detailed Usage
Expand Down
4 changes: 4 additions & 0 deletions docs/project_info/contributing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,10 @@ Follow these steps to set up a development environment.
Now, try importing ``haptools`` or running it on the command line.

.. note::
If you run into an attribute error ``module 'distutils' has no attribute 'util'`` see `this workaround <https://github.com/python-poetry/poetry/issues/3336#issuecomment-831789763>`_.


---------------------
Managing Dependencies
---------------------
Expand Down
79 changes: 62 additions & 17 deletions haptools/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,11 @@ def karyogram(bp, sample, out, title, centromeres, colors, verbosity):
"--out",
type=str,
required=True,
help="Prefix to name output files.",
help=(
"Path to desired output file. E.g. /path/to/output.vcf.gz "
"Possible outputs are vcf|bcf|vcf.gz|pgen and there will be an "
"additional breakpoints output with extension bp e.g. /path/to/output.bp."
),
)
@click.option(
"--chroms",
Expand All @@ -128,28 +132,48 @@ def karyogram(bp, sample, out, title, centromeres, colors, verbosity):
help="Number of samples to simulate each generation",
)
@click.option(
"--invcf",
"--ref_vcf",
required=True,
help=(
"VCF file used as reference for creation of simulated samples respective "
"genotypes."
"VCF or PGEN file used as reference for creation of simulated samples"
" respective genotypes."
),
)
@click.option(
"--sample_info",
required=True,
help=(
"File that maps samples from the reference VCF (--invcf) to population codes "
"describing the populations in the header of the model file."
"File that maps samples from the reference VCF (--invcf) to population"
" codes describing the populations in the header of the model file."
),
)
@click.option(
"--region",
required=False,
default=None,
help=(
"Subset the simulation to a specific region in a chromosome using the form"
" chrom:start-end. Example 2:1000-2000"
"Subset the simulation to a specific region in a chromosome using the"
" form chrom:start-end. Example 2:1000-2000"
),
)
@click.option(
"--pop_field",
required=False,
is_flag=True,
default=False,
help=(
"Flag for outputting the population field in your VCF output. NOTE this"
" flag does not work when your output file is in PGEN format."
),
)
@click.option(
"--sample_field",
required=False,
is_flag=True,
default=False,
help=(
"Flag for outputting the sample field in your VCF output. NOTE this"
" flag does not work when your output file is in PGEN format."
),
)
@click.option(
Expand All @@ -158,8 +182,8 @@ def karyogram(bp, sample, out, title, centromeres, colors, verbosity):
is_flag=True,
required=False,
help=(
"Flag used to determine whether to only output breakpoints or continue to "
"simulate a vcf file."
"Flag used to determine whether to only output breakpoints or"
" continue to simulate a vcf file."
),
)
@click.option(
Expand All @@ -171,7 +195,7 @@ def karyogram(bp, sample, out, title, centromeres, colors, verbosity):
help="The level of verbosity desired",
)
def simgenotype(
invcf,
ref_vcf,
sample_info,
model,
mapdir,
Expand All @@ -180,6 +204,8 @@ def simgenotype(
seed,
chroms,
region,
pop_field,
sample_field,
only_breakpoint,
verbosity,
):
Expand All @@ -199,6 +225,11 @@ def simgenotype(
log = getLogger(name="simgenotype", level=verbosity)
start = time.time()

# immediately set pop_filed and sample_field flags to false if pgen file
if out.endswith(".pgen"):
pop_field = False
sample_field = False

# parse region and chroms parameters
if not (chroms or region):
raise Exception("Either chroms or region must be specified.")
Expand All @@ -223,25 +254,39 @@ def simgenotype(
if mapdir[-1] == "/":
mapdir = mapdir[:-1]

# grab prefix from --out for outputting breakpoint
out_prefix = re.split(r"(\.vcf|\.bcf|\.vcf\.gz|\.pgen)$", out)[0]

# simulate breakpoints
popsize = validate_params(
model, mapdir, chroms, popsize, invcf, sample_info, region, only_breakpoint
model, mapdir, chroms, popsize, ref_vcf, sample_info, region, only_breakpoint
)
samples, breakpoints = simulate_gt(
samples, pop_dict, breakpoints = simulate_gt(
model, mapdir, chroms, region, popsize, log, seed
)
breakpoints = write_breakpoints(samples, breakpoints, out, log)
breakpoints = write_breakpoints(samples, pop_dict, breakpoints, out_prefix, log)
bp_end = time.time()

# simulate vcfs
vcf_start = time.time()
if not only_breakpoint:
output_vcf(breakpoints, chroms, model, invcf, sample_info, region, out, log)
output_vcf(
breakpoints,
chroms,
model,
ref_vcf,
sample_info,
region,
pop_field,
sample_field,
out,
log,
)
end = time.time()

log.debug(f"Time elapsed for breakpoint simulation: {bp_end - start}")
log.debug(f"Time elapse for creating vcf: {end - vcf_start}")
log.debug(f"Time elapsed for simgenotype execution: {end - start}")
log.debug(f"Time elapsed for creating vcf: {end - vcf_start}")
log.debug(f"Total time elapsed for simgenotype execution: {end - start}")


@main.command()
Expand Down
20 changes: 20 additions & 0 deletions haptools/data/genotypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -569,6 +569,26 @@ def check_maf(
raise ValueError(msg)
return maf

def check_sorted(self):
"""
Check that the variant coordinates are sorted
Raise a ValueError if any of variants in any of the chromosomes are unsorted
Raises
------
ValueError
If any variant position is less than a position that comes before it within
the same chromosome
"""
chroms = set(self.variants["chrom"])
for chrom in chroms:
positions = self.variants["pos"][self.variants["chrom"] == chrom]
if not np.all(positions[:-1] <= positions[1:]):
raise ValueError(
f"The variants in chromosome '{chrom}' are not sorted by position"
)


class GenotypesRefAlt(Genotypes):
"""
Expand Down
Loading

0 comments on commit de011d0

Please sign in to comment.