-
Notifications
You must be signed in to change notification settings - Fork 355
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Configurable approaches to prep alignment splits
- Adds a configuration option to explore alternatives to our current bgzip/grabix preparation for fastqs to see if we can improve speed. - Add indexing option with rtg SDF files, which improves speed at the cost of a 3x larger disk footprint.
- Loading branch information
Showing
8 changed files
with
164 additions
and
35 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
"""Provide indexing and retrieval of files using Real Time Genomics SDF format. | ||
Prepares a sdf representation of reads suitable for indexed retrieval, | ||
normalizing many different input types. | ||
https://github.com/RealTimeGenomics/rtg-tools | ||
""" | ||
import os | ||
import subprocess | ||
|
||
from bcbio import bam, utils | ||
from bcbio.distributed.transaction import file_transaction | ||
from bcbio.pipeline import datadict as dd | ||
from bcbio.provenance import do | ||
|
||
def to_sdf(files, data): | ||
"""Convert a fastq or BAM input into a SDF indexed file. | ||
""" | ||
# BAM | ||
if len(files) == 1 and files[0].endswith(".bam"): | ||
qual = [] | ||
format = ["-f", "sam-pe" if bam.is_paired(files[0]) else "sam-se"] | ||
inputs = [files[0]] | ||
# fastq | ||
else: | ||
qual = ["-q", "illumina" if dd.get_quality_format(data).lower() == "illumina" else "sanger"] | ||
format = ["-f", "fastq"] | ||
if len(files) == 2: | ||
inputs = ["-l", files[0], "-r", files[1]] | ||
else: | ||
assert len(files) == 1 | ||
inputs = [files[0]] | ||
work_dir = utils.safe_makedir(os.path.join(data["dirs"]["work"], "align_prep")) | ||
out_file = os.path.join(work_dir, | ||
"%s.sdf" % utils.splitext_plus(os.path.basename(os.path.commonprefix(files)))[0]) | ||
if not utils.file_exists(out_file): | ||
with file_transaction(data, out_file) as tx_out_file: | ||
cmd = _rtg_cmd(["rtg", "format", "-o", tx_out_file] + format + qual + inputs) | ||
do.run(cmd, "Format inputs to indexed SDF") | ||
return out_file | ||
|
||
def _rtg_cmd(cmd): | ||
return "export RTG_JAVA_OPTS='-Xms500m' && export RTG_MEM=2g && " + " ".join(cmd) | ||
|
||
def calculate_splits(sdf_file, split_size): | ||
"""Retrieve | ||
""" | ||
counts = _sdfstats(sdf_file)["counts"] | ||
splits = [] | ||
cur = 0 | ||
for i in range(counts // split_size + (0 if counts % split_size == 0 else 1)): | ||
splits.append("%s-%s" % (cur, min(counts, cur + split_size))) | ||
cur += split_size | ||
return splits | ||
|
||
def _sdfstats(sdf_file): | ||
cmd = ["rtg", "sdfstats", sdf_file] | ||
pairs = [] | ||
counts = [] | ||
lengths = [] | ||
for line in subprocess.check_output(_rtg_cmd(cmd), shell=True).split("\n"): | ||
if line.startswith("Paired arm"): | ||
pairs.append(line.strip().split()[-1]) | ||
elif line.startswith("Number of sequences"): | ||
counts.append(int(line.strip().split()[-1])) | ||
elif line.startswith("Minimum length"): | ||
lengths.append(int(line.strip().split()[-1])) | ||
assert len(set(counts)) == 1, counts | ||
return {"counts": counts[0], "pairs": pairs, "min_size": min(lengths)} | ||
|
||
def min_read_size(sdf_file): | ||
"""Retrieve minimum read size from SDF statistics. | ||
""" | ||
return _sdfstats(sdf_file)["min_size"] | ||
|
||
def is_paired(sdf_file): | ||
"""Check if we have paired inputs in a SDF file. | ||
""" | ||
pairs = _sdfstats(sdf_file)["pairs"] | ||
return len(set(pairs)) > 1 | ||
|
||
def to_fastq_apipe_cl(sdf_file, start=None, end=None): | ||
"""Return a command lines to provide streaming fastq input. | ||
For paired end, returns a forward and reverse command line. For | ||
single end returns a single command line and None for the pair. | ||
""" | ||
cmd = ["rtg", "sdf2fastq", "--no-gzip", "-o", "-"] | ||
if start is not None: | ||
cmd += ["--start-id=%s" % start] | ||
if end is not None: | ||
cmd += ["--end-id=%s" % end] | ||
if is_paired(sdf_file): | ||
out = [] | ||
for ext in ["left", "right"]: | ||
out.append("<(%s)" % _rtg_cmd(cmd + ["-i", os.path.join(sdf_file, ext)])) | ||
return out | ||
else: | ||
cmd += ["-i", sdf_file] | ||
return ["<(%s)" % _rtg_cmd(cmd), None] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
c08aba1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I added in my preprocessing is piped quality format conversion during bgzip re-compression using seqtk. Basically I am reading the first 10k fastq entries with seqtk and extracting average quality, and then parametrize streamed conversion to Sanger format with seqtk|bgzip if the quality range indicates that a file is in Illumina format. That simplifies a lot downstream (and during configuration, since I do not have to know the quality format beforehand), and I have to touch the fastqs anyway.
c08aba1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sven-Eric;
Thanks for this. As normal, you're reading my mind. We already do the quality conversion as part of bgzip and indexing so everything downstream can assume sanger:
https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/ngsalign/alignprep.py#L434
https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/ngsalign/alignprep.py#L87
The goal here is to try and make the preparation fast enough that we can always do it, which will enable more parallelization and hopefully better runs as well. I was hoping SDF indices from RTG would be a good solution. They are faster to generate but take ~3x more space than bgzipped fastqs. @Lenbok, are there ways to compress RTG fastq file storage to make them more disk friendly? I love the built in read indexing and retrieval.
We do have auto-detection of format, although only as a check instead of automatically setting it:
https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/ngsalign/alignprep.py#L87
Maybe we could add a
quality_format: auto
option to guess and set quality formats if folks want to trust bcbio to figure it out. Would that be helpful and save upstream processing?c08aba1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great; automatic conversion in bcbio would definitely be a plus. I guess it just depends on the number of records you're looking at to infer the format. 10k were always okay in my experience. I wasn't using the bcbio implementation since I also wanted to parallelize the S3 download and bgzip/conversion over the cluster (bcbio was only using one job for this some time ago). Perhaps that has changed in the meantime. Regarding the RTG fastqs, I'd be fine with larger indexes but not with overall larger data files. In fact, I'd even vote for unaligned 8-binned CRAM if that is a reasonable alternative to fastq for all the external tools (but I guess it isn't yet). I'd assume CRAM would also include the ability for random access, but I'm not so sure if samtools/cramtools/scramble support that nicely enough for bcbio to use that everywhere.
c08aba1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chapmanb short answer, no, not unless you count the use of --no-names or --no-quality to remove some of the data altogether. Longer answer:
The SDF format uses a fixed-bit-packed representation for storage of nucleotides/amino acids/q-scores (and q-scores are always stored in sanger encoding) in order to get all the parsing/normalization out of the way once and allow direct access to read subsets (or to pull out subsequences of longer data such as chromosomes, e.g
rtg sdfsubseq
). While it's better than storing the data uncompressed, there's definitely a large gap between it and a compressed storage format. A couple of years ago we did some experiments with compression (there's even an arithmetic coder implementation in the source) -- it was too slow to be practical, but that was in the context of looking for in-memory RAM reduction during mapping.The margin is larger now people have started quantizing the q-scores, which compression techniques naturally take advantage of but which SDF still stores at a little over 6 bits per score. Looking at the q-score distribution of a typical fastq we have here which seems to have been binned to 7 distinct values, a 0-order compression model should be able to get this to about 1.9 bits per score. That alone would take the 3x down to 2x, more could be obtained from compression of name or base data.
Thanks for the feedback!
c08aba1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Len -- thanks for the overview. That helps a ton. The SDF indexing and access is really nice and I understand the space tradeoff.
Sven-Eric -- thanks for the feedback. I'll look at adding in auto-detection as an option. We have the RTG SDF option available but currently default to bgzip and grabix indexed fastqs. They are a bit slower to generate, but parallel bgzip helps quite a bit to make it comparable with RTGs preparation process as long as there are multiple cores to use. I don't know of any CRAM approaches for pulling blocks of reads from unaligned inputs, but would be happy to try them out if you know a practical approach. Thanks again.
c08aba1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, samtools can read CRAM by now and is using htslib internally (https://github.com/samtools/htslib).
Somehow this sounds as if extracting blocks of reads should work, in principle.
c08aba1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, further following up the CRAM proposal (since CRAM will be the dominant read format soon anyways, with support for lossy and greedy reference-based compression also for unaligned reads), obviously one would need someone to implement extraction of unaligned reads by blocks (i.e., give me the first ends of all reads pairs with index 1001-2000). Perhaps we could approach these guys: https://github.com/jwalabroad/VariantBam/blob/master/README.md That tool builds on htslib and looks useful also for other areas of bcbio. Given their powerful rule engine it might comparable easy for them to implement paired chunking of reads. Shall we ask?
c08aba1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guys and gals, I should have said - at least two ladies are on the team.
c08aba1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sven-Eric;
Nice find with VariantBam, that looks really useful. The ability to downsample by coverage is incredible and something I've looked for in a tool for a while. I don't know about the ability to extend the current cram indexes to pull out read blocks but definitely seems worth asking if that's of interest to the team. Thanks again.
c08aba1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approached the htslib people first, since they should know most about the block structure of the CRAM format. samtools/htslib#316