Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting blocks of unaligned reads from CRAM #316

Open
schelhorn opened this issue Jan 19, 2016 · 6 comments
Open

Extracting blocks of unaligned reads from CRAM #316

schelhorn opened this issue Jan 19, 2016 · 6 comments

Comments

@schelhorn
Copy link

I was wondering if htslib/samtools support extraction of blocks of unaligned reads similar to what a bgzipped FASTQ file that has been indexed with grabix currently allows.

This is relevant for processing a single FASTQ across many compute nodes by allowing each process (such as a read mapper) to extract windows of reads (such as "the third block of 1M reads") without having to seek through the whole FASTQ file first (due to the index). For us, the ability to do so is highly relevant to replace the aged FASTQ with unaligned CRAM for WGS read mapping.

Note that this is different from extracting reads that align to a window of the reference genome; I would rather expect to be able to extract the unaligned reads in the order of the FASTQ file that has been used for generating the unaligned CRAM. A nice addition would be an option for filtering by mate ID so that one could specify which end (or both) should be returned. May this perhaps be possible while also limiting IO by using the blocks/slices of the CRAM format?

schelhorn referenced this issue in bcbio/bcbio-nextgen Jan 19, 2016
- Adds a configuration option to explore alternatives to our current
  bgzip/grabix preparation for fastqs to see if we can improve speed.
- Add indexing option with rtg SDF files, which improves speed at the
  cost of a 3x larger disk footprint.
@jkbonfield
Copy link
Contributor

On Tue, Jan 19, 2016 at 08:51:38AM -0800, Sven-Eric Schelhorn wrote:

I was wondering if htslib/samtools support extraction of blocks of unaligned reads similar to what a bgzipped FASTQ file that has been indexed with grabix currently allows.

This is relevant for processing a single FASTQ across many compute nodes by allowing each process (such as a read mapper) to extract windows of reads (such as "the third block of 1M reads") without having to seek through the whole FASTQ file first (due to the index). For us, the ability to do so is highly relevant to replace the aged FASTQ with unaligned CRAM for WGS read mapping.

It's doable in an around-the-houses sort of way.

If you build an index of the cram file then the .crai file contains
the offsets of each cram container / slice. You can then produce
valid cram streams by simply using dd on the input file to stitch
together the header block with other blocks, taking care to handle the
eof case too. It's messy, but demonstrates the principle.

I've done this manually to produce small cram files demonstrating
bugs, where a huge file has the issue and I want to produce a smaller
test case.

The .crai file is just gzipped text. I wouldn't guarantee on it never
changing, but you could explore whether this technique is enough to
cover your needs. If so then we can consider getting a more robust and
supported implementation into htslib to avoid external tools parsing
the .crai themselves.

James

James Bonfield ([email protected]) | Hora aderat briligi. Nunc et Slythia Tova
| Plurima gyrabant gymbolitare vabo;
A Staden Package developer: | Et Borogovorum mimzebant undique formae,
https://sf.net/projects/staden/ | Momiferique omnes exgrabure Rathi.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@chapmanb
Copy link

James -- thanks so much. This sounds promising although the bits about producing valid cram streams with dd is beyond my CRAM skill level, in terms of producing a proof of principle implementation for bcbio. I could do the crai parsing bits and test it if I had a better interface or guidance how to do the retrieval. Thanks again for the ideas.

@jkbonfield
Copy link
Contributor

I admit it's esoteric! But here's a worked example.

$ zcat _.cram.crai
19      18868   136639  7362    210     103725
19      155366  95912   111322  212     93967
19      251145  86228   205527  214     120408
19      337223  88123   326176  214     118523
...[cut for brevity]...
19  62737042    84894   65556969    210 179619
19  62821820    83512   65736827    206 105559
19  62905183    60287   65842621    218 109915

These are reference numbers (counting from 0, so 19 being chr20 of this sample). Staden io_lib has a cram_dump utility (hideously written, but a debugging aid for me primarily) that reveals a bit more about the structure of this file.

$ cram_dump _.cram
File definition structure
    Major vers: 3
    Minor vers: 0
    file_id:    /tmp/_.cram

BAM header:
@HD     VN:1.4  SO:coordinate
...[rest of header]...

Container pos 7362 size 103935
    Ref id:            19
    Ref pos:           18868 + 136639
    Rec counter:       0
    No. recs:          10000
    No. bases          1466522
    No. blocks:        24
    No. landmarks:     1
    {210}
...
... [lots more data ending with last real container] ...
Container pos 65842621 size 110133
    Ref id:            19
    Ref pos:           62905183 + 60287
    Rec counter:       6510000
    No. recs:          6772
    No. bases          995898
    No. blocks:        24
    No. landmarks:     1
    {218}

...[followed by EOF block, actually an empty container itself]...
Container pos 65952783 size 15
    Ref id:            -1
    Ref pos:           4542278 + 0
    Rec counter:       0
    No. recs:          0
    No. bases          0
    No. blocks:        1
    No. landmarks:     0
    {}

This is showing the reference locations as well as the file offsets. Reference locations "id 19 pos 18868 + 136639" is the same data you're seeing in the first line of the index file. File locations "pos 7362 size 103935" are in there too. It's a bit different for the container size. The difference in file offset between 1st and 2nd container is 111322-7362 == 103960. This isn't very helpful, but the difference is due to whether or not it includes the container header structure itself. Therefore the easy solution is just to subtract start locations.

To stitch a valid cram file together, we need CRAM header, BAM header (ie all the bits between first container - the first 7362 bytes in my example), a series of containers, and the footer - the EOF block which for cram v3 is 38 bytes long.
So for 9 containers taken out of the middle of the .crai file:

@ seq3a[/tmp]; zcat _.cram.crai|head -400|tail
19  39697477    89695   38701731    208 80472
19  39787022    90904   38782440    202 79128
19  39877779    93587   38861799    206 94698
19  39971219    85377   38956732    200 80250
19  40056448    90739   39037211    204 79074
19  40147042    92103   39116518    206 88262
19  40239024    91659   39205015    210 85474
19  40330540    92800   39290728    208 94931
19  40423192    93575   39385896    202 73863
19  40516618    93779   39459990    204 78303
@ seq3a[/tmp]; expr 39459990 - 38701731
758259
@ seq3a[/tmp]; ls -l _.cram
-rw-r--r-- 1 jkb team117 65952821 Jan 19 18:01 _.cram
@ seq3a[/tmp]; expr 65952821 - 38
65952783
@ seq3a[/tmp]; (dd if=_.cram bs=1 count=7362; dd if=_.cram bs=1 skip=38701731 count=758259; dd if=_.cram bs=1 skip=65952783) > _sub.cram
7362+0 records in
7362+0 records out
7362 bytes (7.4 kB) copied, 0.0157573 s, 467 kB/s
758259+0 records in
758259+0 records out
758259 bytes (758 kB) copied, 0.858084 s, 884 kB/s
38+0 records in
38+0 records out
38 bytes (38 B) copied, 4.7836e-05 s, 794 kB/s
@ seq3a[/tmp]; samtools view _sub.cram | wc
  90000 1264796 34238109

Yes it's hideous, but it demonstrates in principle cut-n-shunt of CRAM can work. However I'm not sure if we'll get time to make a proper API for doing container splitting in CRAM in a graceful way soon.

I'm not promoting this as an official way of parallelising CRAM processing and there's no check in here for cram version number, whether the EOF is valid, even the .crai file format may change at some stage in the future. Hence the need for an abstracted API (one day). But meanwhile a home-brew hacky solution is possible.

@mcshane
Copy link
Contributor

mcshane commented May 28, 2019

@jkbonfield Can I resurrect this thread? The 100Gbp+ data coming off the PacBio Sequel II and ONT PromethION means that we are getting large unaligned files at the moment to begin the assembly process (BAM for PacBio at the moment, and maybe CRAM with https://github.com/EGA-archive/ont2cram for ONT). Many operations that were fine to be parallelised per-run for the old Sequel I and MinION yields would benefit from being able to shard these now much larger files into chunks of N reads. The thread was originally for CRAM, but ideally would work for BAM via the bai or csi index too...

@jkbonfield
Copy link
Contributor

jkbonfield commented May 29, 2019

This was implemented, but not in htslib. See io_lib's cram_filter.

It's a bit clunky as it doesn't have an obvious way to tell you how many containers exist (other than when you get zero reads back I guess you know you're at the end), but for example:

$ samtools view -c /tmp/pb.cram
9999

$ cram_filter -n 5-10 /tmp/pb.cram - | samtools view -c
3878

Note you can do cram_index on an unsorted file and then zcat pb.cram.crai | wc -l to work out how many containers there are. (They start from zero for cram_filter.)

Doing it in BAM is harder. We can't index it if it isn't sorted and there's no way to do random access on unsorted files. In theory we could use the bzi index (from bgzip), but the BAM format has no way to indicate which blocks start on read boundaries and which do not. It's basically a bad format for this type of work.

Edit: I take that back, an unaligned file is "sorted" in as far as it'll permit bai/csi indexing. So maybe it's possible. I know people (not us) wrote tools to do chunking of BAM based on genomic coords and the indices, but I don't think anyone managed it for arbitrary slices of Nth read. Technically possible, but a mess.

@jmarshall
Copy link
Member

Re BAM, this would probably be a use case for the proposed splitting index (see samtools/hts-specs#321) that hasn't been finalised (or implemented in HTSlib / samtools).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants