-
Notifications
You must be signed in to change notification settings - Fork 442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to the depth filtering for mpileup. #1084
base: develop
Are you sure you want to change the base?
Conversation
We can now count the depth anywhere along the read rather than only at the left end. It's not buffering reads up so it doesn't know what's about to come, so for any individual column it'll still favour reads that end near a column than start near a column, but by counting at the right hand end we avoid the boom & bust approach to before where the coverage could plumet to zero immediately following a large spike. With iter->depth_pos = 1.0 the coverage ends up being a minimum of maxcnt (assuming the true coverage is higher), rather than a maximum of maxcnt. So we go over-depth, but avoid the drop outs.
In the synthetic testing, the more complex heuristics definitely works better, but they're really hard to understand and get a grasp on. The only reason I think it may be beneficial is the enrichment of paired data. We may wish to consider something like this for a better |
coord. It's not so vital with the revised formula, so within 10 works good enough and prevents some more excessive depth raises. Also (ifdef-ed out) implemented the "other" more complex method, for comparison. Cull whichever we wish before merging.
dbbd34d
to
46105c9
Compare
I thought of one way the old depth method could collapse while in the shower this morning (yes, I do still have those during covid times; occasionally!). The old method attempts to always ensure we have coverage at every base, even if moving on from a deep region, by not filtering data unless it starts at the same location as earlier deep data. Ie it has the code to only filter reads if read pos == current iterator pos. ie multiple reads starting at the same point. It then checks depth at pos. So if all reads have cigar 101M and we have high depth of reads starting at pos 1000, then when iter->pos increments from 1000 to 1001 the first read is added. All subsequent reads at 1001 however match iter->pos and are removed as the number of reads in flight (overlapping 1001) is too high due to the high depth at pos 1000. That gives a crash down to depth 1 periodically. The mistake is that it's checking depth overlapping, not depth of reads starting at 1001. The intention was good, but it wasn't thought through. Secondly, all it takes is for that initial read to be soft-clipped and not actually extend beyond 1100 and we then hit a zero depth output at 1101. I don't think this is the sole cause of our gaps because we're not incrementing by 1 in our amplicon sequencing, but by ~150ish, yet still seeing crashes in depth. So I think it's likely there are other ways that iter->pos can move on. However it shows at least one way the existing method fails. In both cases, the fix is actually to measure the depth at the last base and not the first base, which is coincidentally precisely what the PR is doing. Further still, by doing that, we can now relax the constraint that we only remove reads if the start pos matches the last one. That is tragic on high depth long reads as we pretty much never remove anything. Being able to specify where the depth measurement is taken (from 0.0 being 1st base to 1.0 being last base) we gain some control over how sensitive we wish to be to potential zero depth regions and how wide they may be. It's not as good as a future buffer of reads as we're still figuring out what to do without knowing what's about to come, but it's a simple fix. A better fix, as discussed outside this PR, is a modified |
Unfortunately it looks like the simple solution here can overshoot the target depth quite badly if you set This happens because most of the reads are exactly the same length and the depth added at each position is below the limit. As the depth is being measured at the far end, it starts off at zero for each new reference position so everything can get through. The overshoot is much less bad when counting closer to the start of the read (I tried 50% and 80%), but then the problem with badly undershooting reappears: The example here was from this sam file, I subtracted 104524607 from the positions when making the depth plots to keep the x axis values small. I'm not sure if this problem can be solved without keeping track of a bigger pool of reads... |
We can now count the depth anywhere along the read rather than only at the left end. It's not buffering reads up so it doesn't know what's about to come, so for any individual column it'll still favour reads that end near a column than start near a column, but by counting at the right hand end we avoid the boom & bust approach to before where the coverage could plumet to zero immediately following a large spike.
With iter->depth_pos = 1.0 the coverage ends up being a minimum of maxcnt (assuming the true coverage is higher), rather than a maximum of maxcnt. So we go over-depth, but avoid the drop outs.
The following plot shows an example of fake amplicon based sequencing (350bp amplicons every 300bp) with full unlimited depth, reducing depth on the current develop branch, with this PR, and with an alternative more complex set of heuristics[1]. Plots are staggered marginally in X for clarity only.
[1] Reading simulated data consisting of qname, pos and len (on ref, not len of seq).
The name hash is used as a random number generator for removal of reads, but it gives a bias towards keeping pairs together. It's not perfect by any stretch, being stochastic, but at 1k depth out of a pool of ~300k depth about 1 in 4 reads also had their mate pair, compared to just 6% if using drand48 as the random number generator instead of h.