-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed a bug in pileup mode relating to number of haplotypes #8489
Fixed a bug in pileup mode relating to number of haplotypes #8489
Conversation
Github actions tests reported job failures from actions build 5931710172
|
Github actions tests reported job failures from actions build 5931724901
|
Github actions tests reported job failures from actions build 5933685416
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor change and a request to update the unit tests to reflect the new reality a little better.
@davidbenjamin This is a reversion to a fix you made recently fixing the non-determninism... it turns out this was in some edge cases producing assembly windows with 50k+ haplotypes (because we were taking everything that ties at the 5 ranked score without limit...) Now we are arbitrarily breaking based on string comparision which at least prevents us from exploding...
Do you have any ideas for an alternative tie-breaking scheme for haploytpes? It seems these ties were happening much more than expected (probably because most haplotypes end up with (length - k) as their scores because there kmers appear at least somewhere... This also would seem to bias us to mostly select the longest haplotypes in every case for inserting events....
...va/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/AssemblyBasedCallerUtils.java
Outdated
Show resolved
Hide resolved
...roadinstitute/hellbender/tools/walkers/haplotypecaller/AssemblyBasedCallerUtilsUnitTest.java
Show resolved
Hide resolved
20 10099055 . T C 47.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.460;DP=19;ExcessHet=0.0000;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=55.52;MQRankSum=-3.419;QD=2.65;ReadPosRankSum=-0.830;SOR=0.330 GT:AD:DP:GQ:PL 0/1:15,3:18:55:55,0,557 | ||
20 10099079 . C T 40.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.694;DP=23;ExcessHet=0.0000;FS=2.192;MLEAC=1;MLEAF=0.500;MQ=56.33;MQRankSum=-3.254;QD=1.77;ReadPosRankSum=-0.568;SOR=1.460 GT:AD:DP:GQ:PL 0/1:19,4:23:48:48,0,551 | ||
20 10099140 . G T 583.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.176;DP=49;ExcessHet=0.0000;FS=6.794;MLEAC=1;MLEAF=0.500;MQ=60.07;MQRankSum=1.344;QD=12.42;ReadPosRankSum=-0.466;SOR=1.944 GT:AD:DP:GQ:PL 0/1:28,19:47:99:591,0,932 | ||
20 10099046 . T C 31.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.842;DP=17;ExcessHet=0.0000;FS=3.701;MLEAC=1;MLEAF=0.500;MQ=56.07;MQRankSum=-3.254;QD=1.86;ReadPosRankSum=-0.597;SOR=0.090 GT:AD:DP:GQ:PL 0/1:15,2:17:39:39,0,564 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these diferences are mostly noise to do with exactly what haplotypes come out... it is interesting the ties at the 5th score level seem to be very common... We really need to replace this heuristic with something not quite so terrible... @davidbenjamin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamesemery It seems to be that the intent here is to corral the combinatorial explosion of paths in the de Bruijn graph that don't correspond to actual phasing. However, by using the same kmer size as in assembly the heuristic is completely toothless! It's basically a tautology that all the kmers of an assembled haplotype show up in some read, since otherwise the kmer wouldn't exist in the graph.
A better approach would be to use the same heuristic but with larger kmers than used in assembly. For example, using k = 30 would get rid of spurious phasing within 30 bases.
Also, I don't think that using a larger k is merely a tie-breaker for the extant approach. Rather, what we have now should be replaced completely.
I don't think I am done pondering this but that's a good start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One might also ask whether an excess of haplotypes means that instead of filtering we should just redo assembly with a larger k, or dynamically switch to linked de Bruijn graphs. That could well be.
Of course, in pileup mode that's moot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like a good observation... It would also make it harder for every haplotype to be maximally represented by reads! I think it might also be worth considering normalizing to the lenght of the haplotype as well to prevent ourselves from being biased towards longer haplotypes that have incidental (and distant) insertions being over-represented in the choices here... alternatively we could flip the score on its head... where the score is the absolute number of missing kmers from the reads, which should more or less be capped at ~(2k -1) but not have this gross property that it prefers longer hapltotypes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that counting non-missing kmers and thereby rewarding length is wrong.
Since pileup haplotypes are constructed by injecting pileup events into base haplotypes, why not compute the score at that point as follows: 1) make a single kmer containing the pileup event and any nearby events on the base haplotype 2) check whether this kmer is supported in the reads. If not, it's a spurious phasing. 3) If too many pileup haplotypes pass, repeat with larger k.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidbenjamin i'm going to open a ticket to re-visit this heuristic... i think the band-aid here is sufficient and necessary to get into the next release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last minute fix then i think this can be merged
// final ordering determined by string representation (for determinism). | ||
return onlyNewHaplotypes.stream() | ||
.filter(h -> scores.get(h) >= minimumPassingScore) | ||
.sorted(Comparator.comparing(Haplotype::getBaseString)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually i think you need to change this... you need Comparitor.comparing(score).thenComparing (need to check the syntax on that) because now this is shuffling the score > minPassingScore haplotypes which is probably fine most of the time but could end up not selecting the top scorign hap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once you fix that subtlety in the comparator feel free to merge this branch.
Github actions tests reported job failures from actions build 6000851764
|
* Limited the number of haplotypes returned in pileup mode to the number requested. * Modified pileup assembly code to be more deterministic.
Recent refactoring seems to have introduced a bug in pileup mode that failed to enforce the limit on the number of haplotypes to be considered. With this patch:
HashSet
s toLinkedHashSets
to preserve determinism.