-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added indel support to SamLocusIterator #408
Conversation
Hi @magicDGS, thanks for doing this! I'll try and take a look at this in the next day or so. In the mean time two quick things:
|
Sorry, I forgot to add some changes in my local repository before PR regarding the test. It should be fixed in commit b48aa0a |
Fixed hard-tabs in commit 1572366 |
@@ -75,12 +77,15 @@ public RecordAndOffset(final SAMRecord record, final int offset) { | |||
|
|||
/** | |||
* The unit of iteration. Holds information about the locus (the SAMSequenceRecord and 1-based position | |||
* on the reference), plus List of ReadAndOffset objects, one for each read that overlaps the locus | |||
* on the reference), plus List of ReadAndOffset objects, one for each read that overlaps the locus; | |||
* two more List of ReadAndOffset objects includes reads that overlaps the locus with insertions/deletions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't read quite right to me. How about:
two more List_s_ of ReadAndOffset objects include reads that overlap the locus with insertions and deletions respectively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in ac5c8b5
Hi @magicDGS there's one case that I think doesn't work if I've understood the code correctly. Either way it would be great to add a test for it to ensure/show that it works. My concern is that I think if you encounter a stream of record with the same alignment start position, and the first read starts with an
I think in this case that when Since you can't rely on any specific ordering of reads with the same alignment start position, you may need to change things further such that you pull off all the reads with the same alignment start position, order them by whether or not they start with an indel, and then accumulate from all of them. |
// iterate over the cigar element | ||
for (int elementIndex = 0; elementIndex < cigar.size(); elementIndex++) { | ||
final CigarElement e = cigar.get(elementIndex); | ||
switch (e.getOperator()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This switch statements would, I think, be clearer written as follows:
if (operator == I) // handle insertion
else if (operator == D) // handle deletion
else {
if (operator.consumesReadBases()) readBase += e.getLength();
if (operator.consumesReferenceBases()) refBase += e.getLength();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 5d23bdd
@magicDGS I've made a first pass through the code, but will likely have more comments once you've addressed the first batch. @alecw Would you mind taking a look at this PR, perhaps after @magicDGS has made the first round of revisions? As the original author of |
Hi @tfenne, it is true that the iterator is not working on the test case that you mention and I will modify the |
Hello @tfenne, for a reason that I do not fully understand, Do you have any clue about what is happening? When changing the test in the original implementation of Thank you very much in advance. EDIT: fixed in #421 and merged in this PR |
Truth be told, I'm not the original author. We inherited this code from @kcibul. Anyway, this code is pretty complicated (apart from the changes) and I don't think I have the time to get my head into this enough to comment intelligently. One suggestion I would make, however, is that it would be good for the javadoc to explain how insertions and deletions will be handled differently according to the value of includeIndels. As it stands now I don't know what the existing behavior is and what the new behavior will be. -Alec |
Although I'm still working on this, I found that GATK4 already implemented a I do not know if it is worth to implement this option in htsjdk or if someone is interested in that behavior use directly the GATK4 framework. What do you think, @tfenne? |
@tfenne Are you still interested in getting this into htsjdk? It sounds like @magicDGS has found a solution in GATK4 for the original need that prompted this PR. If you are interested, could you comment before the next htsjdk review party on 5/3? Otherwise, we might close this due to lack of interest. |
I found the solution in GATK4, but I think that it is still important in htsjdk. But it will probably be better for the new API if it is implemented (#520), using |
int start = rec.getAlignmentStart(); | ||
// only if we are including indels and the record does not start in the first base of the reference | ||
// the stop locus to populate the queue is not the same if the record starts with an insertion | ||
if(includeIndels && start != 1 && rec.getCigar().getCigarElement(0).getOperator().equals(CigarOperator.I)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if there's a clip and then an insertion, e.g 5S6I4M . In this case, you'll not notice that it is in-fact an insertion that comes before the alignment start. (needs a test)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (test+fix)
@yfarjoun, I reorganized commits and squashed them for an easier review. I solved the bug in the code when tracking indels, and added some simple tests for them. Could you have a look? |
for(final CigarElement element: cigar.getCigarElements()) { | ||
switch(element.getOperator()) { | ||
case I: return true; | ||
case S: continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be any operator that doesn't consume refbases (except "I"). I think it would be better to implement this as an if statement. Perhaps something like:
<SNIP>
if( element.getOperator()==CigarOperator.I ) return true;
if ( ! element.getOperator().consumesReferenceBases() ) continue;
break;
</SNIP>
Thanks for your work on this PR @magicDGS. I've commented about your bug fix, and a few of the places where the spaces in your code are missing (sorry for the nit-picking, we are trying to have a consistent code style). There are more that I missed, perhaps you can set your IDE to fix these for you? I think that there are some CIGAR cases that are missing. e.g.
|
Back to you for (final?) review, @yfarjoun:
|
Thanks @magicDGS !! |
Added indel tracking to
SamLocusIterator
(as discussed in issue #387) and some tests. The default behaviour is still do not keep reads spanning indels.