Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added indel support to SamLocusIterator #408

Merged
merged 5 commits into from
May 29, 2016

Conversation

magicDGS
Copy link
Member

Added indel tracking to SamLocusIterator (as discussed in issue #387) and some tests. The default behaviour is still do not keep reads spanning indels.

@tfenne
Copy link
Member

tfenne commented Dec 15, 2015

Hi @magicDGS, thanks for doing this! I'll try and take a look at this in the next day or so. In the mean time two quick things:

  1. It looks like the test that is failing is a new test that you added - do you know what's going on?
  2. Your edits introduce a lot of hard-tabs. Some time before merging those'll need to get turned into soft-tabs with tab size=4.

@magicDGS
Copy link
Member Author

Sorry, I forgot to add some changes in my local repository before PR regarding the test. It should be fixed in commit b48aa0a

@magicDGS
Copy link
Member Author

Fixed hard-tabs in commit 1572366

@@ -75,12 +77,15 @@ public RecordAndOffset(final SAMRecord record, final int offset) {

/**
* The unit of iteration. Holds information about the locus (the SAMSequenceRecord and 1-based position
* on the reference), plus List of ReadAndOffset objects, one for each read that overlaps the locus
* on the reference), plus List of ReadAndOffset objects, one for each read that overlaps the locus;
* two more List of ReadAndOffset objects includes reads that overlaps the locus with insertions/deletions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't read quite right to me. How about:
two more List_s_ of ReadAndOffset objects include reads that overlap the locus with insertions and deletions respectively.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in ac5c8b5

@tfenne
Copy link
Member

tfenne commented Dec 18, 2015

Hi @magicDGS there's one case that I think doesn't work if I've understood the code correctly. Either way it would be great to add a test for it to ensure/show that it works. My concern is that I think if you encounter a stream of record with the same alignment start position, and the first read starts with an M operator and some subsequent read starts with an I operator, things will go badly. E.g.

R1  65  chr4    5000    60  101M    ...
R2  65  chr4    5000    60  6I95M   ...
R3  65  chr4    5000    60  101M    ...
R4  65  chr4    5000    60  101M    ...

I think in this case that when R1 is encountered that the accumulator will be flushed up to position 5000, and then all the bases from R1 accumulated. And then when R2 is encountered, it will expect that accumulator[0] is in fact for position 4999, and the exception on line 349 will be thrown.

Since you can't rely on any specific ordering of reads with the same alignment start position, you may need to change things further such that you pull off all the reads with the same alignment start position, order them by whether or not they start with an indel, and then accumulate from all of them.

// iterate over the cigar element
for (int elementIndex = 0; elementIndex < cigar.size(); elementIndex++) {
final CigarElement e = cigar.get(elementIndex);
switch (e.getOperator()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This switch statements would, I think, be clearer written as follows:

if (operator == I) // handle insertion
else if (operator == D) // handle deletion
else {
    if (operator.consumesReadBases()) readBase += e.getLength();
    if (operator.consumesReferenceBases()) refBase += e.getLength();
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 5d23bdd

@tfenne
Copy link
Member

tfenne commented Dec 18, 2015

@magicDGS I've made a first pass through the code, but will likely have more comments once you've addressed the first batch.

@alecw Would you mind taking a look at this PR, perhaps after @magicDGS has made the first round of revisions? As the original author of SamLocusIterator I think it would be good for you to take a look at this.

@magicDGS
Copy link
Member Author

Hi @tfenne, it is true that the iterator is not working on the test case that you mention and I will modify the testStartWithInsertion method to develop what you suggest. Thank you very much for your revision, and I will let you know when I fix that issue.

@magicDGS
Copy link
Member Author

Hello @tfenne, for a reason that I do not fully understand, testEmitUncoveredLoci is broken when using the SAMRecordSetBuilder (commit 9d14bb9, see the file here), but the implementation is the same as in the commit where the test pass (5d23bdd, see the file here), although separating the setIncludeIndels.

Do you have any clue about what is happening? When changing the test in the original implementation of SamLocusIterator it is not working either, so I don't think that is an issue of the new implementation.

Thank you very much in advance.

EDIT: fixed in #421 and merged in this PR

@alecw
Copy link
Contributor

alecw commented Dec 21, 2015

Truth be told, I'm not the original author. We inherited this code from @kcibul. Anyway, this code is pretty complicated (apart from the changes) and I don't think I have the time to get my head into this enough to comment intelligently.

One suggestion I would make, however, is that it would be good for the javadoc to explain how insertions and deletions will be handled differently according to the value of includeIndels. As it stands now I don't know what the existing behavior is and what the new behavior will be.

-Alec

@magicDGS
Copy link
Member Author

magicDGS commented Feb 2, 2016

Although I'm still working on this, I found that GATK4 already implemented a LocusIteratorByState (broadinstitute/gatk#1442) that solve the initial issue.

I do not know if it is worth to implement this option in htsjdk or if someone is interested in that behavior use directly the GATK4 framework. What do you think, @tfenne?

@yfarjoun yfarjoun mentioned this pull request Apr 19, 2016
@droazen
Copy link
Contributor

droazen commented Apr 19, 2016

@tfenne Are you still interested in getting this into htsjdk? It sounds like @magicDGS has found a solution in GATK4 for the original need that prompted this PR. If you are interested, could you comment before the next htsjdk review party on 5/3? Otherwise, we might close this due to lack of interest.

@magicDGS
Copy link
Member Author

I found the solution in GATK4, but I think that it is still important in htsjdk. But it will probably be better for the new API if it is implemented (#520), using AlignmentContext as in GATK.

int start = rec.getAlignmentStart();
// only if we are including indels and the record does not start in the first base of the reference
// the stop locus to populate the queue is not the same if the record starts with an insertion
if(includeIndels && start != 1 && rec.getCigar().getCigarElement(0).getOperator().equals(CigarOperator.I)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if there's a clip and then an insertion, e.g 5S6I4M . In this case, you'll not notice that it is in-fact an insertion that comes before the alignment start. (needs a test)

Copy link
Member Author

@magicDGS magicDGS May 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (test+fix)

@magicDGS
Copy link
Member Author

@yfarjoun, I reorganized commits and squashed them for an easier review. I solved the bug in the code when tracking indels, and added some simple tests for them. Could you have a look?

for(final CigarElement element: cigar.getCigarElements()) {
switch(element.getOperator()) {
case I: return true;
case S: continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be any operator that doesn't consume refbases (except "I"). I think it would be better to implement this as an if statement. Perhaps something like:

<SNIP>
if( element.getOperator()==CigarOperator.I ) return true;
if ( ! element.getOperator().consumesReferenceBases() )  continue;
break;
</SNIP>

@yfarjoun
Copy link
Contributor

Thanks for your work on this PR @magicDGS.

I've commented about your bug fix, and a few of the places where the spaces in your code are missing (sorry for the nit-picking, we are trying to have a consistent code style). There are more that I missed, perhaps you can set your IDE to fix these for you?

I think that there are some CIGAR cases that are missing.

e.g.

2M4N3I4M
2M4N4D5M (is this legal?)

@magicDGS
Copy link
Member Author

magicDGS commented May 18, 2016

Back to you for (final?) review, @yfarjoun:

  • Coding style fixed with IDE
  • startWithInsertion() implemented as you suggested
  • New tests for examples of CIGAR in previous comment

@yfarjoun
Copy link
Contributor

Thanks @magicDGS I'm happy with this. 👍

@tfenne you took a look at this first, and said you might have more comments. Are you happy enough with my review, or would you like another glance?

@yfarjoun yfarjoun merged commit 5196d09 into samtools:master May 29, 2016
@yfarjoun
Copy link
Contributor

Thanks @magicDGS !!

@magicDGS
Copy link
Member Author

Thanks for all your coments and help, @tfenne and @yfarjoun!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants