Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support BAMs with >65535 CIGAR operations #1003

Closed
wants to merge 11 commits into from
Closed

Support BAMs with >65535 CIGAR operations #1003

wants to merge 11 commits into from

Conversation

lh3
Copy link
Member

@lh3 lh3 commented Oct 3, 2017

Description

This PR implements the long-cigar solution that @yfarjoun and I have agreed on. In the presence of CIGAR with >65535 operators, we move the real CIGAR to the CG tag and put a fake full clipping CIGAR (i.e. <readLength>S) in place.

./gradlew test reported one test failure due to EnaRefServiceTest. Looking at the call stack, I don't think that is my fault. The rest of tests have passed.

On an example SAM and BAM (from http://lh3lh3.users.sourceforge.net/data), PrintReadsExample gives the desired BAM output when taking both SAM and BAM as input. Both shallow and deep decoding also write correct BAMs.

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)

@codecov-io
Copy link

codecov-io commented Oct 18, 2017

Codecov Report

Merging #1003 into master will increase coverage by 0.173%.
The diff coverage is 80.22%.

@@               Coverage Diff               @@
##              master     #1003       +/-   ##
===============================================
+ Coverage     66.103%   66.277%   +0.173%     
- Complexity      7562      7758      +196     
===============================================
  Files            532       532               
  Lines          32254     33072      +818     
  Branches        5497      5701      +204     
===============================================
+ Hits           21321     21919      +598     
- Misses          8777      8954      +177     
- Partials        2156      2199       +43
Impacted Files Coverage Δ Complexity Δ
...main/java/htsjdk/samtools/SAMRecordSetBuilder.java 88.703% <0%> (-0.418%) 68 <7> (+3)
src/main/java/htsjdk/samtools/BAMRecord.java 74.194% <76.923%> (+0.471%) 51 <1> (+14) ⬆️
src/main/java/htsjdk/samtools/BAMRecordCodec.java 85.271% <89.189%> (+0.735%) 22 <0> (+7) ⬆️
...in/java/htsjdk/samtools/filter/ReadNameFilter.java 53.659% <0%> (-11.047%) 4% <0%> (ø)
...rc/main/java/htsjdk/variant/vcf/VCFFileReader.java 66.379% <0%> (-10.091%) 34% <0%> (+17%)
...va/htsjdk/tribble/TribbleIndexedFeatureReader.java 70.192% <0%> (+2.97%) 24% <0%> (+2%) ⬆️
...ain/java/htsjdk/tribble/AbstractFeatureReader.java 73.846% <0%> (+3.013%) 26% <0%> (+8%) ⬆️
src/main/java/htsjdk/samtools/util/Histogram.java 82.991% <0%> (+3.539%) 107% <0%> (+37%) ⬆️
...rc/main/java/htsjdk/tribble/util/ParsingUtils.java 77.097% <0%> (+3.552%) 87% <0%> (+40%) ⬆️
... and 5 more

@yfarjoun
Copy link
Contributor

yfarjoun commented Dec 15, 2017

@lh3 there are failing tests...could you please rebase (not squash) to see if that helps?

This is the solution that @yfarjoun and I have agreed on. In the presence of
CIGAR with >65535 operators, we move the real CIGAR to the CG tag and put a
fake full clipping CIGAR (i.e. <readLength>S) in place.

`./gradlew` reported one test failure due to `EnaRefServiceTest`. Looking at
the call stack, I don't think that is my fault. The rest of tests have passed.

On an example SAM and BAM (from <http://lh3lh3.users.sourceforge.net/data>),
`PrintReadsExample` gives the desired BAM output when taking both SAM and BAM
as input. Both shallow and deep decoding also write correct BAMs.
For BAMs generated with old tools, the indexingBin field may not be correctly
set. We need to manually update this field to avoid errors during random
retrieval.
The old code to lift tag to CIGAR did not work. This commit fixed several
issues:

- byte order is not set when computing alignmentEnd
- forgot to add *4 to cigarLen in two cases
- getChar() is not what it is supposed to mean

The new code has been more carefully tested and seems to work.
This fixes a test failure.
@lh3
Copy link
Member Author

lh3 commented Dec 16, 2017

The master has a new test which leads to a failure (null is not checked). This has been fixed. All tests passed now.

@yfarjoun
Copy link
Contributor

I'd like to have a test for this...do you mind if I added a test in a commit?

@lh3
Copy link
Member Author

lh3 commented Dec 16, 2017

@yfarjoun Please go ahead and add the test. I really appreciate!

- extracting constants etc.
@yfarjoun
Copy link
Contributor

hi @lh3, I added some tests and some code changes here: https://github.com/samtools/htsjdk/tree/yf_cigar-64k-tag though one seemingly unrelated test is failing, and the second test I wrote isn't failing, even though I designed it to fail...I was hoping that you could include something like the tests I wrote here in your branch. The second test adds wrong CG tags to the records and expects failure (since they should either be replaced by the new tags, or confuse the system with wrong tags)...let me know how I can help.

@yfarjoun
Copy link
Contributor

I found the problem and fixed it...so now my only question is why can I add a CG tag and somehow the bam manages to not get tripped up...

@yfarjoun
Copy link
Contributor

@lh3 could you tell me what you think about the tests and my question?

@lh3
Copy link
Member Author

lh3 commented Dec 21, 2017

On my laptop, there are 3 QualityEncodingDetectorTest and 1 SAMTextReaderTest failures with your branch. I need to check more closely about what are causing them.

@yfarjoun
Copy link
Contributor

scratch that. I did push the latest, and they pass on travis... https://travis-ci.org/samtools/htsjdk/builds/319218545?utm_source=github_status&utm_medium=notification

@yfarjoun
Copy link
Contributor

make sure you got the latest commit from my branch.

@lh3
Copy link
Member Author

lh3 commented Dec 22, 2017

Thanks a lot, @yfarjoun. I have merged your branch into this PR.



//why is this not breaking?
@Test(dataProvider = "longCigarsData")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you understand, @lh3 , why this test isn't failing?!

I"m adding a fake CG tag and I have long cigars, but somehow it manages to encode and decode the read correctly....I was expecting failure!

@yfarjoun
Copy link
Contributor

superseded by #1086

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants