Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CRAM embed_ref=2 with seqs overlapping ref end. #1848

Merged
merged 2 commits into from
Oct 9, 2024

Conversation

jkbonfield
Copy link
Contributor

If the sequences align off the end of the reference and we are creating consensus on the fly, then the consensus generated also steps beyond the reference length. Although this longer reference is embedded, it is trimmed back by the CRAM decoder which validates against the declared reference length in SQ LN, leading to Ns appearing in the decoder.

Therefore we now validate in the encoder too, which also needed refs_from_header updates to parse the LN tag so the encoder can trim. Note we already overloaded r->length==0 for an indication that we've not parsed the fa/fai file yet, so we can't just naively fill this out from the SQ LN header. We could hold this information elsewhere via a proper flag and modify all the places that utilise that knowledge, but the simplest (and safest) fix is to have a separate variable used for this one specific case.

An example of failure could be seen in:

./test/test_view -C -o embed_ref=2 test/c1#bounds.sam |\
./test/test_view -

If the sequences align off the end of the reference and we are
creating consensus on the fly, then the consensus generated also steps
beyond the reference length.  Although this longer reference is
embedded, it is trimmed back by the CRAM decoder which validates
against the declared reference length in SQ LN, leading to Ns
appearing in the decoder.

Therefore we now validate in the encoder too, which also needed
refs_from_header updates to parse the LN tag so the encoder can trim.
Note we already overloaded r->length==0 for an indication that we've
not parsed the fa/fai file yet, so we can't just naively fill this out
from the SQ LN header.  We could hold this information elsewhere via a
proper flag and modify all the places that utilise that knowledge, but
the simplest (and safest) fix is to have a separate variable used for
this one specific case.

An example of failure could be seen in:

    ./test/test_view -C -o embed_ref=2 test/c1#bounds.sam |\
    ./test/test_view -
test/sam's test_bam_set1_write_and_read_back.tmp.bam file had mate
information (rnext, pnext, tlen) for a record that was not paired.
This triggers a CRAM "fix" where some, but strangley not all, of this
was changed.

Arguably CRAM should correct all 3 or none, but regardless the test
data we are generating should be correct.
@daviesrob daviesrob merged commit ca92061 into samtools:develop Oct 9, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants