-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Work-around for CIGARs with >64k operations in BAM files #560
Closed
Closed
Changes from all commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
71f06db
support CIGARs with >64k operations
lh3 860ebdd
fixed wrong byte-order on big endian
lh3 0227406
changed TAB to 4 spaces
lh3 6cb3dcd
check realloc return; fixed an endianness issue
lh3 9d2d1a8
Long-cigar solution with soft-clipping
lh3 fce8639
Merge branch 'develop' into cigar-64k
lh3 f923193
recompute bin for long-cigar records
lh3 1026c56
convert TAB to 4 SPACEs
lh3 821c2ac
support both fake cigar xS and xSyN
lh3 963091e
replaced TAB with SPACEs
lh3 27bba42
reduce unnecessary cigar parsing; recompute bin
lh3 79466f8
lift the CG tag to CIGAR for CRAM and give warning
lh3 5cae643
Merge branch 'develop' into cigar-64k
lh3 08acf13
Merge branch 'develop' into cigar-64k
lh3 d57d155
check tag2cigar() return code and real cigar len
lh3 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @lh3, I'm adding support for cigars in CG tags into BioJulia/BioAlignments.jl, and am currently implementing the test for the fake cigar, which is defined as
[k<<4|4,m<<4|3]
in the spec, where k is l_seq and m is the length of the reference in the alignment. I don't see a field I can say obviously corresponds to m in BAM records.In your C of bam_tag2cigar, if I understand it correctly, I see you check whether the first op is a softclip, and if it's length == l_seq:
if (bam_cigar_op(cigar0[0]) != BAM_CSOFT_CLIP || bam_cigar_oplen(cigar0[0]) != c->l_qseq) return 0;
But I don't see in the C where the second cigar op of "mN" is checked, is it therefore something that is a given if kS is present then mN will be present?
Thanks,
Ben.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't a field directly in the BAM structure to define 'm' as it is the number of reference bases spanned by the sequence and not query (so not l_qseq). The corresponding code for writing it is also in sam.c:
You'll see the bam_cigar2rlen function is the appropriate code for working out 'm'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkbonfield Ah I see, so if I'm properly understanding, since
m
is computed withbam_cigar2rlen
, using the actual long cigar, then when reading in a BAM record, inbam_tag2cigar
, only the first cigar op is checked fork
andBAM_CSOFT_CLIP
here:Preciely because you can't yet compute
m
as the real cigar has not been read yet, in order to do so?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be the generous interpretation, but it probably boils down to not bothering to check. :-) Note it's not solely entire-soft-clip, but also the presence of the CG tag. The two combined are pretty specific, although it's not a bad idea to also validate that CG and cigar are consistent with each other.
Note the main reason for the existence of the m part of kSmN was so that tools that wanted to do coverage graphs could still use the minimal cigar string to derive the alignment end coordinate. The suggestion started with just kS and so the htslib implementation probably shows some of that history in how it is structured.