Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Htsjdk mangles unicode when writing sam files (and possibly bam/cram) #1202

Open
lbergelson opened this issue Oct 24, 2018 · 0 comments
Open
Labels

Comments

@lbergelson
Copy link
Member

The newest version of sam specifies which fields are allowed to be utf-8 and which must be standard 7bit ascii. I tested it and it turns out we do not support utf-8 in sam files at all. We mangle them to the ascii characters in all cases.

AsciiWriter uses the very simple StringUtil.charsToBytes which just downcasts the input char to a byte. This is incorrect.
We don't detect this case and instead silently corrupt the output.

We may have similar problems with cram/bam.

@lbergelson lbergelson added the bug label Oct 24, 2018
lbergelson added a commit that referenced this issue Oct 26, 2018
* The @sq DS header field was added to the 1.6 bam spec, this adds a getter and setter for it.
* We do not correctly support UTF-8 characters in description due to #1202
lbergelson added a commit that referenced this issue Nov 7, 2018
* Changing htsjdk to produce sam version 1.6

* Htsjdk has technically been producing Sam version 1.6 since support for long cigars was added.
* Updating that list of acceptable versions to include 1.6 and setting the header version of new bams to 1.6.
* There is a known issue with writing utf-8 characters in the sam header, this is now allowed for some fields but not handled correctly. See #1202
lbergelson added a commit that referenced this issue Nov 14, 2018
* The @sq DS header field was added to the 1.6 bam spec, this adds a getter and setter for it.
* We do not correctly support UTF-8 characters in description due to #1202
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant