Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MergeSamFiles accept SO:UNKNOWN #1069

Conversation

a-filipp0v
Copy link
Contributor

@a-filipp0v a-filipp0v commented Jan 15, 2018

Description

Request is associated with Picard's issue broadinstitute/picard#796

As said in that issue MergeSamFiles can't accept SAM-file with SO:UNKNOWN. That behavior is expected, because tag's value is in improper case(in accordance with SAM-format specification, SO value must be in lowercase). We've found that wrong tag case also provides an exception.

The idea of implementation is that htsjdk must check case of header's tags and values. If tag/value itself is valid, but in improper case(e.g. lowercase instead of uppercase) - it should be converted to proper case before sending the result to output(without changing the source file). Also, there must be a message if that conversion was made.

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)

alexeyphilippov added 2 commits January 15, 2018 02:16
- checks for case-sensitive parts of header;
- tests for my implementation.
- appropriate conversion if tag/value is valid but in improper case(without changing the source file);
- warning if change was made;
- corresponding test.
@codecov-io
Copy link

codecov-io commented Jan 15, 2018

Codecov Report

Merging #1069 into master will increase coverage by 2.195%.
The diff coverage is 95%.

@@              Coverage Diff               @@
##             master     #1069       +/-   ##
==============================================
+ Coverage     66.16%   68.355%   +2.195%     
- Complexity     7609      8003      +394     
==============================================
  Files           533       541        +8     
  Lines         32373     32691      +318     
  Branches       5509      5535       +26     
==============================================
+ Hits          21418     22346      +928     
+ Misses         8790      8120      -670     
- Partials       2165      2225       +60
Impacted Files Coverage Δ Complexity Δ
.../main/java/htsjdk/samtools/SAMTextHeaderCodec.java 81.496% <100%> (+0.918%) 47 <0> (+1) ⬆️
src/main/java/htsjdk/samtools/SAMFileHeader.java 64.607% <85.714%> (+1.023%) 44 <3> (+1) ⬆️
.../main/java/htsjdk/samtools/util/ftp/FTPStream.java 0% <0%> (-100%) 0% <0%> (-3%)
src/main/java/htsjdk/tribble/util/FTPHelper.java 0% <0%> (-64.286%) 0% <0%> (-3%)
.../main/java/htsjdk/samtools/util/ftp/FTPClient.java 24.096% <0%> (-55.172%) 7% <0%> (-10%)
...mtools/seekablestream/SeekableFTPStreamHelper.java 19.355% <0%> (-37.097%) 1% <0%> (-9%)
...c/main/java/htsjdk/samtools/util/ftp/FTPReply.java 44.444% <0%> (-16.667%) 9% <0%> (-3%)
...c/main/java/htsjdk/samtools/util/ftp/FTPUtils.java 17.544% <0%> (-15.789%) 2% <0%> (-1%)
...main/java/htsjdk/tribble/util/RemoteURLHelper.java 46.154% <0%> (-15.385%) 3% <0%> (-1%)
...jdk/samtools/seekablestream/SeekableFTPStream.java 14.634% <0%> (-14.634%) 3% <0%> (-4%)
... and 127 more

@yfarjoun
Copy link
Contributor

Personally I'm opposed to this change.

I do not see why htsjdk needs to be made more complicated in order to accept non-SAM-spec compliant input. Its hard enough as it is even with SAM-compliant input and this will make the code more complicated. What if someone has a typo in the header and they spelled the sort order "unkwnon"? Or perhaps they use a different "locale" and the used "SO: desconocido" ? should htsjdk also accept that?

That off my chest, I'm not a maintainer...so I'd love to hear from them before even looking at the details of the implementation: @lbergelson @tfenne @jacarey

@yfarjoun
Copy link
Contributor

I think that the correct solution is to fix the offending software so that it produces compliant BAM files and to write a simple script that fixes "broken" headers.

@yfarjoun
Copy link
Contributor

Alternatively a better (in my view) solution would be an ability to ignore (when validation=silent or leniant) an unknown sort order and use "unknown" when that happens (instead of lowercasing)

@magicDGS
Copy link
Member

I also prefer that HTSJDK is SAM-compliant and do not allow non-supported SO...

@jacarey
Copy link
Contributor

jacarey commented Jan 16, 2018

I agree with @yfarjoun

@yfarjoun
Copy link
Contributor

@a-filipp0v could you modify the code as follows:

  1. remove the lower-casing and the logs related to that
  2. if SortOrder.valueOf(unmodified) throws an error, AND VALIDATION_STRINGENCY!=STRICT, use SortOrder.UNKNOWN and either emit a warning or not depending on VALIDATION_STRINGENCY.

I think that this will both solve the issue at hand and not introduce un-needed complexity.

As a bonus, I think that your tests should still be useful!

alexeyphilippov added 2 commits January 24, 2018 15:18
- added implementation that treating non-conforming SO tag values as "unknown";
- corresponding tests reworked, few more tests added.
@a-filipp0v
Copy link
Contributor Author

a-filipp0v commented Feb 2, 2018

Thank you for your comments. Hope we've understood you correctly in case of non-conforming SO tag values.
Our implementation has been reworked according to your requests. Also, we've reworked previous tests in SamFileHeaderTest, added few more and removed one test case from ValidateSamFileTest because it's redundant after our implementation.

this.sortOrder = null;
tempVal = SortOrder.valueOf(value).toString();
} catch (IllegalArgumentException e) {
tempVal = "unknown";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use SortOrder.unknown.toString() here.

Copy link
Contributor

@yfarjoun yfarjoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this.

mKeyValuePairs.put(keyAndValue[0], keyAndValue[1]);
}
lineValid = true;
}

private void validateSortOrderValue(String[] value) {
if ("SO".equals(value[0])) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use SAMFileHeader.SORT_ORDER_TAG

log.warn("Found non conforming header SO tag: "
+ value[1] + ". Treating as 'unknown'.");
}
value[1] = "unknown";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use SortOrder.unknown.toString()

if (key.equals(SORT_ORDER_TAG)) {
this.sortOrder = null;
try {
this.sortOrder = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why move this line into the try clause?

@@ -69,6 +70,8 @@
private static final Pattern FIELD_SEPARATOR_RE = Pattern.compile(FIELD_SEPARATOR);

public static final String COMMENT_PREFIX = HEADER_LINE_START + HeaderRecordType.CO.name() + FIELD_SEPARATOR;
private static final Log log = Log.getInstance(SAMTextHeaderCodec.class);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra newline

@@ -233,8 +236,8 @@ private void parseHDLine(final ParsedHeaderLine parsedHeaderLine) {
try {
if (soString != null) SAMFileHeader.SortOrder.valueOf(soString);
} catch (IllegalArgumentException e) {
reportErrorParsingLine(HEADER_LINE_START + parsedHeaderLine.getHeaderRecordType() +
" line has non-conforming SO tag value: "+ soString + ".",
reportErrorParsingLine(HEADER_LINE_START + parsedHeaderLine.getHeaderRecordType() +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revert extra tab

@@ -552,12 +552,6 @@ public void duplicateReadsOutOfOrder() throws Exception {
"@RG\tID:0\tSM:Hi,Mom!\n" +
"E\t147\tchr1\t15\t255\t10M\t=\t2\t-30\tCAACAGAAGC\t)'.*.+2,))\tU2:Z:CAA";

final String SOTagCorrectlyProcessTestData =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this removed?

Copy link

@IanaKabakova IanaKabakova Jul 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your reply and sorry for the delay. We've removed this test case since as @a-filipp0v said, it's redundant after our implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand...if it's just "Redundant" but not exactly equals, I'd ask that it be left in.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case does not pass now, since we no longer treat SO:NOTKNOWN as error as test expects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it

@a-filipp0v
Copy link
Contributor Author

Hello, @yfarjoun , thanks for your reply.

According to your previous comments, we've tried to move an appropriate check to SamFileValidator, but found that it still might make sense to leave our checks in SAMTextHeaderCodec and SAMFileHeader.

First, the check in SamFileValidator will be called only when the ValidateSamFile is used, which does not prevent an error when using any tools separately (for example, MergeSamFile as in this issue).

Secondly, SamFileValidator does not contain separate data about Validation Stringency, because this parameter is parsed from Picard during start.

We suggest leaving checks in SAMTextHeaderCodec and SAMFileHeader AND adding a similar one in SamFileValidator. What do you think about it?

@lbergelson
Copy link
Member

@yfarjoun Since you did the initial review, could you take a look at this again when you get a chance?

@yfarjoun
Copy link
Contributor

@a-filipp0v thanks for the explanation. I've removed my comment relating to the validation.

@yfarjoun
Copy link
Contributor

yfarjoun commented Jul 4, 2018

@a-filipp0v sorry for the delay. please respond to the review comments.

@yfarjoun
Copy link
Contributor

@a-filipp0v just so we know, are you planning to get back to this PR?

@IanaKabakova
Copy link

@yfarjoun sorry for the delay and thank you for your reply. I made some fixes according to your notes. Could you take a look at PR again?

@yfarjoun
Copy link
Contributor

Thanks for the changes. Please leave tests as is, unless they no longer pass.

Copy link
Contributor

@yfarjoun yfarjoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine now. Maintainers, I'd like to merge this, but please glance at it as it changes the API a bit: a previously failing file will now validate with order=="unknown"

@yfarjoun
Copy link
Contributor

@jacarey @tfenne @lbergelson any opinions about this small API change?

if (key.equals(SORT_ORDER_TAG)) {
this.sortOrder = null;
try {
tempVal = SortOrder.valueOf(value).toString();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In setSortOrder() the code uses .name() instead of .toString() which I think is clearer; either way the code should be consistent.

Also, it seems odd that setSortOrder() sets sortOrder to non-null, if the argument is a valid SortOrder, but this method sets it to null. Maybe this method should just call setSortOrder() when key is SORT_ORDER_TAG?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pshapiro4broad that's how the old implementation also does it...it let's the super-class deal with it...take a look at htsjdk.samtools.SAMFileHeader#getSortOrder, I think that might clarify things up (and note that this.sortOrder is private)

@yfarjoun
Copy link
Contributor

No response from the maintainers for a while now, I suspect that no-one objects....if I'm wrong, please revert and accept my apologies.

@yfarjoun yfarjoun merged commit c484241 into samtools:master Oct 29, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants