Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow multiple VCFSimpleHeaderLine of same type #1531

Closed
wants to merge 3 commits into from

Conversation

mjhipp
Copy link

@mjhipp mjhipp commented Jan 28, 2021

Description

Closes #277
Closes #500

Currently, it is not possible to add multiple VCFSimpleHeaderLines to a VCFHeader if they are of the same type (see issues linked above). A specific example, is adding ALT lines, which are included in 4.2 spec (that shows a header with multiple ALT lines on page 11).

To fix this, I added a new map and methods for working with SimpleVCFHeaderLines that are not of the default types.

I have seen another fix to this (#835), but it was closed due to large scope. I am hoping this change is small and useful enough to be considered. Thanks!

Things to think about before submitting:

  • Make sure your changes compile and new tests pass locally.
  • Add new tests or update existing ones:
    • A bug fix should include a test that previously would have failed and passes now.
    • New features should come with new tests that exercise and validate the new functionality.
  • Extended the README / documentation, if necessary
  • Check your code style.
  • Write a clear commit title and message
    • The commit message should describe what changed and is targeted at htsjdk developers
    • Breaking changes should be mentioned in the commit message.

@codecov-io
Copy link

codecov-io commented Jan 28, 2021

Codecov Report

Merging #1531 (569ccf1) into master (2e3da45) will increase coverage by 0.000%.
The diff coverage is 57.895%.

@@             Coverage Diff              @@
##              master     #1531    +/-   ##
============================================
  Coverage     69.407%   69.407%            
- Complexity      8924      8945    +21     
============================================
  Files            602       604     +2     
  Lines          35521     35649   +128     
  Branches        5904      5924    +20     
============================================
+ Hits           24654     24743    +89     
- Misses          8533      8555    +22     
- Partials        2334      2351    +17     
Impacted Files Coverage Δ Complexity Δ
src/main/java/htsjdk/variant/vcf/VCFHeader.java 86.842% <57.895%> (-3.274%) 82.000 <6.000> (+6.000) ⬇️
.../java/htsjdk/samtools/util/BufferedLineReader.java 48.387% <0.000%> (-20.034%) 7.000% <0.000%> (+1.000%) ⬇️
src/main/java/htsjdk/samtools/CRAMFileReader.java 74.762% <0.000%> (-0.238%) 47.000% <0.000%> (+2.000%) ⬇️
...jdk/samtools/util/BlockCompressedOutputStream.java 79.710% <0.000%> (ø) 34.000% <0.000%> (+2.000%)
src/main/java/htsjdk/io/Writer.java 0.000% <0.000%> (ø) 0.000% <0.000%> (?%)
src/main/java/htsjdk/io/AsyncWriterPool.java 72.222% <0.000%> (ø) 5.000% <0.000%> (?%)
src/main/java/htsjdk/io/HtsPath.java 62.195% <0.000%> (+10.879%) 20.000% <0.000%> (+5.000%)

@clintval
Copy link
Contributor

clintval commented Feb 1, 2021

Hey @lbergelson is this a PR your team would be willing to review? Many thanks if so!

@lbergelson
Copy link
Member

@clintval Yeah, we're definitely going to get to it. Reviews have been really slow lately because working from home in the pandemic is killing my productivity.

@cmnbroad
Copy link
Collaborator

cmnbroad commented Feb 9, 2021

@mjhipp Thanks for taking a stab at this. While it does fix the duplicate line problem, it changes current behavior in a way that we should at least think about. Currently, any ID header line other than INFO/FORMAT (i.e., ALT, META, SAMPLE, PEDIGREE, or any other custom line) can be retrieved via getOtherHeaderLine(String key). I think with this change that will no longer be the case.

The real issue is that otherMeta probably shouldn't have any ID lines in it. Making the change in this PR and adding the new getOtherHeaderLine(String key, String id) kind of codifies that, though, and it feels like that's going in the wrong direction. An alternative fix would be to stop putting the ID lines in there at all, and store them elsewhere, and expose them through the new method. That would have the same backward compatibility issue as this fix, but it would also eliminate the need for the synthetic key names, which might be confusing since they look a lot like the ALT SV ID type hierarchy.

If we're going to break compatibility to fix this, I'd be inclined to favor doing the more complete fix though it's likely to be a bit more work and might impact other methods. Or maybe we should just resurrect the old PR and wait for that. @lbergelson any thoughts ? Having said all that, we should fix this somehow...

@mjhipp
Copy link
Author

mjhipp commented Feb 9, 2021

An alternative fix would be to stop putting the ID lines in there at all, and store them elsewhere, and expose them through the new method.

A nested map is one possibility for this. Where the outer map is from line type ("ALT", "PEDIGREE", etc) to an inner map, and the inner map is ID to metadata line. This would scale for any reasonable amount of line types, without having to hard code them (defining mAltMetaData map).

This would require a similar change to methods, but without synthetic key names.

Would add a new map at the top:

private final Map<String, Map<String, VCFSimpleHeaderLine>> mIdMetaData = new LinkedHashMap<String, Map<String, VCFSimpleHeaderLine>>();

Adding a new line would involve checking for the key, (making a new inner map if necessary), and adding to the inner map at the id hash.

which would change the get method I added to:

    /**
     * @param key    the header key or field type
     * @param id     the header id
     * @return the meta data line, or null if there is none
     */
    public VCFSimpleHeaderLine getIdHeaderLine(final String key, final String id) {
      return mIdMetaData.get(key).get(id);
    }

(probably on multiple lines with a null check in there)

Currently, any ID header line other than INFO/FORMAT (i.e., ALT, META, SAMPLE, PEDIGREE, or any other custom line) can be retrieved via getOtherHeaderLine(String key). I think with this change that will no longer be the case.

With the above changes, could possibly also change the existing getOtherHeaderLine to add some backwards compatibility. This could involve first checking mOtherMetaData for the key, and if not found, check mIdMetaData for that key. If the key is found in the outer map of mIdMetaData, take the first (or last) value in the inner map and cast to VCFHeaderLine.

@mjhipp
Copy link
Author

mjhipp commented Feb 10, 2021

I made the changes above. I think now the only backwards compatibility issue is that getOtherHeaderLines() will not return VCFSimpleHeaderLines, where it had before. To get those, you use getIdHeaderLines()

Comment on lines +611 to +620
if (mOtherMetaData.containsKey(key)) {
return mOtherMetaData.get(key);
} else if (mIdMetaData.containsKey(key)) {
// Get the first item in the linked hash map
Map<String, VCFSimpleHeaderLine> fieldMetaData = mIdMetaData.get(key);
if (fieldMetaData.keySet().size() > 0) {
return fieldMetaData.get(fieldMetaData.keySet().iterator().next());
}
}
return null;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for backwards compatibility, where previously a VCFSimpleHeaderLine could be accessed by this method (only one per field type). It may be better to let this return null and force users to use getIdHeaderLine() to find VCFSimpleHeaderLines. This would be more consistent with getOtherHeaderLines/getIdHeaderLines.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this method makes sense by itself, it introduces a new inconsistency with getOtherHeaderLines(), since it has a different idea of what "other" means (it returns lines that will never be returned by the existing method).

Copy link
Collaborator

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mjhipp Again, thanks for attempting to fix this. I added a couple of more inline comments on your most recent commit, but I'm a bit skeptical that we can fix this issue without introducing a different set of new issues (especially given that we don't have good unit tests for the existing methods).

Before we go any further we should get @lbergelson's thoughts on whether we want trade off some new inconsistency/weirdness in order to fix the duplicate line issue. If so, I think we'd need to add a bunch more unit tests to verify the behavior of whatever tradeoffs we make, but I'll hold off on further comment until he chimes in.

Comment on lines +611 to +620
if (mOtherMetaData.containsKey(key)) {
return mOtherMetaData.get(key);
} else if (mIdMetaData.containsKey(key)) {
// Get the first item in the linked hash map
Map<String, VCFSimpleHeaderLine> fieldMetaData = mIdMetaData.get(key);
if (fieldMetaData.keySet().size() > 0) {
return fieldMetaData.get(fieldMetaData.keySet().iterator().next());
}
}
return null;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this method makes sense by itself, it introduces a new inconsistency with getOtherHeaderLines(), since it has a different idea of what "other" means (it returns lines that will never be returned by the existing method).

mIdMetaData.forEach((k, v) -> lines.addAll(v.values()));
return lines;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class already has a public method List<VCFIDHeaderLine> getIDHeaderLines() that returns all ID lines. I think it will be confusing to also have this new method, with a slightly different name and signature that returns only the ID header lines that are not INFO, FORMAT, etc.

@cmnbroad
Copy link
Collaborator

cmnbroad commented Mar 8, 2021

After further discussing this internally and thinking more about it, I'm inclined to wait for a more complete fix to VCFHeader and address the numerous issues all at once rather than continue trying to make incremental fixes. I've resurrected and rebased my old #835 branch and will resubmit a new draft PR so we can try to find a path forward and get that merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants