Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tribble/Tabix index path support #810

Merged
merged 9 commits into from
Mar 10, 2017

Conversation

magicDGS
Copy link
Member

@magicDGS magicDGS commented Mar 1, 2017

Description

Motivation

While trying to index a Path (no convertible to File) using the Index interface, the client should:

  • Create the LittleEndianOutputStreamby its own. This may lead to wrong indexes because they aren't block-compressed.
  • Get the index file name by its own if it is based on a feature file. This may lead to incorrect index names.
    In addition, this is part of a bigger change for include support of writing in whatever filesystem (and also reading).

Changes

  • Clean up (first commit): removing mentions to vcf in javadoc for Tribble and unused imports.
  • Path support for Index interface: methods with default values use Path.toFile()
  • Implementation of the new methods in AbstractIndex and TabixIndex

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)

@lbergelson
Copy link
Member

@magicDGS Something went wrong here, tests are failing all over.

@magicDGS
Copy link
Member Author

magicDGS commented Mar 1, 2017

I figured it out already, I'm trying to find the problem.

@magicDGS
Copy link
Member Author

magicDGS commented Mar 1, 2017

Very awful error from my side, sorry. The writeBasedOnFeatureFile was overwriting the input file instead of using the name for making the index.

Now I guess that it's solved. Could you have a quick review, @lbergelson?

@codecov-io
Copy link

codecov-io commented Mar 1, 2017

Codecov Report

Merging #810 into master will decrease coverage by 0.021%.
The diff coverage is 62.5%.

@@               Coverage Diff               @@
##              master      #810       +/-   ##
===============================================
- Coverage     64.867%   64.846%   -0.021%     
- Complexity      7175      7187       +12     
===============================================
  Files            526       527        +1     
  Lines          31731     31769       +38     
  Branches        5424      5424               
===============================================
+ Hits           20583     20601       +18     
- Misses          8997      9019       +22     
+ Partials        2151      2149        -2
Impacted Files Coverage Δ Complexity Δ
...va/htsjdk/tribble/TribbleIndexedFeatureReader.java 68.333% <ø> (ø) 22 <0> (ø)
...c/main/java/htsjdk/tribble/TabixFeatureReader.java 68.657% <ø> (ø) 8 <0> (ø)
src/main/java/htsjdk/tribble/Tribble.java 77.778% <100%> (+6.349%) 7 <2> (+2)
.../java/htsjdk/tribble/index/linear/LinearIndex.java 78.363% <100%> (+0.518%) 18 <4> (+2)
...sjdk/tribble/index/interval/IntervalTreeIndex.java 75.61% <100%> (-1.89%) 7 <1> (ø)
src/main/java/htsjdk/tribble/index/Index.java 100% <100%> (ø) 2 <2> (?)
.../main/java/htsjdk/tribble/index/AbstractIndex.java 51.579% <42.857%> (-3.223%) 30 <6> (ø)
...tsjdk/tribble/index/linear/LinearIndexCreator.java 83.721% <57.143%> (-3.459%) 15 <1> (+1)
...k/tribble/index/interval/IntervalIndexCreator.java 88.889% <62.5%> (-3.111%) 15 <2> (+1)
...in/java/htsjdk/tribble/index/tabix/TabixIndex.java 77.108% <62.5%> (-0.669%) 38 <3> (+1)
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6d22658...a6ff57e. Read the comment docs.

@lbergelson lbergelson self-assigned this Mar 1, 2017
@magicDGS magicDGS changed the title Tribble/Tabix index path support WIP: Tribble/Tabix index path support Mar 2, 2017
@magicDGS
Copy link
Member Author

magicDGS commented Mar 2, 2017

This is still WIP, because I found some other paths in the indexing code that should use Path for fully support. So this should wait for review, @lbergelson.

@magicDGS magicDGS changed the title WIP: Tribble/Tabix index path support Tribble/Tabix index path support Mar 3, 2017
@magicDGS
Copy link
Member Author

magicDGS commented Mar 3, 2017

This should be ok now, @lbergelson. Previously failing tests are passing in my computer.

@magicDGS
Copy link
Member Author

magicDGS commented Mar 7, 2017

Can you have a look to this one @lbergelson? This is going to be important for my upcoming PR with tribble writing support with indexing on the fly.

Thanks a lot!

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@magicDGS This is useful work. Thanks for doing this. I have a a few comments. I think it's probably worth breaking implementations of Index to avoid having weird defaults methods there, but lets see if @droazen thinks otherwise. He's usually more conservative about these things than I am.

Would it be possible for you to add a test that creates an index on a non-file path? There's an example of how to do this using an in memory filesystem called jimfs in AbstractFeatureReaderTest.

The basic idea is that you create a new jimfs filesystem, then you can get paths from that filesystem (ex: "jimfs://filesystemname/path/to/my/in/memory/file"). You may have to copy a file from the local system into jimfs if you want to create an index from it, but that's easy to do using the standard Files operations.

Just be sure that the jimfs filesystem gets closed after your test is done, because it will stay in memory until the jvm shuts down otherwise.

* @return a non-null File representing the index
*/
public static File indexFile(final File file) {
return indexFile(file.getAbsoluteFile(), STANDARD_INDEX_EXTENSION);
}

/**
* Return the name of the tabix index file for the provided vcf {@code filename}
* Return the name of the index file for the provided {@code filename}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forgot to change filename -> path

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* Does not actually create an index
* @param filename name of the vcf file
* @param path name of the path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"name of the path" sounds like it's something distinct from the path itself, which is strange because "name of the file" is fine. Maybe just "the path"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -201,7 +213,11 @@ public boolean isCurrentVersion() {
}

public File getIndexedFile() {
return indexedFile;
return getIndexedPath().toFile();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should deprecate this method because it's no longer safe to call in all instances. It also needs a comment saying that it can fail with an exception if the path can't be represented as a file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -251,7 +271,7 @@ private void writeHeader(final LittleEndianOutputStream dos) throws IOException
dos.writeInt(MAGIC_NUMBER);
dos.writeInt(getType());
dos.writeInt(version);
dos.writeString(indexedFile.getAbsolutePath());
dos.writeString(indexedPath.toString());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to use indexedPath.toURI().toString() since it's more likely to roundtrip correctly for non-file paths.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -67,7 +70,7 @@
private final static long NO_TS = -1L;

protected int version; // Our version value
protected File indexedFile = null; // The file we've created this index for
protected Path indexedPath = null; // The file we've created this index for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is technically a breaking change. I'm going to say it's fine though. I don't believe there are any subclasses of AbstractIndex in the wild that we're going to be breaking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I haven't realize that it is breaking compatibility. If we do not want to break compatibility, I can kept the field with the deprecated annotation and set it when the used constructor is a File. Let me know if I should do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's tricky. Anything field that's visible to a subclass of a non-final class is technically a breaking change since someone could be relying on it if they implement a subclass. It's why I'm so strongly in favor of making classes final and of making variable private with accessors if they need to be accessed.

@@ -56,13 +57,18 @@
MathUtils.RunningStat stats = new MathUtils.RunningStat();
long basesSeen = 0;
Feature lastFeature = null;
File inputFile;
// TODO: actually this field is not needed
Path inputPath;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a breaking change, but I don't see the harm in it since I really doubt anyone is subclassing this thing. 👍 to delete the field

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, because it is package protected, I do not think that is going to harm at all (unless someone is subclassing in the same package outside HTSJDK).

If we are breaking compatibiity anyway, as it is unused, why don't we just remove the field (no the constructor arg)? It is only used for get the index creators...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the unused field in the commit for breaking compatibility.

* @param indexPath Where to write the index.
* @throws IOException if the index is unable to write to the specified path.
*/
public default void write(final Path indexPath) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better idea to have the default throw OperationUnsupportedException than to have it call toFile() which will work most of the time but sometimes fail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

*
* @param featurePath
*/
public default void writeBasedOnFeaturePath(Path featurePath) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sort of want to break compatibility and force everyone to implement the Path functions. It would be so much cleaner to have the file versions be default implementations that call the path version. @droazen What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect no on in the wild implements Index. At least it's not done in gatk/picard/igv.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using OperationUnsupportedException for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the interface in the commit for breaking compatibility.

*
* @param tabixPath Where to write the index.
*/
@Override
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make the old file version delegate to this implementation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* @param featurePath Path being indexed.
*/
@Override
public void writeBasedOnFeaturePath(final Path featurePath) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@magicDGS
Copy link
Member Author

magicDGS commented Mar 9, 2017

I added two commits, @lbergelson:

  1. Addressing most of your comments that does not imply backwards incompatibility.
  2. Breaking compatibility for making simpler the interfaces. This could be removed if @droazen thinks that it is better to keep backwards compatibility for subclassing indexes.

Regarding the test, I don't know if I will have time today. I will try, but if not I will add them by tomorrow. You can review the current changes anyway, in case that something else should be changed.

@magicDGS
Copy link
Member Author

magicDGS commented Mar 9, 2017

I've just realized that something is broken with the changes, @lbergelson. Do you have an idea of what is happening? It looks that is breaking indexing on the fly for gzip VCF result (not the indexing itself). The most likely reson is the change to delegate the file constructor to the path one for files, because BlockCompressedOutputStream does not have Path support (see #811) and the filename is not set correctly in the binary codec.

I can add a simple constructor to BlockCompressedOutputStream with a String filename argument, but the change of that class is out of the scope of this commit.

@magicDGS
Copy link
Member Author

magicDGS commented Mar 9, 2017

Ops, I realized that it was an error not related with the BGZ output stream. That's what happens when you program right after waking up... Fixed in the next commit (and removed the commit with the BGZ changes).

@@ -219,8 +219,8 @@ public void write(final Path tabixPath) throws IOException {
*/
@Override
public void writeBasedOnFeaturePath(final Path featurePath) throws IOException {
if (Files.isRegularFile(featurePath)) return;
write(IOUtil.getPath(Tribble.tabixIndexPath(featurePath)));
if (!Files.isRegularFile(featurePath)) return;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the refactoring problem, I overseen this. @lbergelson, I think that this behavior should be documented and/or at least log a warning to avoid getting crazy about what happened with the index. I will rather prefer if it fails with and IO/Tribble exception, but that will break compatibility with public methods...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, good catch. That's terrible behavior. I wonder why it was done that way? Definitely needs to at least log a warning. I would also prefer your suggestion to fail with a TribbleException, but I think we'd need to investigate if that would break existing downstream code somehow. For now, could you have it log a warning, and open a new ticket to change the behavior?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@magicDGS I've opened the issue here: #821

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added warning log here and in AbstractIndex.

@@ -73,41 +73,42 @@
/**
* Writes the index into a file.
*
* Default implementation delegates in {@link #write(Path)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to nitpick the language, but delegates to instead of delegates in is more idiomatic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, I really appreciate to have comments to improve my English skills. Done!


/**
* Write an appropriately named and located Index file based on the name and location of the featureFile.
* If featureFile is not a normal file, the index will silently not be written.
*
* Default implementation delegates in {@link #writeBasedOnFeaturePath(Path)}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@lbergelson
Copy link
Member

@magicDGS Looks good. 👍 Once we have the tests for the two index types.

@magicDGS
Copy link
Member Author

magicDGS commented Mar 9, 2017

Three more commits:

  1. Minor changes from review.
  2. A couple of missing constructors for Path (to test)
  3. Small test for using path in writing and loading.

Thanks for reviewing, @lbergelson - Back to you!

@magicDGS magicDGS mentioned this pull request Mar 10, 2017
5 tasks
@lbergelson
Copy link
Member

👍

@lbergelson lbergelson merged commit 0c282b8 into samtools:master Mar 10, 2017
@magicDGS
Copy link
Member Author

Thanks!

@magicDGS magicDGS deleted the dgs_tribble_path_support branch March 10, 2017 16:19
magicDGS added a commit to bioinformagik/htsjdk that referenced this pull request Mar 10, 2017
magicDGS added a commit to bioinformagik/htsjdk that referenced this pull request Jun 26, 2017
magicDGS added a commit to bioinformagik/htsjdk that referenced this pull request Jun 27, 2017
magicDGS added a commit to bioinformagik/htsjdk that referenced this pull request Aug 25, 2017
magicDGS added a commit to bioinformagik/htsjdk that referenced this pull request Jan 25, 2018
magicDGS added a commit to bioinformagik/htsjdk that referenced this pull request Mar 21, 2018
@magicDGS magicDGS restored the dgs_tribble_path_support branch June 28, 2018 12:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants