Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tribble writing support #822

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

magicDGS
Copy link
Member

@magicDGS magicDGS commented Mar 10, 2017

Description

As described in #701, it will be nice to have a simple mechanism to write features in tribble not only for VariantContext. This is a port from one of my project that I think that a lot of API users will benefit. It includes the following:

  • FeatureWriter interface to encapsulate adding features and writing headers. There is no support for comments yet, but it could be something added in the future.
  • FeatureEncoder interface to encapsulate feature-specific output formats. Based on a very nice suggestion from @lindenb in one of my previous PRs, the purpose of this interface is to format the feature/header and add it to an Appendable. This allow to write directly the features without storing the String format, and give more flexibility. It also have a method to encapsulate the tabix format, to allow indexing on the fly.
  • Example of FeatureEncoder to convert any Feature into a simple BED format (tab-delimited chromosome, start and end). It will be used also in tests.
  • Package protected implementation of FeatureWriter similar to the VCF writer, which uses an encoder and an output stream to create the feature file. This class can be extended for other purposes as noted bellow.
  • Package protected sub-class of FeatureWriterImpl to index on the fly. This one will require eventually Tribble/Tabix index path support #810 for use the java.nio.Path index method, but it is very useful for writing feature files already indexed.
  • Wrapper for FeatureWriter to use asynchronous writing. This is a very simple wrapper extending the AbstractAsyncWriter. The only flaw is that the header could not be written asynchronously with the current implementation (and thus, it's not safe). I do not really need that, and if it is a problem that it's not safe, this support could be removed.
  • Finally, for construction of feature writers, I decided to use a factory pattern instead of the static methods used in the tribble reader classes. This factory have support: 1) index on the fly, 2) block-gzip or gzip, 3) generate MD5 diggest files, 4) asynchronous wriitng and 5) sequence dictionary addition.

All the writing support use java.nio.Path to allow other filesystems. There are no tests for the writers yet (only for the simple encoder), because I want to know your opinion about this addition beforehand. I know that this is a very big PR, but I think that it may be useful for others and will be nice to included in HTSJDK for easier development of new bioinformatic tools.

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)

@magicDGS
Copy link
Member Author

Because you were interested in the open issue, can you have a look to this one, @lbergelson? It still needs #810 for simplify some code paths, but I think that this could be cool for others here. I'm already using this code in my own tools, and it works amazingly!

@codecov-io
Copy link

codecov-io commented Mar 10, 2017

Codecov Report

Merging #822 into master will decrease coverage by 0.096%.
The diff coverage is 49.565%.

@@               Coverage Diff               @@
##              master      #822       +/-   ##
===============================================
- Coverage     66.198%   66.101%   -0.096%     
+ Complexity      7640      7631        -9     
===============================================
  Files            535       539        +4     
  Lines          32406     32491       +85     
  Branches        5514      5514               
===============================================
+ Hits           21452     21477       +25     
- Misses          8790      8847       +57     
- Partials        2164      2167        +3
Impacted Files Coverage Δ Complexity Δ
...ain/java/htsjdk/tribble/writer/FeatureEncoder.java 0% <0%> (ø) 0 <0> (?)
...va/htsjdk/tribble/writer/FeatureWriterFactory.java 0% <0%> (ø) 0 <0> (?)
.../java/htsjdk/tribble/writer/FeatureWriterImpl.java 100% <100%> (ø) 6 <6> (?)
...java/htsjdk/tribble/writer/AsyncFeatureWriter.java 80% <80%> (ø) 7 <7> (?)
...main/java/htsjdk/tribble/bed/BEDSimpleEncoder.java 85.714% <85.714%> (ø) 3 <3> (?)
...a/htsjdk/tribble/writer/IndexingFeatureWriter.java 87.5% <87.5%> (ø) 5 <5> (?)
.../samtools/seekablestream/SeekableMemoryStream.java 95% <0%> (-5%) 11% <0%> (-1%)
.../htsjdk/tribble/index/tabix/TabixIndexCreator.java 84.848% <0%> (-4.545%) 14% <0%> (-1%)
...dk/samtools/seekablestream/SeekableHTTPStream.java 54.545% <0%> (-3.03%) 9% <0%> (-4%)
...ntcontext/writer/IndexingVariantContextWriter.java 83.333% <0%> (-2.381%) 15% <0%> (-1%)
... and 25 more

@magicDGS magicDGS force-pushed the dgs_tribble_writing_support branch from e6db510 to c2721e2 Compare March 10, 2017 16:37
@lbergelson
Copy link
Member

@magicDGS Sorry I haven't gotten back to you on this. I think we're very interested in this but I haven't had time to take a proper look. It's a big update and there's an upcoming deadline for some of my own work that's taking priority.

@magicDGS
Copy link
Member Author

No worries @lbergelson. I'm already using this code in another project, but I though that it would be nice to have support for this in HTSJDK. Whenever you have time I'm up for working in your suggestions...

@magicDGS
Copy link
Member Author

magicDGS commented May 8, 2017

Friendly ping here @lbergelson!

@droazen
Copy link
Contributor

droazen commented Jan 25, 2018

@jonn-smith has volunteered to review this old branch, as he is finding the lack of tribble writing support quite awkward in his current project.

@jonn-smith
Copy link
Contributor

@magicDGS Can you rebase this and I'll start taking a look?

@magicDGS
Copy link
Member Author

@jonn-smith - I rebased, and I hope that it works with the current implementation of bgzip, which changed a lot after I had the last look to this branch.

Thanks for looking into this, will be really, really helpful!

*
* @author Daniel Gomez-Sanchez (magicDGS)
*/
public final class FeatureWriterFactory {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this class is completely untested....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was working slowly on the tests until someone takes a look to the design, because I didn't want to invest time on something that might change completely the design after review. This class will have tests after the rest of components are properly tested and reviewed.

*
* @author Daniel Gomez-Sanchez (magicDGS)
*/
final class IndexingFeatureWriter<F extends Feature> extends FeatureWriterImpl<F> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests not running?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rebase to check if it is a problem of the diff


/**
* Writes as a BED feature. Note: BED coordinates are 0-based, so the start position returned by
* {@link Feature#getStart()} may be different thant the output start position.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thant->than

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@yfarjoun
Copy link
Contributor

yfarjoun commented Feb 4, 2019

@jonn-smith are you going to get around to reviewing this?

@jonn-smith jonn-smith self-requested a review February 4, 2019 19:32
@jonn-smith jonn-smith self-assigned this Feb 4, 2019
@yfarjoun
Copy link
Contributor

Another friendly ping @jonn-smith. This was rebased as you asked a year ago....

* function the line buffer is reset so the contents of the buffer can be reused
*/
private void writeAndResetBuffer() throws IOException {
writer.flush();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really want to flush on every record? seems wasteful

*
* @author Daniel Gomez-Sanchez (magicDGS)
*/
class FeatureWriterImpl<F extends Feature> implements FeatureWriter<F> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename LineFeatureWriter

}

/** Wraps if necessary the output stream. */
private static OutputStream asLocationAwareStream(final OutputStream outputStream) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not return OutputStream

this.indexPath = indexPath;
this.refDict = refDict;
// this is already a LocationAware
this.location = (LocationAware) getOutputStream();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps override getOutputStream() and do the cast there.

@Override
public void close() throws IOException {
if (refDict != null) {
indexer.setIndexSequenceDictionary(refDict);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason for this to be here rather than the constructor?

* @throws IOException if an IO error occurs.
* @throws htsjdk.tribble.TribbleException if an encoding error occurs.
*/
public void writeHeader(final Object header) throws IOException;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be H type rather than Object

public final class FeatureWriterFactory {

/** Default approach for create tribble indexes. */
public static final IndexFactory.IndexBalanceApproach DEFAULT_INDEX_BALANCE_APPROACH = IndexFactory.IndexBalanceApproach.FOR_SEEK_TIME;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woudl be ince to have a way to inject implementation-specific options.

for example: compression level for gzip-compressed files. indexing options, specific version of writer if applicable.


final IndexCreator idxCreator;
final Path indexPath;
if (blockCompressed) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to have a factory for the index itself?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants