-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tribble writing support #822
base: master
Are you sure you want to change the base?
Tribble writing support #822
Conversation
Because you were interested in the open issue, can you have a look to this one, @lbergelson? It still needs #810 for simplify some code paths, but I think that this could be cool for others here. I'm already using this code in my own tools, and it works amazingly! |
Codecov Report
@@ Coverage Diff @@
## master #822 +/- ##
===============================================
- Coverage 66.198% 66.101% -0.096%
+ Complexity 7640 7631 -9
===============================================
Files 535 539 +4
Lines 32406 32491 +85
Branches 5514 5514
===============================================
+ Hits 21452 21477 +25
- Misses 8790 8847 +57
- Partials 2164 2167 +3
|
e6db510
to
c2721e2
Compare
@magicDGS Sorry I haven't gotten back to you on this. I think we're very interested in this but I haven't had time to take a proper look. It's a big update and there's an upcoming deadline for some of my own work that's taking priority. |
No worries @lbergelson. I'm already using this code in another project, but I though that it would be nice to have support for this in HTSJDK. Whenever you have time I'm up for working in your suggestions... |
Friendly ping here @lbergelson! |
c2721e2
to
9cd4459
Compare
9cd4459
to
88ac45b
Compare
27f9483
to
8487f19
Compare
@jonn-smith has volunteered to review this old branch, as he is finding the lack of tribble writing support quite awkward in his current project. |
@magicDGS Can you rebase this and I'll start taking a look? |
@jonn-smith - I rebased, and I hope that it works with the current implementation of bgzip, which changed a lot after I had the last look to this branch. Thanks for looking into this, will be really, really helpful! |
7b106b9
to
3277b79
Compare
* | ||
* @author Daniel Gomez-Sanchez (magicDGS) | ||
*/ | ||
public final class FeatureWriterFactory { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this class is completely untested....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was working slowly on the tests until someone takes a look to the design, because I didn't want to invest time on something that might change completely the design after review. This class will have tests after the rest of components are properly tested and reviewed.
* | ||
* @author Daniel Gomez-Sanchez (magicDGS) | ||
*/ | ||
final class IndexingFeatureWriter<F extends Feature> extends FeatureWriterImpl<F> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests not running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In codecov it appears with 87.5% coverage (only sequence dictionary added to the index is uncovered)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll rebase to check if it is a problem of the diff
|
||
/** | ||
* Writes as a BED feature. Note: BED coordinates are 0-based, so the start position returned by | ||
* {@link Feature#getStart()} may be different thant the output start position. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thant->than
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
- Encoder-based implementation - Implementation for indexing on the fly
852c842
to
bc7e6e6
Compare
@jonn-smith are you going to get around to reviewing this? |
Another friendly ping @jonn-smith. This was rebased as you asked a year ago.... |
* function the line buffer is reset so the contents of the buffer can be reused | ||
*/ | ||
private void writeAndResetBuffer() throws IOException { | ||
writer.flush(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really want to flush on every record? seems wasteful
* | ||
* @author Daniel Gomez-Sanchez (magicDGS) | ||
*/ | ||
class FeatureWriterImpl<F extends Feature> implements FeatureWriter<F> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename LineFeatureWriter
} | ||
|
||
/** Wraps if necessary the output stream. */ | ||
private static OutputStream asLocationAwareStream(final OutputStream outputStream) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not return OutputStream
this.indexPath = indexPath; | ||
this.refDict = refDict; | ||
// this is already a LocationAware | ||
this.location = (LocationAware) getOutputStream(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps override getOutputStream() and do the cast there.
@Override | ||
public void close() throws IOException { | ||
if (refDict != null) { | ||
indexer.setIndexSequenceDictionary(refDict); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason for this to be here rather than the constructor?
* @throws IOException if an IO error occurs. | ||
* @throws htsjdk.tribble.TribbleException if an encoding error occurs. | ||
*/ | ||
public void writeHeader(final Object header) throws IOException; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be H type rather than Object
public final class FeatureWriterFactory { | ||
|
||
/** Default approach for create tribble indexes. */ | ||
public static final IndexFactory.IndexBalanceApproach DEFAULT_INDEX_BALANCE_APPROACH = IndexFactory.IndexBalanceApproach.FOR_SEEK_TIME; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woudl be ince to have a way to inject implementation-specific options.
for example: compression level for gzip-compressed files. indexing options, specific version of writer if applicable.
|
||
final IndexCreator idxCreator; | ||
final Path indexPath; | ||
if (blockCompressed) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to have a factory for the index itself?
Description
As described in #701, it will be nice to have a simple mechanism to write features in tribble not only for
VariantContext
. This is a port from one of my project that I think that a lot of API users will benefit. It includes the following:FeatureWriter
interface to encapsulate adding features and writing headers. There is no support for comments yet, but it could be something added in the future.FeatureEncoder
interface to encapsulate feature-specific output formats. Based on a very nice suggestion from @lindenb in one of my previous PRs, the purpose of this interface is to format the feature/header and add it to anAppendable
. This allow to write directly the features without storing theString
format, and give more flexibility. It also have a method to encapsulate the tabix format, to allow indexing on the fly.FeatureEncoder
to convert anyFeature
into a simple BED format (tab-delimited chromosome, start and end). It will be used also in tests.FeatureWriter
similar to the VCF writer, which uses an encoder and an output stream to create the feature file. This class can be extended for other purposes as noted bellow.FeatureWriterImpl
to index on the fly. This one will require eventually Tribble/Tabix index path support #810 for use thejava.nio.Path
index method, but it is very useful for writing feature files already indexed.FeatureWriter
to use asynchronous writing. This is a very simple wrapper extending theAbstractAsyncWriter
. The only flaw is that the header could not be written asynchronously with the current implementation (and thus, it's not safe). I do not really need that, and if it is a problem that it's not safe, this support could be removed.All the writing support use
java.nio.Path
to allow other filesystems. There are no tests for the writers yet (only for the simple encoder), because I want to know your opinion about this addition beforehand. I know that this is a very big PR, but I think that it may be useful for others and will be nice to included in HTSJDK for easier development of new bioinformatic tools.Checklist