-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MergeSVEvidence #7695
MergeSVEvidence #7695
Conversation
@droazen May want someone from engine team to look at this since it touches some core classes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tedsharpe this looks good so far. You've done the hard part of finagling the evidence feature classes into a nice framework for this. I've taken a first pass, excluding tests, and I just had two main comments:
- The tool currently just merges records in dictionary order and subsets samples when necessary. I probably wasn't clear on everything the tool should do, but I have a comment below outlining the cases when Features actually need to be merged as well (e.g. RD records at the same interval with different samples). It would also be nice to have some kind of check for collisions - for example the same sample and interval defined in two of the input files is probably unexpected and should result in an error.
- It seems like there's a lot of overlap with the
MultiVariantWalker
andMultiVariantDataSource
classes. IMO,VariantContext
implementsFeature
and therefore it seems redundant to have completely separate classes forFeature
s as well. Is there a way to consolidate the code such thatMultiVariantWalker
extendsFeatureMergingWalker
and/or defining aMultiFeatureDataSource
extended byMultiVariantDataSource
? @droazen any thoughts?
|
||
import htsjdk.samtools.SAMSequenceDictionary; | ||
import org.broadinstitute.hellbender.utils.Utils; | ||
|
||
import java.util.List; | ||
|
||
public class FeaturesHeader { | ||
public class SVFeaturesHeader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why you needed to move and rename this, although I can see the class isn't used anywhere currently. Maybe it will make more sense as I go along.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just seemed to me that the name was too generic -- suggesting that it applied to all Features -- and, in particular, that it could easily be confused with the FeatureCodecHeader. I just wanted to make the name suggest to which Feature sub-types it applied.
But there's no particularly compelling reason to change it, and it can be reverted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I'm fine with it if it's only used for SV classes
public Set<String> getSampleNames() { return samples; } | ||
|
||
private void setDictionaryAndSamples() { | ||
dictionary = getMasterSequenceDictionary(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use getBestAvailableSequenceDictionary() instead? Rather than defining special logic here (although you would need to modify it to look at the features headers)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That method pukes if there are multiple dictionaries -- even if they're equivalent. And trying to scrape an incomplete dictionary from the index compounded the problem.
This method allows multiple dictionaries if they're all subsets (with respect to name and order) of the largest dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I think a comment for this function explaining that logic would be helpful. IMO, it's a little confusing to have multiple ways of handling multi-dictionary inputs within the engine. MultiVariantWalker
seems to just use the first vcf's dictionary which also seems arbitrary and contains some costly code to check the dictionaries for consistency.
Maybe we should provide a way for subclasses to choose what kind of dictionary check is performed? That way you have both options for both MultiVariantWalker
and FeatureMergingWalker
. Perhaps the engine folks should weigh in here though.
src/main/java/org/broadinstitute/hellbender/engine/FeatureMergingWalker.java
Outdated
Show resolved
Hide resolved
* To use this walker you need only implement the abstract apply method in a class that declares | ||
* a list of FeatureInputs as an argument. | ||
*/ | ||
public abstract class FeatureMergingWalker<F extends Feature> extends WalkerBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about MultiFeatureWalker
? Akin to MultiVariantWalker
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you referring to the name? Seems like a fine suggestion to rename it MultiFeatureWalker, now that I understand MultiVariantWalker better. However, I'd rather not get bogged down in refactoring the MultiVariantWalker for this PR unless you or the engine team think it's important to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes here I was just suggesting a name change. I'll also leave it up to the engine team to decide whether they should actually be sharing code.
src/main/java/org/broadinstitute/hellbender/tools/sv/MergeSVEvidence.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/sv/MergeSVEvidence.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/sv/DepthEvidence.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/sv/LocusDepth.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/sv/PrintSVEvidence.java
Outdated
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/engine/FeatureMergingWalker.java
Outdated
Show resolved
Hide resolved
I think this is ready for another PR review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tedsharpe I think this is almost ready! For the FeatureOutputCodec
, I think we should err on the side of caution here and do some extra checks to make sure inputs are as expected, even if there is some performance hit. I also have some suggestions on how to refactor a bit to make some of its functionality more re-usable for future tools.
Other than that, there are just some places that would benefit from some documentation, including the tool itself.
Comparator<F> getSameLocusComparator(); | ||
void resolveSameLocusFeatures( PriorityQueue<F> queue, S sink ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since they're public, can you make these methods safer to use in your implementations? I'd worry about bugs where someone (i.e. me) hands it features from different loci. Throw an error if that happens.
I do think that this functionality could be useful in future tools as well - being able to consume lists of feature files and merge them on the fly could come in handy. Can you move these to SVFeature
instead? To be more general-purpose, consume a Collection<SVFeature>
input and have an SVFeature output? Seems like most of the implementations are in the feature classes anyway.
It would be nice to have SVFeature
or the evidence classes themselves implement comparable more generally, and enforce stable ordering in the output, even if there's a small performance hit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved resolution of same-locus features into a general post-processing companion class (which would be available for other situations). One could hook these up as part of a streaming operation, I think, which would allow your suggested use case.
I really wish that features could implement Comparable. Alas, they require a dictionary to compare. This could be encapsulated in a Comparable, but then you'd have to pass the comparable around (instead of passing a dictionary around), and so you're no further ahead.
* </pre> | ||
* | ||
* @author Mark Walker <[email protected]> | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going to need tool documentation here - you can base if off the old documentation. Make sure to enumerate the different constraints on each evidence type, and expected behavior for special cases such as filling in "missing values" for depth evidence.
LocusDepth lastEvidence = queue.poll(); | ||
while ( !queue.isEmpty() ) { | ||
final LocusDepth evidence = queue.poll(); | ||
if ( comparator.compare(lastEvidence, evidence) == 0 ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this approach I would also worry about bugs where the order is not what we expect - i.e. queue
is not built with the Feature's comparator
. Can you make this throw an error if this is the case by checking queue.comparator()
(and for the other features as well)?
|
||
@VisibleForTesting | ||
public final class LocusDepth implements SVFeature { | ||
private final String contig; | ||
private final int position; | ||
private final String sample; | ||
private final byte refCall; // index into nucleotideValues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just occurred to me that we probably don't need to know the reference base for SV calling. Can always be looked up anyhow.
*/ | ||
public Set<String> getSampleNames() { return samples; } | ||
|
||
private void setDictionaryAndSamples() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add some documentation with an overview of the logic here?
return largeDict; | ||
} | ||
|
||
public static final class MergingIterator<F extends Feature> implements Iterator<PQEntry<F>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also a quick doc here
|
||
import htsjdk.samtools.SAMSequenceDictionary; | ||
import org.broadinstitute.hellbender.utils.Utils; | ||
|
||
import java.util.List; | ||
|
||
public class FeaturesHeader { | ||
public class SVFeaturesHeader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I'm fine with it if it's only used for SV classes
import java.util.ArrayList; | ||
import java.util.List; | ||
|
||
public final class FeatureOutputCodecFinder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a quick comment here explaining why this is needed would be helpful
if ( count != MISSING_DATA ) { | ||
if ( evCounts[idx] == MISSING_DATA ) { | ||
evCounts[idx] = tmpCounts[idx]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of neat that we can "patch" missing regions if needed. Make sure to document this here and in the tool doc.
public final static Comparator<DiscordantPairEvidence> comparator = | ||
Comparator.comparing(DiscordantPairEvidence::getSample); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could always return 0? Unless there's an important reason to sort the output by sample name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I think if you still need these comparators after implementing Comparable, they should be private and renamed to something like sameLocusComparator
Thank you @tedsharpe for addressing my comments, I just have two additional requests:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Looks good to me. @droazen did you want to review as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tedsharpe I've posted a review on the changes to engine-level classes -- a couple of requests but nothing major.
* Operations performed just prior to the start of traversal. | ||
*/ | ||
@Override | ||
public void onTraversalStart() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
onTraversalStart()
should generally be reserved for tool authors to override to perform whatever initialization their tool needs. Initialization for Walker classes themselves should be done by overriding onStartup()
, making it final
, and calling super.onStartup();
as the first line, as seen in, eg., FeatureWalker
.
One of these days we really need to rename onStartup()
to something like initializeTraversal()
or initializeEngine()
to make this distinction clearer.
* To use this walker you need only implement the abstract apply method in a class that declares | ||
* a collection of FeatureInputs as an argument. | ||
*/ | ||
public abstract class MultiFeatureWalker<F extends Feature> extends WalkerBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add an ExampleMultiFeatureWalker
in org.broadinstitute.hellbender.tools.examples
+ an ExampleMultiFeatureWalkerIntegrationTest
, modeled after the existing ExampleFeatureWalker
and ExampleFeatureWalkerIntegrationTest
? We generally try to do this for each new traversal added to GATK both as examples for tool authors and to catch regressions in the Walker classes.
* multiple sources of Features. The input files for each feature must be sorted by locus. | ||
* | ||
* To use this walker you need only implement the abstract apply method in a class that declares | ||
* a collection of FeatureInputs as an argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you take my suggestion below to override onStartup()
instead of onTraversalStart()
, add to the docs here:
and may optionally implement {@link #onTraversalStart()}, {@link #onTraversalSuccess()}, and/or {@link #closeTool()}.
which is the usual pattern for Walker classes
* @param feature Current Feature being processed. | ||
* @param header Header object for the source from which the feature was drawn (may be null) | ||
*/ | ||
public abstract void apply( final F feature, final Object header ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any potential use case for having overlapping reference bases and/or reads as a side input here? If that's outside the scope of this traversal then that's fine, but if it could be potentially useful for future tools consider adding a ReferenceContext
and/or ReadsContext
to apply()
, as seen in FeatureWalker
and most of the other Walker classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add ReadsContext and ReferenceContext to apply method? OK, I've done so.
[Editorializing of the worst sort: I've done so because consistency seems more important right now than design, but it bugs me. It seems crazy to pass dummy objects that might not even be useful (when unbacked) when we don't know whether we even need them. Seems to me that it would've been much smarter to provide a walker callback method that the apply methods could use if they need this info. Please put it on your list of design annoyances that we'll never have time to fix. Or maybe there's a good justification that has simply escaped me.]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tedsharpe GATK has always been structured around universal arguments like -R
and -I
that all tools accept, and that (when provided) automatically populate contextual objects passed in to the tools. The main justification for this design is to not require any extra work for the tool authors when overlapping data from other sources is required, and to have the most common kinds of inputs "wired up" in advance and available for the tool to use as needed. Tools can call the hasReference()
, etc., methods from GATKTool
, or the hasBackingDataSource()
methods on the context objects themselves if they need to query whether a particular kind of contextual data is available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I get that. IMHO, it's not really extra work to get the objects you need when you actually need them, and that's better than preparing dummy objects that you might not even want or use that get passed to you every time. I know that's how it's always worked, but it seems silly to me. I revised the code to do things the standard way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little more discoverable / self-documenting this way -- a reminder that the engine can provide you with all these side inputs if you need them. Plus there is no overhead to having these contextual objects around: they are all implemented in terms of lazy queries that don't happen until you access them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. Comes from the Ministry of Silly Walkers. That's my story, and I'm stickin' to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hah, well, it may not convince you, but here's one last defense: the context objects are all initialized with the interval of the primary record -- since they are "tied" to the primary record in that sense, it makes sense that they should be passed in during the traversal along with the primary record.
} | ||
} | ||
|
||
private SAMSequenceDictionary bestDictionary( final SAMSequenceDictionary newDict, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a method comment documenting the approach taken for selecting the "best" dictionary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach for "best" dictionary was documented in setDictionaryAndSamples, but I've rejiggered it to make it less easily missed. (The bestDictionary method is now a static betterDictionary method, which is what was actually going on.)
@@ -473,8 +473,8 @@ final void printCacheStats() { | |||
public SAMSequenceDictionary getSequenceDictionary() { | |||
SAMSequenceDictionary dict = null; | |||
final Object header = getHeader(); | |||
if ( header instanceof FeaturesHeader ) { | |||
dict = ((FeaturesHeader)header).getDictionary(); | |||
if ( header instanceof SVFeaturesHeader ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rationale behind the FeaturesHeader
-> SVFeaturesHeader
migration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why FeaturesHeader --> SVFeaturesHeader? It just seemed to me that the name was too generic -- suggesting that it applied to all Features -- and, in particular, that it could easily be confused with the FeatureCodecHeader. I just wanted to make the name suggest to which Feature sub-types it applied.
import java.util.ArrayList; | ||
import java.util.List; | ||
|
||
public class MultiFeatureWalkerUnitTest extends CommandLineProgramTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I think this can become the ExampleMultiFeatureWalkerIntegrationTest
I requested above if you promote your DummyMultiFeatureWalker
into an official ExampleMultiFeatureWalker
in org.broadinstitute.hellbender.tools.examples
"-" + StandardArgumentDefinitions.INPUT_SHORT_NAME, largeFileTestDir + "NA12878.alignedHg38.duplicateMarked.baseRealigned.bam", | ||
// no dictionary, no sample names, with a single feature | ||
"-" + StandardArgumentDefinitions.FEATURE_SHORT_NAME, packageRootTestDir + "engine/tiny_hg38.baf.txt" | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using ArgumentsBuilder
to build the command lines for your test cases.
// no dictionary, no sample names, with a single feature | ||
"-" + StandardArgumentDefinitions.FEATURE_SHORT_NAME, packageRootTestDir + "engine/tiny_hg38.baf.txt" | ||
}; | ||
dummy.instanceMain(args); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use runCommandLine(args)
after this becomes ExampleMultiFeatureWalkerIntegrationTest
lastStart = feature.getStart(); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need at least one test case that uses -L <intervals>
Remainder of review comments addressed directly with no guff. |
OK. Passes checks. @droazen One final look? I couldn't figure out how to use runCommandLine, because I needed access to my walker instance. Is this OK, or is there a better way to do it? |
@droazen Can I get a thumbs up on this if I've addressed your suggestions adequately, or further input if not? Thanks. |
@tedsharpe I'll take a look now! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tedsharpe Engine changes look good now -- a few last easy comments, then go ahead and merge once they're addressed and tests pass (assuming of course that the changes in the sv package, which I didn't look at, have been fully reviewed)
} | ||
if ( newIdx <= lastIdx ) { | ||
throw new UserException("Contig " + rec.getContig() + | ||
" not in same order as in larger dictionary"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these are exceptions that actual users are likely to trigger, it would be helpful to include the contents of both dictionaries in the error messages. You can do this by throwing a UserException.IncompatibleSequenceDictionaries
and passing the two dictionaries in.
public abstract class MultiFeatureWalker<F extends Feature> extends WalkerBase { | ||
|
||
SAMSequenceDictionary dictionary; | ||
final Set<String> samples = new TreeSet<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there are accessor methods for these, can they be private
? And relatedly, should the Set of samples be wrapped in a Collections.unmodifiableSet()
after initialization to prevent downstream modification?
* internal state as possible. | ||
* | ||
* @param feature Current Feature being processed. | ||
* @param header Header object for the source from which the feature was drawn (may be null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add javadoc for the all-important ReadsContext
and ReferenceContext
args
final ReadsContext readsContext, | ||
final ReferenceContext referenceContext ) { | ||
// We'll just keep track of the Features we see, in the order that we see them. | ||
features.add(feature); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little inconsistent with the rest of the example walkers, which all produce some diagnostic output showing the records that are processed by the traversal. Could you have this tool produce some textual output along the lines of ExampleFeatureWalker
in addition to accumulating the Features in memory?
5ae634f
to
0bf2c44
Compare
No description provided.