Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Included support for indel error detection and read filtering in CollectSamErrorMetrics #1370

Merged
merged 4 commits into from
Sep 24, 2019

Conversation

michaelgatzen
Copy link
Contributor

@michaelgatzen michaelgatzen commented Aug 1, 2019

  • Created new IndelErrorCalculator and IndelErrorMetric classes
  • Adjusted CollectSamErrorMetrics to be able to consider insertions and
    deletions, as well as number of inserted and deleted bases
  • Added indel_length stratifier
  • Using base operation information that have been incorporated into htsjdk
  • Tests for indel error detection and filtering

Description

Give your PR a concise yet descriptive title
Please explain the changes you made here.
Explain the motivation for making this change. What existing problem does the pull request solve?
Mention any issues fixed, addressed or otherwise related to this pull request, including issue numbers or hard links for issues in other repos.
You can delete these instructions once you have written your PR description.


Checklist (never delete this)

Never delete this, it is our record that procedure was followed. If you find that for whatever reason one of the checklist points doesn't apply to your PR, you can leave it unchecked but please add an explanation below.

Content

  • Added or modified tests to cover changes and any new functionality
  • Edited the README / documentation (if applicable)
  • All tests passing on Travis

Review

  • Final thumbs-up from reviewer
  • Rebase, squash and reword as applicable

For more detailed guidelines, see https://github.com/broadinstitute/picard/wiki/Guidelines-for-pull-requests

@yfarjoun
Copy link
Contributor

yfarjoun commented Aug 2, 2019

Yikes! that's some PR!

could you split this into 3 separate PRs?

  1. add indel error.
  2. add indel stratifiers.
  3. add filtering ability.

I think that it will make reviewing the code much easier.

Copy link
Contributor

@yfarjoun yfarjoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though I asked that you split this up, I'm still going to give a few comments here....

@jonn-smith
Copy link
Contributor

jonn-smith commented Aug 5, 2019

@yfarjoun Since these features are built upon each other they'll be somewhat difficult to deconflict at this point. It's a larger PR, but not completely unwieldy so I think it'll be OK to review without splitting it up.

Copy link
Contributor

@yfarjoun yfarjoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this big PR!

the indel error stuff looks good, though I haven't tried it out yet (I'm excited to do so!)

the bulk of the code is about this new filtering DSL, but there's no real testing of it, nor any documentation as to what it is for...also the way it is implemented seems to store the entire input in memory...which doesn't seem to be a good idea (nor needed!) could you explain in the PR or the javaDoc what the filtering is actually for and how it works? it also seems that there are now lots of new output files, but it's unclear what they are for....

Lots of missing "finals" and parenthesis and brace formatting (ask IntelliJ to reformat your code)

@@ -0,0 +1,28 @@
package picard.sam.SamErrorMetric;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all new files need a license

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, added that. There were also some other files in the SamErrorMetric directory that didn't have licenses (e.g. OverlappingErrorMetric), I added a license to those as well if that's ok

return null;
}
if(operation == CollectSamErrorMetrics.BaseOperation.Insertion && cigarElement.getOperator() != CigarOperator.I) {
log.warn("Wrong CIGAR operator for the given position. This is an error and should be fixed.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be more serious than a "warning"....what's going on? shoudln't this throw an exception here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, unclear what "should be fixed" means...did the user do something wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case should never occur (if it does, it would mean an error in the CIGAR string or an error in the code, i.e. an error by me), however, it is seems like a sensible check to make. I now throw an exception instead of a log warning, let me know if you have a better suggestion

src/main/java/picard/sam/SamErrorMetric/ErrorType.java Outdated Show resolved Hide resolved
@coveralls
Copy link

coveralls commented Aug 7, 2019

Coverage Status

Coverage increased (+0.5%) to 80.429% when pulling 13f00e0 on mg_longread_error_reporting into 8d55ae2 on master.

@michaelgatzen
Copy link
Contributor Author

I now removed all the read output "filtering" from this PR and actually tried to split this PR up into smaller ones as well.
Before I added all the indel error handling, @jonn-smith made some optimizations to the locus iteration. I prepared a separate branch for him, and he'll open another PR for that.
Therefore, this branch only contains the new indel functionality and the new stratifiers. I tried to separate the stratifiers from the indel functionality and make a separate PR for that as well, but some of the new stratifiers depend on the indel features and it seemed simpler to have it in one PR rather than having interdependencies between two PRs. Let me know if that's ok or if you would suggest otherwise.

Copy link
Contributor

@yfarjoun yfarjoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another round of mostly minor changes...but here's a bigger request:

could you change AbstractRecordAndOffset to include a Raison D'être (MATCH, INSERTION, DELETION) with default constructor providing MATCH? this way you will not need to change all the addBase signatures to bubble down the operation and instead just ask the RAO itself..... what do you think?


if (operation == CollectSamErrorMetrics.BaseOperation.Insertion) {
nInserts++;
CigarElement cigarElement = ReadBaseStratification.getIndelElement(recordAndOffset, operation);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this line be extracted from the if statement?

/*
* The MIT License
*
* Copyright (c) 2018 The Broad Institute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2019 here and elsewhere


/**
* A calculator that estimates the error rate of the bases it observes for indels only.
* Created by jonn on 6/13/19.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for "created by" line.

lbergelson pushed a commit to samtools/htsjdk that referenced this pull request Aug 21, 2019
* Added alignment type to RecordAndOffset class

- This was suggested as part of broadinstitute/picard#1370
- necessary to determine whether a RecordAndOffset object is an
(alignment) match, insertion, or deletion when queried by a
SamLocusIterator
- default behavior is preserved by overloading the default constructor

* Added tests for including the alignment type in RecordAndOffset
@@ -36,8 +38,16 @@
**/
@Override
public void addBase(final SamLocusIterator.RecordAndOffset recordAndOffset, final SamLocusAndReferenceIterator.SAMLocusAndReference locusAndRef) {
if (!SequenceUtil.isNoCall(recordAndOffset.getReadBase())) {
nBases++;
// TODO mgatzen should deletions be counted towards total bases?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think that only read-bases should count towards the total bases, which will be used as the denominator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - thank you

nBases++;
}
} else if (recordAndOffset.getAlignmentType() == AbstractRecordAndOffset.AlignmentType.Insertion) {
CigarElement cigarElement = ReadBaseStratification.getIndelElement(recordAndOffset);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

/**
* Stratifies according to the number of indel bases (from CIGAR string) that the read has.
*/
public static class IndelsInReadStratifier extends RecordStratifier<Integer> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be implemented as a sum of two CigarOperatorsInReadStratifier (I and D)? probably not as fast, but less code repetition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I moved the actual counting to a private static method so that we don't need to instantiate two Stratifier objects to get the result. (That probably made it more code overall, but less repetition.)


@Test(dataProvider = "provideForTestHasDeletionBeenProcessed")
public void testHasDeletionBeenProcessed(final String cigarString, final List<Boolean> expected) {
Cigar cigar = TextCigarCodec.decode(cigarString);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finals

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor

@yfarjoun yfarjoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. thanks for all the test. just a few minor comments and I'm happy for this to be merged.

jonn-smith and others added 3 commits September 24, 2019 13:03
- Created new IndelErrorCalculator and IndelErrorMetric classes
- Adjusted CollectSamErrorMetrics to be able to consider insertions and
deletions, as well as number of inserted and deleted bases
- Tests for indel error detection
- Added indel_length stratifier
- The PairOrientation stratifier now checks if the read is paired and
returns null if it is not
- Using operator information to that has been incorporated into htsjdk
- Merged mg_longread_error_reporting_iteration into mg_longread_error_reporting
@michaelgatzen michaelgatzen merged commit 34d7285 into master Sep 24, 2019
@michaelgatzen michaelgatzen deleted the mg_longread_error_reporting branch September 24, 2019 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants