Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

making changes to reduce size of giant interval lists #1309

Merged
merged 3 commits into from
Mar 4, 2019

Conversation

lbergelson
Copy link
Member

@lbergelson lbergelson commented Feb 27, 2019

Description

Two separate changes which should reduce the in-memory size of IntervalList and OverlapDetector.

1: Cache the last seen contig name when loading IntervalLists and reuse the same String to avoid keeping many duplicate copies of "chr1" in memory. This should have very minor overhead since interval files are sorted it should be about as effective as a more complicated caching scheme.

  1. Use SingletonSets as the leaf objects in OverlapDetector when constructing it. These are substantially smaller than HashSets of the same size. Change to HashSet only when we find 2 or more identically positioned elements. This will have a minor performance cost while constructing the OverlapDetector but should result in dramatically decreased size in memory.

Checklist

  • Code compiles correctly
  • New tests covering changes and new functionality
  • All tests passing
  • Extended the README / documentation, if necessary
  • Is not backward compatible (breaks binary or source compatibility)

@lbergelson
Copy link
Member Author

@samuelklee Simple changes to reduce the in memory size of interval lists.
@nh13 This might benefit you as well since it sounds like you're dealing with very large interval lists as well.

@codecov-io
Copy link

codecov-io commented Feb 27, 2019

Codecov Report

Merging #1309 into master will increase coverage by 0.996%.
The diff coverage is 85.714%.

@@               Coverage Diff               @@
##              master     #1309       +/-   ##
===============================================
+ Coverage     67.728%   68.724%   +0.996%     
- Complexity      8225      8855      +630     
===============================================
  Files            560       562        +2     
  Lines          33493     35503     +2010     
  Branches        5637      6165      +528     
===============================================
+ Hits           22684     24399     +1715     
- Misses          8632      8835      +203     
- Partials        2177      2269       +92
Impacted Files Coverage Δ Complexity Δ
...ain/java/htsjdk/samtools/util/OverlapDetector.java 96.104% <100%> (ø) 27 <3> (+2) ⬆️
...c/main/java/htsjdk/samtools/util/IntervalList.java 74.754% <100%> (+0.335%) 75 <0> (+1) ⬆️
...c/main/java/htsjdk/samtools/util/IntervalTree.java 75.052% <66.667%> (+0.476%) 45 <2> (+3) ⬆️
...tools/cram/encoding/readfeatures/Substitution.java 83.051% <0%> (-9.054%) 19% <0%> (-3%)
...a/htsjdk/samtools/cram/build/ContainerFactory.java 95.238% <0%> (-1.732%) 9% <0%> (+3%)
...va/htsjdk/samtools/cram/build/ContainerParser.java 95.122% <0%> (-1.174%) 19% <0%> (+1%)
...samtools/cram/structure/CramCompressionRecord.java 92.593% <0%> (-0.376%) 161% <0%> (+48%)
...oding/reader/MultiRefSliceAlignmentSpanReader.java 100% <0%> (ø) 9% <0%> (+4%) ⬆️
...ava/htsjdk/samtools/cram/ref/ReferenceContext.java 91.667% <0%> (ø) 22% <0%> (?)
...htsjdk/samtools/cram/ref/ReferenceContextType.java 100% <0%> (ø) 1% <0%> (?)
... and 31 more

@samuelklee
Copy link

Thanks for making these changes! I think the per-contig OverlapDetector you suggested was enough to bring down memory in CollectReadCounts, so I didn't see any change there. However, I do see a substantial improvement in ModelSegments, which still uses global OverlapDetectors. I was able to run with ~10M bins and <8GB memory, while master required ~10GB. Runtime also actually went down slightly from 18 minutes to 17 minutes, but that might be within the noise.

if (alreadyThere != null) {
alreadyThere.add(object);
tree.put(start, end, alreadyThere);
if( alreadyThere.size() == 1){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this should be implemented within the IntervalTree class as a computeIfPresent method...

alreadyThere.add(object);
tree.put(start, end, alreadyThere);
if( alreadyThere.size() == 1){
Set<T> mutableSet = new HashSet<>(2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm. better find out if you got a mutable one and just add the new element if so...no?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean with an instanceof check? As it's written it should be immutable always if it's exactly size 1 and mutable otherwise. It's a bit awkward but I thought maybe relying on the size was better than relying on the class. I can change that though.

Copy link
Contributor

@yfarjoun yfarjoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some changes requested...thanks for this.

@lbergelson
Copy link
Member Author

@yfarjoun Could you take a look. The implementation in interval tree is slightly more complicated because it has to deal with arbitrary sentinel values instead of just null, and I added some tests because I realized that the conditions I cared about didn't look well covered.

* @param start interval start position
* @param end interval end position
* @param value value to merge into the tree, must not be equal to the sentinel value
* @param remappingFunction a function that merges the new value with the existing value for the same start and end position
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that if function returns null the region will be left empty

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* @param end interval end position
* @param value value to merge into the tree, must not be equal to the sentinel value
* @param remappingFunction a function that merges the new value with the existing value for the same start and end position
* @return the updated value that is now stored in the tree or sentinel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now is not clear enough...(does it refer to before or after the application of the method?)
perhaps the updated value that is eventually stored in the tree or the sentinel value if nothing ends up being stored

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with a slight modification:

the updated value that is stored in the tree after the completion of this merge operation, this will be the sentinel value if nothing ended up being stored
``

}
}
tree.merge(start, end,
Collections.singleton(object),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wierd spacing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -56,6 +60,8 @@ public void clear()
mRoot = null;
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra nls

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

// Sets of size 1 are immutable SingletonSets so we have to make a new
// mutable one to add to
if (oldValue.size() == 1) {
Set<T> newMutableSet = new HashSet<>();
Copy link
Contributor

@yfarjoun yfarjoun Mar 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can replace entire if statement with:

final Set<T> newMutableSet = oldValue.size() == 1 ? new HashSet<>() : oldValue;
newMutableSet.addAll(oldValue);
newMutableSet.addAll(newValue);
return newMutableSet;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, clever, it's a set so you can unconditionally add the same thing over again...

final V alreadyPresent = put(start, end, value);
if (!Objects.equals(alreadyPresent, mSentinel)) {
final V newComputedValue = remappingFunction.apply(value, alreadyPresent);
if(Objects.equals(newComputedValue, mSentinel)){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spaces

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@yfarjoun yfarjoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice. I agree that it's a bit more complicated, but I think that the result is a little nicer...I hope you agree...

@lbergelson
Copy link
Member Author

@yfarjoun It's an improvement. Thanks for the review.

Copy link
Contributor

@yfarjoun yfarjoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@lbergelson lbergelson merged commit 9f84b7b into master Mar 4, 2019
@lbergelson lbergelson deleted the lb_interval_size_reduction branch March 4, 2019 16:43
@lbergelson
Copy link
Member Author

@yfarjoun Thanks for the reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants