Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When file indices are provided, only read models for the specified im… #210

Merged
merged 11 commits into from
Nov 18, 2020

Conversation

dwpaley
Copy link

@dwpaley dwpaley commented Aug 13, 2020

We were loading models (beam, detector, gonio, scan) for every image in a multiimage file, even when only 1 image was requested. This makes a big difference: FormatMultiImage.get_imageset now takes 0.1s instead of 0.8s when making a 1-image "set" from a 600-image .h5 file.

Leaving this as a draft while waiting to see tests.

@dwpaley dwpaley marked this pull request as ready for review August 13, 2020 20:31
@graeme-winter
Copy link
Collaborator

@dwpaley I suspect a test which exercised this would make it easier for reviewers to appreciate the differences here. single_file_indices is not something I for one am familiar with...

@dwpaley
Copy link
Author

dwpaley commented Sep 23, 2020

I think I undersold the performance impact of this change with the 0.1s/0.8s stat. When loading serial data, a new imageset is created for every shot, so that number is multiplied by the number of images.

I can give an example but unfortunately not using data in one of the regression data repos. I tested on dials.stills_process results from a 615-image .h5 data set collected on the MPCCD at SACLA. The run had 12 indexed images. I called $ dials.image_viewer *_integrated.* and timed until the viewer window opened:

Current master: 27.5, 23.9, 23.9 s
This branch: 6.7, 8.6, 6.0 s

I hope this helps illustrate why I consider this important. Yes it can also be fixed by supplying load_models=False, but this makes the default parameters usable.

@graeme-winter
Copy link
Collaborator

I think I undersold the performance impact of this change with the 0.1s/0.8s stat. When loading serial data, a new imageset is created for every shot, so that number is multiplied by the number of images.

This was what the "load before heat death" patch set was for in #118 - every time you load an image you make an imageset of N images so you end up with a lovely N^2 process - when N ~ 20,000 this bites badly.

This is a design failure in imageset creation 😕

@dwpaley
Copy link
Author

dwpaley commented Oct 1, 2020

@graeme-winter After discussing with @phyy-nx and @nksauter I refactored this so that the loop is over single_file_indices instead of over range(num_images). Thus no need to test if i is in single_file_indices. The arrays beam, detector etc still need to be the same length as num_images, so we fill them with None and then directly use single_file_indices to index into them.

format/FormatMultiImage.py Outdated Show resolved Hide resolved
format/FormatMultiImage.py Outdated Show resolved Hide resolved
tests/test_imageset.py Outdated Show resolved Hide resolved
format/FormatMultiImage.py Outdated Show resolved Hide resolved
tests/test_imageset.py Outdated Show resolved Hide resolved
tests/test_imageset.py Outdated Show resolved Hide resolved
tests/test_imageset.py Outdated Show resolved Hide resolved
format/FormatMultiImage.py Outdated Show resolved Hide resolved
format/FormatMultiImage.py Show resolved Hide resolved
tests/test_imageset.py Outdated Show resolved Hide resolved
- Test get_imageset with single_file_indices not specified
- General refactoring
FormatMultiImage.get_imageset assumes that single_file_indices=[] means
we're not loading any images. However when the empty list is passed to ImageSet,
it is treated the same as None, i.e. load all images in the container. To avoid
inconsistency, this just crashes if passed an empty list.
format/FormatMultiImage.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Oct 13, 2020

Codecov Report

Merging #210 into master will decrease coverage by 0.20%.
The diff coverage is 74.19%.

@@            Coverage Diff             @@
##           master     #210      +/-   ##
==========================================
- Coverage   45.38%   45.17%   -0.21%     
==========================================
  Files         228      228              
  Lines       19197    19253      +56     
  Branches     2721     2727       +6     
==========================================
- Hits         8712     8697      -15     
- Misses       9971    10042      +71     
  Partials      514      514              

@dwpaley dwpaley merged commit cb6e61b into master Nov 18, 2020
@dwpaley dwpaley deleted the multiimage_use_indices branch November 18, 2020 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants