Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Spectrum as a read-only class obtainable from an imageset #201

Merged
merged 11 commits into from
Jul 29, 2020

Conversation

phyy-nx
Copy link

@phyy-nx phyy-nx commented Jul 27, 2020

Details:

  • Add Spectrum C++ object that implements methods get_energies (eV), get_weights (unitless), and get_weighted_wavelenth (angstroms). Spectrum can only be set from constructor.
  • Add get_spectrum method to Format and FormatMultiImage
  • Add get_spectrum method to ImageSet. Does not cache it like it does for detector, beam, etc.
  • Implement get_spectrum in NeXus. Specifically implements Per shot wavelength, second try nexusformat/definitions#717, using optional variants to specify a calibrated wavelength and a spectrum.

Details:
- Add Spectrum C++ object that implements methods get_energies (eV), get_weights (unitless), and get_weighted_wavelenth (angstroms). Spectrum can only be set from constructor.
- Add get_spectrum method to Format and FormatMultiImage
- Add get_spectrum method to ImageSet. Does not cache it like it does for detector, beam, etc.
- Implement get_spectrum in NeXus.  Specifically implements nexusformat/definitions#717, using optional variants to specify a calibrated wavelength and a spectrum.
@phyy-nx
Copy link
Author

phyy-nx commented Jul 27, 2020

This is a draft PR that re-implements #158. Note that @dagewa's comment that suggests adding experiment.spectrum doesn't work as then the spectrum would have to be serialized, which defeats the whole point of this new version.

TODO:

  • Add test
  • Resolve load_model race condition. A single function reads both the beam and spectrum models and caches both, but only returns the beam model. The API for that function needs to change to return both models. Is that an ok breaking change? Would only affect FormatNexus.py and derived classes.
  • Resolve flake8 flag: model/__init__.py:28:1: F401 'dxtbx_model_ext.Spectrum' imported but unused. Do I need to put Spectrum in __all__?

@phyy-nx
Copy link
Author

phyy-nx commented Jul 27, 2020

Example of use on a JF16M dataset with per-shot spectrum and a calibrated per-shot wavelength that differs from the weighted average of the specta:

from dxtbx.model.experiment_list import ExperimentListFactory
expt = ExperimentListFactory.from_filenames(['run_000795.JF07T32V01_master.h5'])[0]
print(expt.beam)
print(expt.imageset.get_spectrum(0))

Output

Beam:
    wavelength: 1.30886
    sample to source direction : {0,0,1}
    divergence: 0
    sigma divergence: 0
    polarization normal: {0,1,0}
    polarization fraction: 0.999

Spectrum:
    weighted wavelength: 1.30717

@phyy-nx
Copy link
Author

phyy-nx commented Jul 27, 2020

This list is alphabetically sorted

Fixed

@phyy-nx phyy-nx marked this pull request as ready for review July 27, 2020 22:15
@phyy-nx
Copy link
Author

phyy-nx commented Jul 27, 2020

Ok, ready for comments, including re: the race condition.

@graeme-winter
Copy link
Collaborator

Comments ☝️ are not a review per se more a set of observations as I looked through the code, not tested anything.

Two questions:

  • does this address the polymorphism question? i.e. if I want to have a spectrum which corresponds to a 1% dmm bandpass from an undulator can I have a different model?
  • race condition - propose you add a reader method _read_beam_and_spectrum or similar which reads both, caches and returns both, so get_beam() would just return the former, get_spectrum() the latter from this cache -> no external API change

@graeme-winter
Copy link
Collaborator

BTW, FWIW I am much less worried about this PR than the previous one, since this does not include serialization so if we need to modify things in the future that remains a possibility. The previous iteration - once merged - would have set in stone our model of multiwavelength beam data. With this iteration any subsequent changes need only be fixed up here, in cctbx and in dials to go through.

The polymorphism question is real though, as presumably at some stage you'll want to use these new data? When doing so e.g. in prediction code (if in dials) considering extension to (i) rotation data and (ii) Gaussian pink beam would be welcome.

* @param energies The spectrum energies (eV)
* @param energies The spectrum weights (unitless)
*/
Spectrum(vecd energies, vecd weights)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strikes me while defining this as a type that adding convenience methods to give minimum / maximum energy for prediction purposes could be helpful i.e. where the weights are measurably above 0 (I appreciate that there is some judgement here)

Motivation: for purposes of computing predictions means we have a window to predict through, e.g. if the spectrum in question is very narrow, rather than min(energies) -> max(energies) which could be a lot wider.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose I'd need an algorithm laid out before I implemented something?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, algorithm -

  • compute CDF from spectrum
  • identify the bin below the 1% point
  • identify the bin above the 99% point

There you go - a pretty plausible min, max window with nothing more than floating adds and comparisons (i.e. super cheap in C++)

If we add a Gaussian constructor can use the same logic to define 3σ window or something, but gives a consistent and reasonable definition.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so maybe this is not so simple - I have just grabbed a random SwissFEL Jungfrau data set I had lying around and it looks like the spectrum is not background subtracted i.e.:

146-00001

Without this will be kinda game over for my suggestions, but I'll keep looking for a bit. Am trying to get to this now.

model/spectrum.h Outdated
return weights_;
}

double get_weighted_wavelength() const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we also had get_weighted_wavelength_variance then one could almost model a Gaussian beam with this 🤔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also please can has get_weighted_energy_eV and it's variance equivalent as part of the API - we may not use this but we should probably have it.

This could be a simple refactor as keeping most of the code in this current method as get_weighted_energy_eV() and then converting the result to a wavelength before returning, for consistency.

model/spectrum.h Outdated
summed_weights += weights_[i];
}
DXTBX_ASSERT(weighted_sum > 0 && summed_weights > 0);
return 12398.4187 / (weighted_sum / summed_weights); // eV per Å conversion factor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this magic number worth defining somewhere central as a constant value? (see e.g. deg_as_rad etc.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've used this factor hardcoded all over the place. Would love it as a constant somewhere. Maybe a separate issue?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess what? scitbx::constants::factor_ev_angstrom 🙂

Though this evaluates to a slightly different constant value to your one -

    //! Factor for keV <-> Angstrom conversion.
    /*!
      http://physics.nist.gov/PhysRefData/codata86/table2.html

      h = Plank's Constant = 6.6260755e-34 J s
      c = speed of light = 2.99792458e+8 m/s
      1 keV = 1.e+3 * 1.60217733e-19 J
      1 A = Angstrom = 1.e-10 m

      E = (h * c) / lamda;

      Exponents: (-34 + 8) - (3 - 19 - 10) = 0
     */
    static const double
    factor_kev_angstrom = 6.6260755 * 2.99792458 / 1.60217733;
    static const double
    factor_ev_angstrom  = 6626.0755 * 2.99792458 / 1.60217733;

=>

>>> 6626.0755 * 2.99792458 / 1.60217733
12398.424468024265

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scitbx is out of date. The Planck constant has changed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question was asked whether we can expose those numbers in Python. They are readily available (and maintained) in

>>> import scipy.constants
>>> scipy.constants.Planck
6.62607015e-34
>>> scipy.constants.elementary_charge
1.602176634e-19

amongst many others.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the case that scipy is universally a dependency of dxtbx? If so, works for me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, but we can just add it. There is precedence in other packages.

model/spectrum.h Outdated Show resolved Hide resolved
/** A class to represent a 2D spectrum. */
class Spectrum {
public:
/** Default constructor: initialise all to zero */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're talking constructors, one which takes a minimum and maximum energy window, a central energy and a width to compute a Gaussian representation (see comments elsewhere) would probably be enough to stop me nagging on the topic of Polymorphism 😉

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation of Spectrum uses energy channels, not bandpass. I'm interested in this idea, but I think I need more detail and perhaps a more complete API spec. It sounds like an expansion of the Spectrum class, not necessarily a polymorph, which could be done with a new constructor as you say. Could that be separate work from this PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More detail and API spec I can do; and as slight expansion of this class is doable => indeed no need for polymorphism. I would like to give it a chance to get the changes in here as I think in the long game it will make it more useful in one go. Min, max energy window at the very least seems like something useful to have as part of this PR, since this is what we will be using to decide what to predict on the images.

BTW, re: min / max - was thinking the narrowest interval which contains say 98% of the total flux so it is well defined for measured and Gaussian input spectra.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, proposed change pushed in 3269bf5 with test

double get_weighted_wavelength() const {
if (energies_.size() == 0)
return 0;
double weighted_sum = 0;
Copy link
Collaborator

@graeme-winter graeme-winter Jul 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a static data table, worth computing these in the constructor and saving the values? That way if (just sayin') someone were to pass in nominal values into some theoretical constructor then they could be recovered without having to create a data array to recompute the values from. Would also mean calling get_weighted_wavelength could be treated compute wise as get_wavelength() without excessive compute overhead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, in our use case we aren't even using get_weighted_wavelength at all, it's just a convenience function. We are using the channels and energies directly. I would rather not compute the weighted sums if they won't be used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gonna keep this discussion open, I think there is a lot of merit in computing some metrics in the constructor which would be real useful down the line when it comes to treating the data in other ways (e.g. pink beam)

@dagewa
Copy link
Member

dagewa commented Jul 28, 2020

My comment might not have been clear as it was not a proposal to serialise the spectrum as experiment.spectrum, but simply to provide an interface to the spectrum separate from experiment.beam. The actual spectrum data would be stored with the raw data by analogy with experiment.imageset

@graeme-winter
Copy link
Collaborator

My comment might not have been clear as it was not a proposal to serialise the spectrum as experiment.spectrum, but simply to provide an interface to the spectrum separate from experiment.beam. The actual spectrum data would be stored with the raw data by analogy with experiment.imageset

I think that is not incompatible with what is happening here - maybe it is just a slightly different perspective. experiment.spectrum could easily be added as a property which is delegated to imageset.get_spectrum() or whatever the api is we select.

@graeme-winter
Copy link
Collaborator

@phyy-nx I've made a lot of comments on here, I acknowledge. I do however feel like this is potentially a lot closer to something which covers the bases. I'd be very happy to have a chat about these topics / comments etc. to understand your feelings on the suggestions.

model/spectrum.h Outdated Show resolved Hide resolved
@nksauter
Copy link

nksauter commented Jul 28, 2020 via email

@nksauter
Copy link

nksauter commented Jul 28, 2020 via email

- API changes: get_energies->get_energies_eV, add get_weighted_energy_eV
- Attempt to eliminate race condition by adding read_models method.
- Tidy
- Test more wavelengths in test_spectrum.py
- Remove unused headers
@phyy-nx
Copy link
Author

phyy-nx commented Jul 28, 2020

Aaron, My mistake, I guess the "manage_new_object" bit takes care of the memory leak? Nick

Right. This code came from beam.cc. To be sure, I ran a quick stress test just now and saw no leak.

Done w/ code review for now. @graeme-winter can you review my responses? I've resolved conversations where I think I addressed your feedback, feel free to unresolve them and respond if needed.

@Anthchirp
Copy link
Member

The get_...() functions did catch my eye. I'm aware that their use is rampant in dxtbx, however I thought it'd be worth pointing out that this naming is entirely unnecessary in python and a

def get_thing(self):

function should really be a

@property
def thing(self):

property.

Naturally, the introduction of get_spectrum in eg. Format.py is consistent with the environment, so I'm not criticizing that. Then the only remaining place where get_* names are introduced is the C++ code, but I don't know how you correctly declare properties in C++ land, which makes this... more of a comment than a recommendation really.

@graeme-winter
Copy link
Collaborator

@phyy-nx your 5pm deadline for me to implement all proposed changes is proving impossible because I have a workshop tomorrow which has written off ~ 5 working hours. Am working on it but I would be very disappointed if me missing the deadline means this gets merged as is.

Computes: emin_ev = 1% on CDF and emax_ev = 99% on CDF so 98% or more of the
spectrum is within the range. Added a test for this, which has a computed
approximation to a real SwissFEL spectrum (so could be useful in other tests)

Also ran clang-format on the C++ files I touched
I'll wager a chocolate bar or so that this will make zero measurable
difference to the run time performance. Doing this as a prelude to computing
variance
Recover the weighted variance around the weighted mean. Add test which makes
a clean Gaussian and recovers the same parameters.
@graeme-winter
Copy link
Collaborator

OK, sitrep - am not there yet but I have enough of the API in here that I wanted that all I am really short of is a different constructor, which I can add later on (i.e. construct with known Gaussian-like beam parameters and be able to recover these rather than deriving them as in the example here)

In my modified version it would not probably have a table for the full spectrum but that would not matter, since it would return cached values anyways.

I have made some changes which pre-compute a couple of values in the constructors however these calculations are:

  • straightforward arithmetic
  • streaming through the arrays cleanly

So I am very confident that in real life you will never see the changes. The search for the bandwidth region could be cached from constructor rather than at first run but I am not sure this is worthwhile.

I have made some assumptions in these calculations that the spectrum is background subtracted. I don't know if that is a good assumption to make.

Anyhow, I think the API this has right now is what I was hoping for, and I can see it being useful.

Cleaning up the correct physical constants would be nice.

@graeme-winter
Copy link
Collaborator

Would welcome a wider code review now to make sure we all have got our sums right and people are happy with the implied API changes in here.

@graeme-winter
Copy link
Collaborator

@phyy-nx ev -> eV 👍

@phyy-nx
Copy link
Author

phyy-nx commented Jul 29, 2020

Changes all look good to me. I'd like to fix the constants but it's not a stopper. They need to be fixed all over. Seems like a good candidate for a separate issue.

Since we are under time pressure here I'm going to merge this this morning. Thanks for the changes.

@graeme-winter
Copy link
Collaborator

OK, since these have all gone through really quick, I consider this to be an alpha API which can and will change - I don't want to see it baked in in places which can never be fixed or revised - @phyy-nx please could you agree to this?

@phyy-nx
Copy link
Author

phyy-nx commented Jul 29, 2020

Yup. Alpha API. Subject to change as needed.

@phyy-nx phyy-nx merged commit bf5c4ae into master Jul 29, 2020
@phyy-nx phyy-nx deleted the spectrum_beam3 branch July 29, 2020 16:54
@ndevenish
Copy link
Collaborator

We have a mechanism now in the dials bootstrap to explicitly specify branches to use for individual repositories. That sounds like it’d be worth you adding to the cctbx bootstrap so changes like this don’t need to be rushed through without proper consideration or discussion.

@phyy-nx
Copy link
Author

phyy-nx commented Jul 29, 2020

We have a mechanism now in the dials bootstrap to explicitly specify branches to use for individual repositories. That sounds like it’d be worth you adding to the cctbx bootstrap so changes like this don’t need to be rushed through without proper consideration or discussion.

While I appreciate the need for stability in the code and having good review cycles, we need to be able to work on master, have bootstrap point to master, and move things through quickly on occasion. Just the way things go sometimes. Hopefully not often though.

@ndevenish
Copy link
Collaborator

I don't understand this response. Branches are a first-class feature of git. Working this way is a deliberate choice, not a technical limitation, as is forcing this through on one of the busiest weeks we've had in over six months.

Even ignoring the ability to just check out a different branch (as far as I know all your deployments use git still), the dials bootstrap I mentioned would work like this, for instance:

python3 bootstrap.py --branch dxtbx@spectrum_beam3

Just the way things go sometimes. Hopefully not often though.

Unfortunately I think this came across more patronising than you intended, but personally I feel that this mode of operation classes as "frequent" rather than "not often".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants