Skip to content

Conversation

@henryiii
Copy link
Member

@henryiii henryiii commented Nov 6, 2019

This allows .view() to work on non-simple storages. Closes #175 and closes #21.

Features:

  • No copy for asarray or view, fixes no-copy views
  • Smart array returned, allowing .view().value to get the value, for example.
  • Accessing a single value from the returned array gives an accumulator
  • Single element access now fixed and tested for unlimited and atomic int histograms. Closes Access single histogram values (unlimited and atomic_int) #195.

Todo:

  • Remove trailing _
  • Look into making view be no copy too (existing problem)
  • Add tests
  • Investigating setting elements of the array from instances

Not going to be ready by the CHEP talk, but should be ready for a release end of week / next week.

@henryiii
Copy link
Member Author

henryiii commented Nov 6, 2019

I have at least fully demonstrated that it can be done before the CHEP talk.

@HDembinski
Copy link
Member

I am actually very interested in this, because I want to use a weighted histogram in my current data analysis :).

@henryiii
Copy link
Member Author

It's progressing fairly well, will update the PR fairly soon with work from the plane.

@henryiii
Copy link
Member Author

henryiii commented Nov 11, 2019

I realized after PyHEP that the accumulators are something special to boost-histogram for Python users; up until now "double" storage is about the limit of what you could do in Python histograms easily. I think this explains why multiple people have run into this being broken - it was the very thing they were excited to have available directly in Python.

@HDembinski, this is ready for review. I'm just adding a few tests. Basic idea: np.asarray(h) simply returns a structured array with an exact, no copy view of the data (it can't return a ndarray subclass, AFAICT). The names exposed here have a naming convention mentioned in the docs, but to repeat it here: if a public property exposes a value, that's the name in the record array, otherwise it uses the private name without the trailing underscore.

.view() is a bit more powerful; we can return ndarray subclasses here, so we add a few useful properties and slightly nicer reprs. So you can even do:

h.view().variance

to get the variance array, even if it is a computed property! If you do not access computed properties, this procedure is still no copy! And, unlike np.recarray, you don't even pay a lookup penalty when using properties and tab-completion works.

  • Also fixed no-copy views from .view() for all arrays now, not just the fancy ones.

No copy example:

>>> h = bh.histogram((10,0,1), storage=bh.storage.Weight())
>>> v = h.view()
>>> v.variance
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
>>> v.variance[3] = 3 # Only works if variance is not computed (WeightedSum)
>>> h
histogram(
  regular(10, 0, 1),
  storage=weight
) # Sum: weighted_sum(value=0, variance=3)
>>> h.view()
WeightedSumView(
      [(0., 0.), (0., 0.), (0., 0.), (0., 3.), (0., 0.), (0., 0.),
       (0., 0.), (0., 0.), (0., 0.), (0., 0.)],
      dtype=[('value', '<f8'), ('variance', '<f8')])
>>> v[3]
weighted_sum(value=0, variance=3)

I don't know about you, but I think this is really cool. :)

@henryiii
Copy link
Member Author

henryiii commented Nov 11, 2019

One design question: Should np.asarray(h) return the whole structured array, or just the most reasonable single value array (still no copy, numpy just sets the strides and offsets)? So .value, generally. Since most places a user would actually trigger "asarray" are really when handing a histogram to a function that expects an array, and none of those functions expect a record array, in general, it might make more sense to make something more standard. We already make the assumption that asarray does not include the flow bins. @jpivarski might have an opinion here too (I think he leaned toward structured array).

Recap of reason one might want just the single value:

Nobody should be doing this:

v = np.asarray(h) # v = h.view() is better

but they may do this:

plt.bar(h.axes[0].centers, h, width=h.axes[0].widths);

The latter expects an array of values.

@henryiii henryiii changed the title WIP: Direct memory access to accumulators Direct memory access to accumulators Nov 11, 2019
@jpivarski
Copy link
Member

I was leaning toward record array because I thought the underlying structure was non-negotiably a struct, and this is the appropriate Python translation. If dtype=[('value', '<f8'), ('variance', '<f8')], it's easy enough to np.asarray(h)['value'] or np.asarray(h)['variance'] to get arrays of values.

"Easy enough" in a program, but not necessarily in quick-and-dirty data analysis. For that, you're going to want something to return just the values or just the variances, either by changing the ABI in C++ (which I think is rather drastic) or by providing methods. I don't think you need to worry about users expecting np.asarray(h) to be some simple array of numbers—they know that the histogram has complex structure—I don't think you have to develop defensively for that. I'd say, just provide methods.

The histogram doesn't have to be a single py::buffer_protocol. You could make it a non-buffer object with two methods, values and variances, each of which returns a py::buffer_protocol with the appropriate stride. This way, you wouldn't have to change the ABI or present a record array; you could present two numerical arrays that just happen to interleave in memory. Or you could make the histogram a record array buffer and provide methods that each present a non-record view of values and variances. In fact, these could be implemented on the pybind11 side by casting a histogram as a py::array (no _t) and calling attr("__getitem__")("values") on it.

@henryiii
Copy link
Member Author

henryiii commented Nov 11, 2019

All histograms provide identical APIs; you would (and can) use h.view()["value"] or h.view().value to get the value array. That's already impemented. The question is, what should the raw histogram act like in Python to something that expects a numpy array? We already set the offsets and strides to give the flow=False view of the histogram, so np.asarray(h) is not a raw view over all the data, but rather tries to return what a user would expect. It's really a shortcut, then. We could even drop the __array__ method and then force .view() for users who want the data.

Choices:

  1. Extend the shortcut to always return the most reasonable value if possible (implemented in an earlier commit). If we introduce a new type later that has no "reasonable" single-value view, then it returns a structured array (backward compatible).
  2. Drop __array__, and force .view(). Passing histograms to anything that expects a numpy array breaks.
  3. Return the entire histogram, flow and all, but otherwise the same as choice 4.
  4. Allow the shortcut to return structured arrays (implemented in the current commit). Then np.asarray(h)["value"] works and np.asarray(h).value doesn't. Keeps the current ability to directly pass histograms to plotting functions and such as long as they are not accumulator histograms.

I'll let @HDembinski pick one. I've ordered them in the order that I rank them. As a user, I think the first one is the one I would expect, and extends the current idea of providing the most likely portion of the data instead of the complete data as the __array__. I don't have a strong opinion, though. Technically, we wrap the pybind object in Python, and now that we have fixed .view(), the Pybind buffer protocol is rather irrelevant, actually. We could implement it in Python with .view() from Pybind object.

@henryiii
Copy link
Member Author

PS: In hist, we may have histogram subclasses for each storage, so .values might be a usable shortcut there. But boost-histogram (and Boost.Histogram) have a single Histogram class.

@henryiii
Copy link
Member Author

henryiii commented Nov 11, 2019

(Added an option and reordered)

Also, Jim, the description of how we could do it is pretty much how I'm currently doing it, other than being slightly further down in the wrapping.

Note for future posterity: Pybind11 is missing an array constructor that takes a buffer info object and a base ownership pointer; this was throwing me off until someone on the pybind11 gitter just suggested using the more verbose constructor that did allow the base pointer, which fixed the problem and allowed the array to be made tied to the reference counting of the histogram and without a copy. We should make a Pybind PR with the missing argument in the constructor.

@jpivarski
Copy link
Member

Agh—that's what you get for asking me for an opinion about something I haven't been following. You invited me to make some stuff up. What you're doing is probably fine.

@henryiii
Copy link
Member Author

henryiii commented Nov 11, 2019

A nice argument toward number #1 is that all behavior directly on the histogram is identical - if plt.bar(h.axes[0].centers, h, width=h.axes[0].widths); works, it works on all storages. Asking for .view() gives you the underlying data in storage, so that may not behave identically.

An argument toward #2 is that this is just a shortcut and can be made explicit by using h.view().value (things after .view() are storage dependent, so it will depend on the storage).

I'm neutral here. Hans can pick any of the 4.

How does one type #1 in github without getting an issue link? I thought \#1 worked, but it doesn't seem to.

@HDembinski
Copy link
Member

HDembinski commented Nov 12, 2019

Since I need weights support for my analysis, I implemented this in a branch, see https://github.com/scikit-hep/boost-histogram/tree/weights. This is basically a subset of what you are doing here. Maybe you could have a quick look. I will not make a PR from this branch, just wanted to submit it to show you.

I think the fields for the internal boost-histogram accumulators can be all public (as was done in my branch). This simplifies the code a bit.

@HDembinski
Copy link
Member

HDembinski commented Nov 12, 2019

For me it is not a question about convenience, but about the expectation what np.asarray does. My expectation is that this thing gives you all the information and it returns a pure numpy array. You are right that we already hide the flow bins by default and that was a tricky design decision. The argument for hiding the flow bins by default was that they are almost an implementation detail. We need them to have proper projections, but people should normally not mess with them.

So from that point of view, I think we made the right decision to hide the flow bins in np.asarray. Nevertheless, when you have an accumulator that is really a struct, that should show up when you do np.asarray.

@HDembinski
Copy link
Member

HDembinski commented Nov 12, 2019

So in summary, I would prefer option 4, it seems the least surprising.

@henryiii
Copy link
Member Author

henryiii commented Nov 12, 2019

I think the fields for the internal boost-histogram accumulators can be all public

That's what was done here - it is required to use the nice Pybind11 tools to make the structured array.

So in summary, I would prefer option 4, it seems the least surprising.

Okay, that is what is currently done here already.

Note this now fixes #195 too, you can now access and set single values from unlimited and atomic storages.

@henryiii henryiii marked this pull request as ready for review November 12, 2019 12:03
@henryiii
Copy link
Member Author

Extra note: Ahh, this would simplify the numpy descriptors, too; it might be a reasonable thing to do to clean up eventually. I didn't rename the internal fields because they would collide with the methods, but you are just binding directly to the field, so the method is not needed. Possible cleanup for a later PR.

This was referenced Nov 12, 2019
Copy link
Member Author

@henryiii henryiii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I have no idea why this one comment got stuck in a "review", ask GitHub)

@henryiii
Copy link
Member Author

The accumulator interface is now (not yet pushed) nearly identical to the view interface, which is great. One more thing: should accumulators support a["value"] as well? Then a[x]["value"] would work regardless of whether x is a purely values or if it contains one or more slices. Also, the order becomes interchangeable, that is, arr["value"][3] == arr[3]["value"], regardless of whether arr came from .view() or np.asarray. Currently, without the getitem, it does not work on the .view() version and does work on the np.asarray version.

@henryiii
Copy link
Member Author

henryiii commented Nov 12, 2019

Currently, we have positional or keyword arguments in the accumulator constructor:

bh.accumulators.WeightedMean(wsum=6, wsum2=12, value=3, variance=2)

Should we use the property/field names? This makes the repr very long:

bh.accumulators.WeightedMean(sum_of_weights=6, sum_of_weights_squared=12, value=3, variance=2)

But I like the consistency.

@henryiii
Copy link
Member Author

The final commit adds accum["value"], and allows IPython completion on the name. This keeps the customized view (which returns accumulators instead of structured array "scalars" on single element access) from breaking code that expects the structured array scalars. I can revert the commit if this is a bad idea.

Ready for second review, @HDembinski. This is looking really nice. :)

@henryiii henryiii requested a review from HDembinski November 12, 2019 20:42
@HDembinski
Copy link
Member

HDembinski commented Nov 13, 2019

But I like the consistency.

I also like consistency more than the short names. "Readability counts". Plus it is not something you need to type a lot.

Copy link
Member

@HDembinski HDembinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost done!

@HDembinski
Copy link
Member

The accumulator interface is now (not yet pushed) nearly identical to the view interface, which is great. One more thing: should accumulators support a["value"] as well? Then a[x]["value"] would work regardless of whether x is a purely values or if it contains one or more slices. Also, the order becomes interchangeable, that is, arr["value"][3] == arr[3]["value"], regardless of whether arr came from .view() or np.asarray. Currently, without the getitem, it does not work on the .view() version and does work on the np.asarray version.

That sounds excellent, yes.

The final commit adds accum["value"], and allows IPython completion on the name. This keeps the customized view (which returns accumulators instead of structured array "scalars" on single element access) from breaking code that expects the structured array scalars. I can revert the commit if this is a bad idea.

This is very nice!

@henryiii
Copy link
Member Author

Not worried about typing it so much (IPython tab completes it for me anyway), but it is long enough that it can go off the screen when displayed (such as in the GitHub code box above) Still better to be consistent.

@henryiii
Copy link
Member Author

Should be ready to go! I've removed the namespace, fixed the spelling, and replaced the keywords.

@HDembinski
Copy link
Member

Then it is ready to merge! Excellent work.

@henryiii henryiii merged commit 9398302 into develop Nov 13, 2019
@henryiii henryiii deleted the henryiii-view branch November 13, 2019 15:41
HDembinski pushed a commit that referenced this pull request Dec 1, 2019
* Allow direct memory access, no longer throws error/segfault

* Internalize accumulators

* First working version of conversion

* Fix several bugs, and change the meaning of np.asarray() slightly

* Fix view not returning an editable view

* Normalize names

* Nicer repr and str for views

* Cleanup views with a decorator

* Some small polish

* Fix #195, unlimited and atomic single access

* Adding tests for storages

* README update

* CHANGELOG and renable a test

* Fix unneeded change

* Address review comments

* Move ostream, remove some extra headers

* Fix reprs, add missing tests

* Adding accum[key] access

* Addressing second review

* Rename wsum and wsum2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Access single histogram values (unlimited and atomic_int) Retrieving results of non-trivial storages h.view() for all storage types

4 participants