Direct memory access to accumulators #194

henryiii · 2019-11-06T14:40:24Z

This allows .view() to work on non-simple storages. Closes #175 and closes #21.

Features:

No copy for asarray or view, fixes no-copy views
Smart array returned, allowing .view().value to get the value, for example.
Accessing a single value from the returned array gives an accumulator
Single element access now fixed and tested for unlimited and atomic int histograms. Closes Access single histogram values (unlimited and atomic_int) #195.

Todo:

Remove trailing _
Look into making view be no copy too (existing problem)
Add tests
Investigating setting elements of the array from instances

Not going to be ready by the CHEP talk, but should be ready for a release end of week / next week.

henryiii · 2019-11-06T14:41:00Z

I have at least fully demonstrated that it can be done before the CHEP talk.

HDembinski · 2019-11-11T16:29:34Z

I am actually very interested in this, because I want to use a weighted histogram in my current data analysis :).

henryiii · 2019-11-11T16:30:12Z

It's progressing fairly well, will update the PR fairly soon with work from the plane.

henryiii · 2019-11-11T20:59:38Z

I realized after PyHEP that the accumulators are something special to boost-histogram for Python users; up until now "double" storage is about the limit of what you could do in Python histograms easily. I think this explains why multiple people have run into this being broken - it was the very thing they were excited to have available directly in Python.

@HDembinski, this is ready for review. I'm just adding a few tests. Basic idea: np.asarray(h) simply returns a structured array with an exact, no copy view of the data (it can't return a ndarray subclass, AFAICT). The names exposed here have a naming convention mentioned in the docs, but to repeat it here: if a public property exposes a value, that's the name in the record array, otherwise it uses the private name without the trailing underscore.

.view() is a bit more powerful; we can return ndarray subclasses here, so we add a few useful properties and slightly nicer reprs. So you can even do:

h.view().variance

to get the variance array, even if it is a computed property! If you do not access computed properties, this procedure is still no copy! And, unlike np.recarray, you don't even pay a lookup penalty when using properties and tab-completion works.

Also fixed no-copy views from .view() for all arrays now, not just the fancy ones.

No copy example:

>>> h = bh.histogram((10,0,1), storage=bh.storage.Weight())
>>> v = h.view()
>>> v.variance
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
>>> v.variance[3] = 3 # Only works if variance is not computed (WeightedSum)
>>> h
histogram(
  regular(10, 0, 1),
  storage=weight
) # Sum: weighted_sum(value=0, variance=3)
>>> h.view()
WeightedSumView(
      [(0., 0.), (0., 0.), (0., 0.), (0., 3.), (0., 0.), (0., 0.),
       (0., 0.), (0., 0.), (0., 0.), (0., 0.)],
      dtype=[('value', '<f8'), ('variance', '<f8')])
>>> v[3]
weighted_sum(value=0, variance=3)

I don't know about you, but I think this is really cool. :)

henryiii · 2019-11-11T21:09:25Z

One design question: Should np.asarray(h) return the whole structured array, or just the most reasonable single value array (still no copy, numpy just sets the strides and offsets)? So .value, generally. Since most places a user would actually trigger "asarray" are really when handing a histogram to a function that expects an array, and none of those functions expect a record array, in general, it might make more sense to make something more standard. We already make the assumption that asarray does not include the flow bins. @jpivarski might have an opinion here too (I think he leaned toward structured array).

Recap of reason one might want just the single value:

Nobody should be doing this:

v = np.asarray(h) # v = h.view() is better

but they may do this:

plt.bar(h.axes[0].centers, h, width=h.axes[0].widths);

The latter expects an array of values.

jpivarski · 2019-11-11T21:23:46Z

I was leaning toward record array because I thought the underlying structure was non-negotiably a struct, and this is the appropriate Python translation. If dtype=[('value', '<f8'), ('variance', '<f8')], it's easy enough to np.asarray(h)['value'] or np.asarray(h)['variance'] to get arrays of values.

"Easy enough" in a program, but not necessarily in quick-and-dirty data analysis. For that, you're going to want something to return just the values or just the variances, either by changing the ABI in C++ (which I think is rather drastic) or by providing methods. I don't think you need to worry about users expecting np.asarray(h) to be some simple array of numbers—they know that the histogram has complex structure—I don't think you have to develop defensively for that. I'd say, just provide methods.

The histogram doesn't have to be a single py::buffer_protocol. You could make it a non-buffer object with two methods, values and variances, each of which returns a py::buffer_protocol with the appropriate stride. This way, you wouldn't have to change the ABI or present a record array; you could present two numerical arrays that just happen to interleave in memory. Or you could make the histogram a record array buffer and provide methods that each present a non-record view of values and variances. In fact, these could be implemented on the pybind11 side by casting a histogram as a py::array (no _t) and calling attr("__getitem__")("values") on it.

henryiii · 2019-11-11T21:40:32Z

All histograms provide identical APIs; you would (and can) use h.view()["value"] or h.view().value to get the value array. That's already impemented. The question is, what should the raw histogram act like in Python to something that expects a numpy array? We already set the offsets and strides to give the flow=False view of the histogram, so np.asarray(h) is not a raw view over all the data, but rather tries to return what a user would expect. It's really a shortcut, then. We could even drop the __array__ method and then force .view() for users who want the data.

Choices:

Extend the shortcut to always return the most reasonable value if possible (implemented in an earlier commit). If we introduce a new type later that has no "reasonable" single-value view, then it returns a structured array (backward compatible).
Drop __array__, and force .view(). Passing histograms to anything that expects a numpy array breaks.
Return the entire histogram, flow and all, but otherwise the same as choice 4.
Allow the shortcut to return structured arrays (implemented in the current commit). Then np.asarray(h)["value"] works and np.asarray(h).value doesn't. Keeps the current ability to directly pass histograms to plotting functions and such as long as they are not accumulator histograms.

I'll let @HDembinski pick one. I've ordered them in the order that I rank them. As a user, I think the first one is the one I would expect, and extends the current idea of providing the most likely portion of the data instead of the complete data as the __array__. I don't have a strong opinion, though. Technically, we wrap the pybind object in Python, and now that we have fixed .view(), the Pybind buffer protocol is rather irrelevant, actually. We could implement it in Python with .view() from Pybind object.

henryiii · 2019-11-11T21:46:34Z

PS: In hist, we may have histogram subclasses for each storage, so .values might be a usable shortcut there. But boost-histogram (and Boost.Histogram) have a single Histogram class.

henryiii · 2019-11-11T21:56:16Z

(Added an option and reordered)

Also, Jim, the description of how we could do it is pretty much how I'm currently doing it, other than being slightly further down in the wrapping.

Note for future posterity: Pybind11 is missing an array constructor that takes a buffer info object and a base ownership pointer; this was throwing me off until someone on the pybind11 gitter just suggested using the more verbose constructor that did allow the base pointer, which fixed the problem and allowed the array to be made tied to the reference counting of the histogram and without a copy. We should make a Pybind PR with the missing argument in the constructor.

jpivarski · 2019-11-11T21:59:44Z

Agh—that's what you get for asking me for an opinion about something I haven't been following. You invited me to make some stuff up. What you're doing is probably fine.

henryiii · 2019-11-11T22:49:38Z

A nice argument toward number #1 is that all behavior directly on the histogram is identical - if plt.bar(h.axes[0].centers, h, width=h.axes[0].widths); works, it works on all storages. Asking for .view() gives you the underlying data in storage, so that may not behave identically.

An argument toward #2 is that this is just a shortcut and can be made explicit by using h.view().value (things after .view() are storage dependent, so it will depend on the storage).

I'm neutral here. Hans can pick any of the 4.

How does one type #1 in github without getting an issue link? I thought \#1 worked, but it doesn't seem to.

HDembinski · 2019-11-12T09:58:52Z

Since I need weights support for my analysis, I implemented this in a branch, see https://github.com/scikit-hep/boost-histogram/tree/weights. This is basically a subset of what you are doing here. Maybe you could have a quick look. I will not make a PR from this branch, just wanted to submit it to show you.

I think the fields for the internal boost-histogram accumulators can be all public (as was done in my branch). This simplifies the code a bit.

HDembinski · 2019-11-12T10:06:31Z

For me it is not a question about convenience, but about the expectation what np.asarray does. My expectation is that this thing gives you all the information and it returns a pure numpy array. You are right that we already hide the flow bins by default and that was a tricky design decision. The argument for hiding the flow bins by default was that they are almost an implementation detail. We need them to have proper projections, but people should normally not mess with them.

So from that point of view, I think we made the right decision to hide the flow bins in np.asarray. Nevertheless, when you have an accumulator that is really a struct, that should show up when you do np.asarray.

HDembinski · 2019-11-12T10:11:44Z

So in summary, I would prefer option 4, it seems the least surprising.

henryiii · 2019-11-12T11:11:31Z

I think the fields for the internal boost-histogram accumulators can be all public

That's what was done here - it is required to use the nice Pybind11 tools to make the structured array.

So in summary, I would prefer option 4, it seems the least surprising.

Okay, that is what is currently done here already.

Note this now fixes #195 too, you can now access and set single values from unlimited and atomic storages.

tests/test_public_hist.py

henryiii · 2019-11-12T12:25:13Z

Extra note: Ahh, this would simplify the numpy descriptors, too; it might be a reasonable thing to do to clean up eventually. I didn't rename the internal fields because they would collide with the methods, but you are just binding directly to the field, so the method is not needed. Possible cleanup for a later PR.

boost_histogram/_internal/hist.py

CHANGELOG.md

boost_histogram/_internal/hist.py

include/boost/histogram/python/accumulators/mean.hpp

include/boost/histogram/python/accumulators/weighted_sum.hpp

include/boost/histogram/python/accumulators_ostream.hpp

tests/test_storage.py

henryiii

(I have no idea why this one comment got stuck in a "review", ask GitHub)

henryiii · 2019-11-12T19:26:16Z

The accumulator interface is now (not yet pushed) nearly identical to the view interface, which is great. One more thing: should accumulators support a["value"] as well? Then a[x]["value"] would work regardless of whether x is a purely values or if it contains one or more slices. Also, the order becomes interchangeable, that is, arr["value"][3] == arr[3]["value"], regardless of whether arr came from .view() or np.asarray. Currently, without the getitem, it does not work on the .view() version and does work on the np.asarray version.

henryiii · 2019-11-12T20:01:29Z

Currently, we have positional or keyword arguments in the accumulator constructor:

bh.accumulators.WeightedMean(wsum=6, wsum2=12, value=3, variance=2)

Should we use the property/field names? This makes the repr very long:

bh.accumulators.WeightedMean(sum_of_weights=6, sum_of_weights_squared=12, value=3, variance=2)

But I like the consistency.

henryiii · 2019-11-12T20:41:45Z

The final commit adds accum["value"], and allows IPython completion on the name. This keeps the customized view (which returns accumulators instead of structured array "scalars" on single element access) from breaking code that expects the structured array scalars. I can revert the commit if this is a bad idea.

Ready for second review, @HDembinski. This is looking really nice. :)

include/boost/histogram/python/accumulators/mean.hpp

HDembinski · 2019-11-13T08:55:03Z

But I like the consistency.

I also like consistency more than the short names. "Readability counts". Plus it is not something you need to type a lot.

HDembinski

Almost done!

include/boost/histogram/python/accumulators/mean.hpp

include/boost/histogram/python/accumulators/ostream.hpp

HDembinski · 2019-11-13T09:22:54Z

The accumulator interface is now (not yet pushed) nearly identical to the view interface, which is great. One more thing: should accumulators support a["value"] as well? Then a[x]["value"] would work regardless of whether x is a purely values or if it contains one or more slices. Also, the order becomes interchangeable, that is, arr["value"][3] == arr[3]["value"], regardless of whether arr came from .view() or np.asarray. Currently, without the getitem, it does not work on the .view() version and does work on the np.asarray version.

That sounds excellent, yes.

The final commit adds accum["value"], and allows IPython completion on the name. This keeps the customized view (which returns accumulators instead of structured array "scalars" on single element access) from breaking code that expects the structured array scalars. I can revert the commit if this is a bad idea.

This is very nice!

henryiii · 2019-11-13T12:33:15Z

Not worried about typing it so much (IPython tab completes it for me anyway), but it is long enough that it can go off the screen when displayed (such as in the GitHub code box above) Still better to be consistent.

henryiii · 2019-11-13T12:55:31Z

Should be ready to go! I've removed the namespace, fixed the spelling, and replaced the keywords.

HDembinski · 2019-11-13T13:13:38Z

Then it is ready to merge! Excellent work.

* Allow direct memory access, no longer throws error/segfault * Internalize accumulators * First working version of conversion * Fix several bugs, and change the meaning of np.asarray() slightly * Fix view not returning an editable view * Normalize names * Nicer repr and str for views * Cleanup views with a decorator * Some small polish * Fix #195, unlimited and atomic single access * Adding tests for storages * README update * CHANGELOG and renable a test * Fix unneeded change * Address review comments * Move ostream, remove some extra headers * Fix reprs, add missing tests * Adding accum[key] access * Addressing second review * Rename wsum and wsum2

henryiii changed the title ~~WIP: Direct memory access to accumulators~~ Direct memory access to accumulators Nov 11, 2019

henryiii added 12 commits November 12, 2019 07:01

Allow direct memory access, no longer throws error/segfault

ea2c381

Internalize accumulators

c1227ac

First working version of conversion

3c9b2af

Fix several bugs, and change the meaning of np.asarray() slightly

8dc503c

Fix view not returning an editable view

f1e6c89

Normalize names

3e55a80

Nicer repr and str for views

f1dd6a6

Cleanup views with a decorator

b1f154b

Some small polish

d61118a

Fix #195, unlimited and atomic single access

e198e25

Adding tests for storages

29808b0

README update

eab4378

henryiii force-pushed the henryiii-view branch from 506aadc to eab4378 Compare November 12, 2019 12:02

henryiii marked this pull request as ready for review November 12, 2019 12:03

henryiii commented Nov 12, 2019

View reviewed changes

tests/test_public_hist.py Outdated Show resolved Hide resolved

CHANGELOG and renable a test

09c6ec0

henryiii commented Nov 12, 2019

View reviewed changes

boost_histogram/_internal/hist.py Outdated Show resolved Hide resolved

Fix unneeded change

c88b1a2

HDembinski reviewed Nov 12, 2019

View reviewed changes

This was referenced Nov 12, 2019

Full UHI #208

Open

Set with array #189

Closed

henryiii commented Nov 12, 2019

View reviewed changes

henryiii added 2 commits November 12, 2019 14:01

Address review comments

f0ca2e7

Move ostream, remove some extra headers

7fc7f6b

henryiii added 2 commits November 12, 2019 15:13

Fix reprs, add missing tests

164f66e

Adding accum[key] access

91c033d

henryiii requested a review from HDembinski November 12, 2019 20:42

henryiii commented Nov 12, 2019

View reviewed changes

include/boost/histogram/python/accumulators/mean.hpp Show resolved Hide resolved

HDembinski reviewed Nov 13, 2019

View reviewed changes

henryiii added 2 commits November 13, 2019 07:34

Addressing second review

7f41e94

Rename wsum and wsum2

bcd53cb

henryiii merged commit 9398302 into develop Nov 13, 2019

henryiii deleted the henryiii-view branch November 13, 2019 15:41

henryiii mentioned this pull request Mar 17, 2021

fix: Mean/WeightedMean accumulator addition fix #537

Merged

Direct memory access to accumulators #194

Direct memory access to accumulators #194

Uh oh!

Conversation

henryiii commented Nov 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

henryiii commented Nov 6, 2019

Uh oh!

HDembinski commented Nov 11, 2019

Uh oh!

henryiii commented Nov 11, 2019

Uh oh!

henryiii commented Nov 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

henryiii commented Nov 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jpivarski commented Nov 11, 2019

Uh oh!

henryiii commented Nov 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

henryiii commented Nov 11, 2019

Uh oh!

henryiii commented Nov 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jpivarski commented Nov 11, 2019

Uh oh!

henryiii commented Nov 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HDembinski commented Nov 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HDembinski commented Nov 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HDembinski commented Nov 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

henryiii commented Nov 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

henryiii commented Nov 12, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

henryiii left a comment

Choose a reason for hiding this comment

Uh oh!

henryiii commented Nov 12, 2019

Uh oh!

henryiii commented Nov 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

henryiii commented Nov 12, 2019

Uh oh!

Uh oh!

HDembinski commented Nov 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HDembinski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

henryiii commented Nov 6, 2019 •

edited

Loading

henryiii commented Nov 11, 2019 •

edited

Loading

henryiii commented Nov 11, 2019 •

edited

Loading

henryiii commented Nov 11, 2019 •

edited

Loading

henryiii commented Nov 11, 2019 •

edited

Loading

henryiii commented Nov 11, 2019 •

edited

Loading

HDembinski commented Nov 12, 2019 •

edited

Loading

HDembinski commented Nov 12, 2019 •

edited

Loading

HDembinski commented Nov 12, 2019 •

edited

Loading

henryiii commented Nov 12, 2019 •

edited

Loading

henryiii commented Nov 12, 2019 •

edited

Loading

HDembinski commented Nov 13, 2019 •

edited

Loading