-
Notifications
You must be signed in to change notification settings - Fork 25
Direct memory access to accumulators #194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I have at least fully demonstrated that it can be done before the CHEP talk. |
|
I am actually very interested in this, because I want to use a weighted histogram in my current data analysis :). |
|
It's progressing fairly well, will update the PR fairly soon with work from the plane. |
|
I realized after PyHEP that the accumulators are something special to boost-histogram for Python users; up until now "double" storage is about the limit of what you could do in Python histograms easily. I think this explains why multiple people have run into this being broken - it was the very thing they were excited to have available directly in Python. @HDembinski, this is ready for review. I'm just adding a few tests. Basic idea:
h.view().varianceto get the variance array, even if it is a computed property! If you do not access computed properties, this procedure is still no copy! And, unlike
No copy example: >>> h = bh.histogram((10,0,1), storage=bh.storage.Weight())
>>> v = h.view()
>>> v.variance
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
>>> v.variance[3] = 3 # Only works if variance is not computed (WeightedSum)
>>> h
histogram(
regular(10, 0, 1),
storage=weight
) # Sum: weighted_sum(value=0, variance=3)
>>> h.view()
WeightedSumView(
[(0., 0.), (0., 0.), (0., 0.), (0., 3.), (0., 0.), (0., 0.),
(0., 0.), (0., 0.), (0., 0.), (0., 0.)],
dtype=[('value', '<f8'), ('variance', '<f8')])
>>> v[3]
weighted_sum(value=0, variance=3)I don't know about you, but I think this is really cool. :) |
|
One design question: Should Recap of reason one might want just the single value: Nobody should be doing this: v = np.asarray(h) # v = h.view() is betterbut they may do this: plt.bar(h.axes[0].centers, h, width=h.axes[0].widths);The latter expects an array of values. |
|
I was leaning toward record array because I thought the underlying structure was non-negotiably a "Easy enough" in a program, but not necessarily in quick-and-dirty data analysis. For that, you're going to want something to return just the values or just the variances, either by changing the ABI in C++ (which I think is rather drastic) or by providing methods. I don't think you need to worry about users expecting The histogram doesn't have to be a single |
|
All histograms provide identical APIs; you would (and can) use Choices:
I'll let @HDembinski pick one. I've ordered them in the order that I rank them. As a user, I think the first one is the one I would expect, and extends the current idea of providing the most likely portion of the data instead of the complete data as the |
|
PS: In hist, we may have histogram subclasses for each storage, so |
|
(Added an option and reordered) Also, Jim, the description of how we could do it is pretty much how I'm currently doing it, other than being slightly further down in the wrapping. Note for future posterity: Pybind11 is missing an |
|
Agh—that's what you get for asking me for an opinion about something I haven't been following. You invited me to make some stuff up. What you're doing is probably fine. |
|
A nice argument toward number An argument toward I'm neutral here. Hans can pick any of the 4. How does one type |
|
Since I need weights support for my analysis, I implemented this in a branch, see https://github.com/scikit-hep/boost-histogram/tree/weights. This is basically a subset of what you are doing here. Maybe you could have a quick look. I will not make a PR from this branch, just wanted to submit it to show you. I think the fields for the internal boost-histogram accumulators can be all public (as was done in my branch). This simplifies the code a bit. |
|
For me it is not a question about convenience, but about the expectation what So from that point of view, I think we made the right decision to hide the flow bins in |
|
So in summary, I would prefer option 4, it seems the least surprising. |
That's what was done here - it is required to use the nice Pybind11 tools to make the structured array.
Okay, that is what is currently done here already. Note this now fixes #195 too, you can now access and set single values from unlimited and atomic storages. |
506aadc to
eab4378
Compare
|
Extra note: Ahh, this would simplify the numpy descriptors, too; it might be a reasonable thing to do to clean up eventually. I didn't rename the internal fields because they would collide with the methods, but you are just binding directly to the field, so the method is not needed. Possible cleanup for a later PR. |
henryiii
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I have no idea why this one comment got stuck in a "review", ask GitHub)
|
The accumulator interface is now (not yet pushed) nearly identical to the view interface, which is great. One more thing: should accumulators support |
|
Currently, we have positional or keyword arguments in the accumulator constructor: bh.accumulators.WeightedMean(wsum=6, wsum2=12, value=3, variance=2)Should we use the property/field names? This makes the repr very long: bh.accumulators.WeightedMean(sum_of_weights=6, sum_of_weights_squared=12, value=3, variance=2)But I like the consistency. |
|
The final commit adds Ready for second review, @HDembinski. This is looking really nice. :) |
I also like consistency more than the short names. "Readability counts". Plus it is not something you need to type a lot. |
HDembinski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost done!
That sounds excellent, yes.
This is very nice! |
|
Not worried about typing it so much (IPython tab completes it for me anyway), but it is long enough that it can go off the screen when displayed (such as in the GitHub code box above) Still better to be consistent. |
|
Should be ready to go! I've removed the namespace, fixed the spelling, and replaced the keywords. |
|
Then it is ready to merge! Excellent work. |
* Allow direct memory access, no longer throws error/segfault * Internalize accumulators * First working version of conversion * Fix several bugs, and change the meaning of np.asarray() slightly * Fix view not returning an editable view * Normalize names * Nicer repr and str for views * Cleanup views with a decorator * Some small polish * Fix #195, unlimited and atomic single access * Adding tests for storages * README update * CHANGELOG and renable a test * Fix unneeded change * Address review comments * Move ostream, remove some extra headers * Fix reprs, add missing tests * Adding accum[key] access * Addressing second review * Rename wsum and wsum2
This allows
.view()to work on non-simple storages. Closes #175 and closes #21.Features:
.view().valueto get the value, for example.Todo:
_Not going to be ready by the CHEP talk, but should be ready for a release end of week / next week.