Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement hash for AffineScalarFunc #219

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

NelDav
Copy link

@NelDav NelDav commented Apr 17, 2024

It is more than a week now, that I closed #189 and asked to continue discussion inside the issues of my Fork.
Nobody opened an issue. Because of that, I think that there are no further discrepancies between @wshanks, @MichaelTiemannOSC and me.
So, I want to try again and merge this changes :)

In the current master branch, there is no __hash__ function for AffineScalarFunc but for Variable:

def __hash__(self):
        # All Variable objects are by definition independent
        # variables, so they never compare equal; therefore, their
        # id() are allowed to differ
        # (http://docs.python.org/reference/datamodel.html#object.__hash__):
        return id(self)

Basics about hashes

If we just need a hash for Variable, this would be sufficient, but it is possible to make AffineScalarFunc hashable and I think we should provide a __hash__ function for AffineScalarFunc objects as well.

First, let us take a deeper look into python glossary about the term hashable.
It describes to requirements:

An object is hashable if it has a hash value which never changes during its lifetime

From this requirement, it follows that the hash must be independent from the state of the object, because otherwise the hash could change during lifetime. That means, that we can only use immutable properties of the object, to calculate the hash.

Hashable objects which compare equal must have the same hash value.

If we use the same properties, that are used to determine equality, to calculate the hash, this requirement is fulfilled.

Hashes in the uncertainties package

Let us now check how AffinceScalarFunc behaves in the uncertainties package.
Equality of two AffineScalarFunc objects is relatively complicate and I do not really want to step into details here.
Currently there is another issue discussing that in more detail.
However, if we use the _linear_part property and the _nominal_value property, we are able to calculate a hash, that compares equal, if two AffineScalarFunc objects compare equal.

But there is still a problem, the hash equality must also be guaranteed, for comparisons, where a AffineScalarFunc object and a Variable object are equal.
Because of that, we cannot keep the original implementation of __hash__ inside the Variable object.
Instead, we need to use the same calculation for Variable::__hash_ as we use for AffineScalarFunc;:__hash__.
Normally, removing the __hash__ function from Variable would do the trick, since AffineScalarFunc is the base class of Variable.
However, there one problem.

But before I explain this problem, let us check if the implementation of AffineScalarFunc::__hash__ fulfills the first requirement of hashable.
We note, that the AffienScalarFunc objects are mutable, which might be problem.
But fortunately, the expanded version of _linear_part, as well as the _nominal_value are immutable.
Because of that, also the hash is immutable and cannot change during the entire lifetime.

Now let us step into the reason, why we need to use a slightly different implementation of __hash__ for Variable.
The Variable object has a self reference inside the _linear_part.
This is a problem, when unpickling a Variable object. The reason for that is, that the pickle will first generate the LinearCombination object, before it generate the Variable object. While doing so, it tries to calculate the hash from the self reference. But the Variable object does not exist and the calculation will fail.
For the same reason, the __getstate__ and __setstate__ function must be implemented.

In the end, I want to thank everybody, who contributed to this PR.

@NelDav
Copy link
Author

NelDav commented Apr 17, 2024

Last but not least, the copy function is not needed. It comes from a test, I did.
Probably it would better to implement it inside the __copy__ function.
Or we simply remove it. what do you think about that?

@jagerber48
Copy link
Contributor

@NelDav

It is more than a week now, that I closed #189 and asked to continue discussion inside the issues of my Fork.
Nobody opened an issue.

I would say that discussion is ongoing at #218.

@NelDav
Copy link
Author

NelDav commented Apr 18, 2024

I would say that discussion is ongoing at #218.

I think, that is not the case, it is correct that #218 originates from #189, but it is about a different topic.
#218 is about equality conditions and correlations. But not about hashes. The goal of this PR is just to implement a __hash__ function, that is valid for the current implementation of __eq__.

Sure, if the __eq__ functions are changed in the future, also the __hash__ function need to be change. But that is nothing we want to do in this PR.

Please let me know if I misunderstand something.

@wshanks
Copy link
Collaborator

wshanks commented Apr 18, 2024

Sorry @NelDav for not opening an issue to discuss further. I had wanted to, but I had not partly because discussion continued on in a few different GitHub Discussions in this repo like #218 and also just because I did not have time. I had suggested moving discussion to your fork because I thought the feedback in #189 was that it was too verbose and we should get alignment before reopening a clean PR.

Here is my current understanding:

  • AffineScalarFunc holds a dictionary with Variable keys internally to track its dependence on independent variables.
  • Variable.__hash__ is based on id(). For the allowed range of id() int values, the hash is always unique (just the int value itself). So there are no hash collisions possible with adding Variables to the internal AffineScalarFunc dict. Also, retrieving a value from this internal dict is always a well-defined operation.
  • If we expand our area of concern outside of this internal dict, there is more room for problems. For example float instances could compare equal to Variables with std_dev == 0 but not have the same hash value.
  • The key decision point then is should we avoid these kind of issues by declaring that Variable.__hash__ is implemented only for internal use and should not be used in collections with non-Variable members by user code? Should we go further and change the implementation of AffineScalarFunc and Variable so that Variable is an internal tracker of dependencies but not an AffineScalarFunc needing to support mathematical operations? Or should we double down on immutability and hashability and implement AffineScalarFunc.__hash__ in a way that is consistent with the Python data model?

Personally, I prefer the latter option, but my impression is that the other maintainers have more reservations about treating AffineScalarFunc as immutable. I would say it is still open for discussion though in #193, #217, and #218.

@jagerber48
Copy link
Contributor

jagerber48 commented Apr 18, 2024

Please let me know if I misunderstand something.

Yes, @NelDav, the title of that discussion is about __eq__ but if you read the discussion you'll see we're also discussing hashing and mutability. There are ideas about a "serious overhaul" which would dramatically affect ideas for __hash__. I would say no progress can be made on a __hash__ PR like this one until the discussion at #218 is resolved.

@newville
Copy link
Member

@NelDav @wshanks

Yes, sorry if I have been the hold-up here. I think the various discussions here also get into the issue of what should "ideal behavior" be, and we want to do going forward, not so much about "let's fix a bug right now". I am OK with merging this, and then working to get 3.2.0 ready.

That is, the more philosophical discussion for version 4.0. But, unable to resist, I would respond to @wshanks:

The key decision point then is should we avoid these kind of issues by declaring that Variable.hash is implemented only for internal use and should not be used in collections with non-Variable members by user code? Should we go further and change the implementation of AffineScalarFunc and Variable so that Variable is an internal tracker of dependencies but not an AffineScalarFunc needing to support mathematical operations? Or should we double down on immutability and hashability and implement AffineScalarFunc.hash in a way that is consistent with the Python data model?

Personally, I prefer the latter option, but my impression is that the other maintainers have more reservations about treating AffineScalarFunc as immutable. I would say it is still open for discussion though in #193, #217, and #218.

I think my preference would be:

  1. keep Variable very simple: attributes of nominal_value, std_dev, tag.
  • do not make it a subclass of UFloat / AffineScalerFunc.
  • remove LinearCombination={self: 1} if at all possible. A Variable does not have derivatives.
  • move the __add__, etc methods to Variable.
  1. make UFloat a subclass of Variable, probably overriding the __add__ etc methods. Explore the possibility of using a Context to store the derivatives with respect to the global set of Variables.

But, again, that is for a future version.

@NelDav
Copy link
Author

NelDav commented Apr 18, 2024

I think the various discussions here also get into the issue of what should "ideal behavior" be, and we want to do going forward, not so much about "let's fix a bug right now".

I totally agree with that. That's what I wanted to say before, but I did not find such nice words as you did.

I think my preference would be:

  1. keep Variable very simple: attributes of nominal_value, std_dev, tag.

  2. do not make it a subclass of UFloat / AffineScalerFunc.
    remove LinearCombination={self: 1} if at all possible. A Variable does not have derivatives.
    move the add, etc methods to Variable.

  3. make UFloat a subclass of Variable, probably overriding the add etc methods. Explore the possibility of using a Context to > store the derivatives with respect to the global set of Variables.

Personally, I like this idea for the future version. It sounds like a clean solution.

Copy link

codecov bot commented May 13, 2024

Codecov Report

Attention: Patch coverage is 88.37209% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 95.41%. Comparing base (ea1d664) to head (07eeca5).
Report is 1 commits behind head on master.

Files Patch % Lines
uncertainties/core.py 78.26% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #219      +/-   ##
==========================================
- Coverage   95.59%   95.41%   -0.18%     
==========================================
  Files          12       12              
  Lines        1905     1943      +38     
==========================================
+ Hits         1821     1854      +33     
- Misses         84       89       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jagerber48
Copy link
Contributor

See this PR on my fork where I implement a solution that allows UFloat to be hashable. https://github.com/jagerber48/uncertainties/pull/2/files#diff-95026e41b7b4438fd289ed5a029754020a9a39346ec252ef0cea0080bb258facR88

It begins with the UAtom which is a small frozen dataclass which has a float standard deviation and a uuid (calculated from uuid.uuid4(). UAtoms with distinct uuid are always independent whereas two UAtoms with the same uuid are the same UAtom and therefore perfectly correlated. Because UAtom is a frozen dataclass it has a built in hash function which uses its two data fields.

Next is the UCombo. This replaces the current LinearCombination. The UCombo models a linear combination where each term has a float weight for either a UAtom or another UCombo. The nesting of UCombo allows for deferral of their expansion which may provide a performance improvement over always storing expanded UCombos for deep calculations (this presumed optimization has been used in uncertainties. The UCombo is also a frozen, and thus hashable, dataclass. The UCombo has the expected convenience methods of (1) returning an expanded version of itself and (2) calculated its standard deviation, of course taking into account correlations between terms in the linear combination.

Finally is the UFloat which contains a float value and an uncertainty represented as UCombo. You can extract the std_dev using the UCombo machinery. Because of a quirk of dataclasses* I wasn't able to write the UFloat class as a frozen dataclass. But I was able to write a simple __hash__ function that does the obvious thing of combining the hashes of the value and uncertainty which define it.

*Basically I wanted a few conveniences in the API. I wanted users to be able to pass any kind of number, int, decimal, float, etc. in for the value or uncertainty and then it gets cast to a float and I also want the internal type of uncertainty is UCombo. I wasn't able to figure out how to do this on a frozen dataclass and still get typing etc. to work out correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants