Skip to content

Conversation

@beliefer
Copy link
Contributor

@beliefer beliefer commented Jun 10, 2023

What changes were proposed in this pull request?

The hashCode() of UserDefinedScalarFunc and GeneralScalarExpression is not good enough. Take for example, GeneralScalarExpression uses Objects.hash(name, children), it adopt the hash code of name and children's reference and then combine them together as the GeneralScalarExpression's hash code.
In fact, we should adopt the hash code for each element in children.

Because UserDefinedAggregateFunc and GeneralAggregateFunc missing hashCode(), this PR also want add them.

This PR also improve the toString for UserDefinedAggregateFunc and GeneralAggregateFunc by using bool primitive comparison instead Objects.equals. Because the performance of bool primitive comparison better than Objects.equals.

Why are the changes needed?

Improve the hash code for some DS V2 Expression.

Does this PR introduce any user-facing change?

'Yes'.

How was this patch tested?

N/A

@github-actions github-actions bot added the SQL label Jun 10, 2023
@beliefer beliefer changed the title [SPARK-44018][SQL] Improve the hashCode for Some DS V2 Expression [SPARK-44018][SQL] Improve the hash code for some DS V2 Expression Jun 10, 2023
@beliefer
Copy link
Contributor Author

beliefer commented Jun 11, 2023

ping @cloud-fan cc @asiunov

Copy link

@asiunov asiunov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this! I'm not familiar well with how these classes work and are used, but I have a few suggestions based on how equals/hashCode are usually implemented.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return Objects.hash(name) * 31 + Arrays.hashCode(children);
return Objects.hash(name, isDistinct) * 31 + Arrays.hashCode(children);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return Objects.hash(name, canonicalName) * 31 + Arrays.hashCode(children);
return Objects.hash(name, canonicalName, isDistinct) * 31 + Arrays.hashCode(children);

Comment on lines 62 to 63
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(added isDistinct == that.isDistinct)

Suggested change
return Objects.equals(name, that.name) && Objects.equals(canonicalName, that.canonicalName) &&
Arrays.equals(children, that.children);
return isDistinct == that.isDistinct && Objects.equals(name, that.name) &&
Objects.equals(canonicalName, that.canonicalName) && Arrays.equals(children, that.children);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool primitive comparison works faster than Objects.equals, also it is usually placed at the beginning, because it is faster to check than string/objects.

Suggested change
return Objects.equals(name, that.name) && Objects.equals(isDistinct, that.isDistinct) &&
return isDistinct == that.isDistinct && Objects.equals(name, that.name) &&

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. bool primitive comparison faster than Objects.equals. But put bool primitive comparison at first is not good for readability.

@beliefer beliefer changed the title [SPARK-44018][SQL] Improve the hash code for some DS V2 Expression [SPARK-44018][SQL] Improve the hashCode and toString for some DS V2 Expression Jun 12, 2023
@pan3793
Copy link
Member

pan3793 commented Jun 12, 2023

A question for education: what's the benefit of a good hash in such cases?

@beliefer
Copy link
Contributor Author

A question for education: what's the benefit of a good hash in such cases?

Personally, I think we shouldn't use references of any java object for hash code.

@beliefer
Copy link
Contributor Author

ping @cloud-fan cc @huaxingao

@asiunov
Copy link

asiunov commented Jun 12, 2023

@pan3793, in this case, I built spark using bazel with a linter enabled, which errored on this hashcode implementation saying that Arrays.hashCode must be used.

sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions/UserDefinedScalarFunc.java:61: error: [ArrayHashCode] hashcode method on array does not hash array contents
    return Objects.hash(name, canonicalName, children);
                       ^
    (see https://errorprone.info/bugpattern/ArrayHashCode)
  Did you mean 'return Objects.hash(name, canonicalName, Arrays.hashCode(children));'?

There is a conceptual issue here: hashCode and equals should be always designed in such a way that two non-equal objects may have equal hash codes (meaning a hash collision), but two equal objects must have equal hash codes. If the latter is wrong and such objects are used in a hashtable, it is possible that these two objects will be inserted in different buckets, meaning two equal objects will be present in this hashtable. For two arrays equals returns false and hashCode are different even if their elements are the same. This is because arrays' equals/hashCode are based on reference in memory and nothing else.

I do not remember all the details, but I remember there was a good explanation on designing hashCode/equals in Effective Java by Joshua Bloch.

@pan3793
Copy link
Member

pan3793 commented Jun 13, 2023

... two equal objects must have equal hash codes ... If the latter is wrong and such objects are used in a hashtable ...

@asiunov thanks for the explanation, I know the principle, but I don't find where they are used as hash keys. That's why I put the in such cases in my question.

@asiunov
Copy link

asiunov commented Jun 13, 2023

Oh, I see. Sorry, I do not know the details of how these classes work. I reported this issue mostly because I've got linter errors, and just as a good-to-fix for any future use.

@VindhyaG
Copy link
Contributor

VindhyaG commented Jun 19, 2023

Hi @beliefer One question. If hashCode function is added to UserDefinedAggregateFunc as public that means this is user facing change right?

@beliefer
Copy link
Contributor Author

Hi @beliefer One question. If hashCode function is added to UserDefinedAggregateFunc as public that means this is user facing change right?

Yes. But if it's incorrect, we should fix it.

@VindhyaG
Copy link
Contributor

Hi @beliefer One question. If hashCode function is added to UserDefinedAggregateFunc as public that means this is user facing change right?

Yes. But if it's incorrect, we should fix it.

It says No above in the "Does this PR introduce any user-facing change? " section . Since we are adding two new methods as user facing so wanted to confirm

@VindhyaG
Copy link
Contributor

Hi @beliefer One question. If hashCode function is added to UserDefinedAggregateFunc as public that means this is user facing change right?

Yes. But if it's incorrect, we should fix it.

It says No above in the "Does this PR introduce any user-facing change? " section . Since we are adding two new methods as user facing so wanted to confirm (since that could implicate some changes in doc etc? just a doubt )

@cloud-fan
Copy link
Contributor

... we are adding two new methods ...

It overrides two methods, so it's not an API change but rather a bug fix.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.4!

@cloud-fan cloud-fan closed this in 8c84d2c Jun 19, 2023
cloud-fan pushed a commit that referenced this pull request Jun 19, 2023
…xpression

### What changes were proposed in this pull request?
The `hashCode() `of `UserDefinedScalarFunc` and `GeneralScalarExpression` is not good enough. Take for example, `GeneralScalarExpression` uses `Objects.hash(name, children)`, it adopt the hash code of `name` and `children`'s reference and then combine them together as the `GeneralScalarExpression`'s hash code.
In fact, we should adopt the hash code for each element in `children`.

Because `UserDefinedAggregateFunc` and `GeneralAggregateFunc` missing `hashCode()`, this PR also want add them.

This PR also improve the toString for `UserDefinedAggregateFunc` and `GeneralAggregateFunc` by using bool primitive comparison instead `Objects.equals`. Because the performance of bool primitive comparison better than `Objects.equals`.

### Why are the changes needed?
Improve the hash code for some DS V2 Expression.

### Does this PR introduce _any_ user-facing change?
'Yes'.

### How was this patch tested?
N/A

Closes #41543 from beliefer/SPARK-44018.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 8c84d2c)
Signed-off-by: Wenchen Fan <[email protected]>
czxm pushed a commit to czxm/spark that referenced this pull request Jun 19, 2023
…xpression

### What changes were proposed in this pull request?
The `hashCode() `of `UserDefinedScalarFunc` and `GeneralScalarExpression` is not good enough. Take for example, `GeneralScalarExpression` uses `Objects.hash(name, children)`, it adopt the hash code of `name` and `children`'s reference and then combine them together as the `GeneralScalarExpression`'s hash code.
In fact, we should adopt the hash code for each element in `children`.

Because `UserDefinedAggregateFunc` and `GeneralAggregateFunc` missing `hashCode()`, this PR also want add them.

This PR also improve the toString for `UserDefinedAggregateFunc` and `GeneralAggregateFunc` by using bool primitive comparison instead `Objects.equals`. Because the performance of bool primitive comparison better than `Objects.equals`.

### Why are the changes needed?
Improve the hash code for some DS V2 Expression.

### Does this PR introduce _any_ user-facing change?
'Yes'.

### How was this patch tested?
N/A

Closes apache#41543 from beliefer/SPARK-44018.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@beliefer
Copy link
Contributor Author

@cloud-fan @asiunov @pan3793 @VindhyaG Thank you!

snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
…xpression

### What changes were proposed in this pull request?
The `hashCode() `of `UserDefinedScalarFunc` and `GeneralScalarExpression` is not good enough. Take for example, `GeneralScalarExpression` uses `Objects.hash(name, children)`, it adopt the hash code of `name` and `children`'s reference and then combine them together as the `GeneralScalarExpression`'s hash code.
In fact, we should adopt the hash code for each element in `children`.

Because `UserDefinedAggregateFunc` and `GeneralAggregateFunc` missing `hashCode()`, this PR also want add them.

This PR also improve the toString for `UserDefinedAggregateFunc` and `GeneralAggregateFunc` by using bool primitive comparison instead `Objects.equals`. Because the performance of bool primitive comparison better than `Objects.equals`.

### Why are the changes needed?
Improve the hash code for some DS V2 Expression.

### Does this PR introduce _any_ user-facing change?
'Yes'.

### How was this patch tested?
N/A

Closes apache#41543 from beliefer/SPARK-44018.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 8c84d2c)
Signed-off-by: Wenchen Fan <[email protected]>
GladwinLee pushed a commit to lyft/spark that referenced this pull request Oct 10, 2023
…xpression

### What changes were proposed in this pull request?
The `hashCode() `of `UserDefinedScalarFunc` and `GeneralScalarExpression` is not good enough. Take for example, `GeneralScalarExpression` uses `Objects.hash(name, children)`, it adopt the hash code of `name` and `children`'s reference and then combine them together as the `GeneralScalarExpression`'s hash code.
In fact, we should adopt the hash code for each element in `children`.

Because `UserDefinedAggregateFunc` and `GeneralAggregateFunc` missing `hashCode()`, this PR also want add them.

This PR also improve the toString for `UserDefinedAggregateFunc` and `GeneralAggregateFunc` by using bool primitive comparison instead `Objects.equals`. Because the performance of bool primitive comparison better than `Objects.equals`.

### Why are the changes needed?
Improve the hash code for some DS V2 Expression.

### Does this PR introduce _any_ user-facing change?
'Yes'.

### How was this patch tested?
N/A

Closes apache#41543 from beliefer/SPARK-44018.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 8c84d2c)
Signed-off-by: Wenchen Fan <[email protected]>
catalinii pushed a commit to lyft/spark that referenced this pull request Oct 10, 2023
…xpression

### What changes were proposed in this pull request?
The `hashCode() `of `UserDefinedScalarFunc` and `GeneralScalarExpression` is not good enough. Take for example, `GeneralScalarExpression` uses `Objects.hash(name, children)`, it adopt the hash code of `name` and `children`'s reference and then combine them together as the `GeneralScalarExpression`'s hash code.
In fact, we should adopt the hash code for each element in `children`.

Because `UserDefinedAggregateFunc` and `GeneralAggregateFunc` missing `hashCode()`, this PR also want add them.

This PR also improve the toString for `UserDefinedAggregateFunc` and `GeneralAggregateFunc` by using bool primitive comparison instead `Objects.equals`. Because the performance of bool primitive comparison better than `Objects.equals`.

### Why are the changes needed?
Improve the hash code for some DS V2 Expression.

### Does this PR introduce _any_ user-facing change?
'Yes'.

### How was this patch tested?
N/A

Closes apache#41543 from beliefer/SPARK-44018.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 8c84d2c)
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants