Skip to content

Conversation

@techaddict
Copy link
Contributor

What changes were proposed in this pull request?

Implement DataFrame.sameSemantics

Why are the changes needed?

api coverage

Does this PR introduce any user-facing change?

yes

How was this patch tested?

new Unit tests

@techaddict
Copy link
Contributor Author

cc: @HyukjinKwon @zhengruifeng

@zhengruifeng
Copy link
Contributor

zhengruifeng commented Jan 7, 2023

@techaddict thank you for working on it.

we had some discussion on sameSemantics and semanticHash in #38742 (comment)

I think this one and #39427 are controversial, and the two are developer APIs (

/**
* Returns `true` when the logical query plans inside both [[Dataset]]s are equal and
* therefore return same results.
*
* @note The equality comparison here is simplified by tolerating the cosmetic differences
* such as attribute names.
* @note This API can compare both [[Dataset]]s very fast but can still return `false` on
* the [[Dataset]] that return the same results, for instance, from different plans. Such
* false negative semantic can be useful when caching as an example.
* @since 3.1.0
*/
@DeveloperApi
def sameSemantics(other: Dataset[T]): Boolean = {
queryExecution.analyzed.sameResult(other.queryExecution.analyzed)
}
/**
* Returns a `hashCode` of the logical query plan against this [[Dataset]].
*
* @note Unlike the standard `hashCode`, the hash is calculated against the query plan
* simplified by tolerating the cosmetic differences such as attribute names.
* @since 3.1.0
*/
@DeveloperApi
def semanticHash(): Int = {
queryExecution.analyzed.semanticHash()
}
).

I think we may not add them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants