Skip to content

Conversation

@beliefer
Copy link
Contributor

What changes were proposed in this pull request?

This PR proposes to avoid the risk of overflow of long for getAggregatedChecksumValue.
This PR follows up #50230

Why are the changes needed?

I guess the size of rowBasedChecksums is very great and row-based checksum could be big one. Then there exists the risk of overflow of long.

Does this PR introduce any user-facing change?

'No'.
New feature.

How was this patch tested?

GA tests.

Was this patch authored or co-authored using generative AI tooling?

'No'.

@github-actions github-actions bot added the CORE label Oct 29, 2025
@sarutak
Copy link
Member

sarutak commented Oct 29, 2025

@beliefer Aggregated checksum might overflow and become a negative value but is it really a problem? If negative checksum value causes a problem, should we have a test for a problematic case?

@beliefer
Copy link
Contributor Author

@peter-toth
Copy link
Contributor

I don't think that negative checksum is a problem, we would just lose a bit from the checksum range with that & 011.....

@beliefer
Copy link
Contributor Author

I don't think that negative checksum is a problem, we would just lose a bit from the checksum range with that & 011.....

Does the lost bit cause some unexpected issues ?

@peter-toth
Copy link
Contributor

I don't think that negative checksum is a problem, we would just lose a bit from the checksum range with that & 011.....

Does the lost bit cause some unexpected issues ?

It makes the quality of the checksum worse.

@beliefer
Copy link
Contributor Author

It makes the quality of the checksum worse.

I'm worry that some bits may be lost, which could actually affect the reliability of the comparative checksum.

@peter-toth
Copy link
Contributor

peter-toth commented Oct 30, 2025

It makes the quality of the checksum worse.

I'm worry that some bits may be lost, which could actually affect the reliability of the comparative checksum.

Checksum computation is always about losing bits 😄, the less we lose the better quality checksum we can get.
Here we actually combine order independent RowBasedChecksums computed by partitions into one final checksum that represents the whole data. Losing 1 more bit doesn't seem like a good idea, but if you have a problematic test case then let's investigate it.

def getAggregatedChecksumValue(rowBasedChecksums: Array[RowBasedChecksum]): Long = {
Option(rowBasedChecksums)
.map(_.foldLeft(0L)((acc, c) => acc * 31L + c.getValue))
.map(_.foldLeft(0L)((acc, c) => (acc * 31L + c.getValue) & Long.MaxValue))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does & Long.MaxValue help here? The current code is kind of very common and many hash code are also calculated like this.

@beliefer
Copy link
Contributor Author

I'm not sure. Let me close this PR and open it if exists the actual problem in future.

@beliefer beliefer closed this Oct 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants