Skip to content

Conversation

@jinchengchenghh
Copy link
Collaborator

@jinchengchenghh jinchengchenghh commented Jul 7, 2025

The iceberg hash use mumur3 hash, which aligns with https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp, firstly, process every 4 bytes as a chunk, then process remaining bytes by XOR, sparksql also uses this hash algorithm but is different with processing remaining bytes, which combine the remaining bytes. Extract the common function hashInt64 to functions/lib.

This class will be used for iceberg bucket transform and bucket function.
The iceberg mumur3 hash should be strictly with java implementation, then write by iceberg could read with iceberg Java, and the function call can also get the correct result.
The iceberg utility lib velox_functions_iceberg_hash will be linked by iceberg connector write to do partition transform. #13874

@netlify
Copy link

netlify bot commented Jul 7, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit f355240
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/688c54679b1dc4000892f872

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 7, 2025
@jinchengchenghh jinchengchenghh changed the title Iceberg hash feat(iceberg): Support Iceberg hash Jul 7, 2025
@jinchengchenghh
Copy link
Collaborator Author

Could you help review this PR? Thanks! @mbasmanova

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments.

# See the License for the specific language governing permissions and
# limitations under the License.

velox_add_library(velox_functions_iceberg_util Murmur3Hash32.cpp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

velox_functions_iceberg_util

Let's match the file path: velox_functions_iceberg

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use the same format as https://github.com/facebookincubator/velox/blob/main/velox/functions/lib/CMakeLists.txt#L19

The velox_functions_iceberg_util will also be linked by IcebergConnector for partition transform, velox_functions_iceberg links too much library, so I add a new utility library.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jinchengchenghh "util" name it too generic and should be avoided if possible. Here, we have only one library / binary in a folder, hence, we can use the folder's path to name the library. No need to append any suffix.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jinchengchenghh since this library provides a hash function, perhaps, we can name it velox_functions_iceberg_hash.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will also provide utility function like DateTimeUtil for years transformer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will also provide utility function like DateTimeUtil for years transformer.

It is generally not ideal to create utils that contain multiple unrelated functions. It would be better to keep hash-related utilities separate from date-time-related ones.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, updated to velox_functions_iceberg_hash

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

/// original) to avoid undefined signed integer overflow and sign extension.
class Murmur3Hash32 {
public:
/// Hash the lower int , is a fast path of hashBytes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "lower int" refer to here? "is a fast path of hashBytes" - which hashBytes does this refer to?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hashBytes for int64 is same, hashBytes process 4 bytes as a chunk, only processing the remaining bytes is different.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the comments

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subclass implements hashBytes, do we need to have a virtual function hashBytes? But then the function cannot be static.

@jinchengchenghh
Copy link
Collaborator Author

Resolved all the comments, could you help review again? Thanks! @mbasmanova


#include <gtest/gtest.h>

using namespace facebook::velox::functions::iceberg;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test goes into anonymous namespace inside the namespace that contains code being tested.

namespace facebook::velox::functions::iceberg {
namespace {

TEST(Murmur3Hash32Test, bigint) {
...

}
}

/// hashBytes.
static uint32_t hashInt64(uint64_t input, uint32_t seed);

protected:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all methods here can be public. If you want to have some protected, then hashInt64 also needs to be protected for consistency.

#include <gtest/gtest.h>
#include "velox/common/base/tests/GTestUtils.h"

namespace {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, the test goes into anonymous namespace inside namespace that contains code being tested.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jinchengchenghh Thank you for iterating. Looks good % a few nits.

@jinchengchenghh
Copy link
Collaborator Author

Do you have further comments? Thanks! @mbasmanova

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@mbasmanova mbasmanova added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Jul 29, 2025
@kgpai
Copy link
Contributor

kgpai commented Jul 31, 2025

@jinchengchenghh Can you look if the failing tests related to your changes ? Can you also rebase to latest main ? Thanks.

@jinchengchenghh
Copy link
Collaborator Author

The failure is not related to this PR, rebased just now, waiting to see if the error exists. @kgpai

@jinchengchenghh
Copy link
Collaborator Author

The failure is tracked by #14308, and is not related to this PR

@facebook-github-bot
Copy link
Contributor

@kgpai has imported this pull request. If you are a Meta employee, you can view this in D79732785.

@facebook-github-bot
Copy link
Contributor

@kgpai has imported this pull request. If you are a Meta employee, you can view this in D79732785.

@facebook-github-bot
Copy link
Contributor

@kgpai merged this pull request in a417b2b.

wypb pushed a commit to wypb/velox that referenced this pull request Sep 3, 2025
Summary:
The iceberg hash use mumur3 hash, which aligns with https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp, firstly, process every 4 bytes as a chunk, then process remaining bytes by XOR, sparksql also uses this hash algorithm but is different with processing remaining bytes, which combine the remaining bytes. Extract the common function hashInt64 to functions/lib.

This class will be used for iceberg bucket transform and bucket function.
The iceberg mumur3 hash should be strictly with java implementation, then write by iceberg could read with iceberg Java, and the function call can also get the correct result.
The iceberg utility lib `velox_functions_iceberg_hash` will be linked by iceberg connector write to do partition transform. facebookincubator#13874

Pull Request resolved: facebookincubator#14025

Reviewed By: pedroerp

Differential Revision: D79732785

Pulled By: kgpai

fbshipit-source-id: 6122b94673f015dca5c8484722926709a30fe65e
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants