-
Notifications
You must be signed in to change notification settings - Fork 1.4k
feat(iceberg): Support Iceberg hash #14025
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(iceberg): Support Iceberg hash #14025
Conversation
✅ Deploy Preview for meta-velox canceled.
|
282f1be to
cc51f43
Compare
|
Could you help review this PR? Thanks! @mbasmanova |
mbasmanova
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments.
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| velox_add_library(velox_functions_iceberg_util Murmur3Hash32.cpp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
velox_functions_iceberg_util
Let's match the file path: velox_functions_iceberg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use the same format as https://github.com/facebookincubator/velox/blob/main/velox/functions/lib/CMakeLists.txt#L19
The velox_functions_iceberg_util will also be linked by IcebergConnector for partition transform, velox_functions_iceberg links too much library, so I add a new utility library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jinchengchenghh "util" name it too generic and should be avoided if possible. Here, we have only one library / binary in a folder, hence, we can use the folder's path to name the library. No need to append any suffix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After support the Iceberg function like bucket, https://github.com/facebookincubator/velox/pull/13174/files#diff-588a1233022efcdaf5de5fe7eeb67b699f62e7ece9c3e64130be130091b27423, https://github.com/facebookincubator/velox/pull/13174/files#diff-cee1c2962987fd80218c2475980e18d021179d632f31f9fcdd325f35fb13a733R17, we will have two libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jinchengchenghh since this library provides a hash function, perhaps, we can name it velox_functions_iceberg_hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will also provide utility function like DateTimeUtil for years transformer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will also provide utility function like DateTimeUtil for years transformer.
It is generally not ideal to create utils that contain multiple unrelated functions. It would be better to keep hash-related utilities separate from date-time-related ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, updated to velox_functions_iceberg_hash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
velox/functions/lib/Hash.h
Outdated
| /// original) to avoid undefined signed integer overflow and sign extension. | ||
| class Murmur3Hash32 { | ||
| public: | ||
| /// Hash the lower int , is a fast path of hashBytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "lower int" refer to here? "is a fast path of hashBytes" - which hashBytes does this refer to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hashBytes for int64 is same, hashBytes process 4 bytes as a chunk, only processing the remaining bytes is different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The subclass implements hashBytes, do we need to have a virtual function hashBytes? But then the function cannot be static.
ba51c1a to
9246ce9
Compare
|
Resolved all the comments, could you help review again? Thanks! @mbasmanova |
|
|
||
| #include <gtest/gtest.h> | ||
|
|
||
| using namespace facebook::velox::functions::iceberg; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test goes into anonymous namespace inside the namespace that contains code being tested.
namespace facebook::velox::functions::iceberg {
namespace {
TEST(Murmur3Hash32Test, bigint) {
...
}
}
| /// hashBytes. | ||
| static uint32_t hashInt64(uint64_t input, uint32_t seed); | ||
|
|
||
| protected: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all methods here can be public. If you want to have some protected, then hashInt64 also needs to be protected for consistency.
| #include <gtest/gtest.h> | ||
| #include "velox/common/base/tests/GTestUtils.h" | ||
|
|
||
| namespace { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As above, the test goes into anonymous namespace inside namespace that contains code being tested.
mbasmanova
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jinchengchenghh Thank you for iterating. Looks good % a few nits.
|
Do you have further comments? Thanks! @mbasmanova |
mbasmanova
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
|
@jinchengchenghh Can you look if the failing tests related to your changes ? Can you also rebase to latest main ? Thanks. |
2e2b5a3 to
f355240
Compare
|
The failure is not related to this PR, rebased just now, waiting to see if the error exists. @kgpai |
|
The failure is tracked by #14308, and is not related to this PR |
Summary: The iceberg hash use mumur3 hash, which aligns with https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp, firstly, process every 4 bytes as a chunk, then process remaining bytes by XOR, sparksql also uses this hash algorithm but is different with processing remaining bytes, which combine the remaining bytes. Extract the common function hashInt64 to functions/lib. This class will be used for iceberg bucket transform and bucket function. The iceberg mumur3 hash should be strictly with java implementation, then write by iceberg could read with iceberg Java, and the function call can also get the correct result. The iceberg utility lib `velox_functions_iceberg_hash` will be linked by iceberg connector write to do partition transform. facebookincubator#13874 Pull Request resolved: facebookincubator#14025 Reviewed By: pedroerp Differential Revision: D79732785 Pulled By: kgpai fbshipit-source-id: 6122b94673f015dca5c8484722926709a30fe65e
The iceberg hash use mumur3 hash, which aligns with https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp, firstly, process every 4 bytes as a chunk, then process remaining bytes by XOR, sparksql also uses this hash algorithm but is different with processing remaining bytes, which combine the remaining bytes. Extract the common function hashInt64 to functions/lib.
This class will be used for iceberg bucket transform and bucket function.
The iceberg mumur3 hash should be strictly with java implementation, then write by iceberg could read with iceberg Java, and the function call can also get the correct result.
The iceberg utility lib
velox_functions_iceberg_hashwill be linked by iceberg connector write to do partition transform. #13874