Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use thrust::identity as hash functions for byte pair encoding #13665

Merged
merged 10 commits into from
Jul 10, 2023
3 changes: 2 additions & 1 deletion cpp/src/text/subword/bpe_tokenizer.cu
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@
#include <thrust/execution_policy.h>
#include <thrust/find.h>
#include <thrust/for_each.h>
#include <thrust/functional.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/merge.h>
Expand Down Expand Up @@ -234,7 +235,7 @@ struct byte_pair_encoding_fn {
if (rhs.empty()) break; // no more adjacent pairs

auto const hash = compute_hash(lhs, rhs);
auto const map_itr = d_map.find(hash);
auto const map_itr = d_map.find(hash, thrust::identity<cudf::hash_value_type>{});
if (map_itr != d_map.end()) {
// found a match; record the rank (and other min_ vars)
auto const rank = static_cast<cudf::size_type>(map_itr->second);
Expand Down
2 changes: 1 addition & 1 deletion cpp/src/text/subword/load_merges_file.cu
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ std::unique_ptr<detail::merge_pairs_map_type> initialize_merge_pairs_map(

merge_pairs_map->insert(iter,
iter + input.size(),
cuco::murmurhash3_32<cudf::hash_value_type>{},
thrust::identity<cudf::hash_value_type>{},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the nature of this change? Is it just that it's pointless to be initializing the map with an actual hashed value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. Previously, this was hashing a hashed value.
Also, so that the same hasher is used for both insert and find -- a bug that this fixes too.

thrust::equal_to<cudf::hash_value_type>{},
stream.value());

Expand Down