Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add fixed length str2int / asin str2int memory-mappable hashmap #295

Merged
merged 4 commits into from
Aug 29, 2024

Conversation

vpung
Copy link
Contributor

@vpung vpung commented Aug 6, 2024

Issue #, if available:
N/A

Description of changes:
Added fixed length str2int and fixed length 10 str2int memory-mapped utility for ankerl hashmap.

Usage (Python-binding): Please refer to the added unittests.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@vpung vpung self-assigned this Aug 12, 2024
@vpung vpung force-pushed the ankerl-mmap branch 2 times, most recently from 39135ac to cc1fd3f Compare August 14, 2024 18:00
@vpung vpung requested a review from weiliw-amz August 14, 2024 18:34
from pecos.utils.mmap_hashmap_util import MmapHashmap, MmapHashmapBatchGetter

map_dir = tmpdir.join("fixed_len_10_str2int").realpath().strpath
kv_dict = {"aaaaaaaaaa".encode("utf-8"): 2, "bbbbbbbbbb".encode("utf-8"): 3}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you create the strings like "a" * 10 so that it is more clear that these keys are length-10?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will update for clarity.

max_batch_size - len(kv_dict)
) # Non-exist key
vs = list(kv_dict.values()) + [10] * (max_batch_size - len(kv_dict))
assert r_map_batch_getter.get(ks, 10).tolist() == vs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you testing increasing max_batch_size inside class? Then you should check it indeed got increased.


static constexpr std::size_t fixed_str_len = 10;

struct StrView {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should rename this struct into something like FixLen10Str. It's not a "view" anymore, but actually holds data.

throw std::runtime_error("ASIN string length is not 10.");
}

char key_arr[fixed_str_len];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not necessary to create this array, passing the pointer is sufficient.

}

char key_arr[fixed_str_len];
std::memcpy(key_arr, key_string.data(), key_string.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also remove this line.

// Emplace back std::pair<StrView, uint64_t>
auto eb_val = store_.emplace_back(
std::piecewise_construct,
std::forward_as_tuple(key_arr),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change this line into:

std::forward_as_tuple(key_string.data()),

StrView(const char* input_str = nullptr) {
if (input_str) {
std::memcpy(str, input_str, fixed_str_len);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add an else here to guard against illegal init which we definitely should realize for a bug in the code:

else {
    throw std::runtime_error("Illegal initialization of FixLen10Str with nullptr.");
}

Ideally, should limit scope of this constructor to only take input param std::string_view. However our template of mmap vector limits data to be plain and string_view as param doesn't satisfy. Using raw ptr here as a compromise.

@vpung vpung force-pushed the ankerl-mmap branch 2 times, most recently from 256eaa6 to 1aedb47 Compare August 27, 2024 18:06
@vpung vpung marked this pull request as ready for review August 27, 2024 18:44
@vpung vpung merged commit 37028ca into amzn:mainline Aug 29, 2024
25 checks passed
@vpung vpung deleted the ankerl-mmap branch August 29, 2024 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants