-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix code causing spurious Wstringop-overflow warning #3333
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -2204,20 +2204,32 @@ constexpr auto to_ascii(Char c) -> char { | |||||||||||||||||||||||||||||||||||||||||
return c <= 0xff ? static_cast<char>(c) : '\0'; | ||||||||||||||||||||||||||||||||||||||||||
} | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
// Returns the length of a codepoint. Returns 0 for invalid codepoints. | ||||||||||||||||||||||||||||||||||||||||||
FMT_CONSTEXPR inline auto code_point_length_impl(char c) -> int { | ||||||||||||||||||||||||||||||||||||||||||
return "\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\0\0\0\0\0\0\0\0\2\2\2\2\3\3\4" | ||||||||||||||||||||||||||||||||||||||||||
[static_cast<unsigned char>(c) >> 3]; | ||||||||||||||||||||||||||||||||||||||||||
} | ||||||||||||||||||||||||||||||||||||||||||
// Returns the length of a codepoint. Returns 1 for invalid codepoints. | ||||||||||||||||||||||||||||||||||||||||||
// This is equivalent to | ||||||||||||||||||||||||||||||||||||||||||
// | ||||||||||||||||||||||||||||||||||||||||||
// int len = code_point_length_impl(c); | ||||||||||||||||||||||||||||||||||||||||||
// return len + !len; | ||||||||||||||||||||||||||||||||||||||||||
// | ||||||||||||||||||||||||||||||||||||||||||
// This is useful because it allows the compiler to check that the | ||||||||||||||||||||||||||||||||||||||||||
// length is within the range [1, 4] | ||||||||||||||||||||||||||||||||||||||||||
FMT_CONSTEXPR inline auto code_point_length_impl_2(char c) -> int { | ||||||||||||||||||||||||||||||||||||||||||
return static_cast<int>((0x3a55000000000000ull >> (2 * (static_cast<unsigned char>(c) >> 3))) & 0x3) + 1; | ||||||||||||||||||||||||||||||||||||||||||
} | ||||||||||||||||||||||||||||||||||||||||||
Comment on lines
+2212
to
+2222
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we need both There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On a second thought, this logic can be moved directly into There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The return value of First case (first and last lines): Lines 654 to 664 in 3daf338
Second case: Lines 2215 to 2223 in 3daf338
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, right. Then I think we should merge |
||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
template <typename Char> | ||||||||||||||||||||||||||||||||||||||||||
FMT_CONSTEXPR auto code_point_length(const Char* begin) -> int { | ||||||||||||||||||||||||||||||||||||||||||
if (const_check(sizeof(Char) != 1)) return 1; | ||||||||||||||||||||||||||||||||||||||||||
int len = code_point_length_impl(static_cast<char>(*begin)); | ||||||||||||||||||||||||||||||||||||||||||
int len = code_point_length_impl_2(static_cast<char>(*begin)); | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
// Compute the pointer to the next character early so that the next | ||||||||||||||||||||||||||||||||||||||||||
// iteration can start working on the next character. Neither Clang | ||||||||||||||||||||||||||||||||||||||||||
// nor GCC figure out this reordering on their own. | ||||||||||||||||||||||||||||||||||||||||||
return len + !len; | ||||||||||||||||||||||||||||||||||||||||||
return len; | ||||||||||||||||||||||||||||||||||||||||||
} | ||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||
// Return the result via the out param to workaround gcc bug 77539. | ||||||||||||||||||||||||||||||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this logic deserves some explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! Sorry for the delay.
code_point_length_impl is initially implemented like this, which gets the length of a code point based on the first character in that codepoint.
For valid codepoints, we return the length of the codepoint, and for invalid codepoints we return a length of 0.
When used in
code_point_length
, the length of invalid codepoints gets converted to 1, so we don't actually need to distinguish between valid and invalid codepoints forcode_point_length
.As a result, we could write it like this:
Now we have 4 possible return values: [1, 2, 3, 4]. If we instead use [0, 1, 2, 3] and just add one, it looks like this now:
Because this array only contains values 0..3, each value can be represented in 2 bits. This means that the entire array fits in a 64 bit integer.
We can get that integer like this:
This gives us
0x3a55000000000000
.That logic basically indexes into this integer, treating it as an array of 2-bit values, resulting in the final implementation:
This has two benefits:
That being said the code could definitely be less ugly.
I'll apply your recommendations with regard to merging
code_point_length_impl
into utf8_decode, and mergingcode_point_length_impl_2
intocode_point_length
. I haven't noticed any similar false positives in utf8_decode.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks for the explanation.