You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
How about adding these 3 lines (or a better rewriting of them) to function code_point_length()?
if (len >= 2 && (static_cast<unsignedchar>(*(begin + 1)) >> 6) != 0x2) len = 1;
if (len >= 3 && (static_cast<unsignedchar>(*(begin + 2)) >> 6) != 0x2) len = 1;
if (len == 4 && (static_cast<unsignedchar>(*(begin + 3)) >> 6) != 0x2) len = 1;
Based on 9.0 source code, this becomes:
template <typename Char>
FMT_CONSTEXPR autocode_point_length(const Char* begin) -> int {
if (const_check(sizeof(Char) != 1)) return1;
auto lengths =
"\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\0\0\0\0\0\0\0\0\2\2\2\2\3\3\4";
int len = lengths[static_cast<unsignedchar>(*begin) >> 3];
if (len >= 2 && (static_cast<unsignedchar>(*(begin + 1)) >> 6) != 0x2) len = 1;
if (len >= 3 && (static_cast<unsignedchar>(*(begin + 2)) >> 6) != 0x2) len = 1;
if (len == 4 && (static_cast<unsignedchar>(*(begin + 3)) >> 6) != 0x2) len = 1;
// Compute the pointer to the next character early so that the next// iteration can start working on the next character. Neither Clang// nor GCC figure out this reordering on their own.return len + !len;
}
This simply consider that a byte value, which should introduce a 2, 3, or 4 bytes UTF-8 sequence, is only counted as a 2, 3, 4 bytes sequence IF the right count of next bytes are indeed trailing bytes of an UTF-8 sequence.
If the library is used with char strings encoding like, let's say ISO8859, it won't start miscounting lengths when padding.
And it still works properly for correct UTF-8 strings :
Of course, there is a possibility of inventing single-byte character sets sequences which would "look like" valid UTF-8 encoding, but generally, those will be unusual combinations for real text sequences.
The text was updated successfully, but these errors were encountered:
I was working with "latest" from https://fmt.dev, which was 9.0.0.
Downloaded 9.1.0 released about or less than an hour ago (not yet on https://fmt.dev obviously), and I can confirm that the way the code has been restructured since 9.0.0, I do not face the issue for which I had posted the above fix.
How about adding these 3 lines (or a better rewriting of them) to function code_point_length()?
Based on 9.0 source code, this becomes:
This simply consider that a byte value, which should introduce a 2, 3, or 4 bytes UTF-8 sequence, is only counted as a 2, 3, 4 bytes sequence IF the right count of next bytes are indeed trailing bytes of an UTF-8 sequence.
If the library is used with char strings encoding like, let's say ISO8859, it won't start miscounting lengths when padding.
And it still works properly for correct UTF-8 strings :
Of course, there is a possibility of inventing single-byte character sets sequences which would "look like" valid UTF-8 encoding, but generally, those will be unusual combinations for real text sequences.
The text was updated successfully, but these errors were encountered: