Handling strings which might not be proper UTF-8 in a better way... #3059

omascia · 2022-08-26T00:04:45Z

How about adding these 3 lines (or a better rewriting of them) to function code_point_length()?

  if (len >= 2 && (static_cast<unsigned char>(*(begin + 1)) >> 6) != 0x2) len = 1;
  if (len >= 3 && (static_cast<unsigned char>(*(begin + 2)) >> 6) != 0x2) len = 1;
  if (len == 4 && (static_cast<unsigned char>(*(begin + 3)) >> 6) != 0x2) len = 1;

Based on 9.0 source code, this becomes:

template <typename Char>
FMT_CONSTEXPR auto code_point_length(const Char* begin) -> int {
  if (const_check(sizeof(Char) != 1)) return 1;
  auto lengths =
      "\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\1\0\0\0\0\0\0\0\0\2\2\2\2\3\3\4";
  int len = lengths[static_cast<unsigned char>(*begin) >> 3];
  if (len >= 2 && (static_cast<unsigned char>(*(begin + 1)) >> 6) != 0x2) len = 1;
  if (len >= 3 && (static_cast<unsigned char>(*(begin + 2)) >> 6) != 0x2) len = 1;
  if (len == 4 && (static_cast<unsigned char>(*(begin + 3)) >> 6) != 0x2) len = 1;

  // Compute the pointer to the next character early so that the next
  // iteration can start working on the next character. Neither Clang
  // nor GCC figure out this reordering on their own.
  return len + !len;
}

This simply consider that a byte value, which should introduce a 2, 3, or 4 bytes UTF-8 sequence, is only counted as a 2, 3, 4 bytes sequence IF the right count of next bytes are indeed trailing bytes of an UTF-8 sequence.
If the library is used with char strings encoding like, let's say ISO8859, it won't start miscounting lengths when padding.
And it still works properly for correct UTF-8 strings :

		string iso{ -23, 99, 111, 108, 101 };  // "école" (ISO889)
		string utf{ -61, -87, 99, 111, 108, 101 }; // "école" (UTF-8)
		string asc{ 101, 99, 111, 108, 101 };  // "ecole" (ASCII)

		string out_iso = fmt::format("{:<10}", iso);  // size() == 10 (correct)
		string out_utf = fmt::format("{:<10}", utf);  // size() == 11 (correct)
		string out_asc = fmt::format("{:<10}", asc);  // size() == 10 (correct)

Of course, there is a possibility of inventing single-byte character sets sequences which would "look like" valid UTF-8 encoding, but generally, those will be unusual combinations for real text sequences.

vitaut · 2022-08-27T15:42:52Z

There is no code you are referring to in the current master but what you are looking for might have already been addressed by #3056.

omascia · 2022-08-27T16:38:15Z

I was working with "latest" from https://fmt.dev, which was 9.0.0.
Downloaded 9.1.0 released about or less than an hour ago (not yet on https://fmt.dev obviously), and I can confirm that the way the code has been restructured since 9.0.0, I do not face the issue for which I had posted the above fix.

vitaut · 2022-08-27T16:40:55Z

Great, thanks for checking.

vitaut closed this as completed Aug 27, 2022

vitaut added the question label Aug 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling strings which might not be proper UTF-8 in a better way... #3059

Handling strings which might not be proper UTF-8 in a better way... #3059

omascia commented Aug 26, 2022 •

edited

Loading

vitaut commented Aug 27, 2022

omascia commented Aug 27, 2022

vitaut commented Aug 27, 2022

Handling strings which might not be proper UTF-8 in a better way... #3059

Handling strings which might not be proper UTF-8 in a better way... #3059

Comments

omascia commented Aug 26, 2022 • edited Loading

vitaut commented Aug 27, 2022

omascia commented Aug 27, 2022

vitaut commented Aug 27, 2022

omascia commented Aug 26, 2022 •

edited

Loading