-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix decoder on broken utf8 sequences. #3044
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR.
*e |= ((uchar(s[0]) & prefix_masks[len]) != | ||
uchar((prefix_masks[len] << 1) & 0xFF)); // first byte correct? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is needed because utf8_decode
already gives an error for "\xf0\x28...".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without this check failed at sequence \xf4\x8f\xbf\xc0
.
/Users/phprus/Devel/fmt/test/ranges-test.cc:393: Failure
Expected equality of these values:
fmt::format("{}", vec{"\xf4\x8f\xbf\xc0"})
Which is: "[\"\\xf4\\x0f\xBF\\xc0\"]"
"[\"\\xf4\\x8f\\xbf\\xc0\"]"
utf8_decode
return zero in e
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test case then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind, I missed the code snippet which shows it's already there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test already present:
Lines 383 to 384 in 541cd21
EXPECT_EQ(fmt::format("{}", vec{"\xf4\x8f\xbf\xc0"}), | |
"[\"\\xf4\\x8f\\xbf\\xc0\"]"); |
This patch fix two error.
First error fixed by add condition error ? 1 : to_unsigned(end - buf_ptr)
Second - by add first byte check.
The first error masked the second on the sequence \xf4\x8f\xbf\xc0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICS the current version of utf8_decode
already returns an error for "\xf4\x8f\xbf\xc0" too: https://godbolt.org/z/E194qvshK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Step 1:
"\xf4\x8f\xbf\xc0"
- is not valid utf-8, print \xf4
, remove \xf4
from sequence.
Step 2:
"\x8f\xbf\xc0\0"
- is not valid utf-8, but utf8_decode return no error and invalid codepoint (15, not 143).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks.
@@ -643,7 +646,7 @@ FMT_CONSTEXPR void for_each_codepoint(string_view s, F f) { | |||
auto error = 0; | |||
auto end = utf8_decode(buf_ptr, &cp, &error); | |||
bool result = f(error ? invalid_code_point : cp, | |||
string_view(ptr, to_unsigned(end - buf_ptr))); | |||
string_view(ptr, error ? 1 : to_unsigned(end - buf_ptr))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we saw error here "end - buf_ptr" is < 0, but I dont follow how that could happen with invalid code point.
include/fmt/format.h
Outdated
@@ -643,7 +646,7 @@ FMT_CONSTEXPR void for_each_codepoint(string_view s, F f) { | |||
auto error = 0; | |||
auto end = utf8_decode(buf_ptr, &cp, &error); | |||
bool result = f(error ? invalid_code_point : cp, | |||
string_view(ptr, to_unsigned(end - buf_ptr))); | |||
string_view(ptr, error ? 1 : to_unsigned(end - buf_ptr))); | |||
return result ? end : nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you help me to understand in the example of "\xF0\x28", is this end still buf_ptr+4 even tho there is an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stevenzzzz
No. On invalid utf-8 sequence, processing is one byte at a time.
Please describe your testing process.
My environment:
macOS 12.5.1
Apple clang version 13.1.6 (clang-1316.0.21.2.5)
Test steps:
CXXFLAGS="-fsanitize=address" cmake -DCMAKE_VERBOSE_MAKEFILE=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_STANDARD=20 -DFMT_PEDANTIC=ON -DFMT_WERROR=ON /path/to/patched/fmt
make
./bin/ranges-test
The test shows no errors.
String "\xF0\x28" present in updated test.
Without this PR:
ERROR: AddressSanitizer: negative-size-param: (size=-2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have overlooked something, but for "\xF0\x28\0\0anything" the utf8_decode_new/decode would return 4 (as expected), when the error is set to 42. Is that "4" gonna skip the "\x28\0\0"?
same test: https://godbolt.org/z/EbeEex4Gf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stevenzzzz
If error != 0
then return value of the utf8_decode
function is not used and one byte is skipped (pushed to output in "\xVV" format).
I added a test for the string "\xf0\x28\0\0anything"
and it finished without ASAN error and with expected results: \"\\xf0(\\x00\\x00anything\"
.
Please provide full test source for ASAN errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for working on this!
We got the ASAN error by printing a protobuf debugstring() which is arbitrary binary bytes, unfortunately I can't show the protobuf to you.
I am a little bit lost on how the printing works, could you help me to understand where is the "skip one byte on error" pls? per this piece of code:
for (auto end = p + s.size() - block_size + 1; p < end;) {
p = decode(p, p);
if (!p) return;
}
it seems p would advance 4 bytes when encode(p, p) is called on "\xof\x28".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stevenzzzz
Please replace
auto decode = [f](const char* buf_ptr, const char* ptr) {
auto cp = uint32_t();
auto error = 0;
auto end = utf8_decode(buf_ptr, &cp, &error);
bool result = f(error ? invalid_code_point : cp,
string_view(ptr, error ? 1 : to_unsigned(end - buf_ptr)));
return result ? end : nullptr;
};
with:
auto decode = [f](const char* buf_ptr, const char* ptr) {
auto cp = uint32_t();
auto error = 0;
auto end = utf8_decode(buf_ptr, &cp, &error);
bool result = f(error ? invalid_code_point : cp,
string_view(ptr, error ? 1 : to_unsigned(end - buf_ptr)));
return result ? (error ? buf_ptr + 1 : end) : nullptr;
};
and run test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering why you didn't replace that in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stevenzzzz
Did this replacement fix the bug? If yes, I will update the PR.
I didn't see this bug before.
Signed-off-by: Vladislav Shchapov <[email protected]>
Merged, thanks! @phprus, could you report this to https://github.com/skeeto/branchless-utf8 where this implementation originates from? |
- Update from version 9.0.0 to 9.1.0 - Update of rootfile - Changelog 9.1.0 - 2022-08-27 * ``fmt::formatted_size`` now works at compile time `#3026 <https://github.com/fmtlib/fmt/pull/3026>`_ For example (`godbolt <https://godbolt.org/z/1MW5rMdf8>`__): .. code:: c++ #include <fmt/compile.h> int main() { using namespace fmt::literals; constexpr size_t n = fmt::formatted_size("{}"_cf, 42); fmt::print("{}\n", n); // prints 2 } * Fixed handling of invalid UTF-8 `#3038 <https://github.com/fmtlib/fmt/pull/3038>`_, `#3044 <https://github.com/fmtlib/fmt/pull/3044>`_, `#3056 <https://github.com/fmtlib/fmt/pull/3056>`_ * Improved Unicode support in ``ostream`` overloads of ``print`` `#2994 <https://github.com/fmtlib/fmt/pull/2994>`_, `#3001 <https://github.com/fmtlib/fmt/pull/3001>`_, `#3025 <https://github.com/fmtlib/fmt/pull/3025>`_ * Fixed handling of the sign specifier in localized formatting on systems with 32-bit ``wchar_t`` `#3041 <https://github.com/fmtlib/fmt/issues/3041>`_). * Added support for wide streams to ``fmt::streamed`` `#2994 <https://github.com/fmtlib/fmt/pull/2994>`_ * Added the ``n`` specifier that disables the output of delimiters when formatting ranges `#2981 <https://github.com/fmtlib/fmt/pull/2981>`_, `#2983 <https://github.com/fmtlib/fmt/pull/2983>`_ For example (`godbolt <https://godbolt.org/z/roKqGdj8c>`__): .. code:: c++ #include <fmt/ranges.h> #include <vector> int main() { auto v = std::vector{1, 2, 3}; fmt::print("{:n}\n", v); // prints 1, 2, 3 } * Worked around problematic ``std::string_view`` constructors introduced in C++23 `#3030 <https://github.com/fmtlib/fmt/issues/3030>`_, `#3050 <https://github.com/fmtlib/fmt/issues/3050>`_ * Improve handling (exclusion) of recursive ranges `#2968 <https://github.com/fmtlib/fmt/issues/2968>`_, `#2974 <https://github.com/fmtlib/fmt/pull/2974>`_ * Improved error reporting in format string compilation `#3055 <https://github.com/fmtlib/fmt/issues/3055>`_ * Improved the implementation of `Dragonbox <https://github.com/jk-jeon/dragonbox>`_, the algorithm used for the default floating-point formatting `#2984 <https://github.com/fmtlib/fmt/pull/2984>`_ * Fixed issues with floating-point formatting on exotic platforms. * Improved the implementation of chrono formatting `#3010 <https://github.com/fmtlib/fmt/pull/3010>`_ * Improved documentation `#2966 <https://github.com/fmtlib/fmt/pull/2966>`_, `#3009 <https://github.com/fmtlib/fmt/pull/3009>`_, `#3020 <https://github.com/fmtlib/fmt/issues/3020>`_, `#3037 <https://github.com/fmtlib/fmt/pull/3037>`_ * Improved build configuration `#2991 <https://github.com/fmtlib/fmt/pull/2991>`_, `#2995 <https://github.com/fmtlib/fmt/pull/2995>`_, `#3004 <https://github.com/fmtlib/fmt/issues/3004>`_, `#3007 <https://github.com/fmtlib/fmt/pull/3007>`_, `#3040 <https://github.com/fmtlib/fmt/pull/3040>`_ * Fixed various warnings and compilation issues `#2969 <https://github.com/fmtlib/fmt/issues/2969>`_, `#2971 <https://github.com/fmtlib/fmt/pull/2971>`_, `#2975 <https://github.com/fmtlib/fmt/issues/2975>`_, `#2982 <https://github.com/fmtlib/fmt/pull/2982>`_, `#2985 <https://github.com/fmtlib/fmt/pull/2985>`_, `#2988 <https://github.com/fmtlib/fmt/issues/2988>`_, `#3000 <https://github.com/fmtlib/fmt/issues/3000>`_, `#3006 <https://github.com/fmtlib/fmt/issues/3006>`_, `#3014 <https://github.com/fmtlib/fmt/issues/3014>`_, `#3015 <https://github.com/fmtlib/fmt/issues/3015>`_, `#3021 <https://github.com/fmtlib/fmt/pull/3021>`_, `#3023 <https://github.com/fmtlib/fmt/issues/3023>`_, `#3024 <https://github.com/fmtlib/fmt/pull/3024>`_, `#3029 <https://github.com/fmtlib/fmt/pull/3029>`_, `#3043 <https://github.com/fmtlib/fmt/pull/3043>`_, `#3052 <https://github.com/fmtlib/fmt/issues/3052>`_, `#3053 <https://github.com/fmtlib/fmt/pull/3053>`_, `#3054 <https://github.com/fmtlib/fmt/pull/3054>`_ Signed-off-by: Adolf Belka <[email protected]> Reviewed-by: Michael Tremer <[email protected]>
Issue: #3038
utf8_decode
did not check if the first byte is correct.for_each_codepoint
did not check for overflow on broken codepoint.cc @stevenzzzz @dyfrgi