Invalid first byte bug #8

phprus · 2022-08-24T14:41:27Z

If first byte of the sequence is invalid then function can return zero as error code (valid utf-8).

Look fmt issue fmtlib/fmt#3038 , PR fmtlib/fmt#3044 and https://godbolt.org/z/EbeEex4Gf for details.

skeeto · 2022-08-24T16:50:52Z

Thanks for the Godbolt link since that was very helpful in figuring this out. It looks like the two issues observed in fmt are due to invalid use and a defective local modification. I'm unable to reproduce issues on my original:

#include "utf8.h"
#include <stdio.h>

int main(void)
{
    char tests[][4] = {
        "\x8f\xbf\xc0\x00",
        "\xbf\xc0\x00\x00",
        "\xf0\x28\x00\x00",
        "\xf4\x8f\xbf\xc0",
    };
    int ntests = sizeof(tests) / sizeof(*tests);
    for (int i = 0; i < ntests; i++) {
        int err;
        uint32_t cp;
        char *p = utf8_decode(tests[i], &cp, &err);
        printf("U+%06lx len=%td err=%d\n", (long)cp, (p-tests[i]), err);
    }
}

My output:

U+03f000 len=1 err=70
U+000000 len=1 err=90
U+028000 len=4 err=42
U+10ffc0 len=4 err=1

The ASan issue was caused by the invalid use. The input must be no less than 4 bytes, and it's the caller's responsibility to pad the input if necessary. It looks like fmt handles this part correctly. The returned pointer may point into or one past this padding. This is a subtle detail, and it's why they were observing a size of -2. Yes, this is an awkward interface, but it's paying for the performance gains if used appropriately (more on that below).

The zero error problem is caused by merging the len computation with advancing the pointer, via fmt::detail::code_point_length. It's important that the pointer always advances, so it adds len + !len. However, the domain of len must include zero for table lookups into shiftc, mins, and shifte. By forcing it non-zero, the error checks were broken. I strongly suggest the fmt folks run their modified version against my quite-thorough test suite to see if any other error checks are still broken.

Regarding the overall use of utf8_decode in fmt: It makes little sense to use a branchless decoder and then put branches around the call site. The decoder is written for the following use case:

Large buffers (at least thousands of code points)
Probably containing non-ASCII code points (length varies unpredictably)
Probably valid UTF-8 (optimistic)
Invalid input will be completely rejected (no replacement characters)

The caller runs the decoder across the buffer, extracting code points, using the returned pointer for the next iteration — all while making no decisions about errors. Instead they're accumulated and checked later. After all, we're optimistic and expect no errors, so we march along as though nothing has gone wrong. Following the 4-byte rule means the input must simply have 3 bytes of zero padding.

An example that decodes standard input:

#define N (1 << 20)  // 1 MiB

// input buffer
char buf[N+3];
char *end = buf + fread(buf, 1, N, stdin);
end[0] = end[1] = end[2] = 0;

// parsed code points
int len = 0;
uint32_t cp[N];

int errors = 0;
for (char *p = buf; p < end;) {
    int e;
    p = utf8_decode(p, cp+len++, &e);
    errors |= e;
}

The input buffer has 3 extra bytes which do not accept input. The loop stops at the end of the input, not the end of the padding (i.e. no "-2 size"). When the loop is done, errors will indicate if anything went wrong during decoding. It will not say where or what since invalid input will be rejected outright (4), not massaged or recovered, and it didn't waste time in the loop inspecting the error indicator.

Note: Rejection could mean switching to a slower decoder that permits replacement characters — a slow path for the unexpected case.

In retrospect I probably should have required callers to zero e, then only |= in the function instead of the initial assignment. That simplifies the loop nicely:

int errors = 0;
for (char *p = buf; p < end;) {
    p = utf8_decode(p, cp+len++, &errors);
}

If fmt wants to use my decoder, they ought to have a fast path that looks like this when the string is above some length threshold, so that they actually get the performance benefits. Since there's no efficient way to append padding to the input they could stop short and decode the last few bytes slowly.

phprus · 2022-08-24T18:08:26Z

@skeeto
Thank you for your comment!
I fixed bugs in fmt's utf8_decode implementation and added tests.

cc @vitaut

vitaut · 2022-08-26T20:05:10Z

Thanks a lot, @skeeto, for the detailed reply. Looks like I introduced an error during some refactor.

If fmt wants to use my decoder, they ought to have a fast path that looks like this when the string is above some length threshold, so that they actually get the performance benefits.

This is the plan but we haven't done it yet. The current usage is indeed suboptimal.

phprus mentioned this issue Aug 24, 2022

Fix decoder on broken utf8 sequences. fmtlib/fmt#3044

Merged

This was referenced Aug 24, 2022

Fix bugs in utf8 decoder fmtlib/fmt#3056

Merged

Result length is wrong when some arguments use negative char values fmtlib/fmt#3058

Closed

phprus closed this as completed Sep 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid first byte bug #8

Invalid first byte bug #8

phprus commented Aug 24, 2022

skeeto commented Aug 24, 2022

phprus commented Aug 24, 2022

vitaut commented Aug 26, 2022

Invalid first byte bug #8

Invalid first byte bug #8

Comments

phprus commented Aug 24, 2022

skeeto commented Aug 24, 2022

phprus commented Aug 24, 2022

vitaut commented Aug 26, 2022