-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid first byte bug #8
Comments
Thanks for the Godbolt link since that was very helpful in figuring this out. It looks like the two issues observed in #include "utf8.h"
#include <stdio.h>
int main(void)
{
char tests[][4] = {
"\x8f\xbf\xc0\x00",
"\xbf\xc0\x00\x00",
"\xf0\x28\x00\x00",
"\xf4\x8f\xbf\xc0",
};
int ntests = sizeof(tests) / sizeof(*tests);
for (int i = 0; i < ntests; i++) {
int err;
uint32_t cp;
char *p = utf8_decode(tests[i], &cp, &err);
printf("U+%06lx len=%td err=%d\n", (long)cp, (p-tests[i]), err);
}
} My output:
The ASan issue was caused by the invalid use. The input must be no less than 4 bytes, and it's the caller's responsibility to pad the input if necessary. It looks like The zero error problem is caused by merging the Regarding the overall use of
The caller runs the decoder across the buffer, extracting code points, using the returned pointer for the next iteration — all while making no decisions about errors. Instead they're accumulated and checked later. After all, we're optimistic and expect no errors, so we march along as though nothing has gone wrong. Following the 4-byte rule means the input must simply have 3 bytes of zero padding. An example that decodes standard input: #define N (1 << 20) // 1 MiB
// input buffer
char buf[N+3];
char *end = buf + fread(buf, 1, N, stdin);
end[0] = end[1] = end[2] = 0;
// parsed code points
int len = 0;
uint32_t cp[N];
int errors = 0;
for (char *p = buf; p < end;) {
int e;
p = utf8_decode(p, cp+len++, &e);
errors |= e;
} The input buffer has 3 extra bytes which do not accept input. The loop stops at the end of the input, not the end of the padding (i.e. no "-2 size"). When the loop is done, Note: Rejection could mean switching to a slower decoder that permits replacement characters — a slow path for the unexpected case. In retrospect I probably should have required callers to zero int errors = 0;
for (char *p = buf; p < end;) {
p = utf8_decode(p, cp+len++, &errors);
} If |
Thanks a lot, @skeeto, for the detailed reply. Looks like I introduced an error during some refactor.
This is the plan but we haven't done it yet. The current usage is indeed suboptimal. |
If first byte of the sequence is invalid then function can return zero as error code (valid utf-8).
Look fmt issue fmtlib/fmt#3038 , PR fmtlib/fmt#3044 and https://godbolt.org/z/EbeEex4Gf for details.
The text was updated successfully, but these errors were encountered: