Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rtext] GetCodepointNext() does not return ? as intended on invalid input #2996

Closed
3 tasks done
chocolate42 opened this issue Mar 31, 2023 · 2 comments · Fixed by #2997
Closed
3 tasks done

[rtext] GetCodepointNext() does not return ? as intended on invalid input #2996

chocolate42 opened this issue Mar 31, 2023 · 2 comments · Fixed by #2997

Comments

@chocolate42
Copy link
Contributor

Please, before submitting a new issue verify and check:

  • I tested it on latest raylib version from master branch
  • I checked there is no similar issue already reported
  • My code has no errors or misuse of raylib

Issue description

GetCodepointNext says it returns '?' when an invalid codepoint is read, however at no point is this true. '?' is only returned when '?' is the input.

  • When the first byte is 128..191 or 248..255 it's invalid UTF8, but the code returns a 1 byte result verbatim instead of '?'
  • The 10xxxxxx of multibyte encodings are also not checked
  • If these were fixed and the default returned, the default return size of *codepointSize = 0; which appears to be the intention would break existing code, for example LoadCodepoints would endlessly loop on any input with an invalid codepoint. If *codepointSize = 1; were the default then no existing code should break, potentially multiple ? chars would be output per invalid codepoint which seems fine.
  • Overlong encodings are not checked, which strictly speaking isn't allowed per the UTF-8 spec. However that's not a big deal IMO

Environment

Untested other than in below example.

Code Example

#include <stdio.h>
//function as it currently is in raylib verbatim
int GetCodepointNext(const char *text, int *codepointSize)
{
    const char *ptr = text;
    int codepoint = 0x3f;       // Codepoint (defaults to '?')
    *codepointSize = 0;

    // Get current codepoint and bytes processed
    if (0xf0 == (0xf8 & ptr[0]))
    {
        // 4 byte UTF-8 codepoint
        codepoint = ((0x07 & ptr[0]) << 18) | ((0x3f & ptr[1]) << 12) | ((0x3f & ptr[2]) << 6) | (0x3f & ptr[3]);
        *codepointSize = 4;
    }
    else if (0xe0 == (0xf0 & ptr[0]))
    {
        // 3 byte UTF-8 codepoint */
        codepoint = ((0x0f & ptr[0]) << 12) | ((0x3f & ptr[1]) << 6) | (0x3f & ptr[2]);
        *codepointSize = 3;
    }
    else if (0xc0 == (0xe0 & ptr[0]))
    {
        // 2 byte UTF-8 codepoint
        codepoint = ((0x1f & ptr[0]) << 6) | (0x3f & ptr[1]);
        *codepointSize = 2;
    }
    else
    {
        // 1 byte UTF-8 codepoint
        codepoint = ptr[0];
        *codepointSize = 1;
    }

    return codepoint;
}
int main(int argc, char *argv) {
	char t=0xFF, t2[]={0xF0, 0x01, 0x01, 0x01}, t3[]={0xF0, 0x81, 0x81, 0x81};
	int len, codepoint;
	codepoint=GetCodepointNext(&t, &len);
	printf("Invalid codepoint (first byte) not replaced with '?': codepoint %d length %d\n", codepoint, len);
	codepoint=GetCodepointNext(t2, &len);
	printf("Invalid codepoint (multibyte) not replaced with '?': character %d length %d\n", codepoint, len);
	codepoint=GetCodepointNext(t3, &len);
	printf("Overlong codepoint not replaced with '?': character %d length %d\n", codepoint, len);
}

Which outputs this

Invalid codepoint (first byte) not replaced with '?': codepoint -1 length 1
Invalid codepoint (multibyte) not replaced with '?': character 4161 length 4
Overlong codepoint not replaced with '?': character 4161 length 4
@raysan5 raysan5 changed the title [rtext] GetCodepointNext does not return '?' as intended on invalid input [rtext] GetCodepointNext() does not return ? as intended on invalid input Mar 31, 2023
@raysan5
Copy link
Owner

raysan5 commented Mar 31, 2023

@chocolate42 Thanks for reporting, that function was a simplified version of GetCodepoint() and I did it wrong... it needs to be reviewed.

@chocolate42
Copy link
Contributor Author

If you don't want to validate fully then it looks like it's as simple as else if (!(0x80 & ptr[0])) or else if (0x00 == (0x80 & ptr[0])) for the last condition then fixing things like LoadCodepoints which don't check for a zero length result. I can do a PR to that effect if you like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants