[rtext] `GetCodepointNext()` does not return `?` as intended on invalid input #2996

chocolate42 · 2023-03-31T16:23:52Z

Please, before submitting a new issue verify and check:

I tested it on latest raylib version from master branch
I checked there is no similar issue already reported
My code has no errors or misuse of raylib

Issue description

GetCodepointNext says it returns '?' when an invalid codepoint is read, however at no point is this true. '?' is only returned when '?' is the input.

When the first byte is 128..191 or 248..255 it's invalid UTF8, but the code returns a 1 byte result verbatim instead of '?'
The 10xxxxxx of multibyte encodings are also not checked
If these were fixed and the default returned, the default return size of *codepointSize = 0; which appears to be the intention would break existing code, for example LoadCodepoints would endlessly loop on any input with an invalid codepoint. If *codepointSize = 1; were the default then no existing code should break, potentially multiple ? chars would be output per invalid codepoint which seems fine.
Overlong encodings are not checked, which strictly speaking isn't allowed per the UTF-8 spec. However that's not a big deal IMO

Environment

Untested other than in below example.

Code Example

#include <stdio.h>
//function as it currently is in raylib verbatim
int GetCodepointNext(const char *text, int *codepointSize)
{
    const char *ptr = text;
    int codepoint = 0x3f;       // Codepoint (defaults to '?')
    *codepointSize = 0;

    // Get current codepoint and bytes processed
    if (0xf0 == (0xf8 & ptr[0]))
    {
        // 4 byte UTF-8 codepoint
        codepoint = ((0x07 & ptr[0]) << 18) | ((0x3f & ptr[1]) << 12) | ((0x3f & ptr[2]) << 6) | (0x3f & ptr[3]);
        *codepointSize = 4;
    }
    else if (0xe0 == (0xf0 & ptr[0]))
    {
        // 3 byte UTF-8 codepoint */
        codepoint = ((0x0f & ptr[0]) << 12) | ((0x3f & ptr[1]) << 6) | (0x3f & ptr[2]);
        *codepointSize = 3;
    }
    else if (0xc0 == (0xe0 & ptr[0]))
    {
        // 2 byte UTF-8 codepoint
        codepoint = ((0x1f & ptr[0]) << 6) | (0x3f & ptr[1]);
        *codepointSize = 2;
    }
    else
    {
        // 1 byte UTF-8 codepoint
        codepoint = ptr[0];
        *codepointSize = 1;
    }

    return codepoint;
}
int main(int argc, char *argv) {
	char t=0xFF, t2[]={0xF0, 0x01, 0x01, 0x01}, t3[]={0xF0, 0x81, 0x81, 0x81};
	int len, codepoint;
	codepoint=GetCodepointNext(&t, &len);
	printf("Invalid codepoint (first byte) not replaced with '?': codepoint %d length %d\n", codepoint, len);
	codepoint=GetCodepointNext(t2, &len);
	printf("Invalid codepoint (multibyte) not replaced with '?': character %d length %d\n", codepoint, len);
	codepoint=GetCodepointNext(t3, &len);
	printf("Overlong codepoint not replaced with '?': character %d length %d\n", codepoint, len);
}

Which outputs this

Invalid codepoint (first byte) not replaced with '?': codepoint -1 length 1
Invalid codepoint (multibyte) not replaced with '?': character 4161 length 4
Overlong codepoint not replaced with '?': character 4161 length 4

The text was updated successfully, but these errors were encountered:

raysan5 · 2023-03-31T16:42:55Z

@chocolate42 Thanks for reporting, that function was a simplified version of GetCodepoint() and I did it wrong... it needs to be reviewed.

chocolate42 · 2023-03-31T16:55:38Z

If you don't want to validate fully then it looks like it's as simple as else if (!(0x80 & ptr[0])) or else if (0x00 == (0x80 & ptr[0])) for the last condition then fixing things like LoadCodepoints which don't check for a zero length result. I can do a PR to that effect if you like.

raysan5 changed the title ~~[rtext] GetCodepointNext does not return '?' as intended on invalid input~~ [rtext] GetCodepointNext() does not return ? as intended on invalid input Mar 31, 2023

chocolate42 mentioned this issue Apr 1, 2023

[rtext] Fix GetCodepointNext() to return default value on invalid input with size=0 #2997

Merged

raysan5 closed this as completed in #2997 Apr 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rtext] `GetCodepointNext()` does not return `?` as intended on invalid input #2996

[rtext] `GetCodepointNext()` does not return `?` as intended on invalid input #2996

chocolate42 commented Mar 31, 2023

raysan5 commented Mar 31, 2023

chocolate42 commented Mar 31, 2023

[rtext] GetCodepointNext() does not return ? as intended on invalid input #2996

[rtext] GetCodepointNext() does not return ? as intended on invalid input #2996

Comments

chocolate42 commented Mar 31, 2023

Issue description

Environment

Code Example

raysan5 commented Mar 31, 2023

chocolate42 commented Mar 31, 2023

[rtext] `GetCodepointNext()` does not return `?` as intended on invalid input #2996

[rtext] `GetCodepointNext()` does not return `?` as intended on invalid input #2996