Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify utf8.c #102424

Closed
wants to merge 2 commits into from
Closed

Simplify utf8.c #102424

wants to merge 2 commits into from

Conversation

am11
Copy link
Member

@am11 am11 commented May 19, 2024

diet version 🤸

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label May 19, 2024
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label May 19, 2024
@am11 am11 added area-PAL-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels May 19, 2024
@am11 am11 force-pushed the feature/minipal/utf8-v2 branch from 9fc22ee to 020c8ed Compare May 21, 2024 05:46
@am11 am11 marked this pull request as ready for review May 21, 2024 05:47
@am11
Copy link
Member Author

am11 commented May 21, 2024

cc @AaronRobinsonMSFT, re: #100451 (comment) - this mainly reduces the assembly code with cl.exe -O2 (before 1091 lines, now 369 for minipal_convert_utf8_to_utf16). Could you please run your benchmark to see how it compares against the MultiByteToWideChar/WideCharToMultiChar win32 APIs? Hopefully it will perform better now (at least better than before 😅).

}

// May have a problem if we have to flush
if (ch != 0)
if (sourceIndex < sourceLength && (destinationIndex > destinationLength || (destinationIndex == destinationLength && sourceIndex + 1 < sourceLength)))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lambdageek, just an FYI, I had to change this condition from

if (sourceIndex < sourceLength && (destinationIndex >= destinationLength)

for mono because this case was failing:

IMHO, the reflection stack of mono should use the same expectation from giconv about the expected lengths as coreclr's reflection stack makes from coreclr PAL's unicode.cpp, so this standalone component doesn't need to handle more edge cases than necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@fanyang-mono fanyang-mono May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test that @am11 mentioned was testing the custom attribute value string "\uDFFF". For mono, it went through minipal_convert_utf8_to_utf16 to decode this string "\xed\xbf\xbf". It seems to me that CoreCLR went through a different mechanism to get the value of the custom attribute value.

Additionally, what Mono is doing in g_utf8_to_utf16_impl of giconv.c isn't a lot different than what is done in MultiByteToWideChar of unicode.cpp. If I understand correctly, the only difference is that in MultiByteToWideChar, it makes source length and destination length one byte longer than it is. Doing this in mono won't stop this check ((sourceIndex < sourceLength) && (destinationIndex >= destinationLength)) either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fanyang-mono, coreclr ultimately uses the same implementation (couple of wrappers later), but in reflection stack (metasig.h etc.), there it compensates for those off-by-one, zero-length kind of scenarios to match the expectation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@am11 Do you mind pointing me to the CoreCLR source code where they handles off-by-one, zero-length kind of scenarios?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fanyang-mono, good question. 😅 It is a bit tricky to discern the exact place without actually attaching a conditional debugger (or some tricky way), but I was debugging the similar tests for coreclr last year #85558 (comment) which points to

RegMeta::GetTypeDefProps(

One thing I do remember; when all PAL tests were passing and 99% of managed tests were passingm those edge cases covered in reflection tests were still failing, which is why I had to add a similar special condition in unicode.cpp -> utf8.c conversion to balance varied expectations of mono and coreclr + reflection simultaneously.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably fine having this a little bit longer check, because to make mono align with CoreCLR on this is non-trivial work.

Copy link
Member

@AaronRobinsonMSFT AaronRobinsonMSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I will take a look.

src/native/minipal/utf8.c Show resolved Hide resolved
src/native/minipal/utf8.c Outdated Show resolved Hide resolved
src/native/minipal/utf8.c Outdated Show resolved Hide resolved
@jkotas
Copy link
Member

jkotas commented May 22, 2024

Could you please run your benchmark to see how it compares against the MultiByteToWideChar/WideCharToMultiChar win32 APIs?

int main()
{
    char* p = (char*)malloc(200);
    memset(p, 'a', 200);

    for (;;)
    {
        int start = GetTickCount();
        for (int i = 0; i < 10000000; i++)
#if 0
            minipal_get_length_utf8_to_utf16(p, 200, 0);
#else
            MultiByteToWideChar(CP_UTF8, 0, p, 200, NULL, 0);
#endif
        int end = GetTickCount();
        printf("%d\n", end - start);
    }
}

MultiByteToWideChar: 220ms per iteration
minipal_get_length_utf8_to_utf16: 1370ms per iteration

@am11
Copy link
Member Author

am11 commented May 22, 2024

Ran some google benchmarks: https://gist.github.com/am11/8d949f6c112aaab1575e85f2f350c59e

Long length: 140
Short length: 20

main:

-------------------------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
BM_minipal_get_length_utf16_to_utf8/ASCII_Short          46.8 ns         46.3 ns     15200142 items_per_second=388.583M/s
BM_minipal_get_length_utf16_to_utf8/NonASCII_Short       49.2 ns         48.4 ns     14081759 items_per_second=288.974M/s
BM_minipal_get_length_utf16_to_utf8/ASCII_Long           48.5 ns         47.9 ns     14615732 items_per_second=2.92203G/s
BM_minipal_get_length_utf16_to_utf8/NonASCII_Long         167 ns          164 ns      4267226 items_per_second=854.207M/s
BM_minipal_convert_utf16_to_utf8/ASCII_Short             66.7 ns         65.7 ns     10535346 items_per_second=273.86M/s
BM_minipal_convert_utf16_to_utf8/NonASCII_Short          98.3 ns         97.0 ns      7215230 items_per_second=144.395M/s
BM_minipal_convert_utf16_to_utf8/ASCII_Long               135 ns          133 ns      5226338 items_per_second=1.05422G/s
BM_minipal_convert_utf16_to_utf8/NonASCII_Long            565 ns          556 ns      1261443 items_per_second=251.616M/s
BM_minipal_get_length_utf8_to_utf16/ASCII_Short          46.5 ns         45.7 ns     15311624 items_per_second=393.657M/s
BM_minipal_get_length_utf8_to_utf16/NonASCII_Short       56.6 ns         55.8 ns     12379302 items_per_second=573.563M/s
BM_minipal_get_length_utf8_to_utf16/ASCII_Long           47.7 ns         47.2 ns     15069416 items_per_second=2.96665G/s
BM_minipal_get_length_utf8_to_utf16/NonASCII_Long         118 ns          116 ns      5989100 items_per_second=1.20323G/s
BM_minipal_convert_utf8_to_utf16/ASCII_Short             61.5 ns         60.9 ns     11373792 items_per_second=295.365M/s
BM_minipal_convert_utf8_to_utf16/NonASCII_Short           126 ns          124 ns      5599462 items_per_second=257.461M/s
BM_minipal_convert_utf8_to_utf16/ASCII_Long               109 ns          108 ns      6452981 items_per_second=1.30072G/s
BM_minipal_convert_utf8_to_utf16/NonASCII_Long            260 ns          258 ns      2722687 items_per_second=542.822M/s

PR:

-------------------------------------------------------------------------------------------------------------
Benchmark                                                   Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
BM_minipal_get_length_utf16_to_utf8/ASCII_Short          46.3 ns         46.2 ns     15294729 items_per_second=389.687M/s
BM_minipal_get_length_utf16_to_utf8/NonASCII_Short       46.7 ns         46.7 ns     15014757 items_per_second=300.068M/s
BM_minipal_get_length_utf16_to_utf8/ASCII_Long           89.1 ns         88.2 ns      7965951 items_per_second=1.58799G/s
BM_minipal_get_length_utf16_to_utf8/NonASCII_Long         167 ns          167 ns      4209817 items_per_second=840.11M/s
BM_minipal_convert_utf16_to_utf8/ASCII_Short             57.9 ns         57.8 ns     12086261 items_per_second=311.416M/s
BM_minipal_convert_utf16_to_utf8/NonASCII_Short          80.0 ns         79.9 ns      8745736 items_per_second=175.196M/s
BM_minipal_convert_utf16_to_utf8/ASCII_Long               155 ns          154 ns      4546517 items_per_second=907.604M/s
BM_minipal_convert_utf16_to_utf8/NonASCII_Long            489 ns          488 ns      1427791 items_per_second=286.914M/s
BM_minipal_get_length_utf8_to_utf16/ASCII_Short          47.9 ns         47.1 ns     14990417 items_per_second=382.482M/s
BM_minipal_get_length_utf8_to_utf16/NonASCII_Short       65.7 ns         65.6 ns     10645254 items_per_second=487.782M/s
BM_minipal_get_length_utf8_to_utf16/ASCII_Long            131 ns          130 ns      5443066 items_per_second=1.07377G/s
BM_minipal_get_length_utf8_to_utf16/NonASCII_Long         180 ns          179 ns      3938780 items_per_second=781.547M/s
BM_minipal_convert_utf8_to_utf16/ASCII_Short             73.2 ns         72.9 ns      9594561 items_per_second=246.863M/s
BM_minipal_convert_utf8_to_utf16/NonASCII_Short           109 ns          108 ns      6488872 items_per_second=296.073M/s
BM_minipal_convert_utf8_to_utf16/ASCII_Long               282 ns          281 ns      2457149 items_per_second=497.751M/s
BM_minipal_convert_utf8_to_utf16/NonASCII_Long            298 ns          296 ns      2353052 items_per_second=472.879M/s

BM_minipal_get_length_utf16_to_utf8/ASCII_Long has gotten the worse performance regression. I tried fast tracking ASCII with:

    // Use vectorized operations for ASCII characters

    typedef uint16_t v4ui __attribute__ ((vector_size (8)));

    for (; sourceIndex < sourceLength && destinationIndex < destinationLength; sourceIndex += 4, destinationIndex += 4)
    {
        v4ui data = *(v4ui*)&source[sourceIndex];
        v4ui mask4 = (v4ui){0x007F, 0x007F, 0x007F, 0x007F};
        v4ui comparison = data <= mask4;
        size_t remainingLength = sourceLength - sourceIndex;

        if (remainingLength > 4 && __builtin_expect(comparison[0] & comparison[1] & comparison[2] & comparison[3], 1))
        {
            destination[destinationIndex] = (char)data[0];
            destination[destinationIndex + 1] = (char)data[1];
            destination[destinationIndex + 2] = (char)data[2];
            destination[destinationIndex + 3] = (char)data[3];
            continue;
        }
        else if (remainingLength <= 4)
        {
            // Handle the remaining 4 or less elements if ASCII
            if (remainingLength == 4) {
                if (__builtin_expect(comparison[0] & comparison[1] & comparison[2] & comparison[3], 1))
                {
                    destination[destinationIndex++] = (char)data[0];
                    destination[destinationIndex++] = (char)data[1];
                    destination[destinationIndex++] = (char)data[2];
                    destination[destinationIndex++] = (char)data[3];
                    return destinationIndex;
                }
                break;
            }

            v4ui mask3 = (v4ui){0x007F, 0x007F, 0x007F};
            v4ui mask2 = (v4ui){0x007F, 0x007F};

            comparison = data <= mask3;
            if (remainingLength == 3) {
                if( __builtin_expect(comparison[0] & comparison[1] & comparison[2], 1)) {
                    destination[destinationIndex++] = (char)data[0];
                    destination[destinationIndex++] = (char)data[1];
                    destination[destinationIndex++] = (char)data[2];
                    return destinationIndex;
                }
                break;
            }

            comparison = data <= mask2;
            if (remainingLength == 2) {
                if(__builtin_expect(comparison[0] & comparison[1], 1)) {
                    destination[destinationIndex++] = (char)data[0];
                    destination[destinationIndex++] = (char)data[1];
                    return destinationIndex;
                }
                break;
            }

            if (source[sourceIndex] < 0x80){
                destination[destinationIndex++] = (char)source[sourceIndex];
                return destinationIndex;
            }
        }

        // Break out of the outer loop here because we know it's non-ASCII
        break;
    }

    // Handle non-ASCII and mixed cases
    while (sourceIndex < sourceLength && destinationIndex < destinationLength)
    {
...

but that did not move the needle (+/- 2ns diff).

Haven't given up on fine-tuning yet, but overall I think the new implementation is simple and more readable. So it's a balancing question of readability vs. performance (I'm pursuing both 🙂).

@AaronRobinsonMSFT AaronRobinsonMSFT marked this pull request as draft July 2, 2024 16:16
@AaronRobinsonMSFT
Copy link
Member

@am11 I'm moving this to draft for now.

Copy link
Contributor

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

@github-actions github-actions bot locked and limited conversation to collaborators Sep 1, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-PAL-coreclr community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants