Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding a JavaScriptEncoder implementation that doesn't encode the block list or surrogate pairs. #42847

Closed
dan655t opened this issue Sep 29, 2020 · 15 comments

Comments

@dan655t
Copy link

dan655t commented Sep 29, 2020

I would like to know if there is a way or any intention for the System.Text.Json serializer to support emojis defined with surrogate pairs?

When we serialize "📲" we get its representation as escaped unicode values. I would like it to remain unescaped.

static void Main()
{
    var encoderSettings = new TextEncoderSettings();
    encoderSettings.AllowRange(UnicodeRanges.All);
    var serializerOptions = new JsonSerializerOptions
    {
        Encoder = JavaScriptEncoder.Create(encoderSettings),
        WriteIndented = true
    };
    var json = JsonSerializer.Serialize("📲", serializerOptions);
    Console.WriteLine(json);
    // "\uD83D\uDCF2"
}

Thank you in advance.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Text.Json untriaged New issue has not been triaged by the area owner labels Sep 29, 2020
@GrabYourPitchforks
Copy link
Member

There is no built-in mechanism for allowing this. You'd have to subclass the JavaScriptEncoder type and set the JsonSerializerOptions.Encoder property to an instance of your custom type.

@layomia - We could consider allowing the unsafe relaxed escaper to allow known-good supplementary characters. This isn't trivial work, though.

@layomia layomia removed the untriaged New issue has not been triaged by the area owner label Sep 30, 2020
@layomia layomia added this to the Future milestone Sep 30, 2020
@dan655t
Copy link
Author

dan655t commented Sep 30, 2020

Thank you for the reply. This is blocking us from migrating off of Newtonsoft.

@terrajobst
Copy link
Member

@dan655

Could you describe why this is a blocker? Those values are correctly deserialized; it just means that if you look a the raw JSON payload you'll seem the escaped.

@dan655t
Copy link
Author

dan655t commented Mar 8, 2021

@terrajobst sure. Our mobile apps display the escaped text as is instead of the emojis.

@sgryphon
Copy link

sgryphon commented Mar 13, 2021

@terrajobst sure. Our mobile apps display the escaped text as is instead of the emojis.

That sounds like a problem with your decoding app not correctly handling valid JSON.

The JSON specification says: ' To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E". '

Technically only the quotation mark and reverse solidus require escaping (it is the only way to represent them), so maybe they could add switches to control if other character do or do not get encoded (tabs, line feeds, control characters, values outside ASCII, values outside BMP). e.g. the UTF-8 byte sequence will be shorted than the 12-byte escaped surrogate pair, so save a few bytes.

But the decoding app should still handle all valid encodings.

You are asking for a work around to compensate for a bug in your mobile app.

@dan655t
Copy link
Author

dan655t commented Mar 13, 2021

You are asking for a work around to compensate for a bug in your mobile app.

I disagree with this take.

  • We do not force our users to update to the latest version of our client apps.
  • We use several different backend technologies. All of which use their de facto json serializer and they do not escape these emojis.
  • Newtonsoft just works.

I'm honestly surprised more people aren't having this issue considering how prevalent emojis are in user generated content.

@GrabYourPitchforks
Copy link
Member

I'm honestly surprised more people aren't having this issue considering how prevalent emojis are in user generated content.

Because as mentioned previously, this appears to be a bug in the component that is deserializing the JSON payload. All JSON deserializers in wide use that we're aware of follow the RFC and properly turn the \uXXXX\uYYYY pair back into the correct emoji.

Would subclassing the escaper (see my comment from a few months ago) work for your scenario? This puts you in full control over how the payload is generated, which should give you sufficient berth to work around other bugs in the deserialization component you use. I suspect if that component isn't properly decoding \uXXXX substrings, it probably has other bugs that you may need to work around.

@dan655t
Copy link
Author

dan655t commented Mar 13, 2021

I will look into a workaround. Thank you all for the responses. Will close this issue.

@dan655t dan655t closed this as completed Mar 13, 2021
@koculu
Copy link

koculu commented Mar 30, 2021

The decision that was made here conflicts with major browser's behavior including Microsoft Edge.
Please reconsider the issue as it really hurts.

!Text.Json serializer converts emoji icons to encoded codepoints and the browsers do not deserialize them correctly! (This statement is incorrect. Problem was in db serialization but not browsers.)

Demonstration of JSON serialization and deserialization that contains an emoji.
var jsObjectWithEmoji = { text: '😀' }; JSON.stringify(jsObjectWithEmoji); JSON.parse(JSON.stringify(jsObjectWithEmoji));

Microsoft Edge Console Demo:
image

@GrabYourPitchforks
Copy link
Member

the browsers do not deserialize them correctly!

Can you provide an example of a System.Text.Json-serialized payload that the browser API JSON.parse does not properly decode? The screenshot you provided above does not demonstrate such an example.

@koculu
Copy link

koculu commented Mar 30, 2021

Hi there!
This turns out a serialization issue on the graph database I used. Thank you for the pointers.
Browsers deserialize these unicode encodings correctly! That statement was clearly false.

On the other hand, browsers by default serialize and deserialize the emojis as is as seen in the screen shots.
It is a good idea to add that support to .NET.

@koculu
Copy link

koculu commented Mar 30, 2021

I switched to Newtonsoft Json after spending several hours on this.
Emojis are very common and .Net 5 should handle them out of the box.

  1. I don't want to waste extra space for a couple of well known emoji characters in my database.
  2. I don't want to waste development time for this kind of 3rd party issues.

@ghost ghost locked as resolved and limited conversation to collaborators Apr 29, 2021
@dotnet dotnet unlocked this conversation May 26, 2023
@eiriktsarpalis
Copy link
Member

Reopening for future consideration.

@eiriktsarpalis eiriktsarpalis changed the title JsonSerializer.Serialize support for emojis defined with surrogate pairs Consider adding a JavaScriptEncoder implementation that doesn't encode the block list or surrogate pairs. Jun 5, 2023
@ghost
Copy link

ghost commented Jun 5, 2023

Tagging subscribers to this area: @dotnet/area-system-text-encodings-web
See info in area-owners.md if you want to be subscribed.

Issue Details

I would like to know if there is a way or any intention for the System.Text.Json serializer to support emojis defined with surrogate pairs?

When we serialize "📲" we get its representation as escaped unicode values. I would like it to remain unescaped.

static void Main()
{
    var encoderSettings = new TextEncoderSettings();
    encoderSettings.AllowRange(UnicodeRanges.All);
    var serializerOptions = new JsonSerializerOptions
    {
        Encoder = JavaScriptEncoder.Create(encoderSettings),
        WriteIndented = true
    };
    var json = JsonSerializer.Serialize("📲", serializerOptions);
    Console.WriteLine(json);
    // "\uD83D\uDCF2"
}

Thank you in advance.

Author: dan655t
Assignees: -
Labels:

area-System.Text.Encodings.Web

Milestone: Future

davidmatson added a commit to davidmatson/runtime that referenced this issue Jun 5, 2023
Also fixes dotnet#86800.
Also fixes dotnet#87138 (except docs outside this repo).
@eiriktsarpalis
Copy link
Member

Closing in favor of #87153.

@ghost ghost locked as resolved and limited conversation to collaborators Jul 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants