perlunicode: Add discussion about malformations #23553

khwilliamson · 2025-08-09T17:38:06Z

And especially the REPLACEMENT CHARACTER

jkeenan · 2025-08-10T11:42:28Z

Until reading this, I had never heard of the "Unicode REPLACEMENT CHARACTER (U+FFFD)", even though I'd seen it thousands of times. So I'm not in a position to confirm the specifics of this p.r. However, once we review that, this can go in.

pod/perlunicode.pod

bulk88 · 2025-08-10T16:45:15Z

pod/perlunicode.pod

+Note that finding a REPLACEMENT CHARACTER in your string doesn't
+necessarily mean there is an attack.  It is a perfectly legal input
+character, for whatever reason.
+


I disagree, WinPerl's generic C coding policy is and MS's help docs say finding REPLACEMENT CHARACTER is I/O error, the SATA/SCSI/IDE cable was yanked, no more information is available. Imagine feeding all of BMP, or 100s of invalid utf8 surrogates into a state machine/de-dupe logic/HV* hash, and every last string pops out the front end or backend end of the API being memcmp(username1, username2, len) == 0 identical. Yet they were all unique different customers/ip addresses/zip codes/email addresses 1 millisecond ago.

What I meant is that TUS does not prohibit someone from placing the REPLACEMENT CHARACTER in Unicode strings. It probably is a bad idea, but it isn't illegal

I forgot the exact experiments I did with MS's multiple in-house, competing, UTF16 <- and -> UTF8 converters, located on the Windows installer .iso in various .dlls. But atleast one of their .dlls doesn't returned U+FFFD on the UTF8 side, but instead starts returning UTF8 code points from this range https://en.wikipedia.org/wiki/Private_Use_Areas which I think MS is using to return their secret error codes hidden inside PUA code points on how the UTF16 input byte stream failed "validation" whatever that particular MS .dll is calling "validating an input UTF16 bytestream".

Hence I've never discussed or considered beyond 10 seconds for blindly dropping MS algorithms/const RO static array tables of metadata on WinPerl, into where Perl brews its own in-house const RO static array tables of metadata. I don't trust the MS APIs, they are there to serve MS's in-house and and users of MS OSes and the public commercially distributed Win32-only ecosystem software.

That isn't the same scope or venn diagram of users as Perl 5 users. Perhaps it could be proven a certain MS API is identical for all possible inputs and outputs as a perl in house API, but im not qualified to prove that through unit tests. Prob not worth the maint either.

bulk88 · 2025-08-10T17:27:01Z

pod/perlunicode.pod

+use the REPLACEMENT CHARACTER for the missing ones.  As long as most of
+the text is translatable, the results could be intelligible to a human
+reader.



ыОУ СХОУЛД ГИЖЕ АН ХОНЕРАБЛЕ МЕНТИОН ТО КОИ╦ АС ТХЕ ОНЛЫ КОДЕ ПАГЕ ВХЕРЕ ТХИС ИС ТРУЕ

I'm unsure of your point here. I happen to be able to read Cyrillic, and surprisingly several of the words you gave here are pronounceable. My point is if you have valid text in some encoding that you are translating to Unicode, but when you encounter a character that doesn't have a Unicode TUS explicitly says to substitute the REPLACEMENT CHARACTER in your translation, rather than doing anything else. (Maybe you could return failure, I suppose.)

My opinion is 0xFFEE and low 7b "?" is not readable with human eyes. Its not an overcompressed jpeg image, or what I did with KOI8 bitwise math logic up above. Its a bad disk sector, you will never know what was behind the square, and no amount of $ to a data recovery firm will get that character back, or give you a good enough guess what it used to be.

If its AI generated auto captions, "???" "..." or nowadays AI algos just print "*music*" for 8 minutes instead of low 7b "?" on the screen. I just dont want anyone to think a 0xFFEE or a '?' in a CSV file is ever acceptable coding practices and to walk away and move on with daily life after seeing it for half a second in a system tracing log/sql table/dev tools console.

pod/perlunicode.pod

khwilliamson · 2025-08-22T16:55:06Z

Significant revisions have been pushed

khwilliamson mentioned this pull request Aug 9, 2025

sv_vcatpvfn_flags: Use utf8_to_uv #23083

Merged

bram-perl approved these changes Aug 9, 2025

View reviewed changes