Skip to content

Conversation

@khwilliamson
Copy link
Contributor

And especially the REPLACEMENT CHARACTER

@jkeenan
Copy link
Contributor

jkeenan commented Aug 10, 2025

Until reading this, I had never heard of the "Unicode REPLACEMENT CHARACTER (U+FFFD)", even though I'd seen it thousands of times. So I'm not in a position to confirm the specifics of this p.r. However, once we review that, this can go in.

Note that finding a REPLACEMENT CHARACTER in your string doesn't
necessarily mean there is an attack. It is a perfectly legal input
character, for whatever reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree, WinPerl's generic C coding policy is and MS's help docs say finding REPLACEMENT CHARACTER is I/O error, the SATA/SCSI/IDE cable was yanked, no more information is available. Imagine feeding all of BMP, or 100s of invalid utf8 surrogates into a state machine/de-dupe logic/HV* hash, and every last string pops out the front end or backend end of the API being memcmp(username1, username2, len) == 0 identical. Yet they were all unique different customers/ip addresses/zip codes/email addresses 1 millisecond ago.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant is that TUS does not prohibit someone from placing the REPLACEMENT CHARACTER in Unicode strings. It probably is a bad idea, but it isn't illegal

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot the exact experiments I did with MS's multiple in-house, competing, UTF16 <- and -> UTF8 converters, located on the Windows installer .iso in various .dlls. But atleast one of their .dlls doesn't returned U+FFFD on the UTF8 side, but instead starts returning UTF8 code points from this range https://en.wikipedia.org/wiki/Private_Use_Areas which I think MS is using to return their secret error codes hidden inside PUA code points on how the UTF16 input byte stream failed "validation" whatever that particular MS .dll is calling "validating an input UTF16 bytestream".

Hence I've never discussed or considered beyond 10 seconds for blindly dropping MS algorithms/const RO static array tables of metadata on WinPerl, into where Perl brews its own in-house const RO static array tables of metadata. I don't trust the MS APIs, they are there to serve MS's in-house and and users of MS OSes and the public commercially distributed Win32-only ecosystem software.

That isn't the same scope or venn diagram of users as Perl 5 users. Perhaps it could be proven a certain MS API is identical for all possible inputs and outputs as a perl in house API, but im not qualified to prove that through unit tests. Prob not worth the maint either.

use the REPLACEMENT CHARACTER for the missing ones. As long as most of
the text is translatable, the results could be intelligible to a human
reader.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ыОУ СХОУЛД ГИЖЕ АН ХОНЕРАБЛЕ МЕНТИОН ТО КОИ╦ АС ТХЕ ОНЛЫ КОДЕ ПАГЕ ВХЕРЕ ТХИС ИС ТРУЕ

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure of your point here. I happen to be able to read Cyrillic, and surprisingly several of the words you gave here are pronounceable. My point is if you have valid text in some encoding that you are translating to Unicode, but when you encounter a character that doesn't have a Unicode TUS explicitly says to substitute the REPLACEMENT CHARACTER in your translation, rather than doing anything else. (Maybe you could return failure, I suppose.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion is 0xFFEE and low 7b "?" is not readable with human eyes. Its not an overcompressed jpeg image, or what I did with KOI8 bitwise math logic up above. Its a bad disk sector, you will never know what was behind the square, and no amount of $ to a data recovery firm will get that character back, or give you a good enough guess what it used to be.

If its AI generated auto captions, "???" "..." or nowadays AI algos just print "*music*" for 8 minutes instead of low 7b "?" on the screen. I just dont want anyone to think a 0xFFEE or a '?' in a CSV file is ever acceptable coding practices and to walk away and move on with daily life after seeing it for half a second in a system tracing log/sql table/dev tools console.

@khwilliamson khwilliamson force-pushed the perlunicode_replacement branch from 0a176b6 to 2c3d040 Compare August 21, 2025 21:21
@khwilliamson
Copy link
Contributor Author

Significant revisions have been pushed

@khwilliamson khwilliamson merged commit ab396c3 into Perl:blead Sep 1, 2025
33 checks passed
@khwilliamson khwilliamson deleted the perlunicode_replacement branch September 1, 2025 15:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants