-
Notifications
You must be signed in to change notification settings - Fork 603
perlunicode: Add discussion about malformations #23553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Until reading this, I had never heard of the "Unicode REPLACEMENT CHARACTER (U+FFFD)", even though I'd seen it thousands of times. So I'm not in a position to confirm the specifics of this p.r. However, once we review that, this can go in. |
| Note that finding a REPLACEMENT CHARACTER in your string doesn't | ||
| necessarily mean there is an attack. It is a perfectly legal input | ||
| character, for whatever reason. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree, WinPerl's generic C coding policy is and MS's help docs say finding REPLACEMENT CHARACTER is I/O error, the SATA/SCSI/IDE cable was yanked, no more information is available. Imagine feeding all of BMP, or 100s of invalid utf8 surrogates into a state machine/de-dupe logic/HV* hash, and every last string pops out the front end or backend end of the API being memcmp(username1, username2, len) == 0 identical. Yet they were all unique different customers/ip addresses/zip codes/email addresses 1 millisecond ago.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant is that TUS does not prohibit someone from placing the REPLACEMENT CHARACTER in Unicode strings. It probably is a bad idea, but it isn't illegal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot the exact experiments I did with MS's multiple in-house, competing, UTF16 <- and -> UTF8 converters, located on the Windows installer .iso in various .dlls. But atleast one of their .dlls doesn't returned U+FFFD on the UTF8 side, but instead starts returning UTF8 code points from this range https://en.wikipedia.org/wiki/Private_Use_Areas which I think MS is using to return their secret error codes hidden inside PUA code points on how the UTF16 input byte stream failed "validation" whatever that particular MS .dll is calling "validating an input UTF16 bytestream".
Hence I've never discussed or considered beyond 10 seconds for blindly dropping MS algorithms/const RO static array tables of metadata on WinPerl, into where Perl brews its own in-house const RO static array tables of metadata. I don't trust the MS APIs, they are there to serve MS's in-house and and users of MS OSes and the public commercially distributed Win32-only ecosystem software.
That isn't the same scope or venn diagram of users as Perl 5 users. Perhaps it could be proven a certain MS API is identical for all possible inputs and outputs as a perl in house API, but im not qualified to prove that through unit tests. Prob not worth the maint either.
| use the REPLACEMENT CHARACTER for the missing ones. As long as most of | ||
| the text is translatable, the results could be intelligible to a human | ||
| reader. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ыОУ СХОУЛД ГИЖЕ АН ХОНЕРАБЛЕ МЕНТИОН ТО КОИ╦ АС ТХЕ ОНЛЫ КОДЕ ПАГЕ ВХЕРЕ ТХИС ИС ТРУЕ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure of your point here. I happen to be able to read Cyrillic, and surprisingly several of the words you gave here are pronounceable. My point is if you have valid text in some encoding that you are translating to Unicode, but when you encounter a character that doesn't have a Unicode TUS explicitly says to substitute the REPLACEMENT CHARACTER in your translation, rather than doing anything else. (Maybe you could return failure, I suppose.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My opinion is 0xFFEE and low 7b "?" is not readable with human eyes. Its not an overcompressed jpeg image, or what I did with KOI8 bitwise math logic up above. Its a bad disk sector, you will never know what was behind the square, and no amount of $ to a data recovery firm will get that character back, or give you a good enough guess what it used to be.
If its AI generated auto captions, "???" "..." or nowadays AI algos just print "*music*" for 8 minutes instead of low 7b "?" on the screen. I just dont want anyone to think a 0xFFEE or a '?' in a CSV file is ever acceptable coding practices and to walk away and move on with daily life after seeing it for half a second in a system tracing log/sql table/dev tools console.
0a176b6 to
2c3d040
Compare
|
Significant revisions have been pushed |
And especially the REPLACEMENT CHARACTER