-
-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-breaking space Unicode byte A0 gets mangled #54
Comments
The issue is that the default input character set for HTML is UTF-8. 0xA0 is the numeric value of a non-breaking space character, but its UTF-8 encoding is C2 A0. A0 on its own is not valid UTF-8. Lambda Soup (actually, its undrelying parser Markup.ml) is reading the A0 and replacing it with the Unicode replacement character, with numeric value 0xFFFD and UTF-8 encoding EF BF BD. This behavior is correct according to the HTML spec. Can you configure Kramdown to either emit HTML entities, or to emit UTF-8? Alternatively, Lambda Soup and Markup.ml can be used to read input in a different encoding. A bare A0 byte is not any Unicode encoding, but I believe it is valid ISO 8859-1. It's not recommended to use this, but if Kramdown cannot be told to write Unicode, you could try replacing your_input
|> Markup.string
|> Markup.parse_html ~encoding:Markup.Encoding.iso_8859_1
|> Markup.signals
|> Soup.from_signals |
I guess this must be a Kramdown bug, then. Because it already emits HTML entities for single and double quotes and ellipses among other things. I was surprised that it provided this single A0 byte when everything else in the output is plain 7-bit ASCII. I've just been piping the output from Kramdown through a sed script which converts the A0 byte into Update:
I still say this is a bug in Kramdown, however because where the non-breaking space is being generated is in footnotes, but also in those footnotes it is adding another HTML entity ↩ which looks like a carriage return symbol for linking back to the origin of the footnote. It doesn't make sense and seems inconsistent to me that I have to add this option for only one place (so far that I have found) to ensure HTML entities are generated instead of higher-valued bytes of an unspecified encoding. But I'll leave it alone for now. Final update: Maybe an enhancement to lambdasoup would be to honor the environment's encoding? |
Thanks for looking into this!
Strictly speaking, it wouldn't be an enchancement. In the HTML spec, the parsing algorithm does not depend on the user's environment. I also don't think it's something that users expect. The vast majority are using UTF-8 and it would be surprising for their code to behave differently when deployed to another machine, potentially to another user, whose environment happens to be configured differently, even when the input data is exactly the same. There is an encoding detection procedure, but it works by assuming 7-bit ASCII and trying to find a |
I am making use of Lamdasoup by way of the static site generator Soupault. I'm using Kramdown to convert plain text into HTML. Most of the output has HTML entities, for example smart quotes and ellipses. However non-breaking spaces are output as one byte with the hex value of A0 instead of . This is getting mangled as seen here:
The text was updated successfully, but these errors were encountered: