Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-breaking space Unicode byte A0 gets mangled #54

Closed
arkdae opened this issue Sep 2, 2023 · 3 comments
Closed

Non-breaking space Unicode byte A0 gets mangled #54

arkdae opened this issue Sep 2, 2023 · 3 comments

Comments

@arkdae
Copy link

arkdae commented Sep 2, 2023

I am making use of Lamdasoup by way of the static site generator Soupault. I'm using Kramdown to convert plain text into HTML. Most of the output has HTML entities, for example smart quotes and ellipses. However non-breaking spaces are output as one byte with the hex value of A0 instead of  . This is getting mangled as seen here:

utop # Soup.parse "<html><body>Test\xA0test</body></html>" |> Soup.pretty_print;;
- : string =
"<html>\n <head></head>\n <body>\n  Test�test\n </body>\n</html>\n"
@aantron
Copy link
Owner

aantron commented Sep 5, 2023

The issue is that the default input character set for HTML is UTF-8.

0xA0 is the numeric value of a non-breaking space character, but its UTF-8 encoding is C2 A0.

A0 on its own is not valid UTF-8. Lambda Soup (actually, its undrelying parser Markup.ml) is reading the A0 and replacing it with the Unicode replacement character, with numeric value 0xFFFD and UTF-8 encoding EF BF BD. This behavior is correct according to the HTML spec.

Can you configure Kramdown to either emit HTML entities, or to emit UTF-8?

Alternatively, Lambda Soup and Markup.ml can be used to read input in a different encoding. A bare A0 byte is not any Unicode encoding, but I believe it is valid ISO 8859-1. It's not recommended to use this, but if Kramdown cannot be told to write Unicode, you could try replacing Soup.parse your_input by

your_input
|> Markup.string
|> Markup.parse_html ~encoding:Markup.Encoding.iso_8859_1
|> Markup.signals
|> Soup.from_signals

@arkdae
Copy link
Author

arkdae commented Sep 6, 2023

I guess this must be a Kramdown bug, then. Because it already emits HTML entities for single and double quotes and ellipses among other things. I was surprised that it provided this single A0 byte when everything else in the output is plain 7-bit ASCII.

I've just been piping the output from Kramdown through a sed script which converts the A0 byte into &nbsp;

Update:
I found another solution, run Kramdown like so:

kramdown --entity-output :symbolic

I still say this is a bug in Kramdown, however because where the non-breaking space is being generated is in footnotes, but also in those footnotes it is adding another HTML entity &amp#8617; which looks like a carriage return symbol for linking back to the origin of the footnote. It doesn't make sense and seems inconsistent to me that I have to add this option for only one place (so far that I have found) to ensure HTML entities are generated instead of higher-valued bytes of an unspecified encoding.

But I'll leave it alone for now.

Final update:
I guess it is not a bug in Kramdown. If I change my environment to LANG=en_US.UTF-8, then Kramdown outputs the two bytes C2 A0. So it was outputting ISO-8859-1 simply because that is what my environment was set to.

Maybe an enhancement to lambdasoup would be to honor the environment's encoding?

@aantron
Copy link
Owner

aantron commented Sep 6, 2023

Thanks for looking into this!

Maybe an enhancement to lambdasoup would be to honor the environment's encoding?

Strictly speaking, it wouldn't be an enchancement. In the HTML spec, the parsing algorithm does not depend on the user's environment. I also don't think it's something that users expect. The vast majority are using UTF-8 and it would be surprising for their code to behave differently when deployed to another machine, potentially to another user, whose environment happens to be configured differently, even when the input data is exactly the same.

There is an encoding detection procedure, but it works by assuming 7-bit ASCII and trying to find a <meta> tag before restarting parsing, or by looking for Unicode byte order marks. Those are absent in the vast majority of inputs Lambda Soup sees, so, in practice, Lambda Soup assumes UTF-8, though it can be forced to read just about any other encoding.

@aantron aantron closed this as completed Jul 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants