-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python Markdown turns </#rrggbb>
into HTML comment, even while in inline code
#1425
Comments
For context am I using Python Markdown through MkDocs, but this seems to be generally an issue as pointed out in this comment. The full discussion for context: mkdocs/mkdocs#3537 |
I imagine this is a quirk with the HTML parser. I'm not sure off the top of my head what specifically in the parser is causing this, but I would imagine this could be considered a bug. |
My only guess is that the parser turns invalid HTML tags into comments to avoid possible breaking of the content. I really hope this can be fixed, should it be a bug, or that there is a workaround for this. |
What I find most concerning is the content isn't preserved for inline code: >>> markdown.markdown('`</#rrggbb/>`')
'<p><code><!--#rrggbb/--></code></p>' The parser should not be making changes in this way. |
This is treated as a bogus comment by the HTML parser of Python: from html.parser import HTMLParser
class Parser(HTMLParser):
def handle_comment(self, v):
print("Comment:", v)
def handle_data(self, v):
print("Data:", v)
Parser().feed('<#rgba>Foo</#rgba>')
# prints:
# Data: <
# Data: #rgba>Foo
# Comment: #rgba See these lines: https://github.com/python/cpython/blob/936135bb97fe04223aa30ca6e98eac8f3ed6b349/Lib/html/parser.py#L277C15-L286 Perhaps Python-Markdown should parse backticks before html blocks, doing replacements as needed. If not, you can't put lazy bogus HTML comments inside code blocks: >>> markdown.markdown("`</? p>Hi</p>`\n\n```\n</+ ups>\n```")
'<p><code><!--? p-->Hi</p></code></p>\n<p>```</p>\n<!--+ ups-->\n\n<p>```</p>' |
There are multiple different factors which are resulting in this unexpected output. First, the Markdown parser passes the entire document to an HTML parser to identify and extract all raw HTML blocks. Anything that is not a raw HTML block (as defined by Markdown) is (supposedly) passed through unalerted. Later, the Markdown parser then handles anything which looks like HTML tags in the inline parser after code spans are parsed. In other words, the HTML block parser passes inline tags through as unaltered plain text, and that is parsed a second time by a much simpler parser (which just finds all content between opening and closed angle brackets). Unfortunately, in a few rare occasions, invalid HTML can cause the HTML block parser to alter the text passed through in some odd way, which appears to be the case here. It has been our position that it is not our responsibility to properly handle invalid HTML. Bad input results in bad output, even if that output is in an unexpected format. Of course, the issue here is that the invalid HTML is in a code span, and so a user should be able to presume that it would not be parsed by any HTML parser and be preserved as-is. Unfortunately, the way the current HTML block parser is implemented, it's not possible to completely avoid passing any part of the document through the HTML block parser. All we can do is try to properly identify which parts should be passed through unaltered. So, now lets take a look as what the HTML block parser is doing. The string As noted by @mondeja above, the bogus comment state is handled by the HTML parser here: Note that the |
Looking through the HTML parser code, there are only two situations in which
The first is obvious. The presumption is that a comment was intended, but the required In any event, when I originally worked on this, I missed the second case. My thinking was that the if we added On further reflection, I think it is better to just pass either case through as plain text. In that way we don't make any modifications to the user's input. An added benefit is that when the invalid HTML is in a code span, we aren't messing with it. And for the few occasions out there where a user was taking advantage of the existing behavior to get a comment where they wanted one, well, they were abusing a bug so they shouldn't expect it to continue to work with future updates. |
The solution to #1066 has a side effect of preventing various edge cases of invalid comments (tags that start with |
I think that's related to the spec you linked before, where it states to define this as a comment, should it be an invalid opening/closing tag from what I read. Tho, in all honesty do I wonder why this behaviour was chosen for that spec instead of something like escaping the tag itself (so turning |
As with most implementations, we now pass through bogus comments (as defined by the HTML Spec) unaltered except that they are HTML escaped. This deviates from the reference implementation which compeltely ignores them. As the reference implementation seems to not have even contemplated their existance, it is not being used as a reference in this instance. Fixes Python-Markdown#1425.
I have a fix in #1426. However, in the meantime, you can use raw HTML as a workaround:
As a reminder, wrapping your escaped HTML in raw code tags will prevent Markdown from further manipulating it. And because you are using raw HTML you need to do your own escaping anyway, which prevents the bug from triggering. As you wanted a code span anyway, this works. Of course, for instances where there is no code span involved, this doesn't help. |
As with most implementations, we now pass through bogus comments (as defined by the HTML Spec) unaltered except that they are HTML escaped. This deviates from the reference implementation which completely ignores them. As the reference implementation seems to not have even contemplated their existence, it is not being used as a reference in this instance. Fixes #1425.
Something small I want to mention here real quick is an interesting exception to that behaviour discussed here... That exception being uppercased letters it seems. Just wanted to mention this. Doesn't really change much I think. EDIT: Actually, I was wrong. That above case was in a code block, so surrounded by pre tags in the end... |
When you put
</#rrggbb>
as-is inside a inline code block will Python Markdown convert it into a HTML comment (<!--#rrggbb-->
), which feels odd.Is there a way to prevent this? Using HTML escape methods such as
</#rrggbb>
will not work, as Python Markdown will then treat the content literally and show it as-is.Additionally can and will this cause issues with inline-code highlighting used.
The text was updated successfully, but these errors were encountered: