Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

supporting markdown links #514

Open
jvanasco opened this issue Oct 14, 2020 · 1 comment
Open

supporting markdown links #514

jvanasco opened this issue Oct 14, 2020 · 1 comment

Comments

@jvanasco
Copy link
Contributor

I maintain a package that uses htmllib5 to translate Markdown into HTML (https://github.com/jvanasco/html5lib_to_markdown) alongside our Bleach usage for dealing with user-submitted text.

I thought I had a workaround for some odd behavior between Python2 and Python3, but after encountering some issues migrating the CI tests to tox, I dug into my library and this library... and I realized there was a bigger problem.

The problem is that while almost all of Markdown is valid HTML, it also support a quick "link" format which exists as a url in an unnamed tag:

<https://example.com/path/to>

While my first reaction was to handle this in a pre-processor, I remembered that context matters and I need to know if I encounter this in a code-formatting block or not -- so I need to integrate this with a tokenizer.

When these links are handled by this library's tokenizer's emitCurrentToken, the current logic creates a token name of "http:", "https:", or "mailto:". This is great.

However, the token's raw data, however, is cast into an ordered dict - which blows away any duplicate values and a chance to recreate the tag -- and some other characters trip up the delimiting. For example:

<https://example.com/a/aa/b/bb/c/d/e/f/g?foo=bar&bar=foo;#biz>

Is there any chance of html5lib supporting a use case of keeping the full data of these unnamed urls tags somehow? I don't expect them to be serialized by this library, as this is a weird HTMLish format that is not real HTML - but Markdown is a popular and widespread format that is mostly valid HTML, except for this one _____ tag.

There are a few ideas I had that are 70% towards a PR for this - but if this use-case is too outside the scope of this library, I need to spend my time looking for alternatives.

Thanks, J

@theRealProHacker
Copy link
Contributor

Are you converting HTML to Markdown or Markdown to HTML? Because if you actually want to convert Markdown to HTML using an HTML parser, there are definitely more problems than just "this one _____ tag".

For example consider:

```html
<pre>
```

As you mentioned, the HTML parser doesn't know about the quotes around the pre tag and will parse it as an HTML element, which is obviously not what you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants