Consider ignoring HTML (inline) in `alt` on image #716

wooorm · 2022-06-28T14:21:47Z

CommonMark prescribes that markdown is interpreted, but corresponding tags not output, in alt on <img>:

![foo *bar*]

[foo *bar*]: train.jpg "train & tracks"

<p><img src="train.jpg" alt="foo bar" title="train &amp; tracks" /></p>

Example 571

(see also some more info in that section).

To paint a more illustrative picture of this, and introduce the problem:

![a *b* `c` [d](#) <e>f</e> &amp; \- **g**](#)

->

<p><img src="#" alt="a b c d <e>f</e> &amp; - g" /></p>

I see no good reason that actual HTML is used, while html-from-markdown is ignored.
I find that there is something to say for not doing this at all: for a *b* c, maybe the user actually wanted the asterisks in the alt.
However, that’s probably too much of a breaking change, and maybe this is fine.
And there is something to say for doing it for everything (including actual HTML), that a <em>b</em> c is consistent to a *b* c and turns into a b c. I believe this to be the right call, and hence this issue.

Perhaps of note: HTML does not work in alt. Neither tags, nor comments, nor instructions, nothing. https://html.spec.whatwg.org/multipage/parsing.html#attribute-value-(double-quoted)-state.

I can do the work.

The text was updated successfully, but these errors were encountered:

jgm · 2022-07-04T11:53:06Z

Note: the spec doesn't mandate any particular treatment of the "image description." It only recommends:

Though this spec is concerned with parsing, not rendering, it is recommended that in rendering to HTML, only the plain string content of the image description be used. Note that in the above example, the alt attribute’s value is foo bar, not foo [bar](/url) or foo <a href="/url">bar</a>. Only the plain string content is rendered, without formatting.

So a fully compliant implementation is free to make a different decision.

I see no good reason that actual HTML is used, while html-from-markdown is ignored.

cmark, the reference implementation, renders your example thus:

<p><img src="#" alt="a b c d &lt;e&gt;f&lt;/e&gt; &amp; - g" /></p>

I'm not sure what implementation preserved the <e>? I'd say it's a bug -- certainly an undesirable feature, and something to report to the relevant implementation.

wooorm · 2022-07-04T11:56:04Z

I'm not sure what implementation preserved the <e>...

Everyone that follows CommonMark: < and > don’t need to be encoded in double quoted attributes in HTML. They can be, which is what you show, but CM does not require that.

wooorm · 2022-07-04T12:02:17Z

So a fully compliant implementation is free to make a different decision.

Could we improve the recommendation to explain the term “plain string” in a way that represents how several markdown parsers work?

jgm · 2022-07-04T12:03:10Z

OK, I see your issue is not really about whether the < is encoded, but whether the <e> is included at all.

As I mentioned, the spec is really silent on this -- the examples in this section just embody a recommendation -- so you're free to omit the raw HTML when generating the alt attribute. I agree that that's probably a better thing to do than what cmark currently does -- indeed, my own Haskell commonmark parser omits the tags.

I'd be okay with changes to the spec along these lines, I think, as long as the reference implementations were also updated to match. Note, however, that such a change would be an annoyance to all current implementers, who would find their spec tests failing and need to adjust things for a case that is probably pretty rare.

wooorm · 2022-07-04T12:14:28Z

the examples in this section just embody a recommendation

This is coming up several times with several people. One idea to clarify this is to mark test cases as illustrative/optional/recommendations with some attribute.

Note, however, that such a change would be an annoyance to all current implementers, who would find their spec tests failing and need to adjust things for a case that is probably pretty rare.

This to me sounds like the reason for CommonMark to exist and have a set of test cases that people pull in to test their parsers with. Much of markdown is edge cases.
It also relates to my previous point on marking test cases as optional.

jgm · 2022-07-04T12:19:27Z

If you're up for making changes to spec, commonmark.js, and cmark, then you've got the green light!

wooorm · 2022-07-04T19:24:00Z

While I understand the question, I am hoping to contribute solely to the spec.
I have no C knowledge, so contributing to cmark is out of my reach.
I could potentially help with JavaScript, although I am already maintaining several markdown parsers and pressed for time, so I am not really interested in maintaining others as well.

jgm · 2022-07-05T08:29:23Z

The thing is, if you contribute in this way to the spec, then I have to modify the reference implementations, and that takes time. I don't like them to get out of sync, and I don't have time to spend on this right now. You can keep this issue open if you like.

wooorm · 2022-07-05T08:44:13Z

Right, that’s quite fair.
Have you have though about opening up maintenance of commonmark? I imagine that I’d feel more inclined to work more on this if I’d have more sense of ownership.
Another thought: while one reference parser is crucial, in the JavaScript world there are already several more popular commonmark-compliant markdown parsers. Perhaps it would relieve the burden of maintenance to archive commonmark.js? (cmark could be compiled to wasm for the dingus, or say my own micromark could be used)

colinodell · 2022-07-25T21:12:42Z

Another thought: while one reference parser is crucial, in the JavaScript world there are already several more popular commonmark-compliant markdown parsers. Perhaps it would relieve the burden of maintenance to archive commonmark.js?

As someone who maintains a compliant parser and doesn't count C as a language I understand well, I personally find the JavaScript implementation to be extremely helpful in understanding the impact of spec changes. But I also know the burden of maintaining multiple projects too, so while I'd be bummed to lose the reference parser, it would be understandable.

rlidwka · 2023-11-19T18:29:53Z

Are there any updates on this? If not, can we at least get a clarification on what the expected behavior should be?

Note, however, that such a change would be an annoyance to all current implementers, who would find their spec tests failing

The result of updating the spec is a one-time change for some implementors. The result of not updating it is the continued stream of issues regarding the inconsistency between parsers. Latter is arguably more annoying.

jgm · 2023-11-19T20:50:11Z

What is the precise change to the spec you think should be made?

rlidwka · 2023-11-20T07:50:04Z

Just a general overview here. I'll try to suggest precise changes in the next post.

Option 1 - user friendly

Make it so image alt in output HTML is copied verbatim from ![here]:

![*hello* <bar>]() => <img alt="*hello* <bar>">

This seems to be what users expect.

This also has usability issues, because literal ] becomes impossible to add in alt. Perhaps, ![[ hello ] world ]]() syntax can be used to remedy this (similar to backticks), but this is probably a departure too far from existing implementations.

From implementation point of view it's also unclear:

either you store raw input alongside AST for links and images (performance concerns)
or you try to reconstruct initial input using source maps (not all implementations use source maps)
or mandate different handling for links and images (one produces AST, one doesn't), which is quite horrible

Option 2 - keep existing behavior

We now have 3 choices of what to do with ![hello <textarea>]():

<img src="hello <textarea>"> - that's commonmark.js
<img src="hello >textarea<"> - that's cmark
<img src="hello "> - that's haskell

Need to decide which one is correct.

Note that some implementations have an option to disable html rule entirely (at least we do). Having different results based on whether parser is able to parse html might be undesirable.

wooorm · 2023-11-20T08:00:28Z

In the OP I discussed these choices too, and I advised going with existing behavior, but not emitting html tags. Which is like opt 2 haskell

rlidwka · 2023-11-20T08:09:37Z

@jgm, now actually answering your question:

What is the precise change to the spec you think should be made?

Maybe spec should specify how to transform AST into "plain string content" for the purposes of forming image alt:

<text>hello</text> -> hello
<code>hello</code> -> hello
<softbreak /> -> \n
<html_block><a></html_block> -> <a> (same for inline)

The last rule makes sense for me personally (since it leads to the same behavior whether html rule exists or not). But maybe serializing htmls into an empty string is more "correct" in theory.

Also, I'd like to add a test to the spec along these lines:

If you use special symbols in image alt, you can wrap them into code span:
![`*em* <link>`]()

It's not going to fail anywhere (all parsers keep contents of code span as is hopefully), but it may be a useful suggestion for markdown writers (prettier/prettier#15140) on how to deal with special characters.

(or mention in any other way that automated software should escape user content inside image alt when auto-generating markdown)

wooorm mentioned this issue Sep 2, 2022

Consider preventing autolinks in links #719

Open

This was referenced Sep 12, 2022

HTML tags inside image alt shouldn't be parsed markdown-it/markdown-it#896

Closed

MD033 flagging HTML tags in image alt text strings DavidAnson/markdownlint#579

Closed

wooorm mentioned this issue Feb 2, 2023

Generated Markdown is missing line breaks between heading and image syntax-tree/mdast-util-to-markdown#59

Closed

4 tasks

This was referenced Jul 20, 2023

MD033 triggered by "elements" inside of image description DavidAnson/markdownlint#913

Closed

Markdown: Prettier unescapes symbols in image descriptors prettier/prettier#15140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider ignoring HTML (inline) in `alt` on image #716

Consider ignoring HTML (inline) in `alt` on image #716

wooorm commented Jun 28, 2022 •

edited

Loading

jgm commented Jul 4, 2022

wooorm commented Jul 4, 2022 •

edited

Loading

wooorm commented Jul 4, 2022

jgm commented Jul 4, 2022

wooorm commented Jul 4, 2022

jgm commented Jul 4, 2022

wooorm commented Jul 4, 2022

jgm commented Jul 5, 2022

wooorm commented Jul 5, 2022

colinodell commented Jul 25, 2022

rlidwka commented Nov 19, 2023

jgm commented Nov 19, 2023

rlidwka commented Nov 20, 2023

wooorm commented Nov 20, 2023

rlidwka commented Nov 20, 2023 •

edited

Loading

Consider ignoring HTML (inline) in alt on image #716

Consider ignoring HTML (inline) in alt on image #716

Comments

wooorm commented Jun 28, 2022 • edited Loading

jgm commented Jul 4, 2022

wooorm commented Jul 4, 2022 • edited Loading

wooorm commented Jul 4, 2022

jgm commented Jul 4, 2022

wooorm commented Jul 4, 2022

jgm commented Jul 4, 2022

wooorm commented Jul 4, 2022

jgm commented Jul 5, 2022

wooorm commented Jul 5, 2022

colinodell commented Jul 25, 2022

rlidwka commented Nov 19, 2023

jgm commented Nov 19, 2023

rlidwka commented Nov 20, 2023

Option 1 - user friendly

Option 2 - keep existing behavior

wooorm commented Nov 20, 2023

rlidwka commented Nov 20, 2023 • edited Loading

Consider ignoring HTML (inline) in `alt` on image #716

Consider ignoring HTML (inline) in `alt` on image #716

wooorm commented Jun 28, 2022 •

edited

Loading

wooorm commented Jul 4, 2022 •

edited

Loading

rlidwka commented Nov 20, 2023 •

edited

Loading