-
-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ambiguity in block quote definition #460
Comments
Taking the definition at face value, we consider some contents to be contained in a block quote - - - asdf
- sdfg Which is parsed alone as: <ul>
<li>
<ul>
<li>
<ul>
<li>asdf</li>
</ul>
</li>
<li>sdfg</li>
</ul>
</li>
</ul> By prepending the list marker >- - - asdf
> - sdfg which is parsed as <blockquote>
<ul>
<li>
<ul>
<li>
<ul>
<li>asdf</li>
</ul>
</li>
</ul>
</li>
<li>sdfg</li>
</ul>
</blockquote> Note that this is not the result we wanted, nor is it the one given by the specification where we start with a contents Cs and prepend quote markers to contain this contents. Instead of doing this, we could have started with the markdown - - - asdf
- sdfg which is parsed alone as: <ul>
<li>
<ul>
<li>
<ul>
<li>asdf</li>
</ul>
</li>
</ul>
</li>
<li>sdfg</li>
</ul> and prepended the list marker >- - - asdf
> - sdfg This time the result we get is per the specification, because we receive the block structure we started with nested in a block quote. Or We could have started with the same text and instead prepended >- - - asdf
> - sdfg And we would receive again the same html. However, the markdown we ended up with (by using the process given by the specification) is identical to the first markdown we considered, even though we started with a different content. The specification MUST describe a process to reverse this procedure in order to be non ambiguous. |
IMO we should aim to preserve consistency in the given indentation so that lines that line up in the original text line up when considered as contents of a block quote, I would propose the following process: Let Ls be an ordered set of lines A block quote marker holds the value of either A or B, and is preceded by 0 to 3 spaces. A line L begins with the block quote marker B if B is a block quote marker at the start of the line L and A is not a block quote marker at the start of the line L. Let Cs be an empty ordered set of lines.
If Cs is non empty then a block quote is defined with contents Cs. This should ensure that if a single B type block quote marker is used in the block quote, then all lines will reflect the indentation as a result of that, otherwise the blockquote marker A will be used. And should preserve the "intuitive" indentation that can be obtained by looking at the lines with respect to one another. |
This definitely needs work, thanks.
The approach of the spec (for better or worse) was to specify
how to construct each of the block and inline element types
(writer's perspective), rather than how to parse (reader's
perspective). If all possible ways of constructing elements are
specified, then it should be possible to write a parser that
recognizes them (and the reference implementations are meant
to show that).
There are pitfalls, though, to this approach. If two
different constructions (for different element types) can
result in the same text, then we have a problem. This is
the kind of problem you're pointing out for block quotes.
There are a couple of places in the text where we resort
to specifying precedences, which isn't really in the spirit
of the writer's-perspective strategy outlined above, but is
necessary to avoid the problem.
Perhaps it would have be better to rewrite the spec from the
reader's (parser's) perspective, but I don't know if I have
energy for that.
|
I'd probably agree with that, but I think it's salvageable without a complete rewrite. I think the important thing to do is construct a definition from the reader's perspective, and see what that leaves in what the writer can do.
For example, if the spec were to say that the writer should stick to a single marker type per block quote, and the shorter marker has higher priority if it is used on any line then I think that would keep everything consistent with the parsing strategy I outlined above. The key thing I think is that the writer should not be using different marker lengths (so that indentation can be unambiguously preserved). The current reference implementation just grabs the longest marker it can find. |
I've worked an initial implementation of that algorithm I gave into the parser I'm working on, so the following is now produced (in yaml-ish notation): >1. > asdf
> > sdfg
> 1. > asdf
> > sdfg
> 1. a
>2. b
> 1. a
>2. b
> 1. a
> 2. b blockquote:
ol:
li:
blockquote:
p:
text:
asdf sdfg
blockquote:
ol:
li:
blockquote:
p:
text:
asdf sdfg
blockquote:
ol:
li:
p:
text:
a
li:
p:
text:
b
blockquote:
pre:
code:
1. a
ol start="2":
li:
p:
text:
b
blockquote:
ol:
li:
p:
text:
a
li:
p:
text:
b Feel free to throw some test cases at me if you like the approach, I think I'm certainly going to use it. |
Require that the same block quote marker be used to avoid ambiguity in parsing strategy (compatible with the algorithm described [here](commonmark#460 (comment)))
This is the simplest fix commonmark#460 that I can think of that matches the behavior of the reference implementation. It's not simple, because the behavior being described is complex, but it needs to be spelled out.
Let's talk about block quotes.
The following sections needs rephrasing.
Which block quote marker? There are two versions of the basic case for each line added.
Again, we have the problem of "which blockquote marker?".
These are not definitions. At best they are multivalued "functions".
They do not describe a which text constitutes a block quote, they describe how some contents Cs may be mapped to a block quote.
These maps are not invertible. As in, a single block quote may map to multiple versions of the contents Cs (by choosing different markers), and you can check that all versions of these contents may be mapped back to this block quote by choosing different markers (though they of course cannot be mapped uniquely).
Because there is no unique way to determine a block quote's contents, these cannot be definitions.
If these points are specified in a way such that the contents does become uniquely defined, then it would serve to be less ambiguous by stating the inversion of the current map provided (so that a block quote may be identified based on the lines that actually define it, which is how a parser would have to work in practice).
The text was updated successfully, but these errors were encountered: