Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternate nested emphasis and strong emphasis delimiters in Markdown writer #10642

Closed
aphedges opened this issue Feb 24, 2025 · 6 comments
Closed
Labels

Comments

@aphedges
Copy link

Explain the problem.
The Markdown writers should alternate nested emphasis and strong emphasis delimiters to prevent incorrect formatting being output.

I encountered nested <i> tags in the wild (they appear to be relatively common on Wikipedia), and I noticed that nested italics are rendered as strong emphasis instead of nested emphasis:

$ echo '<i><i>A</i></i>' | ./pandoc --from html --to gfm --trace
[trace] Parsed [Plain [Emph [Emph [Str "A"]]]] at line 1
**A**
$ echo '<strong>A</strong>' | ./pandoc --from html --to gfm --trace
[trace] Parsed [Plain [Strong [Str "A"]]] at line 1
**A**

I would instead expect that <i><i>A</i></i> be converted to _*A*_ or *_A_*. This syntax appears to be treated as nested emphasis according to both the Markdown specification (https://spec.commonmark.org/0.31.2/#emphasis-and-strong-emphasis) and Pandoc's own Markdown reader:

$ echo '*A B _*C*_*' | ./pandoc --from gfm --to native
[ Para
    [ Emph
        [ Str "A"
        , Space
        , Str "B"
        , Space
        , Emph [ Emph [ Str "C" ] ]
        ]
    ]
]
$ echo '*A B _*C*_*' | ./pandoc --from gfm --to gfm
*A B **C***

This issue is similar to #9521, but that bug report is asking for the formatting to be dropped. I am instead asking that Pandoc not try to "clean" the formatting here and simply write Markdown that it can itself read in. In addition, it is tagged with format:HTML and reader when this issue should instead be format:Markdown and writer.

I'm not sure how one should handle nested intraword emphasis, but given the limitations of Markdown, it might be best to consider that impossible to write without problems.

Pandoc version?
macOS on Apple Silicon (albeit an x86_64 executable running under Rosetta2)
pandoc 3.6.3-nightly-2025-02-24
Features: +server +lua
Scripting engine: Lua 5.4

@aphedges aphedges added the bug label Feb 24, 2025
@jgm
Copy link
Owner

jgm commented Feb 25, 2025

Pandoc's markdown parser can handle the sort of nested italics that seems sensible, for example

*This is *nested**

What it can't handle is a case where the entire phrase is nested:

**this is nested**

But why on earth would someone write something like that? Can you point to some real-world examples?

@aphedges
Copy link
Author

I'm sorry that I wasn't clear, but I don't think we are on the same page.

I first noticed this bug on Circadian rhythm - Wikipedia, which includes nested <i> elements in its HTML. Here is the first example in the page's source:

<i><i lang="la"><a href="https://en.wiktionary.org/wiki/circa#Latin" class="extiw" title="wikt:circa">circa</a></i></i>

The problem is that the nested italics are written as **circa** in Markdown, which renders them as bold instead.

It's not particularly relevant, but the reason these tested italics happen is that the inner <i> tag is hidden within a template and not seen by the user writing the wikitext, who added their own italics.

Please let me know how I could have written my initial issue to avoid any confusion.

@jgm
Copy link
Owner

jgm commented Mar 10, 2025

I see, your explanation about the template helps.

@aphedges
Copy link
Author

Sorry for not testing this out earlier, but the fix loses formatting information present in the original text. Instead of treating Emph [Emph ils]] as ils, would it be possible to treat Emph [Emph ils]] as just Emph ils?

@jgm
Copy link
Owner

jgm commented Mar 17, 2025

Usually when emph is nested, the convention is to alternate between italics and non-italics.

For example, if a book is called "Race in Huckleberry Finn" and another book that discusses this is called "A Commentary on [booktitle]", it will normally be formatted as "A Commentary on Race in Huckleberry Finn".

That is why this choice was made.

@aphedges
Copy link
Author

That's a good point! Thanks for the explanation!

I guess I'll just preprocess the HTML to fix the nested italics, then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants