Skip to content

Fix HTML tags appearing in wiki table of contents#36284

Merged
wxiaoguang merged 8 commits intogo-gitea:mainfrom
heathdutton:fix/36106-wiki-toc-strip-html
Jan 23, 2026
Merged

Fix HTML tags appearing in wiki table of contents#36284
wxiaoguang merged 8 commits intogo-gitea:mainfrom
heathdutton:fix/36106-wiki-toc-strip-html

Conversation

@heathdutton
Copy link
Copy Markdown
Contributor

@heathdutton heathdutton commented Jan 2, 2026

Fixes #36106

When wiki headings contain HTML elements (like <a name="anchor"></a>), the raw HTML code was appearing verbatim in the table of contents instead of being stripped out.

Before: ToC displays <a name="asdf"></a> has strange html
After: ToC displays has strange html


Also fix #17958 by the way

@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Jan 2, 2026
@github-actions github-actions bot added the modifies/go Pull requests that update Go code label Jan 2, 2026
@lunny
Copy link
Copy Markdown
Member

lunny commented Jan 2, 2026

Could you add some test?

@heathdutton heathdutton force-pushed the fix/36106-wiki-toc-strip-html branch from f0b0b36 to d737929 Compare January 2, 2026 20:22
@lunny
Copy link
Copy Markdown
Member

lunny commented Jan 2, 2026

Whether any a tag in the heading should be ignored or we could have an option to do that?

@heathdutton
Copy link
Copy Markdown
Contributor Author

Good question! I went with stripping all HTML tags rather than just <a> specifically - the ToC should really just show plain text for navigation, so things like <span>, <strong>, etc. wouldn't make sense there either.

The heading itself still renders with the HTML in the document body, so anchor links like <a name="anchor"> still work fine for linking purposes.

I think adding an option would be overkill for this - can't think of a case where someone would actually want raw HTML showing up in their ToC. But happy to discuss if you see it differently!

@delvh
Copy link
Copy Markdown
Member

delvh commented Jan 3, 2026

can't think of a case where someone would actually want raw HTML showing up in their ToC

What about case <a> and <b>?

@delvh
Copy link
Copy Markdown
Member

delvh commented Jan 3, 2026

So, I think 'raw HTML' is useful when it is only accidentally HTML.
And that one can most certainly happen in titles

@heathdutton heathdutton force-pushed the fix/36106-wiki-toc-strip-html branch from d737929 to a1c7525 Compare January 3, 2026 01:22
@heathdutton
Copy link
Copy Markdown
Contributor Author

Good edge case to think about! I tested this and the fix handles it correctly:

## <a href="link">Click</a> and <b>Bold</b>

ToC shows: Click and Bold

The HTML tags get stripped but the text content inside them is preserved - which is exactly what we want for a readable ToC.

I've added test cases covering this scenario. Also verified that code spans like \`` are handled separately by the markdown parser and aren't affected.

@heathdutton heathdutton force-pushed the fix/36106-wiki-toc-strip-html branch from e211fab to 0b84de7 Compare January 3, 2026 03:16
@wxiaoguang wxiaoguang force-pushed the fix/36106-wiki-toc-strip-html branch from 0b84de7 to e211fab Compare January 3, 2026 03:17
@wxiaoguang
Copy link
Copy Markdown
Contributor

wxiaoguang commented Jan 3, 2026

I fixed the tests, it needs to clearly assert what we want.

And we can see that the result doesn't seem good.

By the way: no need to rebase or force push, see the contribution guideline https://github.com/go-gitea/gitea/blob/main/CONTRIBUTING.md#maintaining-open-prs

image

@heathdutton
Copy link
Copy Markdown
Contributor Author

Roger that, I'll step back.

@silverwind silverwind requested a review from Copilot January 23, 2026 13:54
@wxiaoguang
Copy link
Copy Markdown
Contributor

@silverwind this PR still has problems, see #36284 (comment), AI's review won't really help.

@wxiaoguang wxiaoguang marked this pull request as draft January 23, 2026 14:07
@silverwind silverwind review requested due to automatic review settings January 23, 2026 14:14
@silverwind
Copy link
Copy Markdown
Member

Sorry, I was just going by "ready" PRs.

@wxiaoguang
Copy link
Copy Markdown
Contributor

Well, it seems that this PR is unlikely to get progresses.

Then, it becomes my work, again.

Copy link
Copy Markdown
Contributor

@wxiaoguang wxiaoguang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, the result is what it should be.

@GiteaBot GiteaBot added lgtm/need 1 This PR needs approval from one additional maintainer to be merged. and removed lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. labels Jan 23, 2026
@wxiaoguang wxiaoguang marked this pull request as ready for review January 23, 2026 16:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request fixes a bug where HTML tags in wiki headings were appearing verbatim in the table of contents instead of being stripped out. The fix refactors the ToC generation from an AST-based approach to an HTML-based approach, extracting plain text from heading nodes and properly escaping it during ToC rendering.

Changes:

  • Refactored ToC generation to work with HTML nodes instead of Markdown AST nodes
  • Added proper text extraction from heading elements that strips HTML tags while preserving text content
  • Implemented HTML-safe ToC rendering using new htmlutil helper functions

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated no comments.

Show a summary per file
File Description
web_src/js/markup/anchors.ts Added FIXME comment expressing architectural opinion (non-functional)
routers/web/repo/wiki.go Updated to use new ToC rendering approach with proper HTML handling
modules/markup/render.go Added new types TocShowInSectionType and TocHeadingItem for ToC data structure
modules/markup/mdstripper/mdstripper.go Removed WithAutoHeadingID() parser option as heading IDs are now handled in HTML post-processing
modules/markup/markdown/transform_heading.go Deleted (replaced by HTML-based heading processing)
modules/markup/markdown/toc.go Deleted (replaced by RenderTocHeadingItems in html.go)
modules/markup/markdown/prefixed_id.go Deleted (ID prefixing now handled in HTML post-processing)
modules/markup/markdown/markdown.go Removed custom prefix ID generator and AutoHeadingID parser option
modules/markup/markdown/goldmark.go Updated to set ToC mode flags instead of generating AST nodes
modules/markup/html_toc_test.go Added comprehensive test verifying HTML tags are stripped from ToC
modules/markup/html_node.go Enhanced to extract heading text and populate ToC items during HTML post-processing
modules/markup/html.go Added RenderTocHeadingItems function with proper HTML escaping
modules/markup/common/footnote.go Updated footnote IDs to include "user-content-" prefix in renderer
modules/htmlutil/html.go Added HTMLPrintf, HTMLPrint, and HTMLPrintTag helper functions for safe HTML output

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@wxiaoguang wxiaoguang force-pushed the fix/36106-wiki-toc-strip-html branch from 1b6e34c to 560ff13 Compare January 23, 2026 16:44
@wxiaoguang
Copy link
Copy Markdown
Contributor

wxiaoguang commented Jan 23, 2026

Sorry, I was just going by "ready" PRs.

It is ready now. And maybe you can take a look at this FIXME later (in following up PRs) https://github.com/go-gitea/gitea/pull/36284/files#diff-a1ea66825d0703d3cd1da7b9428ad1efecce97e52fb828246cc2b7729283faa0R3

// FIXME: don't see why these tricks make sense. If these prefixes are not needed, they should be removed entirely by backend.

@GiteaBot GiteaBot added lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. and removed lgtm/need 1 This PR needs approval from one additional maintainer to be merged. labels Jan 23, 2026
@wxiaoguang wxiaoguang enabled auto-merge (squash) January 23, 2026 19:57
@wxiaoguang wxiaoguang merged commit 0f78b99 into go-gitea:main Jan 23, 2026
24 checks passed
@silverwind
Copy link
Copy Markdown
Member

silverwind commented Jan 23, 2026

// FIXME: don't see why these tricks make sense. If these prefixes are not needed, they should be removed entirely by backend.

The reason for these prefixes is so that markdown content is not in direct control of the id namespace of the page because that may lead to broken JS features or even vulnerabilities.

It would be better if backend sets these prefixes.

@wxiaoguang
Copy link
Copy Markdown
Contributor

// FIXME: don't see why these tricks make sense. If these prefixes are not needed, they should be removed entirely by backend.

The reason for these prefixes is so that markdown content is not in direct control of the id namespace of the page because that may lead to broken JS features or even vulnerabilities.

It would be better if backend sets these prefixes.

But the frontend code just removes the user-content- prefix which was added by backend, so what's the purpose?

image image

@silverwind
Copy link
Copy Markdown
Member

silverwind commented Jan 24, 2026

The removal on href attributes (but not on id) is so that these prefixes are invisible to the user and their links README.md#a-b work. This behaviour matches GitHub exactly.

@wxiaoguang
Copy link
Copy Markdown
Contributor

wxiaoguang commented Jan 24, 2026

The removal on href attributes (but not on id) is so that these prefixes are invisible to the user and their links README.md#a-b work. This behaviour matches GitHub exactly.

It matches, if eventually these prefixes don't exist when users visit the page, why backend should add them?

// FIXME: don't see why these tricks make sense. If these prefixes are not needed, they should be removed entirely by backend.

@silverwind
Copy link
Copy Markdown
Member

silverwind commented Jan 24, 2026

why backend should add them

because it's usually best to alter HTML in backend, but I see this mechanism with the removal requires JS to work, so it can not work with JS disabled.

While could move the addition of the prefix to frontend but then we give markdown documents access to the full id namespace, which is bad because it can breaks the page's link targets.

I see no better way. The need for prefixes is to prevent markdown from altering the full id namespace.

@wxiaoguang
Copy link
Copy Markdown
Contributor

The need for prefixes is to prevent markdown from altering the full id namespace.

Sorry I don't understand.

If backend doesn't add these prefixes, isn't it the same as current approach for end users? The same as GitHub? So I don't see why "even vulnerabilities" #36284 (comment)

I don't see what is a "need for prefixes", what is the full "id" namespace. Can you show a real example?

@silverwind
Copy link
Copy Markdown
Member

silverwind commented Jan 24, 2026

For example, repo homepage has #readme anchor, without the prefix a markdown document could generate the same anchor which results in invalid HTML because you can not have two same ids. When a user opens the page with that anchor, it's undefined behaviour to which anchor the browser would scroll to.

@wxiaoguang
Copy link
Copy Markdown
Contributor

wxiaoguang commented Jan 24, 2026

For example, repo homepage has #readme anchor, without the prefix a markdown document could generate the same anchor which results in invalid HTML because you can not have two same ids.

But you also always remove these prefixes, right? The same as GitHub?

Do you mean :

  • GitHub has vulnerabilities? And they should add user-content- prefix?
  • If no, why Gitea adds the prefix and then removes it?

@silverwind
Copy link
Copy Markdown
Member

silverwind commented Jan 24, 2026

But you also removed these prefixes, right? The same as GitHub?

No, id is never altered by JS so the page stays without id conflicts at all times.

GitHub has vulnerabilities? And they should add user-content- prefix?

No, Github leaves the ids also unchanged in JS, so id uniqueness is guaranteed with or without JS.

If no, why Gitea adds the prefix and then removes it?

Both Github and Gitea only remove it from href, never from id.

@wxiaoguang
Copy link
Copy Markdown
Contributor

If no, why Gitea adds the prefix and then removes it?

Both Github and Gitea only remove it from href, never from id.

Hmm ... thanks, I think I can understand more, so the comment can be updated to clarify the purpose of that part of code

@silverwind
Copy link
Copy Markdown
Member

So in summary:

  • backend adds prefix to guarantee unique ids on the page
  • js strips prefix so users gets nice prefix-less links
  • js intercepts the hash navigation to add the prefix so the correct prefixed id element is focused

@silverwind
Copy link
Copy Markdown
Member

#36443

@wxiaoguang
Copy link
Copy Markdown
Contributor

wxiaoguang commented Jan 24, 2026

If no, why Gitea adds the prefix and then removes it?

Both Github and Gitea only remove it from href, never from id.

Hmm ... thanks, I think I can understand more, so the comment can be updated to clarify the purpose of that part of code

https://github.com/go-gitea/gitea/pull/36438/files#diff-a1ea66825d0703d3cd1da7b9428ad1efecce97e52fb828246cc2b7729283faa0

Fixed the fixme


Update: just found #36443

@silverwind
Copy link
Copy Markdown
Member

silverwind commented Jan 24, 2026

I find my description in #36443 better, let's continue there.

silverwind added a commit that referenced this pull request Jan 24, 2026
See discussion in #36284.

---------

Signed-off-by: silverwind <me@silverwind.io>
Co-authored-by: wxiaoguang <wxiaoguang@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
zjjhot added a commit to zjjhot/gitea that referenced this pull request Jan 26, 2026
* giteaofficial/main:
  Normalize guessed languages for code highlighting (go-gitea#36450)
  Add `knip` linter (go-gitea#36442)
  Fix various bugs (go-gitea#36446)
  Update tool dependencies (go-gitea#36445)
  Update JS dependencies, adjust webpack config, misc fixes (go-gitea#36431)
  fix: Improve image captcha contrast for dark mode (go-gitea#36265)
  Refactor template render (go-gitea#36438)
  Add documentation for markdown anchor post-processing (go-gitea#36443)
  Fix markup heading parsing, fix emphasis parsing (go-gitea#36284)
  Front port changelog for 1.25.4 (go-gitea#36432)
  Bugfix: Potential incorrect runID in run status update (go-gitea#36437)
  Restrict branch naming when new change matches with protection rules (go-gitea#36405)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. modifies/frontend modifies/go Pull requests that update Go code type/enhancement An improvement of existing functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

html is copied verbatim to ToC in wiki Markdown Formatting tokens in URLs should not be parsed

7 participants