Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify link label matching #695

Open
dbuenzli opened this issue Nov 11, 2021 · 1 comment
Open

Clarify link label matching #695

dbuenzli opened this issue Nov 11, 2021 · 1 comment

Comments

@dbuenzli
Copy link

In the 0.30 spec we have:

One label matches another just in case their normalized forms are equal. To normalize a label, strip off the opening and closing brackets, perform the Unicode case fold, [...]

"Perform the Unicode case fold" is a bit unclear – in the sense I had to consult cmark to see what it was doing. If I understood correctly this is definition R4 of the Unicode standard p. 154. so maybe that could be referenced


P.S. A better definition would likely have been R5 as it would handle correctly identifiers in different normal forms (like é composed in one id and é decomposed in another one) but you'd need to import the Unicode normalization and associated machinery into the definition of CommonMark.

@kivikakk
Copy link
Contributor

I'm currently bringing Comrak up to speed on the changes to CommonMark since GFM was rebased on it, and I hit some difficulty here too, since "Unicode case fold" has no precise meaning.

I might end up imitating the mechanism used in cmark directly (generating code based on CaseFolding-x.0.0.txt) since every Unicode library out there supports a slightly different set of things.

kivikakk added a commit to kivikakk/comrak that referenced this issue Jul 10, 2024
We add `caseless` to do the folding. It matches upstream enough [^1],
unlike e.g. ICU4X's `CaseMapper` (doesn't fold Eszett to "ss"), and also
unlike ICU4X, it doesn't require us to bump our MSRV. 2/2 sgtm

A separate `--gfm-quirks` CLI option is added since base tests fail if
we just turn on all of GFM for them.

The nice thing about `caseless` is that while its last release may be
6 years ago, it depends on unicode-normalization ^0.1, the latest of
which is 5 months ago. It's also [very easy to read][caseless], so I'm
all good with this.

[^1] Not that straightforward: commonmark/commonmark-spec#695

[caseless]: https://github.com/unicode-rs/rust-caseless/blob/v0.2.1/src/lib.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants