fix: Simplify unicode punctuation #2841
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Marked version: 5.0.5
Markdown flavor: GitHub Flavored Markdown
Description
Cleans up the unicode punctuation from #2811 by using
\p{P}
instead of a long list of unicode characters. There are a handful of punctuation characters$+<=>`^|~
not included in that set for whatever reason, so they are still specified here. Includes the accompanying tweaks to a couple other regexes to apply it correctly. This also lets us cover a slightly larger punctuation set since my understanding is JS unicode symbols end at\uFFFF
but there are a few more after that.And a tiny unrelated logic simplification in the emStrong Tokenizer.
My only question is if there is a better way to exclude single characters from
\p{P}
, for instance in the emStrong, we don't include the current delimiter*
or_
in the punctuation checks. I get around this now with an additional lookahead regex:Something like
(?!_)\p{P}
For instance, this example lets us exclude
_
from the\p{P}
group. I'm ok with this, but if there is some secret "subtraction from a unicode set" syntax, I would like to know.I didn't add tests, but could potentially look up some of the characters that were missing previously and add them to the existing unicode test.
Contributor
Committer
In most cases, this should be a different person than the contributor.