Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Unicode regex util #5019

Closed
jodator opened this issue Apr 25, 2019 · 5 comments
Closed

Create Unicode regex util #5019

jodator opened this issue Apr 25, 2019 · 5 comments
Labels
package:utils resolution:expired This issue was closed due to lack of feedback. status:discussion status:stale type:feature This issue reports a feature request (an idea for a new functionality or a missing option).

Comments

@jodator
Copy link
Contributor

jodator commented Apr 25, 2019

When we start digging into the RTL support or better support of other languages the problem with some regexes surfaces again and again.

ATM we need better regexes to:

All the groups are defined in unicode standard.

I'm not sure if we need all categories right now (thus it might be helpful).

Now we now that there's already a library that adds support for groups (and other features not present in JS RegExp engine) called xregexp. It already defines those categories. The lib looks useful to me but it have all the typical downsides of external libraries:

  1. Not created at home.
  2. Fear of potential abandon project.
  3. Is another dependency.
  4. etc.

The xRegExp library compiles to native JS RegExp so only little overhead is added when creating a regexp.

If not using library we need to create an util that will provide set of characters that meets our needs and which can be used with RegExp engine:

// mimick: /\p{Ps}/

import unicodeRegExp from `@ckeditor/ckeditor5-utils/src/unicoderegexp`;
const openPunctuation = unicodeRegExp.getOpenPunctuation();

const regex = new RegExp( `${ openPunctuation }` )

// then it will be equivalent to:
const regex = new Regex( '\\(\\[\\{\u0F3A\u0F3C\u169B\u201A\u201E\u2045\u207D\u208D\u2308\u230A\u2329\u2768\u276A\u276C\u276E\u2770\u2772\u2774\u27C5\u27E6\u27E8\u27EA\u27EC\u27EE\u2983\u2985\u2987\u2989\u298B\u298D\u298F\u2991\u2993\u2995\u2997\u29D8\u29DA\u29FC\u2E22\u2E24\u2E26\u2E28\u2E42\u3008\u300A\u300C\u300E\u3010\u3014\u3016\u3018\u301A\u301D\uFD3F\uFE17\uFE35\uFE37\uFE39\uFE3B\uFE3D\uFE3F\uFE41\uFE43\uFE47\uFE59\uFE5B\uFE5D\uFF08\uFF3B\uFF5B\uFF5F\uFF62' );

// or using xregexp:
const regex = XRegExp( '\\pPs' );
@jodator
Copy link
Contributor Author

jodator commented Apr 25, 2019

@jodator
Copy link
Contributor Author

jodator commented Apr 25, 2019

ps.: The full list of unicode characters in categories is here: https://www.unicode.org/Public/12.0.0/ucd/extracted/DerivedGeneralCategory.txt.

@jodator
Copy link
Contributor Author

jodator commented Apr 26, 2019

ps.: There's a \u flag for RegExps but it:

"unicode"; treat a pattern as a sequence of unicode code points

So it does not provide the unicode category sequences like \pPs

Ref:

  1. https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp#Regular_expression_and_Unicode_characters

@mlewand mlewand transferred this issue from ckeditor/ckeditor5-utils Oct 9, 2019
@mlewand mlewand added this to the backlog milestone Oct 9, 2019
@mlewand mlewand added status:discussion type:feature This issue reports a feature request (an idea for a new functionality or a missing option). package:utils labels Oct 9, 2019
@pomek pomek removed this from the backlog milestone Feb 21, 2022
@CKEditorBot
Copy link
Collaborator

There has been no activity on this issue for the past year. We've marked it as stale and will close it in 30 days. We understand it may be relevant, so if you're interested in the solution, leave a comment or reaction under this issue.

@CKEditorBot
Copy link
Collaborator

We've closed your issue due to inactivity over the last year. We understand that the issue may still be relevant. If so, feel free to open a new one (and link this issue to it).

@CKEditorBot CKEditorBot added the resolution:expired This issue was closed due to lack of feedback. label Nov 7, 2023
@CKEditorBot CKEditorBot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
package:utils resolution:expired This issue was closed due to lack of feedback. status:discussion status:stale type:feature This issue reports a feature request (an idea for a new functionality or a missing option).
Projects
None yet
Development

No branches or pull requests

4 participants