Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More concise regex, any way? #14

Open
jiangweiatgithub opened this issue Jan 31, 2021 · 3 comments
Open

More concise regex, any way? #14

jiangweiatgithub opened this issue Jan 31, 2021 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@jiangweiatgithub
Copy link

Given the following strings:
My Work 1
My Work 2
My Work 3
His Work 1
His Work 2
His Work 3
Their Work 1
Their Work 2
Their Work 3
The generated regex is:
/(My Work (1|2|3)|His Work (1|2|3)|Their Work (1|2|3))/

Is there any way to improve it to:
(My|His|Their) Work (1|2|3)
?

@wimpyprogrammer wimpyprogrammer self-assigned this Jan 31, 2021
@wimpyprogrammer wimpyprogrammer added the enhancement New feature or request label Jan 31, 2021
@wimpyprogrammer
Copy link
Owner

Hi, thanks for the suggestion! Yeah that's a neat idea. The current logic is a bit simplistic and could be smarter.

My first thought is to calculate the longest common substring (maybe with a library like common-substrings) and use that to determine groupings. Although that approach gives me some unexpected results:

const commonSubstrings = require('common-substrings');

const tests = [
    'My Work 1',
    'My Work 2',
    'My Work 3',
    'His Work 1',
    'His Work 2',
    'His Work 3',
    'Their Work 1',
    'Their Work 2',
    'Their Work 3',
];

commonSubstrings(tests);

/*
[
  { source: [0, 1, 2], name: "My Work ", weight: 24 },
  { source: [0, 3, 6], name: " Work 1", weight: 21 },
  { source: [1, 4, 7], name: " Work 2", weight: 21 },
  { source: [2, 5, 8], name: " Work 3", weight: 21 },
  { source: [3, 4, 5], name: "His Work ", weight: 27 },
  { source: [6, 7, 8], name: "Their Work ", weight: 33 },
]
*/

The common substring of ' Work ' isn't detected, so oddly enough this would still lead to a RegEx like

/(Their Work (1|2|3)|His Work (1|2|3)|My Work (1|2|3))/

Maybe your example is a particularly hard one; I'm sure this approach would still be useful in other cases like:

  • One fish
  • Two fish
  • Red fish
  • Blue fish

Is this a feature you'd like to work on?

@jiangweiatgithub
Copy link
Author

jiangweiatgithub commented Jan 31, 2021

Hi Drew, thanks for the quick response!

TypeScript is new to me, though that library of it is very promising. I have just tried adding the minOccurrence option and adjusting it from default 2 to 4 in the last line of your script, commonSubstrings(tests, {minOccurrence: 4, minLength: 3});, and running it on https://npm.runkit.com/common-substrings. Now the desired result is given, but as the only result:
{name: " Work ", source: [0, 1, 2, 3, 4, 5, 6, 7, 8], weight: 54}

I feel like sending this issue to the original author of that library and see if he can do something.

@hanwencheng
Copy link

hanwencheng commented Jun 23, 2021

Hi, sorry for the late response, because "'My Work 1" consume its substring "Work" (As "My Work 1" fully include the substring "Work", but "My Work" and "Work 1" only have overlap), so it is not shown in the first list, as default minOccurrence is 2, but if you set the minOccurrence to 4, the algorithm will ignore "'My Work 1", so "Work" will be found. The algorithm will start to consume the longest common substring instead of the shortest one (so first consume "My Work 1" and then consume the rest of "Work", but in our case "Work" are all consumed), though you can change the algorithm code to start from the shortest (code of the algorithm need to be changed).

for example, if you have the list of

const tests = [
    'My Work 1',
    'My Work 2',
    'My Work 3',
    'His Work 1',
    'His Work 2',
    'His Work 3',
    'Their Work 1',
    'Their Work 2',
    'Their Work 3',
    'Work 5',
    'Work 4',
];

The 'Work' common substring will be listed again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants