-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More concise regex, any way? #14
Comments
Hi, thanks for the suggestion! Yeah that's a neat idea. The current logic is a bit simplistic and could be smarter. My first thought is to calculate the longest common substring (maybe with a library like const commonSubstrings = require('common-substrings');
const tests = [
'My Work 1',
'My Work 2',
'My Work 3',
'His Work 1',
'His Work 2',
'His Work 3',
'Their Work 1',
'Their Work 2',
'Their Work 3',
];
commonSubstrings(tests);
/*
[
{ source: [0, 1, 2], name: "My Work ", weight: 24 },
{ source: [0, 3, 6], name: " Work 1", weight: 21 },
{ source: [1, 4, 7], name: " Work 2", weight: 21 },
{ source: [2, 5, 8], name: " Work 3", weight: 21 },
{ source: [3, 4, 5], name: "His Work ", weight: 27 },
{ source: [6, 7, 8], name: "Their Work ", weight: 33 },
]
*/ The common substring of /(Their Work (1|2|3)|His Work (1|2|3)|My Work (1|2|3))/ Maybe your example is a particularly hard one; I'm sure this approach would still be useful in other cases like:
Is this a feature you'd like to work on? |
Hi Drew, thanks for the quick response! TypeScript is new to me, though that library of it is very promising. I have just tried adding the I feel like sending this issue to the original author of that library and see if he can do something. |
Hi, sorry for the late response, because "'My Work 1" consume its substring "Work" (As "My Work 1" fully include the substring "Work", but "My Work" and "Work 1" only have overlap), so it is not shown in the first list, as default minOccurrence is 2, but if you set the minOccurrence to 4, the algorithm will ignore "'My Work 1", so "Work" will be found. The algorithm will start to consume the longest common substring instead of the shortest one (so first consume "My Work 1" and then consume the rest of "Work", but in our case "Work" are all consumed), though you can change the algorithm code to start from the shortest (code of the algorithm need to be changed). for example, if you have the list of const tests = [
'My Work 1',
'My Work 2',
'My Work 3',
'His Work 1',
'His Work 2',
'His Work 3',
'Their Work 1',
'Their Work 2',
'Their Work 3',
'Work 5',
'Work 4',
]; The 'Work' common substring will be listed again. |
Given the following strings:
My Work 1
My Work 2
My Work 3
His Work 1
His Work 2
His Work 3
Their Work 1
Their Work 2
Their Work 3
The generated regex is:
/(My Work (1|2|3)|His Work (1|2|3)|Their Work (1|2|3))/
Is there any way to improve it to:
(My|His|Their) Work (1|2|3)
?
The text was updated successfully, but these errors were encountered: