Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve keyword recognition in categorisation: word plural forms #214

Open
2 of 4 tasks
KarenJewell opened this issue Jan 7, 2023 · 3 comments
Open
2 of 4 tasks
Assignees
Labels
good first issue Good for newcomers

Comments

@KarenJewell
Copy link
Member

KarenJewell commented Jan 7, 2023

Is your feature request related to a problem? Please describe.
The current ODSCategories keyword map has a lot of contextually duplicate keywords to accomodate various form of the keword. For example "school" and "schools" to accomodate the plural form of the word. This "duplication" adds unnecessary bulk to the mapping: it lengthens the time needed to categorise and makes the mapping more cumbersome to curate.

The ultimate aim of this task is to:

  • reduce the size of the category-keyword map,
  • while also maintaining or improving the volume of matched datasets having those keywords.

Describe the solution you'd like

  • Required: we can regex match keywords in plural form (an "s" suffix)
  • Required: ODSCategories_Keywords must retain keyword as in the mapping (not the dataset variant)
  • Optional: consider other forms of plurals ("ies", "es")
  • Optional: consider plural forms in phrases (word groups)

Describe alternatives you've considered
Stemming. We could stem the word down, but stemming might introduce all matter of complexities we may not need to handle just yet. Simple "s" suffix matches will be able to reduce a whole chunk of our current keyword-category map - low-hanging fruit.

Additional context
See relevant docs: How-to-modify-category-keywords

@fozy81
Copy link
Contributor

fozy81 commented May 26, 2023

Will take a look at this issue.

@fozy81
Copy link
Contributor

fozy81 commented Jun 17, 2023

I've submitted PR please review #240 @KarenJewell

The PR only fixes for the simple case of removing trailing 's' (upper or lowercase). It doesn't deal with other plurals ("ies", "es").

@JackGilmore
Copy link
Member

Merged #240

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
Development

No branches or pull requests

3 participants