You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Curious if it would be possible to expose a regex token pattern param like that in CountVectorizer? This would help in filtering for (un)wanted chars during tokenization, e.g. hyphens, ampersands, apostrophes, etc.
The workaround I have found so far has been to use a custom POS tagger (custom_pos_tagger param of KeyphraseVectorizer) wherein I don't change any POS patterns/behaviour but recompile and modify the underlying spacy tokenizer object's prefix, suffix, and infix params. Wondering if there is a simpler way of exposing such behaviour? Keen to hear your thoughts!
Thanks
The text was updated successfully, but these errors were encountered:
I'm running into the same issue here where I need to consider hyphenated compound words within key-phrases.
Using the English model in spaCy I've managed to remove the infix hyphen splitting rule from the tokenizer before passing the model to the KeyphraseVectorizer. Then, further tracked it down to where it performs the transform on the CountVectorizer and the compound words are being discarded because they are not matching the default token_pattern.
Hello!
Curious if it would be possible to expose a regex token pattern param like that in CountVectorizer? This would help in filtering for (un)wanted chars during tokenization, e.g. hyphens, ampersands, apostrophes, etc.
The workaround I have found so far has been to use a custom POS tagger (
custom_pos_tagger
param of KeyphraseVectorizer) wherein I don't change any POS patterns/behaviour but recompile and modify the underlying spacy tokenizer object's prefix, suffix, and infix params. Wondering if there is a simpler way of exposing such behaviour? Keen to hear your thoughts!Thanks
The text was updated successfully, but these errors were encountered: