Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CounterVectorizer does not support strip_accents=unicode #1144

Open
thang-le-klaviyo opened this issue Dec 18, 2024 · 1 comment
Open

CounterVectorizer does not support strip_accents=unicode #1144

thang-le-klaviyo opened this issue Dec 18, 2024 · 1 comment

Comments

@thang-le-klaviyo
Copy link

Hi!
For one of our models, we are using TfidVectorizer with one of our params being strip_accents='unicode'. However, when trying to convert our pipeline using convert_sklearn, we got the error that only strip_accents=None is supported. I was wondering if we can support strip_accents unicode. Thanks!

@xadupre
Copy link
Collaborator

xadupre commented Dec 19, 2024

We should be able to do that but not with standard onnx ops. https://onnx.ai/onnx/operators/onnx__StringNormalizer.html does not handle accents. But with a regular expression and this op in onnxruntime-extensions, that should work: https://github.com/microsoft/onnxruntime-extensions/blob/main/operators/text/string_ecmaregex_split.cc. Are you ok with taking an extra dependency?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants