CounterVectorizer does not support strip_accents=unicode #1144

thang-le-klaviyo · 2024-12-18T16:20:03Z

Hi!
For one of our models, we are using TfidVectorizer with one of our params being strip_accents='unicode'. However, when trying to convert our pipeline using convert_sklearn, we got the error that only strip_accents=None is supported. I was wondering if we can support strip_accents unicode. Thanks!

xadupre · 2024-12-19T14:15:52Z

We should be able to do that but not with standard onnx ops. https://onnx.ai/onnx/operators/onnx__StringNormalizer.html does not handle accents. But with a regular expression and this op in onnxruntime-extensions, that should work: https://github.com/microsoft/onnxruntime-extensions/blob/main/operators/text/string_ecmaregex_split.cc. Are you ok with taking an extra dependency?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CounterVectorizer does not support strip_accents=unicode #1144

CounterVectorizer does not support strip_accents=unicode #1144

thang-le-klaviyo commented Dec 18, 2024

xadupre commented Dec 19, 2024

CounterVectorizer does not support strip_accents=unicode #1144

CounterVectorizer does not support strip_accents=unicode #1144

Comments

thang-le-klaviyo commented Dec 18, 2024

xadupre commented Dec 19, 2024