[Feature] New transform to annotate with any classifier model with multi-classifier support #924

Harmedox · 2025-01-08T03:40:58Z

Search before asking

I searched the issues and found no similar issues.

Component

Transforms/Other

Feature

The lang_id transform allows to specify a model for language identification, for which it annotates each document with the language and score.

Challenge:
The lang id transform is currently limited to language identification models. What is being proposed is a transform that can be used to annotate any classifier model irrespective of the underlying classification task. This is based on the assumption that the model outputs a label with an associated confidence/probability score. This new transform will have the following enhancements:

classifier model, or a list of models can be loaded from a variety of sources e.g., HF, S3 etc.
the name of the label and score annotations (or a list of label-score pair) is configurable
when a list of models is used, matching number of label-score names must be configured

Are you willing to submit a PR?

Yes I am willing to submit a PR!

touma-I · 2025-01-09T00:03:34Z

@Harmedox How would this transform be used in pre-training, fine-tuning, RAG or any other application? Would this be a substitute to the current lang_id ? If so, can you quantify the value. I understand "how" you want to do it, I am not clear on the 'why'. Can you please elaborate more. Thanks.

@shahrokhDaijavad any point of view on the actual use case for this new transform? thanks

Harmedox · 2025-01-09T00:23:58Z

@Harmedox How would this transform be used in pre-training, fine-tuning, RAG or any other application? Would this be a substitute to the current lang_id ? If so, can you quantify the value. I understand "how" you want to do it, I am not clear on the 'why'. Can you please elaborate more. Thanks.

@shahrokhDaijavad any point of view on the actual use case for this new transform? thanks

Text classification is a common use case in data preprocessing. Language identification is an example of how this transform can be used. Another example is classification into domain categories such as medical, code, technology, health etc, that enables us filter high-quality domain-specific content to be used for pre-training.

In principle, if I have a classification model (whether it's performing language identification or any other task) and I want to annotate the label and confidence score, this transform will be used.

Would this be a substitute to the current lang_id ?

Yes, it should be. This would be a more generic version of lang_id.

Harmedox added the enhancement New feature or request label Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] New transform to annotate with any classifier model with multi-classifier support #924

[Feature] New transform to annotate with any classifier model with multi-classifier support #924

Harmedox commented Jan 8, 2025

touma-I commented Jan 9, 2025

Harmedox commented Jan 9, 2025

[Feature] New transform to annotate with any classifier model with multi-classifier support #924

[Feature] New transform to annotate with any classifier model with multi-classifier support #924

Comments

Harmedox commented Jan 8, 2025

Search before asking

Component

Feature

Are you willing to submit a PR?

touma-I commented Jan 9, 2025

Harmedox commented Jan 9, 2025