Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] New transform to annotate with any classifier model with multi-classifier support #924

Open
1 of 2 tasks
Harmedox opened this issue Jan 8, 2025 · 2 comments
Open
1 of 2 tasks
Labels
enhancement New feature or request

Comments

@Harmedox
Copy link

Harmedox commented Jan 8, 2025

Search before asking

  • I searched the issues and found no similar issues.

Component

Transforms/Other

Feature

The lang_id transform allows to specify a model for language identification, for which it annotates each document with the language and score.

Challenge:
The lang id transform is currently limited to language identification models. What is being proposed is a transform that can be used to annotate any classifier model irrespective of the underlying classification task. This is based on the assumption that the model outputs a label with an associated confidence/probability score. This new transform will have the following enhancements:

  1. classifier model, or a list of models can be loaded from a variety of sources e.g., HF, S3 etc.
  2. the name of the label and score annotations (or a list of label-score pair) is configurable
  3. when a list of models is used, matching number of label-score names must be configured

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@Harmedox Harmedox added the enhancement New feature or request label Jan 8, 2025
@touma-I
Copy link
Collaborator

touma-I commented Jan 9, 2025

@Harmedox How would this transform be used in pre-training, fine-tuning, RAG or any other application? Would this be a substitute to the current lang_id ? If so, can you quantify the value. I understand "how" you want to do it, I am not clear on the 'why'. Can you please elaborate more. Thanks.

@shahrokhDaijavad any point of view on the actual use case for this new transform? thanks

@Harmedox
Copy link
Author

Harmedox commented Jan 9, 2025

@Harmedox How would this transform be used in pre-training, fine-tuning, RAG or any other application? Would this be a substitute to the current lang_id ? If so, can you quantify the value. I understand "how" you want to do it, I am not clear on the 'why'. Can you please elaborate more. Thanks.

@shahrokhDaijavad any point of view on the actual use case for this new transform? thanks

Text classification is a common use case in data preprocessing. Language identification is an example of how this transform can be used. Another example is classification into domain categories such as medical, code, technology, health etc, that enables us filter high-quality domain-specific content to be used for pre-training.

In principle, if I have a classification model (whether it's performing language identification or any other task) and I want to annotate the label and confidence score, this transform will be used.

Would this be a substitute to the current lang_id ?

Yes, it should be. This would be a more generic version of lang_id.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants