Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple transformer vectorisers from sklearn-based ones #174

Closed
aCampello opened this issue Aug 23, 2022 · 2 comments
Closed

Decouple transformer vectorisers from sklearn-based ones #174

aCampello opened this issue Aug 23, 2022 · 2 comments
Labels
enhancement New feature or request

Comments

@aCampello
Copy link

aCampello commented Aug 23, 2022

Description

Thank you very much for the nice repository. We are currently using XLinear, together with the multi-thread Tfidf vectoriser

class Tfidf(Vectorizer):
"""Multithreaded tfidf vectorizer with C++ backend.
Supports 'word', 'char' and 'char_wb' tokenization.
"""

It works as a charm. However, with the new transformer vectorisers, to use Tfidf I need to unnecessarily install all transformers/torch dependencies in my virtual environment. It would be good if the code was decoupled, so a user can just use Tfidf if needed, with a slim environment. In the past, I used to just install pecos with no dependencies, and then only install the specific dependencies for the specific modules - this is desirable for a production environment.

Proposal:

  • That the pertained transformers in
    class PretrainedTransformer(Vectorizer):
    """Vectorizer with a variety of Transformer models."""

    Be moved to an independent file, say pretrained_vectorizers.py, alongside transformers and torch imports.

I'm happy to give this change a go, and contribute with a MR if the maintainers upvote it.

@aCampello aCampello added the enhancement New feature or request label Aug 23, 2022
@aCampello
Copy link
Author

Any thoughts by the maintainers? @OctoberChang @weiliw-amz ?

@OctoberChang
Copy link
Contributor

@aCampello , as suggested, we decoupled the vectorizer.py to contain only tf-idf based vectorizers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants