Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure Tokenizer and Splitter modules #3002

Merged
merged 2 commits into from
Nov 27, 2022
Merged

Conversation

alanakbik
Copy link
Collaborator

When initializing a Sentence object, with use_tokenizer=True (set by default), it triggered an import of the SegtokTokenizer in the Sentence constructor. According to this thread, there is a cost of re-importing modules and since we often create large numbers of Sentence objects, this is unnecessary overhead.

This PR makes the import global. To do this, it splits up the flair.tokenization module into tokenization (for all tokenizers) and splitter (for all sentence splitters). This way, there is no longer an import of flair.data in flair.tokenization.

@alanakbik
Copy link
Collaborator Author

However, it doesn't really seem to make a difference in runtime. Tested with

import timeit
t = timeit.Timer(setup='from flair.data import Sentence',
                 stmt='Sentence("a a a a a a a a a a ")')
print(t.timeit())

@alanakbik alanakbik merged commit 247bff4 into master Nov 27, 2022
@alanakbik alanakbik deleted the import-tokenizer branch November 27, 2022 12:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant