Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Tokenizer's num_words filtering is based on word's index #311

Open
pierreelliott opened this issue Oct 13, 2020 · 2 comments
Open

Tokenizer's num_words filtering is based on word's index #311

pierreelliott opened this issue Oct 13, 2020 · 2 comments

Comments

@pierreelliott
Copy link

pierreelliott commented Oct 13, 2020

In the method texts_to_sequences_generator (of the Tokenizer), the num_words check is based on the word's index. I understand that this check is fast, but wouldn't it be a problem if the ordering is changed (ie, if it isn't based on frequency anymore) ?

for w in seq:
i = self.word_index.get(w)
if i is not None:
if num_words and i >= num_words:
if oov_token_index is not None:
vect.append(oov_token_index)
else:
vect.append(i)

@Dref360
Copy link
Contributor

Dref360 commented Oct 14, 2020

Hello,
Note, I'm far from an expert in NLP

Do you have an example where you wouldn't use frequency?

As long as word_index is sorted in order of importance it should work I think.

@pierreelliott
Copy link
Author

Hi,

In my current project, we defined an external index/word mapping, as our dataset often change but not our vocabulary. So the tokens won't always be sorted in order of importance.

For the record, I don't need this particular method (yet, I think...) but I found the assumption on the data in the check a little bit "hard".

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants