Tokenizer's `num_words` filtering is based on word's index #311

pierreelliott · 2020-10-13T09:58:43Z

In the method texts_to_sequences_generator (of the Tokenizer), the num_words check is based on the word's index. I understand that this check is fast, but wouldn't it be a problem if the ordering is changed (ie, if it isn't based on frequency anymore) ?

keras-preprocessing/keras_preprocessing/text.py

Lines 333 to 340 in 5949df1

    
           for w in seq: 
        
               i = self.word_index.get(w) 
        
               if i is not None: 
        
                   if num_words and i >= num_words: 
        
                       if oov_token_index is not None: 
        
                           vect.append(oov_token_index) 
        
                   else: 
        
                       vect.append(i)

The text was updated successfully, but these errors were encountered:

Dref360 · 2020-10-14T15:52:03Z

Hello,
Note, I'm far from an expert in NLP

Do you have an example where you wouldn't use frequency?

As long as word_index is sorted in order of importance it should work I think.

pierreelliott · 2020-10-14T16:42:34Z

Hi,

In my current project, we defined an external index/word mapping, as our dataset often change but not our vocabulary. So the tokens won't always be sorted in order of importance.

For the record, I don't need this particular method (yet, I think...) but I found the assumption on the data in the check a little bit "hard".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer's `num_words` filtering is based on word's index #311

Tokenizer's `num_words` filtering is based on word's index #311

pierreelliott commented Oct 13, 2020 •

edited

Loading

Dref360 commented Oct 14, 2020

pierreelliott commented Oct 14, 2020

Tokenizer's num_words filtering is based on word's index #311

Tokenizer's num_words filtering is based on word's index #311

Comments

pierreelliott commented Oct 13, 2020 • edited Loading

Dref360 commented Oct 14, 2020

pierreelliott commented Oct 14, 2020

Tokenizer's `num_words` filtering is based on word's index #311

Tokenizer's `num_words` filtering is based on word's index #311

pierreelliott commented Oct 13, 2020 •

edited

Loading