- This repository holds all attempts made/models tried to complete the Billion Word Imputation Kaggle Challenge.
The task defined in the Kaggle challenge is, given a "billion" words forming senctences pulled from the Chelba et al. training dataset and a testing dataset holding a lesser amount of senctences with one word removed from each, you are to train something (no suggestions given) to both find the location of the missing word and replace it with the correct word. My approach to this challenge began with omitting the task of locating the index of the missing word because the two ways to complete this subtask (N-gram Model or Word Distance Statistics) were too computationally expensive for my machine and difficult to properly implement in a pipeline. My next step was to take my training set and create my own set of missing words where the indicies were given by either a "[MASK]" symbol or a "103" symbol.
After this new training set was created, I could now approach the task of filling in the correct word using a Masked Language Model (MLM). Three strategies were attempted to use MLM: Masked Language Modeling with BERT, Masked Language Modeling with BERT and HuggingFace, and Next Word Prediction BI-LSTM.
My best and only working model (Masked Language Modeling with BERT) was able to predict masked words with probabilities on the order of 0.02 +-0.02. Comparing to the scores attained on the Kaggle leaderboard was not applicable because I could not make a submission, given I had to greatly alter the challenge and my approaches.
- Data:
- Type:
- Input: Full sentences in English ranging from 3 to 25 words
- Input: Sentences with approximately 15% of their words masked with placeholders.
- Size: 4.2GB
- Instances (Train, Test, Validation Split): 5000 sentences for training, and 1250 masked sentences for testing. Giving an 80/20 split.
- Type:
- None needed.
-
Sample sentences
- 'The U.S. Centers for Disease Control and Prevention initially advised school systems to close if outbreaks occurred , then reversed itself , saying the apparent mildness of the virus meant most schools and day care centers should stay open , even if they had confirmed cases of swine flu .'
- 'When Ms. Winfrey invited Suzanne Somers to share her controversial views about bio-identical hormone treatment on her syndicated show in 2009 , it won Ms. Winfrey a rare dollop of unflattering press , including a Newsweek cover story titled " Crazy Talk : Oprah , Wacky Cures & You . "'
- 'Elk calling -- a skill that hunters perfected long ago to lure game with the promise of a little romance -- is now its own sport .'
- "Don 't !"
- 'Fish , ranked 98th in the world , fired 22 aces en route to a 6-3 , 6-7 ( 5 / 7 ) , 7-6 ( 7 / 4 ) win over seventh-seeded Argentinian David Nalbandian .'
-
Tokenized sentence with padding (representative of the first sample sentence)
- tensor([ 101, 1996, 1057, 1012, 1055, 1012, 6401, 2005, 4295, 2491, 1998, 9740, 3322, 9449, 2082, 3001, 2000, 2485, 2065, 8293, 2015, 4158, 1010, 2059, 11674, 2993, 1010, 3038, 1996, 6835, 10256, 2791, 1997, 1996, 7865, 103, 2087, 2816, 1998, 2154, 2729, 6401, 2323, 2994, 2330, 1010, 2130, 2065, 2027, 2018, 4484, 3572, 1997, 25430, 3170, 19857, 1012, 102, 0, 0, 0, 0, 0, 0, 0,..., 0, 0, 0, 0, 0]) where...
- 101 = [CLS] (classifying/starting token)
- 102 = [SEP] (separation/ending token)
- 103 = [MASK] (masked/removed token)
For both the Masked Language Modeling with BERT and Masked Language Modeling with BERT and HuggingFace models, the input was the unedited full sentences and the expected output was "unmasked" sentences from the test set. Only the Masked Language Modeling with BERT model was able to give me some kind of tangible output. The other model would kill my kernel every time an attempt was made to train the model, regardless of how many parameters I move around.
For the Next Word Prediction BI-LSTM model the input was sentences that were "chopped off" at the index of the masked word. The input is a bit different because the expectation is for the model predict the next word in the sentence, which can then be spliced back with the remainder of the sentence. The furthest I could get with this model was for it to return to me a list of words it "believed" could be the next word, but I could not get it to actually choose a word.
The training was set to run on CPU for the Masked Language Modeling with BERT and Masked Language Modeling with BERT and HuggingFace models, and the last model was not set to run on any device in particular. Training for all models took upwards of 9 hours before any edits were made to the training/testing set, after chopping down, training took around 45 minutes. The only training curves I was able to extract were for the Next Word Prediction BI-LSTM model, and they looked to be very linear because of my machines inability to complete a sufficient amount of epochs without killing the kernel. The decision to stop training was highly regulated by how many epochs my machine could handle. I did try Google Collab to see if I could offload the computaional power needed to a remote computer, and a I was met with a similar demise.
- Next Word Prediction BI-LSTM (epochs, accuracy)
- Next Word Prediction BI-LSTM (epochs, loss)
- Given my attempts did not end in any comparable and inferable results, I will state that MLM pipelined with either the N-gram model or the Word Distance Statistics (WDS) to locate the missing word would be the most effective route.
- Next steps would be to take another stab at integrating the missing word locating models into the picture and look more into HuggingFace's pretrained embedders.
Results in their current state are not ideal to reproduce.
-
MLM with Bert and HuggingFace Directory contains the script for respective model.
-
MLM with Bert Directory contains the script for the respective model and the model file.
-
NWP BI-LSTM Directory (Next Word Prediction BI-LSTM) contains the script for the respective model.
-
Note that all of these notebooks should contain enough text for someone to understand what is happening.
- Pandas
- Numpy
- Pytorch
- HuggingFace Transformers
- Keras
- Tensorflow
- tqdm