Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added PubMed embeddings computed by @jessepeng #519

Merged
merged 1 commit into from
Feb 19, 2019

Conversation

alanakbik
Copy link
Collaborator

@jessepeng computed a character LM over PubMed abstracts and shared the models with us. This PR adds them as FlairEmbeddings.

Init with:

embeddings_f = FlairEmbeddings('pubmed-forward')
embeddings_b = FlairEmbeddings('pubmed-backward')

@alanakbik alanakbik merged commit aaa8a2f into release-0.4.1 Feb 19, 2019
@alanakbik alanakbik deleted the GH-518-pubmed-flair branch February 19, 2019 13:28
@khituras
Copy link
Contributor

khituras commented Feb 19, 2019

Is the size of the hidden layer(s) and the number of layers known for these models? This would be an interesting information for comparative experiments.

@alanakbik
Copy link
Collaborator Author

Hi @khituras - I believe the model was trained with a hidden size of 1150 and 3 layers and BPTT truncated at a sequence length of 240. It was only trained over a 5% sample of PubMed abstracts until 2015, which is 1.219.734 abstracts.

@jessepeng is this correct?

@jessepeng
Copy link

Yes, this is correct. Below are the hyperparameters used for training:
• 3 Layers LSTM
• Hidden size 1150
• Embedding size 200
• Dropout 0.5
• Sequence Length 240
• LR 20
• Batch size 100
• Annealing 0.25
• Gradient clipping 0.25

@khituras
Copy link
Contributor

@jessepeng Thank you so much for this specification. Was there some specific evaluation strategy that lead you to choose these parameters?
@alanakbik Will those be available in the documentation for the embeddings? I think that would be very important - for any embedding actually - so the users know what they are working with and if it would make sense to train embeddings themselves with different parameters.

@alanakbik
Copy link
Collaborator Author

Yes, good point - we'll add this to the documentation with the release!

@pinal-patel
Copy link

Can we know the statistics of test and validation dataset and what is the perplexity on test and validation dataset?

@jessepeng
Copy link

@khituras No, I chose most of those parameters because they were the standard parameters of Flair. I did however choose the number of layers and number of hidden dimensions to be in accordance to a word-level LM I also trained on the same corpus. The architecture and hyperparameters I chose for this LM follow Merity et. al. 2017.

@pinal-patel The dataset consisting of the aforementioned 1.219.734 abstract was split 60/10/30 into train/validation/test datasets. The perplexities on train/val/test were 2,15/2,08/2,07 for the forward model and 2,19/2,1/2,09 for the backward model.

@shreyashub
Copy link

@jessepeng Did you start the training from scratch on Pubmed abstracts or did you further fine tune on a model trained on Wiki or some similar dataset?
Also, how long did it take and on what H/W?

@shreyashub
Copy link

@jessepeng ?
@alanakbik, if I need to further train these embeddings on more data. What changes need to be made to Tutorial 9 ?

@jessepeng
Copy link

@shreyashub I started training from scratch. I trained each direction for about 10 days on a GeForce GTX Titan X.

@alanakbik
Copy link
Collaborator Author

Hello @shreyashub to fine tune an existing LanguageModel, you only need to load an existing one instead of instantiating a new one. The rest of the training code remains the same as in Tutorial 9:

from flair.data import Dictionary
from flair.models import LanguageModel
from flair.trainers.language_model_trainer import LanguageModelTrainer, TextCorpus

# get your corpus, process forward and at the character level
corpus = TextCorpus('/path/to/your/corpus',
                    dictionary,
                    is_forward_lm,
                    character_level=True)

# instantiate an existing LM, such as one from the FlairEmbeddings
language_model = FlairEmbeddings('news-forward-fast').lm

# use the model trainer to fine-tune this model on your corpus
trainer = LanguageModelTrainer(language_model, corpus)

trainer.train('resources/taggers/language_model',
              sequence_length=10,
              mini_batch_size=10,
              max_epochs=10)

Note that when you fine-tune, you automatically use the same character dictionary as before and automatically copy the direction (forward/backward).

@shreyashub
Copy link

shreyashub commented Jun 29, 2019

as PooledFlairEmbeddings('pubmed-forward').lm does not exist, do we train FlairEmbeddings and use them instead in PooledFlairEmbeddings. But I don't think that makes sense. What can I do? @alanakbik

@alanakbik
Copy link
Collaborator Author

Yes that works - the pooled variant just builds on top of FlairEmbeddings, so you can train with FlairEmbeddings('pubmed-forward').lm and then use the resulting embeddings either as FlairEmbeddings or as PooledFlairEmbeddings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants