Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation perplexity is 146.71 at the end of training (24 epochs) #3

Open
ygoncharov opened this issue Feb 7, 2016 · 12 comments
Open
Assignees
Labels

Comments

@ygoncharov
Copy link

(it should get ~82 on valid and ~79 on test)

$ python main.py --dataset ptb

.....

epoch: [24] [ 250/ 265] loss: 3.466149
Valid: loss: 5.225354, perplexity: 185.927017
{'perplexity': 83.749542031012467, 'epoch': 24, 'valid_perplexity': 146.71359295576036, 'learning_rate': 0.5}
[] Saving checkpoints...
Test: loss: 4.836956, perplexity: 126.084908
[
] Test loss: 4.954320, perplexity: 141.786226

@ygoncharov ygoncharov changed the title Validation perplexity is 146.71 at the end of training (34 epochs) Validation perplexity is 146.71 at the end of training (24 epochs) Feb 7, 2016
@carpedm20
Copy link
Owner

I'm working on this issue and I don't think the current implementation is different from the original model. I checked the model validity by comparing the losses of a single batch during the early epochs and there are no differences. Also, I checked the perplexity of training set goes down to 90.

loss

One thing I'm working on is to change the testing algorithm which is different from the original. The original code calculate the whole perplexity of all test data in a single forward pass but this repo calculates the perplexity of test data same as the training data, which is batch averaged perplexity. This will reduce the perplexity in some way.. but not sure this will make the comparable results.

If you find any other differences, feel free to share it to me 😄

@carpedm20 carpedm20 added the bug label Feb 7, 2016
@carpedm20 carpedm20 self-assigned this Feb 7, 2016
@yoonkim
Copy link

yoonkim commented Feb 8, 2016

Cool stuff!
I noticed on the README that you are using 100/150 hidden units for small/large models respectively. I actually use 300/650 hidden units, so this might explain the difference in performance. Also, it seems like you are using RMSProp? I've found vanilla SGD with starting learning rate of 1.0 (halved every time the perplexity does not improve on dev set) to work much better than other optimization methods, including RMSProp.

Hope this helps.

@carpedm20
Copy link
Owner

@yoonkim Hi! Thanks for sharing your great work and I enjoyed the paper very well! Actually, README is an old one which I forgot to update it (now I fixed it) and the code already uses same hidden units, optimizer, and decay as you mentioned..

@yoonkim
Copy link

yoonkim commented Feb 8, 2016

Ah ok! Few other things may be:

  • batch size
  • parameter initialization

@carpedm20
Copy link
Owner

Thanks! I'll dig into those things and how was the perplexity on training set after the training?

@yoonkim
Copy link

yoonkim commented Feb 8, 2016

I think it should be a lot lower. I don't recall the numbers exactly but since the dataset is small and the model has a lot of capacity (even with dropout) training PPL should be well below 50.

@nileshkulkarni
Copy link

nileshkulkarni commented Jun 2, 2016

@carpedm20 Hi,
Did you find any possibles pointers on this issue of high test perplexity? I was trying to debug it and any help would be appreciated.

@yss4
Copy link

yss4 commented Jul 4, 2016

@carpedm20 Hello, thanks for sharing your code in github. I also noticed that the problem of getting high perplexity on PTB test set is still ongoing. Have you had a chance to deal with this issue or any pointer to fix it? Thanks in advance.

@carpedm20
Copy link
Owner

@nileshkulkarni @yss4 No, I couldn't find the reason of problem yet and I'm not working on this project now. But if you share me any weird codes that is different from the original paper, please share it and I'll take a look at it.

@mkroutikov
Copy link

mkroutikov commented Sep 16, 2016

@carpedm20 This implementation is NOT identical to the original.

Interested reader can have a look at my code here:
https://github.com/mkroutikov/tf-lstm-char-cnn
that does reproduce Yoon Kim's redult in TF.

@hejunqing
Copy link

I ran the code yesterday and received a result of 156.097 averaged validation PPL, 149.565 averaged test PPL. So I am reading your code and the original.The first different thing I found was the criterion, yours is CE while the original is NLL.Does it matter?

@guanghuixu
Copy link

Thanks for sharing your code. I want to know how can I train a model in word_level? I found you code has the things like ( use_char = Ture, use_word = False). Is it useful to adjust the 'use_word = Ture'? Looking forward to your answer, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants