Skip to content
This repository was archived by the owner on Jan 15, 2024. It is now read-only.

[BUGFIX] Fix Vocab with unknown_token remapped to != 0 via token_to_idx arg#862

Merged
leezu merged 3 commits intodmlc:masterfrom
leezu:fixvocabunktokenremapping
Aug 4, 2019
Merged

[BUGFIX] Fix Vocab with unknown_token remapped to != 0 via token_to_idx arg#862
leezu merged 3 commits intodmlc:masterfrom
leezu:fixvocabunktokenremapping

Conversation

@leezu
Copy link
Contributor

@leezu leezu commented Aug 3, 2019

Description

#732 forgot to remap the default specified in DefaultLookupDict, but only remapped the idx_to_token / token_to_idx data structures.

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Changes

  • Fix Vocab with unknown_token remapped to != 0 via token_to_idx arg

Comments

  • Needs to be backported to 0.7

@leezu leezu requested a review from eric-haibin-lin August 3, 2019 20:26
@leezu leezu requested a review from szha as a code owner August 3, 2019 20:26
@leezu leezu mentioned this pull request Aug 3, 2019
@mli
Copy link
Member

mli commented Aug 3, 2019

Job PR-862/1 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-862/1/index.html

leezu added 3 commits August 3, 2019 21:13
Confirmed that vocab[vocab.unknown_token] still == 0 for all models created
prior to the flexible vocab PR. Ie:

- book_corpus_wiki_en_uncased
- wiki_multilingual_uncased
- openwebtext_book_corpus_wiki_en_uncased
- wiki_multilingual_cased
- wiki_cn_cased
@leezu leezu force-pushed the fixvocabunktokenremapping branch from 5723c73 to 112e133 Compare August 3, 2019 21:17
@mli
Copy link
Member

mli commented Aug 3, 2019

Job PR-862/2 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-862/2/index.html

@codecov
Copy link

codecov bot commented Aug 3, 2019

Codecov Report

Merging #862 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #862      +/-   ##
==========================================
+ Coverage   90.17%   90.17%   +<.01%     
==========================================
  Files          66       66              
  Lines        6342     6344       +2     
==========================================
+ Hits         5719     5721       +2     
  Misses        623      623
Impacted Files Coverage Δ
src/gluonnlp/vocab/vocab.py 97.31% <100%> (+0.02%) ⬆️

for token in special_tokens:
assert token in vocab, "Token %s not found in the vocab" % token
assert vocab['RandomWordByHaibin'] == 0
assert vocab['RandomWordByHaibin'] == vocab[vocab.unknown_token]
Copy link
Member

@eric-haibin-lin eric-haibin-lin Aug 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not my random word 😂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6d72708 :)

@leezu leezu merged commit 0008445 into dmlc:master Aug 4, 2019
@leezu leezu deleted the fixvocabunktokenremapping branch August 4, 2019 09:28
leezu added a commit that referenced this pull request Aug 5, 2019
* Fix Vocab with unknown_token remapped to != 0 via token_to_idx arg

* Add test

* Update test_pretrained_bert_models

Confirmed that vocab[vocab.unknown_token] still == 0 for all models created
prior to the flexible vocab PR. Ie:

- book_corpus_wiki_en_uncased
- wiki_multilingual_uncased
- openwebtext_book_corpus_wiki_en_uncased
- wiki_multilingual_cased
- wiki_cn_cased

* Use cuda 10.1 on CI

* Disable false positive pylint warnings

Fixed in master branch by #838
Fix does not apply here as it would require dropping py2.

* Don't test against v1.6 beta and v1.4.1 at the same time.

Test v0.7x branch only on v1.4 due to incompatible doctest output on both mxnet
versions

* Disable test_finetune_chinese_inference due to DNS error
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants