[BUGFIX] Fix handling of duplicate special tokens in Vocabulary by leezu · Pull Request #749 · dmlc/gluon-nlp

leezu · 2019-06-03T12:35:26Z

Example use case in scripts/tests/test_dataprocessor.py where eos_token == padding_token. Prior to this fix, eos_token == padding_token lead to a
corrupted vocabulary index.

Example of the corrupted index:

> v = nlp.Vocab(...)
> v.idx_to_token
['<unk>', '<eos>', '<eos>', 'c', 'b', 'a']
> v.token_to_idx
{'<unk>': 0, '<eos>': 2, 'c': 3, 'b': 4, 'a': 5}

Checklist

Essentials

PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

Changes

Fix handling of duplicate special tokens when creating a Vocabulary

Comments

This fix does not impact backward compatibility in that previously serialized
vocabularies will be restored exactly as serialized.
Newly created vocabularies may differ, if up until now their index was corrupt
The changes to the deserialization process in
[FEATURE] Flexible vocabulary #732 and resulting execution of sanity
checks during deserialization would prevent deserialization of vocabularies
with a corrupt index. I added some backwards compatibility code for loading the corrupted serialized files in [FEATURE] Flexible vocabulary #732

Example use case in scripts/tests/test_dataprocessor.py where eos_token == padding_token. Prior to this fix, eos_token == padding_token lead to a corrupted vocabulary index. Example of the corrupted index: > v = nlp.Vocab(...) > v.idx_to_token ['<unk>', '<eos>', '<eos>', 'c', 'b', 'a'] > v.token_to_idx {'<unk>': 0, '<eos>': 2, 'c': 3, 'b': 4, 'a': 5}

codecov · 2019-06-03T12:35:28Z

Codecov Report

Merging #749 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #749      +/-   ##
==========================================
- Coverage    90.5%   90.48%   -0.02%     
==========================================
  Files          65       65              
  Lines        6076     6077       +1     
==========================================
  Hits         5499     5499              
- Misses        577      578       +1

Impacted Files	Coverage Δ
src/gluonnlp/vocab/vocab.py	`97.95% <100%> (+0.01%)`	⬆️
src/gluonnlp/data/dataloader.py	`83.62% <0%> (-0.87%)`	⬇️

This adds backwards compatiblity for deserializing some vocabularies serialized before dmlc#749

mli · 2019-06-03T13:43:00Z

Job PR-749/1 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-749/1/index.html

This adds backwards compatiblity for deserializing some vocabularies serialized before dmlc#749

szha

How does it affect deserialization of existing vocab (e.g. concern of index mismatch in token embedding)?

szha

In case this would cause index mismatch for any token embedding, we should also find a way to alert user.

leezu · 2019-06-03T18:59:59Z

This does not affect deserialization in the current codebase.

If #732 is merged, deserialization would fail, however #732 also contains some backward compat code.
That code assures that deserialization works as it does currently and prints a warning.

This adds backwards compatiblity for deserializing some vocabularies serialized before dmlc#749

leezu added 2 commits June 3, 2019 12:19

Add failing test

b43e8a2

leezu requested a review from szha as a code owner June 3, 2019 12:35

leezu added bug Something isn't working release focus Progress focus for release labels Jun 3, 2019

leezu requested a review from eric-haibin-lin June 3, 2019 12:37

leezu added a commit to leezu/gluon-nlp that referenced this pull request Jun 3, 2019

Handle serialized vocabularies with corrupted index

bce4ca6

This adds backwards compatiblity for deserializing some vocabularies serialized before dmlc#749

leezu added a commit to leezu/gluon-nlp that referenced this pull request Jun 3, 2019

Handle serialized vocabularies with corrupted index

9d51676

This adds backwards compatiblity for deserializing some vocabularies serialized before dmlc#749

leezu mentioned this pull request Jun 3, 2019

[MODEL] BERT conversion scripts, SciBERT, BioBERT, ClinicalBERT #735

Merged

leezu added a commit to leezu/gluon-nlp that referenced this pull request Jun 3, 2019

Handle serialized vocabularies with corrupted index

e61472f

This adds backwards compatiblity for deserializing some vocabularies serialized before dmlc#749

szha reviewed Jun 3, 2019

View reviewed changes

szha approved these changes Jun 3, 2019

View reviewed changes

leezu added a commit to leezu/gluon-nlp that referenced this pull request Jun 3, 2019

Handle serialized vocabularies with corrupted index

4c2fb12

This adds backwards compatiblity for deserializing some vocabularies serialized before dmlc#749

eric-haibin-lin approved these changes Jun 4, 2019

View reviewed changes

eric-haibin-lin merged commit ef513bd into dmlc:master Jun 4, 2019

leezu added a commit to leezu/gluon-nlp that referenced this pull request Jun 4, 2019

Handle serialized vocabularies with corrupted index

a3667ef

This adds backwards compatiblity for deserializing some vocabularies serialized before dmlc#749

leezu mentioned this pull request Jun 4, 2019

[FEATURE] Flexible vocabulary #732

Merged

6 tasks

leezu deleted the fixvocabspecialandreservedtokenshandling branch June 4, 2019 08:38

leezu added a commit to leezu/gluon-nlp that referenced this pull request Jun 6, 2019

Handle serialized vocabularies with corrupted index

57cb063

This adds backwards compatiblity for deserializing some vocabularies serialized before dmlc#749

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUGFIX] Fix handling of duplicate special tokens in Vocabulary#749

[BUGFIX] Fix handling of duplicate special tokens in Vocabulary#749
eric-haibin-lin merged 2 commits intodmlc:masterfrom
leezu:fixvocabspecialandreservedtokenshandling

leezu commented Jun 3, 2019 •

edited

Loading

Uh oh!

codecov bot commented Jun 3, 2019 •

edited

Loading

Uh oh!

mli commented Jun 3, 2019

Uh oh!

szha left a comment

Uh oh!

szha left a comment

Uh oh!

leezu commented Jun 3, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

leezu commented Jun 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Essentials

Changes

Comments

Uh oh!

codecov bot commented Jun 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mli commented Jun 3, 2019

Uh oh!

szha left a comment

Choose a reason for hiding this comment

Uh oh!

szha left a comment

Choose a reason for hiding this comment

Uh oh!

leezu commented Jun 3, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

leezu commented Jun 3, 2019 •

edited

Loading

codecov bot commented Jun 3, 2019 •

edited

Loading