Skip to content
This repository was archived by the owner on Jan 15, 2024. It is now read-only.

[BUGFIX] Fix handling of duplicate special tokens in Vocabulary#749

Merged
eric-haibin-lin merged 2 commits intodmlc:masterfrom
leezu:fixvocabspecialandreservedtokenshandling
Jun 4, 2019
Merged

[BUGFIX] Fix handling of duplicate special tokens in Vocabulary#749
eric-haibin-lin merged 2 commits intodmlc:masterfrom
leezu:fixvocabspecialandreservedtokenshandling

Conversation

@leezu
Copy link
Contributor

@leezu leezu commented Jun 3, 2019

Example use case in scripts/tests/test_dataprocessor.py where eos_token == padding_token. Prior to this fix, eos_token == padding_token lead to a
corrupted vocabulary index.

Example of the corrupted index:

> v = nlp.Vocab(...)
> v.idx_to_token
['<unk>', '<eos>', '<eos>', 'c', 'b', 'a']
> v.token_to_idx
{'<unk>': 0, '<eos>': 2, 'c': 3, 'b': 4, 'a': 5}

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Changes

  • Fix handling of duplicate special tokens when creating a Vocabulary

Comments

  • This fix does not impact backward compatibility in that previously serialized
    vocabularies will be restored exactly as serialized.
  • Newly created vocabularies may differ, if up until now their index was corrupt
  • The changes to the deserialization process in
    [FEATURE] Flexible vocabulary #732 and resulting execution of sanity
    checks during deserialization would prevent deserialization of vocabularies
    with a corrupt index. I added some backwards compatibility code for loading the corrupted serialized files in [FEATURE] Flexible vocabulary #732

leezu added 2 commits June 3, 2019 12:19
Example use case in scripts/tests/test_dataprocessor.py where eos_token ==
padding_token. Prior to this fix, eos_token == padding_token lead to a corrupted
vocabulary index.

Example of the corrupted index:

> v = nlp.Vocab(...)
> v.idx_to_token
['<unk>', '<eos>', '<eos>', 'c', 'b', 'a']
> v.token_to_idx
{'<unk>': 0, '<eos>': 2, 'c': 3, 'b': 4, 'a': 5}
@leezu leezu requested a review from szha as a code owner June 3, 2019 12:35
@codecov
Copy link

codecov bot commented Jun 3, 2019

Codecov Report

Merging #749 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #749      +/-   ##
==========================================
- Coverage    90.5%   90.48%   -0.02%     
==========================================
  Files          65       65              
  Lines        6076     6077       +1     
==========================================
  Hits         5499     5499              
- Misses        577      578       +1
Impacted Files Coverage Δ
src/gluonnlp/vocab/vocab.py 97.95% <100%> (+0.01%) ⬆️
src/gluonnlp/data/dataloader.py 83.62% <0%> (-0.87%) ⬇️

@leezu leezu added bug Something isn't working release focus Progress focus for release labels Jun 3, 2019
@leezu leezu requested a review from eric-haibin-lin June 3, 2019 12:37
leezu added a commit to leezu/gluon-nlp that referenced this pull request Jun 3, 2019
This adds backwards compatiblity for deserializing some vocabularies serialized
before dmlc#749
leezu added a commit to leezu/gluon-nlp that referenced this pull request Jun 3, 2019
This adds backwards compatiblity for deserializing some vocabularies serialized
before dmlc#749
@mli
Copy link
Member

mli commented Jun 3, 2019

Job PR-749/1 is complete.
Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-749/1/index.html

leezu added a commit to leezu/gluon-nlp that referenced this pull request Jun 3, 2019
This adds backwards compatiblity for deserializing some vocabularies serialized
before dmlc#749
Copy link
Member

@szha szha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it affect deserialization of existing vocab (e.g. concern of index mismatch in token embedding)?

Copy link
Member

@szha szha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case this would cause index mismatch for any token embedding, we should also find a way to alert user.

@leezu
Copy link
Contributor Author

leezu commented Jun 3, 2019

This does not affect deserialization in the current codebase.

If #732 is merged, deserialization would fail, however #732 also contains some backward compat code.
That code assures that deserialization works as it does currently and prints a warning.

leezu added a commit to leezu/gluon-nlp that referenced this pull request Jun 3, 2019
This adds backwards compatiblity for deserializing some vocabularies serialized
before dmlc#749
@eric-haibin-lin eric-haibin-lin merged commit ef513bd into dmlc:master Jun 4, 2019
leezu added a commit to leezu/gluon-nlp that referenced this pull request Jun 4, 2019
This adds backwards compatiblity for deserializing some vocabularies serialized
before dmlc#749
@leezu leezu mentioned this pull request Jun 4, 2019
6 tasks
@leezu leezu deleted the fixvocabspecialandreservedtokenshandling branch June 4, 2019 08:38
leezu added a commit to leezu/gluon-nlp that referenced this pull request Jun 6, 2019
This adds backwards compatiblity for deserializing some vocabularies serialized
before dmlc#749
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

bug Something isn't working release focus Progress focus for release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants