Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Add vocabulary and embedding #10074

Merged
merged 20 commits into from
Mar 15, 2018
Merged

Add vocabulary and embedding #10074

merged 20 commits into from
Mar 15, 2018

Conversation

astonzhang
Copy link
Member

Description

Add vocabulary and embedding

Checklist

Essentials

  • Passed code style checking (make lint)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • utils, (and when applicable, API doc)
  • vocab, (and when applicable, API doc)
  • embedding, (and when applicable, API doc)

Comments

  • NA

yzhliu and others added 9 commits March 11, 2018 15:58
* [REVIEW REQUIRED] Revert PR #9484 & add additional dependency licenses to LICENSE file (#9701)

* Revert "[Review Required] Fixing Licenses: Cleaning up the Top Level LICENSE file (#9484)"

This reverts commit 8930d96.

* Some more LICENSE fixes

* Adding some more packages to the LICENSE file

* Adding dependencies of dependencies

* update v1.1.0 change log to NEWS.md

* sync README.md from v1.1.0 branch

* revert to correct jenkins url in README
* parallelization for roipooling

* remove some useless computation

* remove useless muls

* add author and retriggering

* retrigger again
* Bug Fix and performance optimized for rtc

1. "super().__init__()" bug is fixed in python 2.
2. Kernel is initialized in the stage of operator init.

* Update custom_softmax_rtc.py

fix unnessesary format
@astonzhang astonzhang requested a review from szha as a code owner March 12, 2018 16:27


def test_token_embedding_from_file():
embed_root = 'embedding'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a tempfile instead



def test_vocab_set_embedding_with_one_custom_embedding():
embed_root = 'embedding'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a tempfile instead



def test_vocabulary_with_two_custom_embeddings():
embed_root = '.'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a tempfile instead

# coding: utf-8
# pylint: disable=consider-iterating-dictionary
# pylint: disable=super-init-not-called
# pylint: disable=arguments-differ
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the last two pylint ignores really invalid?

@piiswrong
Copy link
Contributor

This should be in gluon.
If we put vocab and embedding in mxnet.text and textdataset in gluon it's going to be really confusing

:func:`~mxnet.contrib.text.embedding.create`.
:func:`~mxnet.text.embedding.create`.


Examples
--------
>>> @mxnet.contrib.text.embedding.register

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> @mxnet.text.embedding.register

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member

@cjolivier01 cjolivier01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a JIRA

@@ -29,10 +29,10 @@
import warnings
import zipfile

from . import _constants as C
from mxnet import ndarray as nd
from mxnet import nd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually use relative import.


```python
>>> text_data = " hello world \n hello nice world \n hi world \n"
>>> counter = text.count_tokens_from_str(text_data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem necessary to create vocab just to access embedding vector.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved


```

The obtained `counter` has key-value pairs whose keys are words and values are word frequencies.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should explain why a counter is needed first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

file is:

'(token_1)(ed))v_11)(ed)(v_12)(ed)...(ed)(v_1k)\\\\n
(token_2)(ed)(v_21)(ed)(v_22)(ed)...(ed)(v_2k)\\\\n...'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use an example for the file format inside a code block so that it's easier to understand the file format. Currently it looks confusing. http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-10074/10/api/python/gluon/text.html#mxnet.gluon.text.embedding.TokenEmbedding.from_file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

Examples
--------
>>> fasttext = text.embedding.create('fasttext', file_name='wiki.simple.vec')
>>> text_data = " hello world \n hello nice world \n hi world \n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

def __len__(self):
return len(self._idx_to_token)

def set_embedding(self, embeddings):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use (self, *embeddings) instead. embeddings should not be a list. After the change, it should be possible to do:

vocab.set_embedding(fasttext_emb, glove_embed)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember to update the doc/example accordingly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved. with updated test cases

frequent words 'world' and 'hello' are also indexed.


### Assign token embedding to vocabulary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assign doesn't seem like the right verb. Maybe attach?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved


@property
def reserved_tokens(self):
return self._reserved_tokens
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the reserved_tokens property always include unknown_token, given that they are all indexed first?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved with more detailed api specification.

@szha szha merged commit 092df8f into apache:nlp_toolkit Mar 15, 2018
szha pushed a commit to szha/mxnet that referenced this pull request Mar 15, 2018
* [MXNET-67] Sync master with v1.1.0 branch (apache#10031)

* [REVIEW REQUIRED] Revert PR apache#9484 & add additional dependency licenses to LICENSE file (apache#9701)

* Revert "[Review Required] Fixing Licenses: Cleaning up the Top Level LICENSE file (apache#9484)"

This reverts commit 8930d96.

* Some more LICENSE fixes

* Adding some more packages to the LICENSE file

* Adding dependencies of dependencies

* update v1.1.0 change log to NEWS.md

* sync README.md from v1.1.0 branch

* revert to correct jenkins url in README

* Parallelization for ROIpooling OP (apache#9958)

* parallelization for roipooling

* remove some useless computation

* remove useless muls

* add author and retriggering

* retrigger again

* comments to copy and copyto are corrected (apache#10040)

* Bug Fix and performance optimized for rtc (apache#10018)

* Bug Fix and performance optimized for rtc

1. "super().__init__()" bug is fixed in python 2.
2. Kernel is initialized in the stage of operator init.

* Update custom_softmax_rtc.py

fix unnessesary format

* set embedding

* Code and test revised

* api implementation done

* license and news

* readme and cpp

* pylint disable

* Add API doc

* less pylint disable

* remove contrib

* move to gluon, revise api doc

* fix import order

* re-test

* relative imports

* re-run test

* revise implementation, test case, and api doc

* re-test
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants