This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Add vocabulary and embedding #10074
Merged
Merged
Add vocabulary and embedding #10074
Changes from 18 commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
4e2f8e9
[MXNET-67] Sync master with v1.1.0 branch (#10031)
yzhliu 59f0306
Parallelization for ROIpooling OP (#9958)
xinyu-intel 1e270b1
comments to copy and copyto are corrected (#10040)
chsin 63074ce
Bug Fix and performance optimized for rtc (#10018)
chinakook df974e0
set embedding
9c806f5
Code and test revised
5797aab
api implementation done
a2215ca
license and news
c69cb07
readme and cpp
1863d91
pylint disable
c378669
Add API doc
5edca9d
less pylint disable
c208477
remove contrib
56d5307
move to gluon, revise api doc
5ba2225
fix import order
47d7ed4
re-test
63923db
relative imports
616cff9
re-run test
14735e1
revise implementation, test case, and api doc
240ef86
re-test
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,332 @@ | ||
# Gluon Text API | ||
|
||
## Overview | ||
|
||
The `mxnet.gluon.text` APIs refer to classes and functions related to text data processing, such | ||
as bulding indices and loading pre-trained embedding vectors for text tokens and storing them in the | ||
`mxnet.ndarray.NDArray` format. | ||
|
||
This document lists the text APIs in `mxnet.gluon`: | ||
|
||
```eval_rst | ||
.. autosummary:: | ||
:nosignatures: | ||
|
||
mxnet.gluon.text.embedding | ||
mxnet.gluon.text.vocab | ||
mxnet.gluon.text.utils | ||
``` | ||
|
||
All the code demonstrated in this document assumes that the following modules or packages are | ||
imported. | ||
|
||
```python | ||
>>> from mxnet import gluon | ||
>>> from mxnet import nd | ||
>>> from mxnet.gluon import text | ||
>>> import collections | ||
|
||
``` | ||
|
||
### Access pre-trained word embeddings for indexed words | ||
|
||
As a common use case, let us access pre-trained word embedding vectors for indexed words in just a | ||
few lines of code. | ||
|
||
To begin with, let us create a fastText word embedding instance by specifying the embedding name | ||
`fasttext` and the pre-trained file name `wiki.simple.vec`. | ||
|
||
```python | ||
>>> fasttext = text.embedding.create('fasttext', file_name='wiki.simple.vec') | ||
|
||
``` | ||
|
||
Now, suppose that we have a simple text data set in the string format. We can count word frequency | ||
in the data set. | ||
|
||
```python | ||
>>> text_data = " hello world \n hello nice world \n hi world \n" | ||
>>> counter = text.count_tokens_from_str(text_data) | ||
|
||
``` | ||
|
||
The obtained `counter` has key-value pairs whose keys are words and values are word frequencies. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should explain why a counter is needed first. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. resolved |
||
Suppose that we want to build indices for all the keys in `counter` and load the defined fastText | ||
word embedding for all such indexed words. We need a Vocabulary instance with `counter` and | ||
`fasttext` as its arguments. | ||
|
||
```python | ||
>>> my_vocab = text.Vocabulary(counter, embedding=fasttext) | ||
|
||
``` | ||
|
||
Now we are ready to access the fastText word embedding vectors for indexed words, such as 'hello' | ||
and 'world'. | ||
|
||
```python | ||
>>> my_vocab.embedding[['hello', 'world']] | ||
|
||
[[ 3.95669997e-01 2.14540005e-01 -3.53889987e-02 -2.42990002e-01 | ||
... | ||
-7.54180014e-01 -3.14429998e-01 2.40180008e-02 -7.61009976e-02] | ||
[ 1.04440004e-01 -1.08580001e-01 2.72119999e-01 1.32990003e-01 | ||
... | ||
-3.73499990e-01 5.67310005e-02 5.60180008e-01 2.90190000e-02]] | ||
<NDArray 2x300 @cpu(0)> | ||
|
||
``` | ||
|
||
### Using pre-trained word embeddings in `gluon` | ||
|
||
To demonstrate how to use pre-trained word embeddings in the `gluon` package, let us first obtain | ||
indices of the words 'hello' and 'world'. | ||
|
||
```python | ||
>>> my_vocab[['hello', 'world']] | ||
[2, 1] | ||
|
||
``` | ||
|
||
We can obtain the vector representation for the words 'hello' and 'world' by specifying their | ||
indices (2 and 1) and the weight matrix `my_vocab.embedding.idx_to_vec` in | ||
`mxnet.gluon.nn.Embedding`. | ||
|
||
```python | ||
>>> input_dim, output_dim = my_vocab.embedding.idx_to_vec.shape | ||
>>> layer = gluon.nn.Embedding(input_dim, output_dim) | ||
>>> layer.initialize() | ||
>>> layer.weight.set_data(my_vocab.embedding.idx_to_vec) | ||
>>> layer(nd.array([2, 1])) | ||
|
||
[[ 3.95669997e-01 2.14540005e-01 -3.53889987e-02 -2.42990002e-01 | ||
... | ||
-7.54180014e-01 -3.14429998e-01 2.40180008e-02 -7.61009976e-02] | ||
[ 1.04440004e-01 -1.08580001e-01 2.72119999e-01 1.32990003e-01 | ||
... | ||
-3.73499990e-01 5.67310005e-02 5.60180008e-01 2.90190000e-02]] | ||
<NDArray 2x300 @cpu(0)> | ||
|
||
``` | ||
|
||
## Vocabulary | ||
|
||
The vocabulary builds indices for text tokens and can be assigned with token embeddings. The input | ||
counter whose keys are candidate indices may be obtained via | ||
[`count_tokens_from_str`](#mxnet.gluon.text.utils.count_tokens_from_str). | ||
|
||
|
||
```eval_rst | ||
.. currentmodule:: mxnet.gluon.text.vocab | ||
.. autosummary:: | ||
:nosignatures: | ||
|
||
Vocabulary | ||
``` | ||
|
||
Suppose that we have a simple text data set in the string format. We can count word frequency in the | ||
data set. | ||
|
||
```python | ||
>>> text_data = " hello world \n hello nice world \n hi world \n" | ||
>>> counter = text.utils.count_tokens_from_str(text_data) | ||
|
||
``` | ||
|
||
The obtained `counter` has key-value pairs whose keys are words and values are word frequencies. | ||
Suppose that we want to build indices for the 2 most frequent keys in `counter` with the unknown | ||
token representation '(unk)' and a reserved token '(pad)'. | ||
|
||
```python | ||
>>> my_vocab = text.Vocabulary(counter, max_size=2, unknown_token='(unk)', | ||
... reserved_tokens=['(pad)']) | ||
|
||
``` | ||
|
||
We can access properties such as `token_to_idx` (mapping tokens to indices), `idx_to_token` (mapping | ||
indices to tokens), `unknown_token` (representation of any unknown token) and `reserved_tokens` | ||
(reserved tokens). | ||
|
||
|
||
```python | ||
>>> my_vocab.token_to_idx | ||
{'(unk)': 0, '(pad)': 1, 'world': 2, 'hello': 3} | ||
>>> my_vocab.idx_to_token | ||
['(unk)', '(pad)', 'world', 'hello'] | ||
>>> my_vocab.unknown_token | ||
'(unk)' | ||
>>> my_vocab.reserved_tokens | ||
['(pad)'] | ||
>>> len(my_vocab) | ||
4 | ||
>>> my_vocab[['hello', 'world']] | ||
[3, 2] | ||
``` | ||
|
||
Besides the specified unknown token '(unk)' and reserved_token '(pad)' are indexed, the 2 most | ||
frequent words 'world' and 'hello' are also indexed. | ||
|
||
|
||
### Assign token embedding to vocabulary | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Assign doesn't seem like the right verb. Maybe attach? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Resolved |
||
|
||
A vocabulary instance can be assigned with token embedding. | ||
|
||
To begin with, suppose that we have a simple text data set in the string format. We can count word | ||
frequency in the data set. | ||
|
||
```python | ||
>>> text_data = " hello world \n hello nice world \n hi world \n" | ||
>>> counter = text.count_tokens_from_str(text_data) | ||
|
||
``` | ||
|
||
The obtained `counter` has key-value pairs whose keys are words and values are word frequencies. | ||
Let us define the fastText word embedding instance with the pre-trained file `wiki.simple.vec`. | ||
|
||
```python | ||
>>> fasttext = text.embedding.create('fasttext', file_name='wiki.simple.vec') | ||
|
||
``` | ||
|
||
Suppose that we want to build indices for the most frequent 2 keys in `counter` and load the defined | ||
fastText word embedding for all these 2 words. | ||
|
||
```python | ||
>>> my_vocab = text.vocab.Vocabulary(counter, max_size=2, embedding=fasttext) | ||
|
||
``` | ||
|
||
Now we are ready to access the fastText word embedding vectors for indexed words. | ||
|
||
```python | ||
>>> my_vocab.embedding[['hello', 'world']] | ||
|
||
[[ 3.95669997e-01 2.14540005e-01 -3.53889987e-02 -2.42990002e-01 | ||
... | ||
-7.54180014e-01 -3.14429998e-01 2.40180008e-02 -7.61009976e-02] | ||
[ 1.04440004e-01 -1.08580001e-01 2.72119999e-01 1.32990003e-01 | ||
... | ||
-3.73499990e-01 5.67310005e-02 5.60180008e-01 2.90190000e-02]] | ||
<NDArray 2x300 @cpu(0)> | ||
|
||
``` | ||
|
||
Let us define the GloVe word embedding with the pre-trained file `glove.6B.50d.txt`. Then, | ||
we can re-assign a GloVe text embedding instance to the vocabulary. | ||
|
||
```python | ||
>>> glove = text.embedding.create('glove', file_name='glove.6B.50d.txt') | ||
>>> my_vocab.set_embedding(glove) | ||
|
||
``` | ||
|
||
Now we are ready to access the GloVe word embedding vectors for indexed words. | ||
|
||
```python | ||
>>> my_vocab.embedding[['hello', 'world']] | ||
|
||
[[ -0.38497001 0.80092001 | ||
... | ||
0.048833 0.67203999] | ||
[ -0.41486001 0.71847999 | ||
... | ||
-0.37639001 -0.67541999]] | ||
<NDArray 2x50 @cpu(0)> | ||
|
||
``` | ||
|
||
If a token is unknown to `my_vocab`, its embedding vector is initialized according to the default | ||
specification in `glove` (all elements are 0). | ||
|
||
```python | ||
|
||
>>> my_vocab.embedding['nice'] | ||
|
||
[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. | ||
... | ||
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] | ||
<NDArray 50 @cpu(0)> | ||
|
||
``` | ||
|
||
|
||
|
||
## Text token embedding | ||
|
||
To load token embeddings from an externally hosted pre-trained token embedding file, such as those | ||
of GloVe and FastText, use | ||
[`embedding.create(embedding_name, file_name)`](#mxnet.gluon.text.embedding.create). | ||
|
||
To get all the available `embedding_name` and `file_name`, use | ||
[`embedding.get_file_names()`](#mxnet.gluon.text.embedding.get_file_names). | ||
|
||
```python | ||
>>> text.embedding.get_file_names() | ||
{'glove': ['glove.42B.300d.txt', 'glove.6B.50d.txt', 'glove.6B.100d.txt', ...], | ||
'fasttext': ['wiki.en.vec', 'wiki.simple.vec', 'wiki.zh.vec', ...]} | ||
|
||
``` | ||
|
||
Alternatively, to load embedding vectors from a custom pre-trained text token embedding file, use | ||
[`TokenEmbedding.from_file`](#mxnet.gluon.text.embedding.TokenEmbedding.from_file). | ||
|
||
|
||
```eval_rst | ||
.. currentmodule:: mxnet.gluon.text.embedding | ||
.. autosummary:: | ||
:nosignatures: | ||
|
||
register | ||
create | ||
get_file_names | ||
TokenEmbedding | ||
GloVe | ||
FastText | ||
``` | ||
|
||
See [Assign token embedding to vocabulary](#Assign token embedding to vocabulary) for how to assign | ||
token embeddings to vocabulary and use token embeddings. | ||
|
||
|
||
### Implement a new text token embedding | ||
|
||
For ``embedding``, create a subclass of `mxnet.gluon.text.embedding.TokenEmbedding`. | ||
Also add ``@mxnet.gluon.text.embedding.TokenEmbedding.register`` before this class. See | ||
[`embedding.py`](https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/gluon/text/embedding.py) | ||
for examples. | ||
|
||
|
||
## Text utilities | ||
|
||
The following functions provide utilities for text data processing. | ||
|
||
```eval_rst | ||
.. currentmodule:: mxnet.gluon.text.utils | ||
.. autosummary:: | ||
:nosignatures: | ||
|
||
count_tokens_from_str | ||
``` | ||
|
||
|
||
## API Reference | ||
|
||
<script type="text/javascript" src='../../_static/js/auto_module_index.js'></script> | ||
|
||
```eval_rst | ||
|
||
.. automodule:: mxnet.gluon.text.embedding | ||
:members: register, create, get_file_names | ||
.. autoclass:: mxnet.gluon.text.embedding.TokenEmbedding | ||
:members: from_file | ||
.. autoclass:: mxnet.gluon.text.embedding.GloVe | ||
.. autoclass:: mxnet.gluon.text.embedding.FastText | ||
|
||
.. automodule:: mxnet.gluon.text.vocab | ||
.. autoclass:: mxnet.gluon.text.vocab.Vocabulary | ||
:members: set_embedding, to_tokens | ||
|
||
.. automodule:: mxnet.gluon.text.utils | ||
:members: count_tokens_from_str | ||
|
||
``` | ||
<script>auto_index("api-reference");</script> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
# Licensed to the Apache Software Foundation (ASF) under one | ||
# or more contributor license agreements. See the NOTICE file | ||
# distributed with this work for additional information | ||
# regarding copyright ownership. The ASF licenses this file | ||
# to you under the Apache License, Version 2.0 (the | ||
# "License"); you may not use this file except in compliance | ||
# with the License. You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, | ||
# software distributed under the License is distributed on an | ||
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
# KIND, either express or implied. See the License for the | ||
# specific language governing permissions and limitations | ||
# under the License. | ||
|
||
# coding: utf-8 | ||
# pylint: disable=wildcard-import | ||
"""This module includes utilities for indexing and embedding text.""" | ||
|
||
from .vocab import * | ||
|
||
from . import embedding | ||
|
||
from .utils import * |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem necessary to create vocab just to access embedding vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved