Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Add vocabulary and embedding #10074

Merged
merged 20 commits into from
Mar 15, 2018
332 changes: 332 additions & 0 deletions docs/api/python/gluon/text.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,332 @@
# Gluon Text API

## Overview

The `mxnet.gluon.text` APIs refer to classes and functions related to text data processing, such
as bulding indices and loading pre-trained embedding vectors for text tokens and storing them in the
`mxnet.ndarray.NDArray` format.

This document lists the text APIs in `mxnet.gluon`:

```eval_rst
.. autosummary::
:nosignatures:

mxnet.gluon.text.embedding
mxnet.gluon.text.vocab
mxnet.gluon.text.utils
```

All the code demonstrated in this document assumes that the following modules or packages are
imported.

```python
>>> from mxnet import gluon
>>> from mxnet import nd
>>> from mxnet.gluon import text
>>> import collections

```

### Access pre-trained word embeddings for indexed words

As a common use case, let us access pre-trained word embedding vectors for indexed words in just a
few lines of code.

To begin with, let us create a fastText word embedding instance by specifying the embedding name
`fasttext` and the pre-trained file name `wiki.simple.vec`.

```python
>>> fasttext = text.embedding.create('fasttext', file_name='wiki.simple.vec')

```

Now, suppose that we have a simple text data set in the string format. We can count word frequency
in the data set.

```python
>>> text_data = " hello world \n hello nice world \n hi world \n"
>>> counter = text.count_tokens_from_str(text_data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem necessary to create vocab just to access embedding vector.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved


```

The obtained `counter` has key-value pairs whose keys are words and values are word frequencies.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should explain why a counter is needed first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

Suppose that we want to build indices for all the keys in `counter` and load the defined fastText
word embedding for all such indexed words. We need a Vocabulary instance with `counter` and
`fasttext` as its arguments.

```python
>>> my_vocab = text.Vocabulary(counter, embedding=fasttext)

```

Now we are ready to access the fastText word embedding vectors for indexed words, such as 'hello'
and 'world'.

```python
>>> my_vocab.embedding[['hello', 'world']]

[[ 3.95669997e-01 2.14540005e-01 -3.53889987e-02 -2.42990002e-01
...
-7.54180014e-01 -3.14429998e-01 2.40180008e-02 -7.61009976e-02]
[ 1.04440004e-01 -1.08580001e-01 2.72119999e-01 1.32990003e-01
...
-3.73499990e-01 5.67310005e-02 5.60180008e-01 2.90190000e-02]]
<NDArray 2x300 @cpu(0)>

```

### Using pre-trained word embeddings in `gluon`

To demonstrate how to use pre-trained word embeddings in the `gluon` package, let us first obtain
indices of the words 'hello' and 'world'.

```python
>>> my_vocab[['hello', 'world']]
[2, 1]

```

We can obtain the vector representation for the words 'hello' and 'world' by specifying their
indices (2 and 1) and the weight matrix `my_vocab.embedding.idx_to_vec` in
`mxnet.gluon.nn.Embedding`.

```python
>>> input_dim, output_dim = my_vocab.embedding.idx_to_vec.shape
>>> layer = gluon.nn.Embedding(input_dim, output_dim)
>>> layer.initialize()
>>> layer.weight.set_data(my_vocab.embedding.idx_to_vec)
>>> layer(nd.array([2, 1]))

[[ 3.95669997e-01 2.14540005e-01 -3.53889987e-02 -2.42990002e-01
...
-7.54180014e-01 -3.14429998e-01 2.40180008e-02 -7.61009976e-02]
[ 1.04440004e-01 -1.08580001e-01 2.72119999e-01 1.32990003e-01
...
-3.73499990e-01 5.67310005e-02 5.60180008e-01 2.90190000e-02]]
<NDArray 2x300 @cpu(0)>

```

## Vocabulary

The vocabulary builds indices for text tokens and can be assigned with token embeddings. The input
counter whose keys are candidate indices may be obtained via
[`count_tokens_from_str`](#mxnet.gluon.text.utils.count_tokens_from_str).


```eval_rst
.. currentmodule:: mxnet.gluon.text.vocab
.. autosummary::
:nosignatures:

Vocabulary
```

Suppose that we have a simple text data set in the string format. We can count word frequency in the
data set.

```python
>>> text_data = " hello world \n hello nice world \n hi world \n"
>>> counter = text.utils.count_tokens_from_str(text_data)

```

The obtained `counter` has key-value pairs whose keys are words and values are word frequencies.
Suppose that we want to build indices for the 2 most frequent keys in `counter` with the unknown
token representation '(unk)' and a reserved token '(pad)'.

```python
>>> my_vocab = text.Vocabulary(counter, max_size=2, unknown_token='(unk)',
... reserved_tokens=['(pad)'])

```

We can access properties such as `token_to_idx` (mapping tokens to indices), `idx_to_token` (mapping
indices to tokens), `unknown_token` (representation of any unknown token) and `reserved_tokens`
(reserved tokens).


```python
>>> my_vocab.token_to_idx
{'(unk)': 0, '(pad)': 1, 'world': 2, 'hello': 3}
>>> my_vocab.idx_to_token
['(unk)', '(pad)', 'world', 'hello']
>>> my_vocab.unknown_token
'(unk)'
>>> my_vocab.reserved_tokens
['(pad)']
>>> len(my_vocab)
4
>>> my_vocab[['hello', 'world']]
[3, 2]
```

Besides the specified unknown token '(unk)' and reserved_token '(pad)' are indexed, the 2 most
frequent words 'world' and 'hello' are also indexed.


### Assign token embedding to vocabulary
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assign doesn't seem like the right verb. Maybe attach?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved


A vocabulary instance can be assigned with token embedding.

To begin with, suppose that we have a simple text data set in the string format. We can count word
frequency in the data set.

```python
>>> text_data = " hello world \n hello nice world \n hi world \n"
>>> counter = text.count_tokens_from_str(text_data)

```

The obtained `counter` has key-value pairs whose keys are words and values are word frequencies.
Let us define the fastText word embedding instance with the pre-trained file `wiki.simple.vec`.

```python
>>> fasttext = text.embedding.create('fasttext', file_name='wiki.simple.vec')

```

Suppose that we want to build indices for the most frequent 2 keys in `counter` and load the defined
fastText word embedding for all these 2 words.

```python
>>> my_vocab = text.vocab.Vocabulary(counter, max_size=2, embedding=fasttext)

```

Now we are ready to access the fastText word embedding vectors for indexed words.

```python
>>> my_vocab.embedding[['hello', 'world']]

[[ 3.95669997e-01 2.14540005e-01 -3.53889987e-02 -2.42990002e-01
...
-7.54180014e-01 -3.14429998e-01 2.40180008e-02 -7.61009976e-02]
[ 1.04440004e-01 -1.08580001e-01 2.72119999e-01 1.32990003e-01
...
-3.73499990e-01 5.67310005e-02 5.60180008e-01 2.90190000e-02]]
<NDArray 2x300 @cpu(0)>

```

Let us define the GloVe word embedding with the pre-trained file `glove.6B.50d.txt`. Then,
we can re-assign a GloVe text embedding instance to the vocabulary.

```python
>>> glove = text.embedding.create('glove', file_name='glove.6B.50d.txt')
>>> my_vocab.set_embedding(glove)

```

Now we are ready to access the GloVe word embedding vectors for indexed words.

```python
>>> my_vocab.embedding[['hello', 'world']]

[[ -0.38497001 0.80092001
...
0.048833 0.67203999]
[ -0.41486001 0.71847999
...
-0.37639001 -0.67541999]]
<NDArray 2x50 @cpu(0)>

```

If a token is unknown to `my_vocab`, its embedding vector is initialized according to the default
specification in `glove` (all elements are 0).

```python

>>> my_vocab.embedding['nice']

[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
...
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
<NDArray 50 @cpu(0)>

```



## Text token embedding

To load token embeddings from an externally hosted pre-trained token embedding file, such as those
of GloVe and FastText, use
[`embedding.create(embedding_name, file_name)`](#mxnet.gluon.text.embedding.create).

To get all the available `embedding_name` and `file_name`, use
[`embedding.get_file_names()`](#mxnet.gluon.text.embedding.get_file_names).

```python
>>> text.embedding.get_file_names()
{'glove': ['glove.42B.300d.txt', 'glove.6B.50d.txt', 'glove.6B.100d.txt', ...],
'fasttext': ['wiki.en.vec', 'wiki.simple.vec', 'wiki.zh.vec', ...]}

```

Alternatively, to load embedding vectors from a custom pre-trained text token embedding file, use
[`TokenEmbedding.from_file`](#mxnet.gluon.text.embedding.TokenEmbedding.from_file).


```eval_rst
.. currentmodule:: mxnet.gluon.text.embedding
.. autosummary::
:nosignatures:

register
create
get_file_names
TokenEmbedding
GloVe
FastText
```

See [Assign token embedding to vocabulary](#Assign token embedding to vocabulary) for how to assign
token embeddings to vocabulary and use token embeddings.


### Implement a new text token embedding

For ``embedding``, create a subclass of `mxnet.gluon.text.embedding.TokenEmbedding`.
Also add ``@mxnet.gluon.text.embedding.TokenEmbedding.register`` before this class. See
[`embedding.py`](https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/gluon/text/embedding.py)
for examples.


## Text utilities

The following functions provide utilities for text data processing.

```eval_rst
.. currentmodule:: mxnet.gluon.text.utils
.. autosummary::
:nosignatures:

count_tokens_from_str
```


## API Reference

<script type="text/javascript" src='../../_static/js/auto_module_index.js'></script>

```eval_rst

.. automodule:: mxnet.gluon.text.embedding
:members: register, create, get_file_names
.. autoclass:: mxnet.gluon.text.embedding.TokenEmbedding
:members: from_file
.. autoclass:: mxnet.gluon.text.embedding.GloVe
.. autoclass:: mxnet.gluon.text.embedding.FastText

.. automodule:: mxnet.gluon.text.vocab
.. autoclass:: mxnet.gluon.text.vocab.Vocabulary
:members: set_embedding, to_tokens

.. automodule:: mxnet.gluon.text.utils
:members: count_tokens_from_str

```
<script>auto_index("api-reference");</script>
26 changes: 26 additions & 0 deletions python/mxnet/gluon/text/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

# coding: utf-8
# pylint: disable=wildcard-import
"""This module includes utilities for indexing and embedding text."""

from .vocab import *

from . import embedding

from .utils import *
Loading