Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Add vocabulary and embedding #10074

Merged
merged 20 commits into from
Mar 15, 2018
347 changes: 347 additions & 0 deletions docs/api/python/gluon/text.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,347 @@
# Gluon Text API

## Overview

The `mxnet.gluon.text` APIs refer to classes and functions related to text data processing, such
as bulding indices and loading pre-trained embedding vectors for text tokens and storing them in the
`mxnet.ndarray.NDArray` format.

This document lists the text APIs in `mxnet.gluon`:

```eval_rst
.. autosummary::
:nosignatures:

mxnet.gluon.text.embedding
mxnet.gluon.text.vocab
mxnet.gluon.text.utils
```

All the code demonstrated in this document assumes that the following modules or packages are
imported.

```python
>>> from mxnet import gluon
>>> from mxnet import nd
>>> from mxnet.gluon import text
>>> import collections

```

### Indexing words and using pre-trained word embeddings in `gluon`

As a common use case, let us index words, attach pre-trained word embeddings for them, and use
such embeddings in `gluon` in just a few lines of code.

To begin with, suppose that we have a simple text data set in the string format. We can count word
frequency in the data set.

```python
>>> text_data = " hello world \n hello nice world \n hi world \n"
>>> counter = text.count_tokens_from_str(text_data)

```

The obtained `counter` has key-value pairs whose keys are words and values are word frequencies.
This allows us to filter out infrequent words (See details at
[Vocabulary API specifications](#mxnet.gluon.text.vocab.Vocabulary)).
Suppose that we want to build indices for all the keys in `counter`. We need a Vocabulary instance
with `counter` as its argument.

```python
>>> my_vocab = text.Vocabulary(counter)

```

To attach word embedding to indexed words in `my_vocab`, let us go on to create a fastText word
embedding instance by specifying the embedding name `fasttext` and the pre-trained file name
`wiki.simple.vec`.

```python
>>> fasttext = text.embedding.create('fasttext', file_name='wiki.simple.vec')

```

So we can attach word embedding `fasttext` to indexed words `my_vocab`.

```python
>>> my_vocab.set_embedding(fasttext)

```

Now we are ready to access the fastText word embedding vectors for indexed words, such as 'hello'
and 'world'.

```python
>>> my_vocab.embedding[['hello', 'world']]

[[ 3.95669997e-01 2.14540005e-01 -3.53889987e-02 -2.42990002e-01
...
-7.54180014e-01 -3.14429998e-01 2.40180008e-02 -7.61009976e-02]
[ 1.04440004e-01 -1.08580001e-01 2.72119999e-01 1.32990003e-01
...
-3.73499990e-01 5.67310005e-02 5.60180008e-01 2.90190000e-02]]
<NDArray 2x300 @cpu(0)>

```

To demonstrate how to use pre-trained word embeddings in the `gluon` package, let us first obtain
indices of the words 'hello' and 'world'.

```python
>>> my_vocab[['hello', 'world']]
[2, 1]

```

We can obtain the vector representation for the words 'hello' and 'world' by specifying their
indices (2 and 1) and the weight matrix `my_vocab.embedding.idx_to_vec` in
`mxnet.gluon.nn.Embedding`.

```python
>>> input_dim, output_dim = my_vocab.embedding.idx_to_vec.shape
>>> layer = gluon.nn.Embedding(input_dim, output_dim)
>>> layer.initialize()
>>> layer.weight.set_data(my_vocab.embedding.idx_to_vec)
>>> layer(nd.array([2, 1]))

[[ 3.95669997e-01 2.14540005e-01 -3.53889987e-02 -2.42990002e-01
...
-7.54180014e-01 -3.14429998e-01 2.40180008e-02 -7.61009976e-02]
[ 1.04440004e-01 -1.08580001e-01 2.72119999e-01 1.32990003e-01
...
-3.73499990e-01 5.67310005e-02 5.60180008e-01 2.90190000e-02]]
<NDArray 2x300 @cpu(0)>

```

## Vocabulary

The vocabulary builds indices for text tokens and can be attached with token embeddings. The input
counter whose keys are candidate indices may be obtained via
[`count_tokens_from_str`](#mxnet.gluon.text.utils.count_tokens_from_str).


```eval_rst
.. currentmodule:: mxnet.gluon.text.vocab
.. autosummary::
:nosignatures:

Vocabulary
```

Suppose that we have a simple text data set in the string format. We can count word frequency in the
data set.

```python
>>> text_data = " hello world \n hello nice world \n hi world \n"
>>> counter = text.utils.count_tokens_from_str(text_data)

```

The obtained `counter` has key-value pairs whose keys are words and values are word frequencies.
This allows us to filter out infrequent words. Suppose that we want to build indices for the 2 most
frequent keys in `counter` with the unknown token representation '(unk)' and a reserved token
'(pad)'.

```python
>>> my_vocab = text.Vocabulary(counter, max_size=2, unknown_token='(unk)',
... reserved_tokens=['(pad)'])

```

We can access properties such as `token_to_idx` (mapping tokens to indices), `idx_to_token` (mapping
indices to tokens), `unknown_token` (representation of any unknown token) and `reserved_tokens`
(reserved tokens).


```python
>>> my_vocab.token_to_idx
{'(unk)': 0, '(pad)': 1, 'world': 2, 'hello': 3}
>>> my_vocab.idx_to_token
['(unk)', '(pad)', 'world', 'hello']
>>> my_vocab.unknown_token
'(unk)'
>>> my_vocab.reserved_tokens
['(pad)']
>>> len(my_vocab)
4
>>> my_vocab[['hello', 'world']]
[3, 2]
```

Besides the specified unknown token '(unk)' and reserved_token '(pad)' are indexed, the 2 most
frequent words 'world' and 'hello' are also indexed.


### Attach token embedding to vocabulary

A vocabulary instance can be attached with token embedding.

To begin with, suppose that we have a simple text data set in the string format. We can count word
frequency in the data set.

```python
>>> text_data = " hello world \n hello nice world \n hi world \n"
>>> counter = text.count_tokens_from_str(text_data)

```

The obtained `counter` has key-value pairs whose keys are words and values are word frequencies.
This allows us to filter out infrequent words.
Suppose that we want to build indices for the most frequent 2 keys in `counter`.

```python
>>> my_vocab = text.Vocabulary(counter, max_size=2)

```

Let us define the fastText word embedding instance with the pre-trained file `wiki.simple.vec`.

```python
>>> fasttext = text.embedding.create('fasttext', file_name='wiki.simple.vec')

```

So we can attach word embedding `fasttext` to indexed words `my_vocab`.

```python
>>> my_vocab.set_embedding(fasttext)

```

Now we are ready to access the fastText word embedding vectors for the indexed words.

```python
>>> my_vocab.embedding[['hello', 'world']]

[[ 3.95669997e-01 2.14540005e-01 -3.53889987e-02 -2.42990002e-01
...
-7.54180014e-01 -3.14429998e-01 2.40180008e-02 -7.61009976e-02]
[ 1.04440004e-01 -1.08580001e-01 2.72119999e-01 1.32990003e-01
...
-3.73499990e-01 5.67310005e-02 5.60180008e-01 2.90190000e-02]]
<NDArray 2x300 @cpu(0)>

```

Let us define the GloVe word embedding with the pre-trained file `glove.6B.50d.txt`. Then,
we can re-attach a GloVe text embedding instance to the vocabulary.

```python
>>> glove = text.embedding.create('glove', file_name='glove.6B.50d.txt')
>>> my_vocab.set_embedding(glove)

```

Now we are ready to access the GloVe word embedding vectors for the indexed words.

```python
>>> my_vocab.embedding[['hello', 'world']]

[[ -0.38497001 0.80092001
...
0.048833 0.67203999]
[ -0.41486001 0.71847999
...
-0.37639001 -0.67541999]]
<NDArray 2x50 @cpu(0)>

```

If a token is unknown to `my_vocab`, its embedding vector is initialized according to the default
specification in `glove` (all elements are 0).

```python

>>> my_vocab.embedding['nice']

[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
...
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
<NDArray 50 @cpu(0)>

```



## Text token embedding

To load token embeddings from an externally hosted pre-trained token embedding file, such as those
of GloVe and FastText, use
[`embedding.create(embedding_name, file_name)`](#mxnet.gluon.text.embedding.create).

To get all the available `embedding_name` and `file_name`, use
[`embedding.get_file_names()`](#mxnet.gluon.text.embedding.get_file_names).

```python
>>> text.embedding.get_file_names()
{'glove': ['glove.42B.300d.txt', 'glove.6B.50d.txt', 'glove.6B.100d.txt', ...],
'fasttext': ['wiki.en.vec', 'wiki.simple.vec', 'wiki.zh.vec', ...]}

```

Alternatively, to load embedding vectors from a custom pre-trained text token embedding file, use
[`TokenEmbedding.from_file`](#mxnet.gluon.text.embedding.TokenEmbedding.from_file).


```eval_rst
.. currentmodule:: mxnet.gluon.text.embedding
.. autosummary::
:nosignatures:

register
create
get_file_names
TokenEmbedding
GloVe
FastText
```

See [Assign token embedding to vocabulary](#assign-token-embedding-to-vocabulary) for how to attach
token embeddings to vocabulary and use token embeddings.


### Implement a new text token embedding

For ``embedding``, create a subclass of `mxnet.gluon.text.embedding.TokenEmbedding`.
Also add ``@mxnet.gluon.text.embedding.TokenEmbedding.register`` before this class. See
[`embedding.py`](https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/gluon/text/embedding.py)
for examples.


## Text utilities

The following functions provide utilities for text data processing.

```eval_rst
.. currentmodule:: mxnet.gluon.text.utils
.. autosummary::
:nosignatures:

count_tokens_from_str
```


## API Reference

<script type="text/javascript" src='../../_static/js/auto_module_index.js'></script>

```eval_rst

.. automodule:: mxnet.gluon.text.embedding
:members: register, create, get_file_names
.. autoclass:: mxnet.gluon.text.embedding.TokenEmbedding
:members: from_file
.. autoclass:: mxnet.gluon.text.embedding.GloVe
.. autoclass:: mxnet.gluon.text.embedding.FastText

.. automodule:: mxnet.gluon.text.vocab
.. autoclass:: mxnet.gluon.text.vocab.Vocabulary
:members: set_embedding, to_tokens

.. automodule:: mxnet.gluon.text.utils
:members: count_tokens_from_str

```
<script>auto_index("api-reference");</script>
26 changes: 26 additions & 0 deletions python/mxnet/gluon/text/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

# coding: utf-8
# pylint: disable=wildcard-import
"""This module includes utilities for indexing and embedding text."""

from .vocab import *

from . import embedding

from .utils import *
Loading