Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[Clojure] Add fastText example #15340

Merged
merged 11 commits into from
Jun 30, 2019

Conversation

AlexChalk
Copy link
Contributor

@AlexChalk AlexChalk commented Jun 24, 2019

Description

  • Right now the CNN text classification example provides support for glove and word2vec embeddings. This PR extends that support to fastText.
  • Additionally, this PR updates the README.md instructions for running examples from the REPL.

Resolves #14118

Checklist

Essentials

  • Changes are complete (i.e. I finished coding on this PR).
  • For new examples, README.md is added to explain what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable.
  • To my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change.

Changes

  • CNN text classification (including docs)

@gigasquid
Copy link
Member

Great! I look forward to reviewing this 😸

@gigasquid gigasquid changed the title Add fastText example [Clojure] Add fastText example Jun 24, 2019
@gigasquid
Copy link
Member

pinging @Chouffe to take a look if he has time as well since he helped shape the issue ticket 😄

@Chouffe
Copy link
Contributor

Chouffe commented Jun 25, 2019

I will take a look shortly @adc17
Looking forward to reviewing it!

@Chouffe
Copy link
Contributor

Chouffe commented Jun 25, 2019

I could run fastText:

cnn-text-classification.classifier=> (train-convnet {:devs [(context/cpu 0)] :embedding-size 300 :batch-size 100 :test-size 100 :num-epoch 10 :max-examples 1000 :pretrained-embedding :fastText})
Loading all the movie reviews from  data/mr-data
WARN  org.apache.mxnet.WarnIfNotDisposed: LEAK: [one-time warning] An instance of org.apache.mxnet.Symbol was not disposed. Set property mxnet.traceLeakedObjects to true to enable tracing
Loading the fastText pre-trained word embeddings from  data/fastText/wiki.simple.vec
Shuffling the data and splitting into training and test sets
{:sentence-count 2000, :sentence-size 62, :vocab-size 8078, :embedding-size 300, :pretrained-embedding :fastText}
Getting ready to train for  10  epochs
===========
WARN  org.apache.mxnet.DataDesc: Found Undefined Layout, will use default index 0 for batch axis
WARN  org.apache.mxnet.DataDesc: Found Undefined Layout, will use default index 0 for batch axis
WARN  org.apache.mxnet.DataDesc: Found Undefined Layout, will use default index 0 for batch axis
WARN  org.apache.mxnet.DataDesc: Found Undefined Layout, will use default index 0 for batch axis
WARN  org.apache.mxnet.DataDesc: Found Undefined Layout, will use default index 0 for batch axis
[18:54:04] src/operator/tensor/./matrix_op-inl.h:200: Using target_shape will be deprecated.
[18:54:04] src/operator/tensor/./matrix_op-inl.h:200: Using target_shape will be deprecated.
INFO  org.apache.mxnet.module.BaseModule: Epoch[0] Train-accuracy=0.5326316
INFO  org.apache.mxnet.module.BaseModule: Epoch[0] Time cost=4463
INFO  org.apache.mxnet.module.BaseModule: Epoch[0] Validation-accuracy=0.59
...
INFO  org.apache.mxnet.module.BaseModule: Epoch[8] Train-accuracy=0.9836842
INFO  org.apache.mxnet.module.BaseModule: Epoch[8] Time cost=4093
INFO  org.apache.mxnet.module.BaseModule: Epoch[8] Validation-accuracy=0.73
INFO  org.apache.mxnet.module.BaseModule: Epoch[9] Train-accuracy=0.9878947
INFO  org.apache.mxnet.module.BaseModule: Epoch[9] Time cost=3861
INFO  org.apache.mxnet.module.BaseModule: Epoch[9] Validation-accuracy=0.75

Thanks a lot for adding this @adc17! It seems to work really well :-)
I am wondering if we should add bash scripts to fetch the data (for word2vec and fastText embeddings)?

Copy link
Contributor

@Chouffe Chouffe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks really good!
I left some minor comments @adc17.

:let [fields (.split line " ")]]
[(aget fields 0)
(mapv #(Float/parseFloat ^String %) (rest fields))]))

(defn load-glove [glove-file-path]
(println "Loading the glove pre-trained word embeddings from " glove-file-path)
(into {} (read-text-embedding-pairs (io/reader glove-file-path))))
(into {} (read-text-embedding-pairs (line-seq (io/reader glove-file-path)))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is becoming hard to read. Can we use a threading macro here?

(->> (io/reader path)
     line-seq
     read-text-embedding-pairs
     (into {}))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I considered this myself!


(defn load-fasttext [fasttext-file-path]
(println "Loading the fastText pre-trained word embeddings from " fasttext-file-path)
(into {} (read-text-embedding-pairs (remove-fasttext-metadata (line-seq (io/reader fasttext-file-path))))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also use a threading macro here for readability?

@@ -190,6 +198,8 @@
vocab-embeddings (case pretrained-embedding
:glove (->> (load-glove (glove-file-path embedding-size))
(build-vocab-embeddings vocab embedding-size))
:fastText (->> (load-fasttext fasttext-file-path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer a keyword like :fast-text or :fasttext instead of having an uppercase character.


(def remove-fasttext-metadata rest)

(defn load-fasttext [fasttext-file-path]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mark the IO functions with ! to convey it?

(defn load-fasttext! ...)

```
Note that loading word2vec embeddings consumes memory and takes some time.

You can also train them using `JVM_OPTS="-Xmx8g" lein run` once you've modified
the parameters to `train-convnet` (see above) in `src/cnn_text_classification/classifier.clj`.
In order to run training with word2vec on the complete data set, you will need to run:
```
(train-convnet {:embedding-size 300 :batch-size 100 :test-size 1000 :num-epoch 10 :pretrained-embedding :word2vec})
(train-convnet {:devs [(context/cpu 0)] :embedding-size 300 :batch-size 100 :test-size 1000 :num-epoch 10 :pretrained-embedding :word2vec})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the README!

@@ -58,15 +72,15 @@ you'll need to unzip them and place them in the `contrib/clojure-package/data` d

Then you can run training on a subset of examples through the repl using:
```
(train-convnet {:embedding-size 300 :batch-size 100 :test-size 100 :num-epoch 10 :max-examples 1000 :pretrained-embedding :word2vec})
(train-convnet {:devs [(context/cpu 0)] :embedding-size 300 :batch-size 100 :test-size 100 :num-epoch 10 :max-examples 1000 :pretrained-embedding :word2vec})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the README! Should we use (context/default-context) instead though?


Then you can run training on a subset of examples through the repl using:
```
(train-convnet {:devs [(context/cpu 0)] :embedding-size 300 :batch-size 100 :test-size 100 :num-epoch 10 :max-examples 1000 :pretrained-embedding :fastText})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use (context/default-context) instead?

@@ -29,8 +29,7 @@ You also must download the glove word embeddings. The suggested one to use is th
## Usage

You can run through the repl with
`(train-convnet {:embedding-size 50 :batch-size 100 :test-size 100 :num-epoch 10 :max-examples 1000 :pretrained-embedding :glove})`

`(train-convnet {:devs [(context/cpu 0)] :embedding-size 50 :batch-size 100 :test-size 100 :num-epoch 10 :max-examples 1000 :pretrained-embedding :glove})`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the README! Should we use (context/default-context) instead?

@gigasquid
Copy link
Member

Thanks for your contribution @adc17 and thanks for reviewing @Chouffe 😸

I took a look and it looks good to me too. When then feedback items are addressed and CI is green, it will be good to merge.

@AlexChalk
Copy link
Contributor Author

AlexChalk commented Jun 28, 2019

Thanks for the thorough review @Chouffe; I agree with all your suggestions.

I am wondering if we should add bash scripts to fetch the data (for word2vec and fastText embeddings)?

I did this for fastText; the word2vec link is a google drive address that requires confirmation prior to download, so I didn't script it for now. We'd need a workaround like this to automate the download: https://stackoverflow.com/a/32742700/7028216

Copy link
Member

@gigasquid gigasquid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for your contribution 💯

@gigasquid gigasquid merged commit b869ecd into apache:master Jun 30, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Clojure] - Provide support for Fasttext embedding in CNN Text Classification example
3 participants