-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Conversation
Great! I look forward to reviewing this 😸 |
pinging @Chouffe to take a look if he has time as well since he helped shape the issue ticket 😄 |
I will take a look shortly @adc17 |
I could run fastText: cnn-text-classification.classifier=> (train-convnet {:devs [(context/cpu 0)] :embedding-size 300 :batch-size 100 :test-size 100 :num-epoch 10 :max-examples 1000 :pretrained-embedding :fastText})
Loading all the movie reviews from data/mr-data
WARN org.apache.mxnet.WarnIfNotDisposed: LEAK: [one-time warning] An instance of org.apache.mxnet.Symbol was not disposed. Set property mxnet.traceLeakedObjects to true to enable tracing
Loading the fastText pre-trained word embeddings from data/fastText/wiki.simple.vec
Shuffling the data and splitting into training and test sets
{:sentence-count 2000, :sentence-size 62, :vocab-size 8078, :embedding-size 300, :pretrained-embedding :fastText}
Getting ready to train for 10 epochs
===========
WARN org.apache.mxnet.DataDesc: Found Undefined Layout, will use default index 0 for batch axis
WARN org.apache.mxnet.DataDesc: Found Undefined Layout, will use default index 0 for batch axis
WARN org.apache.mxnet.DataDesc: Found Undefined Layout, will use default index 0 for batch axis
WARN org.apache.mxnet.DataDesc: Found Undefined Layout, will use default index 0 for batch axis
WARN org.apache.mxnet.DataDesc: Found Undefined Layout, will use default index 0 for batch axis
[18:54:04] src/operator/tensor/./matrix_op-inl.h:200: Using target_shape will be deprecated.
[18:54:04] src/operator/tensor/./matrix_op-inl.h:200: Using target_shape will be deprecated.
INFO org.apache.mxnet.module.BaseModule: Epoch[0] Train-accuracy=0.5326316
INFO org.apache.mxnet.module.BaseModule: Epoch[0] Time cost=4463
INFO org.apache.mxnet.module.BaseModule: Epoch[0] Validation-accuracy=0.59
...
INFO org.apache.mxnet.module.BaseModule: Epoch[8] Train-accuracy=0.9836842
INFO org.apache.mxnet.module.BaseModule: Epoch[8] Time cost=4093
INFO org.apache.mxnet.module.BaseModule: Epoch[8] Validation-accuracy=0.73
INFO org.apache.mxnet.module.BaseModule: Epoch[9] Train-accuracy=0.9878947
INFO org.apache.mxnet.module.BaseModule: Epoch[9] Time cost=3861
INFO org.apache.mxnet.module.BaseModule: Epoch[9] Validation-accuracy=0.75 Thanks a lot for adding this @adc17! It seems to work really well :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks really good!
I left some minor comments @adc17.
:let [fields (.split line " ")]] | ||
[(aget fields 0) | ||
(mapv #(Float/parseFloat ^String %) (rest fields))])) | ||
|
||
(defn load-glove [glove-file-path] | ||
(println "Loading the glove pre-trained word embeddings from " glove-file-path) | ||
(into {} (read-text-embedding-pairs (io/reader glove-file-path)))) | ||
(into {} (read-text-embedding-pairs (line-seq (io/reader glove-file-path))))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is becoming hard to read. Can we use a threading macro here?
(->> (io/reader path)
line-seq
read-text-embedding-pairs
(into {}))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I considered this myself!
|
||
(defn load-fasttext [fasttext-file-path] | ||
(println "Loading the fastText pre-trained word embeddings from " fasttext-file-path) | ||
(into {} (read-text-embedding-pairs (remove-fasttext-metadata (line-seq (io/reader fasttext-file-path)))))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also use a threading macro here for readability?
@@ -190,6 +198,8 @@ | |||
vocab-embeddings (case pretrained-embedding | |||
:glove (->> (load-glove (glove-file-path embedding-size)) | |||
(build-vocab-embeddings vocab embedding-size)) | |||
:fastText (->> (load-fasttext fasttext-file-path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer a keyword like :fast-text
or :fasttext
instead of having an uppercase character.
|
||
(def remove-fasttext-metadata rest) | ||
|
||
(defn load-fasttext [fasttext-file-path] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mark the IO functions with !
to convey it?
(defn load-fasttext! ...)
``` | ||
Note that loading word2vec embeddings consumes memory and takes some time. | ||
|
||
You can also train them using `JVM_OPTS="-Xmx8g" lein run` once you've modified | ||
the parameters to `train-convnet` (see above) in `src/cnn_text_classification/classifier.clj`. | ||
In order to run training with word2vec on the complete data set, you will need to run: | ||
``` | ||
(train-convnet {:embedding-size 300 :batch-size 100 :test-size 1000 :num-epoch 10 :pretrained-embedding :word2vec}) | ||
(train-convnet {:devs [(context/cpu 0)] :embedding-size 300 :batch-size 100 :test-size 1000 :num-epoch 10 :pretrained-embedding :word2vec}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing the README!
@@ -58,15 +72,15 @@ you'll need to unzip them and place them in the `contrib/clojure-package/data` d | |||
|
|||
Then you can run training on a subset of examples through the repl using: | |||
``` | |||
(train-convnet {:embedding-size 300 :batch-size 100 :test-size 100 :num-epoch 10 :max-examples 1000 :pretrained-embedding :word2vec}) | |||
(train-convnet {:devs [(context/cpu 0)] :embedding-size 300 :batch-size 100 :test-size 100 :num-epoch 10 :max-examples 1000 :pretrained-embedding :word2vec}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing the README! Should we use (context/default-context)
instead though?
|
||
Then you can run training on a subset of examples through the repl using: | ||
``` | ||
(train-convnet {:devs [(context/cpu 0)] :embedding-size 300 :batch-size 100 :test-size 100 :num-epoch 10 :max-examples 1000 :pretrained-embedding :fastText}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use (context/default-context)
instead?
@@ -29,8 +29,7 @@ You also must download the glove word embeddings. The suggested one to use is th | |||
## Usage | |||
|
|||
You can run through the repl with | |||
`(train-convnet {:embedding-size 50 :batch-size 100 :test-size 100 :num-epoch 10 :max-examples 1000 :pretrained-embedding :glove})` | |||
|
|||
`(train-convnet {:devs [(context/cpu 0)] :embedding-size 50 :batch-size 100 :test-size 100 :num-epoch 10 :max-examples 1000 :pretrained-embedding :glove})` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing the README! Should we use (context/default-context)
instead?
Thanks for the thorough review @Chouffe; I agree with all your suggestions.
I did this for fastText; the word2vec link is a google drive address that requires confirmation prior to download, so I didn't script it for now. We'd need a workaround like this to automate the download: https://stackoverflow.com/a/32742700/7028216 |
52601e4
to
f454f3c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for your contribution 💯
Description
Resolves #14118
Checklist
Essentials
Changes