[Clojure] - Provide support for Fasttext embedding in CNN Text Classification example #14118

gigasquid · 2019-02-11T16:11:41Z

Right now the CNN text classification example provides support for glove and word2vec embeddings.
It would be great to also provide support for BERT to give users an example of how to integrate that into their code as well.

CNN Text Classification Example: https://github.com/apache/incubator-mxnet/tree/master/contrib/clojure-package/examples/cnn-text-classification

Reference implementation of BERT embedding for MXNet (python) https://github.com/imgarylai/bert-embedding

mxnet-label-bot · 2019-02-11T16:11:46Z

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Feature

gigasquid · 2019-02-25T16:15:04Z

This was originally for BERT - but @Chouffe helped me understand that it is more complicated than I originally thought with this model. So changing it to https://fasttext.cc/

AlexChalk · 2019-06-06T12:24:36Z

This sounds like 'figure out how to use the lib and document it as code', so a good ticket for someone new to machine learning? I'll have a go if that's correct.

gigasquid · 2019-06-06T23:31:10Z

That's correct :) Give a shout if you have any questions, issues. The #clojure-mxnet slack room is also good. See about joining here http://mxnet.incubator.apache.org/versions/master/community/contribute.html

AlexChalk · 2019-06-19T12:39:04Z

Hi @gigasquid, sorry for delay on this.

The fastText data format looks almost identical to glove, so with a few modifications (e.g. removing line 1 of the data), I think something as simple as this will work.

However, I'm having trouble running the glove (and word2vec) examples off master (osx, mojave). Can you repro this?:

lein repl
(train-convnet {:embedding-size 50 :batch-size 100 :test-size 100 :num-epoch 10 :max-examples 1000 :pretrained-embedding :glove})
=> Loading all the movie reviews from  data/mr-data
=> Loading the glove pre-trained word embeddings from  data/glove/glove.6B.50d.txt
=> Shuffling the data and splitting into training and test sets
=> {:sentence-count 2000, :sentence-size 62, :vocab-size 8078, :embedding-size 50, :pretrained-embedding :glove}
=> ClassCastException [Ljava.lang.Object; cannot be cast to [Lorg.apache.mxnet.Context;  org.apache.clojure-mxnet.module/module (module.clj:65)

gigasquid · 2019-06-20T14:10:34Z

@adc17 Sorry for the trouble. It looks like the code was refactored and the README instructions weren't updated. It requires a :devs key to tell it whether the run on cpu or gpu and how many devices - see the main code in the classifier for the correct usage.

From the repl you can use (train-convnet {:devs [(context/cpu 0)] :embedding-size 50 :batch-size 100 :test-size 100 :num-epoch 10 :max-examples 1000 :pretrained-embedding :glove}) and it should work.

If you could update the documentation to help others in the future that would be great 😸

AlexChalk · 2019-06-20T22:00:13Z

No problem, sorry for not spotting this myself.

I should be able to submit a PR this weekend, and I'll update the docs at the same time.

AlexChalk · 2019-06-24T00:11:34Z

@gigasquid this will take longer than expected, as I'm running into OOMs.

OutOfMemoryError GC overhead limit exceeded  java.util.Arrays.copyOfRange (Arrays.java:3664)

The same thing happens for gloVe when I use the 200d+ word vectors (only without the stack trace).

Seeing as fastText only give us 300d word vectors, I'm a long way off successfully running them.

This kind of memory optimization is something I've never done before; so I'm now out of my depth in terms of making things work with fastText's pretrained .vec files.

I can look at parsing their binary training format in a similar way to what's currently done with word2vec, (those were 300d and my system could use them), but again, I've never really done this before 😟.

For reference, I'm on a late 2016 macbook pro (with the 4-core 2ghz i5, and 8gb ram).

AlexChalk · 2019-06-24T15:36:22Z

One workaround is to not use all 1M embeddings; I can just take the first 100K from the file. If that sounds ok, let me know and I'll submit a PR.

AlexChalk · 2019-06-24T19:07:51Z

Scratch that, I've just discovered the 'wiki.simple' pretrained embeddings that are short enough to handle running locally 🎊: #15340

gigasquid added Clojure good first issue Feature request labels Feb 11, 2019

gigasquid changed the title ~~[Clojure] - Provide support for BERT embedding in CNN Text Classification example~~ [Clojure] - Provide support for Fasttext embedding in CNN Text Classification example Feb 25, 2019

AlexChalk mentioned this issue Jun 24, 2019

[Clojure] Add fastText example #15340

Merged

4 tasks

gigasquid closed this as completed in #15340 Jun 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Clojure] - Provide support for Fasttext embedding in CNN Text Classification example #14118

[Clojure] - Provide support for Fasttext embedding in CNN Text Classification example #14118

gigasquid commented Feb 11, 2019

mxnet-label-bot commented Feb 11, 2019

gigasquid commented Feb 25, 2019 •

edited

Loading

AlexChalk commented Jun 6, 2019

gigasquid commented Jun 6, 2019

AlexChalk commented Jun 19, 2019

gigasquid commented Jun 20, 2019

AlexChalk commented Jun 20, 2019

AlexChalk commented Jun 24, 2019

AlexChalk commented Jun 24, 2019

AlexChalk commented Jun 24, 2019

[Clojure] - Provide support for Fasttext embedding in CNN Text Classification example #14118

[Clojure] - Provide support for Fasttext embedding in CNN Text Classification example #14118

Comments

gigasquid commented Feb 11, 2019

mxnet-label-bot commented Feb 11, 2019

gigasquid commented Feb 25, 2019 • edited Loading

AlexChalk commented Jun 6, 2019

gigasquid commented Jun 6, 2019

AlexChalk commented Jun 19, 2019

gigasquid commented Jun 20, 2019

AlexChalk commented Jun 20, 2019

AlexChalk commented Jun 24, 2019

AlexChalk commented Jun 24, 2019

AlexChalk commented Jun 24, 2019

gigasquid commented Feb 25, 2019 •

edited

Loading