Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[Clojure] - Provide support for Fasttext embedding in CNN Text Classification example #14118

Closed
gigasquid opened this issue Feb 11, 2019 · 10 comments · Fixed by #15340
Closed

Comments

@gigasquid
Copy link
Member

Right now the CNN text classification example provides support for glove and word2vec embeddings.
It would be great to also provide support for BERT to give users an example of how to integrate that into their code as well.

CNN Text Classification Example: https://github.com/apache/incubator-mxnet/tree/master/contrib/clojure-package/examples/cnn-text-classification

Reference implementation of BERT embedding for MXNet (python) https://github.com/imgarylai/bert-embedding

@mxnet-label-bot
Copy link
Contributor

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Feature

@gigasquid
Copy link
Member Author

gigasquid commented Feb 25, 2019

This was originally for BERT - but @Chouffe helped me understand that it is more complicated than I originally thought with this model. So changing it to https://fasttext.cc/

@gigasquid gigasquid changed the title [Clojure] - Provide support for BERT embedding in CNN Text Classification example [Clojure] - Provide support for Fasttext embedding in CNN Text Classification example Feb 25, 2019
@AlexChalk
Copy link
Contributor

This sounds like 'figure out how to use the lib and document it as code', so a good ticket for someone new to machine learning? I'll have a go if that's correct.

@gigasquid
Copy link
Member Author

That's correct :) Give a shout if you have any questions, issues. The #clojure-mxnet slack room is also good. See about joining here http://mxnet.incubator.apache.org/versions/master/community/contribute.html

@AlexChalk
Copy link
Contributor

Hi @gigasquid, sorry for delay on this.

The fastText data format looks almost identical to glove, so with a few modifications (e.g. removing line 1 of the data), I think something as simple as this will work.

However, I'm having trouble running the glove (and word2vec) examples off master (osx, mojave). Can you repro this?:

lein repl
(train-convnet {:embedding-size 50 :batch-size 100 :test-size 100 :num-epoch 10 :max-examples 1000 :pretrained-embedding :glove})
=> Loading all the movie reviews from  data/mr-data
=> Loading the glove pre-trained word embeddings from  data/glove/glove.6B.50d.txt
=> Shuffling the data and splitting into training and test sets
=> {:sentence-count 2000, :sentence-size 62, :vocab-size 8078, :embedding-size 50, :pretrained-embedding :glove}
=> ClassCastException [Ljava.lang.Object; cannot be cast to [Lorg.apache.mxnet.Context;  org.apache.clojure-mxnet.module/module (module.clj:65)

@gigasquid
Copy link
Member Author

@adc17 Sorry for the trouble. It looks like the code was refactored and the README instructions weren't updated. It requires a :devs key to tell it whether the run on cpu or gpu and how many devices - see the main code in the classifier for the correct usage.

From the repl you can use (train-convnet {:devs [(context/cpu 0)] :embedding-size 50 :batch-size 100 :test-size 100 :num-epoch 10 :max-examples 1000 :pretrained-embedding :glove}) and it should work.

If you could update the documentation to help others in the future that would be great 😸

@AlexChalk
Copy link
Contributor

No problem, sorry for not spotting this myself.

I should be able to submit a PR this weekend, and I'll update the docs at the same time.

@AlexChalk
Copy link
Contributor

@gigasquid this will take longer than expected, as I'm running into OOMs.

OutOfMemoryError GC overhead limit exceeded  java.util.Arrays.copyOfRange (Arrays.java:3664)

The same thing happens for gloVe when I use the 200d+ word vectors (only without the stack trace).

Seeing as fastText only give us 300d word vectors, I'm a long way off successfully running them.

This kind of memory optimization is something I've never done before; so I'm now out of my depth in terms of making things work with fastText's pretrained .vec files.

I can look at parsing their binary training format in a similar way to what's currently done with word2vec, (those were 300d and my system could use them), but again, I've never really done this before 😟.

For reference, I'm on a late 2016 macbook pro (with the 4-core 2ghz i5, and 8gb ram).

@AlexChalk
Copy link
Contributor

One workaround is to not use all 1M embeddings; I can just take the first 100K from the file. If that sounds ok, let me know and I'll submit a PR.

@AlexChalk
Copy link
Contributor

Scratch that, I've just discovered the 'wiki.simple' pretrained embeddings that are short enough to handle running locally 🎊: #15340

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants