-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[Clojure] - Provide support for Fasttext embedding in CNN Text Classification example #14118
Comments
Hey, this is the MXNet Label Bot. |
This was originally for BERT - but @Chouffe helped me understand that it is more complicated than I originally thought with this model. So changing it to https://fasttext.cc/ |
This sounds like 'figure out how to use the lib and document it as code', so a good ticket for someone new to machine learning? I'll have a go if that's correct. |
That's correct :) Give a shout if you have any questions, issues. The #clojure-mxnet slack room is also good. See about joining here http://mxnet.incubator.apache.org/versions/master/community/contribute.html |
Hi @gigasquid, sorry for delay on this. The fastText data format looks almost identical to glove, so with a few modifications (e.g. removing line 1 of the data), I think something as simple as this will work. However, I'm having trouble running the glove (and word2vec) examples off master (osx, mojave). Can you repro this?:
|
@adc17 Sorry for the trouble. It looks like the code was refactored and the README instructions weren't updated. It requires a From the repl you can use If you could update the documentation to help others in the future that would be great 😸 |
No problem, sorry for not spotting this myself. I should be able to submit a PR this weekend, and I'll update the docs at the same time. |
@gigasquid this will take longer than expected, as I'm running into OOMs.
The same thing happens for gloVe when I use the 200d+ word vectors (only without the stack trace). Seeing as fastText only give us 300d word vectors, I'm a long way off successfully running them. This kind of memory optimization is something I've never done before; so I'm now out of my depth in terms of making things work with fastText's pretrained I can look at parsing their binary training format in a similar way to what's currently done with word2vec, (those were 300d and my system could use them), but again, I've never really done this before 😟. For reference, I'm on a late 2016 macbook pro (with the 4-core 2ghz i5, and 8gb ram). |
One workaround is to not use all 1M embeddings; I can just take the first 100K from the file. If that sounds ok, let me know and I'll submit a PR. |
Scratch that, I've just discovered the 'wiki.simple' pretrained embeddings that are short enough to handle running locally 🎊: #15340 |
Right now the CNN text classification example provides support for glove and word2vec embeddings.
It would be great to also provide support for BERT to give users an example of how to integrate that into their code as well.
CNN Text Classification Example: https://github.com/apache/incubator-mxnet/tree/master/contrib/clojure-package/examples/cnn-text-classification
Reference implementation of BERT embedding for MXNet (python) https://github.com/imgarylai/bert-embedding
The text was updated successfully, but these errors were encountered: