added support word2vec training with additional data #18636

lpapenme · 2017-07-14T12:19:58Z

What changes were proposed in this pull request?

Word2Vec is trained unsupervised. The more data it is trained on, the more "accurate" are the word vectors. Hence, Word2Vec should support to be fit on additional data.

How was this patch tested?

Additional unit test.

Please review http://spark.apache.org/contributing.html before opening a pull request.

MLnick · 2017-09-19T15:13:33Z

Hi there - I don't see the value here of adding a few words in a String array to the training. You're effectively adding a second (non-distributed, therefore limited in size) corpus to the training.

Word2Vec is more aimed at training on a larger corpus of text. If you want more accuracy train on a larger training set.

Could you close this PR please?

lpapenme · 2017-09-19T15:21:19Z

At the moment, it is not possible to improve a models accuracy by incorporating additional data. I think this should be supported since it can increase a classifiers performance significantly. With this implementation, I was able to train unsupervised on a Wikipedia Dump, which is pretty large. However, distributing the set is a good point.

MLnick · 2017-09-19T16:07:11Z

I'm sorry but I still don't understand the intention here. You can already train on a Wikipedia dump (or any other dataset) by passing that dataset as the input DataFrame to Word2Vec.

If you want to "incorporate additional data" why not just union the additional sentences / documents together with your other training set?

lpapenme · 2017-09-20T07:37:33Z

The problem emerges in cases where you built a whole pipeline. You have a set of documents you want to classify. These documents have some additional features and they are preprocessed in the pipeline. When coming to Word2Vec, you want to vectorize your documents. However, you see bad performance of your word vectors and you want to tune them by adding additional documents. You don't want these documents to be part of the whole pipeline, because they are unable to pass the previous preprocessing steps.

That was my intention to add this. Probably, it is a very rare usecase. I don't know.

AmplabJenkins · 2018-11-10T16:01:54Z

Can one of the admins verify this patch?

Closes apache#21766 Closes apache#21679 Closes apache#21161 Closes apache#20846 Closes apache#19434 Closes apache#18080 Closes apache#17648 Closes apache#17169 Add: Closes apache#22813 Closes apache#21994 Closes apache#22005 Closes apache#22463 Add: Closes apache#15899 Add: Closes apache#22539 Closes apache#21868 Closes apache#21514 Closes apache#21402 Closes apache#21322 Closes apache#21257 Closes apache#20163 Closes apache#19691 Closes apache#18697 Closes apache#18636 Closes apache#17176 Closes apache#23001 from wangyum/CloseStalePRs. Authored-by: Yuming Wang <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

Leonard Hövelmann added 2 commits July 14, 2017 14:10

added support word2vec training with additional data

7473d6f

fixed intendation

9979214

HyukjinKwon mentioned this pull request Nov 11, 2018

[INFRA] Close stale PRs #23001

Closed

asfgit closed this in a3ba3a8 Nov 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

added support word2vec training with additional data #18636

added support word2vec training with additional data #18636

Uh oh!

lpapenme commented Jul 14, 2017

Uh oh!

MLnick commented Sep 19, 2017

Uh oh!

lpapenme commented Sep 19, 2017

Uh oh!

MLnick commented Sep 19, 2017

Uh oh!

lpapenme commented Sep 20, 2017

Uh oh!

AmplabJenkins commented Nov 10, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

added support word2vec training with additional data #18636

added support word2vec training with additional data #18636

Uh oh!

Conversation

lpapenme commented Jul 14, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

MLnick commented Sep 19, 2017

Uh oh!

lpapenme commented Sep 19, 2017

Uh oh!

MLnick commented Sep 19, 2017

Uh oh!

lpapenme commented Sep 20, 2017

Uh oh!

AmplabJenkins commented Nov 10, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants