Skip to content

Conversation

@lpapenme
Copy link

What changes were proposed in this pull request?

Word2Vec is trained unsupervised. The more data it is trained on, the more "accurate" are the word vectors. Hence, Word2Vec should support to be fit on additional data.

How was this patch tested?

Additional unit test.

Please review http://spark.apache.org/contributing.html before opening a pull request.

@MLnick
Copy link
Contributor

MLnick commented Sep 19, 2017

Hi there - I don't see the value here of adding a few words in a String array to the training. You're effectively adding a second (non-distributed, therefore limited in size) corpus to the training.

Word2Vec is more aimed at training on a larger corpus of text. If you want more accuracy train on a larger training set.

Could you close this PR please?

@lpapenme
Copy link
Author

At the moment, it is not possible to improve a models accuracy by incorporating additional data. I think this should be supported since it can increase a classifiers performance significantly. With this implementation, I was able to train unsupervised on a Wikipedia Dump, which is pretty large. However, distributing the set is a good point.

@MLnick
Copy link
Contributor

MLnick commented Sep 19, 2017

I'm sorry but I still don't understand the intention here. You can already train on a Wikipedia dump (or any other dataset) by passing that dataset as the input DataFrame to Word2Vec.

If you want to "incorporate additional data" why not just union the additional sentences / documents together with your other training set?

@lpapenme
Copy link
Author

The problem emerges in cases where you built a whole pipeline. You have a set of documents you want to classify. These documents have some additional features and they are preprocessed in the pipeline. When coming to Word2Vec, you want to vectorize your documents. However, you see bad performance of your word vectors and you want to tune them by adding additional documents. You don't want these documents to be part of the whole pipeline, because they are unable to pass the previous preprocessing steps.

That was my intention to add this. Probably, it is a very rare usecase. I don't know.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@asfgit asfgit closed this in a3ba3a8 Nov 11, 2018
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025
Closes apache#21766
Closes apache#21679
Closes apache#21161
Closes apache#20846
Closes apache#19434
Closes apache#18080
Closes apache#17648
Closes apache#17169

Add:
Closes apache#22813
Closes apache#21994
Closes apache#22005
Closes apache#22463

Add:
Closes apache#15899

Add:
Closes apache#22539
Closes apache#21868
Closes apache#21514
Closes apache#21402
Closes apache#21322
Closes apache#21257
Closes apache#20163
Closes apache#19691
Closes apache#18697
Closes apache#18636
Closes apache#17176

Closes apache#23001 from wangyum/CloseStalePRs.

Authored-by: Yuming Wang <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants