-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-14050][ML] Add multiple languages support and additional methods for Stop Words Remover #11871
Conversation
@burakkose I guess it is okay to copy the lists from NLTK since it is Apache licensed. Could you add a header to each stopword file and put a link there? It helps us review the changes. Thanks! |
ok to test |
@@ -0,0 +1,319 @@ | |||
a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If other lists are from NLTK, maybe we should use their English stopwords too. It would be good to make quick comparison and check the differences.
@burakkose I made a quick pass. Just want to mention another option for the implementation. Instead of having val stopWords = StopWordsRemover.loadStopWords("turkish").toSet ++ Set("a") -- Set("b")
val swr = new StopWordsRemover()
.setStopWords(stopWords.toArray)
... This makes more composite code. |
Test build #53745 has finished for PR 11871 at commit
|
@burakkose the |
@srowen , I got from http://www.nltk.org/nltk_data/ , Stopwords Corpus, and they mentioned that |
Test build #53784 has finished for PR 11871 at commit
|
After updating English stop words list, "d" is a stop word.
@mengxr , can you review again? |
Test build #53788 has finished for PR 11871 at commit
|
Test build #53822 has finished for PR 11871 at commit
|
Test build #54184 has finished for PR 11871 at commit
|
added a static method to Python,readme for resources, deleted |
Jenkins retest this please |
Test build #54446 has finished for PR 11871 at commit
|
terms.filter(s => !lowerStopWords.contains(toLower(s))) | ||
} | ||
udf { terms: Seq[String] => | ||
terms.filter(s => !stopWordsSet.contains(s)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be terms.filterNot(stopWordsSet.contains)
?
It seems like this code path will always pay the cost of making a set out of the stopwords. It's not huge but wonder if it makes sense to store a ref to the set once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give more information about that case. What is the best way for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you save a reference to the active set of stopwords instead of making the list into a set each time? might be more natural to have a defensive copy anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will fix. Do you have any additional suggestions about the pull-request, such as additional features?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See question below about null words
@burakkose There were some merge conflicts introduced by recent commits. So please rebase master when you update this PR. Thanks! |
@@ -98,7 +46,7 @@ class StopWordsRemover(override val uid: String) | |||
|
|||
/** | |||
* the stop words set to be filtered out | |||
* Default: [[StopWords.English]] | |||
* Default: [[Array.empty]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be a little clearer with the scaladoc, I think we should mention that Array.empty actually implies loading the english stop words. (Or we could just have the default be the loaded version of the english stop words as is done in the PySpark code).
I"m going to send a PR based on this. So we can catch 2.0. |
@mengxr , I couldn't find free time, sorry for that. I actually wrote new codes, and I was just waiting for tests. I am going to send a new PR. |
@burakkose I sent out a PR at #12843 and it would be great if you can help review it. I think we should get this one into Spark 2.0. There is also a TODO to add locale support. If you have time, could you start working on https://issues.apache.org/jira/browse/SPARK-15064? Thanks! |
# Conflicts: # mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala # mllib/src/test/scala/org/apache/spark/ml/feature/StopWordsRemoverSuite.scala # python/pyspark/ml/feature.py # python/pyspark/ml/tests.py
Test build #57686 has finished for PR 11871 at commit
|
Test build #57689 has finished for PR 11871 at commit
|
@mengxr, can you check? I added the locale support, and applied your changes. I haven't opened a new pull request for the locale support. |
Test build #57785 has finished for PR 11871 at commit
|
Test build #57806 has finished for PR 11871 at commit
|
…ds for Stop Words Remover ## What changes were proposed in this pull request? This PR continues the work from #11871 with the following changes: * load English stopwords as default * covert stopwords to list in Python * update some tests and doc ## How was this patch tested? Unit tests. Closes #11871 cc: burakkose srowen Author: Burak Köse <[email protected]> Author: Xiangrui Meng <[email protected]> Author: Burak KOSE <[email protected]> Closes #12843 from mengxr/SPARK-14050. (cherry picked from commit e20cd9f) Signed-off-by: Xiangrui Meng <[email protected]>
Hello, I'm new to programming pyspark, i have problem with this code from pyspark.ml.feature import StopWordsRemover I want use "loadDefaultStopWords("french")" bat i don't now how use it. |
Apache Spark is a global engine, so It should appeal to everyone as much as possible. I added multiple language support for StopWordRemover, and used nltk's language list except for English(English is still same). I added some facilitator functions such as setLanguage, setAdditionalWords and setIgnoredWords.
English is the default. If you want to change the language, use setLanguage. For example; setLanguage("danish").