[SPARK-14050][ML] Add multiple languages support and additional methods for Stop Words Remover #11871

burakkose · 2016-03-22T00:02:41Z

Apache Spark is a global engine, so It should appeal to everyone as much as possible. I added multiple language support for StopWordRemover, and used nltk's language list except for English(English is still same). I added some facilitator functions such as setLanguage, setAdditionalWords and setIgnoredWords.

English is the default. If you want to change the language, use setLanguage. For example; setLanguage("danish").

mengxr · 2016-03-22T04:33:24Z

@burakkose I guess it is okay to copy the lists from NLTK since it is Apache licensed. Could you add a header to each stopword file and put a link there? It helps us review the changes. Thanks!

mengxr · 2016-03-22T04:33:29Z

ok to test

mengxr · 2016-03-22T04:35:16Z

mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/english.txt

@@ -0,0 +1,319 @@
+a


If other lists are from NLTK, maybe we should use their English stopwords too. It would be good to make quick comparison and check the differences.

mengxr · 2016-03-22T04:56:44Z

@burakkose I made a quick pass. Just want to mention another option for the implementation. Instead of having language, ignoredWords, and additionalWords, we can separate the lists from StopwordsRemover:

val stopWords = StopWordsRemover.loadStopWords("turkish").toSet ++ Set("a") -- Set("b")
val swr = new StopWordsRemover()
  .setStopWords(stopWords.toArray)
...

This makes more composite code.

SparkQA · 2016-03-22T05:03:29Z

Test build #53745 has finished for PR 11871 at commit 6d215b3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-03-22T14:05:20Z

@burakkose the LICENSE file and other license info probably needs updating if you're including this. I can help if you'll point me to the source of this data and its license.

burakkose · 2016-03-22T14:10:56Z

@srowen , I got from http://www.nltk.org/nltk_data/ , Stopwords Corpus, and they mentioned that
"
They were obtained from:
http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/
The English list has been augmented
nltk/nltk_data#22
"

SparkQA · 2016-03-22T16:59:26Z

Test build #53784 has finished for PR 11871 at commit 41cd258.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

After updating English stop words list, "d" is a stop word.

burakkose · 2016-03-22T17:39:00Z

@mengxr , can you review again?

SparkQA · 2016-03-22T18:10:24Z

Test build #53788 has finished for PR 11871 at commit 4d1812a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-22T22:18:11Z

Test build #53822 has finished for PR 11871 at commit a308622.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-25T20:42:32Z

Test build #54184 has finished for PR 11871 at commit 789342f.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

burakkose · 2016-03-25T21:04:38Z

added a static method to Python,readme for resources, deleted StopWords and language . But we need to retest, @srowen can you request?

srowen · 2016-03-29T16:13:04Z

Jenkins retest this please

SparkQA · 2016-03-29T18:30:30Z

Test build #54446 has finished for PR 11871 at commit 789342f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-03-30T11:56:04Z

mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala

-          terms.filter(s => !lowerStopWords.contains(toLower(s)))
-        }
+      udf { terms: Seq[String] =>
+        terms.filter(s => !stopWordsSet.contains(s))


I think this can be terms.filterNot(stopWordsSet.contains)?
It seems like this code path will always pay the cost of making a set out of the stopwords. It's not huge but wonder if it makes sense to store a ref to the set once?

Can you give more information about that case. What is the best way for you?

Can you save a reference to the active set of stopwords instead of making the list into a set each time? might be more natural to have a defensive copy anyway.

Yes, I will fix. Do you have any additional suggestions about the pull-request, such as additional features?

See question below about null words

mengxr · 2016-04-11T18:24:32Z

@burakkose There were some merge conflicts introduced by recent commits. So please rebase master when you update this PR. Thanks!

holdenk · 2016-04-25T21:20:40Z

mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala

@@ -98,7 +46,7 @@ class StopWordsRemover(override val uid: String)

  /**
   * the stop words set to be filtered out
-   * Default: [[StopWords.English]]
+   * Default: [[Array.empty]]


This could be a little clearer with the scaladoc, I think we should mention that Array.empty actually implies loading the english stop words. (Or we could just have the default be the loaded version of the english stop words as is done in the PySpark code).

mengxr · 2016-05-02T15:08:55Z

I"m going to send a PR based on this. So we can catch 2.0.

burakkose · 2016-05-02T16:09:27Z

@mengxr , I couldn't find free time, sorry for that. I actually wrote new codes, and I was just waiting for tests. I am going to send a new PR.

mengxr · 2016-05-02T16:17:15Z

@burakkose I sent out a PR at #12843 and it would be great if you can help review it. I think we should get this one into Spark 2.0. There is also a TODO to add locale support. If you have time, could you start working on https://issues.apache.org/jira/browse/SPARK-15064? Thanks!

# Conflicts: # mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala # mllib/src/test/scala/org/apache/spark/ml/feature/StopWordsRemoverSuite.scala # python/pyspark/ml/feature.py # python/pyspark/ml/tests.py

SparkQA · 2016-05-03T23:14:06Z

Test build #57686 has finished for PR 11871 at commit bca7c01.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-03T23:39:42Z

Test build #57689 has finished for PR 11871 at commit cb786ee.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

burakkose · 2016-05-04T18:43:23Z

@mengxr, can you check? I added the locale support, and applied your changes. I haven't opened a new pull request for the locale support.

SparkQA · 2016-05-04T20:12:28Z

Test build #57785 has finished for PR 11871 at commit 01471ec.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-04T21:45:11Z

Test build #57806 has finished for PR 11871 at commit dec0634.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

…ds for Stop Words Remover ## What changes were proposed in this pull request? This PR continues the work from #11871 with the following changes: * load English stopwords as default * covert stopwords to list in Python * update some tests and doc ## How was this patch tested? Unit tests. Closes #11871 cc: burakkose srowen Author: Burak Köse <[email protected]> Author: Xiangrui Meng <[email protected]> Author: Burak KOSE <[email protected]> Closes #12843 from mengxr/SPARK-14050. (cherry picked from commit e20cd9f) Signed-off-by: Xiangrui Meng <[email protected]>

Halawa13 · 2016-12-08T14:03:29Z

Hello,

I'm new to programming pyspark, i have problem with this code

from pyspark.ml.feature import StopWordsRemover
df = sc.createDataFrame([(0,["je", "suis", "malade", "comme", "la", "dernierer"]),(1,["si", "non", "tu", "vas", "bien"])],["label", "raw"])
remover = StopWordsRemover(inputCol="raw", outputCol="filtered")
remover.transform(sentenceData).show(truncate=False)

I want use "loadDefaultStopWords("french")" bat i don't now how use it.
I tried
remover.loadDefaultStopWords("french").transform(sentenceData).show(truncate=False) :
It is not working

burakkose added 7 commits March 15, 2016 00:22

add language files

c126c87

add multi-language support for stop words

8248579

add new tests for StopWordsRemover

2c7b73d

adjust resource files

43e5cf5

adjust resource files

a430392

fix stopwords bug

28ee249

update comment lines

6d215b3

mengxr reviewed Mar 22, 2016
View reviewed changes

burakkose added 2 commits March 22, 2016 18:24

update stop words list

6deceec

update stopwordsremover

41cd258

fix test case bug

4d1812a

After updating English stop words list, "d" is a stop word.

fix encoding

a308622

burakkose added 3 commits March 25, 2016 18:26

merge StopWords into StopWordsRemover

c017ee2

add python stopwords support for language selection

55191ce

add new tests for stopwords

789342f

srowen reviewed Mar 30, 2016
View reviewed changes

holdenk reviewed Apr 25, 2016
View reviewed changes

mengxr mentioned this pull request May 2, 2016

[SPARK-14050] [ML] Add multiple languages support and additional methods for Stop Words Remover #12843

Closed

burakkose added 2 commits May 4, 2016 01:14

code review and locale support

d3e0ad6

fix stylecheck

cb786ee

address feedback

01471ec

fix locale

dec0634

asfgit closed this in e20cd9f May 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-14050][ML] Add multiple languages support and additional methods for Stop Words Remover #11871

[SPARK-14050][ML] Add multiple languages support and additional methods for Stop Words Remover #11871

burakkose commented Mar 22, 2016

mengxr commented Mar 22, 2016

mengxr commented Mar 22, 2016

mengxr Mar 22, 2016

mengxr commented Mar 22, 2016

SparkQA commented Mar 22, 2016

srowen commented Mar 22, 2016

burakkose commented Mar 22, 2016

SparkQA commented Mar 22, 2016

burakkose commented Mar 22, 2016

SparkQA commented Mar 22, 2016

SparkQA commented Mar 22, 2016

SparkQA commented Mar 25, 2016

burakkose commented Mar 25, 2016

srowen commented Mar 29, 2016

SparkQA commented Mar 29, 2016

srowen Mar 30, 2016

burakkose Apr 2, 2016

srowen Apr 2, 2016

burakkose Apr 2, 2016

srowen Apr 2, 2016

mengxr commented Apr 11, 2016

holdenk Apr 25, 2016

mengxr commented May 2, 2016

burakkose commented May 2, 2016

mengxr commented May 2, 2016

SparkQA commented May 3, 2016

SparkQA commented May 3, 2016

burakkose commented May 4, 2016

SparkQA commented May 4, 2016

SparkQA commented May 4, 2016

Halawa13 commented Dec 8, 2016

[SPARK-14050][ML] Add multiple languages support and additional methods for Stop Words Remover #11871

[SPARK-14050][ML] Add multiple languages support and additional methods for Stop Words Remover #11871

Conversation

burakkose commented Mar 22, 2016

mengxr commented Mar 22, 2016

mengxr commented Mar 22, 2016

mengxr Mar 22, 2016

Choose a reason for hiding this comment

mengxr commented Mar 22, 2016

SparkQA commented Mar 22, 2016

srowen commented Mar 22, 2016

burakkose commented Mar 22, 2016

SparkQA commented Mar 22, 2016

burakkose commented Mar 22, 2016

SparkQA commented Mar 22, 2016

SparkQA commented Mar 22, 2016

SparkQA commented Mar 25, 2016

burakkose commented Mar 25, 2016

srowen commented Mar 29, 2016

SparkQA commented Mar 29, 2016

srowen Mar 30, 2016

Choose a reason for hiding this comment

burakkose Apr 2, 2016

Choose a reason for hiding this comment

srowen Apr 2, 2016

Choose a reason for hiding this comment

burakkose Apr 2, 2016

Choose a reason for hiding this comment

srowen Apr 2, 2016

Choose a reason for hiding this comment

mengxr commented Apr 11, 2016

holdenk Apr 25, 2016

Choose a reason for hiding this comment

mengxr commented May 2, 2016

burakkose commented May 2, 2016

mengxr commented May 2, 2016

SparkQA commented May 3, 2016

SparkQA commented May 3, 2016

burakkose commented May 4, 2016

SparkQA commented May 4, 2016

SparkQA commented May 4, 2016

Halawa13 commented Dec 8, 2016