Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-14050][ML] Add multiple languages support and additional methods for Stop Words Remover #11871

Closed
wants to merge 23 commits into from

Conversation

burakkose
Copy link
Contributor

Apache Spark is a global engine, so It should appeal to everyone as much as possible. I added multiple language support for StopWordRemover, and used nltk's language list except for English(English is still same). I added some facilitator functions such as setLanguage, setAdditionalWords and setIgnoredWords.

English is the default. If you want to change the language, use setLanguage. For example; setLanguage("danish").

@mengxr
Copy link
Contributor

mengxr commented Mar 22, 2016

@burakkose I guess it is okay to copy the lists from NLTK since it is Apache licensed. Could you add a header to each stopword file and put a link there? It helps us review the changes. Thanks!

@mengxr
Copy link
Contributor

mengxr commented Mar 22, 2016

ok to test

@@ -0,0 +1,319 @@
a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If other lists are from NLTK, maybe we should use their English stopwords too. It would be good to make quick comparison and check the differences.

@mengxr
Copy link
Contributor

mengxr commented Mar 22, 2016

@burakkose I made a quick pass. Just want to mention another option for the implementation. Instead of having language, ignoredWords, and additionalWords, we can separate the lists from StopwordsRemover:

val stopWords = StopWordsRemover.loadStopWords("turkish").toSet ++ Set("a") -- Set("b")
val swr = new StopWordsRemover()
  .setStopWords(stopWords.toArray)
...

This makes more composite code.

@SparkQA
Copy link

SparkQA commented Mar 22, 2016

Test build #53745 has finished for PR 11871 at commit 6d215b3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Mar 22, 2016

@burakkose the LICENSE file and other license info probably needs updating if you're including this. I can help if you'll point me to the source of this data and its license.

@burakkose
Copy link
Contributor Author

@srowen , I got from http://www.nltk.org/nltk_data/ , Stopwords Corpus, and they mentioned that
"
They were obtained from:
http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/
The English list has been augmented
nltk/nltk_data#22
"

@SparkQA
Copy link

SparkQA commented Mar 22, 2016

Test build #53784 has finished for PR 11871 at commit 41cd258.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

After updating English stop words list, "d" is a stop word.
@burakkose
Copy link
Contributor Author

@mengxr , can you review again?

@SparkQA
Copy link

SparkQA commented Mar 22, 2016

Test build #53788 has finished for PR 11871 at commit 4d1812a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 22, 2016

Test build #53822 has finished for PR 11871 at commit a308622.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 25, 2016

Test build #54184 has finished for PR 11871 at commit 789342f.

  • This patch fails from timeout after a configured wait of 250m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@burakkose
Copy link
Contributor Author

added a static method to Python,readme for resources, deleted StopWords and language . But we need to retest, @srowen can you request?

@srowen
Copy link
Member

srowen commented Mar 29, 2016

Jenkins retest this please

@SparkQA
Copy link

SparkQA commented Mar 29, 2016

Test build #54446 has finished for PR 11871 at commit 789342f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

terms.filter(s => !lowerStopWords.contains(toLower(s)))
}
udf { terms: Seq[String] =>
terms.filter(s => !stopWordsSet.contains(s))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be terms.filterNot(stopWordsSet.contains)?
It seems like this code path will always pay the cost of making a set out of the stopwords. It's not huge but wonder if it makes sense to store a ref to the set once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give more information about that case. What is the best way for you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you save a reference to the active set of stopwords instead of making the list into a set each time? might be more natural to have a defensive copy anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will fix. Do you have any additional suggestions about the pull-request, such as additional features?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See question below about null words

@mengxr
Copy link
Contributor

mengxr commented Apr 11, 2016

@burakkose There were some merge conflicts introduced by recent commits. So please rebase master when you update this PR. Thanks!

@@ -98,7 +46,7 @@ class StopWordsRemover(override val uid: String)

/**
* the stop words set to be filtered out
* Default: [[StopWords.English]]
* Default: [[Array.empty]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a little clearer with the scaladoc, I think we should mention that Array.empty actually implies loading the english stop words. (Or we could just have the default be the loaded version of the english stop words as is done in the PySpark code).

@mengxr
Copy link
Contributor

mengxr commented May 2, 2016

I"m going to send a PR based on this. So we can catch 2.0.

@burakkose
Copy link
Contributor Author

@mengxr , I couldn't find free time, sorry for that. I actually wrote new codes, and I was just waiting for tests. I am going to send a new PR.

@mengxr
Copy link
Contributor

mengxr commented May 2, 2016

@burakkose I sent out a PR at #12843 and it would be great if you can help review it. I think we should get this one into Spark 2.0. There is also a TODO to add locale support. If you have time, could you start working on https://issues.apache.org/jira/browse/SPARK-15064? Thanks!

burakkose added 2 commits May 4, 2016 01:14
# Conflicts:
#	mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala
#	mllib/src/test/scala/org/apache/spark/ml/feature/StopWordsRemoverSuite.scala
#	python/pyspark/ml/feature.py
#	python/pyspark/ml/tests.py
@SparkQA
Copy link

SparkQA commented May 3, 2016

Test build #57686 has finished for PR 11871 at commit bca7c01.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 3, 2016

Test build #57689 has finished for PR 11871 at commit cb786ee.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@burakkose
Copy link
Contributor Author

@mengxr, can you check? I added the locale support, and applied your changes. I haven't opened a new pull request for the locale support.

@SparkQA
Copy link

SparkQA commented May 4, 2016

Test build #57785 has finished for PR 11871 at commit 01471ec.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 4, 2016

Test build #57806 has finished for PR 11871 at commit dec0634.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in e20cd9f May 6, 2016
asfgit pushed a commit that referenced this pull request May 6, 2016
…ds for Stop Words Remover

## What changes were proposed in this pull request?

This PR continues the work from #11871 with the following changes:
* load English stopwords as default
* covert stopwords to list in Python
* update some tests and doc

## How was this patch tested?

Unit tests.

Closes #11871

cc: burakkose srowen

Author: Burak Köse <[email protected]>
Author: Xiangrui Meng <[email protected]>
Author: Burak KOSE <[email protected]>

Closes #12843 from mengxr/SPARK-14050.

(cherry picked from commit e20cd9f)
Signed-off-by: Xiangrui Meng <[email protected]>
@Halawa13
Copy link

Halawa13 commented Dec 8, 2016

Hello,

I'm new to programming pyspark, i have problem with this code

from pyspark.ml.feature import StopWordsRemover
df = sc.createDataFrame([(0,["je", "suis", "malade", "comme", "la", "dernierer"]),(1,["si", "non", "tu", "vas", "bien"])],["label", "raw"])
remover = StopWordsRemover(inputCol="raw", outputCol="filtered")
remover.transform(sentenceData).show(truncate=False)

I want use "loadDefaultStopWords("french")" bat i don't now how use it.
I tried
remover.loadDefaultStopWords("french").transform(sentenceData).show(truncate=False) :
It is not working

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants