-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-15064][ML] Locale support in StopWordsRemover #12968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| * Default: English locale ("en") | ||
| * @group param | ||
| */ | ||
| val locale: Param[String] = new Param[String](this, "locale", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, shouldn't all this perhaps be linked to the stopwords set? if you loaded the French stopwords you'd want the French locale always?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but, How can we know that users loaded the French stopwords? User can load stopwords by
StopWordsRemover.loadDefaultStopWords("french")
and setting is
new StopWordsRemover().setStopWords(stopWords)
. Do you have any suggestion about that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For supported languages, we can know the appropriate locale and maintain an internal mapping. So "french" is known to map to Locale.FRENCH. For loading an arbitrary list, we don't know, but you could provide an overload where you provide a Locale.
|
(@burakkose I think the |
|
@HyukjinKwon, thank you for informing. Yes, you're right. |
| setDefault(stopWords -> StopWordsRemover.loadDefaultStopWords("english"), caseSensitive -> false) | ||
| /** | ||
| * Locale for doing a case sensitive comparison | ||
| * Default: English locale ("en") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we list what're the available options, or provide some reference here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's done
|
Made a pass. That's all from me. |
|
This is blocking user guide /examples update for 2.0. |
# Conflicts: # mllib/src/test/scala/org/apache/spark/ml/feature/StopWordsRemoverSuite.scala
|
Can you specify the blocking? |
| } else { | ||
| // TODO: support user locale (SPARK-15064) | ||
| val toLower = (s: String) => if (s != null) s.toLowerCase else s | ||
| val loadedLocale = StopWordsRemover.loadLocale($(locale)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just new Locale($(locale))
|
I'm not sure if this will be shipped with Spark 2.0. If yes, we should update user guide accordingly. |
| /** | ||
| * Locale for doing a case sensitive comparison | ||
| * Default: English locale ("en") | ||
| * @see [[http://www.localeplanet.com/java/]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please link to the official Java doc: https://docs.oracle.com/javase/8/docs/api/java/util/Locale.html or the Locale class.
|
@burakkose is this something you are still working on? If so can you update it to master and look at @mengxr's comments - if not interested in working on it anymore no worries. |
|
Can one of the admins verify this patch? |
Closes apache#12968 Closes apache#16215 Closes apache#16212 Closes apache#16086 Closes apache#15713 Closes apache#16413 Closes apache#16396
Closes apache#12968 Closes apache#16215 Closes apache#16212 Closes apache#16086 Closes apache#15713 Closes apache#16413 Closes apache#16396 Author: Sean Owen <[email protected]> Closes apache#16447 from srowen/CloseStalePRs.
What changes were proposed in this pull request?
How was this patch tested?
unit tests