Skip to content
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 10 additions & 6 deletions docs/ml-features.md
Original file line number Diff line number Diff line change
Expand Up @@ -251,11 +251,12 @@ frequently and don't carry as much meaning.
`StopWordsRemover` takes as input a sequence of strings (e.g. the output
of a [Tokenizer](ml-features.html#tokenizer)) and drops all the stop
words from the input sequences. The list of stopwords is specified by
the `stopWords` parameter. We provide [a list of stop
words](http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words) by
default, accessible by calling `getStopWords` on a newly instantiated
`StopWordsRemover` instance. A boolean parameter `caseSensitive` indicates
if the matches should be case sensitive (false by default).
the `stopWords` parameter. Default stop words for some languages are provided
("danish", "dutch", "english", "finnish", "french", "german", "hungarian", "italian",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Languages should be like Danish, Dutch, ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid it will confuse users as currently it does not support StopWordsRemover.loadDefaultStopWords("English") (with Capital E). Maybe we should use language.toLower in loadDefaultStopWords.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, in that case just clarify that these strings are arguments to the method.

"norwegian", "portuguese", "russian", "spanish", "swedish" and "turkish"),
which are accessible by calling `StopWordsRemover.loadDefaultStopWords(language)`.
A boolean parameter `caseSensitive` indicates if the matches should be case
sensitive (false by default).

**Examples**

Expand Down Expand Up @@ -346,7 +347,10 @@ for more details on the API.

Binarization is the process of thresholding numerical features to binary (0/1) features.

`Binarizer` takes the common parameters `inputCol` and `outputCol`, as well as the `threshold` for binarization. Feature values greater than the threshold are binarized to 1.0; values equal to or less than the threshold are binarized to 0.0.
`Binarizer` takes the common parameters `inputCol` and `outputCol`, as well as the `threshold`
for binarization. Feature values greater than the threshold are binarized to 1.0; values equal
to or less than the threshold are binarized to 0.0. Both Vector and Double types are supported
for `inputCol`.

<div class="codetabs">
<div data-lang="scala" markdown="1">
Expand Down