[Spark-9062] [ML] Change output type of Tokenizer to Array(String, true) #7414

hhbyyh · 2015-07-15T06:40:43Z

jira: https://issues.apache.org/jira/browse/SPARK-9062

Currently output type of Tokenizer is Array(String, false), which is not compatible with Word2Vec and Other transformers since their input type is Array(String, true). Seq[String] in udf will be treated as Array(String, true) by default.

I'm not sure what's the recommended way for Tokenizer to handle the null value in the input. Any suggestion will be welcome.

mengxr · 2015-07-15T06:49:08Z

I think this is a minor issue since we don't output null. +1 on the changes to be consistent with auto schema inference.

SparkQA · 2015-07-15T07:16:00Z

Test build #37328 has finished for PR 7414 at commit c01bd7a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2015-07-17T20:42:39Z

LGTM, I'll merge with master.

change output type of tokenizer

c01bd7a

asfgit closed this in 806c579 Jul 17, 2015

srowen mentioned this pull request Sep 21, 2016

[SPARK-10835] [ML] Word2Vec should accept non-null string array, in addition to existing null string array #15179

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Spark-9062] [ML] Change output type of Tokenizer to Array(String, true) #7414

[Spark-9062] [ML] Change output type of Tokenizer to Array(String, true) #7414

Uh oh!

hhbyyh commented Jul 15, 2015

Uh oh!

mengxr commented Jul 15, 2015

Uh oh!

SparkQA commented Jul 15, 2015

Uh oh!

jkbradley commented Jul 17, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Spark-9062] [ML] Change output type of Tokenizer to Array(String, true) #7414

[Spark-9062] [ML] Change output type of Tokenizer to Array(String, true) #7414

Uh oh!

Conversation

hhbyyh commented Jul 15, 2015

Uh oh!

mengxr commented Jul 15, 2015

Uh oh!

SparkQA commented Jul 15, 2015

Uh oh!

jkbradley commented Jul 17, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants