Skip to content

Conversation

@hhbyyh
Copy link
Contributor

@hhbyyh hhbyyh commented Jul 15, 2015

jira: https://issues.apache.org/jira/browse/SPARK-9062

Currently output type of Tokenizer is Array(String, false), which is not compatible with Word2Vec and Other transformers since their input type is Array(String, true). Seq[String] in udf will be treated as Array(String, true) by default.

I'm not sure what's the recommended way for Tokenizer to handle the null value in the input. Any suggestion will be welcome.

@mengxr
Copy link
Contributor

mengxr commented Jul 15, 2015

I think this is a minor issue since we don't output null. +1 on the changes to be consistent with auto schema inference.

@SparkQA
Copy link

SparkQA commented Jul 15, 2015

Test build #37328 has finished for PR 7414 at commit c01bd7a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

LGTM, I'll merge with master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants