Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Apr 8, 2016

What changes were proposed in this pull request?

Currently we use SQLUserDefinedType annotation to register UDTs for user classes. However, by doing this, we add Spark dependency to user classes.

For some user classes, it is unnecessary to add such dependency that will increase deployment difficulty.

We should provide alternative approach to register UDTs for user classes without SQLUserDefinedType annotation.

How was this patch tested?

UserDefinedTypeSuite

@viirya
Copy link
Member Author

viirya commented Apr 8, 2016

cc @dbtsai You might be interested in this.

@SparkQA
Copy link

SparkQA commented Apr 8, 2016

Test build #55341 has finished for PR 12259 at commit b270bb9.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Apr 8, 2016

@viirya we don't actually want to delete VectorUDT as it's used throughout MLlib. Also, in practice when does UDTRegistration.register need to get called? As ideally MLlib will need to automatically register its UDTs without requiring code change.

@viirya viirya changed the title [SPARK-14487][SQL] User Defined Type registration without SQLUserDefinedType annotation [SPARK-14487][SQL][WIP] User Defined Type registration without SQLUserDefinedType annotation Apr 8, 2016
@viirya
Copy link
Member Author

viirya commented Apr 8, 2016

@MLnick I see. I forget to upload another scala file. I will do it once I am back to laptop.

@viirya
Copy link
Member Author

viirya commented Apr 8, 2016

I separate VectorUDT as individual file that I will upload later.

@viirya
Copy link
Member Author

viirya commented Apr 8, 2016

For these built-in UDTs, we should pre-register them.

I was thinking that we can re-register them in an object. However, as Scala's object is lazy initialized, we can't run registration before user classes are used.

So for these built-in UDTs, we are only allowed to use existing annotation of UDT.

But this registration is still useful because you can manually register external user classes without adding Spark dependency and annotation on it.

@MLnick
Copy link
Contributor

MLnick commented Apr 8, 2016

Sure - my question is when and where should that happen? i.e. it should happen before they are used of course. So do we create a hook somewhere (perhaps in SQLContext init? elsewhere?)

@viirya viirya force-pushed the improve-sql-usertype branch from 92968a1 to b3acdb9 Compare April 8, 2016 14:06
@viirya
Copy link
Member Author

viirya commented Apr 8, 2016

hmm, is it better to make UDTRegistration private and expose UDTRegistration through sqlContext.udt then let users to do registration by call sqlContext.udt.register(userClass, udtClass)?

@SparkQA
Copy link

SparkQA commented Apr 8, 2016

Test build #55345 has finished for PR 12259 at commit b3acdb9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class VectorUDT extends UserDefinedType[Vector]

@SparkQA
Copy link

SparkQA commented Apr 8, 2016

Test build #55347 has finished for PR 12259 at commit 1f58662.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 8, 2016

Test build #55351 has finished for PR 12259 at commit a84d9dd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 9, 2016

Test build #55407 has finished for PR 12259 at commit e59f1d5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class UDTRegistrationSuite extends QueryTest with SharedSQLContext with ParquetTest

@viirya viirya changed the title [SPARK-14487][SQL][WIP] User Defined Type registration without SQLUserDefinedType annotation [SPARK-14487][SQL] User Defined Type registration without SQLUserDefinedType annotation Apr 11, 2016
@viirya
Copy link
Member Author

viirya commented Apr 11, 2016

cc @marmbrus @yhuai @rxin

@SparkQA
Copy link

SparkQA commented Apr 11, 2016

Test build #55503 has finished for PR 12259 at commit eafdd58.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Apr 11, 2016

This is a good idea, but in 2.0 we are going to hide UserDefinedFunction for now and re-design this API in 2.1. We should revisit it together.

@dbtsai
Copy link
Member

dbtsai commented Apr 11, 2016

+cc @mengxr

@rxin In SPARK-13944, the matrix and vector classes will be moved out to spark-mllib-local jar, and that will no longer be able to access the current UDT annotation. As a result, something like this will be necessary. As this is a private api, we may revisit later in 2.1.

@viirya Can we rebase and review this code once the spark-mllib-local build is done, and the platform independent spark ml implementation of vector and matrix is merged?

Thanks.

@viirya
Copy link
Member Author

viirya commented Apr 11, 2016

@dbtsai yes. that will be good.

asfgit pushed a commit that referenced this pull request Apr 15, 2016
… in mllib-local

## What changes were proposed in this pull request?

This task will copy the Vector and Matrix classes from mllib to ml package in mllib-local jar. The UDTs and `since` annotation in ml vector and matrix will be removed from now. UDTs will be achieved by #SPARK-14487, and `since` will be replaced by /*  since 1.2.0 */

The BLAS implementation will be copied, and some of the test utilities will be copies as well.

Summary of changes:

1. In mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/BLAS.scala
  - Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/BLAS.scala
  - logDebug("gemm: alpha is equal to 0 and beta is equal to 1. Returning C.") is removed in ml version.
2. In  mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/Matrices.scala
  - Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Matrices.scala
  - `Since` was removed, and we'll use standard `/* Since /*` Java doc. Will be in another PR.
  - `UDT` related code was removed, and will use `SPARK-13944` #12259  to replace the annotation.
3. In mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/Vectors.scala
  - Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Vectors.scala
  - `Since` was removed.
  - `UDT` related code was removed.
  - In `def parseNumeric`, it was throwing `throw new SparkException(s"Cannot parse $other.")`, and now it's throwing `throw new IllegalArgumentException(s"Cannot parse $other.")`
4. In mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Vectors.scala
  - For consistency with ML version of vector, `def parseNumeric` is now throwing `throw new IllegalArgumentException(s"Cannot parse $other.")`
5. mllib/src/main/scala/org/apache/spark/**mllib**/util/NumericParser.scala is moved to mllib-local/src/main/scala/org/apache/spark/**ml**/util/NumericParser.scala
  - All the `throw new SparkException` were replaced by `throw new IllegalArgumentException`

## How was this patch tested?

unit tests

Author: DB Tsai <[email protected]>

Closes #12317 from dbtsai/dbtsai-ml-vector.
@SparkQA
Copy link

SparkQA commented Apr 24, 2016

Test build #56826 has finished for PR 12259 at commit 3917f6b.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 24, 2016

Test build #56828 has finished for PR 12259 at commit 06bdbc5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Apr 24, 2016

retest this please.

@SparkQA
Copy link

SparkQA commented Apr 24, 2016

Test build #56831 has finished for PR 12259 at commit 06bdbc5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Apr 24, 2016

@mengxr @dbtsai Your comments are addressed now. Please take a look again when you have a chance. Thanks.

@dbtsai
Copy link
Member

dbtsai commented Apr 25, 2016

I think override def pyUDT: String = "pyspark.mllib.linalg.MatrixUDT" has to be changed; otherwise, this will cause inconsistent result. Maybe this can be removed from now, and a followup PR can be done for fixing this together with Python code.

@viirya
Copy link
Member Author

viirya commented Apr 26, 2016

@dbtsai yea. Forget to do that. Updating now.

@viirya
Copy link
Member Author

viirya commented Apr 26, 2016

Created a JIRA https://issues.apache.org/jira/browse/SPARK-14906 to follow this.

@SparkQA
Copy link

SparkQA commented Apr 26, 2016

Test build #56946 has finished for PR 12259 at commit de7ea5d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Apr 26, 2016

ping @dbtsai @mengxr Can you take a look? Any more comments or is this ready to merge now? Thanks.

import org.apache.spark.sql.types._

@BeanInfo
private[ml] case class MyVectorPoint(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto on unit testing

@mengxr
Copy link
Contributor

mengxr commented Apr 27, 2016

Made another pass. Please also address my previous comment on UDTRegistration. It doesn't seem necessary to overload methods with different types (Are they all covered by unit tests?). Could you take a look and see whether we can simplify it further? Thanks!

@viirya
Copy link
Member Author

viirya commented Apr 28, 2016

@mengxr Thanks for these comments. I simplified UDTRegistration, VectorUDTSuite and MatrixUDTSuite. Please take a look and let me know if you have further feedback or it is ready to merge now.

@SparkQA
Copy link

SparkQA commented Apr 28, 2016

Test build #57207 has finished for PR 12259 at commit 45e87d9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class MatrixUDTSuite extends SparkFunSuite
    • class VectorUDTSuite extends SparkFunSuite

*
* @since 1.6.0
*/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: remove empty line

@mengxr
Copy link
Contributor

mengxr commented Apr 28, 2016

LGTM pending Jenkins

@SparkQA
Copy link

SparkQA commented Apr 28, 2016

Test build #57217 has finished for PR 12259 at commit 9ed0f30.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 28, 2016

Test build #57219 has finished for PR 12259 at commit 1c230ae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Apr 28, 2016

Merged into master. Thanks!

@asfgit asfgit closed this in 7c6937a Apr 28, 2016
@viirya
Copy link
Member Author

viirya commented Apr 28, 2016

@mengxr @dbtsai Thanks for reviewing this!

object UDTRegistration extends Serializable with Logging {

/** The mapping between the Class between UserDefinedType and user classes. */
private lazy val udtMap: mutable.Map[String, String] = mutable.Map(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would new entry be added to this map for user UDTs ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, user UDTs can be registered in this map. However, as this is just private api, we may expect to refactor it in 2.1 then have it public.

@viirya viirya deleted the improve-sql-usertype branch December 27, 2023 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants