[SPARK-3930] [SPARK-3933] Support fixed-precision decimal in SQL, and some optimizations by mateiz · Pull Request #2983 · apache/spark

mateiz · 2014-10-28T23:05:26Z

Adds optional precision and scale to Spark SQL's decimal type, which behave similarly to those in Hive 13 (https://cwiki.apache.org/confluence/download/attachments/27362075/Hive_Decimal_Precision_Scale_Support.pdf)
Replaces our internal representation of decimals with a Decimal class that can store small values in a mutable Long, saving memory in this situation and letting some operations happen directly on Longs

This is still marked WIP because there are a few TODOs, but I'll remove that tag when done.

SparkQA · 2014-10-28T23:09:59Z

Test build #22389 has started for PR 2983 at commit 4e4bf3f.

This patch merges cleanly.

SparkQA · 2014-10-28T23:11:06Z

Test build #22389 has finished for PR 2983 at commit 4e4bf3f.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class UnscaledValue(child: Expression) extends UnaryExpression
- case class MakeDecimal(child: Expression, precision: Int, scale: Int) extends UnaryExpression
- case class MutableLiteral(var value: Any, dataType: DataType, nullable: Boolean = true)
- case class PrecisionInfo(precision: Int, scale: Int)
- case class DecimalType(precisionInfo: Option[PrecisionInfo]) extends FractionalType
- final class Decimal extends Ordered[Decimal] with Serializable
- trait DecimalIsConflicted extends Numeric[Decimal]

AmplabJenkins · 2014-10-28T23:11:07Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22389/
Test FAILed.

SparkQA · 2014-10-28T23:39:51Z

Test build #22396 has started for PR 2983 at commit 7d3178b.

This patch merges cleanly.

SparkQA · 2014-10-29T00:12:31Z

Test build #22396 has finished for PR 2983 at commit 7d3178b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class UnscaledValue(child: Expression) extends UnaryExpression
- case class MakeDecimal(child: Expression, precision: Int, scale: Int) extends UnaryExpression
- case class MutableLiteral(var value: Any, dataType: DataType, nullable: Boolean = true)
- case class PrecisionInfo(precision: Int, scale: Int)
- case class DecimalType(precisionInfo: Option[PrecisionInfo]) extends FractionalType
- final class Decimal extends Ordered[Decimal] with Serializable
- trait DecimalIsConflicted extends Numeric[Decimal]

AmplabJenkins · 2014-10-29T00:12:34Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22396/
Test FAILed.

SparkQA · 2014-10-29T01:07:30Z

Test build #22406 has started for PR 2983 at commit f136942.

This patch merges cleanly.

mateiz · 2014-10-29T01:15:49Z

I've marked this as not WIP anymore, because the main TODOs left are in the Hive support. I intend to send that as a separate patch, though I can also add it here. Right now this makes each Hive type be the unlimited-precision decimal, whereas in fact we should respect the precision and scale set in the Hive metastore in Hive 13; but the previous Spark SQL code doesn't respect that either so these are not a regression.

SparkQA · 2014-10-29T01:19:50Z

Test build #22409 has started for PR 2983 at commit 44301f6.

This patch merges cleanly.

SparkQA · 2014-10-29T01:42:57Z

Test build #22406 has finished for PR 2983 at commit f136942.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SparkContext(config: SparkConf) extends SparkStatusAPI with Logging
- abstract class Broadcast[T: ClassTag](val id: Long) extends Serializable with Logging
- class JobUIData(
- public final class JavaStatusAPIDemo
- public static final class IdentityWithDelay<T> implements Function<T, T>
- class VectorTransformer(object):
- class Normalizer(VectorTransformer):
- class JavaModelWrapper(VectorTransformer):
- class StandardScalerModel(JavaModelWrapper):
- class StandardScaler(object):
- class HashingTF(object):
- class IDFModel(JavaModelWrapper):
- class IDF(object):
- class Word2VecModel(JavaModelWrapper):
- class DateType(PrimitiveType):
- case class BitwiseAnd(left: Expression, right: Expression) extends BinaryArithmetic
- case class BitwiseOr(left: Expression, right: Expression) extends BinaryArithmetic
- case class BitwiseXor(left: Expression, right: Expression) extends BinaryArithmetic
- case class BitwiseNot(child: Expression) extends UnaryExpression
- case class UnscaledValue(child: Expression) extends UnaryExpression
- case class MakeDecimal(child: Expression, precision: Int, scale: Int) extends UnaryExpression
- case class MutableLiteral(var value: Any, dataType: DataType, nullable: Boolean = true)
- case class CreateTableAsSelect[T](
- case class PrecisionInfo(precision: Int, scale: Int)
- case class DecimalType(precisionInfo: Option[PrecisionInfo]) extends FractionalType
- final class Decimal extends Ordered[Decimal] with Serializable
- trait DecimalIsConflicted extends Numeric[Decimal]
- logDebug("Found class for $serdeName")

AmplabJenkins · 2014-10-29T01:43:00Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22406/
Test FAILed.

SparkQA · 2014-10-29T01:55:04Z

Test build #22409 has finished for PR 2983 at commit 44301f6.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class UnscaledValue(child: Expression) extends UnaryExpression
- case class MakeDecimal(child: Expression, precision: Int, scale: Int) extends UnaryExpression
- case class MutableLiteral(var value: Any, dataType: DataType, nullable: Boolean = true)
- case class PrecisionInfo(precision: Int, scale: Int)
- case class DecimalType(precisionInfo: Option[PrecisionInfo]) extends FractionalType
- final class Decimal extends Ordered[Decimal] with Serializable
- trait DecimalIsConflicted extends Numeric[Decimal]

AmplabJenkins · 2014-10-29T01:55:07Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22409/
Test FAILed.

mateiz · 2014-10-29T01:59:06Z

Jenkins, test this please

SparkQA · 2014-10-29T02:02:37Z

Test build #22414 has started for PR 2983 at commit 44301f6.

This patch merges cleanly.

SparkQA · 2014-10-29T02:39:03Z

Test build #22414 has finished for PR 2983 at commit 44301f6.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class UnscaledValue(child: Expression) extends UnaryExpression
- case class MakeDecimal(child: Expression, precision: Int, scale: Int) extends UnaryExpression
- case class MutableLiteral(var value: Any, dataType: DataType, nullable: Boolean = true)
- case class PrecisionInfo(precision: Int, scale: Int)
- case class DecimalType(precisionInfo: Option[PrecisionInfo]) extends FractionalType
- final class Decimal extends Ordered[Decimal] with Serializable
- trait DecimalIsConflicted extends Numeric[Decimal]

AmplabJenkins · 2014-10-29T02:39:05Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22414/
Test FAILed.

SparkQA · 2014-10-29T02:50:02Z

Test build #22418 has started for PR 2983 at commit 4ca62cd.

This patch merges cleanly.

SparkQA · 2014-10-29T03:26:33Z

Test build #22418 has finished for PR 2983 at commit 4ca62cd.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class UnscaledValue(child: Expression) extends UnaryExpression
- case class MakeDecimal(child: Expression, precision: Int, scale: Int) extends UnaryExpression
- case class MutableLiteral(var value: Any, dataType: DataType, nullable: Boolean = true)
- case class PrecisionInfo(precision: Int, scale: Int)
- case class DecimalType(precisionInfo: Option[PrecisionInfo]) extends FractionalType
- final class Decimal extends Ordered[Decimal] with Serializable
- trait DecimalIsConflicted extends Numeric[Decimal]
- class DeferredObjectAdapter(oi: ObjectInspector) extends DeferredObject

AmplabJenkins · 2014-10-29T03:26:35Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22418/
Test FAILed.

mateiz · 2014-10-29T07:36:18Z

Jenkins, test this please

SparkQA · 2014-10-29T07:39:43Z

Test build #22443 has started for PR 2983 at commit 4ca62cd.

This patch merges cleanly.

SparkQA · 2014-10-29T08:16:54Z

Test build #22443 has finished for PR 2983 at commit 4ca62cd.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class UnscaledValue(child: Expression) extends UnaryExpression
- case class MakeDecimal(child: Expression, precision: Int, scale: Int) extends UnaryExpression
- case class MutableLiteral(var value: Any, dataType: DataType, nullable: Boolean = true)
- case class PrecisionInfo(precision: Int, scale: Int)
- case class DecimalType(precisionInfo: Option[PrecisionInfo]) extends FractionalType
- final class Decimal extends Ordered[Decimal] with Serializable
- trait DecimalIsConflicted extends Numeric[Decimal]

AmplabJenkins · 2014-10-29T08:16:56Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22443/
Test FAILed.

mateiz · 2014-10-29T18:31:56Z

Jenkins, test this please

SparkQA · 2014-10-29T18:34:46Z

Test build #22459 has started for PR 2983 at commit 4ca62cd.

This patch merges cleanly.

SparkQA · 2014-10-29T19:11:24Z

Test build #22459 has finished for PR 2983 at commit 4ca62cd.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class UnscaledValue(child: Expression) extends UnaryExpression
- case class MakeDecimal(child: Expression, precision: Int, scale: Int) extends UnaryExpression
- case class MutableLiteral(var value: Any, dataType: DataType, nullable: Boolean = true)
- case class PrecisionInfo(precision: Int, scale: Int)
- case class DecimalType(precisionInfo: Option[PrecisionInfo]) extends FractionalType
- final class Decimal extends Ordered[Decimal] with Serializable
- trait DecimalIsConflicted extends Numeric[Decimal]

AmplabJenkins · 2014-10-29T19:11:26Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22459/
Test FAILed.

AmplabJenkins · 2014-10-31T20:09:53Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22620/
Test PASSed.

marmbrus · 2014-10-31T22:25:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala

      | "TimestampType" ^^^ TimestampType
      )

+    protected lazy val fixedDecimalType: Parser[DataType] =


I think this is technically not required. This parser is only for reading data from old parquet files that were encoded with old versions of spark sql. Hopefully we can drop it completely some day.

I see, wouldn't it be less confusing to leave it in for now though? I can also remove it if you prefer.

marmbrus · 2014-11-01T00:28:30Z

Hey Matei, this is pretty awesome! A few minor comments and this needs to be merged. Otherwise LGTM.

…r now Implement CAST to fixed-precision decimal Add rules for propagating precision through decimal calculations These work by casting things to Decimal.Unlimited to do the actual operation, then adding a cast on the result. They will result in more casts than needed, but on the other hand they avoid having each arithmetic operator know about decimal precision rules. We might be able to add more rules later to eliminate some intermediate casts.

Optimize sums and averages on fixed-precision Decimals

SparkQA · 2014-11-01T22:35:02Z

Test build #22705 has started for PR 2983 at commit 35e6b02.

This patch merges cleanly.

SparkQA · 2014-11-01T23:58:39Z

Test build #22705 has finished for PR 2983 at commit 35e6b02.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-01T23:58:42Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22705/
Test FAILed.

SparkQA · 2014-11-02T00:02:30Z

Test build #504 has started for PR 2983 at commit 35e6b02.

This patch merges cleanly.

SparkQA · 2014-11-02T01:43:20Z

Test build #504 has finished for PR 2983 at commit 35e6b02.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mateiz changed the title ~~[SPARK-3930] [SPARK-3933] [WIP] Support fixed-precision decimal in SQL, and some optimizations~~ [SPARK-3930] [SPARK-3933] Support fixed-precision decimal in SQL, and some optimizations Oct 29, 2014

mateiz force-pushed the decimal-1 branch from 44301f6 to 4ca62cd Compare October 29, 2014 02:43

marmbrus reviewed Oct 31, 2014
View reviewed changes

mateiz added 11 commits November 1, 2014 14:51

Make the result of AVG on Decimals be Decimal, not Double

ec0a947

Added mutable Decimal that will be more efficient for small precisions

81db9cb

Optimize sums and averages on fixed-precision Decimals

Some test and bug fixes

2118c0d

Support decimal precision/scale in Hive metastore

b28933d

Fix compile error and test issues after rebase

d1d9d68

Fix decimal support in PySpark

4dc6bae

Support reading/writing decimals as fixed-length binary in Parquet

eb84820

Implement Davies's suggestions in Python

31f915e

Review comments

227f24a

Fix issues after merge

35e6b02

mateiz force-pushed the decimal-1 branch from 75ce192 to 35e6b02 Compare November 1, 2014 22:28

asfgit closed this in 23f966f Nov 2, 2014

liancheng mentioned this pull request Nov 2, 2014

[SQL] Refactors data type pattern matching #2764

Closed

Conversation

mateiz commented Oct 28, 2014

Uh oh!

SparkQA commented Oct 28, 2014

Uh oh!

SparkQA commented Oct 28, 2014

Uh oh!

AmplabJenkins commented Oct 28, 2014

Uh oh!

SparkQA commented Oct 28, 2014

Uh oh!

SparkQA commented Oct 29, 2014

Uh oh!

AmplabJenkins commented Oct 29, 2014

Uh oh!

SparkQA commented Oct 29, 2014

Uh oh!

mateiz commented Oct 29, 2014

Uh oh!

SparkQA commented Oct 29, 2014

Uh oh!

SparkQA commented Oct 29, 2014

Uh oh!

AmplabJenkins commented Oct 29, 2014

Uh oh!

SparkQA commented Oct 29, 2014

Uh oh!

AmplabJenkins commented Oct 29, 2014

Uh oh!

mateiz commented Oct 29, 2014

Uh oh!

SparkQA commented Oct 29, 2014

Uh oh!

SparkQA commented Oct 29, 2014

Uh oh!

AmplabJenkins commented Oct 29, 2014

Uh oh!

SparkQA commented Oct 29, 2014

Uh oh!

SparkQA commented Oct 29, 2014

Uh oh!

AmplabJenkins commented Oct 29, 2014

Uh oh!

mateiz commented Oct 29, 2014

Uh oh!

SparkQA commented Oct 29, 2014

Uh oh!

SparkQA commented Oct 29, 2014

Uh oh!

AmplabJenkins commented Oct 29, 2014

Uh oh!

mateiz commented Oct 29, 2014

Uh oh!

SparkQA commented Oct 29, 2014

Uh oh!

SparkQA commented Oct 29, 2014

Uh oh!

AmplabJenkins commented Oct 29, 2014

Uh oh!

AmplabJenkins commented Oct 31, 2014

Uh oh!

marmbrus Oct 31, 2014

Choose a reason for hiding this comment

Uh oh!

mateiz Oct 31, 2014

Choose a reason for hiding this comment

Uh oh!

marmbrus commented Nov 1, 2014

Uh oh!

SparkQA commented Nov 1, 2014

Uh oh!

SparkQA commented Nov 1, 2014

Uh oh!

AmplabJenkins commented Nov 1, 2014

Uh oh!

SparkQA commented Nov 2, 2014

Uh oh!

SparkQA commented Nov 2, 2014

Uh oh!

Reviewers