[SPARK-23934][SQL] Adding map_from_entries function #21282

mn-mikke · 2018-05-09T17:36:05Z

What changes were proposed in this pull request?

The PR adds the map_from_entries function that returns a map created from the given array of entries.

How was this patch tested?

New tests added into:

CollectionExpressionSuite
DataFrameFunctionSuite

CodeGen Examples

Primitive-type Keys and Values

val idf = Seq(
  Seq((1, 10), (2, 20), (3, 10)),
  Seq((1, 10), null, (2, 20))
).toDF("a")
idf.filter('a.isNotNull).select(map_from_entries('a)).debugCodegen

Result:

/* 042 */         boolean project_isNull_0 = false;
/* 043 */         MapData project_value_0 = null;
/* 044 */
/* 045 */         for (int project_idx_2 = 0; !project_isNull_0 && project_idx_2 < inputadapter_value_0.numElements(); project_idx_2++) {
/* 046 */           project_isNull_0 |= inputadapter_value_0.isNullAt(project_idx_2);
/* 047 */         }
/* 048 */         if (!project_isNull_0) {
/* 049 */           final int project_numEntries_0 = inputadapter_value_0.numElements();
/* 050 */
/* 051 */           final long project_keySectionSize_0 = UnsafeArrayData.calculateSizeOfUnderlyingByteArray(project_numEntries_0, 4);
/* 052 */           final long project_valueSectionSize_0 = UnsafeArrayData.calculateSizeOfUnderlyingByteArray(project_numEntries_0, 4);
/* 053 */           final long project_byteArraySize_0 = 8 + project_keySectionSize_0 + project_valueSectionSize_0;
/* 054 */           if (project_byteArraySize_0 > 2147483632) {
/* 055 */             final Object[] project_keys_0 = new Object[project_numEntries_0];
/* 056 */             final Object[] project_values_0 = new Object[project_numEntries_0];
/* 057 */
/* 058 */             for (int project_idx_1 = 0; project_idx_1 < project_numEntries_0; project_idx_1++) {
/* 059 */               InternalRow project_entry_1 = inputadapter_value_0.getStruct(project_idx_1, 2);
/* 060 */
/* 061 */               project_keys_0[project_idx_1] = project_entry_1.getInt(0);
/* 062 */               project_values_0[project_idx_1] = project_entry_1.getInt(1);
/* 063 */             }
/* 064 */
/* 065 */             project_value_0 = org.apache.spark.sql.catalyst.util.ArrayBasedMapData.apply(project_keys_0, project_values_0);
/* 066 */
/* 067 */           } else {
/* 068 */             final byte[] project_byteArray_0 = new byte[(int)project_byteArraySize_0];
/* 069 */             UnsafeMapData project_unsafeMapData_0 = new UnsafeMapData();
/* 070 */             Platform.putLong(project_byteArray_0, 16, project_keySectionSize_0);
/* 071 */             Platform.putLong(project_byteArray_0, 24, project_numEntries_0);
/* 072 */             Platform.putLong(project_byteArray_0, 24 + project_keySectionSize_0, project_numEntries_0);
/* 073 */             project_unsafeMapData_0.pointTo(project_byteArray_0, 16, (int)project_byteArraySize_0);
/* 074 */             ArrayData project_keyArrayData_0 = project_unsafeMapData_0.keyArray();
/* 075 */             ArrayData project_valueArrayData_0 = project_unsafeMapData_0.valueArray();
/* 076 */
/* 077 */             for (int project_idx_0 = 0; project_idx_0 < project_numEntries_0; project_idx_0++) {
/* 078 */               InternalRow project_entry_0 = inputadapter_value_0.getStruct(project_idx_0, 2);
/* 079 */
/* 080 */               project_keyArrayData_0.setInt(project_idx_0, project_entry_0.getInt(0));
/* 081 */               project_valueArrayData_0.setInt(project_idx_0, project_entry_0.getInt(1));
/* 082 */             }
/* 083 */
/* 084 */             project_value_0 = project_unsafeMapData_0;
/* 085 */           }
/* 086 */
/* 087 */         }

Non-primitive-type Keys and Values

val sdf = Seq(
  Seq(("a", null), ("b", "bb"), ("c", "aa")),
  Seq(("a", "aa"), null, (null, "bb"))
).toDF("a")
sdf.filter('a.isNotNull).select(map_from_entries('a)).debugCodegen

Result:

/* 042 */         boolean project_isNull_0 = false;
/* 043 */         MapData project_value_0 = null;
/* 044 */
/* 045 */         for (int project_idx_1 = 0; !project_isNull_0 && project_idx_1 < inputadapter_value_0.numElements(); project_idx_1++) {
/* 046 */           project_isNull_0 |= inputadapter_value_0.isNullAt(project_idx_1);
/* 047 */         }
/* 048 */         if (!project_isNull_0) {
/* 049 */           final int project_numEntries_0 = inputadapter_value_0.numElements();
/* 050 */
/* 051 */           final Object[] project_keys_0 = new Object[project_numEntries_0];
/* 052 */           final Object[] project_values_0 = new Object[project_numEntries_0];
/* 053 */
/* 054 */           for (int project_idx_0 = 0; project_idx_0 < project_numEntries_0; project_idx_0++) {
/* 055 */             InternalRow project_entry_0 = inputadapter_value_0.getStruct(project_idx_0, 2);
/* 056 */
/* 057 */             if (project_entry_0.isNullAt(0)) {
/* 058 */               throw new RuntimeException("The first field from a struct (key) can't be null.");
/* 059 */             }
/* 060 */
/* 061 */             project_keys_0[project_idx_0] = project_entry_0.getUTF8String(0);
/* 062 */             project_values_0[project_idx_0] = project_entry_0.getUTF8String(1);
/* 063 */           }
/* 064 */
/* 065 */           project_value_0 = org.apache.spark.sql.catalyst.util.ArrayBasedMapData.apply(project_keys_0, project_values_0);
/* 066 */
/* 067 */         }

mn-mikke · 2018-05-09T17:36:33Z

cc @ueshin @gatorsmile

HyukjinKwon · 2018-05-10T01:24:55Z

ok to test

HyukjinKwon · 2018-05-10T01:25:00Z

add to whitelist

kiszk · 2018-05-10T03:58:16Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

+      if (key == null) {
+        throw new RuntimeException("The first field from a struct (key) can't be null.")
+      }
+      if (keySet.contains(key)) {


Is this check necessary for now? This is because other operations (e.g. CreateMap) allows us to create a map with duplicated key. Is it better to be consistent in Spark?

Yeah, we've already touched this topic in your PR for SPARK-23933. I think if some hashing is added into maps in future, these duplicity checks will have to be introduced anyway. So if we add it now, we can avoid breaking changes in future. But I understand your point of view.

Presto also doesn't support duplicates:

presto:default> SELECT map_from_entries(ARRAY[(1, 'x'), (1, 'y')]); Query 20180510_090536_00005_468a9 failed: Duplicate keys (1) are not allowed

WDYT @ueshin @gatorsmile

I'm sorry for the super delay.
Let's just ignore the duplicated key like CreateMap for now. We will need to discuss map-related topics, such as duplicate keys, equality or ordering, etc.

Ok, no problem. I've removed duplicity checks.

SparkQA · 2018-05-10T05:04:01Z

Test build #90434 has finished for PR 21282 at commit 8c6039c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class MapFromEntries(child: Expression) extends UnaryExpression

ueshin · 2018-05-10T07:19:32Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

+  since = "2.4.0")
+case class MapFromEntries(child: Expression) extends UnaryExpression
+{
+  private lazy val resolvedDataType: Option[MapType] = child.dataType match {


@transient?

ueshin · 2018-05-10T07:50:25Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

+  private lazy val resolvedDataType: Option[MapType] = child.dataType match {
+    case ArrayType(
+      StructType(Array(
+        StructField(_, keyType, false, _),


We don't need key field to be nullable = false because we check the nullability when creating an array?

ueshin · 2018-05-10T08:00:56Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

+      StructType(Array(
+        StructField(_, keyType, false, _),
+        StructField(_, valueType, valueNullable, _))),
+      false) => Some(MapType(keyType, valueType, valueNullable))


Can we reject an array with containsNull = true here? The array might not contain nulls.

adrian-wang · 2018-05-10T10:13:18Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

+  """,
+  since = "2.4.0")
+case class MapFromEntries(child: Expression) extends UnaryExpression
+{


SparkQA · 2018-05-10T17:03:10Z

Test build #90459 has finished for PR 21282 at commit 25aa879.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class MapFromEntries(child: Expression) extends UnaryExpression

SparkQA · 2018-05-17T12:21:34Z

Test build #90721 has finished for PR 21282 at commit 8d12d9f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-05-17T17:48:19Z

Test build #90741 has finished for PR 21282 at commit 7fd824e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ArraysOverlap(left: Expression, right: Expression)

mn-mikke · 2018-05-17T19:20:33Z

retest this please

SparkQA · 2018-05-17T22:54:55Z

Test build #90748 has finished for PR 21282 at commit 7fd824e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ArraysOverlap(left: Expression, right: Expression)

kiszk · 2018-05-18T03:39:50Z

retest this please

SparkQA · 2018-05-18T06:24:56Z

Test build #90773 has finished for PR 21282 at commit 7fd824e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ArraysOverlap(left: Expression, right: Expression)

mn-mikke · 2018-05-18T07:24:14Z

retest this please

SparkQA · 2018-05-18T11:13:48Z

Test build #90782 has finished for PR 21282 at commit 7fd824e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ArraysOverlap(left: Expression, right: Expression)

…p_from_entries-to-master

SparkQA · 2018-05-28T17:48:22Z

Test build #91229 has finished for PR 21282 at commit 45e4633.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-03T02:45:22Z

Test build #91421 has finished for PR 21282 at commit 10ace84.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-06-04T19:57:48Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala

+    var i = 0
+    var j = 0
+    while (i < length) {
+      if (!arrayData.isNullAt(i)) {


We should throw an exception if arrayData.isNullAt(i)?

Hi @ueshin,
wouldn't it be better return null in this case? And follow null handling of other functions like flatten?

flatten(array(array(1,2), null, array(3,4))) => null

WDYT?

Yeah, that sounds reasonable. Thanks.

HyukjinKwon · 2018-06-13T14:12:20Z

ok to test

SparkQA · 2018-06-13T18:00:16Z

Test build #91774 has finished for PR 21282 at commit 10ace84.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

…p_from_entries-to-master

SparkQA · 2018-06-21T15:00:10Z

Test build #92173 has finished for PR 21282 at commit 599656e.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-21T19:39:20Z

Test build #92175 has finished for PR 21282 at commit 4eaedc5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2018-06-22T07:15:45Z

LGTM.

ueshin · 2018-06-22T07:17:39Z

Thanks! merging to master.

[SPARK-23934][SQL] Adding map_from_entries function

8c6039c

kiszk reviewed May 10, 2018

View reviewed changes

ueshin reviewed May 10, 2018

View reviewed changes

adrian-wang reviewed May 10, 2018

View reviewed changes

[SPARK-23934][SQL] Addressing review comments

25aa879

[SPARK-23934][SQL] Resolving conflicts.

8d12d9f

[SPARK-23934][SQL] Merging master to the feature branch.

7fd824e

Merge remote-tracking branch 'spark/master' into feature/array-api-ma…

45e4633

…p_from_entries-to-master

mn-mikke added 2 commits June 2, 2018 20:31

[SPARK-23934][SQL] Merging master to the feature branch.

83165e0

[SPARK-23934][SQL] Ignoring key duplicities

10ace84

ueshin reviewed Jun 4, 2018

View reviewed changes

mn-mikke added 2 commits June 14, 2018 17:22

Merge remote-tracking branch 'spark/master' into feature/array-api-ma…

6cca713

…p_from_entries-to-master

Merge remote-tracking branch 'spark/master' into feature/array-api-ma…

44c513c

…p_from_entries-to-master

[SPARK-23934][SQL] Handling of null entries

599656e

[SPARK-23934][SQL] Fixing scala style

4eaedc5

asfgit closed this in 92c2f00 Jun 22, 2018

mn-mikke mentioned this pull request Aug 7, 2018

[SPARK-23939][SQL] Add transform_keys function #22013

Closed

[SPARK-23934][SQL] Adding map_from_entries function #21282

[SPARK-23934][SQL] Adding map_from_entries function #21282

Uh oh!

Conversation

mn-mikke commented May 9, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

CodeGen Examples

Primitive-type Keys and Values

Non-primitive-type Keys and Values

Uh oh!

mn-mikke commented May 9, 2018

Uh oh!

HyukjinKwon commented May 10, 2018

Uh oh!

HyukjinKwon commented May 10, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 10, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 10, 2018

Uh oh!

SparkQA commented May 17, 2018

Uh oh!

SparkQA commented May 17, 2018

Uh oh!

mn-mikke commented May 17, 2018

Uh oh!

SparkQA commented May 17, 2018

Uh oh!

kiszk commented May 18, 2018

Uh oh!

SparkQA commented May 18, 2018

Uh oh!

mn-mikke commented May 18, 2018

Uh oh!

SparkQA commented May 18, 2018

Uh oh!

SparkQA commented May 28, 2018

Uh oh!

SparkQA commented Jun 3, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jun 13, 2018

Uh oh!

SparkQA commented Jun 13, 2018

Uh oh!

SparkQA commented Jun 21, 2018

Uh oh!

SparkQA commented Jun 21, 2018

Uh oh!

ueshin commented Jun 22, 2018

Uh oh!

ueshin commented Jun 22, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

mn-mikke commented May 9, 2018 •

edited

Loading