data source v2 prototype #8

cloud-fan · 2017-07-24T07:39:58Z

No description provided.

marmbrus · 2017-08-09T00:16:08Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/FilterPushDownSupport.java

This is fine unless the filter is very expensive (for example if it is some kind of UDF).

According to the public Filter interface, we can only push down very simple filters, so this should not have performance issues. But we may need to add unhandledFilters to CatalystFilterPushDownSupport, which support arbitrary expressions and may have very expensive predicates.

marmbrus · 2017-08-09T00:16:29Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ColumnPruningSupport.java

Does this work with nested columns?

good question. It will be hard/tricky to support nested column pruning with this API, how about we use requiredSchema: StrcutType as parameter?

cloud-fan · 2017-08-09T17:30:12Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/HashPartitionPushDownSupport.java

shall we keep the hashFunction parameter or have 2 interfaces: DefaultHashPartitionPushDownSupport and HiveHashPartitionPushDownSupport?

cc @liancheng @gatorsmile

cloud-fan · 2017-08-09T17:34:55Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ColumnPruningSupport.java

It will be hard/tricky to support nested column pruning with this API, how about we use requiredSchema: StrcutType as parameter? cc @marmbrus

That could work.

rxin · 2017-08-10T23:56:43Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceV2.java

functionalists -> functions

rxin · 2017-08-10T23:57:06Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceV2.java

I'd create an abstraction to make it clear this is case insensitive.

rxin · 2017-08-10T23:58:48Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/WritableDataSourceV2.java

rxin · 2017-08-10T23:59:05Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/WritableDataSourceV2.java

getWriter -> createWriter

rxin · 2017-08-10T23:59:13Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceV2.java

getReader -> createReader

rxin · 2017-08-10T23:59:24Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceV2SchemaProvider.java

allow -> allows, or accepts

or supportsUserDefinedSchema

rxin · 2017-08-11T01:35:55Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SizeEstimationSupport.java

maybe generalize it to statistics support, and look for the useful statistics ....

Make the name more general? StatsEstimationSupport

rxin · 2017-08-11T01:36:53Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java

it's unclear whether this is the returned data's schema, or the full schema.

how about renaming it to actualSchema?

readSchema?

rxin · 2017-08-11T01:44:48Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataGenerator.java

"generator" is not a great name here since it almost implies the data is just generated rather than being scanned (read).

OK after reading through all the code I'd call this DataReadTask, or just ReadTask

rxin · 2017-08-11T01:46:34Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java

why don't we put the rest of the interfaces for enhancing this reader in one file? we can also get rid of the reader package.

actually after thinking more about it, i think this is fine.

rxin · 2017-08-11T01:54:31Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java

how do we deal with the case in which a data source can read some schema in column format, and some cannot? do we require the data source itself to handle that inside and do the conversion from row to column when needed?

If users implement the ColumnarReadSupport, they have to output columnar data for all required columns. If some columns are not in columnar format, the implementation itself is responsible to handle it inside and do the conversion.

Alternatively, we can add one more method in ColumnarReadSupport: boolean supportColumnar(). By default it's true, but users can overwrite it according to the required schema, to decide to do this scan in columnar fashion or row fashion.

I'd add the supportsColumnarReads or supportsColumnBatch method

rxin · 2017-08-11T02:00:36Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/HashPartitionPushDownSupport.java

this is a weird one and i'm not sure it can be useful as is. there are typically two cases:

the data source is already partitioned by a key, and as a result Spark doesn't need to do a shuffle.

the data source is capable of doing shuffle by arbitrary column itself.

This interface is designed for case 2, but case 1 is far more common than case 2.

Actually it's designed for case 1...

At the beginning, I was thinking of doing hash partitioning propagation, which looks more natural to support case 1. However, some data sources may be partitioned in multiple ways, e.g. it's hash partitioned by a, b, and also hash partitioned by c, d. It will be a little messy to ask a data source to propagate all the possible hash partitionings.

Instead, I think it can be simpler if we do hash partitioning push down. The implementation just need to tell us if it can do this hash partitioning or not. That said, this method should return a boolean instead of void.

rxin · 2017-08-11T02:03:32Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/WritableDataSourceV2.java

I'd force them to implement everything. When implementing a data source this is really trivial if it is no-op. Forcing them to implement it would make it less likely for somebody to forget.

rxin · 2017-08-11T02:10:43Z

...core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala

would we be able to implement v1 using v2?

I don't think so. For data source v1, users give us RDD directly, and I'm not sure how to turn a RDD into a list of ReadTask.

rxin · 2017-08-11T02:12:02Z

sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaAdvancedDataSourceV2.java

fix the space around <Row>

rxin · 2017-08-11T02:12:32Z

sql/core/src/test/java/test/org/apache/spark/sql/sources/v2/JavaDataSourceV2Suite.java

better to write the test suite itself in scala since scala test has better reporting, etc

rxin · 2017-08-11T02:14:16Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ColumnPruningSupport.java

add some comment about nested columns so it is clear that this is not just for top level pruning.

cloud-fan · 2017-08-15T00:13:15Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java

return a message

cloud-fan · 2017-08-15T00:23:12Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceV2Reader.java

use Array.

readRows -> getReadTasks

get -> create, since it might be expensive. get suggests that it is cheap

cloud-fan · 2017-08-15T00:56:09Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/LimitPushDownSupport.java

return boolean

add TopKPushDown

do we need that? if we push both sort and limit down, it is topk isn't it?

cloud-fan · 2017-08-15T01:08:04Z

...core/src/main/java/org/apache/spark/sql/sources/v2/reader/CatalystFilterPushDownSupport.java

add unHandledFilters

gatorsmile · 2017-08-16T06:13:37Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/DataSourceV2SchemaProvider.java

Nit: Also, describe the meaning of the returned boolean by adding @return?

I think the first sentence Whether or not this data source can accept user specified schema already described the meaning?

gatorsmile · 2017-08-16T06:16:48Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/WritableDataSourceV2.java

job-level transaction support

gatorsmile · 2017-08-16T06:21:55Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/WritableDataSourceV2.java

are failed -> failed

Does it mean some partitions abort? We will hold until all the partitions return results (either commit or abort)?

gatorsmile · 2017-08-16T06:28:48Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java

The avoid -> To avoid

gatorsmile · 2017-08-16T06:29:04Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java

and write -> writes

gatorsmile · 2017-08-17T03:26:02Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataSourceV2Writer.java

which is responsible to write -> that is responsible for writing

gatorsmile · 2017-08-17T03:33:13Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SamplePushDownSupport.java

How about the other parameters
seed?
withReplacement?
lowerBound?
upperBound?

Actually, sample has different algorithms. Do we need to pass the method identifier?

For example, https://www.toadworld.com/platforms/ibmdb2/w/wiki/7748.data-sampling

Does Spark or end users need to care about sample algorithms? I think it's more flexible for data source implementations to implement sample if fraction is the only requirement.

gatorsmile · 2017-08-17T03:38:36Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ReadTask.java

remove that

and is responsible to output -> is responsible for outputing

gatorsmile · 2017-08-17T03:40:30Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ReadTask.java

locations -> location

gatorsmile · 2017-08-17T03:41:15Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ReadTask.java

remove be.

, implementations -> . Implementations

## What changes were proposed in this pull request? This PR aims at improving the way physical plans are explained in spark. Currently, the explain output for physical plan may look very cluttered and each operator's string representation can be very wide and wraps around in the display making it little hard to follow. This especially happens when explaining a query 1) Operating on wide tables 2) Has complex expressions etc. This PR attempts to split the output into two sections. In the header section, we display the basic operator tree with a number associated with each operator. In this section, we strictly control what we output for each operator. In the footer section, each operator is verbosely displayed. Based on the feedback from Maryann, the uncorrelated subqueries (SubqueryExecs) are not included in the main plan. They are printed separately after the main plan and can be correlated by the originating expression id from its parent plan. To illustrate, here is a simple plan displayed in old vs new way. Example query1 : ``` EXPLAIN SELECT key, Max(val) FROM explain_temp1 WHERE key > 0 GROUP BY key HAVING max(val) > 0 ``` Old : ``` *(2) Project [key#2, max(val)#15] +- *(2) Filter (isnotnull(max(val#3)#18) AND (max(val#3)#18 > 0)) +- *(2) HashAggregate(keys=[key#2], functions=[max(val#3)], output=[key#2, max(val)#15, max(val#3)#18]) +- Exchange hashpartitioning(key#2, 200) +- *(1) HashAggregate(keys=[key#2], functions=[partial_max(val#3)], output=[key#2, max#21]) +- *(1) Project [key#2, val#3] +- *(1) Filter (isnotnull(key#2) AND (key#2 > 0)) +- *(1) FileScan parquet default.explain_temp1[key#2,val#3] Batched: true, DataFilters: [isnotnull(key#2), (key#2 > 0)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp1], PartitionFilters: [], PushedFilters: [IsNotNull(key), GreaterThan(key,0)], ReadSchema: struct<key:int,val:int> ``` New : ``` Project (8) +- Filter (7) +- HashAggregate (6) +- Exchange (5) +- HashAggregate (4) +- Project (3) +- Filter (2) +- Scan parquet default.explain_temp1 (1) (1) Scan parquet default.explain_temp1 [codegen id : 1] Output: [key#2, val#3] (2) Filter [codegen id : 1] Input : [key#2, val#3] Condition : (isnotnull(key#2) AND (key#2 > 0)) (3) Project [codegen id : 1] Output : [key#2, val#3] Input : [key#2, val#3] (4) HashAggregate [codegen id : 1] Input: [key#2, val#3] (5) Exchange Input: [key#2, max#11] (6) HashAggregate [codegen id : 2] Input: [key#2, max#11] (7) Filter [codegen id : 2] Input : [key#2, max(val)#5, max(val#3)#8] Condition : (isnotnull(max(val#3)#8) AND (max(val#3)#8 > 0)) (8) Project [codegen id : 2] Output : [key#2, max(val)#5] Input : [key#2, max(val)#5, max(val#3)#8] ``` Example Query2 (subquery): ``` SELECT * FROM explain_temp1 WHERE KEY = (SELECT Max(KEY) FROM explain_temp2 WHERE KEY = (SELECT Max(KEY) FROM explain_temp3 WHERE val > 0) AND val = 2) AND val > 3 ``` Old: ``` *(1) Project [key#2, val#3] +- *(1) Filter (((isnotnull(KEY#2) AND isnotnull(val#3)) AND (KEY#2 = Subquery scalar-subquery#39)) AND (val#3 > 3)) : +- Subquery scalar-subquery#39 : +- *(2) HashAggregate(keys=[], functions=[max(KEY#26)], output=[max(KEY)apache#45]) : +- Exchange SinglePartition : +- *(1) HashAggregate(keys=[], functions=[partial_max(KEY#26)], output=[max#47]) : +- *(1) Project [key#26] : +- *(1) Filter (((isnotnull(KEY#26) AND isnotnull(val#27)) AND (KEY#26 = Subquery scalar-subquery#38)) AND (val#27 = 2)) : : +- Subquery scalar-subquery#38 : : +- *(2) HashAggregate(keys=[], functions=[max(KEY#28)], output=[max(KEY)apache#43]) : : +- Exchange SinglePartition : : +- *(1) HashAggregate(keys=[], functions=[partial_max(KEY#28)], output=[max#49]) : : +- *(1) Project [key#28] : : +- *(1) Filter (isnotnull(val#29) AND (val#29 > 0)) : : +- *(1) FileScan parquet default.explain_temp3[key#28,val#29] Batched: true, DataFilters: [isnotnull(val#29), (val#29 > 0)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp3], PartitionFilters: [], PushedFilters: [IsNotNull(val), GreaterThan(val,0)], ReadSchema: struct<key:int,val:int> : +- *(1) FileScan parquet default.explain_temp2[key#26,val#27] Batched: true, DataFilters: [isnotnull(key#26), isnotnull(val#27), (val#27 = 2)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp2], PartitionFilters: [], PushedFilters: [IsNotNull(key), IsNotNull(val), EqualTo(val,2)], ReadSchema: struct<key:int,val:int> +- *(1) FileScan parquet default.explain_temp1[key#2,val#3] Batched: true, DataFilters: [isnotnull(key#2), isnotnull(val#3), (val#3 > 3)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp1], PartitionFilters: [], PushedFilters: [IsNotNull(key), IsNotNull(val), GreaterThan(val,3)], ReadSchema: struct<key:int,val:int> ``` New: ``` Project (3) +- Filter (2) +- Scan parquet default.explain_temp1 (1) (1) Scan parquet default.explain_temp1 [codegen id : 1] Output: [key#2, val#3] (2) Filter [codegen id : 1] Input : [key#2, val#3] Condition : (((isnotnull(KEY#2) AND isnotnull(val#3)) AND (KEY#2 = Subquery scalar-subquery#23)) AND (val#3 > 3)) (3) Project [codegen id : 1] Output : [key#2, val#3] Input : [key#2, val#3] ===== Subqueries ===== Subquery:1 Hosting operator id = 2 Hosting Expression = Subquery scalar-subquery#23 HashAggregate (9) +- Exchange (8) +- HashAggregate (7) +- Project (6) +- Filter (5) +- Scan parquet default.explain_temp2 (4) (4) Scan parquet default.explain_temp2 [codegen id : 1] Output: [key#26, val#27] (5) Filter [codegen id : 1] Input : [key#26, val#27] Condition : (((isnotnull(KEY#26) AND isnotnull(val#27)) AND (KEY#26 = Subquery scalar-subquery#22)) AND (val#27 = 2)) (6) Project [codegen id : 1] Output : [key#26] Input : [key#26, val#27] (7) HashAggregate [codegen id : 1] Input: [key#26] (8) Exchange Input: [max#35] (9) HashAggregate [codegen id : 2] Input: [max#35] Subquery:2 Hosting operator id = 5 Hosting Expression = Subquery scalar-subquery#22 HashAggregate (15) +- Exchange (14) +- HashAggregate (13) +- Project (12) +- Filter (11) +- Scan parquet default.explain_temp3 (10) (10) Scan parquet default.explain_temp3 [codegen id : 1] Output: [key#28, val#29] (11) Filter [codegen id : 1] Input : [key#28, val#29] Condition : (isnotnull(val#29) AND (val#29 > 0)) (12) Project [codegen id : 1] Output : [key#28] Input : [key#28, val#29] (13) HashAggregate [codegen id : 1] Input: [key#28] (14) Exchange Input: [max#37] (15) HashAggregate [codegen id : 2] Input: [max#37] ``` Note: I opened this PR as a WIP to start getting feedback. I will be on vacation starting tomorrow would not be able to immediately incorporate the feedback. I will start to work on them as soon as i can. Also, currently this PR provides a basic infrastructure for explain enhancement. The details about individual operators will be implemented in follow-up prs ## How was this patch tested? Added a new test `explain.sql` that tests basic scenarios. Need to add more tests. Closes apache#24759 from dilipbiswal/explain_feature. Authored-by: Dilip Biswal <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? This PR aims to disable `to_timestamp('366', 'DD')` to recover `ansi` test suite in Java11+. ### Why are the changes needed? Currently, Daily Java 11 and 17 GitHub Action jobs are broken. - https://github.com/apache/spark/runs/5511239176?check_suite_focus=true - https://github.com/apache/spark/runs/5513540604?check_suite_focus=true **Java 8** ``` $ bin/spark-shell --conf spark.sql.ansi.enabled=true Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/03/12 00:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Spark context Web UI available at http://172.16.0.31:4040 Spark context available as 'sc' (master = local[*], app id = local-1647075572229). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.3.0-SNAPSHOT /_/ Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_322) Type in expressions to have them evaluated. Type :help for more information. scala> sql("select to_timestamp('366', 'DD')").show java.time.format.DateTimeParseException: Text '366' could not be parsed, unparsed text found at index 2. If necessary set spark.sql.ansi.enabled to false to bypass this error. ``` **Java 11+** ``` $ bin/spark-shell --conf spark.sql.ansi.enabled=true Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/03/12 01:00:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Spark context Web UI available at http://172.16.0.31:4040 Spark context available as 'sc' (master = local[*], app id = local-1647075607932). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.3.0-SNAPSHOT /_/ Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 11.0.12) Type in expressions to have them evaluated. Type :help for more information. scala> sql("select to_timestamp('366', 'DD')").show java.time.DateTimeException: Invalid date 'DayOfYear 366' as '1970' is not a leap year. If necessary set spark.sql.ansi.enabled to false to bypass this error. ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Test with Java 11+. **BEFORE** ``` $ java -version openjdk version "17.0.2" 2022-01-18 LTS OpenJDK Runtime Environment Zulu17.32+13-CA (build 17.0.2+8-LTS) OpenJDK 64-Bit Server VM Zulu17.32+13-CA (build 17.0.2+8-LTS, mixed mode, sharing) $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z ansi/datetime-parsing-invalid.sql" ... [info] SQLQueryTestSuite: 01:23:00.219 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 01:23:05.209 ERROR org.apache.spark.sql.SQLQueryTestSuite: Error using configs: [info] - ansi/datetime-parsing-invalid.sql *** FAILED *** (267 milliseconds) [info] ansi/datetime-parsing-invalid.sql [info] Expected "java.time.[format.DateTimeParseException [info] Text '366' could not be parsed, unparsed text found at index 2]. If necessary set s...", but got "java.time.[DateTimeException [info] Invalid date 'DayOfYear 366' as '1970' is not a leap year]. If necessary set s..." Result did not match for query #8 [info] select to_timestamp('366', 'DD') (SQLQueryTestSuite.scala:476) ... [info] Run completed in 7 seconds, 389 milliseconds. [info] Total number of tests run: 1 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 0, failed 1, canceled 0, ignored 0, pending 0 [info] *** 1 TEST FAILED *** [error] Failed tests: [error] org.apache.spark.sql.SQLQueryTestSuite [error] (sql / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 21 s, completed Mar 12, 2022, 1:23:05 AM ``` **AFTER** ``` $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z ansi/datetime-parsing-invalid.sql" ... [info] SQLQueryTestSuite: [info] - ansi/datetime-parsing-invalid.sql (390 milliseconds) ... [info] Run completed in 7 seconds, 673 milliseconds. [info] Total number of tests run: 1 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 20 s, completed Mar 12, 2022, 1:24:52 AM ``` Closes apache#35825 from dongjoon-hyun/SPARK-38534. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…n properly ### What changes were proposed in this pull request? Make `ResolveRelations` handle plan id properly ### Why are the changes needed? bug fix for Spark Connect, it won't affect classic Spark SQL before this PR: ``` from pyspark.sql import functions as sf spark.range(10).withColumn("value_1", sf.lit(1)).write.saveAsTable("test_table_1") spark.range(10).withColumnRenamed("id", "index").withColumn("value_2", sf.lit(2)).write.saveAsTable("test_table_2") df1 = spark.read.table("test_table_1") df2 = spark.read.table("test_table_2") df3 = spark.read.table("test_table_1") join1 = df1.join(df2, on=df1.id==df2.index).select(df2.index, df2.value_2) join2 = df3.join(join1, how="left", on=join1.index==df3.id) join2.schema ``` fails with ``` AnalysisException: [CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve dataframe column "id". It's probably because of illegal references like `df1.select(df2.col("a"))`. SQLSTATE: 42704 ``` That is due to existing plan caching in `ResolveRelations` doesn't work with Spark Connect ``` === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === '[#12]Join LeftOuter, '`==`('index, 'id) '[#12]Join LeftOuter, '`==`('index, 'id) !:- '[#9]UnresolvedRelation [test_table_1], [], false :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 !+- '[#11]Project ['index, 'value_2] : +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ! +- '[#10]Join Inner, '`==`('id, 'index) +- '[#11]Project ['index, 'value_2] ! :- '[#7]UnresolvedRelation [test_table_1], [], false +- '[#10]Join Inner, '`==`('id, 'index) ! +- '[#8]UnresolvedRelation [test_table_2], [], false :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 ! : +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ! +- '[#8]SubqueryAlias spark_catalog.default.test_table_2 ! +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_2`, [], false Can not resolve 'id with plan 7 ``` `[#7]UnresolvedRelation [test_table_1], [], false` was wrongly resolved to the cached one ``` :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ``` ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? added ut ### Was this patch authored or co-authored using generative AI tooling? ci Closes apache#45214 from zhengruifeng/connect_fix_read_join. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…plan properly ### What changes were proposed in this pull request? Make `ResolveRelations` handle plan id properly cherry-pick bugfix apache#45214 to 3.5 ### Why are the changes needed? bug fix for Spark Connect, it won't affect classic Spark SQL before this PR: ``` from pyspark.sql import functions as sf spark.range(10).withColumn("value_1", sf.lit(1)).write.saveAsTable("test_table_1") spark.range(10).withColumnRenamed("id", "index").withColumn("value_2", sf.lit(2)).write.saveAsTable("test_table_2") df1 = spark.read.table("test_table_1") df2 = spark.read.table("test_table_2") df3 = spark.read.table("test_table_1") join1 = df1.join(df2, on=df1.id==df2.index).select(df2.index, df2.value_2) join2 = df3.join(join1, how="left", on=join1.index==df3.id) join2.schema ``` fails with ``` AnalysisException: [CANNOT_RESOLVE_DATAFRAME_COLUMN] Cannot resolve dataframe column "id". It's probably because of illegal references like `df1.select(df2.col("a"))`. SQLSTATE: 42704 ``` That is due to existing plan caching in `ResolveRelations` doesn't work with Spark Connect ``` === Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations === '[#12]Join LeftOuter, '`==`('index, 'id) '[#12]Join LeftOuter, '`==`('index, 'id) !:- '[#9]UnresolvedRelation [test_table_1], [], false :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 !+- '[#11]Project ['index, 'value_2] : +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ! +- '[#10]Join Inner, '`==`('id, 'index) +- '[#11]Project ['index, 'value_2] ! :- '[#7]UnresolvedRelation [test_table_1], [], false +- '[#10]Join Inner, '`==`('id, 'index) ! +- '[#8]UnresolvedRelation [test_table_2], [], false :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 ! : +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ! +- '[#8]SubqueryAlias spark_catalog.default.test_table_2 ! +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_2`, [], false Can not resolve 'id with plan 7 ``` `[#7]UnresolvedRelation [test_table_1], [], false` was wrongly resolved to the cached one ``` :- '[#9]SubqueryAlias spark_catalog.default.test_table_1 +- 'UnresolvedCatalogRelation `spark_catalog`.`default`.`test_table_1`, [], false ``` ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? added ut ### Was this patch authored or co-authored using generative AI tooling? ci Closes apache#46291 from zhengruifeng/connect_fix_read_join_35. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>

This is a trivial change to replace the loop index from `int` to `long`. Surprisingly, microbenchmark shows more than double performance uplift. Analysis -------- The hot loop of `arrayEquals` method is simplifed as below. Loop index `i` is defined as `int`, it's compared with `length`, which is a `long`, to determine if the loop should end. ``` public static boolean arrayEquals( Object leftBase, long leftOffset, Object rightBase, long rightOffset, final long length) { ...... int i = 0; while (i <= length - 8) { if (Platform.getLong(leftBase, leftOffset + i) != Platform.getLong(rightBase, rightOffset + i)) { return false; } i += 8; } ...... } ``` Strictly speaking, there's a code bug here. If `length` is greater than 2^31 + 8, this loop will never end because `i` as a 32 bit integer is at most 2^31 - 1. But compiler must consider this behaviour as intentional and generate code strictly match the logic. It prevents compiler from generating optimal code. Defining loop index `i` as `long` corrects this issue. Besides more accurate code logic, JIT is able to optimize this code much more aggressively. From microbenchmark, this trivial change improves performance significantly on both Arm and x86 platforms. Benchmark --------- Source code: https://gist.github.com/cyb70289/258e261f388e22f47e4d961431786d1a Result on Arm Neoverse N2: ``` Benchmark Mode Cnt Score Error Units ArrayEqualsBenchmark.arrayEqualsInt avgt 10 674.313 ± 0.213 ns/op ArrayEqualsBenchmark.arrayEqualsLong avgt 10 313.563 ± 2.338 ns/op ``` Result on Intel Cascake Lake: ``` Benchmark Mode Cnt Score Error Units ArrayEqualsBenchmark.arrayEqualsInt avgt 10 1130.695 ± 0.168 ns/op ArrayEqualsBenchmark.arrayEqualsLong avgt 10 461.979 ± 0.097 ns/op ``` Deep dive --------- Dive deep to the machine code level, we can see why the big gap. Listed below are arm64 assembly generated by Openjdk-17 C2 compiler. For `int i`, the machine code is similar to source code, no deep optimization. Safepoint polling is expensive in this short loop. ``` // jit c2 machine code snippet 0x0000ffff81ba8904: mov w15, wzr // int i = 0 0x0000ffff81ba8908: nop 0x0000ffff81ba890c: nop loop: 0x0000ffff81ba8910: ldr x10, [x13, w15, sxtw] // Platform.getLong(leftBase, leftOffset + i) 0x0000ffff81ba8914: ldr x14, [x12, w15, sxtw] // Platform.getLong(rightBase, rightOffset + i) 0x0000ffff81ba8918: cmp x10, x14 0x0000ffff81ba891c: b.ne 0x0000ffff81ba899c // return false if not equal 0x0000ffff81ba8920: ldr x14, [x28, apache#848] // x14 -> safepoint 0x0000ffff81ba8924: add w15, w15, #0x8 // i += 8 0x0000ffff81ba8928: ldr wzr, [x14] // safepoint polling 0x0000ffff81ba892c: sxtw x10, w15 // extend i to long 0x0000ffff81ba8930: cmp x10, x11 0x0000ffff81ba8934: b.le 0x0000ffff81ba8910 // if (i <= length - 8) goto loop ``` For `long i`, JIT is able to do much more aggressive optimization. E.g, below code snippet unrolls the loop by four. ``` // jit c2 machine code snippet unrolled_loop: 0x0000ffff91de6fe0: sxtw x10, w7 0x0000ffff91de6fe4: add x23, x22, x10 0x0000ffff91de6fe8: add x24, x21, x10 0x0000ffff91de6fec: ldr x13, [x23] // unroll-1 0x0000ffff91de6ff0: ldr x14, [x24] 0x0000ffff91de6ff4: cmp x13, x14 0x0000ffff91de6ff8: b.ne 0x0000ffff91de70a8 0x0000ffff91de6ffc: ldr x13, [x23, #8] // unroll-2 0x0000ffff91de7000: ldr x14, [x24, #8] 0x0000ffff91de7004: cmp x13, x14 0x0000ffff91de7008: b.ne 0x0000ffff91de70b4 0x0000ffff91de700c: ldr x13, [x23, #16] // unroll-3 0x0000ffff91de7010: ldr x14, [x24, #16] 0x0000ffff91de7014: cmp x13, x14 0x0000ffff91de7018: b.ne 0x0000ffff91de70a4 0x0000ffff91de701c: ldr x13, [x23, apache#24] // unroll-4 0x0000ffff91de7020: ldr x14, [x24, apache#24] 0x0000ffff91de7024: cmp x13, x14 0x0000ffff91de7028: b.ne 0x0000ffff91de70b0 0x0000ffff91de702c: add w7, w7, #0x20 0x0000ffff91de7030: cmp w7, w11 0x0000ffff91de7034: b.lt 0x0000ffff91de6fe0 ``` ### What changes were proposed in this pull request? A trivial change to replace loop index `i` of method `arrayEquals` from `int` to `long`. ### Why are the changes needed? To improve performance and fix a possible bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#49568 from cyb70289/arrayEquals. Authored-by: Yibo Cai <[email protected]> Signed-off-by: Sean Owen <[email protected]>

…onicalized expressions ### What changes were proposed in this pull request? Make PullOutNonDeterministic use canonicalized expressions to dedup group and aggregate expressions. This affects pyspark udfs in particular. Example: ``` from pyspark.sql.functions import col, avg, udf pythonUDF = udf(lambda x: x).asNondeterministic() spark.range(10)\ .selectExpr("id", "id % 3 as value")\ .groupBy(pythonUDF(col("value")))\ .agg(avg("id"), pythonUDF(col("value")))\ .explain(extended=True) ``` Currently results in a plan like this: ``` Aggregate [_nondeterministic#15](#15), [_nondeterministic#15 AS dummyNondeterministicUDF(value)#12, avg(id#0L) AS avg(id)#13, dummyNondeterministicUDF(value#6L)#8 AS dummyNondeterministicUDF(value)#14](#15%20AS%20dummyNondeterministicUDF(value)#12,%20avg(id#0L)%20AS%20avg(id)#13,%20dummyNondeterministicUDF(value#6L)#8%20AS%20dummyNondeterministicUDF(value)#14) +- Project [id#0L, value#6L, dummyNondeterministicUDF(value#6L)#7 AS _nondeterministic#15](#0L,%20value#6L,%20dummyNondeterministicUDF(value#6L)#7%20AS%20_nondeterministic#15) +- Project [id#0L, (id#0L % cast(3 as bigint)) AS value#6L](#0L,%20(id#0L%20%%20cast(3%20as%20bigint))%20AS%20value#6L) +- Range (0, 10, step=1, splits=Some(2)) ``` and then it throws: ``` [[MISSING_AGGREGATION] The non-aggregating expression "value" is based on columns which are not participating in the GROUP BY clause. Add the columns or the expression to the GROUP BY, aggregate the expression, or use "any_value(value)" if you do not care which of the values within a group is returned. SQLSTATE: 42803 ``` - how canonicalized fixes this: - nondeterministic PythonUDF expressions always have distinct resultIds per udf - The fix is to canonicalize the expressions when matching. Canonicalized means that we're setting the resultIds to -1, allowing us to dedup the PythonUDF expressions. - for deterministic UDFs, this rule does not apply and "Post Analysis" batch extracts and deduplicates the expressions, as expected ### Why are the changes needed? - the output of the query with the fix applied still makes sense - the nondeterministic UDF is invoked only once, in the project. ### Does this PR introduce _any_ user-facing change? Yes, it's additive, it enables queries to run that previously threw errors. ### How was this patch tested? - added unit test ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#52061 from benrobby/adhoc-fix-pull-out-nondeterministic. Authored-by: Ben Hurdelhey <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

cloud-fan force-pushed the data-source-v2 branch 2 times, most recently from 39036b9 to bc078e1 Compare July 24, 2017 12:34

cloud-fan force-pushed the data-source-v2 branch 2 times, most recently from 8f59eaa to 4d37934 Compare August 8, 2017 17:25

marmbrus reviewed Aug 9, 2017

View reviewed changes

cloud-fan force-pushed the data-source-v2 branch from 4d37934 to b78cc3d Compare August 9, 2017 17:21

cloud-fan commented Aug 9, 2017

View reviewed changes

cloud-fan force-pushed the data-source-v2 branch from b78cc3d to b197884 Compare August 10, 2017 17:19