[pull] master from apache:master #21

pull · 2022-11-05T16:28:57Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

### What changes were proposed in this pull request? This is an update of #29178 which was closed because the root cause of the error was just vaguely defined there but here I will give an explanation why `HiveHBaseTableInputFormat` does not work well with the `NewHadoopRDD` (see in the next section). The PR modify `TableReader.scala` to create `OldHadoopRDD` when input format is 'org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat'. - environments (Cloudera distribution 7.1.7.SP1)： hadoop 3.1.1 hive 3.1.300 spark 3.2.1 hbase 2.2.3 ### Why are the changes needed? With the `NewHadoopRDD` the following exception is raised: ``` java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details. at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:253) at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:296) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:296) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:296) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:296) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.rdd.RDD.partitions(RDD.scala:296) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:446) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:429) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:48) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3715) at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2728) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3706) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3704) at org.apache.spark.sql.Dataset.head(Dataset.scala:2728) at org.apache.spark.sql.Dataset.take(Dataset.scala:2935) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:287) at org.apache.spark.sql.Dataset.showString(Dataset.scala:326) at org.apache.spark.sql.Dataset.show(Dataset.scala:806) at org.apache.spark.sql.Dataset.show(Dataset.scala:765) at org.apache.spark.sql.Dataset.show(Dataset.scala:774) ... 47 elided Caused by: java.lang.IllegalStateException: The input format instance has not been properly initialized. Ensure you call initializeTable either in your constructor or initialize method at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getTable(TableInputFormatBase.java:557) at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:248) ... 86 more ``` ### Short summary of the root cause There are two interfaces: - the new `org.apache.hadoop.mapreduce.InputFormat`: providing a one arg method `getSplits(JobContext context)` (returning `List<InputSplit>`) - the old `org.apache.hadoop.mapred.InputFormat`: providing a two arg method `getSplits(JobConf job, int numSplits)` (returning `InputSplit[]`) And in Hive both are implemented by `HiveHBaseTableInputFormat` but only the old method leads to required initialisation and this why `NewHadoopRDD` fails here. ### Detailed analyses Here all the link refers latest commits of the master branches for the mentioned components at the time of writing this description (to get the right line numbers in the future too as `master` itself is a moving target). Spark in `NewHadoopRDD` uses the new interface providing the one arg method: https://github.com/apache/spark/blob/5556cfc59aa97a3ad4ea0baacebe19859ec0bcb7/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L136 Hive on the other hand binds the initialisation to the two args method coming from the old interface. See [Hive#getSplits](https://github.com/apache/hive/blob/fd029c5b246340058aee513980b8bf660aee0227/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java#L268): ``` Override public InputSplit[] getSplits(final JobConf jobConf, final int numSplits) throws IOException { ``` This calls `getSplitsInternal` which contains the [initialisation](https://github.com/apache/hive/blob/fd029c5b246340058aee513980b8bf660aee0227/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java#L299) too: ``` initializeTable(conn, tableName); ``` Interesting that Hive also uses the one arg method internally within the `getSplitsInternal` [here](https://github.com/apache/hive/blob/fd029c5b246340058aee513980b8bf660aee0227/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java#L356) but the initialisation done earlier. By calling the new interface method (what `NewHadoopRDD` does) the call goes straight to the HBase method: [org.apache.hadoop.hbase.mapreduce.TableInputFormatBase#getSplits](https://github.com/apache/hbase/blob/63cdd026f08cdde6ac0fde1342ffd050e8e02441/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java#L230). Where there would be some `JobContext` based [initialisation](https://github.com/apache/hbase/blob/63cdd026f08cdde6ac0fde1342ffd050e8e02441/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java#L234-L237) which by default is an [empty method](https://github.com/apache/hbase/blob/63cdd026f08cdde6ac0fde1342ffd050e8e02441/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.java#L628-L640): ```java /** * Handle subclass specific set up. Each of the entry points used by the MapReduce framework, * {link #createRecordReader(InputSplit, TaskAttemptContext)} and {link #getSplits(JobContext)}, * will call {link #initialize(JobContext)} as a convenient centralized location to handle * retrieving the necessary configuration information and calling * {link #initializeTable(Connection, TableName)}. Subclasses should implement their initialize * call such that it is safe to call multiple times. The current TableInputFormatBase * implementation relies on a non-null table reference to decide if an initialize call is needed, * but this behavior may change in the future. In particular, it is critical that initializeTable * not be called multiple times since this will leak Connection instances. */ protected void initialize(JobContext context) throws IOException { } ``` This is not overridden by Hive and hard to reason why we need that (its an internal Hive class) so it is easier to fix this in Spark. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? 1) create hbase table ``` hbase(main):001:0>create 'hbase_test1', 'cf1' hbase(main):001:0> put 'hbase_test', 'r1', 'cf1:c1', '123' ``` 2) create hive table related to hbase table hive> ``` CREATE EXTERNAL TABLE `hivetest.hbase_test`( `key` string COMMENT '', `value` string COMMENT '') ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 'hbase.columns.mapping'=':key,cf1:v1', 'serialization.format'='1') TBLPROPERTIES ( 'hbase.table.name'='hbase_test') ``` 3): spark-shell query hive table while data in HBase ``` scala> spark.sql("select * from hivetest.hbase_test").show() 22/11/05 01:14:16 WARN conf.HiveConf: HiveConf of name hive.masking.algo does not exist 22/11/05 01:14:16 WARN client.HiveClientImpl: Detected HiveConf hive.execution.engine is 'tez' and will be reset to 'mr' to disable useless hive logic Hive Session ID = f05b6866-86df-4d88-9eea-f1c45043bb5f +---+-----+ |key|value| +---+-----+ | r1| 123| +---+-----+ ``` Closes #38516 from attilapiros/SPARK-32380. Authored-by: attilapiros <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…elper#replaceNotIncludedMsg` to remove `@hashCode` ### What changes were proposed in this pull request? The daily build GA `Master Scala 2.13 + Hadoop 3 + JDK 8` failed after apache#39468 merged, the failed tests includes: - org.apache.spark.sql.SQLQueryTestSuite - org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite and the failure error message is similar to the following: ``` 2023-01-11T01:03:46.3478323Z 01:03:46.347 ERROR org.apache.spark.sql.SQLQueryTestSuite: Error using configs: 2023-01-11T01:03:46.3677880Z [info]- explain.sql *** FAILED *** (348 milliseconds) 2023-01-11T01:03:46.3678710Z [info] explain.sql 2023-01-11T01:03:46.3679479Z [info] Expected "...es.CatalogFileIndex[7d811218, [key, val] 2023-01-11T01:03:46.3680509Z [info] +- Project [key#x, val#x] 2023-01-11T01:03:46.3681033Z [info] +- SubqueryAlias spark_catalog.default.explain_temp4 2023-01-11T01:03:46.3684259Z [info] +- Relation spark_catalog.default.explain_temp4[key#x,val#x] parquet 2023-01-11T01:03:46.3684922Z [info] 2023-01-11T01:03:46.3685766Z [info] == Optimized Logical Plan == 2023-01-11T01:03:46.3687590Z [info] InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndex7d811218, [key, val] 2023-01-11T01:03:46.3688465Z [info] +- WriteFiles 2023-01-11T01:03:46.3690929Z [info] +- Sort [val#x ASC NULLS FIRST], false 2023-01-11T01:03:46.3691387Z [info] +- Project [key#x, empty2null(val#x) AS val#x] 2023-01-11T01:03:46.3692078Z [info] +- Relation spark_catalog.default.explain_temp4[key#x,val#x] parquet 2023-01-11T01:03:46.3692549Z [info] 2023-01-11T01:03:46.3693443Z [info] == Physical Plan == 2023-01-11T01:03:46.3695233Z [info] Execute InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndex7d811218], [key, val] 2023-01-11T01:03:46.3696100Z [info] +- Writ...", but got "...es.CatalogFileIndex[cdfa8472, [key, val] 2023-01-11T01:03:46.3698327Z [info] +- Project [key#x, val#x] 2023-01-11T01:03:46.3698881Z [info] +- SubqueryAlias spark_catalog.default.explain_temp4 2023-01-11T01:03:46.3699680Z [info] +- Relation spark_catalog.default.explain_temp4[key#x,val#x] parquet 2023-01-11T01:03:46.3704986Z [info] 2023-01-11T01:03:46.3705457Z [info] == Optimized Logical Plan == 2023-01-11T01:03:46.3717140Z [info] InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndexcdfa8472, [key, val] 2023-01-11T01:03:46.3718309Z [info] +- WriteFiles 2023-01-11T01:03:46.3718964Z [info] +- Sort [val#x ASC NULLS FIRST], false 2023-01-11T01:03:46.3719752Z [info] +- Project [key#x, empty2null(val#x) AS val#x] 2023-01-11T01:03:46.3723046Z [info] +- Relation spark_catalog.default.explain_temp4[key#x,val#x] parquet 2023-01-11T01:03:46.3723598Z [info] 2023-01-11T01:03:46.3726955Z [info] == Physical Plan == 2023-01-11T01:03:46.3728111Z [info] Execute InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndexcdfa8472], [key, val] 2023-01-11T01:03:46.3898445Z [info] +- Writ..." Result did not match for query #21 2023-01-11T01:03:46.3902948Z [info] EXPLAIN EXTENDED INSERT INTO TABLE explain_temp5 SELECT * FROM explain_temp4 (SQLQueryTestSuite.scala:495) 2023-01-11T01:03:46.3903881Z [info] org.scalatest.exceptions.TestFailedException: 2023-01-11T01:03:46.3904492Z [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) 2023-01-11T01:03:46.3905449Z [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) 2023-01-11T01:03:46.3906493Z [info] at org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1564) 2023-01-11T01:03:46.3907683Z [info] at org.scalatest.Assertions.assertResult(Assertions.scala:847) 2023-01-11T01:03:46.3908243Z [info] at org.scalatest.Assertions.assertResult$(Assertions.scala:842) 2023-01-11T01:03:46.3908812Z [info] at org.scalatest.funsuite.AnyFunSuite.assertResult(AnyFunSuite.scala:1564) 2023-01-11T01:03:46.3910011Z [info] at org.apache.spark.sql.SQLQueryTestSuite.$anonfun$runQueries$11(SQLQueryTestSuite.scala:495) 2023-01-11T01:03:46.3910611Z [info] at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:563) 2023-01-11T01:03:46.3911163Z [info] at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:561) 2023-01-11T01:03:46.3912094Z [info] at scala.collection.AbstractIterable.foreach(Iterable.scala:926) 2023-01-11T01:03:46.3912781Z [info] at org.apache.spark.sql.SQLQueryTestSuite.$anonfun$runQueries$9(SQLQueryTestSuite.scala:486) 2023-01-11T01:03:46.3913371Z [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) 2023-01-11T01:03:46.3914237Z [info] at org.scalatest.Assertions.withClue(Assertions.scala:1065) 2023-01-11T01:03:46.3915165Z [info] at org.scalatest.Assertions.withClue$(Assertions.scala:1052) 2023-01-11T01:03:46.3915725Z [info] at org.scalatest.funsuite.AnyFunSuite.withClue(AnyFunSuite.scala:1564) 2023-01-11T01:03:46.3916341Z [info] at org.apache.spark.sql.SQLQueryTestSuite.runQueries(SQLQueryTestSuite.scala:462) 2023-01-11T01:03:46.3917485Z [info] at org.apache.spark.sql.SQLQueryTestSuite.$anonfun$runTest$35(SQLQueryTestSuite.scala:364) 2023-01-11T01:03:46.3918517Z [info] at org.apache.spark.sql.SQLQueryTestSuite.$anonfun$runTest$35$adapted(SQLQueryTestSuite.scala:362) 2023-01-11T01:03:46.3919102Z [info] at scala.collection.immutable.List.foreach(List.scala:333) 2023-01-11T01:03:46.3919675Z [info] at org.apache.spark.sql.SQLQueryTestSuite.runTest(SQLQueryTestSuite.scala:362) 2023-01-11T01:03:46.3921754Z [info] at org.apache.spark.sql.SQLQueryTestSuite.$anonfun$createScalaTestCase$6(SQLQueryTestSuite.scala:269) 2023-01-11T01:03:46.3922358Z [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) 2023-01-11T01:03:46.3923784Z [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) 2023-01-11T01:03:46.3924473Z [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) 2023-01-11T01:03:46.3925286Z [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) 2023-01-11T01:03:46.3926199Z [info] at org.scalatest.Transformer.apply(Transformer.scala:22) 2023-01-11T01:03:46.3927071Z [info] at org.scalatest.Transformer.apply(Transformer.scala:20) 2023-01-11T01:03:46.3928583Z [info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) 2023-01-11T01:03:46.3929225Z [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:207) 2023-01-11T01:03:46.3930091Z [info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) 2023-01-11T01:03:46.3933329Z [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) 2023-01-11T01:03:46.3933893Z [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) 2023-01-11T01:03:46.3934875Z [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) 2023-01-11T01:03:46.3935479Z [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) 2023-01-11T01:03:46.3936453Z [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:66) 2023-01-11T01:03:46.3937318Z [info] at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) 2023-01-11T01:03:46.3940707Z [info] at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) 2023-01-11T01:03:46.3941350Z [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:66) 2023-01-11T01:03:46.3941962Z [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) 2023-01-11T01:03:46.3943332Z [info] at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) 2023-01-11T01:03:46.3944504Z [info] at scala.collection.immutable.List.foreach(List.scala:333) 2023-01-11T01:03:46.3950194Z [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) 2023-01-11T01:03:46.3950748Z [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) 2023-01-11T01:03:46.3951912Z [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) 2023-01-11T01:03:46.3952515Z [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269) 2023-01-11T01:03:46.3953476Z [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268) 2023-01-11T01:03:46.3954069Z [info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564) 2023-01-11T01:03:46.3966445Z [info] at org.scalatest.Suite.run(Suite.scala:1114) 2023-01-11T01:03:46.3967583Z [info] at org.scalatest.Suite.run$(Suite.scala:1096) 2023-01-11T01:03:46.3968377Z [info] at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564) 2023-01-11T01:03:46.3969537Z [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273) 2023-01-11T01:03:46.3970510Z [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535) 2023-01-11T01:03:46.3971298Z [info] at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273) 2023-01-11T01:03:46.3972182Z [info] at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272) 2023-01-11T01:03:46.3973529Z [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:66) 2023-01-11T01:03:46.3974433Z [info] at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) 2023-01-11T01:03:46.3977778Z [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) 2023-01-11T01:03:46.3984781Z [info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) 2023-01-11T01:03:46.3985521Z [info] at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:66) 2023-01-11T01:03:46.3986684Z [info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321) 2023-01-11T01:03:46.3987264Z [info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517) 2023-01-11T01:03:46.3987774Z [info] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413) 2023-01-11T01:03:46.3988269Z [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) 2023-01-11T01:03:46.3989260Z [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 2023-01-11T01:03:46.3996895Z [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 2023-01-11T01:03:46.3997550Z [info] at java.lang.Thread.run(Thread.java:750) ``` The reason for the failure is that the result of `CatalogFileIndex` printed when using Scala 2.12 is different from Scala 2.13: When using Scala 2.12, the `hashCode` of `CatalogFileIndex` is `7d811218` ``` InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndex7d811218, [key, val] ``` When using Scala 2.13, the `hashCode` of `CatalogFileIndex` is `cdfa8472` ``` InsertIntoHadoopFsRelationCommand Location [not included in comparison]/{warehouse_dir}/explain_temp5], Append, `spark_catalog`.`default`.`explain_temp5`, org.apache.spark.sql.execution.datasources.CatalogFileIndexcdfa8472, [key, val] ``` So this add a new `replaceAll` action to `SQLQueryTestHelper#replaceNotIncludedMsg` to remove `hashCode` to make `CatalogFileIndex` print the same results when using Scala 2.12 and 2.13 ### Why are the changes needed? Make daily build GA `Master Scala 2.13 + Hadoop 3 + JDK 8` can run successfully ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Manual test with this pr: ``` gh pr checkout 39598 dev/change-scala-version.sh 2.13 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z explain.sql" -Pscala-2.13 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z explain-aqe.sql" -Pscala-2.13 build/sbt -Phive-thriftserver "hive-thriftserver/testOnly org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite -- -z explain.sql" -Pscala-2.13 build/sbt -Phive-thriftserver "hive-thriftserver/testOnly org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite -- -z explain-aqe.sql" -Pscala-2.13 ``` Closes apache#39598 from LuciferYang/SPARK-41708-FOLLOWUP. Authored-by: yangjie01 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? This PR aims to add `Python 3.10` to Infra docker images. ### Why are the changes needed? This is a preparation to add a daily `Python 3.10` GitHub Action job later for Apache Spark 4.0.0. Note that Python 3.10 is installed at the last step to avoid the following issues which happens when we install Python 3.9 and 3.10 at the same stage by package manager. ``` #21 13.03 ERROR: Cannot uninstall 'blinker'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall. #21 ERROR: process "/bin/sh -c python3.9 -m pip install numpy 'pyarrow>=14.0.0' 'pandas<=2.1.3' scipy unittest-xml-reporting plotly>=4.8 'mlflow>=2.3.1' coverage matplotlib openpyxl 'memory-profiler==0.60.0' 'scikit-learn==1.1.*'" did not complete successfully: exit code: 1 ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? 1. I verified that the Python CI is not affected and still use Python 3.9.5 only. ``` ======================================================================== Running PySpark tests ======================================================================== Running PySpark tests. Output is in /__w/spark/spark/python/unit-tests.log Will test against the following Python executables: ['python3.9'] Will test the following Python modules: ['pyspark-errors'] python3.9 python_implementation is CPython python3.9 version is: Python 3.9.5 Starting test(python3.9): pyspark.errors.tests.test_errors (temp output: /__w/spark/spark/python/target/fd967f24-3607-4aa6-8190-3f8d7de522e1/python3.9__pyspark.errors.tests.test_errors___zauwgy1.log) Finished test(python3.9): pyspark.errors.tests.test_errors (0s) Tests passed in 0 seconds ``` 2. Pass `Base Image Build` step for new Python 3.10. ![Screenshot 2023-11-16 at 10 53 37 AM](https://github.com/apache/spark/assets/9700541/6bbb3461-c5f0-4d60-94f6-7cd8df0594ed) 3. Since new Python 3.10 is not used in CI, we need to validate like the following. ``` $ docker run -it --rm ghcr.io/dongjoon-hyun/apache-spark-ci-image:master-6895105871 python3.10 --version Python 3.10.13 ``` ``` $ docker run -it --rm ghcr.io/dongjoon-hyun/apache-spark-ci-image:master-6895105871 python3.10 -m pip freeze alembic==1.12.1 annotated-types==0.6.0 blinker==1.7.0 certifi==2019.11.28 chardet==3.0.4 charset-normalizer==3.3.2 click==8.1.7 cloudpickle==2.2.1 contourpy==1.2.0 coverage==7.3.2 cycler==0.12.1 databricks-cli==0.18.0 dbus-python==1.2.16 deepspeed==0.12.3 distro-info==0.23+ubuntu1.1 docker==6.1.3 entrypoints==0.4 et-xmlfile==1.1.0 filelock==3.9.0 Flask==3.0.0 fonttools==4.44.3 gitdb==4.0.11 GitPython==3.1.40 googleapis-common-protos==1.56.4 greenlet==3.0.1 grpcio==1.56.2 grpcio-status==1.48.2 gunicorn==21.2.0 hjson==3.1.0 idna==2.8 importlib-metadata==6.8.0 itsdangerous==2.1.2 Jinja2==3.1.2 joblib==1.3.2 kiwisolver==1.4.5 lxml==4.9.3 Mako==1.3.0 Markdown==3.5.1 MarkupSafe==2.1.3 matplotlib==3.8.1 memory-profiler==0.60.0 mlflow==2.8.1 mpmath==1.3.0 networkx==3.0 ninja==1.11.1.1 numpy==1.26.2 oauthlib==3.2.2 openpyxl==3.1.2 packaging==23.2 pandas==2.1.3 Pillow==10.1.0 plotly==5.18.0 protobuf==3.20.3 psutil==5.9.6 py-cpuinfo==9.0.0 pyarrow==14.0.1 pydantic==2.5.1 pydantic_core==2.14.3 PyGObject==3.36.0 PyJWT==2.8.0 pynvml==11.5.0 pyparsing==3.1.1 python-apt==2.0.1+ubuntu0.20.4.1 python-dateutil==2.8.2 pytz==2023.3.post1 PyYAML==6.0.1 querystring-parser==1.2.4 requests==2.31.0 requests-unixsocket==0.2.0 scikit-learn==1.1.3 scipy==1.11.3 six==1.14.0 smmap==5.0.1 SQLAlchemy==2.0.23 sqlparse==0.4.4 sympy==1.12 tabulate==0.9.0 tenacity==8.2.3 threadpoolctl==3.2.0 torch==2.0.1+cpu torcheval==0.0.7 torchvision==0.15.2+cpu tqdm==4.66.1 typing_extensions==4.8.0 tzdata==2023.3 unattended-upgrades==0.1 unittest-xml-reporting==3.2.0 urllib3==2.1.0 websocket-client==1.6.4 Werkzeug==3.0.1 zipp==3.17.0 ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#43840 from dongjoon-hyun/SPARK-45953. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…pressions in `buildAggExprList` ### What changes were proposed in this pull request? Trim aliases before matching Sort/Having/Filter expressions with semantically equal expression from the Aggregate below in `buildAggExprList` ### Why are the changes needed? For a query like: ``` SELECT course, year, GROUPING(course) FROM courseSales GROUP BY CUBE(course, year) ORDER BY GROUPING(course) ``` Plan after `ResolveReferences` and before `ResolveAggregateFunctions` looks like: ``` !Sort [cast((shiftright(tempresolvedcolumn(spark_grouping_id#18L, spark_grouping_id, false), 1) & 1) as tinyint) AS grouping(course)#22 ASC NULLS FIRST], true +- Aggregate [course#19, year#20, spark_grouping_id#18L], [course#19, year#20, cast((shiftright(spark_grouping_id#18L, 1) & 1) as tinyint) AS grouping(course)#21 AS grouping(course)#15] .... ``` Because aggregate list has `Alias(Alias(cast((shiftright(spark_grouping_id#18L, 1) & 1) as tinyint))` expression from `SortOrder` won't get matched as semantically equal and it will result in adding an unnecessary `Project`. By stripping inner aliases from aggregate list (that are going to get removed anyways in `CleanupAliases`) we can match `SortOrder` expression and resolve it as `grouping(course)#15` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#51339 from mihailotim-db/mihailotim-db/fix_inner_aliases_semi_structured. Authored-by: Mihailo Timotic <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

github-actions bot added the SQL label Nov 5, 2022

pull bot added ⤵️ pull and removed SQL labels Nov 5, 2022

pull bot merged commit 7009ef0 into huangxiaopingRD:master Nov 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[pull] master from apache:master #21

[pull] master from apache:master #21

Uh oh!

pull bot commented Nov 5, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[pull] master from apache:master #21

[pull] master from apache:master #21

Uh oh!

Conversation

pull bot commented Nov 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pull bot commented Nov 5, 2022 •

edited

Loading