[SPARK-21354] [SQL] INPUT FILE related functions do not support more than one sources#18580
[SPARK-21354] [SQL] INPUT FILE related functions do not support more than one sources#18580gatorsmile wants to merge 4 commits intoapache:masterfrom
Conversation
|
Test build #79426 has finished for PR 18580 at commit
|
|
|
||
| case e @ (_: InputFileName | _: InputFileBlockLength | _: InputFileBlockStart) | ||
| if getNumLeafNodes(operator) > 1 => | ||
| e.failAnalysis(s"'${e.prettyName}' does not support more than one sources") |
There was a problem hiding this comment.
nit: one sources -> one source.
| val e = intercept[AnalysisException] { | ||
| sql(s"SELECT *, $func() FROM tab1 JOIN tab2 ON tab1.id = tab2.id") | ||
| }.getMessage | ||
| assert(e.contains(s"'$func' does not support more than one sources")) |
|
@dongjoon-hyun What is the output of Hive for your case? |
|
The following is the output of the current Spark. scala> spark.range(10).write.saveAsTable("t1")
scala> spark.range(100,110).write.saveAsTable("t2")
scala> sql("select *, input_file_name() from t1").show(false)
+---+-------------------------------------------------------------------------------------------------------------------+
|id |input_file_name() |
+---+-------------------------------------------------------------------------------------------------------------------+
|3 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00003-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|4 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00003-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|8 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00007-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|9 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00007-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|0 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00000-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|1 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00001-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|2 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00002-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|5 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00004-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|6 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00005-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|7 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00006-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
+---+-------------------------------------------------------------------------------------------------------------------+
scala> sql("select *, input_file_name() from t2").show(false)
+---+-------------------------------------------------------------------------------------------------------------------+
|id |input_file_name() |
+---+-------------------------------------------------------------------------------------------------------------------+
|103|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00003-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|104|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00003-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|108|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00007-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|109|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00007-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|100|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00000-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|101|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00001-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|102|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00002-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|105|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00004-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|106|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00005-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|107|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00006-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
+---+-------------------------------------------------------------------------------------------------------------------+
scala> sql("select *, input_file_name() from ((select * from t1) union all (select * from t2))").show(false)
+---+-------------------------------------------------------------------------------------------------------------------+
|id |input_file_name() |
+---+-------------------------------------------------------------------------------------------------------------------+
|3 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00003-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|4 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00003-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|8 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00007-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|9 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00007-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|0 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00000-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|1 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00001-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|2 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00002-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|5 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00004-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|6 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00005-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|7 |file:///Users/dongjoon/spark/spark-warehouse/t1/part-00006-b0ca8fa4-03ae-4e3a-b4b4-a13d601cd155-c000.snappy.parquet|
|103|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00003-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|104|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00003-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|108|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00007-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|109|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00007-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|100|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00000-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|101|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00001-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|102|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00002-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|105|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00004-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|106|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00005-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
|107|file:///Users/dongjoon/spark/spark-warehouse/t2/part-00006-76ea547d-0187-40f0-b5dd-f9f1fffeeabf-c000.snappy.parquet|
+---+-------------------------------------------------------------------------------------------------------------------+ |
|
In case of |
|
Does Hive report an error? |
|
Let me check that. BTW, I think Spark is better than Hive. :) |
|
This is the error and difference. HIVE SPARK |
|
For union, does Hive output an error? |
|
It was the same error, |
|
cc @cloud-fan |
|
|
||
| private def getNumInputFileBlockSources(operator: LogicalPlan): Int = { | ||
| operator match { | ||
| case _: LeafNode => 1 |
There was a problem hiding this comment.
shall we only consider file data source leaf node?
There was a problem hiding this comment.
Unable to check it in CheckAnalysis. Both HadoopRDD and FileScanRDD have the same issues. To block both, we need to add the check as another rule.
|
Test build #79496 has finished for PR 18580 at commit
|
|
Retest this please. |
|
Test build #79576 has finished for PR 18580 at commit
|
| case o => | ||
| val numInputFileBlockSources = o.children.map(checkNumInputFileBlockSources(e, _)).sum | ||
| if (numInputFileBlockSources > 1) { | ||
| e.failAnalysis(s"'${e.prettyName}' does not support more than one sources") |
There was a problem hiding this comment.
Need to check it as early as possible; otherwise, Union might eat it.
|
LGTM, pending jenkins |
|
Test build #79651 has finished for PR 18580 at commit
|
|
thanks, merging to master! |
What changes were proposed in this pull request?
The build-in functions
input_file_name,input_file_block_start,input_file_block_lengthdo not support more than one sources, like what Hive does. Currently, Spark does not block it and the outputs are ambiguous/non-deterministic. It could be from any side.This PR blocks it and issues an error.
How was this patch tested?
Added a test case