Skip to content

Conversation

@windpiger
Copy link
Contributor

@windpiger windpiger commented Mar 6, 2017

What changes were proposed in this pull request?

In SPARK-5068, we introduce a SQLConf spark.sql.hive.verifyPartitionPath,
if it is set to true, it will avoid the task failed when the patition location does not exists in the filesystem.

this situation should always return emtpy and don't lead to the task failed, here we remove this conf.

And the function verifyPartitionPath has a bug ,that if the partition path is custom path

it will still do filter for all partition path in the parameter partitionToDeserializer,
it will scan the path which does not belong to the table ,e.g. custom path is /root/a
and the partitionSpec is b=1/c=2, this will lead to scan / because of the getPathPatternByPath

How was this patch tested?

modify a test case

…eturn empty when the location does not exists
@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #73991 has finished for PR 17176 at commit 95aa931.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@windpiger windpiger force-pushed the removeHiveVerfiyPath branch from 4bb0e28 to 8128567 Compare March 6, 2017 12:05
}
// convert /demo/data/year/month/day to /demo/data/*/*/*/
def getPathPatternByPath(parNum: Int, tempPath: Path, partitionName: String): String = {
// if the partition path does not end with partition name, we should not
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the partition location has been altered to another location, we should not do this pattern, or we will list pattern files which does not belong to the partition

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #73998 has finished for PR 17176 at commit 8128567.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #73992 has finished for PR 17176 at commit 4bb0e28.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@windpiger
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Mar 6, 2017

Test build #74016 has finished for PR 17176 at commit 8128567.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@windpiger
Copy link
Contributor Author

why jenkins failed...

@SparkQA
Copy link

SparkQA commented Mar 7, 2017

Test build #74072 has finished for PR 17176 at commit 22b1f53.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

def verifyPartitionPath(
partitionToDeserializer: Map[HivePartition, Class[_ <: Deserializer]]):
Map[HivePartition, Class[_ <: Deserializer]] = {
if (!sparkSession.sessionState.conf.verifyPartitionPath) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after this pr https://github.com/apache/spark/pull/17187, read hive table which does not use stored by will not use HiveTableScanExec.

this function has a bug ,that if the partition path is custom path

  1. it will still do filter for all partition path in the parameter partitionToDeserializer,
  2. it will scan the path which does not belong to the table ,e.g. custom path is /root/a
    and the partitionSpec is b=1/c=2, this will lead to scan / because of the getPathPatternByPath

@SparkQA
Copy link

SparkQA commented Mar 7, 2017

Test build #74106 has finished for PR 17176 at commit 262e2f2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 7, 2017

Test build #74107 has finished for PR 17176 at commit 3a15e5d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

@windpiger If you do not have a bandwidth to continue it, how about closing it now?

case (partition, partDeserializer) =>
val partPath = partition.getDataLocation
val fs = partPath.getFileSystem(hadoopConf)
fs.exists(partPath)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each partition sending an RPC request to the NameNode can result in poor performance

@asfgit asfgit closed this in a3ba3a8 Nov 11, 2018
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025
Closes apache#21766
Closes apache#21679
Closes apache#21161
Closes apache#20846
Closes apache#19434
Closes apache#18080
Closes apache#17648
Closes apache#17169

Add:
Closes apache#22813
Closes apache#21994
Closes apache#22005
Closes apache#22463

Add:
Closes apache#15899

Add:
Closes apache#22539
Closes apache#21868
Closes apache#21514
Closes apache#21402
Closes apache#21322
Closes apache#21257
Closes apache#20163
Closes apache#19691
Closes apache#18697
Closes apache#18636
Closes apache#17176

Closes apache#23001 from wangyum/CloseStalePRs.

Authored-by: Yuming Wang <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants