[SPARK-6352] [SQL] Add DirectParquetOutputCommitter by ypcat · Pull Request #5042 · apache/spark

ypcat · 2015-03-16T09:13:43Z

Add a DirectParquetOutputCommitter class that skips _temporary directory when saving to s3. To enable it, set hadoop configuration "spark.sql.parquet.output.committer.class" to "org.apache.spark.sql.parquet.DirectParquetOutputCommitter".

ash211 · 2015-03-16T16:15:16Z

Jenkins this is ok to test

SparkQA · 2015-03-16T16:41:26Z

Test build #28650 has finished for PR 5042 at commit 769bd67.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class DirectParquetOutputCommitter(outputPath: Path, context: TaskAttemptContext)

shaneknapp · 2015-03-16T16:45:02Z

jenkins, test this please

SparkQA · 2015-03-16T16:48:39Z

Test build #28653 has finished for PR 5042 at commit 769bd67.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class DirectParquetOutputCommitter(outputPath: Path, context: TaskAttemptContext)

SparkQA · 2015-03-17T07:33:14Z

Test build #28709 has finished for PR 5042 at commit c42468c.

This patch fails RAT tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-03-17T09:18:03Z

Test build #28712 has finished for PR 5042 at commit 0d540b9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Add a new configuration key: spark.sql.parquet.output.committer.class which should be a sub-class of ParquetOutputCommitter

Conflicts: sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetIOSuite.scala

ypcat · 2015-03-23T10:45:59Z

I change the design to allow more general usage. User can set "spark.sql.parquet.output.committer.class" to a class extending ParquetOutputFormat.
I still include the implementation of DirectParquetOutputCommitter, if it is unwanted, we can move it to be only used in test case.

SparkQA · 2015-03-23T12:03:44Z

Test build #28993 has finished for PR 5042 at commit e17bf47.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-03-23T12:27:30Z

Test build #28992 has finished for PR 5042 at commit 9ae7545.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

liancheng · 2015-04-11T15:37:58Z

Hey @ypcat, sorry for the late review. If I understand this correctly, with DirectParquetOutputCommitter, all tasks writes to the final output directory directly. Then what if speculative tasks are scheduled? This may end up with two tasks trying to write to the same file simultaneously, right?

liancheng · 2015-04-11T16:21:44Z

At least we should state explicitly that speculative tasks should be disabled when using the direct output committer.

JoshRosen · 2015-04-11T18:59:55Z

@liancheng, if this is intended only for writing directly to S3, then I think we purposely want to avoid writing to _temporary because the move / rename of the temporary file to the final file is expensive and because S3 PUT operations are atomic. If you have two tasks racing to write to the same file, then the last writer should win and we shouldn't have to worry about partially-written files. As with DirectOutputCommitter, this is not safe for use on filesystems where file creation / overwrite isn't atomic, such as HDFS, but that's not a use-case that we have to worry about here (I think this is only for use by advanced users who know that they're writing directly to S3).

ypcat · 2015-04-13T03:21:20Z

@liancheng this DirectParquetOutputCommitter is based on DirectOutputCommitter in this thread and was intended to use on s3. I was not aware of the problem of speculation. I think we can either:

exclude DirectParquetOutputCommitter and let user implement their own like DirectOutputCommitter.
write down the consequences in documentation as @JoshRosen described.

liancheng · 2015-04-13T13:29:31Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala

Should we check for mapreduce.output.fileoutputformat.outputdir instead?

Oh I see, this is for Hadoop 1 compatibility.

liancheng · 2015-04-13T13:42:41Z

@JoshRosen @ypcat Thanks for the detailed explanation, I didn't realize S3 PUT operations are atomic at first. This LGTM. Merging to master.

@ypcat Thanks again for working on this!

JoshRosen · 2015-04-13T17:36:05Z

We can't change this now, but note that the commit message / PR description here is slightly out of date w.r.t. the actual changes being committed, since we now use the spark.sql.parquet.output.committer.class configuration instead of spark.sql.parquet.useDirectParquetOutputCommitter.

We may want to document this configuration at some point; if we do, I think the right place is the Parquet Configuration section in the SQL programming guide: https://spark.apache.org/docs/latest/sql-programming-guide.html#configuration

JoshRosen · 2015-04-13T17:42:27Z

Actually, on giving this a closer look I'm not sure whether this faithfully respects all of the Parquet configurations for controlling OutputCommitter behavior. For instance, it looks like there's a ParquetOutputFormat.ENABLE_JOB_SUMMARY configuration that can be used to suppress the writing of the summary file (https://github.com/Parquet/parquet-mr/blame/fa8957d7939b59e8d391fa17000b34e865de015d/parquet-hadoop/src/main/java/parquet/hadoop/ParquetOutputCommitter.java#L48). I don't really know much about Parquet, so I can't really speak to whether this is a configuration that would be commonly used, but we should probably try to support it.

ypcat · 2015-04-14T06:57:10Z

@JoshRosen I updated PR description. It looks like the parquet code base I used was too old that it did not include the ParquetOutputFormat.ENABLE_JOB_SUMMARY thing. I will add support to this.

JoshRosen · 2015-04-14T21:07:11Z

It looks like this patch broke the Hadoop 1.x build because the test code uses Configuration.unset, which isn't present in Hadoop 1.x:

[info] Compiling 1 Scala source to /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/examples/target/scala-2.10/classes...
[error] /home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop1.0/label/centos/sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetIOSuite.scala:402: value unset is not a member of org.apache.hadoop.conf.Configuration
[error]       configuration.unset("spark.sql.parquet.output.committer.class")
[error]                     ^
[error] one error found

To unbreak the build, I'm going to revert this patch and re-open the JIRA. Let's open a new PR to fix this. At the same time, we can also address the ENABLE_JOB_SUMMARY flag.

ypcat · 2015-04-15T06:14:32Z

I cannot find a way to unset a config value in hadoop 1.x API. The closest thing is to set it to a default value, which I think should be fine in test code.
And I found I cannot add more commits to this PR since it is closed. Should we reopen it or use a new PR?

ash211 · 2015-04-15T06:59:35Z

I think it's best to open a new PR

ypcat · 2015-04-15T10:33:27Z

New PR is #5525

Add new config "spark.sql.parquet.output.committer.class" to allow custom parquet output committer and an output committer class specific to use on s3. Fix compilation error introduced by #5042. Respect ParquetOutputFormat.ENABLE_JOB_SUMMARY flag. Author: Pei-Lun Lee <pllee@appier.com> Closes #5525 from ypcat/spark-6352 and squashes the following commits: 54c6b15 [Pei-Lun Lee] error handling 472870e [Pei-Lun Lee] add back custom parquet output committer ddd0f69 [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352 9ece5c5 [Pei-Lun Lee] compatibility with hadoop 1.x 8413fcd [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352 fe65915 [Pei-Lun Lee] add support for parquet config parquet.enable.summary-metadata e17bf47 [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352 9ae7545 [Pei-Lun Lee] [SPARL-6352] [SQL] Change to allow custom parquet output committer. 0d540b9 [Pei-Lun Lee] [SPARK-6352] [SQL] add license c42468c [Pei-Lun Lee] [SPARK-6352] [SQL] add test case 0fc03ca [Pei-Lun Lee] [SPARK-6532] [SQL] hide class DirectParquetOutputCommitter 769bd67 [Pei-Lun Lee] DirectParquetOutputCommitter f75e261 [Pei-Lun Lee] DirectParquetOutputCommitter

Add new config "spark.sql.parquet.output.committer.class" to allow custom parquet output committer and an output committer class specific to use on s3. Fix compilation error introduced by apache#5042. Respect ParquetOutputFormat.ENABLE_JOB_SUMMARY flag. Author: Pei-Lun Lee <pllee@appier.com> Closes apache#5525 from ypcat/spark-6352 and squashes the following commits: 54c6b15 [Pei-Lun Lee] error handling 472870e [Pei-Lun Lee] add back custom parquet output committer ddd0f69 [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352 9ece5c5 [Pei-Lun Lee] compatibility with hadoop 1.x 8413fcd [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352 fe65915 [Pei-Lun Lee] add support for parquet config parquet.enable.summary-metadata e17bf47 [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352 9ae7545 [Pei-Lun Lee] [SPARL-6352] [SQL] Change to allow custom parquet output committer. 0d540b9 [Pei-Lun Lee] [SPARK-6352] [SQL] add license c42468c [Pei-Lun Lee] [SPARK-6352] [SQL] add test case 0fc03ca [Pei-Lun Lee] [SPARK-6532] [SQL] hide class DirectParquetOutputCommitter 769bd67 [Pei-Lun Lee] DirectParquetOutputCommitter f75e261 [Pei-Lun Lee] DirectParquetOutputCommitter

Add new config "spark.sql.parquet.output.committer.class" to allow custom parquet output committer and an output committer class specific to use on s3. Fix compilation error introduced by apache#5042. Respect ParquetOutputFormat.ENABLE_JOB_SUMMARY flag. Author: Pei-Lun Lee <pllee@appier.com> Closes apache#5525 from ypcat/spark-6352 and squashes the following commits: 54c6b15 [Pei-Lun Lee] error handling 472870e [Pei-Lun Lee] add back custom parquet output committer ddd0f69 [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352 9ece5c5 [Pei-Lun Lee] compatibility with hadoop 1.x 8413fcd [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352 fe65915 [Pei-Lun Lee] add support for parquet config parquet.enable.summary-metadata e17bf47 [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352 9ae7545 [Pei-Lun Lee] [SPARL-6352] [SQL] Change to allow custom parquet output committer. 0d540b9 [Pei-Lun Lee] [SPARK-6352] [SQL] add license c42468c [Pei-Lun Lee] [SPARK-6352] [SQL] add test case 0fc03ca [Pei-Lun Lee] [SPARK-6532] [SQL] hide class DirectParquetOutputCommitter 769bd67 [Pei-Lun Lee] DirectParquetOutputCommitter f75e261 [Pei-Lun Lee] DirectParquetOutputCommitter Conflicts: sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetIOSuite.scala

Add new config "spark.sql.parquet.output.committer.class" to allow custom parquet output committer and an output committer class specific to use on s3. Fix compilation error introduced by apache#5042. Respect ParquetOutputFormat.ENABLE_JOB_SUMMARY flag. Author: Pei-Lun Lee <pllee@appier.com> Closes apache#5525 from ypcat/spark-6352 and squashes the following commits: 54c6b15 [Pei-Lun Lee] error handling 472870e [Pei-Lun Lee] add back custom parquet output committer ddd0f69 [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352 9ece5c5 [Pei-Lun Lee] compatibility with hadoop 1.x 8413fcd [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352 fe65915 [Pei-Lun Lee] add support for parquet config parquet.enable.summary-metadata e17bf47 [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352 9ae7545 [Pei-Lun Lee] [SPARL-6352] [SQL] Change to allow custom parquet output committer. 0d540b9 [Pei-Lun Lee] [SPARK-6352] [SQL] add license c42468c [Pei-Lun Lee] [SPARK-6352] [SQL] add test case 0fc03ca [Pei-Lun Lee] [SPARK-6532] [SQL] hide class DirectParquetOutputCommitter 769bd67 [Pei-Lun Lee] DirectParquetOutputCommitter f75e261 [Pei-Lun Lee] DirectParquetOutputCommitter

Pei-Lun Lee added 2 commits March 16, 2015 17:03

DirectParquetOutputCommitter

f75e261

DirectParquetOutputCommitter

769bd67

Pei-Lun Lee added 2 commits March 17, 2015 15:27

[SPARK-6532] [SQL] hide class DirectParquetOutputCommitter

0fc03ca

[SPARK-6352] [SQL] add test case

c42468c

[SPARK-6352] [SQL] add license

0d540b9

Pei-Lun Lee added 2 commits March 23, 2015 18:32

[SPARL-6352] [SQL] Change to allow custom parquet output committer.

9ae7545

Add a new configuration key: spark.sql.parquet.output.committer.class which should be a sub-class of ParquetOutputCommitter

Merge branch 'master' of https://github.com/apache/spark into spark-6352

e17bf47

Conflicts: sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetIOSuite.scala

liancheng reviewed Apr 13, 2015
View reviewed changes

asfgit closed this in b29663e Apr 13, 2015

ypcat mentioned this pull request Apr 15, 2015

[SPARK-6352] [SQL] Custom parquet output committer #5525

Closed

mccheah mentioned this pull request May 14, 2015

[SPARK-6352] [SQL] Custom parquet output committer palantir/spark#10

Closed

Conversation

ypcat commented Mar 16, 2015

Uh oh!

ash211 commented Mar 16, 2015

Uh oh!

SparkQA commented Mar 16, 2015

Uh oh!

shaneknapp commented Mar 16, 2015

Uh oh!

SparkQA commented Mar 16, 2015

Uh oh!

SparkQA commented Mar 17, 2015

Uh oh!

SparkQA commented Mar 17, 2015

Uh oh!

ypcat commented Mar 23, 2015

Uh oh!

SparkQA commented Mar 23, 2015

Uh oh!

SparkQA commented Mar 23, 2015

Uh oh!

liancheng commented Apr 11, 2015

Uh oh!

liancheng commented Apr 11, 2015

Uh oh!

JoshRosen commented Apr 11, 2015

Uh oh!

ypcat commented Apr 13, 2015

Uh oh!

liancheng Apr 13, 2015

Choose a reason for hiding this comment

Uh oh!

liancheng Apr 13, 2015

Choose a reason for hiding this comment

Uh oh!

liancheng commented Apr 13, 2015

Uh oh!

JoshRosen commented Apr 13, 2015

Uh oh!

JoshRosen commented Apr 13, 2015

Uh oh!

ypcat commented Apr 14, 2015

Uh oh!

JoshRosen commented Apr 14, 2015

Uh oh!

ypcat commented Apr 15, 2015

Uh oh!

ash211 commented Apr 15, 2015

Uh oh!

ypcat commented Apr 15, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants