[SPARK-2119][SQL] Improved Parquet performance when reading off S3 #1370

liancheng · 2014-07-11T05:29:16Z

Essentially this PR fixed three issues to gain much better performance when reading large Parquet file off S3.

When reading the schema, fetching Parquet metadata from a part-file rather than the _metadata file

The _metadata file contains metadata of all row groups, and can be very large if there are many row groups. Since schema information and row group metadata are coupled within a single Thrift object, we have to read the whole _metadata to fetch the schema. On the other hand, schema is replicated among footers of all part-files, which are fairly small.
Only add the root directory of the Parquet file rather than all the part-files to input paths

HDFS API can automatically filter out all hidden files and underscore files (_SUCCESS & _metadata), there's no need to filter out all part-files and add them individually to input paths. What make it much worse is that, FileInputFormat.listStatus() calls FileSystem.globStatus() on each individual input path sequentially, each results a blocking remote S3 HTTP request.
Worked around PARQUET-16

Essentially PARQUET-16 is similar to the above issue, and results lots of sequential FileSystem.getFileStatus() calls, which are further translated into a bunch of remote S3 HTTP requests.

FilteringParquetRowInputFormat should be cleaned up once PARQUET-16 is fixed.

Below is the micro benchmark result. The dataset used is a S3 Parquet file consists of 3,793 partitions, about 110MB per partition in average. The benchmark is done with a 9-node AWS cluster.

Creating a Parquet SchemaRDD (Parquet schema is fetched)
```
val tweets = parquetFile(uri)
```
- Before: 17.80s
- After: 8.61s
Fetching partition information
```
tweets.getPartitions
```
- Before: 700.87s
- After: 21.47s
Counting the whole file (both steps above are executed altogether)
```
parquetFile(uri).count()
```
- Before: ??? (haven't test yet)
- After: 53.26s

SparkQA · 2014-07-11T05:32:38Z

QA tests have started for PR 1370. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16555/consoleFull

SparkQA · 2014-07-11T07:08:00Z

QA results for PR 1370:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16555/consoleFull

marmbrus · 2014-07-15T22:20:35Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala

    }
+
+    // NOTE (lian): Parquet "_metadata" file can be very slow if the file consists of lots of row
+    // groups. Since Parquet schema is replicated among all row groups, we only need to touch a


Are we making a new assumption here that all of the data has the same schema. I know we don't promise support for that now, but it would be nice to do in the future.

Yes, we are making this assumption, will add a comment here. (And checking schema consistency can be potentially inefficient for large Parquet file with lots of row groups.)

SparkQA · 2014-07-16T03:38:00Z

QA tests have started for PR 1370. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16704/consoleFull

SparkQA · 2014-07-16T05:23:35Z

QA results for PR 1370:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16704/consoleFull

marmbrus · 2014-07-16T16:45:13Z

Thanks! I've merged this into master.

…gle file as parameter ```if (!fs.getFileStatus(path).isDir) throw Exception``` make no sense after this commit #1370 be careful if someone is working on SPARK-2551, make sure the new change passes test case ```test("Read a parquet file instead of a directory")``` Author: chutium <[email protected]> Closes #2044 from chutium/parquet-singlefile and squashes the following commits: 4ae477f [chutium] [SPARK-3138][SQL] sqlContext.parquetFile should be able to take a single file as parameter (cherry picked from commit 48f4278) Signed-off-by: Michael Armbrust <[email protected]>

…gle file as parameter ```if (!fs.getFileStatus(path).isDir) throw Exception``` make no sense after this commit #1370 be careful if someone is working on SPARK-2551, make sure the new change passes test case ```test("Read a parquet file instead of a directory")``` Author: chutium <[email protected]> Closes #2044 from chutium/parquet-singlefile and squashes the following commits: 4ae477f [chutium] [SPARK-3138][SQL] sqlContext.parquetFile should be able to take a single file as parameter

JIRA issue: [SPARK-2119](https://issues.apache.org/jira/browse/SPARK-2119) Essentially this PR fixed three issues to gain much better performance when reading large Parquet file off S3. 1. When reading the schema, fetching Parquet metadata from a part-file rather than the `_metadata` file The `_metadata` file contains metadata of all row groups, and can be very large if there are many row groups. Since schema information and row group metadata are coupled within a single Thrift object, we have to read the whole `_metadata` to fetch the schema. On the other hand, schema is replicated among footers of all part-files, which are fairly small. 1. Only add the root directory of the Parquet file rather than all the part-files to input paths HDFS API can automatically filter out all hidden files and underscore files (`_SUCCESS` & `_metadata`), there's no need to filter out all part-files and add them individually to input paths. What make it much worse is that, `FileInputFormat.listStatus()` calls `FileSystem.globStatus()` on each individual input path sequentially, each results a blocking remote S3 HTTP request. 1. Worked around [PARQUET-16](https://issues.apache.org/jira/browse/PARQUET-16) Essentially PARQUET-16 is similar to the above issue, and results lots of sequential `FileSystem.getFileStatus()` calls, which are further translated into a bunch of remote S3 HTTP requests. `FilteringParquetRowInputFormat` should be cleaned up once PARQUET-16 is fixed. Below is the micro benchmark result. The dataset used is a S3 Parquet file consists of 3,793 partitions, about 110MB per partition in average. The benchmark is done with a 9-node AWS cluster. - Creating a Parquet `SchemaRDD` (Parquet schema is fetched) ```scala val tweets = parquetFile(uri) ``` - Before: 17.80s - After: 8.61s - Fetching partition information ```scala tweets.getPartitions ``` - Before: 700.87s - After: 21.47s - Counting the whole file (both steps above are executed altogether) ```scala parquetFile(uri).count() ``` - Before: ??? (haven't test yet) - After: 53.26s Author: Cheng Lian <[email protected]> Closes apache#1370 from liancheng/faster-parquet and squashes the following commits: 94a2821 [Cheng Lian] Added comments about schema consistency d2c4417 [Cheng Lian] Worked around PARQUET-16 to improve Parquet performance 1c0d1b9 [Cheng Lian] Accelerated Parquet schema retrieving 5bd3d29 [Cheng Lian] Fixed Parquet log level

…gle file as parameter ```if (!fs.getFileStatus(path).isDir) throw Exception``` make no sense after this commit apache#1370 be careful if someone is working on SPARK-2551, make sure the new change passes test case ```test("Read a parquet file instead of a directory")``` Author: chutium <[email protected]> Closes apache#2044 from chutium/parquet-singlefile and squashes the following commits: 4ae477f [chutium] [SPARK-3138][SQL] sqlContext.parquetFile should be able to take a single file as parameter

liancheng added 3 commits July 9, 2014 00:11

Fixed Parquet log level

5bd3d29

Accelerated Parquet schema retrieving

1c0d1b9

Worked around PARQUET-16 to improve Parquet performance

d2c4417

liancheng mentioned this pull request Jul 15, 2014

PARQUET-16: Avoid calling getFileStatus() on all part-files apache/parquet-java#17

Closed

marmbrus reviewed Jul 15, 2014
View reviewed changes

Added comments about schema consistency

94a2821

asfgit closed this in efc452a Jul 16, 2014

chutium mentioned this pull request Aug 19, 2014

[SPARK-3138][SQL] sqlContext.parquetFile should be able to take a single file as parameter #2044

Closed

liancheng deleted the faster-parquet branch September 24, 2014 00:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-2119][SQL] Improved Parquet performance when reading off S3 #1370

[SPARK-2119][SQL] Improved Parquet performance when reading off S3 #1370

liancheng commented Jul 11, 2014

SparkQA commented Jul 11, 2014

SparkQA commented Jul 11, 2014

marmbrus Jul 15, 2014

liancheng Jul 16, 2014

SparkQA commented Jul 16, 2014

SparkQA commented Jul 16, 2014

marmbrus commented Jul 16, 2014

[SPARK-2119][SQL] Improved Parquet performance when reading off S3 #1370

[SPARK-2119][SQL] Improved Parquet performance when reading off S3 #1370

Conversation

liancheng commented Jul 11, 2014

SparkQA commented Jul 11, 2014

SparkQA commented Jul 11, 2014

marmbrus Jul 15, 2014

Choose a reason for hiding this comment

liancheng Jul 16, 2014

Choose a reason for hiding this comment

SparkQA commented Jul 16, 2014

SparkQA commented Jul 16, 2014

marmbrus commented Jul 16, 2014