[SPARK-3091] [SQL] Add support for caching metadata on Parquet files #2005

mateiz · 2014-08-17T23:53:13Z

For larger Parquet files, reading the file footers (which is done in parallel on up to 5 threads) and HDFS block locations (which is serial) can take multiple seconds. We can add an option to cache this data within FilteringParquetInputFormat. Unfortunately ParquetInputFormat only caches footers within each instance of ParquetInputFormat, not across them.

Note: this PR leaves this turned off by default for 1.1, but I believe it's safe to turn it on after. The keys in the hash maps are FileStatus objects that include a modification time, so this will work fine if files are modified. The location cache could become invalid if files have moved within HDFS, but that's rare so I just made it invalidate entries every 15 minutes.

SparkQA · 2014-08-17T23:55:37Z

QA tests have started for PR 2005 at commit 22072b0.

This patch merges cleanly.

SparkQA · 2014-08-18T00:16:00Z

QA tests have started for PR 2005 at commit c71e9ed.

This patch merges cleanly.

SparkQA · 2014-08-18T00:25:37Z

QA tests have started for PR 2005 at commit dae8efe.

This patch merges cleanly.

SparkQA · 2014-08-18T01:35:10Z

QA tests have finished for PR 2005 at commit dae8efe.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2014-08-18T17:59:54Z

Only failed thrift server tests. I'm going to go ahead and merge. Thanks Matei!

For larger Parquet files, reading the file footers (which is done in parallel on up to 5 threads) and HDFS block locations (which is serial) can take multiple seconds. We can add an option to cache this data within FilteringParquetInputFormat. Unfortunately ParquetInputFormat only caches footers within each instance of ParquetInputFormat, not across them. Note: this PR leaves this turned off by default for 1.1, but I believe it's safe to turn it on after. The keys in the hash maps are FileStatus objects that include a modification time, so this will work fine if files are modified. The location cache could become invalid if files have moved within HDFS, but that's rare so I just made it invalidate entries every 15 minutes. Author: Matei Zaharia <[email protected]> Closes #2005 from mateiz/parquet-cache and squashes the following commits: dae8efe [Matei Zaharia] Bug fix c71e9ed [Matei Zaharia] Handle empty statuses directly 22072b0 [Matei Zaharia] Use Guava caches and add a config option for caching metadata 8fb56ce [Matei Zaharia] Cache file block locations too 453bd21 [Matei Zaharia] Bug fix 4094df6 [Matei Zaharia] First attempt at caching Parquet footers (cherry picked from commit 9eb74c7) Signed-off-by: Michael Armbrust <[email protected]>

For larger Parquet files, reading the file footers (which is done in parallel on up to 5 threads) and HDFS block locations (which is serial) can take multiple seconds. We can add an option to cache this data within FilteringParquetInputFormat. Unfortunately ParquetInputFormat only caches footers within each instance of ParquetInputFormat, not across them. Note: this PR leaves this turned off by default for 1.1, but I believe it's safe to turn it on after. The keys in the hash maps are FileStatus objects that include a modification time, so this will work fine if files are modified. The location cache could become invalid if files have moved within HDFS, but that's rare so I just made it invalidate entries every 15 minutes. Author: Matei Zaharia <[email protected]> Closes apache#2005 from mateiz/parquet-cache and squashes the following commits: dae8efe [Matei Zaharia] Bug fix c71e9ed [Matei Zaharia] Handle empty statuses directly 22072b0 [Matei Zaharia] Use Guava caches and add a config option for caching metadata 8fb56ce [Matei Zaharia] Cache file block locations too 453bd21 [Matei Zaharia] Bug fix 4094df6 [Matei Zaharia] First attempt at caching Parquet footers

mateiz added 4 commits August 17, 2014 14:33

First attempt at caching Parquet footers

4094df6

Bug fix

453bd21

Cache file block locations too

8fb56ce

Use Guava caches and add a config option for caching metadata

22072b0

Handle empty statuses directly

c71e9ed

Bug fix

dae8efe

asfgit closed this in 9eb74c7 Aug 18, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-3091] [SQL] Add support for caching metadata on Parquet files #2005

[SPARK-3091] [SQL] Add support for caching metadata on Parquet files #2005

Uh oh!

mateiz commented Aug 17, 2014

Uh oh!

SparkQA commented Aug 17, 2014

Uh oh!

SparkQA commented Aug 18, 2014

Uh oh!

SparkQA commented Aug 18, 2014

Uh oh!

SparkQA commented Aug 18, 2014

Uh oh!

marmbrus commented Aug 18, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-3091] [SQL] Add support for caching metadata on Parquet files #2005

[SPARK-3091] [SQL] Add support for caching metadata on Parquet files #2005

Uh oh!

Conversation

mateiz commented Aug 17, 2014

Uh oh!

SparkQA commented Aug 17, 2014

Uh oh!

SparkQA commented Aug 18, 2014

Uh oh!

SparkQA commented Aug 18, 2014

Uh oh!

SparkQA commented Aug 18, 2014

Uh oh!

marmbrus commented Aug 18, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants