PARQUET-16: Avoid calling getFileStatus() on all part-files #17

liancheng · 2014-07-11T02:22:07Z

This PR improves performance of ParquetInputFormat.getSplit(JobContext), especially when reading large files from S3.

When calling getSplits(JobContext), all the FileStatus objects are already fetched via listStatus, but abandoned immediately. PR #2 added an LRU cache for Footer objects, and fortunately corresponding FileStatus objects are cached together. In this PR, two private methods are added to leverage the LRU cache and prevent retrieving FileStatus objects sequentially when calling getSplits(JobContext).

liancheng · 2014-07-11T04:20:47Z

Hmm... Build failed because of unknown Maven issue.

liancheng · 2014-07-15T02:20:54Z

Hey @dvryaboy, would you mind to help to re-test and review this PR? Currently we mimicked this PR via reflection tricks to gain better performance in Apache Spark (apache/spark#1370). It would be great if we can have this merged into Parquet. Please let me know your concerns if any, especially those related to PR #2. Thanks!

dvryaboy · 2014-07-16T14:58:33Z

Hi, sorry, got busy with real life. Will review and help merge this
weekend.

julienledem · 2014-07-18T23:13:45Z

I think this conflicts with https://github.com/apache/incubator-parquet-mr/pull/2
@liancheng could you merge master in your branch to fix it?

julienledem · 2014-07-18T23:14:20Z

parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java

-import java.util.Comparator;
-import java.util.List;
-import java.util.Map;
+import java.util.*;


we prefer the explicit imports. this makes it easier to read outside of an IDE (like on github)

liancheng · 2014-07-30T07:20:25Z

@julienledem Made changes to make this PR compatible with #2 and got rid of the side effects.

tsdeng · 2014-08-01T18:13:35Z

Is LRU cache really necessary? LRU cache is useful when keys are hit at different frequencies or probabilities. This does not seem to be the case for footers.

Here we just want to cache based on the jobcontext, so if the same jobcontext asks the footers again, we can provide the cached footers, other wise we invalidate the cache.

tsdeng · 2014-08-01T19:03:19Z

The other PR you referenced https://github.com/apache/incubator-parquet-mr/pull/2
The purpose of it is actually when a different jobContext is given to avoid giving the cached footers.

So a list of footers cached based on jobContext should solve it?
@matt-martin

liancheng · 2014-08-02T05:42:19Z

@tsdeng Hmm, I'm afraid caching footers doesn't solve this performance issue. It's the getFileStatus() call in getSplits(Configuration, List<Footer>) that affects the performance. Especially those FileStatus objects are already retrieved in getSplits(JobContext) withlistStatus(). There's no reason to retrieve them again one by one in a sequential manner.

liancheng · 2014-08-04T03:22:24Z

Just in case someone didn't notice, this comment may help understand this issue.

Basically, it's the FileStatuses (which help building ParquetInputSplits) rather than the footers that are important. And I'm actually not trying to "cache" the FileStatuses, just passing them from the upper part of the call chain to the lower part with the help of the existing footersCache since retrieved FileStatuses are already there anyway.

liancheng · 2014-08-07T03:17:44Z

Hey guys, would anyone like to share some further thoughts on this PR?

I should mention that by working around this issue in a similar way in Spark SQL, we observed an order of magnitude performance boost when reading large Parquet file (over 300GB, with 3000+ part-files and a 140MB _metadata) from S3, because getFileStatus is much more expensive in case of S3.

liancheng · 2014-09-03T23:01:59Z

Rebased to master.

Myasuka · 2015-01-20T07:26:10Z

Hi, guys.
We store parquet files with size 110GB, 6000+ parts on Tachyon, and query Spark sql such as select count (*). However, before launching Spark tasks, there exists 18~20 seconds getting file status, I can see a lot of log info printling on the console, these info looks like INFO - getFileStatus(tachyon://****/**): HDFS Path: hdfs://****/** TPath: tachyon://****/** These behaviors cost a lot of time, while we query the same data on HDFS in parquet format, there not exist these getFileStatus calls, which results in quering on HDFS gain a better performance.

Our Spark version is 1.2.0, Tachyon version is 0.6-SNAPSHOT, and Hadoop version is 2.4.0, I think this PR is related to our problem.

liancheng · 2015-01-20T19:19:46Z

Hey @Myasuka, Spark SQL already worked around this issue by providing a customized input format. Since you only observe those getFileStatus calls on Tachyon, I guess it's Tachyon rather than Parquet who issued these calls. However, I'm rather unfamiliar with Tachyon. Please correct me if I'm wrong here.

Myasuka · 2015-02-10T04:12:51Z

Hi, @liancheng , sorry for late reply. It seems that it's Tachyon's bug, you can see the call stack tracks below, the 1st pic is quering directly from HDFS while the other one is quering from Tachyon. Moreover, I have created an issue in Tachyon.

liancheng · 2015-02-10T19:41:03Z

Closing this since #45 fixed this issue from another angle.

…ache#17)

julienledem reviewed Jul 18, 2014
View reviewed changes

liancheng added 2 commits September 3, 2014 15:50

Leveraged footersCache to pass FileStatus objects without side effects

bc043d5

Reverted code style changes to make this PR cleaner

c4f2fb8

liancheng force-pushed the parquet-16 branch from fd351b2 to c4f2fb8 Compare September 3, 2014 22:51

liancheng closed this Feb 10, 2015

liancheng deleted the parquet-16 branch February 10, 2015 19:41

parthchandra pushed a commit to parthchandra/incubator-parquet-mr that referenced this pull request May 13, 2022

Remove extra copy when decompressing data page with encryption on (ap…

9fbda8d

…ache#17)

sunchao added a commit to sunchao/parquet-mr that referenced this pull request Jun 16, 2022

Remove extra copy when decompressing data page with encryption on (ap…

214c18f

…ache#17)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-16: Avoid calling getFileStatus() on all part-files #17

PARQUET-16: Avoid calling getFileStatus() on all part-files #17

liancheng commented Jul 11, 2014

liancheng commented Jul 11, 2014

liancheng commented Jul 15, 2014

dvryaboy commented Jul 16, 2014

julienledem commented Jul 18, 2014

julienledem Jul 18, 2014

liancheng commented Jul 30, 2014

tsdeng commented Aug 1, 2014

tsdeng commented Aug 1, 2014

liancheng commented Aug 2, 2014

liancheng commented Aug 4, 2014

liancheng commented Aug 7, 2014

liancheng commented Sep 3, 2014

Myasuka commented Jan 20, 2015

liancheng commented Jan 20, 2015

Myasuka commented Feb 10, 2015

liancheng commented Feb 10, 2015

PARQUET-16: Avoid calling getFileStatus() on all part-files #17

PARQUET-16: Avoid calling getFileStatus() on all part-files #17

Conversation

liancheng commented Jul 11, 2014

liancheng commented Jul 11, 2014

liancheng commented Jul 15, 2014

dvryaboy commented Jul 16, 2014

julienledem commented Jul 18, 2014

julienledem Jul 18, 2014

Choose a reason for hiding this comment

liancheng commented Jul 30, 2014

tsdeng commented Aug 1, 2014

tsdeng commented Aug 1, 2014

liancheng commented Aug 2, 2014

liancheng commented Aug 4, 2014

liancheng commented Aug 7, 2014

liancheng commented Sep 3, 2014

Myasuka commented Jan 20, 2015

liancheng commented Jan 20, 2015

Myasuka commented Feb 10, 2015

liancheng commented Feb 10, 2015