Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-16: Avoid calling getFileStatus() on all part-files #17

Closed
wants to merge 2 commits into from

Conversation

liancheng
Copy link
Contributor

JIRA issue: PARQUET-16

This PR improves performance of ParquetInputFormat.getSplit(JobContext), especially when reading large files from S3.

When calling getSplits(JobContext), all the FileStatus objects are already fetched via listStatus, but abandoned immediately. PR #2 added an LRU cache for Footer objects, and fortunately corresponding FileStatus objects are cached together. In this PR, two private methods are added to leverage the LRU cache and prevent retrieving FileStatus objects sequentially when calling getSplits(JobContext).

Review on Reviewable

@liancheng
Copy link
Contributor Author

Hmm... Build failed because of unknown Maven issue.

@liancheng
Copy link
Contributor Author

Hey @dvryaboy, would you mind to help to re-test and review this PR? Currently we mimicked this PR via reflection tricks to gain better performance in Apache Spark (apache/spark#1370). It would be great if we can have this merged into Parquet. Please let me know your concerns if any, especially those related to PR #2. Thanks!

@dvryaboy
Copy link
Contributor

Hi, sorry, got busy with real life. Will review and help merge this
weekend.

@julienledem
Copy link
Member

I think this conflicts with https://github.com/apache/incubator-parquet-mr/pull/2
@liancheng could you merge master in your branch to fix it?

import java.util.Comparator;
import java.util.List;
import java.util.Map;
import java.util.*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we prefer the explicit imports. this makes it easier to read outside of an IDE (like on github)

@liancheng
Copy link
Contributor Author

@julienledem Made changes to make this PR compatible with #2 and got rid of the side effects.

@tsdeng
Copy link
Contributor

tsdeng commented Aug 1, 2014

Is LRU cache really necessary? LRU cache is useful when keys are hit at different frequencies or probabilities. This does not seem to be the case for footers.

Here we just want to cache based on the jobcontext, so if the same jobcontext asks the footers again, we can provide the cached footers, other wise we invalidate the cache.

@tsdeng
Copy link
Contributor

tsdeng commented Aug 1, 2014

The other PR you referenced https://github.com/apache/incubator-parquet-mr/pull/2
The purpose of it is actually when a different jobContext is given to avoid giving the cached footers.

So a list of footers cached based on jobContext should solve it?
@matt-martin

@liancheng
Copy link
Contributor Author

@tsdeng Hmm, I'm afraid caching footers doesn't solve this performance issue. It's the getFileStatus() call in getSplits(Configuration, List<Footer>) that affects the performance. Especially those FileStatus objects are already retrieved in getSplits(JobContext) withlistStatus(). There's no reason to retrieve them again one by one in a sequential manner.

@liancheng
Copy link
Contributor Author

Just in case someone didn't notice, this comment may help understand this issue.

Basically, it's the FileStatuses (which help building ParquetInputSplits) rather than the footers that are important. And I'm actually not trying to "cache" the FileStatuses, just passing them from the upper part of the call chain to the lower part with the help of the existing footersCache since retrieved FileStatuses are already there anyway.

@liancheng
Copy link
Contributor Author

Hey guys, would anyone like to share some further thoughts on this PR?

I should mention that by working around this issue in a similar way in Spark SQL, we observed an order of magnitude performance boost when reading large Parquet file (over 300GB, with 3000+ part-files and a 140MB _metadata) from S3, because getFileStatus is much more expensive in case of S3.

@liancheng
Copy link
Contributor Author

Rebased to master.

@Myasuka
Copy link
Member

Myasuka commented Jan 20, 2015

Hi, guys.
We store parquet files with size 110GB, 6000+ parts on Tachyon, and query Spark sql such as select count (*). However, before launching Spark tasks, there exists 18~20 seconds getting file status, I can see a lot of log info printling on the console, these info looks like INFO - getFileStatus(tachyon://****/**): HDFS Path: hdfs://****/** TPath: tachyon://****/** These behaviors cost a lot of time, while we query the same data on HDFS in parquet format, there not exist these getFileStatus calls, which results in quering on HDFS gain a better performance.

Our Spark version is 1.2.0, Tachyon version is 0.6-SNAPSHOT, and Hadoop version is 2.4.0, I think this PR is related to our problem.

@liancheng
Copy link
Contributor Author

Hey @Myasuka, Spark SQL already worked around this issue by providing a customized input format. Since you only observe those getFileStatus calls on Tachyon, I guess it's Tachyon rather than Parquet who issued these calls. However, I'm rather unfamiliar with Tachyon. Please correct me if I'm wrong here.

@Myasuka
Copy link
Member

Myasuka commented Feb 10, 2015

Hi, @liancheng , sorry for late reply. It seems that it's Tachyon's bug, you can see the call stack tracks below, the 1st pic is quering directly from HDFS while the other one is quering from Tachyon. Moreover, I have created an issue in Tachyon.

fromhdfs
withtachyon

@liancheng
Copy link
Contributor Author

Closing this since #45 fixed this issue from another angle.

@liancheng liancheng closed this Feb 10, 2015
@liancheng liancheng deleted the parquet-16 branch February 10, 2015 19:41
parthchandra pushed a commit to parthchandra/incubator-parquet-mr that referenced this pull request May 13, 2022
sunchao added a commit to sunchao/parquet-mr that referenced this pull request Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants