-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-16: Avoid calling getFileStatus() on all part-files #17
Conversation
Hmm... Build failed because of unknown Maven issue. |
Hey @dvryaboy, would you mind to help to re-test and review this PR? Currently we mimicked this PR via reflection tricks to gain better performance in Apache Spark (apache/spark#1370). It would be great if we can have this merged into Parquet. Please let me know your concerns if any, especially those related to PR #2. Thanks! |
Hi, sorry, got busy with real life. Will review and help merge this |
I think this conflicts with https://github.com/apache/incubator-parquet-mr/pull/2 |
import java.util.Comparator; | ||
import java.util.List; | ||
import java.util.Map; | ||
import java.util.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we prefer the explicit imports. this makes it easier to read outside of an IDE (like on github)
@julienledem Made changes to make this PR compatible with #2 and got rid of the side effects. |
Is LRU cache really necessary? LRU cache is useful when keys are hit at different frequencies or probabilities. This does not seem to be the case for footers. Here we just want to cache based on the jobcontext, so if the same jobcontext asks the footers again, we can provide the cached footers, other wise we invalidate the cache. |
The other PR you referenced https://github.com/apache/incubator-parquet-mr/pull/2 So a list of footers cached based on jobContext should solve it? |
@tsdeng Hmm, I'm afraid caching footers doesn't solve this performance issue. It's the |
Just in case someone didn't notice, this comment may help understand this issue. Basically, it's the |
Hey guys, would anyone like to share some further thoughts on this PR? I should mention that by working around this issue in a similar way in Spark SQL, we observed an order of magnitude performance boost when reading large Parquet file (over 300GB, with 3000+ part-files and a 140MB |
fd351b2
to
c4f2fb8
Compare
Rebased to master. |
Hi, guys. Our Spark version is 1.2.0, Tachyon version is 0.6-SNAPSHOT, and Hadoop version is 2.4.0, I think this PR is related to our problem. |
Hey @Myasuka, Spark SQL already worked around this issue by providing a customized input format. Since you only observe those |
Hi, @liancheng , sorry for late reply. It seems that it's Tachyon's bug, you can see the call stack tracks below, the 1st pic is quering directly from HDFS while the other one is quering from Tachyon. Moreover, I have created an issue in Tachyon. |
Closing this since #45 fixed this issue from another angle. |
JIRA issue: PARQUET-16
This PR improves performance of
ParquetInputFormat.getSplit(JobContext)
, especially when reading large files from S3.When calling
getSplits(JobContext)
, all theFileStatus
objects are already fetched vialistStatus
, but abandoned immediately. PR #2 added an LRU cache forFooter
objects, and fortunately correspondingFileStatus
objects are cached together. In this PR, two private methods are added to leverage the LRU cache and prevent retrievingFileStatus
objects sequentially when callinggetSplits(JobContext)
.