-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-84: Avoid reading rowgroup metadata in memory on the client side. #45
Conversation
public static final MetadataFilter NO_FILTER = new NoFilter(); | ||
public static final MetadataFilter SKIP_ROW_GROUPS = new SkipMetadataFilter(); | ||
/** | ||
* [ startOffset, endOffset ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix comment: [ startOffset, endOffset )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx
7957a92
to
5b6bd1b
Compare
} | ||
} | ||
/** | ||
* [ startOffset, endOffset ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment endOffset )
LGTM! |
@tsdeng and the build is green! |
…ide. This will improve reading big datasets with a large schema (thousands of columns) Instead rowgroup metadata can be read in the tasks where each tasks reads only the metadata of the file it's reading Author: julien <[email protected]> Closes apache#45 from julienledem/skip_reading_row_groups and squashes the following commits: ccdd08c [julien] fix parquet-hive 24a2050 [julien] Merge branch 'master' into skip_reading_row_groups 3d7e35a [julien] adress review feedback 5b6bd1b [julien] more tests 323d254 [julien] sdd unit tests f599259 [julien] review feedback fb11f02 [julien] fix backward compatibility check 2c20b46 [julien] cleanup readFooters methods 3da37d8 [julien] fix read summary ab95a45 [julien] cleanup 4d16df3 [julien] implement task side metadata 9bb8059 [julien] first stab at integrating skipping row groups
…ide. This will improve reading big datasets with a large schema (thousands of columns) Instead rowgroup metadata can be read in the tasks where each tasks reads only the metadata of the file it's reading Author: julien <[email protected]> Closes apache#45 from julienledem/skip_reading_row_groups and squashes the following commits: ccdd08c [julien] fix parquet-hive 24a2050 [julien] Merge branch 'master' into skip_reading_row_groups 3d7e35a [julien] adress review feedback 5b6bd1b [julien] more tests 323d254 [julien] sdd unit tests f599259 [julien] review feedback fb11f02 [julien] fix backward compatibility check 2c20b46 [julien] cleanup readFooters methods 3da37d8 [julien] fix read summary ab95a45 [julien] cleanup 4d16df3 [julien] implement task side metadata 9bb8059 [julien] first stab at integrating skipping row groups Conflicts: parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileWriter.java parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java parquet-hadoop/src/test/java/parquet/hadoop/example/TestInputOutputFormat.java Resolution: Conflicts were from whitespace changes and strict type checking (not backported). Removed dependence on strict type checking.
…ide. This will improve reading big datasets with a large schema (thousands of columns) Instead rowgroup metadata can be read in the tasks where each tasks reads only the metadata of the file it's reading Author: julien <[email protected]> Closes apache#45 from julienledem/skip_reading_row_groups and squashes the following commits: ccdd08c [julien] fix parquet-hive 24a2050 [julien] Merge branch 'master' into skip_reading_row_groups 3d7e35a [julien] adress review feedback 5b6bd1b [julien] more tests 323d254 [julien] sdd unit tests f599259 [julien] review feedback fb11f02 [julien] fix backward compatibility check 2c20b46 [julien] cleanup readFooters methods 3da37d8 [julien] fix read summary ab95a45 [julien] cleanup 4d16df3 [julien] implement task side metadata 9bb8059 [julien] first stab at integrating skipping row groups Conflicts: parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileWriter.java parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java parquet-hadoop/src/test/java/parquet/hadoop/example/TestInputOutputFormat.java Resolution: Conflicts were from whitespace changes and strict type checking (not backported). Removed dependence on strict type checking.
apache#43 added the logic to return null when `compressedPages` become empty. However this is not correct with async IO enabled, since the first page may not have been read yet, when the method is called. This fixes it by adding a `isFinished` variable to indicate whether all the pages have been consumed in the `ColumnChunkPageReadStore`. In addition, this also added a few pre-condition checks to make sure the object won't run into some invalid state.
Follow-up of apache#45. This fixes the pre-condition check of `getPageValueCount` method.
This will improve reading big datasets with a large schema (thousands of columns)
Instead rowgroup metadata can be read in the tasks where each tasks reads only the metadata of the file it's reading