Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PARQUET-84: Avoid reading rowgroup metadata in memory on the client side. #45

Closed
wants to merge 12 commits into from

Conversation

julienledem
Copy link
Member

This will improve reading big datasets with a large schema (thousands of columns)
Instead rowgroup metadata can be read in the tasks where each tasks reads only the metadata of the file it's reading

public static final MetadataFilter NO_FILTER = new NoFilter();
public static final MetadataFilter SKIP_ROW_GROUPS = new SkipMetadataFilter();
/**
* [ startOffset, endOffset (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix comment: [ startOffset, endOffset )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx

@julienledem julienledem changed the title Avoid reading rowgroup metadata in memory on the client side. PARQUET-84: Avoid reading rowgroup metadata in memory on the client side. Sep 3, 2014
@julienledem julienledem force-pushed the skip_reading_row_groups branch from 7957a92 to 5b6bd1b Compare September 3, 2014 00:32
}
}
/**
* [ startOffset, endOffset (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment endOffset )

@tsdeng
Copy link
Contributor

tsdeng commented Sep 4, 2014

LGTM!

@julienledem
Copy link
Member Author

@tsdeng and the build is green!

@asfgit asfgit closed this in 5dafd12 Sep 5, 2014
tongjiechen pushed a commit to tongjiechen/incubator-parquet-mr that referenced this pull request Oct 8, 2014
…ide.

This will improve reading big datasets with a large schema (thousands of columns)
Instead rowgroup metadata can be read in the tasks where each tasks reads only the metadata of the file it's reading

Author: julien <[email protected]>

Closes apache#45 from julienledem/skip_reading_row_groups and squashes the following commits:

ccdd08c [julien] fix parquet-hive
24a2050 [julien] Merge branch 'master' into skip_reading_row_groups
3d7e35a [julien] adress review feedback
5b6bd1b [julien] more tests
323d254 [julien] sdd unit tests
f599259 [julien] review feedback
fb11f02 [julien] fix backward compatibility check
2c20b46 [julien] cleanup readFooters methods
3da37d8 [julien] fix read summary
ab95a45 [julien] cleanup
4d16df3 [julien] implement task side metadata
9bb8059 [julien] first stab at integrating skipping row groups
@julienledem julienledem deleted the skip_reading_row_groups branch October 30, 2014 23:30
rdblue pushed a commit to rdblue/parquet-mr that referenced this pull request Feb 6, 2015
…ide.

This will improve reading big datasets with a large schema (thousands of columns)
Instead rowgroup metadata can be read in the tasks where each tasks reads only the metadata of the file it's reading

Author: julien <[email protected]>

Closes apache#45 from julienledem/skip_reading_row_groups and squashes the following commits:

ccdd08c [julien] fix parquet-hive
24a2050 [julien] Merge branch 'master' into skip_reading_row_groups
3d7e35a [julien] adress review feedback
5b6bd1b [julien] more tests
323d254 [julien] sdd unit tests
f599259 [julien] review feedback
fb11f02 [julien] fix backward compatibility check
2c20b46 [julien] cleanup readFooters methods
3da37d8 [julien] fix read summary
ab95a45 [julien] cleanup
4d16df3 [julien] implement task side metadata
9bb8059 [julien] first stab at integrating skipping row groups

Conflicts:
	parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileWriter.java
	parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java
	parquet-hadoop/src/test/java/parquet/hadoop/example/TestInputOutputFormat.java
Resolution:
    Conflicts were from whitespace changes and strict type checking (not
    backported). Removed dependence on strict type checking.
rdblue pushed a commit to rdblue/parquet-mr that referenced this pull request Mar 9, 2015
…ide.

This will improve reading big datasets with a large schema (thousands of columns)
Instead rowgroup metadata can be read in the tasks where each tasks reads only the metadata of the file it's reading

Author: julien <[email protected]>

Closes apache#45 from julienledem/skip_reading_row_groups and squashes the following commits:

ccdd08c [julien] fix parquet-hive
24a2050 [julien] Merge branch 'master' into skip_reading_row_groups
3d7e35a [julien] adress review feedback
5b6bd1b [julien] more tests
323d254 [julien] sdd unit tests
f599259 [julien] review feedback
fb11f02 [julien] fix backward compatibility check
2c20b46 [julien] cleanup readFooters methods
3da37d8 [julien] fix read summary
ab95a45 [julien] cleanup
4d16df3 [julien] implement task side metadata
9bb8059 [julien] first stab at integrating skipping row groups

Conflicts:
	parquet-hadoop/src/main/java/parquet/hadoop/ParquetFileWriter.java
	parquet-hadoop/src/main/java/parquet/hadoop/ParquetInputFormat.java
	parquet-hadoop/src/test/java/parquet/hadoop/example/TestInputOutputFormat.java
Resolution:
    Conflicts were from whitespace changes and strict type checking (not
    backported). Removed dependence on strict type checking.
sunchao added a commit to sunchao/parquet-mr that referenced this pull request Aug 1, 2022
apache#43 added the logic to return null when `compressedPages` become empty. However this is not correct with async IO enabled, since the first page may not have been read yet, when the method is called.

This fixes it by adding a `isFinished` variable to indicate whether all the pages have been consumed in the `ColumnChunkPageReadStore`. In addition, this also added a few pre-condition checks to make sure the object won't run into some invalid state.
sunchao added a commit to sunchao/parquet-mr that referenced this pull request Sep 16, 2022
Follow-up of apache#45. This fixes the pre-condition check of `getPageValueCount` method.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants