-
Notifications
You must be signed in to change notification settings - Fork 1.5k
PARQUET-1212: Column indexes: Show indexes in tools #479
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| List<String> blockIndexes; | ||
|
|
||
| @Parameter(names = { "-i", "--column-index" }, description = "Shows the column indexes; " | ||
| + "the default behavior is if both -i and -o would be set") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(subjective, optional) I would suggest "active by default unless -o is used"
| boolean showColumnIndex; | ||
|
|
||
| @Parameter(names = { "-o", "--offset-index" }, description = "Shows the offset indexes; " | ||
| + "the default behavior is if both -i and -o would be set") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(subjective, optional) I would suggest "active by default unless -i is used"
| + "blocks are referenced by their indexes from 0") | ||
| List<String> blockIndexes; | ||
|
|
||
| @Parameter(names = { "-i", "--column-index" }, description = "Shows the column indexes; " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(subjective, optional) I find -i very counter-intuitive. Could this be -c and --column something else? For example -f/--field? Alternatively, -k/--column? Or -p/--column-path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--column -c is used in other commands. I think, it is better to keep this option as is to be common between the different commands.
| return 0; | ||
| } | ||
|
|
||
| private String getBlockIndex(int i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a short comment about what happens here and why?
If I understand correclty:
- without blockIndexes, we address into blocks directly, but
- blockIndexes introduces and additional layer of indirection if exists
Although I'm still unsure about why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks to your comment, finally I undestood what is going on but I find it very confusing.
- If the user wants to see all blocks, the index of elements returned by getBlocks() corresponds to their actual index in the file.
- If, on the other hand, the user wants to view specific blocks, getBlocks() returns elements in the order they were requested and the actual index can only be obtained via a reverse lookup.
I think it would be a lot cleaner the following way:
- The getBlocks() method should return an index to block map instead of a list of blocks.
- This way when iterating though the results of getBlocks(), we won't have to do a reverse lookup for the actual index, since it is a part of the container returned.
| /** | ||
| * parquet-tools command to print column and offset indexes. | ||
| */ | ||
| public class ColumnIndexCommand extends ArgsOnlyCommand { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comments as for the CLI command.
| registry.put("merge", MergeCommand.class); | ||
| registry.put("rowcount", RowCountCommand.class); | ||
| registry.put("size", SizeCommand.class); | ||
| registry.put("columnindex", ColumnIndexCommand.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the command name should be the same in the CLI and in parquet-tools.
zivanfi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think the logic is a lot clearer now!
This is a squashed feature branch merge including the changes listed below. The detailed history can be found in the 'column-indexes' branch. * PARQUET-1211: Column indexes: read/write API (#456) * PARQUET-1212: Column indexes: Show indexes in tools (#479) * PARQUET-1213: Column indexes: Limit index size (#480) * PARQUET-1214: Column indexes: Truncate min/max values (#481) * PARQUET-1364: Invalid row indexes for pages starting with nulls (#507) * PARQUET-1310: Column indexes: Filtering (#509) * PARQUET-1386: Fix issues of NaN and +-0.0 in case of float/double column indexes (#515) * PARQUET-1389: Improve value skipping at page synchronization (#514) * PARQUET-1381: Fix missing endRecord after merging columnIndex
No description provided.