Skip to content

Conversation

@gszadovszky
Copy link
Contributor

No description provided.

List<String> blockIndexes;

@Parameter(names = { "-i", "--column-index" }, description = "Shows the column indexes; "
+ "the default behavior is if both -i and -o would be set")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(subjective, optional) I would suggest "active by default unless -o is used"

boolean showColumnIndex;

@Parameter(names = { "-o", "--offset-index" }, description = "Shows the offset indexes; "
+ "the default behavior is if both -i and -o would be set")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(subjective, optional) I would suggest "active by default unless -i is used"

+ "blocks are referenced by their indexes from 0")
List<String> blockIndexes;

@Parameter(names = { "-i", "--column-index" }, description = "Shows the column indexes; "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(subjective, optional) I find -i very counter-intuitive. Could this be -c and --column something else? For example -f/--field? Alternatively, -k/--column? Or -p/--column-path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--column -c is used in other commands. I think, it is better to keep this option as is to be common between the different commands.

return 0;
}

private String getBlockIndex(int i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a short comment about what happens here and why?

If I understand correclty:

  • without blockIndexes, we address into blocks directly, but
  • blockIndexes introduces and additional layer of indirection if exists

Although I'm still unsure about why.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to your comment, finally I undestood what is going on but I find it very confusing.

  • If the user wants to see all blocks, the index of elements returned by getBlocks() corresponds to their actual index in the file.
  • If, on the other hand, the user wants to view specific blocks, getBlocks() returns elements in the order they were requested and the actual index can only be obtained via a reverse lookup.

I think it would be a lot cleaner the following way:

  • The getBlocks() method should return an index to block map instead of a list of blocks.
  • This way when iterating though the results of getBlocks(), we won't have to do a reverse lookup for the actual index, since it is a part of the container returned.

/**
* parquet-tools command to print column and offset indexes.
*/
public class ColumnIndexCommand extends ArgsOnlyCommand {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as for the CLI command.

registry.put("merge", MergeCommand.class);
registry.put("rowcount", RowCountCommand.class);
registry.put("size", SizeCommand.class);
registry.put("columnindex", ColumnIndexCommand.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the command name should be the same in the CLI and in parquet-tools.

Copy link
Contributor

@zivanfi zivanfi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think the logic is a lot clearer now!

@zivanfi zivanfi merged commit 6165a0c into apache:column-indexes May 22, 2018
zivanfi pushed a commit that referenced this pull request Oct 18, 2018
This is a squashed feature branch merge including the changes listed below. The detailed history can be found in the 'column-indexes' branch.

* PARQUET-1211: Column indexes: read/write API (#456)
* PARQUET-1212: Column indexes: Show indexes in tools (#479)
* PARQUET-1213: Column indexes: Limit index size (#480)
* PARQUET-1214: Column indexes: Truncate min/max values (#481)
* PARQUET-1364: Invalid row indexes for pages starting with nulls (#507)
* PARQUET-1310: Column indexes: Filtering (#509)
* PARQUET-1386: Fix issues of NaN and +-0.0 in case of float/double column indexes (#515)
* PARQUET-1389: Improve value skipping at page synchronization (#514)
* PARQUET-1381: Fix missing endRecord after merging columnIndex
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants