Skip to content

Conversation

@gszadovszky
Copy link
Contributor

No description provided.

@gszadovszky gszadovszky changed the title Parquet 1211 Parquet 1211: Write column indexes: read/write API Feb 13, 2018
@gszadovszky
Copy link
Contributor Author

Please note that the current parquet-format version is 2.4.1-SNAPSHOT which works only if you build it locally before building parquet. For the final commit we will need a released version of parquet-format that contains my update.

Copy link
Contributor

@zivanfi zivanfi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change "Parquet 1211" to "PARQUET-1211" in the PR description. I may have more useful suggestions as well once I read through the code. :)

public abstract class ColumnIndexBuilder {

static abstract class ColumnIndexBase implements ColumnIndex {
private static ByteBuffer EMPTY_BYTE_BUFFER = ByteBuffer.allocate(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@gszadovszky gszadovszky changed the title Parquet 1211: Write column indexes: read/write API PARQUET-1211: Write column indexes: read/write API Feb 15, 2018
null,
stringBinary("Slartibartfast")));
assertEquals(BoundaryOrder.ASCENDING, columnIndex.getBoundaryOrder());
assertCorrectNullCounts(columnIndex, 1, 2, 3, 4, 5, 6, 7, 8);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check is passed? why not "assertCorrectNullCounts(columnIndex, 11, 21, 31, 41, 51, 61, 71, 81)"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the List<Long> is created by using Arrays.asList I have to use long values otherwise a List<Integer> would be created. 1l, 2l etc. are the long literals in java. Others might use capital L as such 1L, 2L etc.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohhh, I got it wrong. It's letter "l", not number "1".

Copy link
Contributor

@zivanfi zivanfi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only got to OffsetIndexBuilder.java, will continue reviewing from that point later. Only minor nits so far.

static abstract class ColumnIndexBase implements ColumnIndex {
private static final ByteBuffer EMPTY_BYTE_BUFFER = ByteBuffer.allocate(0);
private static final int MAX_VALUE_LENGTH_FOR_TOSTRING = 40;
private static final String INNER_ETC = "(...)";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) I would suggest naming this TOSTRING_TRUNCATION_MARKER.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

private static final ByteBuffer EMPTY_BYTE_BUFFER = ByteBuffer.allocate(0);
private static final int MAX_VALUE_LENGTH_FOR_TOSTRING = 40;
private static final String INNER_ETC = "(...)";
private static final int FIRST_LENGTH = (MAX_VALUE_LENGTH_FOR_TOSTRING - INNER_ETC.length()) / 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) I would suggest naming this TOSTRING_TRUNCATION_START_POS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

private static final int MAX_VALUE_LENGTH_FOR_TOSTRING = 40;
private static final String INNER_ETC = "(...)";
private static final int FIRST_LENGTH = (MAX_VALUE_LENGTH_FOR_TOSTRING - INNER_ETC.length()) / 2;
private static final int LAST_LENGTH = MAX_VALUE_LENGTH_FOR_TOSTRING - INNER_ETC.length() - FIRST_LENGTH;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) I would suggest naming this TOSTRING_TRUNCATION_END_POS.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

public String toString() {
try (Formatter formatter = new Formatter()) {
formatter.format("Boudary order: %s\n", boundaryOrder);
String minMaxPart = " %-" + MAX_VALUE_LENGTH_FOR_TOSTRING + "s %-" + MAX_VALUE_LENGTH_FOR_TOSTRING + "s\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks somewhat scary, but I don't know better either (only in C/C++, where printf supports specifying the desired lengths in parameters).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know a better way either. 😞

String nullCount = nullCounts == null ? "--" : Long.toString(nullCounts[i]);
String min, max;
if (nullPages[i]) {
min = max = "--";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) Replace "--"-s with a constant named TOSTRING_MISSING_VALUE_MARKER or similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

* the statistics to be added
*/
public void add(Statistics<?> stats) {
if (stats.hasNonNullValue()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this handle that case correctly when we don't have min/max values in spite of non-null values being present? (For int96-s, for example.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we collecting statistics for all the types (even if not supported) only we don't write them to the file. ColumnIndex is working similarly.

List<ByteBuffer> maxValues) {
clear();
int requiredSize = nullPages.size();
if ((nullCounts != null && nullCounts.size() != requiredSize) || minValues.size() != requiredSize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is nullCounts checked for being null but minValues and maxValues being used without a similar check?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I see now, only nullCounts is optional.

}

/**
* Builds the column index. It also resets all the collected data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether the build method resetting the builder is something that API consumers would expect based on its name. Would it make sense to separate the building and the resetting?

}
}

// min_i <= min_i+1 && max_i <= max_i+1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) It is a bit hard to parse at first whether the +1 is in the index. It would be better as:
// min[i] <= min[i+1] && max[i] <= max[i+1]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

* @param firstRowIndex
* the index of the first row in the page (within the row group)
*/
public void add(long offset, int compressedPageSize, long firstRowIndex) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this method public? It seems to me that if one circumvents the other add method that has just 2 parameters, previousRowCount will not be updated thus this add() should only be called by the other add(). Do I miss something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two argument add method is used by the writers as that point we don't have the real file offsets. That's why we have the build(long) to shift the related values at build time.
The three argument add method is used by the metadata converter when reading the Parquet file. In this case the build() method is used as already the correct offsets are read and no shifting is required.
I'll add some comments to make it more clear.

w.start();
w.startBlock(4);
w.startColumn(C1, 7, CODEC);
w.writeDataPage(7, 4, BytesInput.from(BYTES3), STATS1, BIT_PACKED, BIT_PACKED, PLAIN);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

STATS1? no definition?

BinaryStatistics STATS1 = new BinaryStatistics();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, you've checked my change by rebasing to the actual master where STATS1 and STATS2 are removed and exchanged to EMPTY_STATS. In the current change STATS1 and STATS2 still exist so it is correct as is.
I did not rebase this change yet because it would cause loosing the pointers of the review findings. I'll do the rebase as soon as we'll have the required parquet-format release and this change can be finalized.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I build the branch which merging your patch, and found the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for drawing my attention to this one. I'll keep in mind when I'm rebasing.

w.writeDataPage(7, 4, BytesInput.from(BYTES3), STATS1, BIT_PACKED, BIT_PACKED, PLAIN);
w.endColumn();
w.startColumn(C2, 8, CODEC);
w.writeDataPage(8, 4, BytesInput.from(BYTES4), STATS2, BIT_PACKED, BIT_PACKED, PLAIN);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

STATS2? no definition?

BinaryStatistics STATS2 = new BinaryStatistics();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See STATS1.

stringBinary("Slartibartfast"),
null,
null,
stringBinary("Perfect"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a serious typo here: The THHGTTG character is called Prefect, not Perfect. :)

3000, 3000, 29,
6000, 1200, 56);

builder = OffsetIndexBuilder.getBuilder();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be mistaken, but recreating the builder with the same values seems unnecessary to me.

builder.add(5, 6);
builder.add(7, 8);
assertNull(builder.build());
builder.add(1, 2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these repeated adds intentional?

48000, 22000, 211,
90000, 30000, 361);

builder = OffsetIndexBuilder.getBuilder();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like before, this seems unnecessary.

builder.add(7, 8, 9);
builder.add(10, 11, 12);
assertNull(builder.build());
builder.add(1, 2, 3);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like unintentional duplication.

/**
* @param column
* the column chunk which the column index is to be returned for
* @return the column index for the specified column chunk or {@code null} if the there is no index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) s/the there/there/ (in this line and in another line below as well)


assertNull("Should handle null column index", ParquetMetadataConverter
.toParquetColumnIndex(Types.required(PrimitiveTypeName.INT32).named("test_int32"), null));
assertNull("Should handle unsupported types", ParquetMetadataConverter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should ignore unsupported types.


public class TestColumnChunkPageWriteStore {

// OutputFile implementation to reach out the PositionOutputStream internally used by the writer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/reach out/expose/

@gszadovszky gszadovszky changed the base branch from master to column-indexes May 17, 2018 10:03
@gszadovszky gszadovszky changed the title PARQUET-1211: Write column indexes: read/write API PARQUET-1211: Column indexes: read/write API May 17, 2018
@zivanfi zivanfi merged commit aa571d7 into apache:column-indexes May 17, 2018
* @deprecated will be removed in 2.0.0. This method does not support writing column indexes; Use
* {@link #writePage(BytesInput, int, int, Statistics, Encoding, Encoding, Encoding)} instead
*/
@Deprecated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this requires deprecation. It's part of the column API that is internal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, it is not documented anywhere and we do know that at least two different projects use it directly (Spark and Hive for vectorization).
I think, it is better to be on the safe side and handle every java-public API members as Public so we won't break any API consumer.

zivanfi pushed a commit that referenced this pull request Oct 18, 2018
This is a squashed feature branch merge including the changes listed below. The detailed history can be found in the 'column-indexes' branch.

* PARQUET-1211: Column indexes: read/write API (#456)
* PARQUET-1212: Column indexes: Show indexes in tools (#479)
* PARQUET-1213: Column indexes: Limit index size (#480)
* PARQUET-1214: Column indexes: Truncate min/max values (#481)
* PARQUET-1364: Invalid row indexes for pages starting with nulls (#507)
* PARQUET-1310: Column indexes: Filtering (#509)
* PARQUET-1386: Fix issues of NaN and +-0.0 in case of float/double column indexes (#515)
* PARQUET-1389: Improve value skipping at page synchronization (#514)
* PARQUET-1381: Fix missing endRecord after merging columnIndex
}

@Override
int compareMinValues(PrimitiveComparator<Binary> comparator, int index1, int index2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gszadovszky, why does the type of the PrimitiveComparator use Binary for all types? Or just a mistake?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ConeyLiu, PrimitiveComparator<Binary> covers all potential use case since the only non-java-primitive value type in parquet-mr is Binary and the related methods for these primitive types are part of the PrimitiveComparator interface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants