Skip to content

Conversation

@gszadovszky
Copy link
Contributor

Make sure you have checked all steps below.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

@plygrnd
Copy link

plygrnd commented Sep 23, 2021

TravisCI runs contain multiple ParquetDecodingExceptions (example). Is this expected?

@gszadovszky
Copy link
Contributor Author

TravisCI runs contain multiple ParquetDecodingExceptions (example). Is this expected?

@plygrnd, yes this is expected. The test class TestCorruptThriftRecords verifies if the correct exceptions are thrown. It uses MR to execute the tests and it seems MR logs the exceptions before they are thrown to the caller.

@plygrnd
Copy link

plygrnd commented Sep 23, 2021

Okay, LGTM then.

@shangxinli
Copy link
Contributor

LGTM

* A specific IOException thrown when invalid values are found in the Parquet file metadata (including the footer,
* page header etc.).
*/
public static class InvalidParquetMetadataException extends IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this is an IOException. What is the value of making this a checked exception? Why not just make this a RuntimeException? Or use some existing one like IllegalStateException or ParquetDecodingException?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module parquet-format-structures is the one all the others are depended on. Parquet exceptions are implemented in another module so I cannot use them here. Since we already throw IOExceptions I've felt extending it would be a good idea. But you might be right. I am happy to extend RuntimeException instead of IOException.

return pageHeader;
}

private static <T> void validateValue(Predicate<? super T> validator, T value, String metaName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why accept a predicate? Most check methods like this use a boolean. I would expect it to work like a Precondition:

if (!isValid) {
  throw new ParquetDecodingException(...);
}

int size = pageHeader.getCompressed_page_size()
validateValue(size >= 0, String.format("Compressed page size must be positive, but was: %s", size));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure why I've implemented this way. I'm fine rewriting to use a simple boolean.

public static PageHeader readPageHeader(InputStream from,
BlockCipher.Decryptor decryptor, byte[] AAD) throws IOException {
return read(from, new PageHeader(), decryptor, AAD);
return validate(read(from, new PageHeader(), decryptor, AAD));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more clear if you called MetadataValidator.validate rather than just validate.

fail("Expected exception but did not thrown");
} catch (InvalidParquetMetadataException e) {
assertTrue("Exception message does not contain the expected parts",
e.getMessage().contains("pageHeader.compressed_page_size"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there an assertion helper so you don't need to catch the exception? Something like assertThrows in the codebase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is something already implemented but in another module and I cannot use it here.

* A specific RuntimeException thrown when invalid values are found in the Parquet file metadata (including the
* footer, page header etc.).
*/
public static class InvalidParquetMetadataException extends RuntimeException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: I'd prefer it if the exception weren't an inner class since that makes it harder to reference. But this isn't a blocker.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a fair point anyway. I'll move it before merging. Thanks a lot for the review.

* page header etc.).
*/
public class InvalidParquetMetadataException extends RuntimeException {
<T> InvalidParquetMetadataException(String message) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the type parameter for?

import org.apache.parquet.format.Util.DefaultFileMetaDataConsumer;
import org.junit.Test;

import org.apache.parquet.format.Util.DefaultFileMetaDataConsumer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: unnecessary change.

@gszadovszky gszadovszky merged commit 1695d92 into apache:master Sep 30, 2021
gszadovszky added a commit that referenced this pull request Sep 30, 2021
gszadovszky added a commit that referenced this pull request Sep 30, 2021
@rdblue
Copy link
Contributor

rdblue commented Oct 3, 2021

@gszadovszky, thanks for getting this done.

sunchao pushed a commit to sunchao/parquet-mr that referenced this pull request Mar 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants