Old Parquet files with wrong Compressed Size not Readable #2926

pyckle · 2024-06-23T17:01:19Z

In certain circumstances, the CLI will fail to read old (perhaps ancient) parquet files that have an incorrect compressed_size field set in the column metadata that does not include the dictionary page (at least according to the comment in the code). The code that is supposed to handle this does not flip the byte buffer it reads the extra bytes into. It appears to have been broken for a few years now.

I have written a PR that includes a defective parquet file with this issue, wrote a unit test that fails without the additional flip, and validated that the code works afterwards.

This is a trivial minor issue that was from learning the code rather than actually addressing a production issue, so there's no urgency.

The text was updated successfully, but these errors were encountered:

pyckle added a commit to pyckle/parquet-java that referenced this issue Jun 23, 2024

Fix error reading files with buggy compressed size apache#2926

502c80d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Old Parquet files with wrong Compressed Size not Readable #2926

Old Parquet files with wrong Compressed Size not Readable #2926

pyckle commented Jun 23, 2024

Old Parquet files with wrong Compressed Size not Readable #2926

Old Parquet files with wrong Compressed Size not Readable #2926

Comments

pyckle commented Jun 23, 2024