[BUG] Fix Parquet reads with chunk sizing #2658
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Another followup to #2586.
Problem statement
#2586 incorrectly handles value reading and chunking. In that PR, only local tests were used. Locally, chunk sizes of up to
128 * 1024
rows are allowed, so chunk size exceeded the total number of rows to read. However, non-local reads such as to S3 instead have a default chunk size of2048
. This results in a scenario where chunk size is less than the total number of rows to read.When this happens, if the row count of a data page aligns with the chunk size, we continue reading the next data page to see if the last row contains more leaf values. If the first value belongs to a new record, then the number of rows seen would be incremented. It would then always be the case that
rows read > additional rows to read (which is 0)
, and the exit condition ofrows read == additional rows to read
is never fulfilled, so we continue reading values into a chunk until the page runs out of values. This could repeat for every subsequent data page.The end result is that we can have columns with incorrectly sized chunks that are incongruous with the chunk sizes of other columns, causing Daft to error out.
TLDR: chunk sizes were not being respected during parquet reads.
Solution
Instead of checking whether the
rows read == additional rows to read
condition at the end of the loop where we iterate through a page's values, we move the check to the start andpeek
at the value to decide if we should continue iterating for the current chunk.Additionally, we modify the change in #2643 so that the remaining number of values to read are zeroed out iff the number of rows read is equal to the total number of rows to read, and not when the number of rows read is equal to the number of additional rows to read (which only applies to the current chunk).
Example
As an example, consider a parquet file with the schema
<nested struct<field0 string, field1 string>
. Letfield0
be dictionary encoded whilefield1
uses fallback encoding. Given4097
rows we might get the following page layout:Before this PR, after page
0-2
is read, we've read enough rows to fill up a chunk of size2048
(which is our default chunk size when reading from S3). However, from #2586, we still read page0-3
to check if the row contains multiple leaf values. Before #2643, what happens is that we see a repetition level of 0, so we increment the number of rows seen, sorows seen > additional rows to read for the page
, and we never fulfill the strictrows seen == additional rows to read
condition to stop reading to a chunk. After #2586, we correctly note that the chunk is full and exit, but we also consumed a value that belongs to the next chunk, so we end up with insufficient values in the end.