Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading FIXED_LEN_BYTE_ARRAY columns with nulls is inefficient #6296

Closed
etseidl opened this issue Aug 23, 2024 · 1 comment · Fixed by #6297
Closed

Reading FIXED_LEN_BYTE_ARRAY columns with nulls is inefficient #6296

etseidl opened this issue Aug 23, 2024 · 1 comment · Fixed by #6297
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate

Comments

@etseidl
Copy link
Contributor

etseidl commented Aug 23, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
When reading a Parquet file with FIXED_LEN_BYTE_ARRAY columns with nulls present one necessary operation is moving the fixed-length data into the correct location within the output buffer to take into account null slots. This is handled by the pad_nulls function in the ValuesBuffer trait. The inner loop of this function

  for i in 0..byte_length {
      self.buffer[level_pos_bytes + i] = self.buffer[value_pos_bytes + i]
  }

works well when the fixed width is low (<= 4), but for larger widths this loop is quite inefficient.

Describe the solution you'd like
Rewriting the inner loop for longer fixed-size arrays can speed this operation up considerably. In particular, by copying slices of the buffer to another location in the buffer, the compiler can vectorize the move, e.g.

  let split = self.buffer.split_at_mut(level_pos_bytes);
  let dst = &mut split.1[..byte_length];
  let src = &split.0[value_pos_bytes..value_pos_bytes + byte_length];
  for i in 0..byte_length {
      dst[i] = src[i]
  }

Describe alternatives you've considered
I tried Vec::copy_within but it was slower than the vectorized copy.

Additional context

@etseidl etseidl added the enhancement Any new improvement worthy of a entry in the changelog label Aug 23, 2024
@alamb alamb added the parquet Changes to the parquet crate label Oct 2, 2024
@alamb
Copy link
Contributor

alamb commented Oct 2, 2024

label_issue.py automatically added labels {'parquet'} from #6297

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants