Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Jan 13, 2026

Which issue does this PR close?

Rationale for this change

@jhorstmann found it is possible to bypass utf8 validation by abusing the ArrayData APIs

What changes are included in this PR?

  1. Add an assert to prevent the bypass
  2. Add tests

Are these changes tested?

Yes, new unit tests are added

Are there any user-facing changes?

error if APIs are misused

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 13, 2026

#[should_panic(expected = "invalid utf-8 sequence")]
#[test]
fn invalid_array_data() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test also fails on main but I wanted to make it super clear you can't build an invalid Utf8ViewArray with the ArrayDataBuilder (as expected)

let views = ScalarBuffer::new(views, offset, len);
Self {
data_type: T::DATA_TYPE,
data_type,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhorstmann noted that reusing data_type here might be faster as it avoids a call to DataType::drop 🤷

fn from(data: ArrayData) -> Self {
let (_data_type, len, nulls, offset, mut buffers, _child_data) = data.into_parts();
let (data_type, len, nulls, offset, mut buffers, _child_data) = data.into_parts();
assert_eq!(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the equivalent check in GenericByteArray:

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Verified test fails if the assert is removed.

Co-authored-by: Martin Hilton <mhilton@influxdata.com>
@Dandandan
Copy link
Contributor

One conflict!

@alamb
Copy link
Contributor Author

alamb commented Jan 14, 2026

Conflict resolved so merging!

@alamb alamb merged commit 991eb2a into apache:main Jan 14, 2026
26 checks passed
Dandandan pushed a commit to Dandandan/arrow-rs that referenced this pull request Jan 15, 2026
…iewArray (apache#9158)

# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- closes apache#9157

# Rationale for this change

@jhorstmann found it is possible to bypass utf8 validation by abusing
the ArrayData APIs

# What changes are included in this PR?

1. Add an assert to prevent the bypass
2. Add tests

# Are these changes tested?

Yes, new unit tests are added
# Are there any user-facing changes?
error if APIs are misused

---------

Co-authored-by: Martin Hilton <mhilton@influxdata.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Possible to bypass the Utf8 check when converting a BinaryViewArray to StringViewArray

6 participants