Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ByteView should be using signed types? #6172

Open
a10y opened this issue Aug 1, 2024 · 2 comments
Open

ByteView should be using signed types? #6172

a10y opened this issue Aug 1, 2024 · 2 comments
Labels
question Further information is requested

Comments

@a10y
Copy link
Contributor

a10y commented Aug 1, 2024

Which part is this question about

Related to the continuation of StringView/BinaryView support in #6163

In arrow_data, the ByteView type is used to encapsulate this structure from the spec:

image

Notably, the spec dictates that all of these values must be signed integers. However, we're using u32.

The arrow-rs builder for GenericByteViewArray doesn't seem to have any range checks on the block, offset and len values for the view structure, which means, I think, you can happily construct a StringView array with arrow-rs, and then attempt to pass it to PyArrow or Java over IPC and have it fail at runtime.

Describe your question

Should we either

  1. be using i32 instead of u32 internally
  2. be adding constraints on the builder methods to ensure that we don't allow adding strings > 2GB
  3. Has someone noticed this before and addressed it and it's not actually a problem
@a10y a10y added the question Further information is requested label Aug 1, 2024
@a10y a10y changed the title ByteView should be using signed types ByteView should be using signed types? Aug 1, 2024
@alamb
Copy link
Contributor

alamb commented Aug 1, 2024

I think using i32 in the code to conform to the spec makes sense -- arrow normally uses signed integers to allow interop with Java as I understand.

maybe @tustvold or @ariesdevil remembers why we used u32 instead 🤔

@tustvold
Copy link
Contributor

tustvold commented Aug 1, 2024

I think the original version of the spec used unsigned which was what the implementation was then based off, I don't see an issue with using i32 or clamping u32

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants