-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-39575][AVRO] add ByteBuffer#rewind after ByteBuffer#get in Avr… #36973
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@gengliangwang Would you like to review this pr? |
|
+CC @xkrogen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right but can you explain why this is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HeapBuffer.get(bytes) puts the data from POS to the end into bytes, and sets POS as the end. The next call will return empty bytes. You can take a look at added unit test. The second call of deserializer will return an InternalRow with empty binary column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I wonder why this never surfaced before? seems like it would mean any binary cols in Avro don't work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is not common to call this twice to deserialize one avro data object
xkrogen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code changes LGTM. I checked the other place the deserializer uses a ByteBuffer, in the (BYTES, _: DecimalType) case, and it doesn't have the same problem because decimalConversions.fromBytes() duplicates the ByteBuffer before extracting from it.
Can you update the PR description to have more details on why this is needed, basically what you described in your comment? It might also be helpful to update the summary to be more descriptive about the impact rather than the mechanics of the change, something like "Fix repeated deserialization of BYTES type in AvroDeserializer" (and then the body can describe the mechanical/technical aspects of the change).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there a nested type here instead of just "type": "bytes" at the top-level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not necessary and it has been removed.
@xkrogen Already updated the PR description. |
|
Can one of the admins verify this patch? |
|
Can you try rebasing and pushing? some doc build steps failed, which is unrelated to the change, but might resolve it |
|
@srowen Thanks for your help. I have rebased and pushed. Now github Action is all successful. |
…oDeserializer ### What changes were proposed in this pull request? Add ByteBuffer#rewind after ByteBuffer#get in AvroDeserializer. ### Why are the changes needed? - HeapBuffer.get(bytes) puts the data from POS to the end into bytes, and sets POS as the end. The next call will return empty bytes. - The second call of AvroDeserializer will return an InternalRow with empty binary column when avro record has binary column. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add ut in AvroCatalystDataConversionSuite. Closes #36973 from wzx140/avro-fix. Authored-by: wangzixuan.wzxuan <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 558b395) Signed-off-by: Sean Owen <[email protected]>
…oDeserializer ### What changes were proposed in this pull request? Add ByteBuffer#rewind after ByteBuffer#get in AvroDeserializer. ### Why are the changes needed? - HeapBuffer.get(bytes) puts the data from POS to the end into bytes, and sets POS as the end. The next call will return empty bytes. - The second call of AvroDeserializer will return an InternalRow with empty binary column when avro record has binary column. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add ut in AvroCatalystDataConversionSuite. Closes #36973 from wzx140/avro-fix. Authored-by: wangzixuan.wzxuan <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 558b395) Signed-off-by: Sean Owen <[email protected]>
|
Merged to master/3.3/3.2 |
…oDeserializer ### What changes were proposed in this pull request? Add ByteBuffer#rewind after ByteBuffer#get in AvroDeserializer. ### Why are the changes needed? - HeapBuffer.get(bytes) puts the data from POS to the end into bytes, and sets POS as the end. The next call will return empty bytes. - The second call of AvroDeserializer will return an InternalRow with empty binary column when avro record has binary column. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add ut in AvroCatalystDataConversionSuite. Closes apache#36973 from wzx140/avro-fix. Authored-by: wangzixuan.wzxuan <[email protected]> Signed-off-by: Sean Owen <[email protected]> (cherry picked from commit 558b395) Signed-off-by: Sean Owen <[email protected]>
…oDeserializer
What changes were proposed in this pull request?
Add ByteBuffer#rewind after ByteBuffer#get in AvroDeserializer.
Why are the changes needed?
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Add ut in AvroCatalystDataConversionSuite.