Skip to content

Conversation

@wzx140
Copy link
Contributor

@wzx140 wzx140 commented Jun 24, 2022

…oDeserializer

What changes were proposed in this pull request?

Add ByteBuffer#rewind after ByteBuffer#get in AvroDeserializer.

Why are the changes needed?

  • HeapBuffer.get(bytes) puts the data from POS to the end into bytes, and sets POS as the end. The next call will return empty bytes.
  • The second call of AvroDeserializer will return an InternalRow with empty binary column when avro record has binary column.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Add ut in AvroCatalystDataConversionSuite.

@wzx140
Copy link
Contributor Author

wzx140 commented Jun 24, 2022

@gengliangwang Would you like to review this pr?

@mridulm
Copy link
Contributor

mridulm commented Jun 24, 2022

+CC @xkrogen

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right but can you explain why this is needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HeapBuffer.get(bytes) puts the data from POS to the end into bytes, and sets POS as the end. The next call will return empty bytes. You can take a look at added unit test. The second call of deserializer will return an InternalRow with empty binary column.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I wonder why this never surfaced before? seems like it would mean any binary cols in Avro don't work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is not common to call this twice to deserialize one avro data object

Copy link
Contributor

@xkrogen xkrogen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code changes LGTM. I checked the other place the deserializer uses a ByteBuffer, in the (BYTES, _: DecimalType) case, and it doesn't have the same problem because decimalConversions.fromBytes() duplicates the ByteBuffer before extracting from it.

Can you update the PR description to have more details on why this is needed, basically what you described in your comment? It might also be helpful to update the summary to be more descriptive about the impact rather than the mechanics of the change, something like "Fix repeated deserialization of BYTES type in AvroDeserializer" (and then the body can describe the mechanical/technical aspects of the change).

Comment on lines 373 to 375
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a nested type here instead of just "type": "bytes" at the top-level?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessary and it has been removed.

@wzx140
Copy link
Contributor Author

wzx140 commented Jun 24, 2022

Code changes LGTM. I checked the other place the deserializer uses a ByteBuffer, in the (BYTES, _: DecimalType) case, and it doesn't have the same problem because decimalConversions.fromBytes() duplicates the ByteBuffer before extracting from it.

Can you update the PR description to have more details on why this is needed, basically what you described in your comment? It might also be helpful to update the summary to be more descriptive about the impact rather than the mechanics of the change, something like "Fix repeated deserialization of BYTES type in AvroDeserializer" (and then the body can describe the mechanical/technical aspects of the change).

@xkrogen Already updated the PR description.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@srowen
Copy link
Member

srowen commented Jun 26, 2022

Can you try rebasing and pushing? some doc build steps failed, which is unrelated to the change, but might resolve it

@wzx140
Copy link
Contributor Author

wzx140 commented Jun 27, 2022

@srowen Thanks for your help. I have rebased and pushed. Now github Action is all successful.

@srowen srowen closed this in 558b395 Jun 27, 2022
srowen pushed a commit that referenced this pull request Jun 27, 2022
…oDeserializer

### What changes were proposed in this pull request?
Add ByteBuffer#rewind after ByteBuffer#get in AvroDeserializer.

### Why are the changes needed?
- HeapBuffer.get(bytes) puts the data from POS to the end into bytes, and sets POS as the end. The next call will return empty bytes.
- The second call of AvroDeserializer will return an InternalRow with empty binary column when avro record has binary column.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Add ut in AvroCatalystDataConversionSuite.

Closes #36973 from wzx140/avro-fix.

Authored-by: wangzixuan.wzxuan <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit 558b395)
Signed-off-by: Sean Owen <[email protected]>
srowen pushed a commit that referenced this pull request Jun 27, 2022
…oDeserializer

### What changes were proposed in this pull request?
Add ByteBuffer#rewind after ByteBuffer#get in AvroDeserializer.

### Why are the changes needed?
- HeapBuffer.get(bytes) puts the data from POS to the end into bytes, and sets POS as the end. The next call will return empty bytes.
- The second call of AvroDeserializer will return an InternalRow with empty binary column when avro record has binary column.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Add ut in AvroCatalystDataConversionSuite.

Closes #36973 from wzx140/avro-fix.

Authored-by: wangzixuan.wzxuan <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit 558b395)
Signed-off-by: Sean Owen <[email protected]>
@srowen
Copy link
Member

srowen commented Jun 27, 2022

Merged to master/3.3/3.2

sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
…oDeserializer

### What changes were proposed in this pull request?
Add ByteBuffer#rewind after ByteBuffer#get in AvroDeserializer.

### Why are the changes needed?
- HeapBuffer.get(bytes) puts the data from POS to the end into bytes, and sets POS as the end. The next call will return empty bytes.
- The second call of AvroDeserializer will return an InternalRow with empty binary column when avro record has binary column.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Add ut in AvroCatalystDataConversionSuite.

Closes apache#36973 from wzx140/avro-fix.

Authored-by: wangzixuan.wzxuan <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit 558b395)
Signed-off-by: Sean Owen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants