-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-21046][SQL] simplify the array offset and length in ColumnVector #18260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| if (this.arrayLengths != null) { | ||
| System.arraycopy(this.arrayLengths, 0, newLengths, 0, capacity); | ||
| System.arraycopy(this.arrayOffsets, 0, newOffsets, 0, capacity); | ||
| // need 2 ints as offset and length for each array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to add a comment regarding that intData[] is used for other purpose compared to the original intention (i.e. storing int column data).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea good idea.
| * Returns the offset of the array at rowid. | ||
| */ | ||
| public abstract int getArrayOffset(int rowId); | ||
| public void arrayWriteEnd(int rowId, int offset, int length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assumed that there is arrayWriteStart since we newly created arrayWriteEnd.
After reading a comment, I understood your intention. Is there any other method name (e.g. arraySetColumn)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no there is no arrayWriteStart.
Maybe I should pick another name, how about putArrayOffsetAndLength or just keep the original name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like putArrayOffsetAndLength
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
Test build #77874 has finished for PR 18260 at commit
|
| // The data stored in these two allocations need to maintain binary compatible. We can | ||
| // directly pass this buffer to external components. | ||
| private long nulls; | ||
| // The actually data of this column vector will be store here. If it's an array column vector, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: will be store -> will be stored ?
| // Array for each type. Only 1 is populated for any type. | ||
| private byte[] byteData; | ||
| private short[] shortData; | ||
| // This is not used used to store data for int column vector, but also can store offsets and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This is not used used to store -> This is not used to store ?
| Platform.reallocateMemory(offsetData, oldCapacity * 4, newCapacity * 4); | ||
| // need 2 ints as offset and length for each array. | ||
| this.data = Platform.reallocateMemory(data, oldCapacity * 8, newCapacity * 8); | ||
| putInt(0, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to putInt here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually we don't, as the new memory region should be filled with 0. This is just a safe guard to be more robust about the underlying memory allocation details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had second thought that putArrayOffsetAndLength(0,0,0) is better for the purpose of the guard?
| } | ||
| arrayLengths = newLengths; | ||
| arrayOffsets = newOffsets; | ||
| putInt(0, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| // need 2 ints as offset and length for each array. | ||
| if (intData == null || intData.length < newCapacity * 2) { | ||
| int[] newData = new int[newCapacity * 2]; | ||
| if (intData != null) System.arraycopy(intData, 0, newData, 0, capacity * 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we should pass intData.length instead of capacity * 2 ?
nvm: here intData.length should be capacity * 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
| column.putArray(3, 3, 3) | ||
| column.putArrayOffsetAndLength(0, 0, 1) | ||
| column.putArrayOffsetAndLength(1, 1, 2) | ||
| column.putArrayOffsetAndLength(2, 2, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be column.putArrayOffsetAndLength(2, 3, 0) ?
|
LGTM except for one comment, pending Jenkins. |
| Platform.reallocateMemory(lengthData, oldCapacity * 4, newCapacity * 4); | ||
| this.offsetData = | ||
| Platform.reallocateMemory(offsetData, oldCapacity * 4, newCapacity * 4); | ||
| // need 2 ints as offset and length for each array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: for each array -> for each array element.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. This's quite a bit ambiguous. Nvm. for each array is good.
|
Test build #77902 has finished for PR 18260 at commit
|
|
LGTM except for one question regarding array capacity after this change. |
| private short[] shortData; | ||
| // This is not only used to store data for int column vector, but also can store offsets and | ||
| // lengths for array column vector. | ||
| private int[] intData; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question I just have is, the capacity of ColumnVector is bound by MAX_CAPACITY. Previously we store offset and length individually, so we can have MAX_CAPACITY arrays at most. Now we store offset and length together in data/intData which is bound to MAX_CAPACITY, doesn't it say we can just have MAX_CAPACITY / 2 arrays at most?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Is it possible to use longData, which has a pair of 32-bit offset and length, to keep MAX_CAPACITY array length?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. I see. We only check the limit of MAX_CAPACITY before actually going into reserveInternal.
Even we pass this check, we still face problem when allocating intData, see the comment below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kiszk Do you meant we store a pair of offset/length together as an element in longData?
|
Test build #77903 has finished for PR 18260 at commit
|
| System.arraycopy(this.arrayOffsets, 0, newOffsets, 0, capacity); | ||
| // need 2 ints as offset and length for each array. | ||
| if (intData == null || intData.length < newCapacity * 2) { | ||
| int[] newData = new int[newCapacity * 2]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newCapacity here can be MAX_CAPACITY at most. When newCapacity is more than MAX_CAPACITY / 2, seems this allocation would cause problem?
|
Test build #77900 has finished for PR 18260 at commit
|
|
Test build #77905 has finished for PR 18260 at commit
|
| private byte[] byteData; | ||
| private short[] shortData; | ||
| private int[] intData; | ||
| // This is not only used to store data for int column vector, but also can store offsets and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int column vector -> long column vector.
| */ | ||
| public abstract int getArrayOffset(int rowId); | ||
| public void putArrayOffsetAndSize(int rowId, int offset, int size) { | ||
| long offsetAndSize = (offset << 32) | size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
offset should be converted to long before shifting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
|
LGTM except for two comments. |
|
Test build #77912 has finished for PR 18260 at commit
|
|
Test build #77933 has finished for PR 18260 at commit
|
|
|
||
| reference.zipWithIndex.foreach { v => | ||
| assert(v._1.length == column.getArrayLength(v._2), "MemoryMode=" + memMode) | ||
| assert(v._1.length == column.getInt(v._2 * 2 + 1), "MemoryMode=" + memMode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, we should also change here.
|
Test build #77936 has finished for PR 18260 at commit
|
|
The spark R test failure is not related, I'm merging it to master, thanks for your review! |
|
Why are we doing this? Isn't it better potentially for compression to store them separately? We can also easily remove the offset for fixed length arrays. |
|
Sorry I missed this part. Currently we don't externalize |
## What changes were proposed in this pull request? Currently when a `ColumnVector` stores array type elements, we will use 2 arrays for lengths and offsets and implement them individually in on-heap and off-heap column vector. In this PR, we use one array to represent both offsets and lengths, so that we can treat it as `ColumnVector` and all the logic can go to the base class `ColumnVector` ## How was this patch tested? existing tests. Author: Wenchen Fan <[email protected]> Closes apache#18260 from cloud-fan/put.
What changes were proposed in this pull request?
Currently when a
ColumnVectorstores array type elements, we will use 2 arrays for lengths and offsets and implement them individually in on-heap and off-heap column vector.In this PR, we use one array to represent both offsets and lengths, so that we can treat it as
ColumnVectorand all the logic can go to the base classColumnVectorHow was this patch tested?
existing tests.