Skip to content

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

Currently when a ColumnVector stores array type elements, we will use 2 arrays for lengths and offsets and implement them individually in on-heap and off-heap column vector.

In this PR, we use one array to represent both offsets and lengths, so that we can treat it as ColumnVector and all the logic can go to the base class ColumnVector

How was this patch tested?

existing tests.

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Jun 10, 2017

cc @ueshin @sameeragarwal @kiszk

if (this.arrayLengths != null) {
System.arraycopy(this.arrayLengths, 0, newLengths, 0, capacity);
System.arraycopy(this.arrayOffsets, 0, newOffsets, 0, capacity);
// need 2 ints as offset and length for each array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add a comment regarding that intData[] is used for other purpose compared to the original intention (i.e. storing int column data).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea good idea.

* Returns the offset of the array at rowid.
*/
public abstract int getArrayOffset(int rowId);
public void arrayWriteEnd(int rowId, int offset, int length) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed that there is arrayWriteStart since we newly created arrayWriteEnd.
After reading a comment, I understood your intention. Is there any other method name (e.g. arraySetColumn)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no there is no arrayWriteStart.

Maybe I should pick another name, how about putArrayOffsetAndLength or just keep the original name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like putArrayOffsetAndLength

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@SparkQA
Copy link

SparkQA commented Jun 10, 2017

Test build #77874 has finished for PR 18260 at commit d4267b7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// The data stored in these two allocations need to maintain binary compatible. We can
// directly pass this buffer to external components.
private long nulls;
// The actually data of this column vector will be store here. If it's an array column vector,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: will be store -> will be stored ?

// Array for each type. Only 1 is populated for any type.
private byte[] byteData;
private short[] shortData;
// This is not used used to store data for int column vector, but also can store offsets and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This is not used used to store -> This is not used to store ?

Platform.reallocateMemory(offsetData, oldCapacity * 4, newCapacity * 4);
// need 2 ints as offset and length for each array.
this.data = Platform.reallocateMemory(data, oldCapacity * 8, newCapacity * 8);
putInt(0, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to putInt here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually we don't, as the new memory region should be filled with 0. This is just a safe guard to be more robust about the underlying memory allocation details.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had second thought that putArrayOffsetAndLength(0,0,0) is better for the purpose of the guard?

}
arrayLengths = newLengths;
arrayOffsets = newOffsets;
putInt(0, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

// need 2 ints as offset and length for each array.
if (intData == null || intData.length < newCapacity * 2) {
int[] newData = new int[newCapacity * 2];
if (intData != null) System.arraycopy(intData, 0, newData, 0, capacity * 2);
Copy link
Member

@ueshin ueshin Jun 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we should pass intData.length instead of capacity * 2 ?
nvm: here intData.length should be capacity * 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

column.putArray(3, 3, 3)
column.putArrayOffsetAndLength(0, 0, 1)
column.putArrayOffsetAndLength(1, 1, 2)
column.putArrayOffsetAndLength(2, 2, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be column.putArrayOffsetAndLength(2, 3, 0) ?

@ueshin
Copy link
Member

ueshin commented Jun 12, 2017

LGTM except for one comment, pending Jenkins.

Platform.reallocateMemory(lengthData, oldCapacity * 4, newCapacity * 4);
this.offsetData =
Platform.reallocateMemory(offsetData, oldCapacity * 4, newCapacity * 4);
// need 2 ints as offset and length for each array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for each array -> for each array element.

Copy link
Member

@viirya viirya Jun 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. This's quite a bit ambiguous. Nvm. for each array is good.

@SparkQA
Copy link

SparkQA commented Jun 12, 2017

Test build #77902 has finished for PR 18260 at commit 2b78043.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Jun 12, 2017

LGTM except for one question regarding array capacity after this change.

private short[] shortData;
// This is not only used to store data for int column vector, but also can store offsets and
// lengths for array column vector.
private int[] intData;
Copy link
Member

@viirya viirya Jun 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question I just have is, the capacity of ColumnVector is bound by MAX_CAPACITY. Previously we store offset and length individually, so we can have MAX_CAPACITY arrays at most. Now we store offset and length together in data/intData which is bound to MAX_CAPACITY, doesn't it say we can just have MAX_CAPACITY / 2 arrays at most?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Is it possible to use longData, which has a pair of 32-bit offset and length, to keep MAX_CAPACITY array length?

Copy link
Member

@viirya viirya Jun 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I see. We only check the limit of MAX_CAPACITY before actually going into reserveInternal.

Even we pass this check, we still face problem when allocating intData, see the comment below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kiszk Do you meant we store a pair of offset/length together as an element in longData?

@SparkQA
Copy link

SparkQA commented Jun 12, 2017

Test build #77903 has finished for PR 18260 at commit 368c346.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

System.arraycopy(this.arrayOffsets, 0, newOffsets, 0, capacity);
// need 2 ints as offset and length for each array.
if (intData == null || intData.length < newCapacity * 2) {
int[] newData = new int[newCapacity * 2];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newCapacity here can be MAX_CAPACITY at most. When newCapacity is more than MAX_CAPACITY / 2, seems this allocation would cause problem?

@SparkQA
Copy link

SparkQA commented Jun 12, 2017

Test build #77900 has finished for PR 18260 at commit a61ba71.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 12, 2017

Test build #77905 has finished for PR 18260 at commit 1dae660.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

@viirya @kiszk good catch! fixed by using long to store offset and size.

private byte[] byteData;
private short[] shortData;
private int[] intData;
// This is not only used to store data for int column vector, but also can store offsets and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int column vector -> long column vector.

*/
public abstract int getArrayOffset(int rowId);
public void putArrayOffsetAndSize(int rowId, int offset, int size) {
long offsetAndSize = (offset << 32) | size;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offset should be converted to long before shifting?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

@viirya
Copy link
Member

viirya commented Jun 12, 2017

LGTM except for two comments.

@SparkQA
Copy link

SparkQA commented Jun 12, 2017

Test build #77912 has finished for PR 18260 at commit e6e60e0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 12, 2017

Test build #77933 has finished for PR 18260 at commit fdc0870.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


reference.zipWithIndex.foreach { v =>
assert(v._1.length == column.getArrayLength(v._2), "MemoryMode=" + memMode)
assert(v._1.length == column.getInt(v._2 * 2 + 1), "MemoryMode=" + memMode)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, we should also change here.

@SparkQA
Copy link

SparkQA commented Jun 12, 2017

Test build #77936 has finished for PR 18260 at commit 04790bb.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Jun 12, 2017

The spark R test failure is not related, I'm merging it to master, thanks for your review!

@asfgit asfgit closed this in 22dd65f Jun 12, 2017
@rxin
Copy link
Contributor

rxin commented Jun 12, 2017

Why are we doing this? Isn't it better potentially for compression to store them separately? We can also easily remove the offset for fixed length arrays.

@cloud-fan
Copy link
Contributor Author

cloud-fan commented Jun 13, 2017

Sorry I missed this part. Currently we don't externalize ColumnVector so I didn't go with this direction. I'm reverting it, waiting to have a consensus about how to change the ColumnVector APIs.

dataknocker pushed a commit to dataknocker/spark that referenced this pull request Jun 16, 2017
## What changes were proposed in this pull request?

Currently when a `ColumnVector` stores array type elements, we will use 2 arrays for lengths and offsets and implement them individually in on-heap and off-heap column vector.

In this PR, we use one array to represent both offsets and lengths, so that we can treat it as `ColumnVector` and all the logic can go to the base class `ColumnVector`

## How was this patch tested?

existing tests.

Author: Wenchen Fan <[email protected]>

Closes apache#18260 from cloud-fan/put.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants