-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-20783][SQL] Create ColumnVector to abstract existing compressed column (batch method) #18704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #79837 has finished for PR 18704 at commit
|
|
Test build #79838 has finished for PR 18704 at commit
|
|
Test build #79839 has finished for PR 18704 at commit
|
|
Test build #79843 has finished for PR 18704 at commit
|
|
ping @rxin |
1 similar comment
|
ping @rxin |
|
@rxin Could you please review this PR? |
|
Test build #80958 has finished for PR 18704 at commit
|
|
Test build #80961 has finished for PR 18704 at commit
|
|
Test build #80964 has finished for PR 18704 at commit
|
|
retest this please |
|
Test build #80978 has finished for PR 18704 at commit
|
|
Test build #80982 has finished for PR 18704 at commit
|
|
@cloud-fan I updated this implementation by using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now we can move them to WritableColumnVector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Rebased in my local version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we delay the decompression and set the dictionary to ColumnVector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
Test build #81093 has finished for PR 18704 at commit
|
|
@cloud-fan could you please review this again? |
|
ping @cloud-fan |
|
Test build #81295 has finished for PR 18704 at commit
|
|
@cloud-fan Resolved conflict, could you please review? |
|
ping @cloud-fan |
1 similar comment
|
ping @cloud-fan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to avoid boxing here? e.g. we can have a lot of primitive array members.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I removed boxing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description is a little vague, as the input data is byte[]. Can we say more about this? e.g. endianness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ueshin Line 145 may make a mistake in comment Sets values from [rowId, rowId + count) to [src + srcIndex, src + srcIndex + count)
It should be Sets values from [src + srcIndex, src + srcIndex + count) to [rowId, rowId + count)
What do you think?
If we need to update, should we update them in this PR? Or, is it better to create another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's update them in this PR. BTW WritableColumnVector may be exposed to end users, so that they can build columnar batch to data source v2 columnar scan, so the document is very important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo? ordinal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need to throw exception at last, why not do it at the beginning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, is there any way to reduce the code duplication? maybe codegen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed code duplication by using a function object. How about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indention is wrong here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
|
retest this please |
|
LGTM, pending jenkins |
|
Test build #82420 has finished for PR 18704 at commit
|
|
I will rebase this next a few hours. |
revert unexpected style change
|
Test build #82426 has finished for PR 18704 at commit
|
|
@cloud-fan merged with the latest master and addressed your comment for indent |
|
thanks, merging to master! |
What changes were proposed in this pull request?
This PR abstracts data compressed by
CompressibleColumnAccessorusingColumnVectorin batch method. WhenColumnAccessor.decompressis called,ColumnVectorwill have uncompressed data. This batch decompress does not useInternalRowto reduce the number of memory accesses.As first step of this implementation, this JIRA supports primitive data types. Another PR will support array and other data types.
This implementation decompress data in batch into uncompressed column batch, as @rxin suggested at here. Another implementation uses adapter approach as @cloud-fan suggested.
How was this patch tested?
Added test suites