Skip to content

Conversation

@kiszk
Copy link
Member

@kiszk kiszk commented Jul 21, 2017

What changes were proposed in this pull request?

This PR abstracts data compressed by CompressibleColumnAccessor using ColumnVector in batch method. When ColumnAccessor.decompress is called, ColumnVector will have uncompressed data. This batch decompress does not use InternalRow to reduce the number of memory accesses.

As first step of this implementation, this JIRA supports primitive data types. Another PR will support array and other data types.

This implementation decompress data in batch into uncompressed column batch, as @rxin suggested at here. Another implementation uses adapter approach as @cloud-fan suggested.

How was this patch tested?

Added test suites

@SparkQA
Copy link

SparkQA commented Jul 21, 2017

Test build #79837 has finished for PR 18704 at commit c09f05f.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 21, 2017

Test build #79838 has finished for PR 18704 at commit ec368d8.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public final class CachedBatchColumnVector extends ReadOnlyColumnVector

@SparkQA
Copy link

SparkQA commented Jul 21, 2017

Test build #79839 has finished for PR 18704 at commit d6e8fef.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Jul 21, 2017

@rxin Could you please review this PR? This is the batch approach that you suggested in here.
Regarding the test failure, this is the issue only in test suite.

@SparkQA
Copy link

SparkQA commented Jul 21, 2017

Test build #79843 has finished for PR 18704 at commit bd0c334.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Jul 23, 2017

ping @rxin

1 similar comment
@kiszk
Copy link
Member Author

kiszk commented Jul 26, 2017

ping @rxin

@kiszk
Copy link
Member Author

kiszk commented Jul 31, 2017

@rxin Could you please review this PR?

@SparkQA
Copy link

SparkQA commented Aug 22, 2017

Test build #80958 has finished for PR 18704 at commit a24a971.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 22, 2017

Test build #80961 has finished for PR 18704 at commit 6367a4c.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 22, 2017

Test build #80964 has finished for PR 18704 at commit 9c8960b.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Aug 22, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Aug 22, 2017

Test build #80978 has finished for PR 18704 at commit 9c8960b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 22, 2017

Test build #80982 has finished for PR 18704 at commit 8f542b0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk kiszk changed the title [SPARK-20783][SQL] Create CachedBatchColumnVector to abstract existing compressed column (batch method) [SPARK-20783][SQL] Create ColumnVector to abstract existing compressed column (batch method) Aug 22, 2017
@kiszk
Copy link
Member Author

kiszk commented Aug 22, 2017

@cloud-fan I updated this implementation by using ColumnVector, as we discussed. I would appreciate it if you could discuss two implementations (on-demand approach) with @rxin.
cc @ueshin

Copy link
Contributor

@cloud-fan cloud-fan Aug 24, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now we can move them to WritableColumnVector

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Rebased in my local version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we delay the decompression and set the dictionary to ColumnVector?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will do that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@SparkQA
Copy link

SparkQA commented Aug 24, 2017

Test build #81093 has finished for PR 18704 at commit fb0d4e5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public final class ColumnDictionary implements Dictionary

@kiszk
Copy link
Member Author

kiszk commented Aug 25, 2017

@cloud-fan could you please review this again?

@kiszk
Copy link
Member Author

kiszk commented Aug 31, 2017

ping @cloud-fan

@SparkQA
Copy link

SparkQA commented Aug 31, 2017

Test build #81295 has finished for PR 18704 at commit 097fc05.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public final class ColumnDictionary implements Dictionary

@kiszk
Copy link
Member Author

kiszk commented Sep 1, 2017

@cloud-fan Resolved conflict, could you please review?

@kiszk
Copy link
Member Author

kiszk commented Sep 6, 2017

ping @cloud-fan

1 similar comment
@kiszk
Copy link
Member Author

kiszk commented Sep 11, 2017

ping @cloud-fan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to avoid boxing here? e.g. we can have a lot of primitive array members.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I removed boxing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description is a little vague, as the input data is byte[]. Can we say more about this? e.g. endianness.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

Copy link
Member Author

@kiszk kiszk Sep 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ueshin Line 145 may make a mistake in comment Sets values from [rowId, rowId + count) to [src + srcIndex, src + srcIndex + count)
It should be Sets values from [src + srcIndex, src + srcIndex + count) to [rowId, rowId + count)

What do you think?
If we need to update, should we update them in this PR? Or, is it better to create another PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's update them in this PR. BTW WritableColumnVector may be exposed to end users, so that they can build columnar batch to data source v2 columnar scan, so the document is very important.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo? ordinal?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need to throw exception at last, why not do it at the beginning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, is there any way to reduce the code duplication? maybe codegen?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed code duplication by using a function object. How about this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indention is wrong here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@cloud-fan
Copy link
Contributor

retest this please

@cloud-fan
Copy link
Contributor

LGTM, pending jenkins

@SparkQA
Copy link

SparkQA commented Oct 3, 2017

Test build #82420 has finished for PR 18704 at commit 549b10f.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Oct 3, 2017

I will rebase this next a few hours.

@SparkQA
Copy link

SparkQA commented Oct 3, 2017

Test build #82426 has finished for PR 18704 at commit c16230d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class PassThroughSuite extends SparkFunSuite

@kiszk
Copy link
Member Author

kiszk commented Oct 4, 2017

@cloud-fan merged with the latest master and addressed your comment for indent

@cloud-fan
Copy link
Contributor

thanks, merging to master!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants