Skip to content

[SPARK-22712][SQL] Use buildReaderWithPartitionValues in native OrcFileFormat#19907

Closed
dongjoon-hyun wants to merge 2 commits intoapache:masterfrom
dongjoon-hyun:SPARK-ORC-BUILD-READER
Closed

[SPARK-22712][SQL] Use buildReaderWithPartitionValues in native OrcFileFormat#19907
dongjoon-hyun wants to merge 2 commits intoapache:masterfrom
dongjoon-hyun:SPARK-ORC-BUILD-READER

Conversation

@dongjoon-hyun
Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

To support vectorization in native OrcFileFormat later, we need to use buildReaderWithPartitionValues instead of buildReader like ParquetFileFormat. This PR replaces buildReader with buildReaderWithPartitionValues.

How was this patch tested?

Pass the Jenkins with the existing test cases.

}

override def buildReader(
override def buildReaderWithPartitionValues(
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @cloud-fan . During the previous ORC PR, we left this behind.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Dec 6, 2017

Test build #84543 has finished for PR 19907 at commit 199c835.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val unsafeProjection = UnsafeProjection.create(requiredSchema)
val deserializer = new OrcDeserializer(dataSchema, requiredSchema, requestedColIds)
val colIds = requestedColIds ++ List.fill(partitionSchema.length)(-1).toArray[Int]
val unsafeProjection = UnsafeProjection.create(resultSchema)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we follow parquet and just join the data row and partition row, and do a final unsafe projection? It's much easier and there is no performance difference.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet Vectorization work like the following.

      // UnsafeRowParquetRecordReader appends the columns internally to avoid another copy.
      if (parquetReader.isInstanceOf[VectorizedParquetRecordReader] &&
          enableVectorizedReader) {
        iter.asInstanceOf[Iterator[InternalRow]]

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. you meant non-vectorized path. Sorry, I was confused since I focused too much on vectorized path. I'll do.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Dec 6, 2017

Test build #84570 has finished for PR 19907 at commit f69fc4e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Copy Markdown
Member Author

dongjoon-hyun commented Dec 7, 2017

Hi, @gatorsmile .
Could you review this please, too?

@cloud-fan
Copy link
Copy Markdown
Contributor

thanks, merging to master!

@asfgit asfgit closed this in dd59a4b Dec 7, 2017
@dongjoon-hyun
Copy link
Copy Markdown
Member Author

Thank you, @cloud-fan !

@dongjoon-hyun dongjoon-hyun deleted the SPARK-ORC-BUILD-READER branch December 7, 2017 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants