Skip to content

Conversation

@gengliangwang
Copy link
Member

@gengliangwang gengliangwang commented Mar 11, 2019

What changes were proposed in this pull request?

Migrate JSON to File Data Source V2

How was this patch tested?

Unit test

@SparkQA
Copy link

SparkQA commented Mar 11, 2019

Test build #103339 has finished for PR 24058 at commit aac8841.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class JsonDataSourceV2 extends FileDataSourceV2
  • case class JsonPartitionReaderFactory(
  • case class JsonScan(
  • class JsonScanBuilder (
  • case class JsonTable(
  • class JsonWriteBuilder(options: DataSourceOptions) extends FileWriteBuilder(options)

@gengliangwang gengliangwang force-pushed the jsonV2 branch 2 times, most recently from 7b7fb79 to 7133bd2 Compare March 27, 2019 17:25
@gengliangwang gengliangwang changed the title [WIP][SPARK-27128][SQL] Migrate JSON to File Data Source V2 [SPARK-27128][SQL] Migrate JSON to File Data Source V2 Mar 27, 2019
@SparkQA
Copy link

SparkQA commented Mar 27, 2019

Test build #104023 has finished for PR 24058 at commit 7133bd2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 28, 2019

Test build #104039 has finished for PR 24058 at commit 67dcaa2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

retest this please.

@SparkQA
Copy link

SparkQA commented Mar 29, 2019

Test build #104075 has finished for PR 24058 at commit 67dcaa2.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Mar 29, 2019

Test build #104080 has finished for PR 24058 at commit 67dcaa2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 8, 2019

Test build #104379 has finished for PR 24058 at commit ce5f77b.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class JsonDataSourceV2 extends FileDataSourceV2
  • case class JsonPartitionReaderFactory(
  • case class JsonScan(
  • class JsonScanBuilder (
  • case class JsonTable(
  • class JsonWriteBuilder(

@SparkQA
Copy link

SparkQA commented Apr 8, 2019

Test build #104380 has finished for PR 24058 at commit c11d54f.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class JsonOutputWriter(

@SparkQA
Copy link

SparkQA commented Apr 8, 2019

Test build #104382 has finished for PR 24058 at commit c74e09a.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang gengliangwang changed the title [SPARK-27128][SQL] Migrate JSON to File Data Source V2 [WIP][SPARK-27128][SQL] Migrate JSON to File Data Source V2 Apr 11, 2019
@SparkQA
Copy link

SparkQA commented Apr 11, 2019

Test build #104517 has finished for PR 24058 at commit cbcd2c7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

retest this please.

@SparkQA
Copy link

SparkQA commented Apr 12, 2019

Test build #104537 has finished for PR 24058 at commit cbcd2c7.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

retest this please.

@SparkQA
Copy link

SparkQA commented Apr 12, 2019

Test build #104542 has finished for PR 24058 at commit cbcd2c7.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 12, 2019

Test build #104549 has finished for PR 24058 at commit 838b5f6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang gengliangwang changed the title [WIP][SPARK-27128][SQL] Migrate JSON to File Data Source V2 [SPARK-27128][SQL] Migrate JSON to File Data Source V2 Apr 12, 2019
@gengliangwang
Copy link
Member Author

This is ready. Please help review it. @cloud-fan @dongjoon-hyun @HyukjinKwon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, it doesn't say "JSON" now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current error message is changed to Unable to infer schema for $tableName, while the tableNameis shortName + path. I can create another PR to fix that.

@HyukjinKwon
Copy link
Member

Looks good if this is matched to CSV one. Will take a closer look late in this week

cloud-fan pushed a commit that referenced this pull request Apr 15, 2019
…ailure in file source V2

## What changes were proposed in this pull request?

Since https://github.com/apache/spark/pull/23383/files#diff-db4a140579c1ac4b1dbec7fe5057eecaR36, the exception message of schema inference failure in file source V2 is `tableName`, which is equivalent to `shortName + path`.

While in file source V1, the message is `Unable to infer schema from ORC/CSV/JSON...`.
We should make the message in V2 consistent with V1, so that in the future migration the related test cases don't need to be modified. #24058 (review)

## How was this patch tested?

Revert the modified unit test cases in https://github.com/apache/spark/pull/24005/files#diff-b9ddfbc9be8d83ecf100b3b8ff9610b9R431 and https://github.com/apache/spark/pull/23383/files#diff-9ab56940ee5a53f2bb81e3c008653362R577, and test with them.

Closes #24369 from gengliangwang/reviseInferSchemaMessage.

Authored-by: Gengliang Wang <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@SparkQA
Copy link

SparkQA commented Apr 15, 2019

Test build #104588 has finished for PR 24058 at commit 77bfc25.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class JsonOutputWriter(
  • class JsonDataSourceV2 extends FileDataSourceV2
  • case class JsonPartitionReaderFactory(
  • case class JsonScan(
  • class JsonScanBuilder (
  • case class JsonTable(
  • class JsonWriteBuilder(

@gengliangwang
Copy link
Member Author

retest this please.

val df = spark.read.json(path.getCanonicalPath).select("CoL1", "CoL5", "CoL3").distinct()
checkAnswer(df, Row("a", "e", "c"))

df.explain(true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should remove it

test("Incorrect result caused by the rule OptimizeMetadataOnlyQuery") {
withSQLConf(OPTIMIZER_METADATA_ONLY.key -> "true") {
withSQLConf(OPTIMIZER_METADATA_ONLY.key -> "true",
SQLConf.USE_V1_SOURCE_READER_LIST.key -> "json") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't v2 disabled by default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

V2 reader is enabled by default.

}.getMessage
assert(msg.contains("only include the internal corrupt record column"))
intercept[catalyst.errors.TreeNodeException[_]] {
spark.read.schema(schema).json(path).filter($"_corrupt_record".isNotNull).count()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we change the behavior for this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SparkQA
Copy link

SparkQA commented Apr 17, 2019

Test build #104652 has finished for PR 24058 at commit 77bfc25.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class JsonOutputWriter(
  • class JsonDataSourceV2 extends FileDataSourceV2
  • case class JsonPartitionReaderFactory(
  • case class JsonScan(
  • class JsonScanBuilder (
  • case class JsonTable(
  • class JsonWriteBuilder(

@SparkQA
Copy link

SparkQA commented Apr 17, 2019

Test build #104659 has finished for PR 24058 at commit 92bfe89.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 17, 2019

Test build #104668 has finished for PR 24058 at commit 4715b0e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Apr 22, 2019

Test build #104788 has finished for PR 24058 at commit 4715b0e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants