Skip to content

Conversation

@heary-cao
Copy link
Contributor

@heary-cao heary-cao commented Oct 26, 2018

What changes were proposed in this pull request?

Refactor JSONBenchmark to use main method

use spark-submit:
bin/spark-submit --class org.apache.spark.sql.execution.datasources.json.JSONBenchmark --jars ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar,./sql/catalyst/target/spark-catalyst_2.11-3.0.0-SNAPSHOT-tests.jar ./sql/core/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar

Generate benchmark result:
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.datasources.json.JSONBenchmark"

How was this patch tested?

manual tests

@heary-cao
Copy link
Contributor Author

cc @dongjoon-hyun, @wangyum

@dongjoon-hyun
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Oct 26, 2018

Test build #98072 has finished for PR 22844 at commit 937111f.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 26, 2018

Test build #98075 has finished for PR 22844 at commit 62af4fd.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@heary-cao
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Oct 27, 2018

Test build #98117 has finished for PR 22844 at commit 5c05263.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 27, 2018

Test build #98120 has finished for PR 22844 at commit ff86e34.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 27, 2018

Test build #98123 has finished for PR 22844 at commit ebef3f7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution, @heary-cao .
But, the values seems to be manually copied. Could you run your benchmark actually?
In general, during refactoring, you should rerun the benchmark by yourself and check whether the PR doesn't create some regression.

cc @yucai and @wangyum

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update without sbt usage to:

bin/spark-submit --class <this class> --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>

Copy link
Member

@wangyum wangyum Oct 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also update the usage in description:

bin/spark-submit --class org.apache.spark.sql.execution.datasources.json.JSONBenchmarks --jars ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar,./sql/catalyst/target/spark-catalyst_2.11-3.0.0-SNAPSHOT-tests.jar ./sql/core/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @heary-cao I mean update here to:

bin/spark-submit --class <this class> --jars <spark core test jar>,<spark catalyst test jar> <spark sql test jar>

and update PR description to:

bin/spark-submit --class org.apache.spark.sql.execution.datasources.json.JSONBenchmarks --jars ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar,./sql/catalyst/target/spark-catalyst_2.11-3.0.0-SNAPSHOT-tests.jar ./sql/core/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar

@SparkQA
Copy link

SparkQA commented Oct 29, 2018

Test build #98193 has finished for PR 22844 at commit 15c0893.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 29, 2018

Test build #98192 has finished for PR 22844 at commit 7d23c18.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 29, 2018

Test build #98199 has finished for PR 22844 at commit fbdbf83.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#22872 has updated runBenchmarkSuite's signature.

Suggested change
override def runBenchmarkSuite(): Unit = {
override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for @yucai 's comment.

@SparkQA
Copy link

SparkQA commented Oct 30, 2018

Test build #98242 has finished for PR 22844 at commit 2c73385.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 30, 2018

Test build #98253 has finished for PR 22844 at commit 4116879.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

================================================================================================

OpenJDK 64-Bit Server VM 1.8.0_163-b01 on Windows 7 6.1
Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the same reason (#22845 (comment)), it's difficult to figure out the CPU.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make a PR to you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heary-cao . Could you review and merge heary-cao#3 ?

JSON per-line parsing: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
No encoding 12107 / 12246 8.3 121.1 1.0X
UTF-8 is set 12375 / 12475 8.1 123.8 1.0X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @HyukjinKwon . According to the ratio, it seems to be a regression on No encoding case. How do you think this change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait .. this is almost 50% slower. This had to be around 8000ish.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also run this benchmark and got the same ratio. So it's a little weird.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, this benchmark was added rather we can make sure setting encoding does not affect the performance without encoding (right @MaxGekk ?). We should fix this. @cloud-fan

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me take a quick look within few days. This is per line basic case where many users are affected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. This is also because of count optimization. ratio is weird but actually it's performance improvement for both cases. shouldn't be a big deal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, it's by a8a1ac0

Before:

JSON per-line parsing:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
No encoding                                 35786 / 36446          2.8         357.9       1.0X
UTF-8 is set                                57486 / 58714          1.7         574.9       0.6X

After:

JSON per-line parsing:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
No encoding                                 11142 / 11425          9.0         111.4       1.0X
UTF-8 is set                                11139 / 11293          9.0         111.4       1.0X

Looks not a regression.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the confirmation!

@SparkQA
Copy link

SparkQA commented Oct 30, 2018

Test build #98262 has finished for PR 22844 at commit 473a3a5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 30, 2018

Test build #98263 has finished for PR 22844 at commit 422df47.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

LGTM

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@dongjoon-hyun
Copy link
Member

Since this is JSON benchmark, could you merge this, @HyukjinKwon ? :)

@HyukjinKwon
Copy link
Member

Merged to master.

@asfgit asfgit closed this in f6ff632 Oct 31, 2018
@dongjoon-hyun
Copy link
Member

Thank you, @heary-cao , @yucai , @wangyum , and @HyukjinKwon .

JSON parsing of wide lines: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
No encoding 39789 / 40053 0.3 3978.9 1.0X
UTF-8 is set 39505 / 39584 0.3 3950.5 1.0X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numbers for currently used Jackson parser should be slightly different. The PR #22920 triggers creation of Jackson parser.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented on the PR. Please add another benchmark cases instead of changing the existing numbers.

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

Refactor JSONBenchmark to use main method

use spark-submit:
`bin/spark-submit --class org.apache.spark.sql.execution.datasources.json.JSONBenchmark --jars ./core/target/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar,./sql/catalyst/target/spark-catalyst_2.11-3.0.0-SNAPSHOT-tests.jar ./sql/core/target/spark-sql_2.11-3.0.0-SNAPSHOT-tests.jar`

Generate benchmark result:
`SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.datasources.json.JSONBenchmark"`

## How was this patch tested?

manual tests

Closes apache#22844 from heary-cao/JSONBenchmarks.

Lead-authored-by: caoxuewen <[email protected]>
Co-authored-by: heary <[email protected]>
Co-authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants