[SPARK-25931][SQL] Benchmarking creation of Jackson parser #22920

MaxGekk · 2018-11-01T12:30:21Z

What changes were proposed in this pull request?

Added new benchmark which forcibly invokes Jackson parser to check overhead of its creation for short and wide JSON strings. Existing benchmarks do not allow to check that due to an optimisation introduced by #21909 for empty schema pushed down to JSON datasource. The count() action passes empty schema as required schema to the datasource, and Jackson parser is not created at all in that case.

Besides of new benchmark I also refactored existing benchmarks:

Added numIters to control number of iteration in each benchmark
Renamed JSON per-line parsing -> count a short column, JSON parsing of wide lines -> count a wide column, and Count a dataset with 10 columns -> Select a subset of 10 columns.

dongjoon-hyun · 2018-11-01T14:26:13Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmark.scala

        spark.read
          .schema(schema)
          .json(path.getAbsolutePath)
+          .filter((_: Row) => true)


@MaxGekk . This is another benchmark case, isn't it?
We should have different benchmark cases for these.

This is not a follow-up. Please create another JIRA to add these test cases.

This is another benchmark case, isn't it?

Originally I added the benchmark to check how specifying of encoding impacts on performance (see #20937). This worked well till #21909 . Currently the benchmark just test how fast JSON datasource can create empty rows (in the case of count()) which is checked by another benchmark.

I believe this PR is just follow up of #21909 which must include the changes proposed in the PR.

@MaxGekk . In your PR (#21909), you already showed the effect via benchmark .

What I mean is both test cases are meaningful and worth to have. :) And, we need to compare both results in the future release.

In any way, please create new different benchmark cases for this PR.

BTW, how do you want to call old and new test cases?

For old case, we can give a new name.

For new case, JSON per-line parsing: looks not a little bit accurate because we have filters now.

SparkQA · 2018-11-01T15:52:53Z

Test build #98351 has finished for PR 22920 at commit c7d5cc4.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-11-01T16:06:07Z

jenkins, retest this, please

SparkQA · 2018-11-01T19:33:15Z

Test build #98356 has finished for PR 22920 at commit c7d5cc4.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

…ow-up

SparkQA · 2018-11-03T00:24:59Z

Test build #98411 has finished for PR 22920 at commit 9e32447.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-11-03T06:08:40Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmark.scala

+  }
+
  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
+    val numIters = 2


Thank you for updating, @MaxGekk .
Do we have a reason to decrease this value from 3 to 2 in this PR?
If this is for reducing the running time, let's keep the original value.
This benchmark is not executed frequently.

dongjoon-hyun · 2018-11-03T07:38:03Z

@MaxGekk .

Could you review Update iteration and result. MaxGekk/spark#14 and merge that?
I recovered the iteration number and update the result on EC2.
Also, please create a new JIRA issue. If you reuse the JIRA issue like this, the fix version and the content becomes less coherent. SPARK-24959 is 2.4.0 and this is for 3.0, isn't it?

Update iteration and result.

SparkQA · 2018-11-03T12:20:10Z

Test build #98424 has finished for PR 22920 at commit 4ecc7e0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM.
Merged to master.

dongjoon-hyun · 2018-11-03T16:56:56Z

Thank you, @MaxGekk and @HyukjinKwon !

MaxGekk · 2018-11-03T17:03:22Z

@dongjoon-hyun Thank you for re-running the benchmarks on EC2, and @HyukjinKwon for review.

## What changes were proposed in this pull request? Added new benchmark which forcibly invokes Jackson parser to check overhead of its creation for short and wide JSON strings. Existing benchmarks do not allow to check that due to an optimisation introduced by apache#21909 for empty schema pushed down to JSON datasource. The `count()` action passes empty schema as required schema to the datasource, and Jackson parser is not created at all in that case. Besides of new benchmark I also refactored existing benchmarks: - Added `numIters` to control number of iteration in each benchmark - Renamed `JSON per-line parsing` -> `count a short column`, `JSON parsing of wide lines` -> `count a wide column`, and `Count a dataset with 10 columns` -> `Select a subset of 10 columns`. Closes apache#22920 from MaxGekk/json-benchmark-follow-up. Lead-authored-by: Maxim Gekk <[email protected]> Co-authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

MaxGekk added 2 commits November 1, 2018 14:24

Re-run the benchmark before changes

f31e438

Invoke Jackson parser in benchmarks

c7d5cc4

MaxGekk mentioned this pull request Nov 1, 2018

[SPARK-25847][SQL][TEST] Refactor JSONBenchmarks to use main method #22844

Closed

dongjoon-hyun requested changes Nov 1, 2018

View reviewed changes

MaxGekk added 5 commits November 2, 2018 21:30

Merge remote-tracking branch 'origin/master' into json-benchmark-foll…

07780dc

…ow-up

Removing filter from count benchmarks

0e99831

Renaming existing benchmarks

02d01f3

Added new benchmark

eeb7c69

Updating benchmark results

9e32447

HyukjinKwon approved these changes Nov 3, 2018

View reviewed changes

dongjoon-hyun reviewed Nov 3, 2018

View reviewed changes

Update iteration and result.

d46e945

Merge pull request #14 from dongjoon-hyun/PR-22920

4ecc7e0

Update iteration and result.

MaxGekk changed the title ~~[SPARK-24959][SQL][FOLLOWUP] Creating Jackson parser in the encoding JSON benchmarks~~ [SPARK-25931][SQL] Benchmarking creation of Jackson parser Nov 3, 2018

dongjoon-hyun approved these changes Nov 3, 2018

View reviewed changes

asfgit closed this in 42b6c1f Nov 3, 2018

MaxGekk deleted the json-benchmark-follow-up branch August 17, 2019 13:35

[SPARK-25931][SQL] Benchmarking creation of Jackson parser #22920

[SPARK-25931][SQL] Benchmarking creation of Jackson parser #22920

Uh oh!

Conversation

MaxGekk commented Nov 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Uh oh!

dongjoon-hyun Nov 1, 2018

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Nov 1, 2018

Choose a reason for hiding this comment

Uh oh!

MaxGekk Nov 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Nov 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Nov 1, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 1, 2018

Uh oh!

MaxGekk commented Nov 1, 2018

Uh oh!

SparkQA commented Nov 1, 2018

Uh oh!

SparkQA commented Nov 3, 2018

Uh oh!

dongjoon-hyun Nov 3, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Nov 3, 2018

Uh oh!

SparkQA commented Nov 3, 2018

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Nov 3, 2018

Uh oh!

MaxGekk commented Nov 3, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MaxGekk commented Nov 1, 2018 •

edited

Loading

MaxGekk Nov 1, 2018 •

edited

Loading

dongjoon-hyun Nov 1, 2018 •

edited

Loading

dongjoon-hyun Nov 3, 2018 •

edited

Loading