Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Nov 1, 2018

What changes were proposed in this pull request?

Added new benchmark which forcibly invokes Jackson parser to check overhead of its creation for short and wide JSON strings. Existing benchmarks do not allow to check that due to an optimisation introduced by #21909 for empty schema pushed down to JSON datasource. The count() action passes empty schema as required schema to the datasource, and Jackson parser is not created at all in that case.

Besides of new benchmark I also refactored existing benchmarks:

  • Added numIters to control number of iteration in each benchmark
  • Renamed JSON per-line parsing -> count a short column, JSON parsing of wide lines -> count a wide column, and Count a dataset with 10 columns -> Select a subset of 10 columns.

spark.read
.schema(schema)
.json(path.getAbsolutePath)
.filter((_: Row) => true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk . This is another benchmark case, isn't it?
We should have different benchmark cases for these.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a follow-up. Please create another JIRA to add these test cases.

Copy link
Member Author

@MaxGekk MaxGekk Nov 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another benchmark case, isn't it?

Originally I added the benchmark to check how specifying of encoding impacts on performance (see #20937). This worked well till #21909 . Currently the benchmark just test how fast JSON datasource can create empty rows (in the case of count()) which is checked by another benchmark.

I believe this PR is just follow up of #21909 which must include the changes proposed in the PR.

Copy link
Member

@dongjoon-hyun dongjoon-hyun Nov 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk . In your PR (#21909), you already showed the effect via benchmark .

What I mean is both test cases are meaningful and worth to have. :) And, we need to compare both results in the future release.

In any way, please create new different benchmark cases for this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, how do you want to call old and new test cases?

  • For old case, we can give a new name.
  • For new case, JSON per-line parsing: looks not a little bit accurate because we have filters now.

@SparkQA
Copy link

SparkQA commented Nov 1, 2018

Test build #98351 has finished for PR 22920 at commit c7d5cc4.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 1, 2018

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Nov 1, 2018

Test build #98356 has finished for PR 22920 at commit c7d5cc4.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 3, 2018

Test build #98411 has finished for PR 22920 at commit 9e32447.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
val numIters = 2
Copy link
Member

@dongjoon-hyun dongjoon-hyun Nov 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for updating, @MaxGekk .
Do we have a reason to decrease this value from 3 to 2 in this PR?
If this is for reducing the running time, let's keep the original value.
This benchmark is not executed frequently.

@dongjoon-hyun
Copy link
Member

@MaxGekk .

  1. Could you review Update iteration and result. MaxGekk/spark#14 and merge that?
    I recovered the iteration number and update the result on EC2.

  2. Also, please create a new JIRA issue. If you reuse the JIRA issue like this, the fix version and the content becomes less coherent. SPARK-24959 is 2.4.0 and this is for 3.0, isn't it?

@MaxGekk MaxGekk changed the title [SPARK-24959][SQL][FOLLOWUP] Creating Jackson parser in the encoding JSON benchmarks [SPARK-25931][SQL] Benchmarking creation of Jackson parser Nov 3, 2018
@SparkQA
Copy link

SparkQA commented Nov 3, 2018

Test build #98424 has finished for PR 22920 at commit 4ecc7e0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.
Merged to master.

@asfgit asfgit closed this in 42b6c1f Nov 3, 2018
@dongjoon-hyun
Copy link
Member

Thank you, @MaxGekk and @HyukjinKwon !

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 3, 2018

@dongjoon-hyun Thank you for re-running the benchmarks on EC2, and @HyukjinKwon for review.

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

Added new benchmark which forcibly invokes Jackson parser to check overhead of its creation for short and wide JSON strings. Existing benchmarks do not allow to check that due to an optimisation introduced by apache#21909 for empty schema pushed down to JSON datasource. The `count()` action passes empty schema as required schema to the datasource, and Jackson parser is not created at all in that case.

Besides of new benchmark I also refactored existing benchmarks:
- Added `numIters` to control number of iteration in each benchmark
- Renamed `JSON per-line parsing` -> `count a short column`, `JSON parsing of wide lines` -> `count a wide column`, and `Count a dataset with 10 columns` -> `Select a subset of 10 columns`.

Closes apache#22920 from MaxGekk/json-benchmark-follow-up.

Lead-authored-by: Maxim Gekk <[email protected]>
Co-authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
@MaxGekk MaxGekk deleted the json-benchmark-follow-up branch August 17, 2019 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants