-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25931][SQL] Benchmarking creation of Jackson parser #22920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| spark.read | ||
| .schema(schema) | ||
| .json(path.getAbsolutePath) | ||
| .filter((_: Row) => true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxGekk . This is another benchmark case, isn't it?
We should have different benchmark cases for these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a follow-up. Please create another JIRA to add these test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is another benchmark case, isn't it?
Originally I added the benchmark to check how specifying of encoding impacts on performance (see #20937). This worked well till #21909 . Currently the benchmark just test how fast JSON datasource can create empty rows (in the case of count()) which is checked by another benchmark.
I believe this PR is just follow up of #21909 which must include the changes proposed in the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, how do you want to call old and new test cases?
- For old case, we can give a new name.
- For new case,
JSON per-line parsing:looks not a little bit accurate because we have filters now.
|
Test build #98351 has finished for PR 22920 at commit
|
|
jenkins, retest this, please |
|
Test build #98356 has finished for PR 22920 at commit
|
|
Test build #98411 has finished for PR 22920 at commit
|
| } | ||
|
|
||
| override def runBenchmarkSuite(mainArgs: Array[String]): Unit = { | ||
| val numIters = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for updating, @MaxGekk .
Do we have a reason to decrease this value from 3 to 2 in this PR?
If this is for reducing the running time, let's keep the original value.
This benchmark is not executed frequently.
|
@MaxGekk .
|
Update iteration and result.
|
Test build #98424 has finished for PR 22920 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
Merged to master.
|
Thank you, @MaxGekk and @HyukjinKwon ! |
|
@dongjoon-hyun Thank you for re-running the benchmarks on EC2, and @HyukjinKwon for review. |
## What changes were proposed in this pull request? Added new benchmark which forcibly invokes Jackson parser to check overhead of its creation for short and wide JSON strings. Existing benchmarks do not allow to check that due to an optimisation introduced by apache#21909 for empty schema pushed down to JSON datasource. The `count()` action passes empty schema as required schema to the datasource, and Jackson parser is not created at all in that case. Besides of new benchmark I also refactored existing benchmarks: - Added `numIters` to control number of iteration in each benchmark - Renamed `JSON per-line parsing` -> `count a short column`, `JSON parsing of wide lines` -> `count a wide column`, and `Count a dataset with 10 columns` -> `Select a subset of 10 columns`. Closes apache#22920 from MaxGekk/json-benchmark-follow-up. Lead-authored-by: Maxim Gekk <[email protected]> Co-authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
Added new benchmark which forcibly invokes Jackson parser to check overhead of its creation for short and wide JSON strings. Existing benchmarks do not allow to check that due to an optimisation introduced by #21909 for empty schema pushed down to JSON datasource. The
count()action passes empty schema as required schema to the datasource, and Jackson parser is not created at all in that case.Besides of new benchmark I also refactored existing benchmarks:
numItersto control number of iteration in each benchmarkJSON per-line parsing->count a short column,JSON parsing of wide lines->count a wide column, andCount a dataset with 10 columns->Select a subset of 10 columns.