Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Oct 21, 2020

What changes were proposed in this pull request?

  1. Turn off/on the SQL config spark.sql.legacy.parquet.int96RebaseModeInWrite which was added by [SPARK-33160][SQL] Allow saving/loading INT96 in parquet w/o rebasing #30056 in DateTimeRebaseBenchmark. The parquet readers should infer correct rebasing mode automatically from metadata.
  2. Regenerate benchmark results of DateTimeRebaseBenchmark in the environment:
Item Description
Region us-west-2 (Oregon)
Instance r3.xlarge (spot instance)
AMI ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1)
Java OpenJDK8/11 installed bysudo add-apt-repository ppa:openjdk-r/ppa & sudo apt install openjdk-11-jdk

Why are the changes needed?

To have up-to-date info about INT96 performance which is the default type for Catalyst's timestamp type.

Does this PR introduce any user-facing change?

No

How was this patch tested?

By updating benchmark results:

$ SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark"

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 21, 2020

after 1900, rebase LEGACY 27305 27305 0 3.7 273.0 0.1X
after 1900, rebase CORRECTED 27715 27715 0 3.6 277.2 0.1X
before 1900, rebase LEGACY 30911 30911 0 3.2 309.1 0.1X
before 1900, rebase CORRECTED 27944 27944 0 3.6 279.4 0.1X
Copy link
Member Author

@MaxGekk MaxGekk Oct 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet writer without rebasing is ~10% faster.

before 1900, vec off, rebase LEGACY 20371 20458 81 4.9 203.7 0.8X
before 1900, vec off, rebase CORRECTED 17484 17541 54 5.7 174.8 1.0X
before 1900, vec on, rebase LEGACY 10284 10327 45 9.7 102.8 1.6X
before 1900, vec on, rebase CORRECTED 7044 7073 37 14.2 70.4 2.4X
Copy link
Member Author

@MaxGekk MaxGekk Oct 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vectorized Reader speed up: ~30%

after 1900, vec on, rebase LEGACY 7183 7255 94 13.9 71.8 2.3X
after 1900, vec on, rebase CORRECTED 7047 7137 86 14.2 70.5 2.4X
before 1900, vec off, rebase LEGACY 20371 20458 81 4.9 203.7 0.8X
before 1900, vec off, rebase CORRECTED 17484 17541 54 5.7 174.8 1.0X
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet-MR speed up ~15%

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34699/

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34699/

@SparkQA
Copy link

SparkQA commented Oct 21, 2020

Test build #130090 has finished for PR 30118 at commit c6d5b5c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@MaxGekk MaxGekk deleted the int96-rebase-benchmark branch December 11, 2020 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants