Skip to content

Match hive support for type widening in the hive connector#19983

Merged
tdcmeehan merged 1 commit intoprestodb:masterfrom
mblanco-denodo:issue_19877
Oct 4, 2023
Merged

Match hive support for type widening in the hive connector#19983
tdcmeehan merged 1 commit intoprestodb:masterfrom
mblanco-denodo:issue_19877

Conversation

@mblanco-denodo
Copy link
Contributor

Test plan - Added test to hive module. Executed hive and parquet tests locally.

== RELEASE NOTES ==

General Changes
* Added type widening support to the parquet column readers (int, bigint, and float) to be able to perform type conversions between inmediate numerical types
* Updated docs accordingly (hive connector - schema evolution)

Hive Changes
* Changes in ParquetPageSourceFactory so that the type-check does not fail with the new widenings supported

@mblanco-denodo mblanco-denodo requested review from a team and shangxinli as code owners June 26, 2023 15:22
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jun 26, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: mblanco-denodo / name: Miguel Blanco Godón (6ae6a33)

@tdcmeehan
Copy link
Contributor

High level seems ok. Please squash commits, and please read the following guidelines for commit message structure and content. I'll follow up with a more detailed review early next week.

Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (docs)

@github-actions
Copy link

github-actions bot commented Sep 4, 2023

Codenotify: Notifying subscribers in CODENOTIFY files for diff 7b95c5b...6ae6a33.

Notify File(s)
@steveburnett presto-docs/src/main/sphinx/connector/hive.rst

Copy link
Contributor

@tdcmeehan tdcmeehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look good logically, but I'm worried about hot path branch mispredictions in the readValue method. Can you run some benchmarks in BenchmarkParquetReader and see if you find a difference in performance?

@mblanco-denodo
Copy link
Contributor Author

These changes look good logically, but I'm worried about hot path branch mispredictions in the readValue method. Can you run some benchmarks in BenchmarkParquetReader and see if you find a difference in performance?

I executed them in master:

# Run complete. Total time: 02:12:51

 

Benchmark                                                 Mode  Cnt   Score   Error  Units
BenchmarkParquetReader.readBooleanNoNull                  avgt   60   0.839 ± 0.033  ns/op
BenchmarkParquetReader.readBooleanWithNull                avgt   60   8.098 ± 0.156  ns/op
BenchmarkParquetReader.readInt32NoNull                    avgt   60   4.909 ± 0.060  ns/op
BenchmarkParquetReader.readInt32WithNull                  avgt   60  10.017 ± 0.105  ns/op
BenchmarkParquetReader.readInt64NoNull                    avgt   60  11.138 ± 0.142  ns/op
BenchmarkParquetReader.readInt64WithNull                  avgt   60  12.864 ± 0.222  ns/op
BenchmarkParquetReader.readInt96NoNull                    avgt   60  29.110 ± 0.395  ns/op
BenchmarkParquetReader.readInt96WithNull                  avgt   60  23.407 ± 0.729  ns/op
BenchmarkParquetReader.readListBooleanWithNull            avgt   60  74.151 ± 1.453  ns/op
BenchmarkParquetReader.readListInt32WithNull              avgt   60  73.887 ± 0.813  ns/op
BenchmarkParquetReader.readListInt64WithNull              avgt   60  78.858 ± 0.996  ns/op
BenchmarkParquetReader.readListInt96WithNull              avgt   60  91.713 ± 1.235  ns/op
BenchmarkParquetReader.readListSliceDictionaryWithNull    avgt   60  79.282 ± 1.215  ns/op
BenchmarkParquetReader.readSliceDictionaryNoNull          avgt   60   8.132 ± 0.118  ns/op
BenchmarkParquetReader.readSliceDictionaryWithNull        avgt   60  11.716 ± 0.200  ns/op
BenchmarkParquetReader.readStructBooleanWithNull          avgt   60  17.400 ± 0.291  ns/op
BenchmarkParquetReader.readStructInt32WithNull            avgt   60  21.285 ± 0.298  ns/op
BenchmarkParquetReader.readStructInt64WithNull            avgt   60  24.061 ± 0.512  ns/op
BenchmarkParquetReader.readStructInt96WithNull            avgt   60  33.878 ± 0.953  ns/op
BenchmarkParquetReader.readStructSliceDictionaryWithNull  avgt   60  24.537 ± 0.467  ns/op

and in the branch:

# Run complete. Total time: 02:09:09

 

Benchmark                                                 Mode  Cnt   Score   Error  Units
BenchmarkParquetReader.readBooleanNoNull                  avgt   60   0.819 ± 0.011  ns/op
BenchmarkParquetReader.readBooleanWithNull                avgt   60   7.668 ± 0.141  ns/op
BenchmarkParquetReader.readInt32NoNull                    avgt   60   3.883 ± 0.036  ns/op
BenchmarkParquetReader.readInt32WithNull                  avgt   60   8.919 ± 0.148  ns/op
BenchmarkParquetReader.readInt64NoNull                    avgt   60   8.779 ± 0.114  ns/op
BenchmarkParquetReader.readInt64WithNull                  avgt   60  11.474 ± 0.160  ns/op
BenchmarkParquetReader.readInt96NoNull                    avgt   60  25.877 ± 0.329  ns/op
BenchmarkParquetReader.readInt96WithNull                  avgt   60  20.038 ± 0.286  ns/op
BenchmarkParquetReader.readListBooleanWithNull            avgt   60  70.582 ± 0.861  ns/op
BenchmarkParquetReader.readListInt32WithNull              avgt   60  75.856 ± 1.109  ns/op
BenchmarkParquetReader.readListInt64WithNull              avgt   60  79.186 ± 1.017  ns/op
BenchmarkParquetReader.readListInt96WithNull              avgt   60  90.857 ± 1.286  ns/op
BenchmarkParquetReader.readListSliceDictionaryWithNull    avgt   60  81.905 ± 3.759  ns/op
BenchmarkParquetReader.readSliceDictionaryNoNull          avgt   60   8.467 ± 0.161  ns/op
BenchmarkParquetReader.readSliceDictionaryWithNull        avgt   60  11.963 ± 0.370  ns/op
BenchmarkParquetReader.readStructBooleanWithNull          avgt   60  17.610 ± 0.288  ns/op
BenchmarkParquetReader.readStructInt32WithNull            avgt   60  22.664 ± 0.393  ns/op
BenchmarkParquetReader.readStructInt64WithNull            avgt   60  25.653 ± 0.627  ns/op
BenchmarkParquetReader.readStructInt96WithNull            avgt   60  32.365 ± 0.429  ns/op
BenchmarkParquetReader.readStructSliceDictionaryWithNull  avgt   60  26.756 ± 0.511  ns/op

The run times appear to be within expectations considering the nature of the tests. Let me know if this should be enough or I should execute each one several times and get an average to be more confident about the changes.

@tdcmeehan
Copy link
Contributor

I am satisfied with it. @shangxinli any concerns before I merge?

@shangxinli
Copy link
Collaborator

The added binary files are concerns. They are not easy to maintain to keep test reliable. Can you use parquet writer to generate those files at runtime?

Add automatic type conversions consuming parquet files from
the hive connector. Conversions are not bidirectional:

     - integer -> bigint -> float -> double

     - bigint -> float -> double

     - float -> double

Add automated testing

Changes in TestHiveFileFormat as now certain tests (double column
reading float value) should not fail

Update doc

Resolves: prestodb#19983
@mblanco-denodo
Copy link
Contributor Author

Changed tests to generate parquet files at runtime as suggested

@tdcmeehan tdcmeehan merged commit aaf917d into prestodb:master Oct 4, 2023
@mbasmanova
Copy link
Contributor

I'm seeing compilation failures that seem to be caused by this change.

https://github.com/prestodb/presto/actions/runs/6417847641/job/17424417483?pr=21042

Error:  Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.8.0:testCompile (default-testCompile) on project presto-hive: Compilation failure: Compilation failure: 
Error:  /home/runner/work/presto/presto/presto-hive/src/test/java/com/facebook/presto/hive/TestHiveTypeWidening.java:[120,22] method writeParquetFileFromPresto in class com.facebook.presto.hive.parquet.ParquetTester cannot be applied to given types;
Error:    required: java.io.File,java.util.List<com.facebook.presto.common.type.Type>,java.util.List<java.lang.String>,java.lang.Iterable<?>[],int,org.apache.parquet.hadoop.metadata.CompressionCodecName,org.apache.parquet.column.ParquetProperties.WriterVersion
Error:    found: java.io.File,java.util.List<com.facebook.presto.common.type.Type>,java.util.List<java.lang.String>,java.lang.Iterable[],int,org.apache.parquet.hadoop.metadata.CompressionCodecName
Error:    reason: actual and formal argument lists differ in length
Error:  /home/runner/work/presto/presto/presto-hive/src/test/java/com/facebook/presto/hive/TestHiveTypeWidening.java:[128,22] method writeParquetFileFromPresto in class com.facebook.presto.hive.parquet.ParquetTester cannot be applied to given types;
Error:    required: java.io.File,java.util.List<com.facebook.presto.common.type.Type>,java.util.List<java.lang.String>,java.lang.Iterable<?>[],int,org.apache.parquet.hadoop.metadata.CompressionCodecName,org.apache.parquet.column.ParquetProperties.WriterVersion
Error:    found: java.io.File,java.util.List<com.facebook.presto.common.type.Type>,java.util.List<java.lang.String>,java.lang.Iterable[],int,org.apache.parquet.hadoop.metadata.CompressionCodecName
Error:    reason: actual and formal argument lists differ in length
Error:  -> [Help 1]

tdcmeehan added a commit to tdcmeehan/presto that referenced this pull request Oct 5, 2023
prestodb#20957 and prestodb#19983 were merged at similar times, and when taken
together broke the build (as changes required by prestodb#20957 aren't
present in prestodb#19983).
@tdcmeehan
Copy link
Contributor

I raised #21043 to fix the build, it appears #20957 wasn't rebased prior to merge.

tdcmeehan added a commit that referenced this pull request Oct 5, 2023
#20957 and #19983 were merged at similar times, and when taken
together broke the build (as changes required by #20957 aren't
present in #19983).
@wanglinsong wanglinsong mentioned this pull request Dec 8, 2023
26 tasks
kaikalur pushed a commit to kaikalur/presto that referenced this pull request Mar 14, 2024
prestodb#20957 and prestodb#19983 were merged at similar times, and when taken
together broke the build (as changes required by prestodb#20957 aren't
present in prestodb#19983).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants