Skip to content

Support reading uniontype as struct from Avro/ORC Hive tables#3483

Merged
dain merged 1 commit intotrinodb:masterfrom
lxynov:uniontype
Jun 7, 2020
Merged

Support reading uniontype as struct from Avro/ORC Hive tables#3483
dain merged 1 commit intotrinodb:masterfrom
lxynov:uniontype

Conversation

@lxynov
Copy link
Copy Markdown
Member

@lxynov lxynov commented Apr 20, 2020

Reading uniontypes by converting them into structs. Take type
"uniontype<int, double>" as an example:

  1. It will be regarded as "struct<tag int, field0 int, field1 string>"
  2. Data {1: 'hello'}, {0: 312}, {1: 'world'} will be read as [1, NULL,
    'hello'], [0, 312, NULL], [1, NULL, 'world']

Writing into uniontypes remains unsupported.

(Note: support for Parquet is not added because Parquet itself doesn't support union types yet.)

Closes #1751

@cla-bot cla-bot bot added the cla-signed label Apr 20, 2020
@lxynov lxynov requested review from dain and ebyhr April 20, 2020 01:22
@electrum
Copy link
Copy Markdown
Member

Starting the names at “field1” would be more consistent with SQL one-based numbering.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment explaining whyh this is storage format dependent

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've added such a comment

@lxynov lxynov force-pushed the uniontype branch 4 times, most recently from 354e320 to 6d54bec Compare April 22, 2020 17:20
@lxynov
Copy link
Copy Markdown
Member Author

lxynov commented Apr 22, 2020

Starting the names at “field1” would be more consistent with SQL one-based numbering.

@electrum Thanks for the suggestion, but the tag field starts at zero in Hive. If we start the names at "field1", we'll also need to make tag start at 1 to reduce confusion? But this will be inconsistent with Hive. What do you think?

@electrum
Copy link
Copy Markdown
Member

That makes sense. Let’s keep it consistent with Hive.

Copy link
Copy Markdown
Member

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. One comment about using dictionary block instead of copying the data. Also, some union tests to AbstractTestHiveFileFormats?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of copying the data use a dictionary block.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dain Thanks for the pointer, but I didn't find a good way to handle NULLs. Is there a way to append a NULL to the raw block so that it can serve as a dictionary?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. This is the same problem we had with unnest. In that case, we scanned the block for a null and if present, we used that; otherwise we copied. We can leave this for now.

Copy link
Copy Markdown
Member

@dain dain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know when it is updated

Reading uniontypes by converting them into structs. Take type
"uniontype<int, double>" as an example:
1. It will be regarded as "struct<tag int, field0 int, field1 string>"
2. Data {1: 'hello'}, {0: 312}, {1: 'world'} will be read as [1, NULL,
   'hello'], [0, 312, NULL], [1, NULL, 'world']

Writing into uniontypes remains unsupported.
@lxynov
Copy link
Copy Markdown
Member Author

lxynov commented May 17, 2020

@dain Thanks for the review! I've rebased it to the latest master.

Also, some union tests to AbstractTestHiveFileFormats?

AbstractTestHiveFileFormats uses serializeObject to build write objects so union types will be written as structs eventually. So I didn't add these tests. Only a product test TestReadUniontype was added which uses HQLs to write union types.

@dain dain self-requested a review May 19, 2020 03:36
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes. This is the same problem we had with unnest. In that case, we scanned the block for a null and if present, we used that; otherwise we copied. We can leave this for now.

@dain dain merged commit e3d798d into trinodb:master Jun 7, 2020
@dain dain mentioned this pull request Jun 8, 2020
9 tasks
@lxynov lxynov deleted the uniontype branch June 12, 2020 22:25
zhenxiao pushed a commit to prestodb/presto that referenced this pull request Jul 15, 2021
Cherry-pick of trinodb/trino#1067, trinodb/trino#2042, trinodb/trino#4055, trinodb/trino#1629, trinodb/trino#3483

Co-authored-by: Parth Brahmbhatt <pbrahmbhatt@netflix.com>
Co-authored-by: David Phillips <david@acz.org>
Co-authored-by: Xingyuan Lin <linxingyuan1102@gmail.com>
Co-authored-by: Dain Sundstrom <dain@iq80.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

Add support for UNIONTYPE in Hive

4 participants