Skip to content

[GOBBLIN-1774] Util for detecting non optional uniontype columns based on Hive Table metadata#3632

Merged
Will-Lo merged 1 commit intoapache:masterfrom
homatthew:mh-complex-union-util-GOBBLIN-1774
Feb 13, 2023
Merged

[GOBBLIN-1774] Util for detecting non optional uniontype columns based on Hive Table metadata#3632
Will-Lo merged 1 commit intoapache:masterfrom
homatthew:mh-complex-union-util-GOBBLIN-1774

Conversation

@homatthew
Copy link
Copy Markdown
Contributor

@homatthew homatthew commented Jan 31, 2023

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots (if applicable):

Problem Statement:

Non-optional union: A union type with more than one non-null option type
e.g. [“int”, “string”] or [“null”, “int”, “string”]
Alternate phrasing:
A schema where there 's a union type of >= 2 branches which are non null.
[int, null] - no
[int, string] - yes
[int, string, null] yes

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jan 31, 2023

Codecov Report

Merging #3632 (12318bf) into master (667d797) will increase coverage by 0.04%.
The diff coverage is 83.33%.

@@             Coverage Diff              @@
##             master    #3632      +/-   ##
============================================
+ Coverage     46.56%   46.61%   +0.04%     
- Complexity    10666    10707      +41     
============================================
  Files          2133     2133              
  Lines         83541    83612      +71     
  Branches       9288     9299      +11     
============================================
+ Hits          38905    38977      +72     
  Misses        41074    41074              
+ Partials       3562     3561       -1     
Impacted Files Coverage Δ
...che/gobblin/hive/metastore/HiveMetaStoreUtils.java 55.55% <83.33%> (+9.20%) ⬆️
...lin/util/filesystem/FileSystemInstrumentation.java 85.71% <0.00%> (-7.15%) ⬇️
...service/modules/orchestration/DagManagerUtils.java 83.47% <0.00%> (-6.09%) ⬇️
...che/gobblin/data/management/copy/CopyableFile.java 73.33% <0.00%> (-2.01%) ⬇️
...in/service/modules/orchestration/Orchestrator.java 52.73% <0.00%> (-0.88%) ⬇️
...blin/service/monitoring/KafkaJobStatusMonitor.java 45.98% <0.00%> (-0.29%) ⬇️
...lin/service/monitoring/SpecStoreChangeMonitor.java 0.00% <0.00%> (ø)
...ervice/modules/flow/BaseFlowToJobSpecCompiler.java 55.00% <0.00%> (ø)
...ervice/monitoring/DagActionStoreChangeMonitor.java 0.00% <0.00%> (ø)
...lin/data/management/copy/ManifestBasedDataset.java 51.78% <0.00%> (ø)
... and 12 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@homatthew homatthew force-pushed the mh-complex-union-util-GOBBLIN-1774 branch from bd2f2fc to 79323de Compare February 1, 2023 07:05
@homatthew homatthew marked this pull request as ready for review February 1, 2023 07:08
@homatthew homatthew force-pushed the mh-complex-union-util-GOBBLIN-1774 branch from 79323de to e7255cb Compare February 7, 2023 02:13
Copy link
Copy Markdown
Contributor

@vikrambohra vikrambohra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix it and ship it!

* See https://github.com/apache/iceberg/issues/189
* Util for detecting if a table has a complex union (aka non-optional unions) column types.
*
* @param t
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Hive table t

return t.getColumns().stream()
.map(HiveRegistrationUnit.Column::getType)
.map(Object::toString)
.anyMatch(columnType -> columnType.contains("uniontype"));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to me we only check union type but not non-optional union type. Is this intentional? If so, please update the description to reflect that as well.

@homatthew homatthew changed the title [GOBBLIN-1774] Util for detecting non optional unions Hive tables [GOBBLIN-1774] Util for detecting non optional uniontype columns based on Hive Table metadata Feb 9, 2023
if (!isAvroFormat(hiveTable)) {
// All values in ORC are optional / nullable
return false;
if (hiveTable.getProps().contains("avro.schema.literal")) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is true for all tables written and managed by Gobblin

.map(HiveRegistrationUnit.Column::getType)
.map(Object::toString)
.anyMatch(columnType -> columnType.contains("uniontype") && !columnType.contains("void"));
if (isNonAvroFormat(hiveTable)) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fallback case if schema literal is not set. Where we can use the ORC type parser to determine if the column is a non-optional union.

This does not work if the underlying table is not ORC based

@homatthew homatthew force-pushed the mh-complex-union-util-GOBBLIN-1774 branch from fd8276c to 12318bf Compare February 10, 2023 18:53
@homatthew homatthew force-pushed the mh-complex-union-util-GOBBLIN-1774 branch from a4b426d to ab7ef60 Compare February 10, 2023 21:21
Copy link
Copy Markdown
Contributor

@vikrambohra vikrambohra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

- This util will be used across GMIP and compaction to handle non optional unions
Non optional unions are compatible with Avro / Orc but not in Iceberg, so special
workarounds are necessary to have tables with both types of data
@homatthew homatthew force-pushed the mh-complex-union-util-GOBBLIN-1774 branch from ab7ef60 to 91aaa25 Compare February 11, 2023 01:52
@homatthew
Copy link
Copy Markdown
Contributor Author

Squashed

@Will-Lo Will-Lo merged commit 9bc9df5 into apache:master Feb 13, 2023
phet added a commit to phet/gobblin that referenced this pull request Feb 13, 2023
* upstream/master:
  [GOBBLIN-1774] Util for detecting non optional uniontypes Hive tables (apache#3632)
  [GOBBLIN-1773] Fix bugs in quota manager (apache#3636)
  [GOBBLIN-1782] Fix Merge State for Flow Pending Resume statuses (apache#3639)
  [GOBBLIN-1755] Support extended ACLs and sticky bit for file based distcp (apache#3616)
  [GOBBLIN-1780] Refactor/rename YarnServiceIT to YarnServiceTest (apache#3637)
  [GOBBLIN-1778] Add house keeping thread in DagManager to periodically sync in memory state with mysql table (apache#3635)
  Register gauge metrics for change monitors (apache#3634)
phet added a commit to phet/gobblin that referenced this pull request Mar 24, 2023
* upstream/master:
  [GOBBLIN-1774] Util for detecting non optional uniontypes Hive tables (apache#3632)
  [GOBBLIN-1773] Fix bugs in quota manager (apache#3636)
  [GOBBLIN-1782] Fix Merge State for Flow Pending Resume statuses (apache#3639)
  [GOBBLIN-1755] Support extended ACLs and sticky bit for file based distcp (apache#3616)
  [GOBBLIN-1780] Refactor/rename YarnServiceIT to YarnServiceTest (apache#3637)
  [GOBBLIN-1778] Add house keeping thread in DagManager to periodically sync in memory state with mysql table (apache#3635)
  Register gauge metrics for change monitors (apache#3634)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants