[GOBBLIN-1779] Ability to filter datasets that contain non optional unions by homatthew · Pull Request #3648 · apache/gobblin

homatthew · 2023-02-22T11:34:05Z

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

My PR addresses the following Gobblin JIRA issues and references them in the PR title. For example, "[GOBBLIN-1779] My Gobblin PR"
- https://issues.apache.org/jira/browse/GOBBLIN-1779

Description

Here are some details about my PR, including screenshots (if applicable):
Problem Statement:
Goal is to make it easy for users to filter out datasets with non optional unions. Avro and ORC flexible schema supports non optional unions. But Iceberg does not support these schemas. This can cause mismatched schemas when writing with different writers. e.g. writing with iceberg will write a struct but writing native orc writer will write as a uniontype
[GOBBLIN-1774] Util for detecting non optional uniontype columns based on Hive Table metadata #3632

Future users may want to implement similar complex filtering on the results of their dataset finders. And this filtering may be very complex. The goal is to make it easy for users to build their own filtering predicates.

Design:

DatasetsFinderFilteringDecorator.java
- Decorator for any DatasetsFinder that accepts denylist and allowlist filters. In this case, we can specify that all tables with non optional unions are excluded or should be included
DatasetHiveSchemaContainsNonOptionalUnion.java
- Predicate that fetches table from hive and performs check on if the schema contains a non optional union
CheckedExceptionPredicate.java
- Util functional interface for providing proper IO exceptions with these predicates (since often these predicates will be making IO calls that can fail)
- https://github.com/apache/gobblin/blob/585298fb5ebc074f69c1b9db87de6186c4855b26/gobblin-utility/src/main/java/org/apache/gobblin/util/function/CheckedExceptionFunction.java
- See the above for the equivalent implementation for Function

Alternative designs:

Alternatively, I could have directly modified any relevant sources or dataset finders to do the filtering there. The problem with that approach is that then we end up class specific configs that are niche and not generalizable.
There is also lots of business logic involved with complex filtering. Adding this business logic into a specific dataset finder would create spaghetti super classes.
Gobblin is OSS and there are many implementations of dataset finders that are internal, which would require extra dev effort to implement filtering.

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:
Hive metastore tests

Dataset finder tests

Commits

My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

homatthew · 2023-02-22T21:16:40Z

...n-data-management/src/main/java/org/apache/gobblin/data/management/dataset/DatasetUtils.java

+  }
+
+  @SuppressWarnings("unchecked")
+  public static <T extends org.apache.gobblin.dataset.Dataset> DatasetsFinder<T> instantiateDatasetFinder(


New method used in DatasetFinderFilteringDecorator to instantiate dataset finder

homatthew · 2023-02-22T21:25:18Z

gobblin-iceberg/src/test/java/org/apache/gobblin/iceberg/writer/HiveMetadataWriterTest.java

+    stopMetastore();
  }
-  @BeforeClass
+  @BeforeSuite


Before class causes some strange flakiness with when this executes. Before and After Suite is the correct behavior (it acts as the entry point and exit point for all the test code with hive)

homatthew · 2023-02-22T21:27:18Z

...ava/org/apache/gobblin/iceberg/predicates/DatasetHiveSchemaContainsNonOptionalUnionTest.java

+import org.testng.annotations.Test;
+
+@Slf4j
+@Test(dependsOnGroups = "icebergMetadataWriterTest")


Putting the depends on groups here allows the class to be run on its own, but also be thread safe with the other hive metastore tests.

Adding per method depends makes it so I cannot run the Iceberg tests without running all of the hive tests first (setting a break point in underlying code then becomes a pain)

homatthew · 2023-02-22T21:27:54Z

...ava/org/apache/gobblin/iceberg/predicates/DatasetHiveSchemaContainsNonOptionalUnionTest.java

+
+  @AfterSuite
+  public void clean() throws Exception {
+    FileUtils.forceDeleteOnExit(tmpDir);


Note: I don't need to call stopMetastore here because it should be handled in the afterclass for hivemetastoretest

homatthew · 2023-02-22T21:30:24Z

...in/java/org/apache/gobblin/iceberg/predicates/DatasetHiveSchemaContainsNonOptionalUnion.java

+  }
+
+  private Optional<HiveTable> getTable(T dataset) throws IOException {
+    DbAndTable dbAndTable = getDbAndTable(dataset);


In initial iterations, I tried using the HiveRegistrationPolicy to get the hivespec based on the dataset urn. But dataset urn isn't necesarily a path. It is dependent on the underlying dataset.

So I expect the user to configure the expected pattern for extracting db and table

We do something similar in compaction (except hard coded)

…nions

vikrambohra · 2023-03-08T22:43:25Z

gobblin-utility/src/main/java/org/apache/gobblin/util/function/CheckedExceptionPredicate.java

+ *   </ul>
+ */
+@FunctionalInterface
+public interface CheckedExceptionPredicate<T, E extends Exception> {


Overall good thought. Should we consider existing Predicate interfaces in guava or other util libraries?

Existing interfaces don't cover this case AFAIK (But more than happy to take suggestions). And this is a pretty common solution for working around it.

...in/java/org/apache/gobblin/iceberg/predicates/DatasetHiveSchemaContainsNonOptionalUnion.java

...c/main/java/org/apache/gobblin/data/management/dataset/DatasetsFinderFilteringDecorator.java

vikrambohra · 2023-03-08T22:54:39Z

...c/main/java/org/apache/gobblin/data/management/dataset/DatasetsFinderFilteringDecorator.java

+  @Override
+  public List<T> findDatasets() throws IOException {
+    List<T> datasets = datasetFinder.findDatasets();
+    List<T> allowedDatasets = null;


instantiate with Empty list to avoid NPE

In hindsight, does this value really matter? We can just return allowedDatasets as-is.

But to answer your question. We can never return null from this method. We always throw an exception or allowedDatasets is overwritten by datasets (which cannot be null)

codecov-commenter · 2023-03-09T00:41:46Z

Codecov Report

Merging #3648 (f260386) into master (26d6ed3) will increase coverage by 0.06%.
The diff coverage is 64.47%.

@@             Coverage Diff              @@
##             master    #3648      +/-   ##
============================================
+ Coverage     46.91%   46.98%   +0.06%     
- Complexity    10756    10776      +20     
============================================
  Files          2135     2138       +3     
  Lines         83834    83919      +85     
  Branches       9320     9324       +4     
============================================
+ Hits          39332    39428      +96     
+ Misses        40933    40920      -13     
- Partials       3569     3571       +2

Impacted Files	Coverage Δ
...bblin/util/function/CheckedExceptionPredicate.java	`0.00% <0.00%> (ø)`
.../gobblin/data/management/dataset/DatasetUtils.java	`53.12% <66.66%> (+1.51%)`	⬆️
...tes/DatasetHiveSchemaContainsNonOptionalUnion.java	`70.83% <70.83%> (ø)`
...ment/dataset/DatasetsFinderFilteringDecorator.java	`88.23% <88.23%> (ø)`
.../org/apache/gobblin/metrics/RootMetricContext.java	`79.68% <0.00%> (+1.56%)`	⬆️
...main/java/org/apache/gobblin/yarn/YarnService.java	`28.63% <0.00%> (+5.57%)`	⬆️
...a/org/apache/gobblin/cluster/GobblinHelixTask.java	`83.87% <0.00%> (+19.35%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

vikrambohra · 2023-03-09T00:47:08Z

+1 LGTM

Will-Lo · 2023-03-09T21:58:12Z

...c/main/java/org/apache/gobblin/data/management/dataset/DatasetsFinderFilteringDecorator.java

+      allowedDatasets = datasets.parallelStream()
+          .filter(dataset -> allowDatasetPredicates.stream()
+              .map(CheckedExceptionPredicate::wrapToTunneled)
+              .allMatch(p -> p.test(dataset)))
+          .filter(dataset -> denyDatasetPredicates.stream()
+              .map(CheckedExceptionPredicate::wrapToTunneled)
+              .noneMatch(predicate -> predicate.test(dataset)))
+          .collect(Collectors.toList());


I like how clean this is :) and it's parallel which is pretty nice. I recall being told early on that in Java 8 streams use more memory than traditionally looping, so just be wary of that

Also this is on the dataset level not the file level so it shouldn't be an issue

Good callout about there being drawbacks to streams. In general, I don't think this code is a hot spot so I am okay with being addicted to the syntactic sugar until the profiler shows otherwise.

Something something premature optimization is the root of all evil. I think even our largest use cases is in the cardinality of thousands, so should be okay for this specific piece.

homatthew force-pushed the mh-batch-complexunion-GOBBLIN-1779 branch 2 times, most recently from 6196a0a to 35c2cee Compare February 22, 2023 21:11

homatthew commented Feb 22, 2023

View reviewed changes

homatthew force-pushed the mh-batch-complexunion-GOBBLIN-1779 branch from 35c2cee to 3966509 Compare February 22, 2023 21:23

homatthew commented Feb 22, 2023

View reviewed changes

homatthew marked this pull request as ready for review February 22, 2023 21:30

homatthew force-pushed the mh-batch-complexunion-GOBBLIN-1779 branch from 3966509 to 181b2cc Compare February 22, 2023 22:23

[GOBBLIN-1779] Ability to filter datasets that contain non optional u…

3ecad05

…nions

vikrambohra suggested changes Mar 8, 2023

View reviewed changes

Address comments

f260386

homatthew force-pushed the mh-batch-complexunion-GOBBLIN-1779 branch from 181b2cc to f260386 Compare March 9, 2023 00:25

Will-Lo reviewed Mar 9, 2023

View reviewed changes

Will-Lo merged commit 8e8e3df into apache:master Mar 9, 2023

homatthew mentioned this pull request Oct 13, 2023

[GOBBLIN-1927] Add topic validation support in KafkaSource, and add TopicNameValidator #3793

Merged

4 tasks

Conversation

homatthew commented Feb 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

JIRA

Description

Design:

Tests

Commits

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Mar 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

vikrambohra commented Mar 9, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

homatthew commented Feb 22, 2023 •

edited

Loading

codecov-commenter commented Mar 9, 2023 •

edited

Loading