Skip to content

[GOBBLIN-1779] Ability to filter datasets that contain non optional unions#3648

Merged
Will-Lo merged 2 commits intoapache:masterfrom
homatthew:mh-batch-complexunion-GOBBLIN-1779
Mar 9, 2023
Merged

[GOBBLIN-1779] Ability to filter datasets that contain non optional unions#3648
Will-Lo merged 2 commits intoapache:masterfrom
homatthew:mh-batch-complexunion-GOBBLIN-1779

Conversation

@homatthew
Copy link
Copy Markdown
Contributor

@homatthew homatthew commented Feb 22, 2023

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Here are some details about my PR, including screenshots (if applicable):
    Problem Statement:
    Goal is to make it easy for users to filter out datasets with non optional unions. Avro and ORC flexible schema supports non optional unions. But Iceberg does not support these schemas. This can cause mismatched schemas when writing with different writers. e.g. writing with iceberg will write a struct but writing native orc writer will write as a uniontype
  • [GOBBLIN-1774] Util for detecting non optional uniontype columns based on Hive Table metadata #3632

Future users may want to implement similar complex filtering on the results of their dataset finders. And this filtering may be very complex. The goal is to make it easy for users to build their own filtering predicates.

Design:

Alternative designs:

  • Alternatively, I could have directly modified any relevant sources or dataset finders to do the filtering there. The problem with that approach is that then we end up class specific configs that are niche and not generalizable.
  • There is also lots of business logic involved with complex filtering. Adding this business logic into a specific dataset finder would create spaghetti super classes.
  • Gobblin is OSS and there are many implementations of dataset finders that are internal, which would require extra dev effort to implement filtering.

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    Hive metastore tests
    image
    Dataset finder tests
    image

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

@homatthew homatthew force-pushed the mh-batch-complexunion-GOBBLIN-1779 branch 2 times, most recently from 6196a0a to 35c2cee Compare February 22, 2023 21:11
}

@SuppressWarnings("unchecked")
public static <T extends org.apache.gobblin.dataset.Dataset> DatasetsFinder<T> instantiateDatasetFinder(
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New method used in DatasetFinderFilteringDecorator to instantiate dataset finder

@homatthew homatthew force-pushed the mh-batch-complexunion-GOBBLIN-1779 branch from 35c2cee to 3966509 Compare February 22, 2023 21:23
stopMetastore();
}
@BeforeClass
@BeforeSuite
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before class causes some strange flakiness with when this executes. Before and After Suite is the correct behavior (it acts as the entry point and exit point for all the test code with hive)

import org.testng.annotations.Test;

@Slf4j
@Test(dependsOnGroups = "icebergMetadataWriterTest")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting the depends on groups here allows the class to be run on its own, but also be thread safe with the other hive metastore tests.

Adding per method depends makes it so I cannot run the Iceberg tests without running all of the hive tests first (setting a break point in underlying code then becomes a pain)


@AfterSuite
public void clean() throws Exception {
FileUtils.forceDeleteOnExit(tmpDir);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I don't need to call stopMetastore here because it should be handled in the afterclass for hivemetastoretest

}

private Optional<HiveTable> getTable(T dataset) throws IOException {
DbAndTable dbAndTable = getDbAndTable(dataset);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In initial iterations, I tried using the HiveRegistrationPolicy to get the hivespec based on the dataset urn. But dataset urn isn't necesarily a path. It is dependent on the underlying dataset.

So I expect the user to configure the expected pattern for extracting db and table

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do something similar in compaction (except hard coded)

@homatthew homatthew marked this pull request as ready for review February 22, 2023 21:30
@homatthew homatthew force-pushed the mh-batch-complexunion-GOBBLIN-1779 branch from 3966509 to 181b2cc Compare February 22, 2023 22:23
* </ul>
*/
@FunctionalInterface
public interface CheckedExceptionPredicate<T, E extends Exception> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall good thought. Should we consider existing Predicate interfaces in guava or other util libraries?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing interfaces don't cover this case AFAIK (But more than happy to take suggestions). And this is a pretty common solution for working around it.

@Override
public List<T> findDatasets() throws IOException {
List<T> datasets = datasetFinder.findDatasets();
List<T> allowedDatasets = null;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instantiate with Empty list to avoid NPE

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In hindsight, does this value really matter? We can just return allowedDatasets as-is.

But to answer your question. We can never return null from this method. We always throw an exception or allowedDatasets is overwritten by datasets (which cannot be null)

@homatthew homatthew force-pushed the mh-batch-complexunion-GOBBLIN-1779 branch from 181b2cc to f260386 Compare March 9, 2023 00:25
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 9, 2023

Codecov Report

Merging #3648 (f260386) into master (26d6ed3) will increase coverage by 0.06%.
The diff coverage is 64.47%.

@@             Coverage Diff              @@
##             master    #3648      +/-   ##
============================================
+ Coverage     46.91%   46.98%   +0.06%     
- Complexity    10756    10776      +20     
============================================
  Files          2135     2138       +3     
  Lines         83834    83919      +85     
  Branches       9320     9324       +4     
============================================
+ Hits          39332    39428      +96     
+ Misses        40933    40920      -13     
- Partials       3569     3571       +2     
Impacted Files Coverage Δ
...bblin/util/function/CheckedExceptionPredicate.java 0.00% <0.00%> (ø)
.../gobblin/data/management/dataset/DatasetUtils.java 53.12% <66.66%> (+1.51%) ⬆️
...tes/DatasetHiveSchemaContainsNonOptionalUnion.java 70.83% <70.83%> (ø)
...ment/dataset/DatasetsFinderFilteringDecorator.java 88.23% <88.23%> (ø)
.../org/apache/gobblin/metrics/RootMetricContext.java 79.68% <0.00%> (+1.56%) ⬆️
...main/java/org/apache/gobblin/yarn/YarnService.java 28.63% <0.00%> (+5.57%) ⬆️
...a/org/apache/gobblin/cluster/GobblinHelixTask.java 83.87% <0.00%> (+19.35%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@vikrambohra
Copy link
Copy Markdown
Contributor

+1 LGTM

Comment on lines +74 to +81
allowedDatasets = datasets.parallelStream()
.filter(dataset -> allowDatasetPredicates.stream()
.map(CheckedExceptionPredicate::wrapToTunneled)
.allMatch(p -> p.test(dataset)))
.filter(dataset -> denyDatasetPredicates.stream()
.map(CheckedExceptionPredicate::wrapToTunneled)
.noneMatch(predicate -> predicate.test(dataset)))
.collect(Collectors.toList());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how clean this is :) and it's parallel which is pretty nice. I recall being told early on that in Java 8 streams use more memory than traditionally looping, so just be wary of that

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this is on the dataset level not the file level so it shouldn't be an issue

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good callout about there being drawbacks to streams. In general, I don't think this code is a hot spot so I am okay with being addicted to the syntactic sugar until the profiler shows otherwise.

Something something premature optimization is the root of all evil. I think even our largest use cases is in the cardinality of thousands, so should be okay for this specific piece.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants