[HUDI-106] Adding support for DynamicBloomFilter #976

nsivabalan · 2019-10-28T06:08:26Z

Adding support for DynamicBloomFilter (link) to tune bloom filter size based on total number of entries.
- Added a BloomFilter interface and two implementations, namely SimpleBloomFilter(existing one) and HudiDynamicBloomFilter(new one).
- Added a BloomFilterFactory to assist in creating the right BloomFilter based on versions.
- Version is stored in parquet metadata footer. If version is not found, SimpleBloomFilter will be created.
- Introduced a config named "hoodie.bloom.index.auto.tune.enable" in HoodieIndexConfig which when enabled, will create new BloomFilter as HudiDynamicBloomFilter.

hudi-common/src/main/java/org/apache/hudi/common/HudiDynamicBloomFilter.java

hudi-common/src/main/java/org/apache/hudi/common/BloomFilterFactory.java

vinothchandar · 2019-10-29T12:28:40Z

@nsivabalan can you rebase against master and see if that helps the test pass

vinothchandar

Left some comments... can we also add a test to test the "dynamic" nature of the filter. e,g having more entries should result in larger filter with same fp ratio.. And also how are you enforcing a maximum dynamic bloom filter size. Can you share data on how big the bloom filter would be, if you say wrote 1M keys at fpp ratio 10^-9

hudi-common/src/main/java/org/apache/hudi/common/BloomFilterFactory.java

hudi-common/src/main/java/org/apache/hudi/common/HudiDynamicBloomFilter.java

hudi-common/src/main/java/org/apache/hudi/common/SimpleBloomFilter.java

hudi-common/src/main/java/org/apache/hudi/common/util/ParquetUtils.java

nsivabalan · 2019-10-29T14:50:27Z

@nsivabalan can you rebase against master and see if that helps the test pass

Nope. I didn't import the checkstyle. will fix it.

nsivabalan · 2019-10-30T02:51:10Z

Should I create a package for bloom.filter in org.apache.hudi.common? and move all BloomFilter related classes to it.

vinothchandar · 2019-10-30T03:44:50Z

Consolidating package makes sense.

nsivabalan · 2019-11-01T17:43:21Z

Left some comments... can we also add a test to test the "dynamic" nature of the filter. e,g having more entries should result in larger filter with same fp ratio.. And also how are you enforcing a maximum dynamic bloom filter size. Can you share data on how big the bloom filter would be, if you say wrote 1M keys at fpp ratio 10^-9

Few questions/clarifications:

I guess you can't bound the size in dynamic bloom filter. Size will grow according to the number to entries added. Initialize number of entries passed will be used to set the min size.
I am trying to find ways to test the FP ratio. Not sure how would you test that.
I was able to verify that adding more entries to the filter than the initial size, increases the size of the bloom.
Here are the sizes of dynamic bloom filter with error rate 10^-9 and initial number of entries as 10k
Size of bloom with 100 entries = 71940 bytes ~= 71kb
Size of bloom with 1000 entries = 71940 bytes ~= 71kb
Size of bloom with 10000 entries = 71940 bytes ~= 71kb
Size of bloom with 100000 entries = 719088 bytes ~= 720kb
Size of bloom with 1000000 entries = 7190568 bytes ~= 7.1 MB
Not sure if we really need to have a (unit) test to ensure that size grows when no of entries added increases. Only assertion we can do is to verify that the size is greater when compared to smaller no of entries are added.

vinothchandar · 2019-11-02T17:47:30Z

@nsivabalan Here is the problem as I see it, w.r.t bounding size.. Currently we have a low default 60K, which comes out to reading ~400kb from the parquet footer. not too shabby a overhead.. My understanding is that the parquet footers are all read at once and even query engines would read the footer.. So if we don't bound the size of the dynamic bloom fitler to say 1MB or so, queries can pay a penalty? (I dont know how big this would be or if its okay) But we won't offer the user the choice to make tradeoffs.. IIUC we need our own impl of dynamic bloom if we were to limit the size.. correct? how doable is that?

I am trying to find ways to test the FP ratio. Not sure how would you test that.
The way I have done it in the past, is to generate a lot of key and hold it in two lists : added, notAdded.. I add the ones from added to bloom filter and then check for false positives using notAdded list.. how much % of not added had a hit is your fp.. For this impl, we need ensure that the fp ratio remains the same, even as you increase the size of added/notAdded lists..

Other small points:

if we don't have a way to configure the bloom type to use, we should add one
We should consider if the default here should be the error rate 10^-9. This will also help reduce the size.. we already have techniques like range pruning to reduce the amount of comparisons.. Assuming even a large 100M entries inserted into a partition, if the bloom filter had 10^-8, it might be enough prevent false positives right.. I guess this will drop the storage needed considerably?

nsivabalan · 2019-11-03T17:32:50Z

@nsivabalan Here is the problem as I see it, w.r.t bounding size.. Currently we have a low default 60K, which comes out to reading ~400kb from the parquet footer. not too shabby a overhead.. My understanding is that the parquet footers are all read at once and even query engines would read the footer.. So if we don't bound the size of the dynamic bloom fitler to say 1MB or so, queries can pay a penalty? (I dont know how big this would be or if its okay) But we won't offer the user the choice to make tradeoffs.. IIUC we need our own impl of dynamic bloom if we were to limit the size.. correct? how doable is that?

I am trying to find ways to test the FP ratio. Not sure how would you test that.
The way I have done it in the past, is to generate a lot of key and hold it in two lists : added, notAdded.. I add the ones from added to bloom filter and then check for false positives using notAdded list.. how much % of not added had a hit is your fp.. For this impl, we need ensure that the fp ratio remains the same, even as you increase the size of added/notAdded lists..

Other small points:

if we don't have a way to configure the bloom type to use, we should add one

We should consider if the default here should be the error rate 10^-9. This will also help reduce the size.. we already have techniques like range pruning to reduce the amount of comparisons.. Assuming even a large 100M entries inserted into a partition, if the bloom filter had 10^-8, it might be enough prevent false positives right.. I guess this will drop the storage needed considerably?

Thanks Vinoth for the detailed response.

To bound the dynamic bloom, we have two options. a. I can google to find if we have some ready to use solution. b. If nothing exists, we can come up with our own DynamicBloom with max bounds on the no of entries. Just that until we reach the max no of entries, the FP ratio will be honored, after which the FP ratio may start to increase. I can get this done.
Wrt testing FP ratio, thanks for the idea. I had similar idea.
Here are the size differences between FP ratio 10^-8 and 10^-9. Key size considered is 50 bytes.

Size	FP 10^-8	FP 10^-9
10k	380k	430k
100k	760k	860k
1M	6.5MB	7.3MB
10M	64MB	72MB

nsivabalan · 2019-11-15T07:23:42Z

@vinothchandar : wrt your point "if we don't have a way to configure the bloom type to use, we should add one", we have BLOOM_INDEX_AUTO_TUNE_ENABLE_PROP in IndexConfig. Were you hinting at this or something else. If it is new one you want to introduce, how will it be different than this config.

vinothchandar · 2019-11-19T01:51:02Z

Instead of BLOOM_INDEX_AUTO_TUNE_ENABLE_PROP, should we just make it BLOOM_INDEX_FILTER_TYPE_PROP? But higher level, yes I wanted to just control this with a flag and turn this off for now by default..

@nsivabalan whats the current state of this PR. Is it still WIP?

nsivabalan · 2019-11-19T05:09:15Z

@vinothchandar : thnx.
Here are the two pending items:

You suggested to check if there is a way to bound the max no of entries. Looks like there is not readily available solution. I have an idea on how to get this done. But need time to test and validate the same.
You wanted to validate the FP ratio. I can't do this in my local laptop. Have to run it in a cluster to get these numbers. Will work on it this week.

But if you feel, we can go ahead with the current state as is, let me know.
I have fixed the config btw, can remove WIP if you feel the above two items can be worked on later.

vinothchandar · 2019-11-19T18:56:22Z

lets get the size capping in and we can merge this. validating the FP ratio would also be good to do.. Let me clarify what I mean.. I just want to verify that as you increase number of entries , fp ratio stays constant and the size of the bloom filter increases (upto the limit configured)

nsivabalan · 2019-11-27T04:42:53Z

I have added our own impl of DynamicBloom where in we cap the max number of entries. In other words, up until maxNumEntries, the FP ratio will be honored, after which there are no gaurantees.

nsivabalan · 2019-11-27T04:51:47Z

Here are the results from FP ratio testing.
Exp params: Error ratio 1.0E-6. For HoodieDynamic Capped bloom filter, max number of entries = 5 * init numEntries

InitNumEntries/EntriesAdded	Hadoop Dynamic(non capped)	HoodieDynamic(Capped)
5k/10k	1.9E-6	1.8E-6
5k/50k	9.1E-6	0.0162626
10k/10k	1.4E-6	1.1E-6
10k/50k	3.7E-6	5.6E-6
10k/60k	7.6E-6	5.7E-5

Serialized size in bytes

InitNumEntries/EntriesAdded	Simple	Hadoop Dynamic(non capped)	HoodieDynamic(Capped)
5k/10k	23980	47996	47996
5k/50k	23980	287796	119936
10k/10k	47944	47976	47976
10k/50k	47944	287692	239748
10k/60k	47944	575348	239748

vinothchandar

Left some high level comments..

(also some detailed ones at the start, then realized PR may not be ready for final review)..

@bvaradar could you help siva with the licensing stuff..

hudi-client/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java

hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java

...ommon/src/main/java/org/apache/hudi/common/bloom/filter/HoodieDynamicBoundedBloomFilter.java

hudi-common/src/main/java/org/apache/hudi/common/bloom/filter/LocalFilter.java

hudi-common/src/main/java/org/apache/hudi/common/util/ParquetUtils.java

hudi-common/src/main/java/org/apache/hudi/common/bloom/filter/LocalDynamicBloomFilter.java

hudi-common/src/main/java/org/apache/hudi/common/bloom/filter/BloomFilterUtils.java

LICENSE

vinothchandar

This looks almost ready. Have we add test classes for InternalDynamicBloomFilter and bounded filter classes.. We need some tests that checks the bounding & dynamic aspects?

hudi-common/src/main/java/org/apache/hudi/common/bloom/filter/BloomFilter.java

hudi-common/src/main/java/org/apache/hudi/common/bloom/filter/BloomFilterTypeCode.java

...ommon/src/main/java/org/apache/hudi/common/bloom/filter/HoodieDynamicBoundedBloomFilter.java

hudi-common/src/main/java/org/apache/hudi/common/util/CleanerUtils.java

vinothchandar · 2019-12-12T15:14:11Z

We can merge once these are addressed.. Lets get this across the line! :)

nsivabalan · 2019-12-13T23:40:52Z

Fixed the check style issue.
Pending items:

One nit in CleanerUtils
Have we add test classes for InternalDynamicBloomFilter and bounded filter classes.. We need some tests that checks the bounding & dynamic aspects?

Will update you once done.

nsivabalan · 2019-12-15T20:54:47Z

@vinothchandar : the diff is ready. you are good to review. Sorry, forgot to ping you.

vinothchandar

Two small nits.

...common/src/test/java/org/apache/hudi/common/bloom/filter/TestInternalDynamicBloomFilter.java

vinothchandar · 2019-12-16T01:00:55Z

...common/src/test/java/org/apache/hudi/common/bloom/filter/TestInternalDynamicBloomFilter.java

+      if (index != 0) {
+        int curLength = serString.length();
+        if (index > indexForMaxGrowth) {
+          assert curLength == lastKnownBloomSize;


could you please use the junit assert methods? Also this is better written as a parameterized test? (like the bloom filter test; leave it you to make the final call)

fixed the assert. But I didn't feel a need for parametrized since we are testing only dynamic for boundedness once threshold is met.

- Introduced configs for bloom filter type - Implemented dynamic bloom filter with configurable max number of keys - BloomFilterFactory abstractions; Defaults to current simple bloom filter

@OverRide

…st for hoodie-client module (apache#930) [HUDI-271] Create QuickstartUtils for simplifying quickstart guide - This will be used in Quickstart guide (Doc changes to follow in a seperate PR). The intention is to simplify quickstart to showcase hudi APIs by writing and reading using spark datasources. - This is located in hudi-spark module intentionally to bring all the necessary classes in hudi-spark-bundle finally. HUDI-121 : Address comments during RC2 voting 1. Remove dnl utils jar from git 2. Add LICENSE Headers in missing files 3. Fix NOTICE and LICENSE in all HUDI packages and in top-level 4. Fix License wording in certain HUDI source files 5. Include non java/scala code in RAT licensing check 6. Use whitelist to include dependencies as part of timeline-server bundling [HUDI-121] Update Release notes [HUDI-121] Fix bugs in Release Scripts found during RC creation [HUDI-287] Address comments during review of release candidate 1. Remove LICENSE and NOTICE files in hoodie child modules. 2. Remove developers and contributor section from pom 3. Also ensure any failures in validation script is reported appropriately 4. Make hoodie parent pom consistent with that of its parent apache-21 (https://github.com/apache/maven-apache-parent/blob/apache-21/pom.xml) Update Release notes [HUDI-121] Fix bug in validation in create_source_release.sh [HUDI-121] Fix bug in validation in deploy_staging_jars.sh [HUDI-265] Failed to delete tmp dirs created in unit tests (apache#928) [HUDI-285] Implement HoodieStorageWriter based on actual file type (apache#936) [HUDI-121] Remove leftover notice file and replace com.uber.hoodie with org.apache.hudi in log4j properties [HUDI-121] Prepare for 0.5.0-incubating-rc5 [HUDI-293] Remove KEYS file from github repository [HUDI-232] Implement sealing/unsealing for HoodieRecord class (apache#938) [HOTFIX] Move to openjdk to get travis passing (apache#944) [MINOR] Add incubating to NOTICE and README.md Please enter the commit message for your changes. Lines starting Rebased with Huid master added coudera profile [HUDI-292] Avoid consuming more entries from kafka than specified sourceLimit. (apache#947) - Special handling when allocedEvents > numEvents - Added unit tests [Docs] Update README.md (apache#955) [HUDI-298] Fix issue with incorrect column mapping casusing bad data, during on-the-fly merge of Real Time tables (apache#956) * Fix issue with incorrect column mapping casusing bad data, during on-the-fly merge of Real Time tables [HUDI-301] fix path error when update a non-partition MOR table Shade and relocate Avro dependency in hadoop-mr-bundle [HUDI-121] Fix licensing issues found during RC voting by general incubator group Update RELEASE Notes in master [HUDI-121] Fix issues in release scripts [HUDI-40] Add parquet support for the Delta Streamer (apache#949) [HUDI-283] : Ensure a sane minimum for merge buffer memory (apache#964) - Some environments e.g spark-shell provide 0 for memory size - This causes unnecessary performance degradation [MINOR] Remove release notes and move confluent repository to hoodie parent pom [MINOR] Add backtick escape while syncing partition fields (apache#967) [MINOR] Move all repository declarations to parent pom (apache#966) [HUDI-290] Normalize test class name of all test classes (apache#951) [HUDI-130] Paths written in compaction plan needs to be relative to base-path [HUDI-169] Speed up rolling back of instants (apache#968) [MINOR] Fix vm crashes (apache#979) [MINOR] Fix no output in travis (apache#984) [MINOR] fix annotation in teardown (apache#990) [MINOR] Fix avro schema warnings in build [HUDI-313] Fix select count star error when querying a realtime table synchronized lock on conf object instead of class Bump checkstyle from 8.8 to 8.18 (apache#981) Bumps [checkstyle](https://github.com/checkstyle/checkstyle) from 8.8 to 8.18. - [Release notes](https://github.com/checkstyle/checkstyle/releases) - [Commits](checkstyle/checkstyle@checkstyle-8.8...checkstyle-8.18) Signed-off-by: dependabot[bot] <[email protected]> Bump httpclient from 4.3.2 to 4.3.6 (apache#980) Bumps httpclient from 4.3.2 to 4.3.6. Signed-off-by: dependabot[bot] <[email protected]> [HUDI-312] Make docker hdfs cluster ephemeral. This is needed to fix flakiness in integration tests. Also, Fix DeltaStreamer hanging issue due to uncaught exception [HUDI-314] Fix multi partition keys error when querying a realtime table Add MOR integration testing [HUDI-321] Support bulkinsert in HDFSParquetImporter (apache#987) - Add bulk insert feature - Fix some minor issues [MINOR] Add features and instructions to build Hudi in README (apache#992) [HUDI-324] TimestampKeyGenerator should support milliseconds (apache#993) [HUDI-302]: simplified countInstants() method in HoodieDefaultTimeline (apache#997) [HUDI-245]: replaced instances of getInstants() and reverse() with getReverseOrderedInstants() (apache#1000) [DOCS] Update to align with original Uber whitepaper (apache#999) [DOCS] Change Hudi acronyms to plural [HUDI-253]: added validations for schema provider class (apache#995) [HUDI-218] Adding Presto support to Integration Test (apache#1003) [HUDI-137] Hudi cleaning state changes should be consistent with compaction actions Before this change, Cleaner performs cleaning of old file versions and then stores the deleted files in .clean files. With this setup, we will not be able to track file deletions if a cleaner fails after deleting files but before writing .clean metadata. This is fine for regular file-system view generation but Incremental timeline syncing relies on clean/commit/compaction metadata to keep a consistent file-system view. Cleaner state transitions is now similar to that of compaction. 1. Requested : HoodieWriteClient.scheduleClean() selects the list of files that needs to be deleted and stores them in metadata 2. Inflight : HoodieWriteClient marks the state to be inflight before it starts deleting 3. Completed : HoodieWriteClient marks the state after completing the deletion according to the cleaner plan [HUDI-306] Support Glue catalog and other hive metastore implementations (apache#961) - Support Glue catalog and other metastore implementations - Remove shading from hudi utilities bundle - Add maven profile to optionally shade hive in utilities bundle [HUDI-80] Leverage Commit metadata to figure out partitions to be cleaned for Cleaning by commits mode (apache#1008) [HUDI-330] add EmptyStatement java checkstyle rule Migrate integration tests to spark 2.4.4 [HUDI-329] Presto Containers for integration test must allow newly built local jars to override [HOTFIX] fix missing version of rat-plugin (apache#1015) - Fixing RT queries for HiveOnSpark that causes race conditions - Adding more comments to understand usage of reader/writer schema - Ensure that rollback instant is always created before the next commit instant. This especially affects IncrementalPull for MOR tables since we can end up pulling in log blocks for uncommitted data - Ensure that generated commit instants are 1 second apart [HUDI-339] Add support of Azure cloud storage (apache#1019) - Add Azure WASB (BLOB) and ADLS storage in StorageSchemes enum - Update testStorageSchemes to test new added storage [HUDI-342] add pull request template for hudi project (apache#1022) [HUDI-346] Set allowMultipleEmptyLines to false for EmptyLineSeparator rule (apache#1025) [HOTFIX] Fix error configuration item of dockerfile-maven-plugin [HUDI-350]: updated default value of config.getCleanerCommitsRetained() in javadocs [HUDI-348] Add Issue template for the project (apache#1029) [HUDI-345] Fix used deprecated function (apache#1024) - Schema.parse() with new Schema.Parser().parse - FSDataOutputStream constructor [HUDI-328] Adding delete api to HoodieWriteClient (apache#1004) [HUDI-328] Adding delete api to HoodieWriteClient and Spark DataSource [MINOR] Some minor optimizations in HoodieJavaStreamingApp (apache#1046) [HUDI-362] Adds a check for the existence of field (apache#1047) [HUDI-358] Add Java-doc and importOrder checkstyle rule (apache#1043) - import groups are separated by one blank line - org.apache.hudi.* at the top location [HUDI-359] Add hudi-env for hudi-cli module (apache#1042) [HUDI-340]: made max events to read from kafka source configurable (apache#1039) [HUDI-327] Add null/empty checks to key generators (apache#1040) * Adds null and empty checks to all key generators. * Also improves error messaging for key generator issues. [HUDI-325] Fix Hive partition error for updated HDFS Hudi table (apache#1001) [HUDI-364] Refactor hudi-hive based on new ImportOrder code style rule (apache#1048) [HUDI-364] Refactor hudi-hive based on new ImportOrder code style rule [HUDI-366] Refactor some module codes based on new ImportOrder code style rule (apache#1055) [HUDI-366] Refactor hudi-hadoop-mr / hudi-timeline-service / hudi-spark / hudi-integ-test / hudi- utilities based on new ImportOrder code style rule [HUDI-373] Refactor hudi-client based on new ImportOrder code style rule (apache#1056) [HUDI-209] Implement JMX metrics reporter (apache#1045) [HUDI-372] Support the shortName for Hudi DataSource (apache#1054) - Ability to do `spark.write.format("hudi")...` [HUDI-374] Unable to generateUpdates in QuickstartUtils (apache#1059) [HUDI-357] Refactor hudi-cli based on new comment and code style rules (apache#1051) [HUDI-370] Refactor hudi-common based on new ImportOrder code style rule (apache#1063) [DOCS] Update Hudi Readme (apache#1058) - Add build status - Clean up layout [DOCS] Update the build source link (apache#1071) [MINOR] Update some urls from http to https in the README file (apache#1074) [HUDI-294] Delete Paths written in Cleaner plan needs to be relative to partition-path (apache#1062) [HUDI-294] Delete Paths written in Cleaner plan needs to be relative to partition-path [HUDI-355] Refactor hudi-common based on new comment and code style rules (apache#1049) [HUDI-355] Refactor hudi-common based on new comment and code style rules [HUDI-365] Refactor hudi-cli based on new ImportOrder code style rule (apache#1076) [checkstyle] Add ConstantName java checkstyle rule (apache#1066) * add SimplifyBooleanExpression java checkstyle rule * collapse empty tags in scalastyle file [HUDI-378] Refactor the rest codes based on new ImportOrder code style rule (apache#1078) [HUDI-379] Refactor the codes based on new JavadocStyle code style rule (apache#1079) [MINOR] add *.log to .gitignore file (apache#1086) [HUDI-353] Add hive style partitioning path [MINOR] Beautify the cli banner (apache#1089) * Add one empty line * replace Cli to CLI * replace Hoodie to Apache Hudi [checkstyle] Unify LOG form (apache#1092) [HUDI-390] Add backtick character in hive queries to support hive identifier as tablename (apache#1090) [HUDI-387] Fix NPE when create savepoint via hudi-cli (apache#1085) [HUDI-368] code clean up in TestAsyncCompaction class (apache#1050) [MINOR] Remove redundant plus operator (apache#1097) [MINOR] replace scala map add operator (apache#1093) replace ++: with ++ [MINOR] Unify Lists import (apache#1103) [HUDI-398]Add spark env set/get for spark launcher (apache#1096) [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset [MINOR] Add slack invite icon in README (apache#1108) [HUDI-106] Adding support for DynamicBloomFilter (apache#976) - Introduced configs for bloom filter type - Implemented dynamic bloom filter with configurable max number of keys - BloomFilterFactory abstractions; Defaults to current simple bloom filter [HUDI-415] Get commit time when Spark start (apache#1113) [HUDI-386] Refactor hudi scala checkstyle rules (apache#1099) [HUDI-444] Refactor the codes based on scala codestyle ReturnChecker rule (apache#1121) [HUDI-311] : Support for AWS Database Migration Service in DeltaStreamer - Add a transformer class, that adds `Op` fiels if not found in input frame - Add a payload implementation, that issues deletes when Op=D - Remove Parquet as a top level source type, consolidate with RowSource - Made delta streamer work without a property file, simply using overridden cli options - Unit tests for transformer/payload classes Fix Error: java.lang.IllegalArgumentException: Can not create a Path from an empty string in HoodieCopyOnWrite#deleteFilesFunc (apache#1126) same link in apache#771 this time is in HoodieCopyOnWrite deleteFilesFunc method [MINOR] Set info servity for ImportOrder temporarily (apache#1127) - Now we need fix import check error manually, disable the rule temporarily before finding a better solution. [MINOR] fix typo [MINOR] fix typo [minor] Fix few typos in the java docs (apache#1132) [HUDI-389] Fixing Index look up to return right partitions for a given key along with fileId with Global Bloom (apache#1091) * Fixing Index look up to return partitions for a given key along with fileId with Global Bloom * Addressing some of the comments * Fixing test in TestHoodieGlobalBloomIndex to test the fix [HUDI-416] Improve hint information for cli (apache#1110) [MINOR] fix typos [MINOR] optimize hudi timeline service (apache#1137) [HUDI-470] Fix NPE when print result via hudi-cli (apache#1138) [MINOR] typo fix (apache#1142) [MINOR] Update the java doc of HoodieTableType (apache#1148) [HUDI-453] Fix throw failed to archive commits error when writing data to MOR/COW table Fix checkstyle Skip setting commit metadata Fix empty content clean plan Update comment [MINOR]: alter some wrong params which bring fatal exception [HUDI-482] Fix missing @OverRide annotation on methods (apache#1156) * [HUDI-482] Fix missing @OverRide annotation on methods [HUDI-455] Redo hudi-client log statements using SLF4J (apache#1145) * [HUDI-455] Redo hudi-client log statements using SLF4J [MINOR] Fix out of limits for results [MINOR] Fix out of limits for results Clean up code [HUDI-343]: Create a DOAP file for Hudi [HUDI-402]: code clean up in test cases [MINOR] Fix error usage of String.format (apache#1169) [HUDI-492]Fix show env all in hudi-cli [HUDI-118]: Options provided for passing properties to Cleaner, compactor and importer commands [MINOR] Optimize hudi-cli module (apache#1136) [MINOR]Optimize hudi-client module (apache#1139) [HUDI-377] Adding Delete() support to DeltaStreamer (apache#1073) - Provides ability to perform hard deletes by writing delete marker records into the source data - if the record contains a special field _hoodie_delete_marker set to true, deletes are performed [HUDI-484] Fix NPE when reading IncrementalPull.sqltemplate in HiveIncrementalPuller (apache#1167) Revert "[HUDI-455] Redo hudi-client log statements using SLF4J (apache#1145)" (apache#1181) This reverts commit e637d9e. [HUDI-438] Merge duplicated code fragment in HoodieSparkSqlWriter (apache#1114) [HUDI-406]: added default partition path in TimestampBasedKeyGenerator [HUDI-501] Execute docker/setup_demo.sh in any directory [HUDI-405] Remove HIVE_ASSUME_DATE_PARTITION_OPT_KEY config from DataSource [HUDI-464] Use Hive Exec Core for tests (apache#1125) [HUDI-417] Refactor HoodieWriteClient so that commit logic can be shareable by both bootstrap and normal write operations (apache#1166) [HUDI-508] Standardizing on "Table" instead of "Dataset" across code (apache#1197) - Docs were talking about storage types before, cWiki moved to "Table" - Most of code already has HoodieTable, HoodieTableMetaClient - correct naming - Replacing renaming use of dataset across code/comments - Few usages in comments and use of Spark SQL DataSet remain unscathed [MINOR] Remove old jekyll config file (apache#1198) Update deprecated HBase API [HUDI-319] Add a new maven profile to generate unified Javadoc for all Java and Scala classes (apache#1195) * Add javadoc build command in README, links to javadoc plugin and rename profile. * Make java version configurable in one place. [HUDI-25] Optimize HoodieInputformat.listStatus() for faster Hive incremental queries on Hoodie Summary: - InputPathHandler class classifies inputPaths into incremental, non incremental and non hoodie paths. - Incremental queries leverage HoodieCommitMetadata to get partitions that are affected and only lists those partitions as opposed to listing all partitions - listStatus() processes each category separately [HUDI-331]Fix java docs for all public apis in HoodieWriteClient (apache#1111) [HUDI-114]: added option to overwrite payload implementation in hoodie.properties file [HUDI-248] CLI doesn't allow rolling back a Delta commit [HUDI-469] Fix: HoodieCommitMetadata only show first commit insert rows. [CLEAN] replace utf-8 constant with StandardCharsets.UTF_8 [MINOR] Fix partition typo (apache#1209) [HUDI-522] Use the same version jcommander uniformly (apache#1214) [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types - Upgrade Spark to 2.4.4, Parquet to 1.10.1, Avro to 1.8.2 - Remove spark-avro from hudi-spark-bundle. Users need to provide --packages org.apache.spark:spark-avro:2.4.4 when running spark-shell or spark-submit - Replace com.databricks:spark-avro with org.apache.spark:spark-avro - Shade avro in hudi-hadoop-mr-bundle to make sure it does not conflict with hive's avro version. [HUDI-322] DeltaSteamer should pick checkpoints off only deltacommits for MOR tables [HUDI-502] provide a custom time zone definition for TimestampBasedKeyGenerator (apache#1188) [HUDI-526] fix the HoodieAppendHandle [MINOR] Reuse random object (apache#1222) Fix conversion of Spark struct type to Avro schema cr https://code.amazon.com/reviews/CR-17184364 [MINOR] Refactor unnecessary boxing inside TypedProperties code (apache#1227) Adding util methods to assist in adding deletion support to Quick Start Fixing delete util method Fixing checkstyle issues [MINOR] Fix redundant judgment statement (apache#1231) [HUDI-335] Improvements to DiskBasedMap used by ExternalSpillableMap, for write and random/sequential read paths, by introducing bufferedRandmomAccessFile Add GlobalDeleteKeyGenerator Adds new GlobalDeleteKeyGenerator for record_key deletes with global indices. Also refactors key generators into their own package. [MINOR] Make constant fields final in HoodieTestDataGenerator (apache#1234) [MINOR] Fix missing @OverRide annotation on BufferedRandomAccessFile method (apache#1236) [HUDI-509] Renaming code in sync with cWiki restructuring (apache#1212) - Storage Type replaced with Table Type (remaining instances) - View types replaced with query types; - ReadOptimized view referred as Snapshot Query - TableFileSystemView sub interfaces renamed to BaseFileOnly and Slice Views - HoodieDataFile renamed to HoodieBaseFile - Hive Sync tool will register RO tables for MOR with a `_ro` suffix - Datasource/Deltastreamer options renamed accordingly - Support fallback to old config values as well, so migration is painless - Config for controlling _ro suffix addition - Renaming DataFile to BaseFile across DTOs, HoodieFileSlice and AbstractTableFileSystemView [HUDI-537] Introduce `repair overwrite-hoodie-props` CLI command (apache#1241) [HUDI-527] scalastyle-maven-plugin moved to pluginManagement as it is only used in hoodie-spark and hoodie-cli modules. This fixes compile warnings as well as unnecessary plugin invocation for most of the modules which do not have scala code. [HUDI-535] Ensure Compaction Plan is always written in .aux folder to avoid 0.5.0/0.5.1 reader-writer compatibility issues (apache#1229) [HUDI-238] Make Hudi support Scala 2.12 (apache#1226) * [HUDI-238] Rename scala related artifactId & add maven profile to support Scala 2.12 [MINOR] Add toString method to TimelineLayoutVersion to make it more readable (apache#1244) [MINOR] Fix PMC in DOAP] (apache#1247) [HUDI-552] Fix the schema mismatch in Row-to-Avro conversion (apache#1246) [HUDI-551] Abstract a test case class for DFS Source to make it extensible (apache#1239) [HUDI-556] Add lisence for PR#1233 [HUDI-559] : Make the timeline layout version default to be null version Moving to 0.5.2-SNAPSHOT on master branch. [MINOR] Download KEYS file when validating release candidate (apache#1259) [MINOR] Update the javadoc of HoodieTableMetaClient#scanFiles (apache#1263) [MINOR] Update the javadoc of HoodieTableMetaClient#scanFiles [MINOR] Fix invalid maven repo address (apache#1265) [MINOR] Change deploy_staging_jars script to take in scala version (apache#1269) [MINOR] Change deploy_staging_jars script to take in scala version (apache#1270) [MINOR] Add missing licenses (apache#1271) [MINOR] fix license issue (apache#1273) [HUDI-549] update Github README with instructions to build with Scala 2.12 (apache#1275) [MINOR] Fix missing groupId / version property of dependency [MINOR] Fix invalid issue url & quickstart url (apache#1282) [MINOR] Remove junit-dep dependency [MINOR] Fix assigning to configuration more times (apache#1291) HUDI-117 Close file handle before throwing an exception due to append failure. Add test cases to handle/verify stage failure scenarios. [HUDI-578] Trim recordKeyFields and partitionPathFields in ComplexKeyGenerator (apache#1281) * [HUDI-578] Trim recordKeyFields and partitionPathFields in ComplexKeyGenerator * add tests [HUDI-564] Added new test cases for HoodieLogFormat and HoodieLogFormatVersion. [HUDI-583] Code Cleanup, remove redundant code, and other changes (apache#1237) [MINOR] Updated DOAP with 0.5.1 release (apache#1300) [MINOR] Updated DOAP with 0.5.1 release (apache#1301) Increase test coverage for HoodieReadClient [HUDI-596] Close KafkaConsumer every time (apache#1303) [HUDI-595] code cleanup, refactoring code out of PR# 1159 (apache#1302) [HUDI-566] Added new test cases for class HoodieTimeline, HoodieDefaultTimeline and HoodieActiveTimeline. [HUDI-585] Optimize the steps of building with scala-2.12 (apache#1293) [MINOR] Remove the declaration of thrown RuntimeException (apache#1305) [HUDI-499] Allow update partition path with GLOBAL_BLOOM (apache#1187) * Handle partition path update by deleting a record from the old partition and insert into the new one * Add a new configuration "hoodie.bloom.index.update.partition.path" to enable the behavior * Add a new unit test case for global bloom index [HUDI-571] Add 'commits show archived' command to CLI [HUDI-570] - Improve test coverage for FSUtils.java [HUDI-587] Fixed generation of jacoco coverage reports. surefire plugin's argLine is moved into a property. This configuration allows jacoco plugin to modify the argLine to insert it's Java Agent's configuration during pre-unit-test stage. [HUDI-560] Remove legacy IdentityTransformer (apache#1264) [HUDI-582] Update NOTICE year [HUDI-478] Fix too many files with unapproved license when execute build_local_docker_images (apache#1323) [HUDI-605] Avoid calculating the size of schema redundantly (apache#1317) CLI - add option to print additional commit metadata [HUDI-574] Fix CLI counts small file inserts as updates (apache#1321) [MINOR] Fix typo (apache#1331) [HUDI-514] A schema provider to get metadata through Jdbc (apache#1200) [HUDI-571] Add show archived compaction(s) to CLI [MINOR] Fix some typos [MINOR] Code Cleanup, remove redundant code (apache#1337) [HUDI-615]: Add some methods and test cases for StringUtils. (apache#1338) [HUDI-108] Removing 2GB spark partition limitations in HoodieBloomIndex with spark 2.4.4 (apache#1315) [MINOR] Add javadoc to SchedulerConfGenerator and code clean (apache#1340) [HUDI-617] Add support for types implementing CharSequence (apache#1339) - Data types extending CharSequence implement a #toString method which provides an easy way to convert them to String. - For example, org.apache.avro.util.Utf8 is easily convertible into String if we use the toString() method. It's better to make the support more generic to support a wider range of data types as partitionKey. [HUDI-622]: Remove VisibleForTesting annotation and import from code (apache#1343) * HUDI:622: Remove VisibleForTesting annotation and import from code Refactoring getter to avoid double extrametadata in json representation [HUDI-624]: Split some of the code from PR for HUDI-479 (apache#1344) [HUDI-597] Enable incremental pulling from defined partitions (apache#1348) [HUDI-625] Fixing performance issues around DiskBasedMap & kryo (apache#1352) [HUDI-580] Fix incorrect license header in files Added cloudera profile Added cloudera profile removed hudi-integ-test rebased from apache hudi mater

- Introduced configs for bloom filter type - Implemented dynamic bloom filter with configurable max number of keys - BloomFilterFactory abstractions; Defaults to current simple bloom filter

…Validation (apache#976)

nsivabalan force-pushed the DynamicBloomFilter branch 2 times, most recently from 5f66849 to cd1199e Compare October 28, 2019 06:28

nsivabalan commented Oct 28, 2019

View reviewed changes

hudi-common/src/main/java/org/apache/hudi/common/HudiDynamicBloomFilter.java Outdated Show resolved Hide resolved

nsivabalan commented Oct 28, 2019

View reviewed changes

hudi-common/src/main/java/org/apache/hudi/common/BloomFilterFactory.java Outdated Show resolved Hide resolved

vinothchandar reviewed Oct 29, 2019

View reviewed changes

vinothchandar mentioned this pull request Oct 30, 2019

Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing #666

Closed

nsivabalan force-pushed the DynamicBloomFilter branch from 5dcf8c7 to 6b6f7c5 Compare November 2, 2019 15:34

nsivabalan force-pushed the DynamicBloomFilter branch from 6b6f7c5 to 5b3d9b7 Compare November 8, 2019 17:28

vinothchandar added the status:in-progress Work in progress label Nov 11, 2019

vinothchandar changed the title ~~[HUDI-106] Adding support for DynamicBloomFilter~~ [WIP] [HUDI-106] Adding support for DynamicBloomFilter Nov 11, 2019

vinothchandar self-assigned this Nov 13, 2019

nsivabalan force-pushed the DynamicBloomFilter branch from 8a6cac4 to 45293c4 Compare November 19, 2019 05:06

nsivabalan changed the title ~~[WIP] [HUDI-106] Adding support for DynamicBloomFilter~~ [HUDI-106] Adding support for DynamicBloomFilter Nov 27, 2019

vinothchandar removed the status:in-progress Work in progress label Dec 3, 2019

vinothchandar reviewed Dec 3, 2019

View reviewed changes

nsivabalan force-pushed the DynamicBloomFilter branch from dee9766 to 8272193 Compare December 4, 2019 23:50

nsivabalan commented Dec 5, 2019

View reviewed changes

LICENSE Outdated Show resolved Hide resolved

vinothchandar approved these changes Dec 12, 2019

View reviewed changes

nsivabalan added 13 commits December 12, 2019 11:26

Adding support for DynamicBloomFilter

9eee2c7

Addressing comments

7b09824

Fixing BloomFilterTypeCode

97c3cad

Addressing comments from Vinoth

e197a59

Fixing index config for bloom filter type

f0c860c

Capping max number of entries in DynamicBloomFilter

8b323e4

Addressing Vinoth's comments for bounded dynamic bloom

40bcb3f

Fixing license and few other comments

78ca886

Added dynamic bloom type tests to TestParquetUtils

563c690

Fixing import ordering

d0138bf

Fixing build issues

9be453b

Fixing import ordering

83d041d

Fixing checkstyle issues

1d6d765

nsivabalan force-pushed the DynamicBloomFilter branch from eeb7d5e to 1d6d765 Compare December 13, 2019 19:22

Fixing Bloom Filter type code comment and some build failures

493c105

Adding test to assert bounded nature of Dynamic Bloom

f52d281

vinothchandar reviewed Dec 16, 2019

View reviewed changes

Fixing last few comments on TestInternalDynamicBloomFilter

9103bab

vinothchandar approved these changes Dec 18, 2019

View reviewed changes

vinothchandar merged commit 14881e9 into apache:master Dec 18, 2019

dgloeckner mentioned this pull request Jun 29, 2020

Column level statistics collection and data skipping optmizations delta-io/delta#144

Closed

kroushan-nit pushed a commit to kroushan-nit/hudi-oss-fork that referenced this pull request Nov 10, 2024

[ENG-16026][MINOR] Allow check for uncommitted log files in Metadata …

fef00d2

…Validation (apache#976)

[HUDI-106] Adding support for DynamicBloomFilter #976

[HUDI-106] Adding support for DynamicBloomFilter #976

Uh oh!

Conversation

nsivabalan commented Oct 28, 2019

Uh oh!

Uh oh!

Uh oh!

vinothchandar commented Oct 29, 2019

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nsivabalan commented Oct 29, 2019

Uh oh!

nsivabalan commented Oct 30, 2019

Uh oh!

vinothchandar commented Oct 30, 2019

Uh oh!

nsivabalan commented Nov 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vinothchandar commented Nov 2, 2019

Uh oh!

nsivabalan commented Nov 3, 2019

Uh oh!

nsivabalan commented Nov 15, 2019

Uh oh!

vinothchandar commented Nov 19, 2019

Uh oh!

nsivabalan commented Nov 19, 2019

Uh oh!

vinothchandar commented Nov 19, 2019

Uh oh!

nsivabalan commented Nov 27, 2019

Uh oh!

nsivabalan commented Nov 27, 2019

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vinothchandar left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vinothchandar commented Dec 12, 2019

Uh oh!

nsivabalan commented Dec 13, 2019

Uh oh!

nsivabalan commented Dec 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vinothchandar Dec 16, 2019

Choose a reason for hiding this comment

Uh oh!

nsivabalan commented Nov 1, 2019 •

edited

Loading

vinothchandar left a comment •

edited

Loading

nsivabalan commented Dec 15, 2019 •

edited

Loading