-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-106] Adding support for DynamicBloomFilter #976
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
nsivabalan
commented
Oct 28, 2019
- Adding support for DynamicBloomFilter (link) to tune bloom filter size based on total number of entries.
- Added a BloomFilter interface and two implementations, namely SimpleBloomFilter(existing one) and HudiDynamicBloomFilter(new one).
- Added a BloomFilterFactory to assist in creating the right BloomFilter based on versions.
- Version is stored in parquet metadata footer. If version is not found, SimpleBloomFilter will be created.
- Introduced a config named "hoodie.bloom.index.auto.tune.enable" in HoodieIndexConfig which when enabled, will create new BloomFilter as HudiDynamicBloomFilter.
5f66849 to
cd1199e
Compare
hudi-common/src/main/java/org/apache/hudi/common/HudiDynamicBloomFilter.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/BloomFilterFactory.java
Outdated
Show resolved
Hide resolved
|
@nsivabalan can you rebase against master and see if that helps the test pass |
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments... can we also add a test to test the "dynamic" nature of the filter. e,g having more entries should result in larger filter with same fp ratio.. And also how are you enforcing a maximum dynamic bloom filter size. Can you share data on how big the bloom filter would be, if you say wrote 1M keys at fpp ratio 10^-9
hudi-common/src/main/java/org/apache/hudi/common/BloomFilterFactory.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/BloomFilterFactory.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/HudiDynamicBloomFilter.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/HudiDynamicBloomFilter.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/SimpleBloomFilter.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/util/ParquetUtils.java
Outdated
Show resolved
Hide resolved
Nope. I didn't import the checkstyle. will fix it. |
|
Should I create a package for bloom.filter in org.apache.hudi.common? and move all BloomFilter related classes to it. |
|
Consolidating package makes sense. |
Few questions/clarifications:
|
5dcf8c7 to
6b6f7c5
Compare
|
@nsivabalan Here is the problem as I see it, w.r.t bounding size.. Currently we have a low default 60K, which comes out to reading ~400kb from the parquet footer. not too shabby a overhead.. My understanding is that the parquet footers are all read at once and even query engines would read the footer.. So if we don't bound the size of the dynamic bloom fitler to say 1MB or so, queries can pay a penalty? (I dont know how big this would be or if its okay) But we won't offer the user the choice to make tradeoffs.. IIUC we need our own impl of dynamic bloom if we were to limit the size.. correct? how doable is that?
Other small points:
|
Thanks Vinoth for the detailed response.
|
6b6f7c5 to
5b3d9b7
Compare
|
@vinothchandar : wrt your point "if we don't have a way to configure the bloom type to use, we should add one", we have BLOOM_INDEX_AUTO_TUNE_ENABLE_PROP in IndexConfig. Were you hinting at this or something else. If it is new one you want to introduce, how will it be different than this config. |
|
Instead of @nsivabalan whats the current state of this PR. Is it still WIP? |
8a6cac4 to
45293c4
Compare
|
@vinothchandar : thnx.
But if you feel, we can go ahead with the current state as is, let me know. |
|
lets get the size capping in and we can merge this. validating the FP ratio would also be good to do.. Let me clarify what I mean.. I just want to verify that as you increase number of entries , fp ratio stays constant and the size of the bloom filter increases (upto the limit configured) |
|
I have added our own impl of DynamicBloom where in we cap the max number of entries. In other words, up until maxNumEntries, the FP ratio will be honored, after which there are no gaurantees. |
|
Here are the results from FP ratio testing.
Serialized size in bytes
|
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some high level comments..
(also some detailed ones at the start, then realized PR may not be ready for final review)..
@bvaradar could you help siva with the licensing stuff..
hudi-client/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
Outdated
Show resolved
Hide resolved
hudi-client/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
Outdated
Show resolved
Hide resolved
hudi-client/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java
Outdated
Show resolved
Hide resolved
hudi-client/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
Outdated
Show resolved
Hide resolved
...ommon/src/main/java/org/apache/hudi/common/bloom/filter/HoodieDynamicBoundedBloomFilter.java
Outdated
Show resolved
Hide resolved
...ommon/src/main/java/org/apache/hudi/common/bloom/filter/HoodieDynamicBoundedBloomFilter.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/bloom/filter/LocalFilter.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/util/ParquetUtils.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/bloom/filter/LocalDynamicBloomFilter.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/bloom/filter/BloomFilterUtils.java
Outdated
Show resolved
Hide resolved
dee9766 to
8272193
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks almost ready. Have we add test classes for InternalDynamicBloomFilter and bounded filter classes.. We need some tests that checks the bounding & dynamic aspects?
hudi-common/src/main/java/org/apache/hudi/common/bloom/filter/BloomFilter.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/bloom/filter/BloomFilterTypeCode.java
Outdated
Show resolved
Hide resolved
...ommon/src/main/java/org/apache/hudi/common/bloom/filter/HoodieDynamicBoundedBloomFilter.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/common/util/CleanerUtils.java
Outdated
Show resolved
Hide resolved
|
We can merge once these are addressed.. Lets get this across the line! :) |
eeb7d5e to
1d6d765
Compare
|
Fixed the check style issue.
Will update you once done. |
|
@vinothchandar : the diff is ready. you are good to review. Sorry, forgot to ping you. |
vinothchandar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two small nits.
...common/src/test/java/org/apache/hudi/common/bloom/filter/TestInternalDynamicBloomFilter.java
Outdated
Show resolved
Hide resolved
| if (index != 0) { | ||
| int curLength = serString.length(); | ||
| if (index > indexForMaxGrowth) { | ||
| assert curLength == lastKnownBloomSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please use the junit assert methods? Also this is better written as a parameterized test? (like the bloom filter test; leave it you to make the final call)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed the assert. But I didn't feel a need for parametrized since we are testing only dynamic for boundedness once threshold is met.
- Introduced configs for bloom filter type - Implemented dynamic bloom filter with configurable max number of keys - BloomFilterFactory abstractions; Defaults to current simple bloom filter
…st for hoodie-client module (apache#930) [HUDI-271] Create QuickstartUtils for simplifying quickstart guide - This will be used in Quickstart guide (Doc changes to follow in a seperate PR). The intention is to simplify quickstart to showcase hudi APIs by writing and reading using spark datasources. - This is located in hudi-spark module intentionally to bring all the necessary classes in hudi-spark-bundle finally. HUDI-121 : Address comments during RC2 voting 1. Remove dnl utils jar from git 2. Add LICENSE Headers in missing files 3. Fix NOTICE and LICENSE in all HUDI packages and in top-level 4. Fix License wording in certain HUDI source files 5. Include non java/scala code in RAT licensing check 6. Use whitelist to include dependencies as part of timeline-server bundling [HUDI-121] Update Release notes [HUDI-121] Fix bugs in Release Scripts found during RC creation [HUDI-287] Address comments during review of release candidate 1. Remove LICENSE and NOTICE files in hoodie child modules. 2. Remove developers and contributor section from pom 3. Also ensure any failures in validation script is reported appropriately 4. Make hoodie parent pom consistent with that of its parent apache-21 (https://github.com/apache/maven-apache-parent/blob/apache-21/pom.xml) Update Release notes [HUDI-121] Fix bug in validation in create_source_release.sh [HUDI-121] Fix bug in validation in deploy_staging_jars.sh [HUDI-265] Failed to delete tmp dirs created in unit tests (apache#928) [HUDI-285] Implement HoodieStorageWriter based on actual file type (apache#936) [HUDI-121] Remove leftover notice file and replace com.uber.hoodie with org.apache.hudi in log4j properties [HUDI-121] Prepare for 0.5.0-incubating-rc5 [HUDI-293] Remove KEYS file from github repository [HUDI-232] Implement sealing/unsealing for HoodieRecord class (apache#938) [HOTFIX] Move to openjdk to get travis passing (apache#944) [MINOR] Add incubating to NOTICE and README.md Please enter the commit message for your changes. Lines starting Rebased with Huid master added coudera profile [HUDI-292] Avoid consuming more entries from kafka than specified sourceLimit. (apache#947) - Special handling when allocedEvents > numEvents - Added unit tests [Docs] Update README.md (apache#955) [HUDI-298] Fix issue with incorrect column mapping casusing bad data, during on-the-fly merge of Real Time tables (apache#956) * Fix issue with incorrect column mapping casusing bad data, during on-the-fly merge of Real Time tables [HUDI-301] fix path error when update a non-partition MOR table Shade and relocate Avro dependency in hadoop-mr-bundle [HUDI-121] Fix licensing issues found during RC voting by general incubator group Update RELEASE Notes in master [HUDI-121] Fix issues in release scripts [HUDI-40] Add parquet support for the Delta Streamer (apache#949) [HUDI-283] : Ensure a sane minimum for merge buffer memory (apache#964) - Some environments e.g spark-shell provide 0 for memory size - This causes unnecessary performance degradation [MINOR] Remove release notes and move confluent repository to hoodie parent pom [MINOR] Add backtick escape while syncing partition fields (apache#967) [MINOR] Move all repository declarations to parent pom (apache#966) [HUDI-290] Normalize test class name of all test classes (apache#951) [HUDI-130] Paths written in compaction plan needs to be relative to base-path [HUDI-169] Speed up rolling back of instants (apache#968) [MINOR] Fix vm crashes (apache#979) [MINOR] Fix no output in travis (apache#984) [MINOR] fix annotation in teardown (apache#990) [MINOR] Fix avro schema warnings in build [HUDI-313] Fix select count star error when querying a realtime table synchronized lock on conf object instead of class Bump checkstyle from 8.8 to 8.18 (apache#981) Bumps [checkstyle](https://github.com/checkstyle/checkstyle) from 8.8 to 8.18. - [Release notes](https://github.com/checkstyle/checkstyle/releases) - [Commits](checkstyle/checkstyle@checkstyle-8.8...checkstyle-8.18) Signed-off-by: dependabot[bot] <[email protected]> Bump httpclient from 4.3.2 to 4.3.6 (apache#980) Bumps httpclient from 4.3.2 to 4.3.6. Signed-off-by: dependabot[bot] <[email protected]> [HUDI-312] Make docker hdfs cluster ephemeral. This is needed to fix flakiness in integration tests. Also, Fix DeltaStreamer hanging issue due to uncaught exception [HUDI-314] Fix multi partition keys error when querying a realtime table Add MOR integration testing [HUDI-321] Support bulkinsert in HDFSParquetImporter (apache#987) - Add bulk insert feature - Fix some minor issues [MINOR] Add features and instructions to build Hudi in README (apache#992) [HUDI-324] TimestampKeyGenerator should support milliseconds (apache#993) [HUDI-302]: simplified countInstants() method in HoodieDefaultTimeline (apache#997) [HUDI-245]: replaced instances of getInstants() and reverse() with getReverseOrderedInstants() (apache#1000) [DOCS] Update to align with original Uber whitepaper (apache#999) [DOCS] Change Hudi acronyms to plural [HUDI-253]: added validations for schema provider class (apache#995) [HUDI-218] Adding Presto support to Integration Test (apache#1003) [HUDI-137] Hudi cleaning state changes should be consistent with compaction actions Before this change, Cleaner performs cleaning of old file versions and then stores the deleted files in .clean files. With this setup, we will not be able to track file deletions if a cleaner fails after deleting files but before writing .clean metadata. This is fine for regular file-system view generation but Incremental timeline syncing relies on clean/commit/compaction metadata to keep a consistent file-system view. Cleaner state transitions is now similar to that of compaction. 1. Requested : HoodieWriteClient.scheduleClean() selects the list of files that needs to be deleted and stores them in metadata 2. Inflight : HoodieWriteClient marks the state to be inflight before it starts deleting 3. Completed : HoodieWriteClient marks the state after completing the deletion according to the cleaner plan [HUDI-306] Support Glue catalog and other hive metastore implementations (apache#961) - Support Glue catalog and other metastore implementations - Remove shading from hudi utilities bundle - Add maven profile to optionally shade hive in utilities bundle [HUDI-80] Leverage Commit metadata to figure out partitions to be cleaned for Cleaning by commits mode (apache#1008) [HUDI-330] add EmptyStatement java checkstyle rule Migrate integration tests to spark 2.4.4 [HUDI-329] Presto Containers for integration test must allow newly built local jars to override [HOTFIX] fix missing version of rat-plugin (apache#1015) - Fixing RT queries for HiveOnSpark that causes race conditions - Adding more comments to understand usage of reader/writer schema - Ensure that rollback instant is always created before the next commit instant. This especially affects IncrementalPull for MOR tables since we can end up pulling in log blocks for uncommitted data - Ensure that generated commit instants are 1 second apart [HUDI-339] Add support of Azure cloud storage (apache#1019) - Add Azure WASB (BLOB) and ADLS storage in StorageSchemes enum - Update testStorageSchemes to test new added storage [HUDI-342] add pull request template for hudi project (apache#1022) [HUDI-346] Set allowMultipleEmptyLines to false for EmptyLineSeparator rule (apache#1025) [HOTFIX] Fix error configuration item of dockerfile-maven-plugin [HUDI-350]: updated default value of config.getCleanerCommitsRetained() in javadocs [HUDI-348] Add Issue template for the project (apache#1029) [HUDI-345] Fix used deprecated function (apache#1024) - Schema.parse() with new Schema.Parser().parse - FSDataOutputStream constructor [HUDI-328] Adding delete api to HoodieWriteClient (apache#1004) [HUDI-328] Adding delete api to HoodieWriteClient and Spark DataSource [MINOR] Some minor optimizations in HoodieJavaStreamingApp (apache#1046) [HUDI-362] Adds a check for the existence of field (apache#1047) [HUDI-358] Add Java-doc and importOrder checkstyle rule (apache#1043) - import groups are separated by one blank line - org.apache.hudi.* at the top location [HUDI-359] Add hudi-env for hudi-cli module (apache#1042) [HUDI-340]: made max events to read from kafka source configurable (apache#1039) [HUDI-327] Add null/empty checks to key generators (apache#1040) * Adds null and empty checks to all key generators. * Also improves error messaging for key generator issues. [HUDI-325] Fix Hive partition error for updated HDFS Hudi table (apache#1001) [HUDI-364] Refactor hudi-hive based on new ImportOrder code style rule (apache#1048) [HUDI-364] Refactor hudi-hive based on new ImportOrder code style rule [HUDI-366] Refactor some module codes based on new ImportOrder code style rule (apache#1055) [HUDI-366] Refactor hudi-hadoop-mr / hudi-timeline-service / hudi-spark / hudi-integ-test / hudi- utilities based on new ImportOrder code style rule [HUDI-373] Refactor hudi-client based on new ImportOrder code style rule (apache#1056) [HUDI-209] Implement JMX metrics reporter (apache#1045) [HUDI-372] Support the shortName for Hudi DataSource (apache#1054) - Ability to do `spark.write.format("hudi")...` [HUDI-374] Unable to generateUpdates in QuickstartUtils (apache#1059) [HUDI-357] Refactor hudi-cli based on new comment and code style rules (apache#1051) [HUDI-370] Refactor hudi-common based on new ImportOrder code style rule (apache#1063) [DOCS] Update Hudi Readme (apache#1058) - Add build status - Clean up layout [DOCS] Update the build source link (apache#1071) [MINOR] Update some urls from http to https in the README file (apache#1074) [HUDI-294] Delete Paths written in Cleaner plan needs to be relative to partition-path (apache#1062) [HUDI-294] Delete Paths written in Cleaner plan needs to be relative to partition-path [HUDI-355] Refactor hudi-common based on new comment and code style rules (apache#1049) [HUDI-355] Refactor hudi-common based on new comment and code style rules [HUDI-365] Refactor hudi-cli based on new ImportOrder code style rule (apache#1076) [checkstyle] Add ConstantName java checkstyle rule (apache#1066) * add SimplifyBooleanExpression java checkstyle rule * collapse empty tags in scalastyle file [HUDI-378] Refactor the rest codes based on new ImportOrder code style rule (apache#1078) [HUDI-379] Refactor the codes based on new JavadocStyle code style rule (apache#1079) [MINOR] add *.log to .gitignore file (apache#1086) [HUDI-353] Add hive style partitioning path [MINOR] Beautify the cli banner (apache#1089) * Add one empty line * replace Cli to CLI * replace Hoodie to Apache Hudi [checkstyle] Unify LOG form (apache#1092) [HUDI-390] Add backtick character in hive queries to support hive identifier as tablename (apache#1090) [HUDI-387] Fix NPE when create savepoint via hudi-cli (apache#1085) [HUDI-368] code clean up in TestAsyncCompaction class (apache#1050) [MINOR] Remove redundant plus operator (apache#1097) [MINOR] replace scala map add operator (apache#1093) replace ++: with ++ [MINOR] Unify Lists import (apache#1103) [HUDI-398]Add spark env set/get for spark launcher (apache#1096) [HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset [MINOR] Add slack invite icon in README (apache#1108) [HUDI-106] Adding support for DynamicBloomFilter (apache#976) - Introduced configs for bloom filter type - Implemented dynamic bloom filter with configurable max number of keys - BloomFilterFactory abstractions; Defaults to current simple bloom filter [HUDI-415] Get commit time when Spark start (apache#1113) [HUDI-386] Refactor hudi scala checkstyle rules (apache#1099) [HUDI-444] Refactor the codes based on scala codestyle ReturnChecker rule (apache#1121) [HUDI-311] : Support for AWS Database Migration Service in DeltaStreamer - Add a transformer class, that adds `Op` fiels if not found in input frame - Add a payload implementation, that issues deletes when Op=D - Remove Parquet as a top level source type, consolidate with RowSource - Made delta streamer work without a property file, simply using overridden cli options - Unit tests for transformer/payload classes Fix Error: java.lang.IllegalArgumentException: Can not create a Path from an empty string in HoodieCopyOnWrite#deleteFilesFunc (apache#1126) same link in apache#771 this time is in HoodieCopyOnWrite deleteFilesFunc method [MINOR] Set info servity for ImportOrder temporarily (apache#1127) - Now we need fix import check error manually, disable the rule temporarily before finding a better solution. [MINOR] fix typo [MINOR] fix typo [minor] Fix few typos in the java docs (apache#1132) [HUDI-389] Fixing Index look up to return right partitions for a given key along with fileId with Global Bloom (apache#1091) * Fixing Index look up to return partitions for a given key along with fileId with Global Bloom * Addressing some of the comments * Fixing test in TestHoodieGlobalBloomIndex to test the fix [HUDI-416] Improve hint information for cli (apache#1110) [MINOR] fix typos [MINOR] optimize hudi timeline service (apache#1137) [HUDI-470] Fix NPE when print result via hudi-cli (apache#1138) [MINOR] typo fix (apache#1142) [MINOR] Update the java doc of HoodieTableType (apache#1148) [HUDI-453] Fix throw failed to archive commits error when writing data to MOR/COW table Fix checkstyle Skip setting commit metadata Fix empty content clean plan Update comment [MINOR]: alter some wrong params which bring fatal exception [HUDI-482] Fix missing @OverRide annotation on methods (apache#1156) * [HUDI-482] Fix missing @OverRide annotation on methods [HUDI-455] Redo hudi-client log statements using SLF4J (apache#1145) * [HUDI-455] Redo hudi-client log statements using SLF4J [MINOR] Fix out of limits for results [MINOR] Fix out of limits for results Clean up code [HUDI-343]: Create a DOAP file for Hudi [HUDI-402]: code clean up in test cases [MINOR] Fix error usage of String.format (apache#1169) [HUDI-492]Fix show env all in hudi-cli [HUDI-118]: Options provided for passing properties to Cleaner, compactor and importer commands [MINOR] Optimize hudi-cli module (apache#1136) [MINOR]Optimize hudi-client module (apache#1139) [HUDI-377] Adding Delete() support to DeltaStreamer (apache#1073) - Provides ability to perform hard deletes by writing delete marker records into the source data - if the record contains a special field _hoodie_delete_marker set to true, deletes are performed [HUDI-484] Fix NPE when reading IncrementalPull.sqltemplate in HiveIncrementalPuller (apache#1167) Revert "[HUDI-455] Redo hudi-client log statements using SLF4J (apache#1145)" (apache#1181) This reverts commit e637d9e. [HUDI-438] Merge duplicated code fragment in HoodieSparkSqlWriter (apache#1114) [HUDI-406]: added default partition path in TimestampBasedKeyGenerator [HUDI-501] Execute docker/setup_demo.sh in any directory [HUDI-405] Remove HIVE_ASSUME_DATE_PARTITION_OPT_KEY config from DataSource [HUDI-464] Use Hive Exec Core for tests (apache#1125) [HUDI-417] Refactor HoodieWriteClient so that commit logic can be shareable by both bootstrap and normal write operations (apache#1166) [HUDI-508] Standardizing on "Table" instead of "Dataset" across code (apache#1197) - Docs were talking about storage types before, cWiki moved to "Table" - Most of code already has HoodieTable, HoodieTableMetaClient - correct naming - Replacing renaming use of dataset across code/comments - Few usages in comments and use of Spark SQL DataSet remain unscathed [MINOR] Remove old jekyll config file (apache#1198) Update deprecated HBase API [HUDI-319] Add a new maven profile to generate unified Javadoc for all Java and Scala classes (apache#1195) * Add javadoc build command in README, links to javadoc plugin and rename profile. * Make java version configurable in one place. [HUDI-25] Optimize HoodieInputformat.listStatus() for faster Hive incremental queries on Hoodie Summary: - InputPathHandler class classifies inputPaths into incremental, non incremental and non hoodie paths. - Incremental queries leverage HoodieCommitMetadata to get partitions that are affected and only lists those partitions as opposed to listing all partitions - listStatus() processes each category separately [HUDI-331]Fix java docs for all public apis in HoodieWriteClient (apache#1111) [HUDI-114]: added option to overwrite payload implementation in hoodie.properties file [HUDI-248] CLI doesn't allow rolling back a Delta commit [HUDI-469] Fix: HoodieCommitMetadata only show first commit insert rows. [CLEAN] replace utf-8 constant with StandardCharsets.UTF_8 [MINOR] Fix partition typo (apache#1209) [HUDI-522] Use the same version jcommander uniformly (apache#1214) [HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types - Upgrade Spark to 2.4.4, Parquet to 1.10.1, Avro to 1.8.2 - Remove spark-avro from hudi-spark-bundle. Users need to provide --packages org.apache.spark:spark-avro:2.4.4 when running spark-shell or spark-submit - Replace com.databricks:spark-avro with org.apache.spark:spark-avro - Shade avro in hudi-hadoop-mr-bundle to make sure it does not conflict with hive's avro version. [HUDI-322] DeltaSteamer should pick checkpoints off only deltacommits for MOR tables [HUDI-502] provide a custom time zone definition for TimestampBasedKeyGenerator (apache#1188) [HUDI-526] fix the HoodieAppendHandle [MINOR] Reuse random object (apache#1222) Fix conversion of Spark struct type to Avro schema cr https://code.amazon.com/reviews/CR-17184364 [MINOR] Refactor unnecessary boxing inside TypedProperties code (apache#1227) Adding util methods to assist in adding deletion support to Quick Start Fixing delete util method Fixing checkstyle issues [MINOR] Fix redundant judgment statement (apache#1231) [HUDI-335] Improvements to DiskBasedMap used by ExternalSpillableMap, for write and random/sequential read paths, by introducing bufferedRandmomAccessFile Add GlobalDeleteKeyGenerator Adds new GlobalDeleteKeyGenerator for record_key deletes with global indices. Also refactors key generators into their own package. [MINOR] Make constant fields final in HoodieTestDataGenerator (apache#1234) [MINOR] Fix missing @OverRide annotation on BufferedRandomAccessFile method (apache#1236) [HUDI-509] Renaming code in sync with cWiki restructuring (apache#1212) - Storage Type replaced with Table Type (remaining instances) - View types replaced with query types; - ReadOptimized view referred as Snapshot Query - TableFileSystemView sub interfaces renamed to BaseFileOnly and Slice Views - HoodieDataFile renamed to HoodieBaseFile - Hive Sync tool will register RO tables for MOR with a `_ro` suffix - Datasource/Deltastreamer options renamed accordingly - Support fallback to old config values as well, so migration is painless - Config for controlling _ro suffix addition - Renaming DataFile to BaseFile across DTOs, HoodieFileSlice and AbstractTableFileSystemView [HUDI-537] Introduce `repair overwrite-hoodie-props` CLI command (apache#1241) [HUDI-527] scalastyle-maven-plugin moved to pluginManagement as it is only used in hoodie-spark and hoodie-cli modules. This fixes compile warnings as well as unnecessary plugin invocation for most of the modules which do not have scala code. [HUDI-535] Ensure Compaction Plan is always written in .aux folder to avoid 0.5.0/0.5.1 reader-writer compatibility issues (apache#1229) [HUDI-238] Make Hudi support Scala 2.12 (apache#1226) * [HUDI-238] Rename scala related artifactId & add maven profile to support Scala 2.12 [MINOR] Add toString method to TimelineLayoutVersion to make it more readable (apache#1244) [MINOR] Fix PMC in DOAP] (apache#1247) [HUDI-552] Fix the schema mismatch in Row-to-Avro conversion (apache#1246) [HUDI-551] Abstract a test case class for DFS Source to make it extensible (apache#1239) [HUDI-556] Add lisence for PR#1233 [HUDI-559] : Make the timeline layout version default to be null version Moving to 0.5.2-SNAPSHOT on master branch. [MINOR] Download KEYS file when validating release candidate (apache#1259) [MINOR] Update the javadoc of HoodieTableMetaClient#scanFiles (apache#1263) [MINOR] Update the javadoc of HoodieTableMetaClient#scanFiles [MINOR] Fix invalid maven repo address (apache#1265) [MINOR] Change deploy_staging_jars script to take in scala version (apache#1269) [MINOR] Change deploy_staging_jars script to take in scala version (apache#1270) [MINOR] Add missing licenses (apache#1271) [MINOR] fix license issue (apache#1273) [HUDI-549] update Github README with instructions to build with Scala 2.12 (apache#1275) [MINOR] Fix missing groupId / version property of dependency [MINOR] Fix invalid issue url & quickstart url (apache#1282) [MINOR] Remove junit-dep dependency [MINOR] Fix assigning to configuration more times (apache#1291) HUDI-117 Close file handle before throwing an exception due to append failure. Add test cases to handle/verify stage failure scenarios. [HUDI-578] Trim recordKeyFields and partitionPathFields in ComplexKeyGenerator (apache#1281) * [HUDI-578] Trim recordKeyFields and partitionPathFields in ComplexKeyGenerator * add tests [HUDI-564] Added new test cases for HoodieLogFormat and HoodieLogFormatVersion. [HUDI-583] Code Cleanup, remove redundant code, and other changes (apache#1237) [MINOR] Updated DOAP with 0.5.1 release (apache#1300) [MINOR] Updated DOAP with 0.5.1 release (apache#1301) Increase test coverage for HoodieReadClient [HUDI-596] Close KafkaConsumer every time (apache#1303) [HUDI-595] code cleanup, refactoring code out of PR# 1159 (apache#1302) [HUDI-566] Added new test cases for class HoodieTimeline, HoodieDefaultTimeline and HoodieActiveTimeline. [HUDI-585] Optimize the steps of building with scala-2.12 (apache#1293) [MINOR] Remove the declaration of thrown RuntimeException (apache#1305) [HUDI-499] Allow update partition path with GLOBAL_BLOOM (apache#1187) * Handle partition path update by deleting a record from the old partition and insert into the new one * Add a new configuration "hoodie.bloom.index.update.partition.path" to enable the behavior * Add a new unit test case for global bloom index [HUDI-571] Add 'commits show archived' command to CLI [HUDI-570] - Improve test coverage for FSUtils.java [HUDI-587] Fixed generation of jacoco coverage reports. surefire plugin's argLine is moved into a property. This configuration allows jacoco plugin to modify the argLine to insert it's Java Agent's configuration during pre-unit-test stage. [HUDI-560] Remove legacy IdentityTransformer (apache#1264) [HUDI-582] Update NOTICE year [HUDI-478] Fix too many files with unapproved license when execute build_local_docker_images (apache#1323) [HUDI-605] Avoid calculating the size of schema redundantly (apache#1317) CLI - add option to print additional commit metadata [HUDI-574] Fix CLI counts small file inserts as updates (apache#1321) [MINOR] Fix typo (apache#1331) [HUDI-514] A schema provider to get metadata through Jdbc (apache#1200) [HUDI-571] Add show archived compaction(s) to CLI [MINOR] Fix some typos [MINOR] Code Cleanup, remove redundant code (apache#1337) [HUDI-615]: Add some methods and test cases for StringUtils. (apache#1338) [HUDI-108] Removing 2GB spark partition limitations in HoodieBloomIndex with spark 2.4.4 (apache#1315) [MINOR] Add javadoc to SchedulerConfGenerator and code clean (apache#1340) [HUDI-617] Add support for types implementing CharSequence (apache#1339) - Data types extending CharSequence implement a #toString method which provides an easy way to convert them to String. - For example, org.apache.avro.util.Utf8 is easily convertible into String if we use the toString() method. It's better to make the support more generic to support a wider range of data types as partitionKey. [HUDI-622]: Remove VisibleForTesting annotation and import from code (apache#1343) * HUDI:622: Remove VisibleForTesting annotation and import from code Refactoring getter to avoid double extrametadata in json representation [HUDI-624]: Split some of the code from PR for HUDI-479 (apache#1344) [HUDI-597] Enable incremental pulling from defined partitions (apache#1348) [HUDI-625] Fixing performance issues around DiskBasedMap & kryo (apache#1352) [HUDI-580] Fix incorrect license header in files Added cloudera profile Added cloudera profile removed hudi-integ-test rebased from apache hudi mater
- Introduced configs for bloom filter type - Implemented dynamic bloom filter with configurable max number of keys - BloomFilterFactory abstractions; Defaults to current simple bloom filter