Skip to content

Conversation

@nsivabalan
Copy link
Contributor

  • Adding support for DynamicBloomFilter (link) to tune bloom filter size based on total number of entries.
    • Added a BloomFilter interface and two implementations, namely SimpleBloomFilter(existing one) and HudiDynamicBloomFilter(new one).
    • Added a BloomFilterFactory to assist in creating the right BloomFilter based on versions.
    • Version is stored in parquet metadata footer. If version is not found, SimpleBloomFilter will be created.
    • Introduced a config named "hoodie.bloom.index.auto.tune.enable" in HoodieIndexConfig which when enabled, will create new BloomFilter as HudiDynamicBloomFilter.

@nsivabalan nsivabalan force-pushed the DynamicBloomFilter branch 2 times, most recently from 5f66849 to cd1199e Compare October 28, 2019 06:28
@vinothchandar
Copy link
Member

@nsivabalan can you rebase against master and see if that helps the test pass

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments... can we also add a test to test the "dynamic" nature of the filter. e,g having more entries should result in larger filter with same fp ratio.. And also how are you enforcing a maximum dynamic bloom filter size. Can you share data on how big the bloom filter would be, if you say wrote 1M keys at fpp ratio 10^-9

@nsivabalan
Copy link
Contributor Author

@nsivabalan can you rebase against master and see if that helps the test pass

Nope. I didn't import the checkstyle. will fix it.

@nsivabalan
Copy link
Contributor Author

Should I create a package for bloom.filter in org.apache.hudi.common? and move all BloomFilter related classes to it.

@vinothchandar
Copy link
Member

Consolidating package makes sense.

@nsivabalan
Copy link
Contributor Author

nsivabalan commented Nov 1, 2019

Left some comments... can we also add a test to test the "dynamic" nature of the filter. e,g having more entries should result in larger filter with same fp ratio.. And also how are you enforcing a maximum dynamic bloom filter size. Can you share data on how big the bloom filter would be, if you say wrote 1M keys at fpp ratio 10^-9

Few questions/clarifications:

  • I guess you can't bound the size in dynamic bloom filter. Size will grow according to the number to entries added. Initialize number of entries passed will be used to set the min size.
  • I am trying to find ways to test the FP ratio. Not sure how would you test that.
  • I was able to verify that adding more entries to the filter than the initial size, increases the size of the bloom.
  • Here are the sizes of dynamic bloom filter with error rate 10^-9 and initial number of entries as 10k
    Size of bloom with 100 entries = 71940 bytes ~= 71kb
    Size of bloom with 1000 entries = 71940 bytes ~= 71kb
    Size of bloom with 10000 entries = 71940 bytes ~= 71kb
    Size of bloom with 100000 entries = 719088 bytes ~= 720kb
    Size of bloom with 1000000 entries = 7190568 bytes ~= 7.1 MB
  • Not sure if we really need to have a (unit) test to ensure that size grows when no of entries added increases. Only assertion we can do is to verify that the size is greater when compared to smaller no of entries are added.

@vinothchandar
Copy link
Member

@nsivabalan Here is the problem as I see it, w.r.t bounding size.. Currently we have a low default 60K, which comes out to reading ~400kb from the parquet footer. not too shabby a overhead.. My understanding is that the parquet footers are all read at once and even query engines would read the footer.. So if we don't bound the size of the dynamic bloom fitler to say 1MB or so, queries can pay a penalty? (I dont know how big this would be or if its okay) But we won't offer the user the choice to make tradeoffs.. IIUC we need our own impl of dynamic bloom if we were to limit the size.. correct? how doable is that?

I am trying to find ways to test the FP ratio. Not sure how would you test that.
The way I have done it in the past, is to generate a lot of key and hold it in two lists : added, notAdded.. I add the ones from added to bloom filter and then check for false positives using notAdded list.. how much % of not added had a hit is your fp.. For this impl, we need ensure that the fp ratio remains the same, even as you increase the size of added/notAdded lists..

Other small points:

  • if we don't have a way to configure the bloom type to use, we should add one
  • We should consider if the default here should be the error rate 10^-9. This will also help reduce the size.. we already have techniques like range pruning to reduce the amount of comparisons.. Assuming even a large 100M entries inserted into a partition, if the bloom filter had 10^-8, it might be enough prevent false positives right.. I guess this will drop the storage needed considerably?

@nsivabalan
Copy link
Contributor Author

@nsivabalan Here is the problem as I see it, w.r.t bounding size.. Currently we have a low default 60K, which comes out to reading ~400kb from the parquet footer. not too shabby a overhead.. My understanding is that the parquet footers are all read at once and even query engines would read the footer.. So if we don't bound the size of the dynamic bloom fitler to say 1MB or so, queries can pay a penalty? (I dont know how big this would be or if its okay) But we won't offer the user the choice to make tradeoffs.. IIUC we need our own impl of dynamic bloom if we were to limit the size.. correct? how doable is that?

I am trying to find ways to test the FP ratio. Not sure how would you test that.
The way I have done it in the past, is to generate a lot of key and hold it in two lists : added, notAdded.. I add the ones from added to bloom filter and then check for false positives using notAdded list.. how much % of not added had a hit is your fp.. For this impl, we need ensure that the fp ratio remains the same, even as you increase the size of added/notAdded lists..

Other small points:

  • if we don't have a way to configure the bloom type to use, we should add one
  • We should consider if the default here should be the error rate 10^-9. This will also help reduce the size.. we already have techniques like range pruning to reduce the amount of comparisons.. Assuming even a large 100M entries inserted into a partition, if the bloom filter had 10^-8, it might be enough prevent false positives right.. I guess this will drop the storage needed considerably?

Thanks Vinoth for the detailed response.

  • To bound the dynamic bloom, we have two options. a. I can google to find if we have some ready to use solution. b. If nothing exists, we can come up with our own DynamicBloom with max bounds on the no of entries. Just that until we reach the max no of entries, the FP ratio will be honored, after which the FP ratio may start to increase. I can get this done.
  • Wrt testing FP ratio, thanks for the idea. I had similar idea.
  • Here are the size differences between FP ratio 10^-8 and 10^-9. Key size considered is 50 bytes.
Size FP 10^-8 FP 10^-9
10k 380k 430k
100k 760k 860k
1M 6.5MB 7.3MB
10M 64MB 72MB

@vinothchandar vinothchandar added the status:in-progress Work in progress label Nov 11, 2019
@vinothchandar vinothchandar changed the title [HUDI-106] Adding support for DynamicBloomFilter [WIP] [HUDI-106] Adding support for DynamicBloomFilter Nov 11, 2019
@vinothchandar vinothchandar self-assigned this Nov 13, 2019
@nsivabalan
Copy link
Contributor Author

@vinothchandar : wrt your point "if we don't have a way to configure the bloom type to use, we should add one", we have BLOOM_INDEX_AUTO_TUNE_ENABLE_PROP in IndexConfig. Were you hinting at this or something else. If it is new one you want to introduce, how will it be different than this config.

@vinothchandar
Copy link
Member

Instead of BLOOM_INDEX_AUTO_TUNE_ENABLE_PROP, should we just make it BLOOM_INDEX_FILTER_TYPE_PROP? But higher level, yes I wanted to just control this with a flag and turn this off for now by default..

@nsivabalan whats the current state of this PR. Is it still WIP?

@nsivabalan
Copy link
Contributor Author

@vinothchandar : thnx.
Here are the two pending items:

  1. You suggested to check if there is a way to bound the max no of entries. Looks like there is not readily available solution. I have an idea on how to get this done. But need time to test and validate the same.
  2. You wanted to validate the FP ratio. I can't do this in my local laptop. Have to run it in a cluster to get these numbers. Will work on it this week.

But if you feel, we can go ahead with the current state as is, let me know.
I have fixed the config btw, can remove WIP if you feel the above two items can be worked on later.

@vinothchandar
Copy link
Member

lets get the size capping in and we can merge this. validating the FP ratio would also be good to do.. Let me clarify what I mean.. I just want to verify that as you increase number of entries , fp ratio stays constant and the size of the bloom filter increases (upto the limit configured)

@nsivabalan nsivabalan changed the title [WIP] [HUDI-106] Adding support for DynamicBloomFilter [HUDI-106] Adding support for DynamicBloomFilter Nov 27, 2019
@nsivabalan
Copy link
Contributor Author

I have added our own impl of DynamicBloom where in we cap the max number of entries. In other words, up until maxNumEntries, the FP ratio will be honored, after which there are no gaurantees.

@nsivabalan
Copy link
Contributor Author

Here are the results from FP ratio testing.
Exp params: Error ratio 1.0E-6. For HoodieDynamic Capped bloom filter, max number of entries = 5 * init numEntries

InitNumEntries/EntriesAdded Hadoop Dynamic(non capped) HoodieDynamic(Capped)
5k/10k 1.9E-6 1.8E-6
5k/50k 9.1E-6 0.0162626
10k/10k 1.4E-6 1.1E-6
10k/50k 3.7E-6 5.6E-6
10k/60k 7.6E-6 5.7E-5

Serialized size in bytes

InitNumEntries/EntriesAdded Simple Hadoop Dynamic(non capped) HoodieDynamic(Capped)
5k/10k 23980 47996 47996
5k/50k 23980 287796 119936
10k/10k 47944 47976 47976
10k/50k 47944 287692 239748
10k/60k 47944 575348 239748

@vinothchandar vinothchandar removed the status:in-progress Work in progress label Dec 3, 2019
Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some high level comments..

(also some detailed ones at the start, then realized PR may not be ready for final review)..

@bvaradar could you help siva with the licensing stuff..

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks almost ready. Have we add test classes for InternalDynamicBloomFilter and bounded filter classes.. We need some tests that checks the bounding & dynamic aspects?

@vinothchandar
Copy link
Member

We can merge once these are addressed.. Lets get this across the line! :)

@nsivabalan
Copy link
Contributor Author

Fixed the check style issue.
Pending items:

  • One nit in CleanerUtils
  • Have we add test classes for InternalDynamicBloomFilter and bounded filter classes.. We need some tests that checks the bounding & dynamic aspects?

Will update you once done.

@nsivabalan
Copy link
Contributor Author

nsivabalan commented Dec 15, 2019

@vinothchandar : the diff is ready. you are good to review. Sorry, forgot to ping you.

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small nits.

if (index != 0) {
int curLength = serString.length();
if (index > indexForMaxGrowth) {
assert curLength == lastKnownBloomSize;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please use the junit assert methods? Also this is better written as a parameterized test? (like the bloom filter test; leave it you to make the final call)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed the assert. But I didn't feel a need for parametrized since we are testing only dynamic for boundedness once threshold is met.

@vinothchandar vinothchandar merged commit 14881e9 into apache:master Dec 18, 2019
sumit-dp pushed a commit to Schedule1/incubator-hudi that referenced this pull request Feb 25, 2020
- Introduced configs for bloom filter type
- Implemented dynamic bloom filter with configurable max number of keys
- BloomFilterFactory abstractions; Defaults to current simple bloom filter
sumit-dp pushed a commit to Schedule1/incubator-hudi that referenced this pull request Mar 6, 2020
…st for hoodie-client module (apache#930)

[HUDI-271] Create QuickstartUtils for simplifying quickstart guide

- This will be used in Quickstart guide (Doc changes to follow in a seperate PR). The intention is to simplify quickstart to showcase hudi APIs by writing and reading using spark datasources.
- This is located in hudi-spark module intentionally to bring all the necessary classes in hudi-spark-bundle finally.

HUDI-121 : Address comments during RC2 voting

1. Remove dnl utils jar from git
2. Add LICENSE Headers in missing files
3. Fix NOTICE and LICENSE in all HUDI packages and in top-level
4. Fix License wording in certain HUDI source files
5. Include non java/scala code in RAT licensing check
6. Use whitelist to include dependencies as part of timeline-server bundling

[HUDI-121] Update Release notes

[HUDI-121] Fix bugs in Release Scripts found during RC creation

[HUDI-287] Address comments during review of release candidate
  1. Remove LICENSE and NOTICE files in hoodie child modules.
  2. Remove developers and contributor section from pom
  3. Also ensure any failures in validation script is reported appropriately
  4. Make hoodie parent pom consistent with that of its parent apache-21 (https://github.com/apache/maven-apache-parent/blob/apache-21/pom.xml)

Update Release notes

[HUDI-121] Fix bug in validation in create_source_release.sh

[HUDI-121] Fix bug in validation in deploy_staging_jars.sh

[HUDI-265] Failed to delete tmp dirs created in unit tests (apache#928)

[HUDI-285] Implement HoodieStorageWriter based on actual file type (apache#936)

[HUDI-121] Remove leftover notice file and replace com.uber.hoodie with org.apache.hudi in log4j properties

[HUDI-121] Prepare for 0.5.0-incubating-rc5

[HUDI-293] Remove KEYS file from github repository

[HUDI-232] Implement sealing/unsealing for HoodieRecord class (apache#938)

[HOTFIX] Move to openjdk to get travis passing (apache#944)

[MINOR] Add incubating to NOTICE and README.md

Please enter the commit message for your changes. Lines starting

Rebased with Huid master
added coudera profile

[HUDI-292] Avoid consuming more entries from kafka than specified sourceLimit. (apache#947)

 - Special handling when allocedEvents > numEvents
 - Added unit tests

[Docs] Update README.md (apache#955)

[HUDI-298] Fix issue with incorrect column mapping casusing bad data, during on-the-fly merge of Real Time tables (apache#956)

* Fix issue with incorrect column mapping casusing bad data, during on-the-fly merge of Real Time tables

[HUDI-301] fix path error when update a non-partition MOR table

Shade and relocate Avro dependency in hadoop-mr-bundle

[HUDI-121] Fix licensing issues found during RC voting by general incubator group

Update RELEASE Notes in master

[HUDI-121] Fix issues in release scripts

[HUDI-40] Add parquet support for the Delta Streamer (apache#949)

[HUDI-283] : Ensure a sane minimum for merge buffer memory (apache#964)

- Some environments e.g spark-shell provide 0 for memory size
- This causes unnecessary performance degradation

[MINOR] Remove release notes and move confluent repository to hoodie parent pom

[MINOR] Add backtick escape while syncing partition fields (apache#967)

[MINOR] Move all repository declarations to parent pom (apache#966)

[HUDI-290] Normalize test class name of all test classes (apache#951)

[HUDI-130] Paths written in compaction plan needs to be relative to base-path

[HUDI-169] Speed up rolling back of instants (apache#968)

[MINOR] Fix vm crashes (apache#979)

[MINOR] Fix no output in travis (apache#984)

[MINOR] fix annotation in teardown (apache#990)

[MINOR] Fix avro schema warnings in build

[HUDI-313] Fix select count star error when querying a realtime table

synchronized lock on conf object instead of class

Bump checkstyle from 8.8 to 8.18 (apache#981)

Bumps [checkstyle](https://github.com/checkstyle/checkstyle) from 8.8 to 8.18.
- [Release notes](https://github.com/checkstyle/checkstyle/releases)
- [Commits](checkstyle/checkstyle@checkstyle-8.8...checkstyle-8.18)

Signed-off-by: dependabot[bot] <[email protected]>

Bump httpclient from 4.3.2 to 4.3.6 (apache#980)

Bumps httpclient from 4.3.2 to 4.3.6.

Signed-off-by: dependabot[bot] <[email protected]>

[HUDI-312] Make docker hdfs cluster ephemeral. This is needed to fix flakiness in integration tests. Also, Fix DeltaStreamer hanging issue due to uncaught exception

[HUDI-314] Fix multi partition keys error when querying a realtime table

Add MOR integration testing

[HUDI-321] Support bulkinsert in HDFSParquetImporter (apache#987)

- Add bulk insert feature
- Fix some minor issues

[MINOR] Add features and instructions to build Hudi in README (apache#992)

[HUDI-324] TimestampKeyGenerator should support milliseconds (apache#993)

[HUDI-302]: simplified countInstants() method in HoodieDefaultTimeline (apache#997)

[HUDI-245]: replaced instances of getInstants() and reverse() with getReverseOrderedInstants() (apache#1000)

[DOCS] Update to align with original Uber whitepaper (apache#999)

[DOCS] Change Hudi acronyms to plural

[HUDI-253]: added validations for schema provider class (apache#995)

[HUDI-218] Adding Presto support to Integration Test (apache#1003)

[HUDI-137] Hudi cleaning state changes should be consistent with compaction actions

Before this change, Cleaner performs cleaning of old file versions and then stores the deleted files in .clean files.
With this setup, we will not be able to track file deletions if a cleaner fails after deleting files but before writing .clean metadata.
This is fine for regular file-system view generation but Incremental timeline syncing relies on clean/commit/compaction metadata to keep a consistent file-system view.

Cleaner state transitions is now similar to that of compaction.

1. Requested : HoodieWriteClient.scheduleClean() selects the list of files that needs to be deleted and stores them in metadata
2. Inflight : HoodieWriteClient marks the state to be inflight before it starts deleting
3. Completed : HoodieWriteClient marks the state after completing the deletion according to the cleaner plan

[HUDI-306] Support Glue catalog and other hive metastore implementations (apache#961)

- Support Glue catalog and other metastore implementations
- Remove shading from hudi utilities bundle
- Add maven profile to optionally shade hive in utilities bundle

[HUDI-80] Leverage Commit metadata to figure out partitions to be cleaned for Cleaning by commits mode (apache#1008)

[HUDI-330] add EmptyStatement java checkstyle rule

Migrate integration tests to spark 2.4.4

[HUDI-329] Presto Containers for integration test must allow newly built local jars to override

[HOTFIX] fix missing version of rat-plugin (apache#1015)

- Fixing RT queries for HiveOnSpark that causes race conditions
- Adding more comments to understand usage of reader/writer schema

- Ensure that rollback instant is always created before the next commit instant.
  This especially affects IncrementalPull for MOR tables since we can end up pulling in
  log blocks for uncommitted data
- Ensure that generated commit instants are 1 second apart

[HUDI-339] Add support of Azure cloud storage (apache#1019)

- Add Azure WASB (BLOB) and ADLS storage in StorageSchemes enum
- Update testStorageSchemes to test new added storage

[HUDI-342] add pull request template for hudi project (apache#1022)

[HUDI-346] Set allowMultipleEmptyLines to false for EmptyLineSeparator rule (apache#1025)

[HOTFIX] Fix error configuration item of dockerfile-maven-plugin

[HUDI-350]: updated default value of config.getCleanerCommitsRetained() in javadocs

[HUDI-348] Add Issue template for the project (apache#1029)

[HUDI-345] Fix used deprecated function (apache#1024)

- Schema.parse() with new Schema.Parser().parse
- FSDataOutputStream constructor

[HUDI-328] Adding delete api to HoodieWriteClient (apache#1004)

[HUDI-328]  Adding delete api to HoodieWriteClient and Spark DataSource

[MINOR] Some minor optimizations in HoodieJavaStreamingApp (apache#1046)

[HUDI-362] Adds a check for the existence of field (apache#1047)

[HUDI-358] Add Java-doc and importOrder checkstyle rule (apache#1043)

- import groups are separated by one blank line
- org.apache.hudi.* at the top location

[HUDI-359] Add hudi-env for hudi-cli module (apache#1042)

[HUDI-340]: made max events to read from kafka source configurable (apache#1039)

[HUDI-327] Add null/empty checks to key generators (apache#1040)

* Adds null and empty checks to all key generators.
* Also improves error messaging for key generator issues.

[HUDI-325] Fix Hive partition error for updated HDFS Hudi table (apache#1001)

[HUDI-364] Refactor hudi-hive based on new ImportOrder code style rule (apache#1048)

[HUDI-364] Refactor hudi-hive based on new ImportOrder code style rule

[HUDI-366] Refactor some module codes based on new ImportOrder code style rule (apache#1055)

[HUDI-366] Refactor hudi-hadoop-mr / hudi-timeline-service / hudi-spark / hudi-integ-test / hudi- utilities based on new ImportOrder code style rule

[HUDI-373] Refactor hudi-client based on new ImportOrder code style rule (apache#1056)

[HUDI-209] Implement JMX metrics reporter (apache#1045)

[HUDI-372] Support the shortName for Hudi DataSource (apache#1054)

- Ability to do `spark.write.format("hudi")...`

[HUDI-374] Unable to generateUpdates in QuickstartUtils (apache#1059)

[HUDI-357] Refactor hudi-cli based on new comment and code style rules (apache#1051)

[HUDI-370] Refactor hudi-common based on new ImportOrder code style rule (apache#1063)

[DOCS] Update Hudi Readme (apache#1058)

- Add build status
- Clean up layout

[DOCS] Update the build source link (apache#1071)

[MINOR] Update some urls from http to https in the README file (apache#1074)

[HUDI-294] Delete Paths written in Cleaner plan needs to be relative to partition-path (apache#1062)

[HUDI-294] Delete Paths written in Cleaner plan needs to be relative to partition-path

[HUDI-355] Refactor hudi-common based on new comment and code style rules (apache#1049)

[HUDI-355] Refactor hudi-common based on new comment and code style rules

[HUDI-365] Refactor hudi-cli based on new ImportOrder code style rule (apache#1076)

[checkstyle] Add ConstantName java checkstyle rule (apache#1066)

* add SimplifyBooleanExpression java checkstyle rule
* collapse empty tags in scalastyle file

[HUDI-378] Refactor the rest codes based on new ImportOrder code style rule (apache#1078)

[HUDI-379] Refactor the codes based on new JavadocStyle code style rule (apache#1079)

[MINOR] add *.log to .gitignore file (apache#1086)

[HUDI-353] Add hive style partitioning path

[MINOR] Beautify the cli banner (apache#1089)

* Add one empty line
* replace Cli to CLI
* replace Hoodie to Apache Hudi

[checkstyle] Unify LOG form (apache#1092)

[HUDI-390] Add backtick character in hive queries to support hive identifier as tablename (apache#1090)

[HUDI-387] Fix NPE when create savepoint via hudi-cli (apache#1085)

[HUDI-368] code clean up in TestAsyncCompaction class (apache#1050)

[MINOR] Remove redundant plus operator (apache#1097)

[MINOR] replace scala map add operator (apache#1093)

replace ++: with ++

[MINOR] Unify Lists import (apache#1103)

[HUDI-398]Add spark env set/get for spark launcher (apache#1096)

[HUDI-308] Avoid Renames for tracking state transitions of all actions on dataset

[MINOR] Add slack invite icon in README (apache#1108)

[HUDI-106] Adding support for DynamicBloomFilter (apache#976)

- Introduced configs for bloom filter type
- Implemented dynamic bloom filter with configurable max number of keys
- BloomFilterFactory abstractions; Defaults to current simple bloom filter

[HUDI-415] Get commit time when Spark start (apache#1113)

[HUDI-386] Refactor hudi scala checkstyle rules (apache#1099)

[HUDI-444] Refactor the codes based on scala codestyle ReturnChecker rule (apache#1121)

[HUDI-311] : Support for AWS Database Migration Service in DeltaStreamer

 - Add a transformer class, that adds `Op` fiels if not found in input frame
 - Add a payload implementation, that issues deletes when Op=D
 - Remove Parquet as a top level source type, consolidate with RowSource
 - Made delta streamer work without a property file, simply using overridden cli options
 - Unit tests for transformer/payload classes

Fix Error: java.lang.IllegalArgumentException: Can not create a Path from an empty string in HoodieCopyOnWrite#deleteFilesFunc (apache#1126)

same link in apache#771
this time is in HoodieCopyOnWrite deleteFilesFunc method

[MINOR] Set info servity for ImportOrder temporarily (apache#1127)

- Now we need fix import check error manually, disable the rule temporarily before finding a better solution.

[MINOR] fix typo

[MINOR] fix typo

[minor] Fix few typos in the java docs (apache#1132)

[HUDI-389] Fixing Index look up to return right partitions for a given key along with fileId with Global Bloom (apache#1091)

* Fixing Index look up to return partitions for a given key along with fileId with Global Bloom
* Addressing some of the comments
* Fixing test in TestHoodieGlobalBloomIndex to test the fix

[HUDI-416] Improve hint information for cli (apache#1110)

[MINOR] fix typos

[MINOR] optimize hudi timeline service (apache#1137)

[HUDI-470] Fix NPE when print result via hudi-cli (apache#1138)

[MINOR] typo fix (apache#1142)

[MINOR] Update the java doc of HoodieTableType (apache#1148)

[HUDI-453] Fix throw failed to archive commits error when writing data to MOR/COW table

Fix checkstyle

Skip setting commit metadata

Fix empty content clean plan

Update comment

[MINOR]: alter some wrong params which bring fatal exception

[HUDI-482] Fix missing @OverRide annotation on methods (apache#1156)

* [HUDI-482] Fix missing @OverRide annotation on methods

[HUDI-455] Redo hudi-client log statements using SLF4J (apache#1145)

* [HUDI-455] Redo hudi-client log statements using SLF4J

[MINOR] Fix out of limits for results

[MINOR] Fix out of limits for results

Clean up code

[HUDI-343]: Create a DOAP file for Hudi

[HUDI-402]: code clean up in test cases

[MINOR] Fix error usage of String.format (apache#1169)

[HUDI-492]Fix show env all in hudi-cli

[HUDI-118]: Options provided for passing properties to Cleaner, compactor and importer commands

[MINOR] Optimize hudi-cli module (apache#1136)

[MINOR]Optimize hudi-client module (apache#1139)

[HUDI-377] Adding Delete() support to DeltaStreamer (apache#1073)

- Provides ability to perform hard deletes by writing delete marker records into the source data
- if the record contains a special field _hoodie_delete_marker set to true, deletes are performed

[HUDI-484] Fix NPE when reading IncrementalPull.sqltemplate in HiveIncrementalPuller (apache#1167)

Revert "[HUDI-455] Redo hudi-client log statements using SLF4J (apache#1145)" (apache#1181)

This reverts commit e637d9e.

[HUDI-438] Merge duplicated code fragment in HoodieSparkSqlWriter (apache#1114)

[HUDI-406]: added default partition path in TimestampBasedKeyGenerator

[HUDI-501] Execute docker/setup_demo.sh in any directory

[HUDI-405] Remove HIVE_ASSUME_DATE_PARTITION_OPT_KEY config from DataSource

[HUDI-464] Use Hive Exec Core for tests (apache#1125)

[HUDI-417] Refactor HoodieWriteClient so that commit logic can be shareable by both bootstrap and normal write operations (apache#1166)

[HUDI-508] Standardizing on "Table" instead of "Dataset" across code (apache#1197)

- Docs were talking about storage types before, cWiki moved to "Table"
 - Most of code already has HoodieTable, HoodieTableMetaClient - correct naming
 - Replacing renaming use of dataset across code/comments
 - Few usages in comments and use of Spark SQL DataSet remain unscathed

[MINOR] Remove old jekyll config file (apache#1198)

Update deprecated HBase API

[HUDI-319] Add a new maven profile to generate unified Javadoc for all Java and Scala classes (apache#1195)

* Add javadoc build command in README, links to javadoc plugin and rename profile.
* Make java version configurable in one place.

[HUDI-25] Optimize HoodieInputformat.listStatus() for faster Hive incremental queries on Hoodie

    Summary:
    - InputPathHandler class classifies  inputPaths into incremental, non incremental and non hoodie paths.
    - Incremental queries leverage HoodieCommitMetadata to get partitions that are affected and only lists those partitions as opposed to listing all partitions
    - listStatus() processes each category separately

[HUDI-331]Fix java docs for all public apis in HoodieWriteClient (apache#1111)

[HUDI-114]: added option to overwrite payload implementation in hoodie.properties file

[HUDI-248] CLI doesn't allow rolling back a Delta commit

[HUDI-469] Fix: HoodieCommitMetadata only show first commit insert rows.

[CLEAN] replace utf-8 constant with StandardCharsets.UTF_8

[MINOR] Fix partition typo (apache#1209)

[HUDI-522] Use the same version jcommander uniformly (apache#1214)

[HUDI-91][HUDI-12]Migrate to spark 2.4.4, migrate to spark-avro library instead of databricks-avro, add support for Decimal/Date types

- Upgrade Spark to 2.4.4, Parquet to 1.10.1, Avro to 1.8.2
- Remove spark-avro from hudi-spark-bundle. Users need to provide --packages org.apache.spark:spark-avro:2.4.4 when running spark-shell or spark-submit
- Replace com.databricks:spark-avro with org.apache.spark:spark-avro
- Shade avro in hudi-hadoop-mr-bundle to make sure it does not conflict with hive's avro version.

[HUDI-322] DeltaSteamer should pick checkpoints off only deltacommits for MOR tables

[HUDI-502] provide a custom time zone definition for TimestampBasedKeyGenerator (apache#1188)

[HUDI-526] fix the HoodieAppendHandle

[MINOR] Reuse random object (apache#1222)

Fix conversion of Spark struct type to Avro schema

cr https://code.amazon.com/reviews/CR-17184364

[MINOR] Refactor unnecessary boxing inside TypedProperties code (apache#1227)

Adding util methods to assist in adding deletion support to Quick Start

Fixing delete util method

Fixing checkstyle issues

[MINOR] Fix redundant judgment statement (apache#1231)

[HUDI-335] Improvements to DiskBasedMap used by ExternalSpillableMap, for write and random/sequential read paths, by introducing bufferedRandmomAccessFile

Add GlobalDeleteKeyGenerator

Adds new GlobalDeleteKeyGenerator for record_key deletes with global indices. Also refactors key generators into their own package.

[MINOR] Make constant fields final in HoodieTestDataGenerator (apache#1234)

[MINOR] Fix missing @OverRide annotation on BufferedRandomAccessFile method (apache#1236)

[HUDI-509] Renaming code in sync with cWiki restructuring (apache#1212)

 - Storage Type replaced with Table Type (remaining instances)
 - View types replaced with query types;
 - ReadOptimized view referred as Snapshot Query
 - TableFileSystemView sub interfaces renamed to BaseFileOnly and Slice Views
 - HoodieDataFile renamed to HoodieBaseFile
 - Hive Sync tool will register RO tables for MOR with a `_ro` suffix
 - Datasource/Deltastreamer options renamed accordingly
 - Support fallback to old config values as well, so migration is painless
 - Config for controlling _ro suffix addition
 - Renaming DataFile to BaseFile across DTOs, HoodieFileSlice and AbstractTableFileSystemView

[HUDI-537] Introduce `repair overwrite-hoodie-props` CLI command (apache#1241)

[HUDI-527] scalastyle-maven-plugin moved to pluginManagement as it is only used in hoodie-spark and hoodie-cli modules.

This fixes compile warnings as well as unnecessary plugin invocation for most of the modules which do not have scala code.

[HUDI-535] Ensure Compaction Plan is always written in .aux folder to avoid 0.5.0/0.5.1 reader-writer compatibility issues (apache#1229)

[HUDI-238] Make Hudi support Scala 2.12 (apache#1226)

* [HUDI-238] Rename scala related artifactId & add maven profile to support Scala 2.12

[MINOR] Add toString method to TimelineLayoutVersion to make it more readable (apache#1244)

[MINOR] Fix PMC in DOAP] (apache#1247)

[HUDI-552] Fix the schema mismatch in Row-to-Avro conversion (apache#1246)

[HUDI-551] Abstract a test case class for DFS Source to make it extensible (apache#1239)

[HUDI-556] Add lisence for PR#1233

[HUDI-559] : Make the timeline layout version default to be null version

Moving to 0.5.2-SNAPSHOT on master branch.

[MINOR] Download KEYS file when validating release candidate (apache#1259)

[MINOR] Update the javadoc of HoodieTableMetaClient#scanFiles (apache#1263)

[MINOR] Update the javadoc of HoodieTableMetaClient#scanFiles

[MINOR] Fix invalid maven repo address (apache#1265)

[MINOR] Change deploy_staging_jars script to take in scala version (apache#1269)

[MINOR] Change deploy_staging_jars script to take in scala version (apache#1270)

[MINOR] Add missing licenses (apache#1271)

[MINOR] fix license issue (apache#1273)

[HUDI-549] update Github README with instructions to build with Scala 2.12 (apache#1275)

[MINOR] Fix missing groupId / version property of dependency

[MINOR] Fix invalid issue url & quickstart url (apache#1282)

[MINOR] Remove junit-dep dependency

[MINOR] Fix assigning to configuration more times (apache#1291)

HUDI-117 Close file handle before throwing an exception due to append failure.
Add test cases to handle/verify stage failure scenarios.

[HUDI-578] Trim recordKeyFields and partitionPathFields in ComplexKeyGenerator (apache#1281)

* [HUDI-578] Trim recordKeyFields and partitionPathFields in ComplexKeyGenerator

* add tests

[HUDI-564] Added new test cases for HoodieLogFormat and HoodieLogFormatVersion.

[HUDI-583] Code Cleanup, remove redundant code, and other changes (apache#1237)

[MINOR] Updated DOAP with 0.5.1 release (apache#1300)

[MINOR] Updated DOAP with 0.5.1 release (apache#1301)

Increase test coverage for HoodieReadClient

[HUDI-596] Close KafkaConsumer every time (apache#1303)

[HUDI-595] code cleanup, refactoring code out of PR# 1159 (apache#1302)

[HUDI-566] Added new test cases for class HoodieTimeline, HoodieDefaultTimeline and HoodieActiveTimeline.

[HUDI-585] Optimize the steps of building with scala-2.12 (apache#1293)

[MINOR] Remove the declaration of thrown RuntimeException (apache#1305)

[HUDI-499] Allow update partition path with GLOBAL_BLOOM (apache#1187)

* Handle partition path update by deleting a record from the old partition and
  insert into the new one
* Add a new configuration "hoodie.bloom.index.update.partition.path" to
  enable the behavior
* Add a new unit test case for global bloom index

[HUDI-571] Add 'commits show archived' command to CLI

[HUDI-570] - Improve test coverage for FSUtils.java

[HUDI-587] Fixed generation of jacoco coverage reports.

surefire plugin's argLine is moved into a property. This configuration allows jacoco plugin to modify the argLine to insert it's Java Agent's configuration during pre-unit-test stage.

[HUDI-560] Remove legacy IdentityTransformer (apache#1264)

[HUDI-582] Update NOTICE year

[HUDI-478] Fix too many files with unapproved license when execute build_local_docker_images (apache#1323)

[HUDI-605] Avoid calculating the size of schema redundantly (apache#1317)

CLI - add option to print additional commit metadata

[HUDI-574] Fix CLI counts small file inserts as updates (apache#1321)

[MINOR] Fix typo (apache#1331)

[HUDI-514] A schema provider to get metadata through Jdbc (apache#1200)

[HUDI-571] Add show archived compaction(s) to CLI

[MINOR] Fix some typos

[MINOR] Code Cleanup, remove redundant code (apache#1337)

[HUDI-615]: Add some methods and test cases for StringUtils. (apache#1338)

[HUDI-108] Removing 2GB spark partition limitations in HoodieBloomIndex with spark 2.4.4 (apache#1315)

[MINOR] Add javadoc to SchedulerConfGenerator and code clean (apache#1340)

[HUDI-617] Add support for types implementing CharSequence (apache#1339)

- Data types extending CharSequence implement a #toString method which provides an easy way to convert them to String.
- For example, org.apache.avro.util.Utf8 is easily convertible into String if we use the toString() method. It's better to make the support more generic to support a wider range of data types as partitionKey.

[HUDI-622]: Remove VisibleForTesting annotation and import from code (apache#1343)

* HUDI:622: Remove VisibleForTesting annotation and import from code

Refactoring getter to avoid double extrametadata in json representation

[HUDI-624]: Split some of the code from PR for HUDI-479 (apache#1344)

[HUDI-597] Enable incremental pulling from defined partitions (apache#1348)

[HUDI-625] Fixing performance issues around DiskBasedMap & kryo (apache#1352)

[HUDI-580] Fix incorrect license header in files

Added cloudera profile

Added cloudera profile
removed  hudi-integ-test
rebased from apache hudi mater
lyogev pushed a commit to YotpoLtd/incubator-hudi that referenced this pull request Mar 30, 2020
- Introduced configs for bloom filter type
- Implemented dynamic bloom filter with configurable max number of keys
- BloomFilterFactory abstractions; Defaults to current simple bloom filter
kroushan-nit pushed a commit to kroushan-nit/hudi-oss-fork that referenced this pull request Nov 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants