[HUDI-1104] Adding support for UserDefinedPartitioners and SortModes to BulkInsert with Rows #3149

nsivabalan · 2021-06-24T12:41:45Z

What is the purpose of the pull request

Adding bulk_insert sort modes for row writer path

(for example:)

Adding bulk_insert sort modes for row writer path
Caching RowCreateHandles if applicable

Verify this pull request

This pull request is already covered by existing tests, such as TestBulkInsertInternalPartitionerForRows

This change added tests and can be verified as follows:

TestHoodieBulkInsertDataInternalWriter

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

hudi-bot · 2021-06-24T12:48:00Z

CI report:

bbf3285 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run travis re-run the last Travis build
@hudi-bot run azure re-run the last Azure build

codecov-commenter · 2021-06-24T14:25:17Z

Codecov Report

Merging #3149 (bbf3285) into master (6e24434) will decrease coverage by 0.00%.
The diff coverage is 56.00%.

@@             Coverage Diff              @@
##             master    #3149      +/-   ##
============================================
- Coverage     47.61%   47.61%   -0.01%     
- Complexity     5487     5496       +9     
============================================
  Files           924      929       +5     
  Lines         41206    41242      +36     
  Branches       4134     4135       +1     
============================================
+ Hits          19621    19637      +16     
- Misses        19843    19860      +17     
- Partials       1742     1745       +3

Flag	Coverage Δ
hudicli	`39.97% <ø> (ø)`
hudiclient	`34.60% <52.63%> (+0.02%)`	⬆️
hudicommon	`48.56% <ø> (-0.02%)`	⬇️
hudiflink	`59.58% <ø> (ø)`
hudihadoopmr	`51.29% <ø> (ø)`
hudisparkdatasource	`67.23% <58.06%> (-0.10%)`	⬇️
hudisync	`54.48% <ø> (ø)`
huditimelineservice	`64.07% <ø> (ø)`
hudiutilities	`58.59% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...a/org/apache/hudi/config/HoodieInternalConfig.java	`0.00% <0.00%> (ø)`
...java/org/apache/hudi/config/HoodieWriteConfig.java	`42.88% <ø> (ø)`
...n/java/org/apache/hudi/internal/DefaultSource.java	`0.00% <0.00%> (ø)`
...org/apache/hudi/spark3/internal/DefaultSource.java	`0.00% <0.00%> (ø)`
...nal/HoodieDataSourceInternalBatchWriteBuilder.java	`0.00% <0.00%> (ø)`
...spark3/internal/HoodieDataSourceInternalTable.java	`0.00% <0.00%> (ø)`
.../BulkInsertInternalPartitionerWithRowsFactory.java	`50.00% <50.00%> (ø)`
...ecution/bulkinsert/NonSortPartitionerWithRows.java	`66.66% <66.66%> (ø)`
...n/bulkinsert/PartitionSortPartitionerWithRows.java	`66.66% <66.66%> (ø)`
...tion/bulkinsert/GlobalSortPartitionerWithRows.java	`75.00% <75.00%> (ø)`
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6e24434...bbf3285. Read the comment docs.

vinothchandar

Left some comments.

vinothchandar · 2021-07-06T07:17:05Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java

this name needs some work.

also seems like you are deriving this value in HoodieSparkSqlWriter? I would like to avoid this new config if possible. can you clarify why we need this. it was not very clear to me

When switching over to a new RowCreateHandle, this config will decide whether to cache and reuse the write handles or close it right away when switching to a diff partition. I have left a comment at the exact place where its used as well.

I named the variable based on the method we have in the interface.

boolean arePartitionRecordsSorted();

Thot it will be consistent.
I can may be name this variable as "arePartitionerRecordsSortedInBulkInsert"

also, guess this is not a public config as such. Do we know where these configs should go? I see we have defined BULKINSERT_INPUT_DATA_SCHEMA_DDL in HoodieWriteConfig which is mainly used for internal purposes.

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java

.../java/org/apache/hudi/execution/bulkinsert/BulkInsertInternalPartitionerWithRowsFactory.java

vinothchandar · 2021-07-06T07:23:20Z

.../java/org/apache/hudi/execution/bulkinsert/BulkInsertInternalPartitionerWithRowsFactory.java

drop Internal from the name?

Existing factory class for write client path is called BulkInsertInternalPartitionerFactory. hence named it this way. Reason is that, we have an interface called BulkInsertPartitioner. we have few out of the box partitioners and we could have user defined as well. hence the naming for these factories as internal. I can fix the name for both the factories if you prefer.

.../src/main/java/org/apache/hudi/execution/bulkinsert/RDDPartitionSortPartitionerWithRows.java

hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

...-spark-common/src/main/java/org/apache/hudi/internal/BulkInsertDataInternalWriterHelper.java

…r Class (apache#1927)

…r to bulk insert of Rows

change the insret overwrte return type [HUDI-1860] Test wrapper for insert_overwrite and insert_overwrite_table [HUDI-2084] Resend the uncommitted write metadata when start up (apache#3168) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-2081] Move schema util tests out from TestHiveSyncTool (apache#3166) [HUDI-2094] Supports hive style partitioning for flink writer (apache#3178) [HUDI-2097] Fix Flink unable to read commit metadata error (apache#3180) [HUDI-2085] Support specify compaction paralleism and compaction target io for flink batch compaction (apache#3169) [HUDI-2092] Fix NPE caused by FlinkStreamerConfig#writePartitionUrlEncode null value (apache#3176) [HUDI-2006] Adding more yaml templates to test suite (apache#3073) [HUDI-2103] Add rebalance before index bootstrap (apache#3185) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-1944] Support Hudi to read from committed offset (apache#3175) * [HUDI-1944] Support Hudi to read from committed offset * [HUDI-1944] Adding group option to KafkaResetOffsetStrategies * [HUDI-1944] Update Exception msg [HUDI-2052] Support load logFile in BootstrapFunction (apache#3134) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-89] Add configOption & refactor all configs based on that (apache#2833) Co-authored-by: Wenning Ding <[email protected]> [MINOR] Update .asf.yaml to codify notification settings, turn on jira comments, gh discussions (apache#3164) - Turn on comment for jira, so we can track PR activity better - Create a notification settings that match https://gitbox.apache.org/schemes.cgi?hudi - Try and turn on "discussions" on Github, to experiment [MINOR] Fix broken build due to FlinkOptions (apache#3198) [HUDI-2088] Missing Partition Fields And PreCombineField In Hoodie Properties For Table Written By Flink (apache#3171) [MINOR] Add Documentation to KEYGENERATOR_TYPE_PROP (apache#3196) [HUDI-2105] Compaction Failed For MergeInto MOR Table (apache#3190) [HUDI-2051] Enable Hive Sync When Spark Enable Hive Meta For Spark Sql (apache#3126) [HUDI-2112] Support reading pure logs file group for flink batch reader after compaction (apache#3202) [HUDI-2114] Spark Query MOR Table Written By Flink Return Incorrect Timestamp Value (apache#3208) [HUDI-2121] Add operator uid for flink stateful operators (apache#3212) [HUDI-2123] Exception When Merge With Null-Value Field (apache#3214) [HUDI-2124] A Grafana dashboard for HUDI. (apache#3216) [HUDI-2057] CTAS Generate An External Table When Create Managed Table (apache#3146) [HUDI-1930] Bootstrap support configure KeyGenerator by type (apache#3170) * [HUDI-1930] Bootstrap support configure KeyGenerator by type [HUDI-2116] Support batch synchronization of partition datas to hive metastore to avoid oom problem (apache#3209) [HUDI-2126] The coordinator send events to write function when there are no data for the checkpoint (apache#3219) [HUDI-2127] Initialize the maxMemorySizeInBytes in log scanner (apache#3220) [HUDI-2058]support incremental query for insert_overwrite_table/insert_overwrite operation on cow table (apache#3139) [HUDI-2129] StreamerUtil.medianInstantTime should return a valid date time string (apache#3221) [HUDI-2131] Exception Throw Out When MergeInto With Decimal Type Field (apache#3224) [HUDI-2122] Improvement in packaging insert into smallfiles (apache#3213) [HUDI-2132] Make coordinator events as POJO for efficient serialization (apache#3223) [HUDI-2106] Fix flink batch compaction bug while user don't set compaction tasks (apache#3192) [HUDI-2133] Support hive1 metadata sync for flink writer (apache#3225) [HUDI-2089]fix the bug that metatable cannot support non_partition table (apache#3182) [HUDI-2028] Implement RockDbBasedMap as an alternate to DiskBasedMap in ExternalSpillableMap (apache#3194) Co-authored-by: Rajesh Mahindra <[email protected]> [HUDI-2135] Add compaction schedule option for flink (apache#3226) [HUDI-2055] Added deltastreamer metric for time of lastSync (apache#3129) [HUDI-2046] Loaded too many classes like sun/reflect/GeneratedSerializationConstructorAccessor in JVM metaspace (apache#3121) Loaded too many classes when use kryo of spark to hudi Co-authored-by: weiwei.duan <[email protected]> [HUDI-1996] Adding functionality to allow the providing of basic auth creds for confluent cloud schema registry (apache#3097) * adding support for basic auth with confluent cloud schema registry [HUDI-2093] Fix empty avro schema path caused by duplicate parameters (apache#3177) * [HUDI-2093] Fix empty avro schema path caused by duplicate parameters * rename shcmea option key * fix doc * rename var name [HUDI-2113] Fix integration testing failure caused by sql results out of order (apache#3204) [HUDI-2016] Fixed bootstrap of Metadata Table when some actions are in progress. (apache#3083) Metadata Table cannot be bootstrapped when any action is in progress. This is detected by the presence of inflight or requested instants. The bootstrapping is initiated in preWrite and postWrite of each commit. So bootstrapping will be retried again until it succeeds. Also added metrics for when the bootstrapping fails or a table is re-bootstrapped. This will help detect tables which are not getting bootstrapped. [HUDI-2140] Fixed the unit test TestHoodieBackedMetadata.testOnlyValidPartitionsAdded. (apache#3234) [HUDI-2115] FileSlices in the filegroup is not descending by timestamp (apache#3206) [HUDI-1104] Adding support for UserDefinedPartitioners and SortModes to BulkInsert with Rows (apache#3149) [HUDI-2069] Refactored String constants (apache#3172) [HUDI-1105] Adding dedup support for Bulk Insert w/ Rows (apache#2206) [HUDI-2134]Add generics to avoif forced conversion in BaseSparkCommitActionExecutor#partition (apache#3232) [HUDI-2009] Fixing extra commit metadata in row writer path (apache#3075) [HUDI-2099]hive lock which state is WATING should be released, otherwise this hive lock will be locked forever (apache#3186) [MINOR] Fix build broken from apache#3186 (apache#3245) [HUDI-2136] Fix conflict when flink-sql-connector-hive and hudi-flink-bundle are both in flink lib (apache#3227) [HUDI-2087] Support Append only in Flink stream (apache#3174) Co-authored-by: 喻兆靖 <[email protected]> UnitTest for deltaSync Removing cosmetic changes and reuse function for insert_overwrite_table unit test intial unit test for the insert_overwrite and insert_over_write_table Adding failed test code for insert_overwrite Revert "[HUDI-2087] Support Append only in Flink stream (apache#3174)" (apache#3251) This reverts commit 3715267. [HUDI-2147] Remove unused class AvroConvertor in hudi-flink (apache#3243) [MINOR] Fix some wrong assert reasons (apache#3248) [HUDI-2087] Support Append only in Flink stream (apache#3252) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-2143] Tweak the default compaction target IO to 500GB when flink async compaction is off (apache#3238) [HUDI-2142] Support setting bucket assign parallelism for flink write task (apache#3239) [HUDI-1483] Support async clustering for deltastreamer and Spark streaming (apache#3142) - Integrate async clustering service with HoodieDeltaStreamer and HoodieStreamingSink - Added methods in HoodieAsyncService to reuse code [HUDI-2107] Support Read Log Only MOR Table For Spark (apache#3193) [HUDI-2144]Bug-Fix:Offline clustering(HoodieClusteringJob) will cause insert action losing data (apache#3240) * fixed * add testUpsertPartitionerWithSmallFileHandlingAndClusteringPlan ut * fix CheckStyle Co-authored-by: yuezhang <[email protected]> [MINOR] Fix EXTERNAL_RECORD_AND_SCHEMA_TRANSFORMATION config (apache#3250) [HUDI-2168] Fix for AccessControlException for anonymous user (apache#3264) [HUDI-2045] Support Read Hoodie As DataSource Table For Flink And DeltaStreamer test with insert-overwrite and insert-overwrite-table removing hardcoded action to pass the unit test [HUDI-1969] Support reading logs for MOR Hive rt table (apache#3033) [HUDI-2171] Add parallelism conf for bootstrap operator using delta-commit for insert_overwrite

…to BulkInsert with Rows (apache#3149)

…rom OSS master. Summary: [HUDI-1731] Rename UpsertPartitioner in hudi-java-client (#2734) Co-authored-by: lei.zhu <[email protected]> Preparation for Avro update (#2650) [MINOR] Delete useless UpsertPartitioner for flink integration (#2746) [HUDI-1738] Emit deletes for flink MOR table streaming read (#2742) Current we did a soft delete for DELETE row data when writes into hoodie table. For streaming read of MOR table, the Flink reader detects the delete records and still emit them if the record key semantics are still kept. This is useful and actually a must for streaming ETL pipeline incremental computation. [HUDI-1591] Implement Spark's FileIndex for Hudi to support queries via Hudi DataSource using non-globbed table path and partition pruning (#2651) [HUDI-1737][hudi-client] Code Cleanup: Extract common method in HoodieCreateHandle & FlinkCreateHandle (#2745) [HUDI-1696] add apache commons-codec dependency to flink-bundle explicitly (#2758) [HUDI-1749] Clean/Compaction/Rollback command maybe never exit when operation fail (#2752) [HUDI-1757] Assigns the buckets by record key for Flink writer (#2757) Currently we assign the buckets by record partition path which could cause hotspot if the partition field is datetime type. Changes to assign buckets by grouping the record whth their key first, the assignment is valid if only there is no conflict(two task write to the same bucket). This patch also changes the coordinator execution to be asynchronous. [MINOR] Fix deprecated build link for travis (#2778) [HUDI-1750] Fail to load user's class if user move hudi-spark-bundle jar into spark classpath (#2753) [HUDI-1767] Add setter to HoodieKey and HoodieRecordLocation to have better SE/DE performance for Flink (#2779) [HUDI-1751] DeltaStreamer print many unnecessary warn log (#2754) [HUDI-1772] HoodieFileGroupId compareTo logical error(fileId self compare) (#2780) [HUDI-1773] HoodieFileGroup code optimize (#2781) [MINOR] Some unit test code optimize (#2782) * Optimized code * Optimized code [HUDI-699] Fix CompactionCommand and add unit test for CompactionCommand (#2325) [HUDI-1778] Add setter to CompactionPlanEvent and CompactionCommitEvent to have better SE/DE performance for Flink (#2789) [MINOR] Update doap with 0.8.0 release (#2772) [HUDI-1775] Add option for compaction parallelism (#2785) [HUDI-1783] Support Huawei Cloud Object Storage (#2796) [MINOR] fix typo. (#2804) [MINOR] Remove unused imports and some other checkstyle issues (#2800) [HUDI-1784] Added print detailed stack log when hbase connection error (#2799) [HUDI-1785] Move OperationConverter to hudi-client-common for code reuse (#2798) [HUDI-1786] Add option for merge max memory (#2805) [HUDI-1787] Remove the rocksdb jar from hudi-flink-bundle (#2807) Remove the RocksDB jar from hudi-flink-bundle to avoid conflicts. [HUDI-1720] Fix RealtimeCompactedRecordReader StackOverflowError (#2721) [HUDI-1788] Insert overwrite (table) for Flink writer (#2808) Supports `INSERT OVERWRITE` and `INSERT OVERWRITE TABLE` for Flink writer. [HUDI-1615] Fixing usage of NULL schema for delete operation in HoodieSparkSqlWriter (#2777) [Hotfix][utilities] Optimized codes (#2821) [HUDI-1798] Flink streaming reader should always monitor the delta commits files (#2825) The streaming reader should only monitor the delta log files, if there are parquet commits but we recognize as logs, the reader would report FileNotFound exception. [HUDI-1797] Remove the com.google.guave jar from hudi-flink-bundle to avoid conflicts. (#2828) Co-authored-by: wangminchao <[email protected]> [HUDI-1801] FlinkMergeHandle rolling over may miss to rename the latest file handle (#2831) The FlinkMergeHandle may rename the N-1 th file handle instead of the latest one, thus to cause data duplication. [HUDI-1792] flink-client query error when processing files larger than 128mb (#2814) Co-authored-by: huangjing <[email protected]> [HUDI-1803] Support BAIDU AFS storage format in hudi (#2836) [MINOR] Add jackson module to presto bundle (#2816) [MINOR][hudi-sync] Fix typos (#2844) [HUDI-1804] Continue to write when Flink write task restart because of container killing (#2843) The `FlinkMergeHande` creates a marker file under the metadata path each time it initializes, when a write task restarts from killing, it tries to create the existing file and reports error. To solve this problem, skip the creation and use the original data file as base file to merge. [HUDI-1716]: Resolving default values for schema from dataframe (#2765) - Adding default values and setting null as first entry in UNION data types in avro schema. Co-authored-by: Aditya Tiwari <[email protected]> [HUDI-1802] Timeline Server Bundle need to include com.esotericsoftware package (#2835) [HUDI-1744] rollback fails on mor table when the partition path hasn't any files (#2749) Co-authored-by: lrz <[email protected]> [MINOR] Added metric reporter Prometheus to HoodieBackedTableMetadataWriter (#2842) [HUDI-1809] Flink merge on read input split uses wrong base file path for default merge type (#2846) [HUDI-1764] Add Hudi-CLI support for clustering (#2773) * tmp base * update * update unit test * update * update * update CLI parameters * linting * update doSchedule in HoodieClusteringJob * update * update diff according to comments [HUDI-1415] Read Hoodie Table As Spark DataSource Table (#2283) [HUDI-1814] Non partitioned table for Flink writer (#2859) [HUDI-1812] Add explicit index state TTL option for Flink writer (#2853) [MINOR] Expose the detailed exception object (#2861) [HUDI-1714] Added tests to TestHoodieTimelineArchiveLog for the archival of compl… (#2677) * Added tests to TestHoodieTimelineArchiveLog for the archival of completed clean and rollback actions. * Adding code review changes * [HUDI-1714] Minor Fixes [HUDI-1746] Added support for replace commits in commit showpartitions, commit show_write_stats, commit showfiles (#2678) * Added support for replace commits in commit showpartitions, commit show_write_stats, commit showfiles * Adding CR changes * [HUDI-1746] Code review changes [HUDI-1551] Add support for BigDecimal and Integer when partitioning based on time. (#2851) Co-authored-by: trungchanh.le <[email protected]> [HUDI-1829] Use while loop instead of recursive call in MergeOnReadInputFormat to avoid StackOverflow (#2862) Recursive all is risky for StackOverflow when there are too many. [HUDI-1844] Add option to flush when total buckets memory exceeds the threshold (#2877) Current code supports flushing as per-bucket memory usage, while the buckets may still take too much memory for bootstrap from history data. When the threshold hits, flush out half of the buckets with bigger buffer size. [HUDI-1835] Fixing kafka native config param for auto offset reset (#2864) [HUDI-1837] Add optional instant range to log record scanner for log (#2870) [HUDI-1742] Improve table level config priority for HoodieMultiTableDeltaStreamer (#2744) [MINOR] Remove redundant method-calling. (#2881) [HUDI-1841] Tweak the min max commits to keep when setting up cleaning retain commits for Flink (#2875) [HUDI-1836] Logging consuming instant to StreamReadOperator#processSplits (#2867) [HUDI-1690] use jsc union instead of rdd union (#2872) [MINOR] Refactor method up to parent-class (#2822) [HUDI-1833] rollback pending clustering even if there is greater commit (#2863) * [HUDI-1833] rollback pending clustering even if there are greater commits [HUDI-1858] Fix cannot create table due to jar conflict (#2886) Co-authored-by: 狄杰 <[email protected]> [HUDI-1845] Exception Throws When Sync Non-Partitioned Table To Hive With MultiPartKeysValueExtractor (#2876) [HUDI-1863] Add rate limiter to Flink writer to avoid OOM for bootstrap (#2891) [HUDI-1867] Streaming read for Flink COW table (#2895) Supports streaming read for Copy On Write table. [HUDI-1817] Fix getting incorrect partition path while using incr query by spark-sql (#2858) [HUDI-1811] Fix TestHoodieRealtimeRecordReader (#2873) Pass basePath with scheme 'file://' to HoodieRealtimeFileSplit [HUDI-1810] Fix azure setting for integ tests (#2889) [HUDI-1620] Fix Metrics UT (#2894) Make sure shutdown Metrics between unit test cases to ensure isolation [HUDI-1852] Add SCHEMA_REGISTRY_SOURCE_URL_SUFFIX and SCHEMA_REGISTRY_TARGET_URL_SUFFIX property (#2884) [HUDI-1781] Fix Flink streaming reader throws ClassCastException (#2900) [HUDI-1718] When query incr view of mor table which has Multi level partitions, the query failed (#2716) [HUDI-1876] wiring in Hadoop Conf with AvroSchemaConverters instantiation (#2914) [HUDI-1821] Remove legacy code for Flink writer (#2868) [HUDI-1880] Support streaming read with compaction and cleaning (#2921) [HUDI-1759] Save one connection retry to hive metastore when hiveSyncTool run with useJdbc=false (#2759) * [HUDI-1759] Save one connection retry to hive metastore when hiveSyncTool run with useJdbc=false * Fix review comment [HUDI-1878] Add max memory option for flink writer task (#2920) Also removes the rate limiter because it has the similar functionality, modify the create and merge handle cleans the retry files automatically. [HUDI-1886] Avoid to generates corrupted files for flink sink (#2929) [MINOR] optimize FilePathUtils (#2931) [HUDI-1707] Reduces log level for too verbose messages from info to debug level. (#2714) * Reduces log level for too verbose messages from info to debug level. * Sort config output. * Code Review : Small restructuring + rebasing to master - Fixing flaky multi delta streamer test - Using isDebugEnabled() checks - Some changes to shorten log message without moving to DEBUG Co-authored-by: volodymyr.burenin <[email protected]> Co-authored-by: Vinoth Chandar <[email protected]> [HUDI-1789] Support reading older snapshots (#2809) * [HUDI-1789] In HoodieParquetInoutFormat we currently default to the latest version of base files. This PR attempts to add a new jobConf `hoodie.%s.consume.snapshot.time` This new config will allow us to read older snapshots. - Reusing hoodie.%s.consume.commit for point in time snapshot queries as well. - Adding javadocs and some more tests [HUDI-1890] FlinkCreateHandle and FlinkAppendHandle canWrite should always return true (#2933) The method #canWrite should always return true because they can already write based on file size, e.g. the BucketAssigner. [HUDI-1818] Validate required fields for Flink HoodieTable (#2930) [HUDI-1851] Adding test suite long running automate scripts for docker (#2880) [HUDI-1055] Remove hardcoded parquet in tests (#2740) * Remove hardcoded parquet in tests * Use DataFileUtils.getInstance * Renaming DataFileUtils to BaseFileUtils Co-authored-by: Vinoth Chandar <[email protected]> [HUDI-1768] add spark datasource unit test for schema validate add column (#2776) [HUDI-1895] Close the file handles gracefully for flink write function to avoid corrupted files (#2938) [HUDI-1722]Fix hive beeline/spark-sql query specified field on mor table occur NPE (#2722) [HUDI-1900] Always close the file handle for a flink mini-batch write (#2943) Close the file handle eagerly to avoid corrupted files as much as possible. [HUDI-1446] Support skip bootstrapIndex's init in abstract fs view init (#2520) Co-authored-by: zhongliang <[email protected]> Co-authored-by: Sivabalan Narayanan <[email protected]> [HUDI-1902] Clean the corrupted files generated by FlinkMergeAndReplaceHandle (#2949) Make the intermediate files of FlinkMergeAndReplaceHandle hidden, when committing the instant, clean these files in case there was some corrupted files left(in normal case, the intermediate files should be cleaned by the FlinkMergeAndReplaceHandle itself). [MINOR][hudi-client] Code-cleanup,remove redundant variable declarations (#2956) [HUDI-1902] Global index for flink writer (#2958) Supports deduplication for record keys with different partition path. [HUDI-1911] Reuse the partition path and file group id for flink write data buffer (#2961) Reuse to reduce memory footprint. [HUDI-1806] Honoring skipROSuffix in spark ds (#2882) * Honoring skipROSuffix in spark ds * Adding tests * fixing scala checkstype issue [HUDI-1913] Using streams instead of loops for input/output (#2962) [MINOR] Remove unused method in BaseSparkCommitActionExecutor (#2965) [HUDI-1915] Fix the file id for write data buffer before flushing (#2966) [HUDI-1871] Fix hive conf for Flink writer hive meta sync (#2968) [HUDI-1719] hive on spark/mr,Incremental query of the mor table, the partition field is incorrect (#2720) [HUDI-1917] Remove the metadata sync logic in HoodieFlinkWriteClient#preWrite because it is not thread safe (#2971) [HUDI-1888] Fix NPE when the nested partition path field has null value (#2957) [HUDI-1918] Fix incorrect keyBy field cause serious data skew, to avoid multiple subtasks write to a partition at the same time (#2972) [HUDI-1740] Fix insert-overwrite API archival (#2784) - fix problem of archiving replace commits - Fix problem when getting empty replacecommit.requested - Improved the logic of handling empty and non-empty requested/inflight commit files. Added unit tests to cover both empty and non-empty inflight files cases and cleaned up some unused test util methods Co-authored-by: yorkzero831 <[email protected]> Co-authored-by: zheren.yu <[email protected]> [MINOR] Update the javadoc of EngineType (#2979) [HUDI-1873] collect() call causing issues with very large upserts (#2907) Co-authored-by: Sivabalan Narayanan <[email protected]> [HUDI-1919] Type mismatch when streaming read copy_on_write table using flink (#2986) * [HUDI-1919] Type mismatch when streaming read copy_on_write table using flink #2976 * Update ParquetSplitReaderUtil.java [HUDI-1920] Set archived as the default value of HOODIE_ARCHIVELOG_FOLDER_PROP_NAME (#2978) [HUDI-1723] Fix path selector listing files with the same mod date (#2845) [HUDI-1922] Bulk insert with row writer supports mor table (#2981) [HUDI-1935] Updated Logger statement (#2996) Co-authored-by: veenaypatil <[email protected]> [HUDI-1865] Make embedded time line service singleton (#2899) [FLINK-1923] Exactly-once write for flink writer (#3002) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-1940] Add SqlQueryBasedTransformer unit test (#3004) [HUDI-1800] Exclude file slices in pending compaction when performing small file sizing (#2902) Co-authored-by: Ryan Pifer <[email protected]> [HUDI-1879] Support Partition Prune For MergeOnRead Snapshot Table (#2926) [MINOR] 'return' is unnecessary as the last statement in a 'void' method (#3012) fix the grammer err of the comment (#3013) Co-authored-by: ywang46 <[email protected]> [HUDI-1948] Shade kryo-shaded jar for hudi flink bundle (#3014) [MINOR] The collection can use forEach() directly (#3016) [MINOR] Access the static member getLastHeartbeatTime via the class instead (#3015) [HUDI-1943] Lose properties when hoodieWriteConfig initializtion (#3006) * [hudi-flink]fix lose properties problem Co-authored-by: haoke <[email protected]> [HUDI-1927] Improve HoodieFlinkStreamer (#3019) Co-authored-by: enter58xuan <[email protected]> [HUDI-1949] Refactor BucketAssigner to make it more efficient (#3017) Add a process single class WriteProfile, the record and small files profile re-construction can be more efficient if we reuse by same checkpoint id. [HUDI-1921] Add target io option for flink compaction (#2980) [HUDI-1952] Fix hive3 meta sync for flink writer (#3021) [HUDI-1953] Fix NPE due to not set the output type of the operator (#3023) Co-authored-by: enter58xuan <[email protected]> [HUDI-1957] Fix flink timeline service lack jetty dependency (#3028) [MINOR] Remove the implementation of Serializable from HoodieException (#3020) [MINOR] Remove unused method in DataSourceUtils (#3031) [HUDI-1961] Add a debezium json integration test case for flink (#3030) [MINOR] Resolve build issue arising from inaccessible pentaho jar (#3034) - Fixes #160 #2479 [HUDI-1954] only reset bucket when flush bucket success (#3029) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-1281] Add deltacommit to ActionType (#3018) Co-authored-by: veenaypatil <[email protected]> [HUDI-1967] Fix the NPE for MOR Hive rt table query (#3032) The HoodieInputFormatUtils.getTableMetaClientByBasePath returns the map with table base path as keys while the HoodieRealtimeInputFormatUtils query it with the partition path. [HUDI-1979] Optimize logic to improve code readability (#3037) Co-authored-by: wei.zhang2 <[email protected]> [HUDI-1942] Add Default value for HIVE_AUTO_CREATE_DATABASE_OPT_KEY in HoodieSparkSqlWriter (#3036) [HUDI-1931] BucketAssignFunction use ValueState instead of MapState (#3026) Co-authored-by: [email protected] <loukey_7821> [HUDI-1909] Skip Commits with empty files (#3045) [HUDI-1148] Remove Hadoop Conf Logs (#3040) [HUDI-1950] Move TestHiveMetastoreBasedLockProvider to functional (#3043) HiveTestUtil static setup mini servers caused connection refused issue in Azure CI environment, as TestHiveSyncTool and TestHiveMetastoreBasedLockProvider share the same test facilities. Moving TestHiveMetastoreBasedLockProvider (the easier one) to functional test with a separate and improved mini server setup resolved the issue. Also cleaned up dfs cluster from HiveTestUtil. The next step is to move TestHiveSyncTool to functional as well. [HUDI-1914] Add fetching latest schema to table command in hudi-cli (#2964) add BootstrapFunction to support index bootstrap (#3024) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-1659] Basic Implement Of Spark Sql Support For Hoodie (#2645) Main functions: Support create table for hoodie. Support CTAS. Support Insert for hoodie. Including dynamic partition and static partition insert. Support MergeInto for hoodie. Support DELETE Support UPDATE Both support spark2 & spark3 based on DataSourceV1. Main changes: Add sql parser for spark2. Add HoodieAnalysis for sql resolve and logical plan rewrite. Add commands implementation for CREATE TABLE、INSERT、MERGE INTO & CTAS. In order to push down the update&insert logical to the HoodieRecordPayload for MergeInto, I make same change to the HoodieWriteHandler and other related classes. 1、Add the inputSchema for parser the incoming record. This is because the inputSchema for MergeInto is different from writeSchema as there are some transforms in the update& insert expression. 2、Add WRITE_SCHEMA to HoodieWriteConfig to pass the write schema for merge into. 3、Pass properties to HoodieRecordPayload#getInsertValue to pass the insert expression and table schema. Verify this pull request Add TestCreateTable for test create hoodie tables and CTAS. Add TestInsertTable for test insert hoodie tables. Add TestMergeIntoTable for test merge hoodie tables. Add TestUpdateTable for test update hoodie tables. Add TestDeleteTable for test delete hoodie tables. Add TestSqlStatement for test supported ddl/dml currently. [HUDI-1929] Support configure KeyGenerator by type (#2993) [HUDI-1980] Optimize the code to prevent other exceptions from causing resources not to be closed (#3038) Co-authored-by: wei.zhang2 <[email protected]> [HUDI-1892] Fix NPE when avro field value is null (#3051) [HUDI-1986] Skip creating marker files for flink merge handle (#3047) [HUDI-1987] Fix non partition table hive meta sync for flink writer (#3049) delete duplicate bootstrap function (#3052) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-1992] Release the new records map for merge handle #close (#3056) [MINOR] Remove boxing (#3062) [MINOR] Add Baidu BOS storage support for hudi (#3061) Co-authored-by: zhangjun30 <[email protected]> [HUDI-1994] Release the new records iterator for append handle #close (#3058) [HUDI-1790] Added SqlSource to fetch data from any partitions for backfill use case (#2896) [MINOR] Add Tencent Cloud HDFS storage support for hudi (#3064) [HUDI-2002] Modify HiveIncrementalPuller log level to ERROR (#3070) Co-authored-by: wei.zhang2 <[email protected]> [HUDI-1984] Support independent flink hudi compaction function (#3046) [HUDI-2000] Release file writer for merge handle #close (#3068) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-1991] Fixing drop dups exception in bulk insert row writer path (#3055) [HUDI-2004] Move CheckpointUtils test cases to independant class (#3072) [MINOR] Fixed the log which should only be printed when the Metadata Table is disabled. (#3080) [HUDI-1950] Fix Azure CI failure in TestParquetUtils (#2984) * fix azure pipeline configs * add pentaho.org in maven repositories * Make sure file paths with scheme in TestParquetUtils * add azure build status to README [HUDI-1999] Refresh the base file view cache for WriteProfile (#3067) Refresh the view to discover new small files. [HUDI-764] [HUDI-765] ORC reader writer Implementation (#2999) Co-authored-by: Qingyun (Teresa) Kang <[email protected]> [MINOR] Rename broken codecov file (#3088) - Stop polluting PRs with wrong coverage info - Retaining the file, so someone can try digging in [HUDI-2022] Release writer for append handle #close (#3087) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-2014] Support flink hive sync in batch mode (#3081) [HUDI-2008] Avoid the raw type usage in some classes under hudi-utilities module (#3076) Fix the filter condition is missing in the judgment condition of compaction instance (#3025) [HUDI-2015] Fix flink operator uid to allow multiple pipelines in one job (#3091) [HUDI-2030] Add metadata cache to WriteProfile to reduce IO (#3090) Keeps same number of instant metadata cache and refresh the cache on new commits. [HUDI-1879] Fix RO Tables Returning Snapshot Result (#2925) [HUDI-2019] Set up the file system view storage config for singleton embedded server write config every time (#3102) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-2032] Make keygen class and keygen type optional for FlinkStreamerConfig (#3104) * [HUDI-2032] Make keygen class and keygen type optional for FlinkStreamerConfig * Address the review suggestion [HUDI-2033] ClassCastException Throw When PreCombineField Is String Type (#3099) [HUDI-2036] Move the compaction plan scheduling out of flink writer coordinator (#3101) Since HUDI-1955 was fixed, we can move the scheduling out if the coordinator to make the coordinator more lightweight. [HUDI-2040] Make flink writer as exactly-once by default (#3106) [MINOR] Fix wrong package name (#3114) [MINOR] Fix Javadoc wrong references (#3115) [HUDI-251] Adds JDBC source support for DeltaStreamer (#2915) As discussed in RFC-14, this change implements the first phase of JDBC incremental puller. It consists following changes: - JdbcSource: This class extends RowSource and implements fetchNextBatch(Option<String> lastCkptStr, long sourceLimit) - SqlQueryBuilder: A simple utility class to build sql queries fluently. - Implements two modes of fetching: full and incremental. Full is a complete scan of RDBMS table. Incremental is delta since last checkpoint. Incremental mode falls back to full fetch in case of any exception. [MINOR] Remove unused module (#3116) [MINOR] Put Azure cache tasks first (#3118) [HUDI-1248] Increase timeout for deltaStreamerTestRunner in TestHoodieDeltaStreamer (#3110) [HUDI-2049] StreamWriteFunction should wait for the next inflight instant time before flushing (#3123) [HUDI-2050] Support rollback inflight compaction instances for batch flink compactor (#3124) [HUDI-1776] Support AlterCommand For Hoodie (#3086) [HUDI-2043] HoodieDefaultTimeline$filterPendingCompactionTImeline() method have wrong filter condition (#3109) [HUDI-2031] JVM occasionally crashes during compaction when spark speculative execution is enabled (#3093) * unit tests added [HUDI-2047] Ignore FileNotFoundException in WriteProfiles #getWritePathsOfInstant (#3125) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-1883] Support Truncate Table For Hoodie (#3098) [HUDI-2013] Removed option to fallback to file listing when Metadata Table is enabled. (#3079) [HUDI-1717] Metadata Reader should merge all the un-synced but complete instants from the dataset timeline. (#3082) [HUDI-1988] FinalizeWrite() been executed twice in AbstractHoodieWriteClient$commitstats (#3050) [HUDI-2054] Remove the duplicate name for flink write pipeline (#3135) [HUDI-1826] Add ORC support in HoodieSnapshotExporter (#3130) [HUDI-2038] Support rollback inflight compaction instances for CompactionPlanOperator (#3105) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-2064] Fix TestHoodieBackedMetadata#testOnlyValidPartitionsAdded (#3141) [HUDI-2061] Incorrect Schema Inference For Schema Evolved Table (#3137) [HUDI-2053] Insert Static Partition With DateType Return Incorrect Partition Value (#3133) [HUDI-2069] Fix KafkaAvroSchemaDeserializer to not rely on reflection (#3111) [HUDI-2069] KafkaAvroSchemaDeserializer should get sourceSchema passed instead using Reflection [HUDI-2062] Catch FileNotFoundException in WriteProfiles #getCommitMetadata Safely (#3138) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-2068] Skip the assign state for SmallFileAssign when the state can not assign initially (#3148) Add ability to provide multi-region (global) data consistency across HMS in different regions (#2542) [global-hive-sync-tool] Add a global hive sync tool to sync hudi table across clusters. Add a way to rollback the replicated time stamp if we fail to sync or if we partly sync Co-authored-by: Jagmeet Bali <[email protected]> [MINOR] Removing un-used files and references (#3150) [HUDI-2060] Added tests for KafkaOffsetGen (#3136) [MINOR] Remove unused methods (#3152) [HUDI-2073] Fix the bug of hoodieClusteringJob never quit (#3157) Co-authored-by: yuezhang <[email protected]> [HUDI-2074] Use while loop instead of recursive call in MergeOnReadInputFormat#MergeIterator to avoid StackOverflow (#3159) [MINOR] Drop duplicate keygenerator class configuration setting (#3167) [HUDI-2067] Sync FlinkOptions config to FlinkStreamerConfig (#3151) [HUDI-1910] Commit Offset to Kafka after successful Hudi commit (#3092) [HUDI-2084] Resend the uncommitted write metadata when start up (#3168) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-2081] Move schema util tests out from TestHiveSyncTool (#3166) [HUDI-2094] Supports hive style partitioning for flink writer (#3178) [HUDI-2097] Fix Flink unable to read commit metadata error (#3180) [HUDI-2085] Support specify compaction paralleism and compaction target io for flink batch compaction (#3169) [HUDI-2092] Fix NPE caused by FlinkStreamerConfig#writePartitionUrlEncode null value (#3176) [HUDI-2006] Adding more yaml templates to test suite (#3073) [HUDI-2103] Add rebalance before index bootstrap (#3185) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-1944] Support Hudi to read from committed offset (#3175) * [HUDI-1944] Support Hudi to read from committed offset * [HUDI-1944] Adding group option to KafkaResetOffsetStrategies * [HUDI-1944] Update Exception msg [HUDI-2052] Support load logFile in BootstrapFunction (#3134) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-89] Add configOption & refactor all configs based on that (#2833) Co-authored-by: Wenning Ding <[email protected]> [MINOR] Update .asf.yaml to codify notification settings, turn on jira comments, gh discussions (#3164) - Turn on comment for jira, so we can track PR activity better - Create a notification settings that match https://gitbox.apache.org/schemes.cgi?hudi - Try and turn on "discussions" on Github, to experiment [MINOR] Fix broken build due to FlinkOptions (#3198) [HUDI-2088] Missing Partition Fields And PreCombineField In Hoodie Properties For Table Written By Flink (#3171) [MINOR] Add Documentation to KEYGENERATOR_TYPE_PROP (#3196) [HUDI-2105] Compaction Failed For MergeInto MOR Table (#3190) [HUDI-2051] Enable Hive Sync When Spark Enable Hive Meta For Spark Sql (#3126) [HUDI-2112] Support reading pure logs file group for flink batch reader after compaction (#3202) [HUDI-2114] Spark Query MOR Table Written By Flink Return Incorrect Timestamp Value (#3208) [HUDI-2121] Add operator uid for flink stateful operators (#3212) [HUDI-2123] Exception When Merge With Null-Value Field (#3214) [HUDI-2124] A Grafana dashboard for HUDI. (#3216) [HUDI-2057] CTAS Generate An External Table When Create Managed Table (#3146) [HUDI-1930] Bootstrap support configure KeyGenerator by type (#3170) * [HUDI-1930] Bootstrap support configure KeyGenerator by type [HUDI-2116] Support batch synchronization of partition datas to hive metastore to avoid oom problem (#3209) [HUDI-2126] The coordinator send events to write function when there are no data for the checkpoint (#3219) [HUDI-2127] Initialize the maxMemorySizeInBytes in log scanner (#3220) [HUDI-2058]support incremental query for insert_overwrite_table/insert_overwrite operation on cow table (#3139) [HUDI-2129] StreamerUtil.medianInstantTime should return a valid date time string (#3221) [HUDI-2131] Exception Throw Out When MergeInto With Decimal Type Field (#3224) [HUDI-2122] Improvement in packaging insert into smallfiles (#3213) [HUDI-2132] Make coordinator events as POJO for efficient serialization (#3223) [HUDI-2106] Fix flink batch compaction bug while user don't set compaction tasks (#3192) [HUDI-2133] Support hive1 metadata sync for flink writer (#3225) [HUDI-2089]fix the bug that metatable cannot support non_partition table (#3182) [HUDI-2028] Implement RockDbBasedMap as an alternate to DiskBasedMap in ExternalSpillableMap (#3194) Co-authored-by: Rajesh Mahindra <[email protected]> [HUDI-2135] Add compaction schedule option for flink (#3226) [HUDI-2055] Added deltastreamer metric for time of lastSync (#3129) [HUDI-2046] Loaded too many classes like sun/reflect/GeneratedSerializationConstructorAccessor in JVM metaspace (#3121) Loaded too many classes when use kryo of spark to hudi Co-authored-by: weiwei.duan <[email protected]> [HUDI-1996] Adding functionality to allow the providing of basic auth creds for confluent cloud schema registry (#3097) * adding support for basic auth with confluent cloud schema registry [HUDI-2093] Fix empty avro schema path caused by duplicate parameters (#3177) * [HUDI-2093] Fix empty avro schema path caused by duplicate parameters * rename shcmea option key * fix doc * rename var name [HUDI-2113] Fix integration testing failure caused by sql results out of order (#3204) [HUDI-2016] Fixed bootstrap of Metadata Table when some actions are in progress. (#3083) Metadata Table cannot be bootstrapped when any action is in progress. This is detected by the presence of inflight or requested instants. The bootstrapping is initiated in preWrite and postWrite of each commit. So bootstrapping will be retried again until it succeeds. Also added metrics for when the bootstrapping fails or a table is re-bootstrapped. This will help detect tables which are not getting bootstrapped. [HUDI-2140] Fixed the unit test TestHoodieBackedMetadata.testOnlyValidPartitionsAdded. (#3234) [HUDI-2115] FileSlices in the filegroup is not descending by timestamp (#3206) [HUDI-1104] Adding support for UserDefinedPartitioners and SortModes to BulkInsert with Rows (#3149) [HUDI-2069] Refactored String constants (#3172) [HUDI-1105] Adding dedup support for Bulk Insert w/ Rows (#2206) [HUDI-2134]Add generics to avoif forced conversion in BaseSparkCommitActionExecutor#partition (#3232) [HUDI-2009] Fixing extra commit metadata in row writer path (#3075) [HUDI-2099]hive lock which state is WATING should be released, otherwise this hive lock will be locked forever (#3186) [MINOR] Fix build broken from #3186 (#3245) [HUDI-2136] Fix conflict when flink-sql-connector-hive and hudi-flink-bundle are both in flink lib (#3227) [HUDI-2087] Support Append only in Flink stream (#3174) Co-authored-by: 喻兆靖 <[email protected]> Revert "[HUDI-2087] Support Append only in Flink stream (#3174)" (#3251) This reverts commit 371526789d663dee85041eb31c27c52c81ef87ef. [HUDI-2147] Remove unused class AvroConvertor in hudi-flink (#3243) [MINOR] Fix some wrong assert reasons (#3248) [HUDI-2087] Support Append only in Flink stream (#3252) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-2143] Tweak the default compaction target IO to 500GB when flink async compaction is off (#3238) [HUDI-2142] Support setting bucket assign parallelism for flink write task (#3239) [HUDI-1483] Support async clustering for deltastreamer and Spark streaming (#3142) - Integrate async clustering service with HoodieDeltaStreamer and HoodieStreamingSink - Added methods in HoodieAsyncService to reuse code [HUDI-2045] Support Read Hoodie As DataSource Table For Flink And DeltaStreamer [HUDI-2107] Support Read Log Only MOR Table For Spark (#3193) [HUDI-2144]Bug-Fix:Offline clustering(HoodieClusteringJob) will cause insert action losing data (#3240) * fixed * add testUpsertPartitionerWithSmallFileHandlingAndClusteringPlan ut * fix CheckStyle Co-authored-by: yuezhang <[email protected]> [MINOR] Fix EXTERNAL_RECORD_AND_SCHEMA_TRANSFORMATION config (#3250) [HUDI-2171] Add parallelism conf for bootstrap operator [HUDI-2168] Fix for AccessControlException for anonymous user (#3264) [HUDI-1969] Support reading logs for MOR Hive rt table (#3033) [HUDI-2165] Support Transformer for HoodieFlinkStreamer (#3270) * [HUDI-2165] Support Transformer for HoodieFlinkStreamer [HUDI-2180] Fix Compile Error For Spark3 (#3274) [HUDI-1828] Update unit tests to support ORC as the base file format (#3237) [MINOR] Correct the logs of enable/not-enable async cleaner service. (#3271) Co-authored-by: yuezhang <[email protected]> [HUDI-2149] Ensure and Audit docs for every configuration class in the codebase (#3272) - Added docs when missing - Rewrote, reworded as needed - Made couple more classes extend HoodieConfig [HUDI-2029] Implement compression for DiskBasedMap in Spillable Map (#3128) [HUDI-2153] Fix BucketAssignFunction Context NullPointerException [MINOR] Refactor hive sync tool to reduce duplicate code (#3276) * [MINOR] Refactor hive sync tool to reduce duplicate code [MINOR] Allow users to choose ORC as base file format in Spark SQL (#3279) [HUDI-1633] Make callback return HoodieWriteStat (#2445) * CALLBACK add partitionPath * callback can send hoodieWriteStat * add ApiMaturityLevel [HUDI-2185] Remove the default parallelism of index bootstrap and bucket assigner Revert "[HUDI-2087] Support Append only in Flink stream (#3252)" This reverts commit 783c9cb3 [HUDI-1447] DeltaStreamer kafka source supports consuming from specified timestamp (#2438) [HUDI-1884] MergeInto Support Partial Update For COW (#3154) [HUDI-2193] Remove state in BootstrapFunction [HUDI-2161] Adding support to disable meta columns with bulk insert operation (#3247) [HUDI-1860] Add INSERT_OVERWRITE and INSERT_OVERWRITE_TABLE support to DeltaStreamer (#3184) [HUDI-2145] Create new bucket when NewFileAssignState filled (#3258) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-2198] Clean and reset the bootstrap events for coordinator when task failover (#3304) [HUDI-2007] Fixing hudi_test_suite for spark nodes and adding spark bulk_insert node (#3074) [MINOR] Disable codecov (#3314) [HUDI-2192] Clean up Multiple versions of scala libraries detected Warning (#3292) [HUDI-2204] Add marker files for flink writer (#3316) [HUDI-2195] Sync Hive Failed When Execute CTAS In Spark2 And Spark3 (#3299) [HUDI-2206] Fix checkpoint blocked because getLastPendingInstant() action after than restoreWriteMetadata() action (#3326) [HUDI-2205] Rollback inflight compaction for flink writer (#3320) [HUDI-2139] MergeInto MOR Table May Result InCorrect Result (#3230) [HUDI-2211] Fix NullPointerException in TestHoodieConsoleMetrics (#3331) [HUDI-2212] Missing PrimaryKey In Hoodie Properties For CTAS Table (#3332) [HUDI-2213] Remove unnecessary parameter for HoodieMetrics constructor and fix NPE in UT (#3333) [HUDI-1848] Adding support for HMS for running DDL queries in hive-sy… (#2879) * [HUDI-1848] Adding support for HMS for running DDL queries in hive-sync-tool * [HUDI-1848] Fixing test cases * [HUDI-1848] CR changes * [HUDI-1848] Fix checkstyle violations * [HUDI-1848] Fixed a bug when metastore api fails for complex schemas with multiple levels. * [HUDI-1848] Adding the complex schema and resolving merge conflicts * [HUDI-1848] Adding some more javadocs * [HUDI-1848] Added javadocs for DDLExecutor impls * [HUDI-1848] Fixed style issue [MINOR] Replace deprecated method isDir with isDirectory (#3319) [HUDI-1241] Automate the generation of configs webpage as configs are added to Hudi repo (#3302) [HUDI-2216] Correct the words fiels in the comments to fields (#3339) [MINOR] Close log scanner after compaction completed (#3294) [HUDI-2214]residual temporary files after clustering are not cleaned up (#3335) [HUDI-2176, 2178, 2179] Adding virtual key support to COW table (#3306) [MINOR] Correct the words accroding in the comments to according (#3343) Correct the words 'accroding' in the comments to 'according' [HUDI-2209] Bulk insert for flink writer (#3334) [HUDI-2219] Fix NPE of HoodieConfig (#3342) [HUDI-2217] Fix no value present in incremental query on MOR (#3340) [HUDI-2223] Fix Alter Partitioned Table Failed (#3350) [HUDI-2227] Only sync hive meta on successful commit for flink batch writer (#3351) [HUDI-2215] Add rateLimiter when Flink writes to hudi. (#3338) Co-authored-by: wangminchao <[email protected]> [HUDI-2044] Integrate consumers with rocksDB and compression within External Spillable Map (#3318) [HUDI-2230] Make codahale times transient to avoid serializable exceptions (#3345) [HUDI-2245] BucketAssigner generates the fileId evenly to avoid data skew (#3362) [HUDI-2244] Fix database alreadyExists exception while hive sync (#3361) [HUDI-2228] Add option 'hive_sync.mode' for flink writer (#3352) [HUDI-2241] Explicit parallelism for flink bulk insert (#3357) [HUDI-1425] Performance loss with the additional hoodieRecords.isEmpty() in HoodieSparkSqlWriter#write (#2296) [MINOR] fix check style error (#3365) [HUDI-2117] Unpersist the input rdd after the commit is completed to … (#3207) Co-authored-by: Vinoth Chandar <[email protected]> [HUDI-2251] Fix Exception Cause By Table Name Case Sensitivity For Append Mode Write (#3367) [HUDI-2253] Refactoring few tests to reduce runningtime. DeltaStreamer and MultiDeltaStreamer tests. Bulk insert row writer tests (#3371) Co-authored-by: Sivabalan Narayanan <[email protected]> [HUDI-2252] Default consumes from the latest instant for flink streaming reader (#3368) [HUDI-2254] Builtin sort operator for flink bulk insert (#3372) [HUDI-2184] Support setting hive sync partition extractor class based on flink configuration (#3284) [HUDI-2218] Fix missing HoodieWriteStat in HoodieCreateHandle (#3341) [HUDI-2164] Let users build cluster plan and execute this plan at once using HoodieClusteringJob for async clustering (#3259) * add --mode schedule/execute/scheduleandexecute * fix checkstyle * add UT testHoodieAsyncClusteringJobWithScheduleAndExecute * log changed * try to make ut success * try to fix ut * modify ut * review changed * code review * code review * code review * code review Co-authored-by: yuezhang <[email protected]> [HUDI-2177][HUDI-2200] Adding virtual keys support for MOR table (#3315) [MINOR] Improving runtime of TestStructuredStreaming by 2 mins (#3382) [HUDI-2225] Add a compaction job in hudi-examples (#3347) [HUDI-2269] Release the disk map resource for flink streaming reader (#3384) [HUDI-2072] Add pre-commit validator framework (#3153) * [HUDI-2072] Add pre-commit validator framework * trigger Travis rebuild [HUDI-2272] Pass base file format to sync clients (#3397) Co-authored-by: Rajesh Mahindra <[email protected]> [HUDI-1371] [HUDI-1893] Support metadata based listing for Spark DataSource and Spark SQL (#2893) [HUDI-2255] Refactor Datasource options (#3373) Co-authored-by: Wenning Ding <[email protected]> [HUDI-2090] Ensure Disk Maps create a subfolder with appropriate prefixes and cleans them up on close (#3329) * Add UUID to the folder name for External Spillable File System * Fix to ensure that Disk maps folders do not interefere across users * Fix test * Fix test * Rebase with latest mater and address comments * Add Shutdown Hooks for the Disk Map Co-authored-by: Rajesh Mahindra <[email protected]> [HUDI-2258] Metadata table for flink (#3381) [HUDI-2087] Support Append only in Flink stream (#3390) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-2232] [SQL] MERGE INTO fails with table having nested struct (#3379) [HUDI-2273] Migrating some long running tests to functional test profile (#3398) [HUDI-2233] Use HMS To Sync Hive Meta For Spark Sql (#3387) [HUDI-2274] Allows INSERT duplicates for Flink MOR table (#3403) [HUDI-2278] Use INT64 timestamp with precision 3 for flink parquet writer (#3414) [HUDI-2182] Support Compaction Command For Spark Sql (#3277) [MINOR] fix compile error in compaction command (#3421) [HUDI-1468] Support custom clustering strategies and preserve commit metadata as part of clustering (#3419) Co-authored-by: Satish Kotha <[email protected]> [HUDI-1842] Spark Sql Support For pre-existing Hoodie Table (#3393) [HUDI-2243] Support Time Travel Query For Hoodie Table (#3360) [HUDI-2247] Filter file where length less than parquet MAGIC length (#3363) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-2208] Support Bulk Insert For Spark Sql (#3328) [HUDI-2194] Skip the latest N partitions when choosing partitions to create ClusteringPlan (#3300) * skip from latest partitions based on hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions && 0(default means skip nothing) * change config verison * add ut Co-authored-by: yuezhang <[email protected]> [HUDI-1771] Propagate CDC format for hoodie (#3285) [HUDI-2288] Support storage on ks3 for hudi (#3434) Co-authored-by: xuzifu <xuzifu.com> [MINOR] Fix travis from errors (#3432) [HUDI-1129] Improving schema evolution support in hudi (#2927) * Adding support to ingest records with old schema after table's schema is evolved * Rebasing against latest master - Trimming test file to be < 800 lines - Renaming config names * Addressing feedback Co-authored-by: Vinoth Chandar <[email protected]> [MINOR] Delete useless com.uber.hoodie.hadoop.hive.HoodieCombineHiveInputFormat (#3298) [MINOR] Fix contribution link in PULL_REQUEST_TEMPLATE (#3425) [HUDI-2042] Compare the field object directly in OverwriteWithLatestAvroPayload (#3108) [HUDI-2170] [HUDI-1763] Always choose the latest record for HoodieRecordPayload (#3401) [HUDI-1939] remove joda time in hivesync module (#3430) [HUDI-2292] MOR should not predicate pushdown when reading with payload_combine type (#3443) [HUDI-1774] Adding support for delete_partitions to spark data source (#3437) [HUDI-2286] Handle the case of failed deltacommit on the metadata table. (#3428) A failed deltacommit on the metadata table will be automatically rolled back. Assuming the failed commit was "t10", the rollback will happen the next time at "t11". Post rollback, when we try to sync the dataset to the metadata table, we should look for all unsynched instants including t11. Current code ignores t11 since the latest commit timestamp on metadata table is t11 (due to rollback). [HUDI-2298] The HoodieMergedLogRecordScanner should set up the operation of the chosen record (#3456) [HUDI-1138] Add timeline-server-based marker file strategy for improving marker-related latency (#3233) - Can be enabled for cloud stores like S3. Not supported for hdfs yet, due to partial write failures. [HUDI-1518] Remove the logic that delete replaced file when archive (#3310) * remove delete replaced file when archive * done * remove unsed import * remove delete replaced files when archive realted UT * code reviewed Co-authored-by: yuezhang <[email protected]> [HUDI-2017] Add API to set a metric in the registry. (#3084) Registry.add() API adds the new value to existing metric value. For some use-cases We need a API to set/replace the existing value. Metadata Table is synced in preWrite() and postWrite() functions of commit. As part of the sync, the current sizes and basefile/logfile counts are published as metrics. If we use the Registry.add() API, the count and sizes are incorrectly published as sum of the two values. This is corrected by using the Registry.set() API instead. [MINOR] Correct TestKafkaSource class and comment (#3451) MINOR (#3459) MOVE hoodie Deltrstreamer to hudi-utilties [HUDI-2294] Adding virtual keys support to deltastreamer (#3450) [HUDI-1292] Created a config to enable/disable syncing of metadata table. (#3427) * [HUDI-1292] Created a config to enable/disable syncing of metadata table. - Metadata Table should only be synced from a single pipeline to prevent conflicts. - Skip syncing metadata table for clustering and compaction - Renamed useFileListingMetadata Co-authored-by: Vinoth Chandar <[email protected]> [MINOR] Deprecate older configs (#3464) Rename and deprecate props in HoodieWriteConfig Rename and deprecate older props [HUDI-2279]Support column name matching for insert * and update set * in merge into (#3415) [MINOR] Tweak change log more as FULL for flink streaming source (#3466) MINOR fix method use error (#3467) [HUDI-1363] Provide option to drop partition columns (#3465) - Co-authored-by: Sivabalan Narayanan <[email protected]> [HUDI-2151] Flipping defaults (#3452) [HUDI-2119] Ensure the rolled-back instance was previously synced to the Metadata Table when syncing a Rollback Instant. (#3210) * [HUDI-2119] Ensure the rolled-back instance was previously synced to the Metadata Table when syncing a Rollback Instant. If the rolled-back instant was synced to the Metadata Table, a corresponding deltacommit with the same timestamp should have been created on the Metadata Table timeline. To ensure we can always perfomr this check, the Metadata Table instants should not be archived until their corresponding instants are present in the dataset timeline. But ensuring this requires a large number of instants to be kept on the metadata table. In this change, the metadata table will keep atleast the number of instants that the main dataset is keeping. If the instant being rolled back was before the metadata table timeline, the code will throw an exception and the metadata table will have to be re-bootstrapped. This should be a very rare occurance and should occur only when the dataset is being repaired by rolling back multiple commits or restoring to an much older time. * Fixed checkstyle * Improvements from review comments. Fixed checkstyle Replaced explicit null check with Option.ofNullable Removed redundant function getSynedInstantTime * Renamed getSyncedInstantTime and getSyncedInstantTimeForReader. Sync is confusing so renamed to getUpdateTime() and getReaderTime(). * Removed getReaderTime which is only for testing as the same method can be accessed during testing differently without making it part of the public interface. * Fix compilation error * Reverting changes to HoodieMetadataFileSystemView Co-authored-by: Vinoth Chandar <[email protected]> [HUDI-2307] When using delete_partition with ds should not rely on the primary key (#3469) - Co-authored-by: Sivabalan Narayanan <[email protected]> [HUDI-2305] Add MARKERS.type and fix marker-based rollback (#3472) - Rollback infers the directory structure and does rollback based on the strategy used while markers were written. "write markers type" in write config is used to determine marker strategy only for new writes. [HUDI-1897] Deltastreamer source for AWS S3 (#3433) - Added two sources for two stage pipeline. a. S3EventsSource that fetches events from SQS and ingests to a meta hoodie table. b. S3EventsHoodieIncrSource reads S3 events from this meta hoodie table, fetches actual objects from S3 and ingests to sink hoodie table. - Added selectors to assist in S3EventsSource. Co-authored-by: Satish M <[email protected]> Co-authored-by: Vinoth Chandar <[email protected]> [MINOR] Adding back all old default val members to DataSourceOptions (#3474) - Added @Deprecated - Added @deprecated javadoc to keys and defaults suggested how to migrate - Moved all deprecated members to bottom to improve readability [HUDI-2268] Add upgrade and downgrade to and from 0.9.0 (#3470) - Added upgrade and downgrade step to and from 0.9.0. Upgrade adds few table properties. Downgrade recreates timeline server based marker files if any. Moving to 0.10.0-SNAPSHOT on master branch. [HOT-FIX] Add apache license to spark_command.txt.template (#3477) [MINOR] Fix SelectPackages in HoodieSparkFunctionalTestSuite (#3476) [HUDI-2191] Bump flink version to 1.13.1 (#3291) [HUDI-2301] fix FileSliceMetrics utils bug (#3487) HUDI-1674 (#3488) [HUDI-2167] HoodieCompactionConfig get HoodieCleaningPolicy NullPointerException close apache/hudi#3402 [HUDI-2316] Support Flink batch upsert (#3494) [HUDI-1363] Include _hoodie_operation meta column in removeMetadataFields (#3501) [MINOR] Fixing release validation script (#3493) [MINOR] Some cosmetic changes for Flink (#3503) [HUDI-2322] Use correct meta columns while preparing dataset for bulk insert (#3504) Restore 0.8.0 config keys with deprecated annotation (#3506) Co-authored-by: Sagar Sumit <[email protected]> Co-authored-by: Vinoth Chandar <[email protected]> [HUDI-2339] Create Table If Not Exists Failed After Alter Table (#3510) Keep non-conflicting names for common configs between DataSourceOptions and HoodieWriteConfig (#3511) [HUDI-2340] Merge the data set for flink bounded source when changelog mode turns off (#3513) [HUDI-2342] Optimize Bootstrap operator (#3516) Co-authored-by: 喻兆靖 <[email protected]> Support referencing subquery with column aliases by table alias in merge into (#3380) [MINOR] Fix BatchBootstrapOperator initialization (#3520) [HUDI-2345] Hoodie columns sort partitioner for bulk insert (#3523) Co-authored-by: yuezhang <[email protected]> [HUDI-2349] Adding spark delete node to integ test suite (#3528) [HUDI-2262] reduce build warnings (#3481) [HUDI-2352] The upgrade downgrade action of flink writer should be singleton (#3531) [MINOR] Update DOAP with 0.9.0 Release (#3537) [HUDI-2359] Add basic "hoodie_is_deleted" unit tests to TestDataSource classes [HUDI-2366] fix too many logs (#3543) [HUDI-2357] MERGE INTO doesn't work for tables created using CTAS (#3534) [HUDI-2368] Catch Throwable in BoundedInMemoryExecutor (#3546) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-2321] Use the caller classloader for ReflectionUtils (#3535) Based on the discussion on stackoverflow: https://stackoverflow.com/questions/1771679/difference-between-threads-context-class-loader-and-normal-classloader The Thread.currentThread().getContextClassLoader() should never be used because the context classloader is not immutable, user can overwrite it when thread switches, it is also nullable. The objection here: https://stackoverflow.com/a/36228195 says the Thread.currentThread().getContextClassLoader() is a JDK design error and the context classloader is never suggested to be used. The API that needs classloader should ask the user to set up the right classloader. [HUDI-2264] Refactor HoodieSparkSqlWriterSuite to add setup and teardown (#3544) [HUDI-2229] Refact HoodieFlinkStreamer to reuse the pipeline of HoodieTableSink (#3495) Co-authored-by: mikewu <[email protected]> [HUDI-2365]Optimizing overwriteField method with Objects.equals (#3542) Optimizing overwriteField method with Objects.equals [HUDI-2371] Improvement flink streaming reader (#3552) - Support reading empty table - Fix filtering by partition path - Support reading from earliest commit [HUDI-2320] Add support ByteArrayDeserializer in AvroKafkaSource (#3502) [HUDI-2378] Add configs for common and pre validate (#3564) Co-authored-by: Rajesh Mahindra <[email protected]> [HUDI-2379] Include the pending compaction file groups for flink (#3567) streaming reader [HUDI-2280] Use GitHub Actions to build different scala spark versions (#3556) [HUDI-2384] Change log file size config to long (#3577) [HUDI-2376] Add pipeline for Append mode (#3573) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-2392] Do not send partition delete record when changelog mode enabled (#3586) [MINOR] Skip checkstyle and rat in Azure (#3593) - make tests run through without being blocked by style issues - let GitHub Actions tasks give quick feedback on build, style and other checks [HUDI-1989] Fix flakiness in TestHoodieMergeOnReadTable (#3574) * [HUDI-1989] Refactor clustering tests for MoR table * refactor assertion helper * add CheckedFunction * SparkClientFunctionalTestHarness.java * put back original test case * move testcases out from TestHoodieMergeOnReadTable.java * add TestHoodieSparkMergeOnReadTableRollback.java * use SparkClientFunctionalTestHarness * add tag [HUDI-1989] Disable HDFSParquetImporter related tests (#3597) Also mark HDFSParquetImportCommand and HDFSParquetImporter as deprecated. [HUDI-2380] The default archive folder should be 'archived' (#3568) [HUDI-2399] Rebalance CI jobs for shorter wait time (#3604) [MINOR] Fixing some functional tests by moving to right packages (#3596) [HUDI-2079] Make CLI command tests functional (#3601) Make all tests in org.apache.hudi.cli.commands extend org.apache.hudi.cli.functional.CLIFunctionalTestHarness and tag as "functional". This also resolves a blocker where DFS init consistently failed when moving to ubuntu 18.04 MINOR_CHECKSTYLE (#3616) Fix checkstyle [HUDI-2080] Move to ubuntu-18.04 for Azure CI (#3409) Update Azure CI ubuntu from 16.04 to 18.04 due to 16.04 will be removed soon Fixed some consistently failed tests * fix TestCOWDataSourceStorage TestMORDataSourceStorage * reset mocks Also update readme badge Co-authored-by: Raymond Xu <[email protected]> [HUDI-2401] Load archived instants for flink streaming reader (#3610) [MINOR] Remove commenting from Github, JIRA bridge (#3620) [HUDI-2403] Add metadata table listing for flink query source (#3618) Add the document to the PUSHGATEWAY configuration item (#3627) [MINOR] Remove unused variables (#3631) [HUDI-2408] Deprecate FunctionalTestHarness to avoid init DFS (#3628) [HUDI-2351] Extract common FS and IO utils for marker mechanism (#3529) [MINOR] Correct the comment for the parallelism of tasks in FlinkOptions (#3634) [HUDI-2411] Remove unnecessary method overriden and note (#3636) [HUDI-2393] Add yamls for large scale testing (#3594) [MINOR] Add avro schema evolution test with (non)nullable column and with(out) default value (#3639) [HUDI-2394] Implement Kafka Sink Protocol for Hudi for Ingesting Immutable Data (#3592) - Fixing packaging, naming of classes - Use of log4j over slf4j for uniformity - More follow-on fixes - Added a version to control/coordinator events. - Eliminated the config added to write config - Fixed fetching of checkpoints based on table type - Clean up of naming, code placement Co-authored-by: Rajesh Mahindra <[email protected]> Co-authored-by: Vinoth Chandar <[email protected]> [HUDI-2354] Fix TimelineServer error because of replacecommit archive (#3536) * bug fixed * done * done * travis fix * code reviewed * code review * done * code reviewed Co-authored-by: yuezhang <[email protected]> [HUDI-2412] Add timestamp based partitioning for flink writer (#3638) [MINOR] fix typo (#3640) [MINOR] Fix typo, 'requried' corrected to 'required' (#3643) [HUDI-2415] Add more info log for flink streaming reader (#3642) [HUDI-2398] Collect event time for inserts in DefaultHoodieRecordPayload (#3602) [MINOR] Fix the default parallelism of write task (#3649) [HUDI-2397] Add `--enable-sync` parameter (#3608) * add meta-sync config * update test * keep enableMetaSync same with enableHiveSync * Switch check logic to use `enableMetaSync` [HUDI-2410] Fix getDefaultBootstrapIndexClass logical error (#3633) [HUDI-2421] Catch the throwable when scheduling the cleaning task for flink writer (#3650) [HUDI-2425] TestHoodieMultiTableDeltaStreamer CI failed due to exception (#3654) [HUDI-2388] Add DAG nodes for Spark SQL in integration test suite (#3583) - Fixed validation in integ test suite for both deltastreamer and write client path. Co-authored-by: Sivabalan Narayanan <[email protected]> [HUDI-2428] Fix protocol and other issues after stress testing Hudi Kafka Connect (#3656) * Fixes based on tests and some improvements * Fix the issues after running stress tests * Fixing checkstyle issues and updating README Co-authored-by: Rajesh Mahindra <[email protected]> Co-authored-by: Vinoth Chandar <[email protected]> [MINOR] Update Kafka connect sink readme [HUDI-2430] Make decimal compatible with hudi for flink writer (#3658) [MINOR] Add document for DataSourceReadOptions (#3653) [MINOR] Delete Redundant code (#3661) [HUDI-2433] Refactor rollback actions in hudi-client module (#3664) [HUDI-2355][Bug]Archive service executed after cleaner finished. (#3545) Co-authored-by: yuezhang <[email protected]> [HUDI-2423] Separate some config logic from HoodieMetricsConfig into HoodieMetricsGraphiteConfig HoodieMetricsJmxConfig (#3652) [HUDI-2404] Add metrics-jmx to spark and flink bundles (#3632) [HUDI-2422] Adding rollback plan and rollback requested instant (#3651) - This patch introduces rollback plan and rollback.requested instant. Rollback will be done in two phases, namely rollback plan and rollback action. In planning, we prepare the rollback plan and serialize it to rollback.requested. In the rollback action phase, we fetch details from the plan and just delete the files as per the plan. This will ensure final rollback commit metadata will contain all files that got rolled back even if rollback failed midway and retried again. [HUDI-2330][HUDI-2335] Adding support for merge-on-read tables (#3679) - Inserts go into logs, hashed by Kafka and Hudi partitions - Fixed issues with the setupKafka script - Bumped up the default commit interval to 300 seconds - Minor renaming [MINOR] Fix typo,'compatiblity' corrected to 'compatibility' (#3675) [HUDI-2434] Make periodSeconds of GraphiteReporter configurable (#3667) [HUDI-2447] Extract common business logic & Fix typo (#3683) [HUDI-2267] Update docs and infra test configs, add support for graphite (#3482) Co-authored-by: Sivabalan Narayanan <[email protected]> [HUDI-2449] Incremental read for Flink (#3686) [HUDI-2444] Fixing delete files corner cases wrt cleaning and rollback when applying changes to metadata (#3678) [HUDI-2343]Fix the exception for mergeInto when the primaryKey and preCombineField of source table and target table differ in case only (#3517) [MINOR] Fix typo."funcitons" corrected to "functions" (#3681) [MINOR] Cosmetic changes for flink (#3701) [HUDI-2479] HoodieFileIndex throws NPE for FileSlice with pure log files (#3702) [HUDI-2395] Metadata tests rewrite (#3695) - Added commit metadata infra to test table so that we can test entire metadata using test table itself. These tests don't care about the contents of files as such and hence we should be able to test all code paths for metadata using test table. Co-authored-by: Sivabalan Narayanan <[email protected]> [HUDI-2383] Clean the marker files after compaction (#3576) [HUDI-2248] Fixing the closing of hms client (#3364) * [HUDI-2248] Fixing the closing of hms client * [HUDI-2248] Using Hive.closeCurrent() over client.close() [HUDI-2385] Make parquet dictionary encoding configurable (#3578) Co-authored-by: leesf <[email protected]> [HUDI-2483] Infer changelog mode for flink compactor (#3706) [HUDI-2485] Consume as mini-batch for flink stream reader (#3710) [HUDI-2484] Fix hive sync mode setting in Deltastreamer (#3712) [HUDI-2451] On windows client with hdfs server for wrong file separator (#3687) Co-authored-by: yao.zhou <[email protected]> [MINOR] fix typo,'SPAKR' corrected to 'SPARK' (#3721) [MINOR] Fix typo,'Kakfa' corrected to 'Kafka' & 'parquest' corrected to 'parquet' (#3717) [HUDI-2487] Fix JsonKafkaSource cannot filter empty messages from kafka (#3715) [HUDI-2474] Refreshing timeline for every operation in Hudi when metadata is enabled (#3698) [MINOR] Add a RFC template and folder (#3726) [HUDI-2277] HoodieDeltaStreamer reading ORC files directly using ORCDFSSource (#3413) * add ORCDFSSource to support reading orc file into hudi format && add UTs * remove ununsed import * simplify tes * code review * code review * code review * code review * code review * code review Co-authored-by: yuezhang <[email protected]> [MINOR] Fix typo Hooodie corrected to Hoodie & reuqired corrected to required (#3730) [MINOR] Support JuiceFileSystem (#3729) [HUDI-2440] Add dependency change diff script for dependency governace (#3674) [HUDI-2499] Making jdbc-url, user and pass as non-required field for other sync modes (#3732) [HUDI-2497] Refactor clean and restore actions in hudi-client module (#3734) [HUDI-2285][HUDI-2476] Metadata table synchronous design. Rebased and Squashed from pull/3426 (#3590) * [HUDI-2285] Adding Synchronous updates to metadata before completion of commits in data timelime. - This patch adds synchronous updates to metadata table. In other words, every write is first committed to metadata table followed by data table. While reading metadata table, we ignore any delta commits that are present only in metadata table and not in data table timeline. - Compaction of metadata table is fenced by the condition that we trigger compaction only when there are no inflight requests in datatable. This ensures that all base files in metadata table is always in sync with data table(w/o any holes) and only there could be some extra invalid commits among delta log files in metadata table. - Due to this, archival of data table also fences itself up until compacted instant in metadata table. All writes to metadata table happens within the datatable lock. So, metadata table works in one writer mode only. This might be tough to loosen since all writers write to same FILES partition and so, will result in a conflict anyways. - As part of this, have added acquiring locks in data table for those operations which were not before while committing (rollback, clean, compaction, cluster). To note, we were not doing any conflict resolution. All we are doing here is to commit by taking a lock. So that all writes to metadata table is always a single writer. - Also added building block to add buckets for partitions, which will be leveraged by other indexes like record level index, etc. For now, FILES partition has only one bucket. In general, any number of buckets per partition is allowed and each partition has a fixed fileId prefix with incremental suffix for each bucket within each partition. Have fixed [HUDI-2476]. This fix is about retrying a failed compaction if it succeeded in metadata for first time, but failed w/ data table. - Enabling metadata table by default. - Adding more tests for metadata table Co-authored-by: Prashant Wason <[email protected]> [HUDI-2456] support 'show partitions' sql (#3693) [HUDI-2513] Refactor table upgrade and downgrade actions in hudi-client module (#3743) [MINOR] Fix typo,'properites' corrected to 'properties' (#3738) [HUDI-2530] Adding async compaction support to integ test suite framework (#3750) [HUDI-2534] Remove the sort operation when bulk_insert in batch mode (#3772) [HUDI-2537] Fix metadata table for flink (#3774) [HUDI-2496] Insert duplicate records when precombined is deactivated for "insert" operation (#3740) [HUDI-2542] AppendWriteFunction throws NPE when checkpointing without written data (#3777) [HUDI-2540] Fixed wrong validation for metadataTableEnabled in HoodieTable (#3781) [MINOR] Fix typo,'paritition' corrected to 'partition' (#3764) [HUDI-2532] Metadata table compaction trigger max delta commits (#3784) - Setting the max delta commits default value from 24 to 10 to trigger the compaction in metadata table. [HUDI-2494] Fixing glob pattern to skip all hoodie meta paths (#3768) [HUDI-2435][BUG]Fix clustering handle errors (#3666) * done * remove unused imports * code reviewed * code reviewed Co-authored-by: yuezhang <[email protected]> Reviewers: O955 Project Hoodie Project Reviewer: Add blocking …

…rom OSS master. Summary: [HUDI-1731] Rename UpsertPartitioner in hudi-java-client (#2734) Co-authored-by: lei.zhu <[email protected]> Preparation for Avro update (#2650) [MINOR] Delete useless UpsertPartitioner for flink integration (#2746) [HUDI-1738] Emit deletes for flink MOR table streaming read (#2742) Current we did a soft delete for DELETE row data when writes into hoodie table. For streaming read of MOR table, the Flink reader detects the delete records and still emit them if the record key semantics are still kept. This is useful and actually a must for streaming ETL pipeline incremental computation. [HUDI-1591] Implement Spark's FileIndex for Hudi to support queries via Hudi DataSource using non-globbed table path and partition pruning (#2651) [HUDI-1737][hudi-client] Code Cleanup: Extract common method in HoodieCreateHandle & FlinkCreateHandle (#2745) [HUDI-1696] add apache commons-codec dependency to flink-bundle explicitly (#2758) [HUDI-1749] Clean/Compaction/Rollback command maybe never exit when operation fail (#2752) [HUDI-1757] Assigns the buckets by record key for Flink writer (#2757) Currently we assign the buckets by record partition path which could cause hotspot if the partition field is datetime type. Changes to assign buckets by grouping the record whth their key first, the assignment is valid if only there is no conflict(two task write to the same bucket). This patch also changes the coordinator execution to be asynchronous. [MINOR] Fix deprecated build link for travis (#2778) [HUDI-1750] Fail to load user's class if user move hudi-spark-bundle jar into spark classpath (#2753) [HUDI-1767] Add setter to HoodieKey and HoodieRecordLocation to have better SE/DE performance for Flink (#2779) [HUDI-1751] DeltaStreamer print many unnecessary warn log (#2754) [HUDI-1772] HoodieFileGroupId compareTo logical error(fileId self compare) (#2780) [HUDI-1773] HoodieFileGroup code optimize (#2781) [MINOR] Some unit test code optimize (#2782) * Optimized code * Optimized code [HUDI-699] Fix CompactionCommand and add unit test for CompactionCommand (#2325) [HUDI-1778] Add setter to CompactionPlanEvent and CompactionCommitEvent to have better SE/DE performance for Flink (#2789) [MINOR] Update doap with 0.8.0 release (#2772) [HUDI-1775] Add option for compaction parallelism (#2785) [HUDI-1783] Support Huawei Cloud Object Storage (#2796) [MINOR] fix typo. (#2804) [MINOR] Remove unused imports and some other checkstyle issues (#2800) [HUDI-1784] Added print detailed stack log when hbase connection error (#2799) [HUDI-1785] Move OperationConverter to hudi-client-common for code reuse (#2798) [HUDI-1786] Add option for merge max memory (#2805) [HUDI-1787] Remove the rocksdb jar from hudi-flink-bundle (#2807) Remove the RocksDB jar from hudi-flink-bundle to avoid conflicts. [HUDI-1720] Fix RealtimeCompactedRecordReader StackOverflowError (#2721) [HUDI-1788] Insert overwrite (table) for Flink writer (#2808) Supports `INSERT OVERWRITE` and `INSERT OVERWRITE TABLE` for Flink writer. [HUDI-1615] Fixing usage of NULL schema for delete operation in HoodieSparkSqlWriter (#2777) [Hotfix][utilities] Optimized codes (#2821) [HUDI-1798] Flink streaming reader should always monitor the delta commits files (#2825) The streaming reader should only monitor the delta log files, if there are parquet commits but we recognize as logs, the reader would report FileNotFound exception. [HUDI-1797] Remove the com.google.guave jar from hudi-flink-bundle to avoid conflicts. (#2828) Co-authored-by: wangminchao <[email protected]> [HUDI-1801] FlinkMergeHandle rolling over may miss to rename the latest file handle (#2831) The FlinkMergeHandle may rename the N-1 th file handle instead of the latest one, thus to cause data duplication. [HUDI-1792] flink-client query error when processing files larger than 128mb (#2814) Co-authored-by: huangjing <[email protected]> [HUDI-1803] Support BAIDU AFS storage format in hudi (#2836) [MINOR] Add jackson module to presto bundle (#2816) [MINOR][hudi-sync] Fix typos (#2844) [HUDI-1804] Continue to write when Flink write task restart because of container killing (#2843) The `FlinkMergeHande` creates a marker file under the metadata path each time it initializes, when a write task restarts from killing, it tries to create the existing file and reports error. To solve this problem, skip the creation and use the original data file as base file to merge. [HUDI-1716]: Resolving default values for schema from dataframe (#2765) - Adding default values and setting null as first entry in UNION data types in avro schema. Co-authored-by: Aditya Tiwari <[email protected]> [HUDI-1802] Timeline Server Bundle need to include com.esotericsoftware package (#2835) [HUDI-1744] rollback fails on mor table when the partition path hasn't any files (#2749) Co-authored-by: lrz <[email protected]> [MINOR] Added metric reporter Prometheus to HoodieBackedTableMetadataWriter (#2842) [HUDI-1809] Flink merge on read input split uses wrong base file path for default merge type (#2846) [HUDI-1764] Add Hudi-CLI support for clustering (#2773) * tmp base * update * update unit test * update * update * update CLI parameters * linting * update doSchedule in HoodieClusteringJob * update * update diff according to comments [HUDI-1415] Read Hoodie Table As Spark DataSource Table (#2283) [HUDI-1814] Non partitioned table for Flink writer (#2859) [HUDI-1812] Add explicit index state TTL option for Flink writer (#2853) [MINOR] Expose the detailed exception object (#2861) [HUDI-1714] Added tests to TestHoodieTimelineArchiveLog for the archival of compl… (#2677) * Added tests to TestHoodieTimelineArchiveLog for the archival of completed clean and rollback actions. * Adding code review changes * [HUDI-1714] Minor Fixes [HUDI-1746] Added support for replace commits in commit showpartitions, commit show_write_stats, commit showfiles (#2678) * Added support for replace commits in commit showpartitions, commit show_write_stats, commit showfiles * Adding CR changes * [HUDI-1746] Code review changes [HUDI-1551] Add support for BigDecimal and Integer when partitioning based on time. (#2851) Co-authored-by: trungchanh.le <[email protected]> [HUDI-1829] Use while loop instead of recursive call in MergeOnReadInputFormat to avoid StackOverflow (#2862) Recursive all is risky for StackOverflow when there are too many. [HUDI-1844] Add option to flush when total buckets memory exceeds the threshold (#2877) Current code supports flushing as per-bucket memory usage, while the buckets may still take too much memory for bootstrap from history data. When the threshold hits, flush out half of the buckets with bigger buffer size. [HUDI-1835] Fixing kafka native config param for auto offset reset (#2864) [HUDI-1837] Add optional instant range to log record scanner for log (#2870) [HUDI-1742] Improve table level config priority for HoodieMultiTableDeltaStreamer (#2744) [MINOR] Remove redundant method-calling. (#2881) [HUDI-1841] Tweak the min max commits to keep when setting up cleaning retain commits for Flink (#2875) [HUDI-1836] Logging consuming instant to StreamReadOperator#processSplits (#2867) [HUDI-1690] use jsc union instead of rdd union (#2872) [MINOR] Refactor method up to parent-class (#2822) [HUDI-1833] rollback pending clustering even if there is greater commit (#2863) * [HUDI-1833] rollback pending clustering even if there are greater commits [HUDI-1858] Fix cannot create table due to jar conflict (#2886) Co-authored-by: 狄杰 <[email protected]> [HUDI-1845] Exception Throws When Sync Non-Partitioned Table To Hive With MultiPartKeysValueExtractor (#2876) [HUDI-1863] Add rate limiter to Flink writer to avoid OOM for bootstrap (#2891) [HUDI-1867] Streaming read for Flink COW table (#2895) Supports streaming read for Copy On Write table. [HUDI-1817] Fix getting incorrect partition path while using incr query by spark-sql (#2858) [HUDI-1811] Fix TestHoodieRealtimeRecordReader (#2873) Pass basePath with scheme 'file://' to HoodieRealtimeFileSplit [HUDI-1810] Fix azure setting for integ tests (#2889) [HUDI-1620] Fix Metrics UT (#2894) Make sure shutdown Metrics between unit test cases to ensure isolation [HUDI-1852] Add SCHEMA_REGISTRY_SOURCE_URL_SUFFIX and SCHEMA_REGISTRY_TARGET_URL_SUFFIX property (#2884) [HUDI-1781] Fix Flink streaming reader throws ClassCastException (#2900) [HUDI-1718] When query incr view of mor table which has Multi level partitions, the query failed (#2716) [HUDI-1876] wiring in Hadoop Conf with AvroSchemaConverters instantiation (#2914) [HUDI-1821] Remove legacy code for Flink writer (#2868) [HUDI-1880] Support streaming read with compaction and cleaning (#2921) [HUDI-1759] Save one connection retry to hive metastore when hiveSyncTool run with useJdbc=false (#2759) * [HUDI-1759] Save one connection retry to hive metastore when hiveSyncTool run with useJdbc=false * Fix review comment [HUDI-1878] Add max memory option for flink writer task (#2920) Also removes the rate limiter because it has the similar functionality, modify the create and merge handle cleans the retry files automatically. [HUDI-1886] Avoid to generates corrupted files for flink sink (#2929) [MINOR] optimize FilePathUtils (#2931) [HUDI-1707] Reduces log level for too verbose messages from info to debug level. (#2714) * Reduces log level for too verbose messages from info to debug level. * Sort config output. * Code Review : Small restructuring + rebasing to master - Fixing flaky multi delta streamer test - Using isDebugEnabled() checks - Some changes to shorten log message without moving to DEBUG Co-authored-by: volodymyr.burenin <[email protected]> Co-authored-by: Vinoth Chandar <[email protected]> [HUDI-1789] Support reading older snapshots (#2809) * [HUDI-1789] In HoodieParquetInoutFormat we currently default to the latest version of base files. This PR attempts to add a new jobConf `hoodie.%s.consume.snapshot.time` This new config will allow us to read older snapshots. - Reusing hoodie.%s.consume.commit for point in time snapshot queries as well. - Adding javadocs and some more tests [HUDI-1890] FlinkCreateHandle and FlinkAppendHandle canWrite should always return true (#2933) The method #canWrite should always return true because they can already write based on file size, e.g. the BucketAssigner. [HUDI-1818] Validate required fields for Flink HoodieTable (#2930) [HUDI-1851] Adding test suite long running automate scripts for docker (#2880) [HUDI-1055] Remove hardcoded parquet in tests (#2740) * Remove hardcoded parquet in tests * Use DataFileUtils.getInstance * Renaming DataFileUtils to BaseFileUtils Co-authored-by: Vinoth Chandar <[email protected]> [HUDI-1768] add spark datasource unit test for schema validate add column (#2776) [HUDI-1895] Close the file handles gracefully for flink write function to avoid corrupted files (#2938) [HUDI-1722]Fix hive beeline/spark-sql query specified field on mor table occur NPE (#2722) [HUDI-1900] Always close the file handle for a flink mini-batch write (#2943) Close the file handle eagerly to avoid corrupted files as much as possible. [HUDI-1446] Support skip bootstrapIndex's init in abstract fs view init (#2520) Co-authored-by: zhongliang <[email protected]> Co-authored-by: Sivabalan Narayanan <[email protected]> [HUDI-1902] Clean the corrupted files generated by FlinkMergeAndReplaceHandle (#2949) Make the intermediate files of FlinkMergeAndReplaceHandle hidden, when committing the instant, clean these files in case there was some corrupted files left(in normal case, the intermediate files should be cleaned by the FlinkMergeAndReplaceHandle itself). [MINOR][hudi-client] Code-cleanup,remove redundant variable declarations (#2956) [HUDI-1902] Global index for flink writer (#2958) Supports deduplication for record keys with different partition path. [HUDI-1911] Reuse the partition path and file group id for flink write data buffer (#2961) Reuse to reduce memory footprint. [HUDI-1806] Honoring skipROSuffix in spark ds (#2882) * Honoring skipROSuffix in spark ds * Adding tests * fixing scala checkstype issue [HUDI-1913] Using streams instead of loops for input/output (#2962) [MINOR] Remove unused method in BaseSparkCommitActionExecutor (#2965) [HUDI-1915] Fix the file id for write data buffer before flushing (#2966) [HUDI-1871] Fix hive conf for Flink writer hive meta sync (#2968) [HUDI-1719] hive on spark/mr,Incremental query of the mor table, the partition field is incorrect (#2720) [HUDI-1917] Remove the metadata sync logic in HoodieFlinkWriteClient#preWrite because it is not thread safe (#2971) [HUDI-1888] Fix NPE when the nested partition path field has null value (#2957) [HUDI-1918] Fix incorrect keyBy field cause serious data skew, to avoid multiple subtasks write to a partition at the same time (#2972) [HUDI-1740] Fix insert-overwrite API archival (#2784) - fix problem of archiving replace commits - Fix problem when getting empty replacecommit.requested - Improved the logic of handling empty and non-empty requested/inflight commit files. Added unit tests to cover both empty and non-empty inflight files cases and cleaned up some unused test util methods Co-authored-by: yorkzero831 <[email protected]> Co-authored-by: zheren.yu <[email protected]> [MINOR] Update the javadoc of EngineType (#2979) [HUDI-1873] collect() call causing issues with very large upserts (#2907) Co-authored-by: Sivabalan Narayanan <[email protected]> [HUDI-1919] Type mismatch when streaming read copy_on_write table using flink (#2986) * [HUDI-1919] Type mismatch when streaming read copy_on_write table using flink #2976 * Update ParquetSplitReaderUtil.java [HUDI-1920] Set archived as the default value of HOODIE_ARCHIVELOG_FOLDER_PROP_NAME (#2978) [HUDI-1723] Fix path selector listing files with the same mod date (#2845) [HUDI-1922] Bulk insert with row writer supports mor table (#2981) [HUDI-1935] Updated Logger statement (#2996) Co-authored-by: veenaypatil <[email protected]> [HUDI-1865] Make embedded time line service singleton (#2899) [FLINK-1923] Exactly-once write for flink writer (#3002) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-1940] Add SqlQueryBasedTransformer unit test (#3004) [HUDI-1800] Exclude file slices in pending compaction when performing small file sizing (#2902) Co-authored-by: Ryan Pifer <[email protected]> [HUDI-1879] Support Partition Prune For MergeOnRead Snapshot Table (#2926) [MINOR] 'return' is unnecessary as the last statement in a 'void' method (#3012) fix the grammer err of the comment (#3013) Co-authored-by: ywang46 <[email protected]> [HUDI-1948] Shade kryo-shaded jar for hudi flink bundle (#3014) [MINOR] The collection can use forEach() directly (#3016) [MINOR] Access the static member getLastHeartbeatTime via the class instead (#3015) [HUDI-1943] Lose properties when hoodieWriteConfig initializtion (#3006) * [hudi-flink]fix lose properties problem Co-authored-by: haoke <[email protected]> [HUDI-1927] Improve HoodieFlinkStreamer (#3019) Co-authored-by: enter58xuan <[email protected]> [HUDI-1949] Refactor BucketAssigner to make it more efficient (#3017) Add a process single class WriteProfile, the record and small files profile re-construction can be more efficient if we reuse by same checkpoint id. [HUDI-1921] Add target io option for flink compaction (#2980) [HUDI-1952] Fix hive3 meta sync for flink writer (#3021) [HUDI-1953] Fix NPE due to not set the output type of the operator (#3023) Co-authored-by: enter58xuan <[email protected]> [HUDI-1957] Fix flink timeline service lack jetty dependency (#3028) [MINOR] Remove the implementation of Serializable from HoodieException (#3020) [MINOR] Remove unused method in DataSourceUtils (#3031) [HUDI-1961] Add a debezium json integration test case for flink (#3030) [MINOR] Resolve build issue arising from inaccessible pentaho jar (#3034) - Fixes #160 #2479 [HUDI-1954] only reset bucket when flush bucket success (#3029) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-1281] Add deltacommit to ActionType (#3018) Co-authored-by: veenaypatil <[email protected]> [HUDI-1967] Fix the NPE for MOR Hive rt table query (#3032) The HoodieInputFormatUtils.getTableMetaClientByBasePath returns the map with table base path as keys while the HoodieRealtimeInputFormatUtils query it with the partition path. [HUDI-1979] Optimize logic to improve code readability (#3037) Co-authored-by: wei.zhang2 <[email protected]> [HUDI-1942] Add Default value for HIVE_AUTO_CREATE_DATABASE_OPT_KEY in HoodieSparkSqlWriter (#3036) [HUDI-1931] BucketAssignFunction use ValueState instead of MapState (#3026) Co-authored-by: [email protected] <loukey_7821> [HUDI-1909] Skip Commits with empty files (#3045) [HUDI-1148] Remove Hadoop Conf Logs (#3040) [HUDI-1950] Move TestHiveMetastoreBasedLockProvider to functional (#3043) HiveTestUtil static setup mini servers caused connection refused issue in Azure CI environment, as TestHiveSyncTool and TestHiveMetastoreBasedLockProvider share the same test facilities. Moving TestHiveMetastoreBasedLockProvider (the easier one) to functional test with a separate and improved mini server setup resolved the issue. Also cleaned up dfs cluster from HiveTestUtil. The next step is to move TestHiveSyncTool to functional as well. [HUDI-1914] Add fetching latest schema to table command in hudi-cli (#2964) add BootstrapFunction to support index bootstrap (#3024) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-1659] Basic Implement Of Spark Sql Support For Hoodie (#2645) Main functions: Support create table for hoodie. Support CTAS. Support Insert for hoodie. Including dynamic partition and static partition insert. Support MergeInto for hoodie. Support DELETE Support UPDATE Both support spark2 & spark3 based on DataSourceV1. Main changes: Add sql parser for spark2. Add HoodieAnalysis for sql resolve and logical plan rewrite. Add commands implementation for CREATE TABLE、INSERT、MERGE INTO & CTAS. In order to push down the update&insert logical to the HoodieRecordPayload for MergeInto, I make same change to the HoodieWriteHandler and other related classes. 1、Add the inputSchema for parser the incoming record. This is because the inputSchema for MergeInto is different from writeSchema as there are some transforms in the update& insert expression. 2、Add WRITE_SCHEMA to HoodieWriteConfig to pass the write schema for merge into. 3、Pass properties to HoodieRecordPayload#getInsertValue to pass the insert expression and table schema. Verify this pull request Add TestCreateTable for test create hoodie tables and CTAS. Add TestInsertTable for test insert hoodie tables. Add TestMergeIntoTable for test merge hoodie tables. Add TestUpdateTable for test update hoodie tables. Add TestDeleteTable for test delete hoodie tables. Add TestSqlStatement for test supported ddl/dml currently. [HUDI-1929] Support configure KeyGenerator by type (#2993) [HUDI-1980] Optimize the code to prevent other exceptions from causing resources not to be closed (#3038) Co-authored-by: wei.zhang2 <[email protected]> [HUDI-1892] Fix NPE when avro field value is null (#3051) [HUDI-1986] Skip creating marker files for flink merge handle (#3047) [HUDI-1987] Fix non partition table hive meta sync for flink writer (#3049) delete duplicate bootstrap function (#3052) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-1992] Release the new records map for merge handle #close (#3056) [MINOR] Remove boxing (#3062) [MINOR] Add Baidu BOS storage support for hudi (#3061) Co-authored-by: zhangjun30 <[email protected]> [HUDI-1994] Release the new records iterator for append handle #close (#3058) [HUDI-1790] Added SqlSource to fetch data from any partitions for backfill use case (#2896) [MINOR] Add Tencent Cloud HDFS storage support for hudi (#3064) [HUDI-2002] Modify HiveIncrementalPuller log level to ERROR (#3070) Co-authored-by: wei.zhang2 <[email protected]> [HUDI-1984] Support independent flink hudi compaction function (#3046) [HUDI-2000] Release file writer for merge handle #close (#3068) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-1991] Fixing drop dups exception in bulk insert row writer path (#3055) [HUDI-2004] Move CheckpointUtils test cases to independant class (#3072) [MINOR] Fixed the log which should only be printed when the Metadata Table is disabled. (#3080) [HUDI-1950] Fix Azure CI failure in TestParquetUtils (#2984) * fix azure pipeline configs * add pentaho.org in maven repositories * Make sure file paths with scheme in TestParquetUtils * add azure build status to README [HUDI-1999] Refresh the base file view cache for WriteProfile (#3067) Refresh the view to discover new small files. [HUDI-764] [HUDI-765] ORC reader writer Implementation (#2999) Co-authored-by: Qingyun (Teresa) Kang <[email protected]> [MINOR] Rename broken codecov file (#3088) - Stop polluting PRs with wrong coverage info - Retaining the file, so someone can try digging in [HUDI-2022] Release writer for append handle #close (#3087) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-2014] Support flink hive sync in batch mode (#3081) [HUDI-2008] Avoid the raw type usage in some classes under hudi-utilities module (#3076) Fix the filter condition is missing in the judgment condition of compaction instance (#3025) [HUDI-2015] Fix flink operator uid to allow multiple pipelines in one job (#3091) [HUDI-2030] Add metadata cache to WriteProfile to reduce IO (#3090) Keeps same number of instant metadata cache and refresh the cache on new commits. [HUDI-1879] Fix RO Tables Returning Snapshot Result (#2925) [HUDI-2019] Set up the file system view storage config for singleton embedded server write config every time (#3102) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-2032] Make keygen class and keygen type optional for FlinkStreamerConfig (#3104) * [HUDI-2032] Make keygen class and keygen type optional for FlinkStreamerConfig * Address the review suggestion [HUDI-2033] ClassCastException Throw When PreCombineField Is String Type (#3099) [HUDI-2036] Move the compaction plan scheduling out of flink writer coordinator (#3101) Since HUDI-1955 was fixed, we can move the scheduling out if the coordinator to make the coordinator more lightweight. [HUDI-2040] Make flink writer as exactly-once by default (#3106) [MINOR] Fix wrong package name (#3114) [MINOR] Fix Javadoc wrong references (#3115) [HUDI-251] Adds JDBC source support for DeltaStreamer (#2915) As discussed in RFC-14, this change implements the first phase of JDBC incremental puller. It consists following changes: - JdbcSource: This class extends RowSource and implements fetchNextBatch(Option<String> lastCkptStr, long sourceLimit) - SqlQueryBuilder: A simple utility class to build sql queries fluently. - Implements two modes of fetching: full and incremental. Full is a complete scan of RDBMS table. Incremental is delta since last checkpoint. Incremental mode falls back to full fetch in case of any exception. [MINOR] Remove unused module (#3116) [MINOR] Put Azure cache tasks first (#3118) [HUDI-1248] Increase timeout for deltaStreamerTestRunner in TestHoodieDeltaStreamer (#3110) [HUDI-2049] StreamWriteFunction should wait for the next inflight instant time before flushing (#3123) [HUDI-2050] Support rollback inflight compaction instances for batch flink compactor (#3124) [HUDI-1776] Support AlterCommand For Hoodie (#3086) [HUDI-2043] HoodieDefaultTimeline$filterPendingCompactionTImeline() method have wrong filter condition (#3109) [HUDI-2031] JVM occasionally crashes during compaction when spark speculative execution is enabled (#3093) * unit tests added [HUDI-2047] Ignore FileNotFoundException in WriteProfiles #getWritePathsOfInstant (#3125) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-1883] Support Truncate Table For Hoodie (#3098) [HUDI-2013] Removed option to fallback to file listing when Metadata Table is enabled. (#3079) [HUDI-1717] Metadata Reader should merge all the un-synced but complete instants from the dataset timeline. (#3082) [HUDI-1988] FinalizeWrite() been executed twice in AbstractHoodieWriteClient$commitstats (#3050) [HUDI-2054] Remove the duplicate name for flink write pipeline (#3135) [HUDI-1826] Add ORC support in HoodieSnapshotExporter (#3130) [HUDI-2038] Support rollback inflight compaction instances for CompactionPlanOperator (#3105) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-2064] Fix TestHoodieBackedMetadata#testOnlyValidPartitionsAdded (#3141) [HUDI-2061] Incorrect Schema Inference For Schema Evolved Table (#3137) [HUDI-2053] Insert Static Partition With DateType Return Incorrect Partition Value (#3133) [HUDI-2069] Fix KafkaAvroSchemaDeserializer to not rely on reflection (#3111) [HUDI-2069] KafkaAvroSchemaDeserializer should get sourceSchema passed instead using Reflection [HUDI-2062] Catch FileNotFoundException in WriteProfiles #getCommitMetadata Safely (#3138) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-2068] Skip the assign state for SmallFileAssign when the state can not assign initially (#3148) Add ability to provide multi-region (global) data consistency across HMS in different regions (#2542) [global-hive-sync-tool] Add a global hive sync tool to sync hudi table across clusters. Add a way to rollback the replicated time stamp if we fail to sync or if we partly sync Co-authored-by: Jagmeet Bali <[email protected]> [MINOR] Removing un-used files and references (#3150) [HUDI-2060] Added tests for KafkaOffsetGen (#3136) [MINOR] Remove unused methods (#3152) [HUDI-2073] Fix the bug of hoodieClusteringJob never quit (#3157) Co-authored-by: yuezhang <[email protected]> [HUDI-2074] Use while loop instead of recursive call in MergeOnReadInputFormat#MergeIterator to avoid StackOverflow (#3159) [MINOR] Drop duplicate keygenerator class configuration setting (#3167) [HUDI-2067] Sync FlinkOptions config to FlinkStreamerConfig (#3151) [HUDI-1910] Commit Offset to Kafka after successful Hudi commit (#3092) [HUDI-2084] Resend the uncommitted write metadata when start up (#3168) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-2081] Move schema util tests out from TestHiveSyncTool (#3166) [HUDI-2094] Supports hive style partitioning for flink writer (#3178) [HUDI-2097] Fix Flink unable to read commit metadata error (#3180) [HUDI-2085] Support specify compaction paralleism and compaction target io for flink batch compaction (#3169) [HUDI-2092] Fix NPE caused by FlinkStreamerConfig#writePartitionUrlEncode null value (#3176) [HUDI-2006] Adding more yaml templates to test suite (#3073) [HUDI-2103] Add rebalance before index bootstrap (#3185) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-1944] Support Hudi to read from committed offset (#3175) * [HUDI-1944] Support Hudi to read from committed offset * [HUDI-1944] Adding group option to KafkaResetOffsetStrategies * [HUDI-1944] Update Exception msg [HUDI-2052] Support load logFile in BootstrapFunction (#3134) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-89] Add configOption & refactor all configs based on that (#2833) Co-authored-by: Wenning Ding <[email protected]> [MINOR] Update .asf.yaml to codify notification settings, turn on jira comments, gh discussions (#3164) - Turn on comment for jira, so we can track PR activity better - Create a notification settings that match https://gitbox.apache.org/schemes.cgi?hudi - Try and turn on "discussions" on Github, to experiment [MINOR] Fix broken build due to FlinkOptions (#3198) [HUDI-2088] Missing Partition Fields And PreCombineField In Hoodie Properties For Table Written By Flink (#3171) [MINOR] Add Documentation to KEYGENERATOR_TYPE_PROP (#3196) [HUDI-2105] Compaction Failed For MergeInto MOR Table (#3190) [HUDI-2051] Enable Hive Sync When Spark Enable Hive Meta For Spark Sql (#3126) [HUDI-2112] Support reading pure logs file group for flink batch reader after compaction (#3202) [HUDI-2114] Spark Query MOR Table Written By Flink Return Incorrect Timestamp Value (#3208) [HUDI-2121] Add operator uid for flink stateful operators (#3212) [HUDI-2123] Exception When Merge With Null-Value Field (#3214) [HUDI-2124] A Grafana dashboard for HUDI. (#3216) [HUDI-2057] CTAS Generate An External Table When Create Managed Table (#3146) [HUDI-1930] Bootstrap support configure KeyGenerator by type (#3170) * [HUDI-1930] Bootstrap support configure KeyGenerator by type [HUDI-2116] Support batch synchronization of partition datas to hive metastore to avoid oom problem (#3209) [HUDI-2126] The coordinator send events to write function when there are no data for the checkpoint (#3219) [HUDI-2127] Initialize the maxMemorySizeInBytes in log scanner (#3220) [HUDI-2058]support incremental query for insert_overwrite_table/insert_overwrite operation on cow table (#3139) [HUDI-2129] StreamerUtil.medianInstantTime should return a valid date time string (#3221) [HUDI-2131] Exception Throw Out When MergeInto With Decimal Type Field (#3224) [HUDI-2122] Improvement in packaging insert into smallfiles (#3213) [HUDI-2132] Make coordinator events as POJO for efficient serialization (#3223) [HUDI-2106] Fix flink batch compaction bug while user don't set compaction tasks (#3192) [HUDI-2133] Support hive1 metadata sync for flink writer (#3225) [HUDI-2089]fix the bug that metatable cannot support non_partition table (#3182) [HUDI-2028] Implement RockDbBasedMap as an alternate to DiskBasedMap in ExternalSpillableMap (#3194) Co-authored-by: Rajesh Mahindra <[email protected]> [HUDI-2135] Add compaction schedule option for flink (#3226) [HUDI-2055] Added deltastreamer metric for time of lastSync (#3129) [HUDI-2046] Loaded too many classes like sun/reflect/GeneratedSerializationConstructorAccessor in JVM metaspace (#3121) Loaded too many classes when use kryo of spark to hudi Co-authored-by: weiwei.duan <[email protected]> [HUDI-1996] Adding functionality to allow the providing of basic auth creds for confluent cloud schema registry (#3097) * adding support for basic auth with confluent cloud schema registry [HUDI-2093] Fix empty avro schema path caused by duplicate parameters (#3177) * [HUDI-2093] Fix empty avro schema path caused by duplicate parameters * rename shcmea option key * fix doc * rename var name [HUDI-2113] Fix integration testing failure caused by sql results out of order (#3204) [HUDI-2016] Fixed bootstrap of Metadata Table when some actions are in progress. (#3083) Metadata Table cannot be bootstrapped when any action is in progress. This is detected by the presence of inflight or requested instants. The bootstrapping is initiated in preWrite and postWrite of each commit. So bootstrapping will be retried again until it succeeds. Also added metrics for when the bootstrapping fails or a table is re-bootstrapped. This will help detect tables which are not getting bootstrapped. [HUDI-2140] Fixed the unit test TestHoodieBackedMetadata.testOnlyValidPartitionsAdded. (#3234) [HUDI-2115] FileSlices in the filegroup is not descending by timestamp (#3206) [HUDI-1104] Adding support for UserDefinedPartitioners and SortModes to BulkInsert with Rows (#3149) [HUDI-2069] Refactored String constants (#3172) [HUDI-1105] Adding dedup support for Bulk Insert w/ Rows (#2206) [HUDI-2134]Add generics to avoif forced conversion in BaseSparkCommitActionExecutor#partition (#3232) [HUDI-2009] Fixing extra commit metadata in row writer path (#3075) [HUDI-2099]hive lock which state is WATING should be released, otherwise this hive lock will be locked forever (#3186) [MINOR] Fix build broken from #3186 (#3245) [HUDI-2136] Fix conflict when flink-sql-connector-hive and hudi-flink-bundle are both in flink lib (#3227) [HUDI-2087] Support Append only in Flink stream (#3174) Co-authored-by: 喻兆靖 <[email protected]> Revert "[HUDI-2087] Support Append only in Flink stream (#3174)" (#3251) This reverts commit 371526789d663dee85041eb31c27c52c81ef87ef. [HUDI-2147] Remove unused class AvroConvertor in hudi-flink (#3243) [MINOR] Fix some wrong assert reasons (#3248) [HUDI-2087] Support Append only in Flink stream (#3252) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-2143] Tweak the default compaction target IO to 500GB when flink async compaction is off (#3238) [HUDI-2142] Support setting bucket assign parallelism for flink write task (#3239) [HUDI-1483] Support async clustering for deltastreamer and Spark streaming (#3142) - Integrate async clustering service with HoodieDeltaStreamer and HoodieStreamingSink - Added methods in HoodieAsyncService to reuse code [HUDI-2045] Support Read Hoodie As DataSource Table For Flink And DeltaStreamer [HUDI-2107] Support Read Log Only MOR Table For Spark (#3193) [HUDI-2144]Bug-Fix:Offline clustering(HoodieClusteringJob) will cause insert action losing data (#3240) * fixed * add testUpsertPartitionerWithSmallFileHandlingAndClusteringPlan ut * fix CheckStyle Co-authored-by: yuezhang <[email protected]> [MINOR] Fix EXTERNAL_RECORD_AND_SCHEMA_TRANSFORMATION config (#3250) [HUDI-2171] Add parallelism conf for bootstrap operator [HUDI-2168] Fix for AccessControlException for anonymous user (#3264) [HUDI-1969] Support reading logs for MOR Hive rt table (#3033) [HUDI-2165] Support Transformer for HoodieFlinkStreamer (#3270) * [HUDI-2165] Support Transformer for HoodieFlinkStreamer [HUDI-2180] Fix Compile Error For Spark3 (#3274) [HUDI-1828] Update unit tests to support ORC as the base file format (#3237) [MINOR] Correct the logs of enable/not-enable async cleaner service. (#3271) Co-authored-by: yuezhang <[email protected]> [HUDI-2149] Ensure and Audit docs for every configuration class in the codebase (#3272) - Added docs when missing - Rewrote, reworded as needed - Made couple more classes extend HoodieConfig [HUDI-2029] Implement compression for DiskBasedMap in Spillable Map (#3128) [HUDI-2153] Fix BucketAssignFunction Context NullPointerException [MINOR] Refactor hive sync tool to reduce duplicate code (#3276) * [MINOR] Refactor hive sync tool to reduce duplicate code [MINOR] Allow users to choose ORC as base file format in Spark SQL (#3279) [HUDI-1633] Make callback return HoodieWriteStat (#2445) * CALLBACK add partitionPath * callback can send hoodieWriteStat * add ApiMaturityLevel [HUDI-2185] Remove the default parallelism of index bootstrap and bucket assigner Revert "[HUDI-2087] Support Append only in Flink stream (#3252)" This reverts commit 783c9cb3 [HUDI-1447] DeltaStreamer kafka source supports consuming from specified timestamp (#2438) [HUDI-1884] MergeInto Support Partial Update For COW (#3154) [HUDI-2193] Remove state in BootstrapFunction [HUDI-2161] Adding support to disable meta columns with bulk insert operation (#3247) [HUDI-1860] Add INSERT_OVERWRITE and INSERT_OVERWRITE_TABLE support to DeltaStreamer (#3184) [HUDI-2145] Create new bucket when NewFileAssignState filled (#3258) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-2198] Clean and reset the bootstrap events for coordinator when task failover (#3304) [HUDI-2007] Fixing hudi_test_suite for spark nodes and adding spark bulk_insert node (#3074) [MINOR] Disable codecov (#3314) [HUDI-2192] Clean up Multiple versions of scala libraries detected Warning (#3292) [HUDI-2204] Add marker files for flink writer (#3316) [HUDI-2195] Sync Hive Failed When Execute CTAS In Spark2 And Spark3 (#3299) [HUDI-2206] Fix checkpoint blocked because getLastPendingInstant() action after than restoreWriteMetadata() action (#3326) [HUDI-2205] Rollback inflight compaction for flink writer (#3320) [HUDI-2139] MergeInto MOR Table May Result InCorrect Result (#3230) [HUDI-2211] Fix NullPointerException in TestHoodieConsoleMetrics (#3331) [HUDI-2212] Missing PrimaryKey In Hoodie Properties For CTAS Table (#3332) [HUDI-2213] Remove unnecessary parameter for HoodieMetrics constructor and fix NPE in UT (#3333) [HUDI-1848] Adding support for HMS for running DDL queries in hive-sy… (#2879) * [HUDI-1848] Adding support for HMS for running DDL queries in hive-sync-tool * [HUDI-1848] Fixing test cases * [HUDI-1848] CR changes * [HUDI-1848] Fix checkstyle violations * [HUDI-1848] Fixed a bug when metastore api fails for complex schemas with multiple levels. * [HUDI-1848] Adding the complex schema and resolving merge conflicts * [HUDI-1848] Adding some more javadocs * [HUDI-1848] Added javadocs for DDLExecutor impls * [HUDI-1848] Fixed style issue [MINOR] Replace deprecated method isDir with isDirectory (#3319) [HUDI-1241] Automate the generation of configs webpage as configs are added to Hudi repo (#3302) [HUDI-2216] Correct the words fiels in the comments to fields (#3339) [MINOR] Close log scanner after compaction completed (#3294) [HUDI-2214]residual temporary files after clustering are not cleaned up (#3335) [HUDI-2176, 2178, 2179] Adding virtual key support to COW table (#3306) [MINOR] Correct the words accroding in the comments to according (#3343) Correct the words 'accroding' in the comments to 'according' [HUDI-2209] Bulk insert for flink writer (#3334) [HUDI-2219] Fix NPE of HoodieConfig (#3342) [HUDI-2217] Fix no value present in incremental query on MOR (#3340) [HUDI-2223] Fix Alter Partitioned Table Failed (#3350) [HUDI-2227] Only sync hive meta on successful commit for flink batch writer (#3351) [HUDI-2215] Add rateLimiter when Flink writes to hudi. (#3338) Co-authored-by: wangminchao <[email protected]> [HUDI-2044] Integrate consumers with rocksDB and compression within External Spillable Map (#3318) [HUDI-2230] Make codahale times transient to avoid serializable exceptions (#3345) [HUDI-2245] BucketAssigner generates the fileId evenly to avoid data skew (#3362) [HUDI-2244] Fix database alreadyExists exception while hive sync (#3361) [HUDI-2228] Add option 'hive_sync.mode' for flink writer (#3352) [HUDI-2241] Explicit parallelism for flink bulk insert (#3357) [HUDI-1425] Performance loss with the additional hoodieRecords.isEmpty() in HoodieSparkSqlWriter#write (#2296) [MINOR] fix check style error (#3365) [HUDI-2117] Unpersist the input rdd after the commit is completed to … (#3207) Co-authored-by: Vinoth Chandar <[email protected]> [HUDI-2251] Fix Exception Cause By Table Name Case Sensitivity For Append Mode Write (#3367) [HUDI-2253] Refactoring few tests to reduce runningtime. DeltaStreamer and MultiDeltaStreamer tests. Bulk insert row writer tests (#3371) Co-authored-by: Sivabalan Narayanan <[email protected]> [HUDI-2252] Default consumes from the latest instant for flink streaming reader (#3368) [HUDI-2254] Builtin sort operator for flink bulk insert (#3372) [HUDI-2184] Support setting hive sync partition extractor class based on flink configuration (#3284) [HUDI-2218] Fix missing HoodieWriteStat in HoodieCreateHandle (#3341) [HUDI-2164] Let users build cluster plan and execute this plan at once using HoodieClusteringJob for async clustering (#3259) * add --mode schedule/execute/scheduleandexecute * fix checkstyle * add UT testHoodieAsyncClusteringJobWithScheduleAndExecute * log changed * try to make ut success * try to fix ut * modify ut * review changed * code review * code review * code review * code review Co-authored-by: yuezhang <[email protected]> [HUDI-2177][HUDI-2200] Adding virtual keys support for MOR table (#3315) [MINOR] Improving runtime of TestStructuredStreaming by 2 mins (#3382) [HUDI-2225] Add a compaction job in hudi-examples (#3347) [HUDI-2269] Release the disk map resource for flink streaming reader (#3384) [HUDI-2072] Add pre-commit validator framework (#3153) * [HUDI-2072] Add pre-commit validator framework * trigger Travis rebuild [HUDI-2272] Pass base file format to sync clients (#3397) Co-authored-by: Rajesh Mahindra <[email protected]> [HUDI-1371] [HUDI-1893] Support metadata based listing for Spark DataSource and Spark SQL (#2893) [HUDI-2255] Refactor Datasource options (#3373) Co-authored-by: Wenning Ding <[email protected]> [HUDI-2090] Ensure Disk Maps create a subfolder with appropriate prefixes and cleans them up on close (#3329) * Add UUID to the folder name for External Spillable File System * Fix to ensure that Disk maps folders do not interefere across users * Fix test * Fix test * Rebase with latest mater and address comments * Add Shutdown Hooks for the Disk Map Co-authored-by: Rajesh Mahindra <[email protected]> [HUDI-2258] Metadata table for flink (#3381) [HUDI-2087] Support Append only in Flink stream (#3390) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-2232] [SQL] MERGE INTO fails with table having nested struct (#3379) [HUDI-2273] Migrating some long running tests to functional test profile (#3398) [HUDI-2233] Use HMS To Sync Hive Meta For Spark Sql (#3387) [HUDI-2274] Allows INSERT duplicates for Flink MOR table (#3403) [HUDI-2278] Use INT64 timestamp with precision 3 for flink parquet writer (#3414) [HUDI-2182] Support Compaction Command For Spark Sql (#3277) [MINOR] fix compile error in compaction command (#3421) [HUDI-1468] Support custom clustering strategies and preserve commit metadata as part of clustering (#3419) Co-authored-by: Satish Kotha <[email protected]> [HUDI-1842] Spark Sql Support For pre-existing Hoodie Table (#3393) [HUDI-2243] Support Time Travel Query For Hoodie Table (#3360) [HUDI-2247] Filter file where length less than parquet MAGIC length (#3363) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-2208] Support Bulk Insert For Spark Sql (#3328) [HUDI-2194] Skip the latest N partitions when choosing partitions to create ClusteringPlan (#3300) * skip from latest partitions based on hoodie.clustering.plan.strategy.daybased.skipfromlatest.partitions && 0(default means skip nothing) * change config verison * add ut Co-authored-by: yuezhang <[email protected]> [HUDI-1771] Propagate CDC format for hoodie (#3285) [HUDI-2288] Support storage on ks3 for hudi (#3434) Co-authored-by: xuzifu <xuzifu.com> [MINOR] Fix travis from errors (#3432) [HUDI-1129] Improving schema evolution support in hudi (#2927) * Adding support to ingest records with old schema after table's schema is evolved * Rebasing against latest master - Trimming test file to be < 800 lines - Renaming config names * Addressing feedback Co-authored-by: Vinoth Chandar <[email protected]> [MINOR] Delete useless com.uber.hoodie.hadoop.hive.HoodieCombineHiveInputFormat (#3298) [MINOR] Fix contribution link in PULL_REQUEST_TEMPLATE (#3425) [HUDI-2042] Compare the field object directly in OverwriteWithLatestAvroPayload (#3108) [HUDI-2170] [HUDI-1763] Always choose the latest record for HoodieRecordPayload (#3401) [HUDI-1939] remove joda time in hivesync module (#3430) [HUDI-2292] MOR should not predicate pushdown when reading with payload_combine type (#3443) [HUDI-1774] Adding support for delete_partitions to spark data source (#3437) [HUDI-2286] Handle the case of failed deltacommit on the metadata table. (#3428) A failed deltacommit on the metadata table will be automatically rolled back. Assuming the failed commit was "t10", the rollback will happen the next time at "t11". Post rollback, when we try to sync the dataset to the metadata table, we should look for all unsynched instants including t11. Current code ignores t11 since the latest commit timestamp on metadata table is t11 (due to rollback). [HUDI-2298] The HoodieMergedLogRecordScanner should set up the operation of the chosen record (#3456) [HUDI-1138] Add timeline-server-based marker file strategy for improving marker-related latency (#3233) - Can be enabled for cloud stores like S3. Not supported for hdfs yet, due to partial write failures. [HUDI-1518] Remove the logic that delete replaced file when archive (#3310) * remove delete replaced file when archive * done * remove unsed import * remove delete replaced files when archive realted UT * code reviewed Co-authored-by: yuezhang <[email protected]> [HUDI-2017] Add API to set a metric in the registry. (#3084) Registry.add() API adds the new value to existing metric value. For some use-cases We need a API to set/replace the existing value. Metadata Table is synced in preWrite() and postWrite() functions of commit. As part of the sync, the current sizes and basefile/logfile counts are published as metrics. If we use the Registry.add() API, the count and sizes are incorrectly published as sum of the two values. This is corrected by using the Registry.set() API instead. [MINOR] Correct TestKafkaSource class and comment (#3451) MINOR (#3459) MOVE hoodie Deltrstreamer to hudi-utilties [HUDI-2294] Adding virtual keys support to deltastreamer (#3450) [HUDI-1292] Created a config to enable/disable syncing of metadata table. (#3427) * [HUDI-1292] Created a config to enable/disable syncing of metadata table. - Metadata Table should only be synced from a single pipeline to prevent conflicts. - Skip syncing metadata table for clustering and compaction - Renamed useFileListingMetadata Co-authored-by: Vinoth Chandar <[email protected]> [MINOR] Deprecate older configs (#3464) Rename and deprecate props in HoodieWriteConfig Rename and deprecate older props [HUDI-2279]Support column name matching for insert * and update set * in merge into (#3415) [MINOR] Tweak change log more as FULL for flink streaming source (#3466) MINOR fix method use error (#3467) [HUDI-1363] Provide option to drop partition columns (#3465) - Co-authored-by: Sivabalan Narayanan <[email protected]> [HUDI-2151] Flipping defaults (#3452) [HUDI-2119] Ensure the rolled-back instance was previously synced to the Metadata Table when syncing a Rollback Instant. (#3210) * [HUDI-2119] Ensure the rolled-back instance was previously synced to the Metadata Table when syncing a Rollback Instant. If the rolled-back instant was synced to the Metadata Table, a corresponding deltacommit with the same timestamp should have been created on the Metadata Table timeline. To ensure we can always perfomr this check, the Metadata Table instants should not be archived until their corresponding instants are present in the dataset timeline. But ensuring this requires a large number of instants to be kept on the metadata table. In this change, the metadata table will keep atleast the number of instants that the main dataset is keeping. If the instant being rolled back was before the metadata table timeline, the code will throw an exception and the metadata table will have to be re-bootstrapped. This should be a very rare occurance and should occur only when the dataset is being repaired by rolling back multiple commits or restoring to an much older time. * Fixed checkstyle * Improvements from review comments. Fixed checkstyle Replaced explicit null check with Option.ofNullable Removed redundant function getSynedInstantTime * Renamed getSyncedInstantTime and getSyncedInstantTimeForReader. Sync is confusing so renamed to getUpdateTime() and getReaderTime(). * Removed getReaderTime which is only for testing as the same method can be accessed during testing differently without making it part of the public interface. * Fix compilation error * Reverting changes to HoodieMetadataFileSystemView Co-authored-by: Vinoth Chandar <[email protected]> [HUDI-2307] When using delete_partition with ds should not rely on the primary key (#3469) - Co-authored-by: Sivabalan Narayanan <[email protected]> [HUDI-2305] Add MARKERS.type and fix marker-based rollback (#3472) - Rollback infers the directory structure and does rollback based on the strategy used while markers were written. "write markers type" in write config is used to determine marker strategy only for new writes. [HUDI-1897] Deltastreamer source for AWS S3 (#3433) - Added two sources for two stage pipeline. a. S3EventsSource that fetches events from SQS and ingests to a meta hoodie table. b. S3EventsHoodieIncrSource reads S3 events from this meta hoodie table, fetches actual objects from S3 and ingests to sink hoodie table. - Added selectors to assist in S3EventsSource. Co-authored-by: Satish M <[email protected]> Co-authored-by: Vinoth Chandar <[email protected]> [MINOR] Adding back all old default val members to DataSourceOptions (#3474) - Added @Deprecated - Added @deprecated javadoc to keys and defaults suggested how to migrate - Moved all deprecated members to bottom to improve readability [HUDI-2268] Add upgrade and downgrade to and from 0.9.0 (#3470) - Added upgrade and downgrade step to and from 0.9.0. Upgrade adds few table properties. Downgrade recreates timeline server based marker files if any. Moving to 0.10.0-SNAPSHOT on master branch. [HOT-FIX] Add apache license to spark_command.txt.template (#3477) [MINOR] Fix SelectPackages in HoodieSparkFunctionalTestSuite (#3476) [HUDI-2191] Bump flink version to 1.13.1 (#3291) [HUDI-2301] fix FileSliceMetrics utils bug (#3487) HUDI-1674 (#3488) [HUDI-2167] HoodieCompactionConfig get HoodieCleaningPolicy NullPointerException close apache/hudi#3402 [HUDI-2316] Support Flink batch upsert (#3494) [HUDI-1363] Include _hoodie_operation meta column in removeMetadataFields (#3501) [MINOR] Fixing release validation script (#3493) [MINOR] Some cosmetic changes for Flink (#3503) [HUDI-2322] Use correct meta columns while preparing dataset for bulk insert (#3504) Restore 0.8.0 config keys with deprecated annotation (#3506) Co-authored-by: Sagar Sumit <[email protected]> Co-authored-by: Vinoth Chandar <[email protected]> [HUDI-2339] Create Table If Not Exists Failed After Alter Table (#3510) Keep non-conflicting names for common configs between DataSourceOptions and HoodieWriteConfig (#3511) [HUDI-2340] Merge the data set for flink bounded source when changelog mode turns off (#3513) [HUDI-2342] Optimize Bootstrap operator (#3516) Co-authored-by: 喻兆靖 <[email protected]> Support referencing subquery with column aliases by table alias in merge into (#3380) [MINOR] Fix BatchBootstrapOperator initialization (#3520) [HUDI-2345] Hoodie columns sort partitioner for bulk insert (#3523) Co-authored-by: yuezhang <[email protected]> [HUDI-2349] Adding spark delete node to integ test suite (#3528) [HUDI-2262] reduce build warnings (#3481) [HUDI-2352] The upgrade downgrade action of flink writer should be singleton (#3531) [MINOR] Update DOAP with 0.9.0 Release (#3537) [HUDI-2359] Add basic "hoodie_is_deleted" unit tests to TestDataSource classes [HUDI-2366] fix too many logs (#3543) [HUDI-2357] MERGE INTO doesn't work for tables created using CTAS (#3534) [HUDI-2368] Catch Throwable in BoundedInMemoryExecutor (#3546) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-2321] Use the caller classloader for ReflectionUtils (#3535) Based on the discussion on stackoverflow: https://stackoverflow.com/questions/1771679/difference-between-threads-context-class-loader-and-normal-classloader The Thread.currentThread().getContextClassLoader() should never be used because the context classloader is not immutable, user can overwrite it when thread switches, it is also nullable. The objection here: https://stackoverflow.com/a/36228195 says the Thread.currentThread().getContextClassLoader() is a JDK design error and the context classloader is never suggested to be used. The API that needs classloader should ask the user to set up the right classloader. [HUDI-2264] Refactor HoodieSparkSqlWriterSuite to add setup and teardown (#3544) [HUDI-2229] Refact HoodieFlinkStreamer to reuse the pipeline of HoodieTableSink (#3495) Co-authored-by: mikewu <[email protected]> [HUDI-2365]Optimizing overwriteField method with Objects.equals (#3542) Optimizing overwriteField method with Objects.equals [HUDI-2371] Improvement flink streaming reader (#3552) - Support reading empty table - Fix filtering by partition path - Support reading from earliest commit [HUDI-2320] Add support ByteArrayDeserializer in AvroKafkaSource (#3502) [HUDI-2378] Add configs for common and pre validate (#3564) Co-authored-by: Rajesh Mahindra <[email protected]> [HUDI-2379] Include the pending compaction file groups for flink (#3567) streaming reader [HUDI-2280] Use GitHub Actions to build different scala spark versions (#3556) [HUDI-2384] Change log file size config to long (#3577) [HUDI-2376] Add pipeline for Append mode (#3573) Co-authored-by: 喻兆靖 <[email protected]> [HUDI-2392] Do not send partition delete record when changelog mode enabled (#3586) [MINOR] Skip checkstyle and rat in Azure (#3593) - make tests run through without being blocked by style issues - let GitHub Actions tasks give quick feedback on build, style and other checks [HUDI-1989] Fix flakiness in TestHoodieMergeOnReadTable (#3574) * [HUDI-1989] Refactor clustering tests for MoR table * refactor assertion helper * add CheckedFunction * SparkClientFunctionalTestHarness.java * put back original test case * move testcases out from TestHoodieMergeOnReadTable.java * add TestHoodieSparkMergeOnReadTableRollback.java * use SparkClientFunctionalTestHarness * add tag [HUDI-1989] Disable HDFSParquetImporter related tests (#3597) Also mark HDFSParquetImportCommand and HDFSParquetImporter as deprecated. [HUDI-2380] The default archive folder should be 'archived' (#3568) [HUDI-2399] Rebalance CI jobs for shorter wait time (#3604) [MINOR] Fixing some functional tests by moving to right packages (#3596) [HUDI-2079] Make CLI command tests functional (#3601) Make all tests in org.apache.hudi.cli.commands extend org.apache.hudi.cli.functional.CLIFunctionalTestHarness and tag as "functional". This also resolves a blocker where DFS init consistently failed when moving to ubuntu 18.04 MINOR_CHECKSTYLE (#3616) Fix checkstyle [HUDI-2080] Move to ubuntu-18.04 for Azure CI (#3409) Update Azure CI ubuntu from 16.04 to 18.04 due to 16.04 will be removed soon Fixed some consistently failed tests * fix TestCOWDataSourceStorage TestMORDataSourceStorage * reset mocks Also update readme badge Co-authored-by: Raymond Xu <[email protected]> [HUDI-2401] Load archived instants for flink streaming reader (#3610) [MINOR] Remove commenting from Github, JIRA bridge (#3620) [HUDI-2403] Add metadata table listing for flink query source (#3618) Add the document to the PUSHGATEWAY configuration item (#3627) [MINOR] Remove unused variables (#3631) [HUDI-2408] Deprecate FunctionalTestHarness to avoid init DFS (#3628) [HUDI-2351] Extract common FS and IO utils for marker mechanism (#3529) [MINOR] Correct the comment for the parallelism of tasks in FlinkOptions (#3634) [HUDI-2411] Remove unnecessary method overriden and note (#3636) [HUDI-2393] Add yamls for large scale testing (#3594) [MINOR] Add avro schema evolution test with (non)nullable column and with(out) default value (#3639) [HUDI-2394] Implement Kafka Sink Protocol for Hudi for Ingesting Immutable Data (#3592) - Fixing packaging, naming of classes - Use of log4j over slf4j for uniformity - More follow-on fixes - Added a version to control/coordinator events. - Eliminated the config added to write config - Fixed fetching of checkpoints based on table type - Clean up of naming, code placement Co-authored-by: Rajesh Mahindra <[email protected]> Co-authored-by: Vinoth Chandar <[email protected]> [HUDI-2354] Fix TimelineServer error because of replacecommit archive (#3536) * bug fixed * done * done * travis fix * code reviewed * code review * done * code reviewed Co-authored-by: yuezhang <[email protected]> [HUDI-2412] Add timestamp based partitioning for flink writer (#3638) [MINOR] fix typo (#3640) [MINOR] Fix typo, 'requried' corrected to 'required' (#3643) [HUDI-2415] Add more info log for flink streaming reader (#3642) [HUDI-2398] Collect event time for inserts in DefaultHoodieRecordPayload (#3602) [MINOR] Fix the default parallelism of write task (#3649) [HUDI-2397] Add `--enable-sync` parameter (#3608) * add meta-sync config * update test * keep enableMetaSync same with enableHiveSync * Switch check logic to use `enableMetaSync` [HUDI-2410] Fix getDefaultBootstrapIndexClass logical error (#3633) [HUDI-2421] Catch the throwable when scheduling the cleaning task for flink writer (#3650) [HUDI-2425] TestHoodieMultiTableDeltaStreamer CI failed due to exception (#3654) [HUDI-2388] Add DAG nodes for Spark SQL in integration test suite (#3583) - Fixed validation in integ test suite for both deltastreamer and write client path. Co-authored-by: Sivabalan Narayanan <[email protected]> [HUDI-2428] Fix protocol and other issues after stress testing Hudi Kafka Connect (#3656) * Fixes based on tests and some improvements * Fix the issues after running stress tests * Fixing checkstyle issues and updating README Co-authored-by: Rajesh Mahindra <[email protected]> Co-authored-by: Vinoth Chandar <[email protected]> [MINOR] Update Kafka connect sink readme [HUDI-2430] Make decimal compatible with hudi for flink writer (#3658) [MINOR] Add document for DataSourceReadOptions (#3653) [MINOR] Delete Redundant code (#3661) [HUDI-2433] Refactor rollback actions in hudi-client module (#3664) [HUDI-2355][Bug]Archive service executed after cleaner finished. (#3545) Co-authored-by: yuezhang <[email protected]> [HUDI-2423] Separate some config logic from HoodieMetricsConfig into HoodieMetricsGraphiteConfig HoodieMetricsJmxConfig (#3652) [HUDI-2404] Add metrics-jmx to spark and flink bundles (#3632) [HUDI-2422] Adding rollback plan and rollback requested instant (#3651) - This patch introduces rollback plan and rollback.requested instant. Rollback will be done in two phases, namely rollback plan and rollback action. In planning, we prepare the rollback plan and serialize it to rollback.requested. In the rollback action phase, we fetch details from the plan and just delete the files as per the plan. This will ensure final rollback commit metadata will contain all files that got rolled back even if rollback failed midway and retried again. [HUDI-2330][HUDI-2335] Adding support for merge-on-read tables (#3679) - Inserts go into logs, hashed by Kafka and Hudi partitions - Fixed issues with the setupKafka script - Bumped up the default commit interval to 300 seconds - Minor renaming [MINOR] Fix typo,'compatiblity' corrected to 'compatibility' (#3675) [HUDI-2434] Make periodSeconds of GraphiteReporter configurable (#3667) [HUDI-2447] Extract common business logic & Fix typo (#3683) [HUDI-2267] Update docs and infra test configs, add support for graphite (#3482) Co-authored-by: Sivabalan Narayanan <[email protected]> [HUDI-2449] Incremental read for Flink (#3686) [HUDI-2444] Fixing delete files corner cases wrt cleaning and rollback when applying changes to metadata (#3678) [HUDI-2343]Fix the exception for mergeInto when the primaryKey and preCombineField of source table and target table differ in case only (#3517) [MINOR] Fix typo."funcitons" corrected to "functions" (#3681) [MINOR] Cosmetic changes for flink (#3701) [HUDI-2479] HoodieFileIndex throws NPE for FileSlice with pure log files (#3702) [HUDI-2395] Metadata tests rewrite (#3695) - Added commit metadata infra to test table so that we can test entire metadata using test table itself. These tests don't care about the contents of files as such and hence we should be able to test all code paths for metadata using test table. Co-authored-by: Sivabalan Narayanan <[email protected]> [HUDI-2383] Clean the marker files after compaction (#3576) [HUDI-2248] Fixing the closing of hms client (#3364) * [HUDI-2248] Fixing the closing of hms client * [HUDI-2248] Using Hive.closeCurrent() over client.close() [HUDI-2385] Make parquet dictionary encoding configurable (#3578) Co-authored-by: leesf <[email protected]> [HUDI-2483] Infer changelog mode for flink compactor (#3706) [HUDI-2485] Consume as mini-batch for flink stream reader (#3710) [HUDI-2484] Fix hive sync mode setting in Deltastreamer (#3712) [HUDI-2451] On windows client with hdfs server for wrong file separator (#3687) Co-authored-by: yao.zhou <[email protected]> [MINOR] fix typo,'SPAKR' corrected to 'SPARK' (#3721) [MINOR] Fix typo,'Kakfa' corrected to 'Kafka' & 'parquest' corrected to 'parquet' (#3717) [HUDI-2487] Fix JsonKafkaSource cannot filter empty messages from kafka (#3715) [HUDI-2474] Refreshing timeline for every operation in Hudi when metadata is enabled (#3698) [MINOR] Add a RFC template and folder (#3726) [HUDI-2277] HoodieDeltaStreamer reading ORC files directly using ORCDFSSource (#3413) * add ORCDFSSource to support reading orc file into hudi format && add UTs * remove ununsed import * simplify tes * code review * code review * code review * code review * code review * code review Co-authored-by: yuezhang <[email protected]> [MINOR] Fix typo Hooodie corrected to Hoodie & reuqired corrected to required (#3730) [MINOR] Support JuiceFileSystem (#3729) [HUDI-2440] Add dependency change diff script for dependency governace (#3674) [HUDI-2499] Making jdbc-url, user and pass as non-required field for other sync modes (#3732) [HUDI-2497] Refactor clean and restore actions in hudi-client module (#3734) [HUDI-2285][HUDI-2476] Metadata table synchronous design. Rebased and Squashed from pull/3426 (#3590) * [HUDI-2285] Adding Synchronous updates to metadata before completion of commits in data timelime. - This patch adds synchronous updates to metadata table. In other words, every write is first committed to metadata table followed by data table. While reading metadata table, we ignore any delta commits that are present only in metadata table and not in data table timeline. - Compaction of metadata table is fenced by the condition that we trigger compaction only when there are no inflight requests in datatable. This ensures that all base files in metadata table is always in sync with data table(w/o any holes) and only there could be some extra invalid commits among delta log files in metadata table. - Due to this, archival of data table also fences itself up until compacted instant in metadata table. All writes to metadata table happens within the datatable lock. So, metadata table works in one writer mode only. This might be tough to loosen since all writers write to same FILES partition and so, will result in a conflict anyways. - As part of this, have added acquiring locks in data table for those operations which were not before while committing (rollback, clean, compaction, cluster). To note, we were not doing any conflict resolution. All we are doing here is to commit by taking a lock. So that all writes to metadata table is always a single writer. - Also added building block to add buckets for partitions, which will be leveraged by other indexes like record level index, etc. For now, FILES partition has only one bucket. In general, any number of buckets per partition is allowed and each partition has a fixed fileId prefix with incremental suffix for each bucket within each partition. Have fixed [HUDI-2476]. This fix is about retrying a failed compaction if it succeeded in metadata for first time, but failed w/ data table. - Enabling metadata table by default. - Adding more tests for metadata table Co-authored-by: Prashant Wason <[email protected]> [HUDI-2456] support 'show partitions' sql (#3693) [HUDI-2513] Refactor table upgrade and downgrade actions in hudi-client module (#3743) [MINOR] Fix typo,'properites' corrected to 'properties' (#3738) [HUDI-2530] Adding async compaction support to integ test suite framework (#3750) [HUDI-2534] Remove the sort operation when bulk_insert in batch mode (#3772) [HUDI-2537] Fix metadata table for flink (#3774) [HUDI-2496] Insert duplicate records when precombined is deactivated for "insert" operation (#3740) [HUDI-2542] AppendWriteFunction throws NPE when checkpointing without written data (#3777) [HUDI-2540] Fixed wrong validation for metadataTableEnabled in HoodieTable (#3781) [MINOR] Fix typo,'paritition' corrected to 'partition' (#3764) [HUDI-2532] Metadata table compaction trigger max delta commits (#3784) - Setting the max delta commits default value from 24 to 10 to trigger the compaction in metadata table. [HUDI-2494] Fixing glob pattern to skip all hoodie meta paths (#3768) [HUDI-2435][BUG]Fix clustering handle errors (#3666) * done * remove unused imports * code reviewed * code reviewed Co-authored-by: yuezhang <[email protected]> Reviewers: #ldap_hudi, jsbali, balajee Reviewed By: #ldap_hud…

nsivabalan mentioned this pull request Jun 24, 2021

[HUDI-1104] Adding support for UserDefinedPartitioners and SortModes to BulkInsert with Rows #2049

Closed

4 tasks

nsivabalan force-pushed the bulk_insert_simplified_prep_internal_custom_ds branch from b12f427 to bd1ddae Compare June 25, 2021 17:10

nsivabalan added the priority:critical Production degraded; pipelines stalled label Jun 25, 2021

nsivabalan force-pushed the bulk_insert_simplified_prep_internal_custom_ds branch from bd1ddae to 86705be Compare June 28, 2021 15:39

vinothchandar added the priority:blocker Production down; release blocker label Jul 5, 2021

vinothchandar requested changes Jul 6, 2021

View reviewed changes

nsivabalan commented Jul 6, 2021

View reviewed changes

...-spark-common/src/main/java/org/apache/hudi/internal/BulkInsertDataInternalWriterHelper.java Outdated Show resolved Hide resolved

lamberken and others added 7 commits July 6, 2021 10:37

trigger rebuild

212523a

[HUDI-1156] Remove unused dependencies from HoodieDeltaStreamerWrappe…

8f46bff

…r Class (apache#1927)

Adding bulk insert sort modes and user defined bulk insert partitione…

3fda2ee

…r to bulk insert of Rows

Rebasing and adding tests for BulkInsertPartitioners with Rows

1752f5e

Fixing build issues

a304570

fetching and rebasing with master

82f373c

Caching row create handles

af3e97a

nsivabalan force-pushed the bulk_insert_simplified_prep_internal_custom_ds branch from 86705be to f5ad980 Compare July 6, 2021 15:05

Addressing feedback

bbf3285

nsivabalan force-pushed the bulk_insert_simplified_prep_internal_custom_ds branch from f5ad980 to bbf3285 Compare July 6, 2021 22:45

nsivabalan merged commit ea9e5d0 into apache:master Jul 7, 2021

ghost pushed a commit to shivagowda/hudi that referenced this pull request Jul 15, 2021

[HUDI-1104] Adding support for UserDefinedPartitioners and SortModes …

d1385c9

…to BulkInsert with Rows (apache#3149)

ghost pushed a commit to shivagowda/hudi that referenced this pull request Aug 1, 2021

[HUDI-1104] Adding support for UserDefinedPartitioners and SortModes …

e41b03d

…to BulkInsert with Rows (apache#3149)

prashantwason pushed a commit to prashantwason/incubator-hudi that referenced this pull request Jan 5, 2026

[HUDI-1104] Adding support for UserDefinedPartitioners and SortModes …

e67b770

…to BulkInsert with Rows (apache#3149)

[HUDI-1104] Adding support for UserDefinedPartitioners and SortModes to BulkInsert with Rows #3149

[HUDI-1104] Adding support for UserDefinedPartitioners and SortModes to BulkInsert with Rows #3149

Uh oh!

Conversation

nsivabalan commented Jun 24, 2021

What is the purpose of the pull request

Verify this pull request

Committer checklist

Uh oh!

hudi-bot commented Jun 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

codecov-commenter commented Jun 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

vinothchandar Jul 6, 2021

Choose a reason for hiding this comment

Uh oh!

vinothchandar Jul 6, 2021

Choose a reason for hiding this comment

Uh oh!

nsivabalan Jul 6, 2021

Choose a reason for hiding this comment

Uh oh!

nsivabalan Jul 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nsivabalan Jul 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vinothchandar Jul 6, 2021

Choose a reason for hiding this comment

Uh oh!

nsivabalan Jul 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

hudi-bot commented Jun 24, 2021 •

edited

Loading

codecov-commenter commented Jun 24, 2021 •

edited

Loading

nsivabalan Jul 6, 2021 •

edited

Loading

nsivabalan Jul 6, 2021 •

edited

Loading

nsivabalan Jul 6, 2021 •

edited

Loading