[HUDI-1526] Translate the api partitionBy to hoodie.datasource.write.partitionpath.field #2431

teeyog · 2021-01-11T11:21:21Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contributing.html before opening a pull request.

What is the purpose of the pull request

Currently, if you want to set the partition of hudi, you must configure it with the parameter hoodie.datasource.write.partitionpath.field, but the Spark DataFrame api partitonBy does not take effect. We can automatically translate the parameter of partitionBy into the partition field of hudi.

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

codecov-io · 2021-01-11T12:45:47Z

Codecov Report

Merging #2431 (e0ac169) into master (4a5683d) will increase coverage by 0.02%.
The diff coverage is 90.00%.

@@             Coverage Diff              @@
##             master    #2431      +/-   ##
============================================
+ Coverage     50.69%   50.71%   +0.02%     
+ Complexity     3132     3131       -1     
============================================
  Files           430      430              
  Lines         19596    19602       +6     
  Branches       2007     2008       +1     
============================================
+ Hits           9934     9941       +7     
+ Misses         8853     8850       -3     
- Partials        809      811       +2

Flag	Coverage Δ	Complexity Δ
hudicli	`37.21% <ø> (ø)`	`0.00 <ø> (ø)`
hudiclient	`100.00% <ø> (ø)`	`0.00 <ø> (ø)`
hudicommon	`51.43% <ø> (ø)`	`0.00 <ø> (ø)`
hudiflink	`36.30% <ø> (ø)`	`0.00 <ø> (ø)`
hudihadoopmr	`33.16% <ø> (ø)`	`0.00 <ø> (ø)`
hudisparkdatasource	`69.73% <90.00%> (+0.26%)`	`0.00 <0.00> (ø)`
hudisync	`48.61% <ø> (ø)`	`0.00 <ø> (ø)`
huditimelineservice	`66.49% <ø> (ø)`	`0.00 <ø> (ø)`
hudiutilities	`69.46% <ø> (-0.06%)`	`0.00 <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ	Complexity Δ
...src/main/scala/org/apache/hudi/DefaultSource.scala	`88.23% <90.00%> (-0.48%)`	`15.00 <0.00> (ø)`
...apache/hudi/utilities/deltastreamer/DeltaSync.java	`70.50% <0.00%> (-0.36%)`	`50.00% <0.00%> (-1.00%)`
...n/scala/org/apache/hudi/HoodieSparkSqlWriter.scala	`49.82% <0.00%> (+1.06%)`	`0.00% <0.00%> (ø%)`

yanghua · 2021-01-19T09:52:17Z

@wangxianghu Please help to review thanks.

wangxianghu · 2021-01-20T01:18:02Z

Hi @teeyog, thanks for your contribution!
can you add some tests to verify this change

teeyog · 2021-01-20T10:36:26Z

Hi @teeyog, thanks for your contribution!
can you add some tests to verify this change

@wangxianghu Test has been added

zhedoubushishi · 2021-01-20T21:58:19Z

hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala

Could you also add a TODO comment here to indicate that we can remove this line after upgrading to Spark 3? I think in the future, Hudi will move to Spark 3.

Thank you for your review, todo has been added

from what I understand, this is not required in spark 3. If that's the case, can you fetch spark version and set this only if spark 2 or Lower. bcoz, even today one can run Hudi w/ spark3.

zhedoubushishi · 2021-01-21T01:58:42Z

LGTM! This is also something I plan to do. Let's wait for others' review.

wangxianghu · 2021-01-23T04:38:41Z

Hi @teeyog, thanks for your contribution!
can you add some tests to verify this change

@wangxianghu Test has been added

Thanks, @teeyog will review soon

wangxianghu · 2021-01-23T06:04:17Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala

@teeyog please take TimestampBasedKeyGenerator,CustomKeyGenerator(configured with timestamp partitionpath),ComplexKeyGenerator... into consideration

@wangxianghu All KeyGenerators are considered, only CustomKeyGenerator is special, which requires the user to specify in the form of field1:PartitionKeyType1, field2:PartitionKeyType2

Can you please add a one java doc here wrt the format expected by CustomKeyGenerator.

wangxianghu · 2021-01-25T08:30:24Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala

we can not simply put SIMPLE and partitionBy field together. Since when user use CustomKeyGenerator and the partitionpath field is of timestamp type, the str after the partitionBy field should be TIMESTAMP

@wangxianghu Thank you for your review. My opinion is this：In accordance with the habit of using Spark, the partition field value corresponding to partitionBy is the original value, so the default is to use SIMPLE. If we automatically infer whether to use TIMESTAMP based on the field type, the rules are not easy to determine. For example, if a field is long, we Do you need to convert to TIMESTAMP? If you want to convert, but the value is not a timestamp, an error will be reported, so SIMPLE is used by default. If you want to use TIMESTAMP, users can directly use hoodie.datasource.write.partitionpath. fieldGo to specify

@wangxianghu Thank you for your review. My opinion is this：In accordance with the habit of using Spark, the partition field value corresponding to partitionBy is the original value, so the default is to use SIMPLE. If we automatically infer whether to use TIMESTAMP based on the field type, the rules are not easy to determine. For example, if a field is long, we Do you need to convert to TIMESTAMP? If you want to convert, but the value is not a timestamp, an error will be reported, so SIMPLE is used by default. If you want to use TIMESTAMP, users can directly use hoodie.datasource.write.partitionpath. fieldGo to specify

yes, I get your point. we'd better support both SIMPLE and TIMESTAMP type patitionpath in a unified way

Yes, now if the parameters include TIMESTAMP_TYPE_FIELD_PROP and TIMESTAMP_OUTPUT_DATE_FORMAT_PROP, TIMESTAMP is used by default, otherwise SIMPLE

wangxianghu

@teeyog thanks for addressing my comments, LGTM now !
cc @yanghua

yanghua

LGTM, cc @vinothchandar May want to do a double-check?

vinothchandar · 2021-01-29T01:49:55Z

@nsivabalan @zhedoubushishi also to review.

nsivabalan · 2021-01-31T18:25:55Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala

my 2 cents. May be we can introduce a new method here which should be invoked directly. As of now, this is called within parametersWithWriteDefaults() which does not sounds right. So, may be an explicit call after parametersWithWriteDefaults() returns. something like translateSqlOptions() or something which will take care of doing this translations. bcoz, exiting method only takes care of some deprecated options, but here we are trying to translate a sql option to Hudi option/config param.

nsivabalan · 2021-01-31T18:44:15Z

hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala

from what I understand, this is not required in spark 3. If that's the case, can you fetch spark version and set this only if spark 2 or Lower. bcoz, even today one can run Hudi w/ spark3.

nsivabalan · 2021-01-31T18:49:35Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala

Can you please add a one java doc here wrt the format expected by CustomKeyGenerator.

nsivabalan · 2021-01-31T18:49:55Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala

Can we do classOf[CustomKeyGenerator].getName rather than hardcoding the full path.

nsivabalan · 2021-01-31T18:52:26Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala

may be "translatedOptParams"

nsivabalan · 2021-01-31T18:53:52Z

...park-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala

can we create separate tests for diff key gens. Also, can we please make a private method and re-use the code in every test if possible.

nsivabalan · 2021-01-31T18:57:54Z

...park-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala

can you please add one more test for CustomKeyGenerator covering the format "field1:simple,field2:timestamp". bcoz, this is the only one that has special handling. would be nice to have more tests around them.

also can we have one failure test. that invalid format should result in failure. TestCustomKeyGenerator should have tests for your reference.

...park-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala

nsivabalan

@vinothchandar : a qq. how do we document this feature in general or how do we let users know that they can leverage spark's partition by ? do we fix quick start utils or somewhere else?
@teeyog : few minor comments. almost there. once addressed we can land.

...park-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala

nsivabalan · 2021-02-04T10:58:51Z

...park-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala

may be we could name the test as "testSparkPartitonByWithCustomKeyGen()"
. if this looks ok, you can fix all methods. succinct and conveys the meaning too.

nsivabalan · 2021-02-04T11:02:07Z

...park-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala

I see we have lot of diff options w/ timestampbased. Can you create a follow up ticket. even if not for you, someone will pick it up and add more tests.

https://issues.apache.org/jira/browse/HUDI-1610

nsivabalan · 2021-02-04T11:03:55Z

...park-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala

just for my understanding. Can you help me understand, with NonpartitionedKeyGenerator what happens w/ the following
a. writer.partitionBy("")
b. writer.partitionBy("non_existant_column")

@teeyog : can you please help(/respond) me with this.

...park-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala

hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala

…asource.write.partitionpath.field`

… `hoodie.datasource.write.partitionpath.field`

nsivabalan · 2021-02-08T12:24:06Z

@teeyog : please ping me here you have addressed all feedback and is ready for review again.

teeyog · 2021-02-10T07:44:22Z

@nsivabalan Modified according to your opinion, please review again, thanks

nsivabalan · 2021-02-10T17:01:43Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala

+      val partitionPathField =
+        keyGeneratorClass match {
+          // Only CustomKeyGenerator needs special treatment, because it needs to be specified in a way
+          // such as "field1:PartitionKeyType1,field2:PartitionKeyType2".


nsivabalan · 2021-02-10T17:04:44Z

...park-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestCOWDataSource.scala

https://issues.apache.org/jira/browse/HUDI-1610

… to hoodie.datasource.write.partitionpath.field (apache#2431)" This reverts commit 26da4f5.

…ie.datasource.write.partitionpath.field (apache#2431)

…OSS master Summary: [HUDI-1509]: Reverting LinkedHashSet changes to combine fields from oldSchema and newSchema in favor of using only new schema for record rewriting (apache#2424) [MINOR] Bumping snapshot version to 0.7.0 (apache#2435) [HUDI-1533] Make SerializableSchema work for large schemas and add ability to sortBy numeric values (apache#2453) [HUDI-1529] Add block size to the FileStatus objects returned from metadata table to avoid too many file splits (apache#2451) [HUDI-1532] Fixed suboptimal implementation of a magic sequence search (apache#2440) [HUDI-1535] Fix 0.7.0 snapshot (apache#2456) [MINOR] Fixing setting defaults for index config (apache#2457) [HUDI-1540] Fixing commons codec shading in spark bundle (apache#2460) [HUDI 1308] Harden RFC-15 Implementation based on production testing (apache#2441) [MINOR] Remove redundant judgments (apache#2466) [MINOR] Fix dataSource cannot use hoodie.datasource.hive_sync.auto_create_database (apache#2444) [MINOR] Disabling problematic tests temporarily to stabilize CI (apache#2468) [MINOR] Make a separate travis CI job for hudi-utilities (apache#2469) [HUDI-1512] Fix spark 2 unit tests failure with Spark 3 (apache#2412) [HUDI-1511] InstantGenerateOperator support multiple parallelism (apache#2434) [HUDI-1332] Introduce FlinkHoodieBloomIndex to hudi-flink-client (apache#2375) [HUDI] Add bloom index for hudi-flink-client [MINOR] Remove InstantGeneratorOperator parallelism limit in HoodieFlinkStreamer and update docs (apache#2471) [MINOR] Improve code readability,remove the continue keyword (apache#2459) [HOTFIX] Revert upgrade flink verison to 1.12.0 (apache#2473) [HUDI-1453] Fix NPE using HoodieFlinkStreamer to etl data from kafka to hudi (apache#2474) [MINOR] Use skipTests flag for skip.hudi-spark2.unit.tests property (apache#2477) [HUDI-1476] Introduce unit test infra for java client (apache#2478) [MINOR] Update doap with 0.7.0 release (apache#2491) [MINOR]Fix NPE when using HoodieFlinkStreamer with multi parallelism (apache#2492) [HUDI-1234] Insert new records to data files without merging for "Insert" operation. (apache#2111) [MINOR] Add Jira URL and Mailing List (apache#2404) [HUDI-1522] Add a new pipeline for Flink writer (apache#2430) [HUDI-1522] Add a new pipeline for Flink writer [HUDI-623] Remove UpgradePayloadFromUberToApache (apache#2455) [HUDI-1555] Remove isEmpty to improve clustering execution performance (apache#2502) [HUDI-1266] Add unit test for validating replacecommit rollback (apache#2418) [MINOR] Quickstart.generateUpdates method add check (apache#2505) [HUDI-1519] Improve minKey/maxKey computation in HoodieHFileWriter (apache#2427) [HUDI-1550] Honor ordering field for MOR Spark datasource reader (apache#2497) [MINOR] Fix method comment typo (apache#2518) [MINOR] Rename FileSystemViewHandler to RequestHandler and corrected the class comment (apache#2458) [HUDI-1335] Introduce FlinkHoodieSimpleIndex to hudi-flink-client (apache#2271) [HUDI-1523] Call mkdir(partition) only if not exists (apache#2501) [HUDI-1538] Try to init class trying different signatures instead of checking its name (apache#2476) [HUDI-1538] Try to init class trying different signatures instead of checking its name. [HUDI-1547] CI intermittent failure: TestJsonStringToHoodieRecordMapF… (apache#2521) [MINOR] Fixing the default value for source ordering field for payload config (apache#2516) [HUDI-1420] HoodieTableMetaClient.getMarkerFolderPath works incorrectly on windows client with hdfs server for wrong file seperator (apache#2526) [HUDI-1571] Adding commit_show_records_info to display record sizes for commit (apache#2514) [HUDI-1589] Fix Rollback Metadata AVRO backwards incompatiblity (apache#2543) [MINOR] Fix wrong logic for checking state condition (apache#2524) [HUDI-1557] Make Flink write pipeline write task scalable (apache#2506) [HUDI-1545] Add test cases for INSERT_OVERWRITE Operation (apache#2483) [HUDI-1603] fix DefaultHoodieRecordPayload serialization failure (apache#2556) [MINOR] Fix the wrong comment for HoodieJavaWriteClientExample (apache#2559) [HUDI-1526] Translate the api partitionBy in spark datasource to hoodie.datasource.write.partitionpath.field (apache#2431) [HUDI-1612] Fix write test flakiness in StreamWriteITCase (apache#2567) [HUDI-1612] Fix write test flakiness in StreamWriteITCase [MINOR] Default to empty list for unset datadog tags property (apache#2574) [MINOR] Add clustering to feature list (apache#2568) [HUDI-1598] Write as minor batches during one checkpoint interval for the new writer (apache#2553) [HUDI-1109] Support Spark Structured Streaming read from Hudi table (apache#2485) [HUDI-1621] Gets the parallelism from context when init StreamWriteOperatorCoordinator (apache#2579) [HUDI-1381] Schedule compaction based on time elapsed (apache#2260) [HUDI-1582] Throw an exception when syncHoodieTable() fails, with RuntimeException (apache#2536) [HUDI-1539] Fix bug in HoodieCombineRealtimeRecordReader with reading empty iterators (apache#2583) [HUDI-1315] Adding builder for HoodieTableMetaClient initialization (apache#2534) [HUDI-1486] Remove inline inflight rollback in hoodie writer (apache#2359) [HUDI-1586] [Common Core] [Flink Integration] Reduce the coupling of hadoop. (apache#2540) [HUDI-1624] The state based index should bootstrap from existing base files (apache#2581) [HUDI-1477] Support copyOnWriteTable in java client (apache#2382) [MINOR] Ensure directory exists before listing all marker files. (apache#2594) [MINOR] hive sync checks for table after creating db if auto create is true (apache#2591) [HUDI-1620] Add azure pipelines configs (apache#2582) [HUDI-1347] Fix Hbase index to make rollback synchronous (via config) (apache#2188) [HUDI-1637] Avoid to rename for bucket update when there is only one flush action during a checkpoint (apache#2599) [HUDI-1638] Some improvements to BucketAssignFunction (apache#2600) [HUDI-1367] Make deltaStreamer transition from dfsSouce to kafkasouce (apache#2227) [HUDI-1269] Make whether the failure of connect hive affects hudi ingest process configurable (apache#2443) [HUDI-1611] Added a configuration to allow specific directories to be filtered out during Metadata Table bootstrap. (apache#2565) [Hudi-1583]: Fix bug that Hudi will skip remaining log files if there is logFile with zero size in logFileList when merge on read. (apache#2584) [HUDI-1632] Supports merge on read write mode for Flink writer (apache#2593) [HUDI-1540] Fixing commons codec dependency in bundle jars (apache#2562) [HUDI-1644] Do not delete older rollback instants as part of rollback. Archival can take care of removing old instants cleanly (apache#2610) [HUDI-1634] Re-bootstrap metadata table when un-synced instants have been archived. (apache#2595) [HUDI-1584] Modify maker file path, which should start with the target base path. (apache#2539) [MINOR] Fix default value for hoodie.deltastreamer.source.kafka.auto.reset.offsets (apache#2617) [HUDI-1553] Configuration and metrics for the TimelineService. (apache#2495) [HUDI-1587] Add latency and freshness support (apache#2541) [HUDI-1647] Supports snapshot read for Flink (apache#2613) [HUDI-1646] Provide mechanism to read uncommitted data through InputFormat (apache#2611) [HUDI-1655] Support custom date format and fix unsupported exception in DatePartitionPathSelector (apache#2621) [HUDI-1636] Support Builder Pattern To Build Table Properties For HoodieTableConfig (apache#2596) [HUDI-1660] Excluding compaction and clustering instants from inflight rollback (apache#2631) [HUDI-1661] Exclude clustering commits from getExtraMetadataFromLatest API (apache#2632) [MINOR] Fix import in StreamerUtil.java (apache#2638) [HUDI-1618] Fixing NPE with Parquet src in multi table delta streamer (apache#2577) [HUDI-1662] Fix hive date type conversion for mor table (apache#2634) [HUDI-1673] Replace scala.Tule2 to Pair in FlinkHoodieBloomIndex (apache#2642) [MINOR] HoodieClientTestHarness close resources in AfterAll phase (apache#2646) [HUDI-1635] Improvements to Hudi Test Suite (apache#2628) [HUDI-1651] Fix archival of requested replacecommit (apache#2622) [HUDI-1663] Streaming read for Flink MOR table (apache#2640) [HUDI-1678] Row level delete for Flink sink (apache#2659) [HUDI-1664] Avro schema inference for Flink SQL table (apache#2658) [HUDI-1681] Support object storage for Flink writer (apache#2662) [HUDI-1685] keep updating current date for every batch (apache#2671) [HUDI-1496] Fixing input stream detection of GCS FileSystem (apache#2500) [HUDI-1684] Tweak hudi-flink-bundle module pom and reorganize the pacakges for hudi-flink module (apache#2669) [HUDI-1692] Bounded source for stream writer (apache#2674) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. (apache#2494) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. [HUDI-1695] Fixed the error messaging (apache#2679) [HUDI 1615] Fixing null schema in bulk_insert row writer path (apache#2653) [HUDI-845] Added locking capability to allow multiple writers (apache#2374) [HUDI-1701] Implement HoodieTableSource.explainSource for all kinds of pushing down (apache#2690) [HUDI-1704] Use PRIMARY KEY syntax to define record keys for Flink Hudi table (apache#2694) [HUDI-1688]hudi write should uncache rdd， when the write operation is finnished (apache#2673) [MINOR] Remove unused var in AbstractHoodieWriteClient (apache#2693) [HUDI-1653] Add support for composite keys in NonpartitionedKeyGenerator (apache#2627) [HUDI-1705] Flush as per data bucket for mini-batch write (apache#2695) [1568] Fixing spark3 bundles (apache#2625) [HUDI-1650] Custom avro kafka deserializer. (apache#2619) [HUDI-1667]: Fix a null value related bug for spark vectorized reader. (apache#2636) [HUDI-1709] Improving config names and adding hive metastore uri config (apache#2699) [MINOR][DOCUMENT] Update README doc for integ test (apache#2703) [HUDI-1710] Read optimized query type for Flink batch reader (apache#2702) [HUDI-1712] Rename & standardize config to match other configs (apache#2708) [hotfix] Log the error message for creating table source first (apache#2711) [HUDI-1495] Bump Flink version to 1.12.2 (apache#2718) [HUDI-1728] Fix MethodNotFound for HiveMetastore Locks (apache#2731) [HUDI-1478] Introduce HoodieBloomIndex to hudi-java-client (apache#2608) [HUDI-1729] Asynchronous Hive sync and commits cleaning for Flink writer (apache#2732) [HOTFIX] close spark session in functional test suite and disable spark3 test for spark2 (apache#2727) [HOTFIX] Disable ITs for Spark3 and scala2.12 (apache#2733) [HOTFIX] fix deploy staging jars script [MINOR] Add Missing Apache License to test files (apache#2736) [UBER] Fixed creation of HoodieMetadataClient which now uses a Builder pattern instead of a constructor. Reviewers: balajee, O955 Project Hoodie Project Reviewer: Add blocking reviewers!, PHID-PROJ-pxfpotkfgkanblb3detq! JIRA Issues: HUDI-593 Differential Revision: https://code.uberinternal.com/D5867129

…OSS master Summary: [HUDI-1509]: Reverting LinkedHashSet changes to combine fields from oldSchema and newSchema in favor of using only new schema for record rewriting (apache#2424) [MINOR] Bumping snapshot version to 0.7.0 (apache#2435) [HUDI-1533] Make SerializableSchema work for large schemas and add ability to sortBy numeric values (apache#2453) [HUDI-1529] Add block size to the FileStatus objects returned from metadata table to avoid too many file splits (apache#2451) [HUDI-1532] Fixed suboptimal implementation of a magic sequence search (apache#2440) [HUDI-1535] Fix 0.7.0 snapshot (apache#2456) [MINOR] Fixing setting defaults for index config (apache#2457) [HUDI-1540] Fixing commons codec shading in spark bundle (apache#2460) [HUDI 1308] Harden RFC-15 Implementation based on production testing (apache#2441) [MINOR] Remove redundant judgments (apache#2466) [MINOR] Fix dataSource cannot use hoodie.datasource.hive_sync.auto_create_database (apache#2444) [MINOR] Disabling problematic tests temporarily to stabilize CI (apache#2468) [MINOR] Make a separate travis CI job for hudi-utilities (apache#2469) [HUDI-1512] Fix spark 2 unit tests failure with Spark 3 (apache#2412) [HUDI-1511] InstantGenerateOperator support multiple parallelism (apache#2434) [HUDI-1332] Introduce FlinkHoodieBloomIndex to hudi-flink-client (apache#2375) [HUDI] Add bloom index for hudi-flink-client [MINOR] Remove InstantGeneratorOperator parallelism limit in HoodieFlinkStreamer and update docs (apache#2471) [MINOR] Improve code readability,remove the continue keyword (apache#2459) [HOTFIX] Revert upgrade flink verison to 1.12.0 (apache#2473) [HUDI-1453] Fix NPE using HoodieFlinkStreamer to etl data from kafka to hudi (apache#2474) [MINOR] Use skipTests flag for skip.hudi-spark2.unit.tests property (apache#2477) [HUDI-1476] Introduce unit test infra for java client (apache#2478) [MINOR] Update doap with 0.7.0 release (apache#2491) [MINOR]Fix NPE when using HoodieFlinkStreamer with multi parallelism (apache#2492) [HUDI-1234] Insert new records to data files without merging for "Insert" operation. (apache#2111) [MINOR] Add Jira URL and Mailing List (apache#2404) [HUDI-1522] Add a new pipeline for Flink writer (apache#2430) [HUDI-1522] Add a new pipeline for Flink writer [HUDI-623] Remove UpgradePayloadFromUberToApache (apache#2455) [HUDI-1555] Remove isEmpty to improve clustering execution performance (apache#2502) [HUDI-1266] Add unit test for validating replacecommit rollback (apache#2418) [MINOR] Quickstart.generateUpdates method add check (apache#2505) [HUDI-1519] Improve minKey/maxKey computation in HoodieHFileWriter (apache#2427) [HUDI-1550] Honor ordering field for MOR Spark datasource reader (apache#2497) [MINOR] Fix method comment typo (apache#2518) [MINOR] Rename FileSystemViewHandler to RequestHandler and corrected the class comment (apache#2458) [HUDI-1335] Introduce FlinkHoodieSimpleIndex to hudi-flink-client (apache#2271) [HUDI-1523] Call mkdir(partition) only if not exists (apache#2501) [HUDI-1538] Try to init class trying different signatures instead of checking its name (apache#2476) [HUDI-1538] Try to init class trying different signatures instead of checking its name. [HUDI-1547] CI intermittent failure: TestJsonStringToHoodieRecordMapF… (apache#2521) [MINOR] Fixing the default value for source ordering field for payload config (apache#2516) [HUDI-1420] HoodieTableMetaClient.getMarkerFolderPath works incorrectly on windows client with hdfs server for wrong file seperator (apache#2526) [HUDI-1571] Adding commit_show_records_info to display record sizes for commit (apache#2514) [HUDI-1589] Fix Rollback Metadata AVRO backwards incompatiblity (apache#2543) [MINOR] Fix wrong logic for checking state condition (apache#2524) [HUDI-1557] Make Flink write pipeline write task scalable (apache#2506) [HUDI-1545] Add test cases for INSERT_OVERWRITE Operation (apache#2483) [HUDI-1603] fix DefaultHoodieRecordPayload serialization failure (apache#2556) [MINOR] Fix the wrong comment for HoodieJavaWriteClientExample (apache#2559) [HUDI-1526] Translate the api partitionBy in spark datasource to hoodie.datasource.write.partitionpath.field (apache#2431) [HUDI-1612] Fix write test flakiness in StreamWriteITCase (apache#2567) [HUDI-1612] Fix write test flakiness in StreamWriteITCase [MINOR] Default to empty list for unset datadog tags property (apache#2574) [MINOR] Add clustering to feature list (apache#2568) [HUDI-1598] Write as minor batches during one checkpoint interval for the new writer (apache#2553) [HUDI-1109] Support Spark Structured Streaming read from Hudi table (apache#2485) [HUDI-1621] Gets the parallelism from context when init StreamWriteOperatorCoordinator (apache#2579) [HUDI-1381] Schedule compaction based on time elapsed (apache#2260) [HUDI-1582] Throw an exception when syncHoodieTable() fails, with RuntimeException (apache#2536) [HUDI-1539] Fix bug in HoodieCombineRealtimeRecordReader with reading empty iterators (apache#2583) [HUDI-1315] Adding builder for HoodieTableMetaClient initialization (apache#2534) [HUDI-1486] Remove inline inflight rollback in hoodie writer (apache#2359) [HUDI-1586] [Common Core] [Flink Integration] Reduce the coupling of hadoop. (apache#2540) [HUDI-1624] The state based index should bootstrap from existing base files (apache#2581) [HUDI-1477] Support copyOnWriteTable in java client (apache#2382) [MINOR] Ensure directory exists before listing all marker files. (apache#2594) [MINOR] hive sync checks for table after creating db if auto create is true (apache#2591) [HUDI-1620] Add azure pipelines configs (apache#2582) [HUDI-1347] Fix Hbase index to make rollback synchronous (via config) (apache#2188) [HUDI-1637] Avoid to rename for bucket update when there is only one flush action during a checkpoint (apache#2599) [HUDI-1638] Some improvements to BucketAssignFunction (apache#2600) [HUDI-1367] Make deltaStreamer transition from dfsSouce to kafkasouce (apache#2227) [HUDI-1269] Make whether the failure of connect hive affects hudi ingest process configurable (apache#2443) [HUDI-1611] Added a configuration to allow specific directories to be filtered out during Metadata Table bootstrap. (apache#2565) [Hudi-1583]: Fix bug that Hudi will skip remaining log files if there is logFile with zero size in logFileList when merge on read. (apache#2584) [HUDI-1632] Supports merge on read write mode for Flink writer (apache#2593) [HUDI-1540] Fixing commons codec dependency in bundle jars (apache#2562) [HUDI-1644] Do not delete older rollback instants as part of rollback. Archival can take care of removing old instants cleanly (apache#2610) [HUDI-1634] Re-bootstrap metadata table when un-synced instants have been archived. (apache#2595) [HUDI-1584] Modify maker file path, which should start with the target base path. (apache#2539) [MINOR] Fix default value for hoodie.deltastreamer.source.kafka.auto.reset.offsets (apache#2617) [HUDI-1553] Configuration and metrics for the TimelineService. (apache#2495) [HUDI-1587] Add latency and freshness support (apache#2541) [HUDI-1647] Supports snapshot read for Flink (apache#2613) [HUDI-1646] Provide mechanism to read uncommitted data through InputFormat (apache#2611) [HUDI-1655] Support custom date format and fix unsupported exception in DatePartitionPathSelector (apache#2621) [HUDI-1636] Support Builder Pattern To Build Table Properties For HoodieTableConfig (apache#2596) [HUDI-1660] Excluding compaction and clustering instants from inflight rollback (apache#2631) [HUDI-1661] Exclude clustering commits from getExtraMetadataFromLatest API (apache#2632) [MINOR] Fix import in StreamerUtil.java (apache#2638) [HUDI-1618] Fixing NPE with Parquet src in multi table delta streamer (apache#2577) [HUDI-1662] Fix hive date type conversion for mor table (apache#2634) [HUDI-1673] Replace scala.Tule2 to Pair in FlinkHoodieBloomIndex (apache#2642) [MINOR] HoodieClientTestHarness close resources in AfterAll phase (apache#2646) [HUDI-1635] Improvements to Hudi Test Suite (apache#2628) [HUDI-1651] Fix archival of requested replacecommit (apache#2622) [HUDI-1663] Streaming read for Flink MOR table (apache#2640) [HUDI-1678] Row level delete for Flink sink (apache#2659) [HUDI-1664] Avro schema inference for Flink SQL table (apache#2658) [HUDI-1681] Support object storage for Flink writer (apache#2662) [HUDI-1685] keep updating current date for every batch (apache#2671) [HUDI-1496] Fixing input stream detection of GCS FileSystem (apache#2500) [HUDI-1684] Tweak hudi-flink-bundle module pom and reorganize the pacakges for hudi-flink module (apache#2669) [HUDI-1692] Bounded source for stream writer (apache#2674) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. (apache#2494) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. [HUDI-1695] Fixed the error messaging (apache#2679) [HUDI 1615] Fixing null schema in bulk_insert row writer path (apache#2653) [HUDI-845] Added locking capability to allow multiple writers (apache#2374) [HUDI-1701] Implement HoodieTableSource.explainSource for all kinds of pushing down (apache#2690) [HUDI-1704] Use PRIMARY KEY syntax to define record keys for Flink Hudi table (apache#2694) [HUDI-1688]hudi write should uncache rdd， when the write operation is finnished (apache#2673) [MINOR] Remove unused var in AbstractHoodieWriteClient (apache#2693) [HUDI-1653] Add support for composite keys in NonpartitionedKeyGenerator (apache#2627) [HUDI-1705] Flush as per data bucket for mini-batch write (apache#2695) [1568] Fixing spark3 bundles (apache#2625) [HUDI-1650] Custom avro kafka deserializer. (apache#2619) [HUDI-1667]: Fix a null value related bug for spark vectorized reader. (apache#2636) [HUDI-1709] Improving config names and adding hive metastore uri config (apache#2699) [MINOR][DOCUMENT] Update README doc for integ test (apache#2703) [HUDI-1710] Read optimized query type for Flink batch reader (apache#2702) [HUDI-1712] Rename & standardize config to match other configs (apache#2708) [hotfix] Log the error message for creating table source first (apache#2711) [HUDI-1495] Bump Flink version to 1.12.2 (apache#2718) [HUDI-1728] Fix MethodNotFound for HiveMetastore Locks (apache#2731) [HUDI-1478] Introduce HoodieBloomIndex to hudi-java-client (apache#2608) [HUDI-1729] Asynchronous Hive sync and commits cleaning for Flink writer (apache#2732) [HOTFIX] close spark session in functional test suite and disable spark3 test for spark2 (apache#2727) [HOTFIX] Disable ITs for Spark3 and scala2.12 (apache#2733) [HOTFIX] fix deploy staging jars script [MINOR] Add Missing Apache License to test files (apache#2736) [UBER] Fixed creation of HoodieMetadataClient which now uses a Builder pattern instead of a constructor. Reviewers: balajee Reviewed By: balajee JIRA Issues: HUDI-593 Differential Revision: https://code.uberinternal.com/D5867129

teeyog changed the title ~~translate the api partitionBy to `hoodie.dat…~~ translate the api partitionBy to hoodie.datasource.write.partitionpath.field Jan 11, 2021

teeyog changed the title ~~translate the api partitionBy to hoodie.datasource.write.partitionpath.field~~ translate the api partitionBy to hoodie.datasource.write.partitionpath.field Jan 11, 2021

teeyog changed the title ~~translate the api partitionBy to hoodie.datasource.write.partitionpath.field~~ [HUDI-2431]translate the api partitionBy to hoodie.datasource.write.partitionpath.field Jan 11, 2021

teeyog changed the title ~~[HUDI-2431]translate the api partitionBy to hoodie.datasource.write.partitionpath.field~~ [HUDI-1526]translate the api partitionBy to hoodie.datasource.write.partitionpath.field Jan 13, 2021

zhedoubushishi reviewed Jan 20, 2021

View reviewed changes

wangxianghu requested changes Jan 23, 2021

View reviewed changes

wangxianghu reviewed Jan 25, 2021

View reviewed changes

teeyog force-pushed the translate_param branch from f1d0fda to 6ad41e4 Compare January 25, 2021 11:48

wangxianghu approved these changes Jan 26, 2021

View reviewed changes

yanghua approved these changes Jan 26, 2021

View reviewed changes

yanghua changed the title ~~[HUDI-1526]translate the api partitionBy to hoodie.datasource.write.partitionpath.field~~ [HUDI-1526] Translate the api partitionBy to hoodie.datasource.write.partitionpath.field Jan 26, 2021

vinothchandar assigned wangxianghu and nsivabalan Jan 29, 2021

nsivabalan requested changes Jan 31, 2021

View reviewed changes

teeyog force-pushed the translate_param branch from e771789 to 9d3fea0 Compare February 2, 2021 05:13

teeyog requested a review from nsivabalan February 2, 2021 09:08

nsivabalan reviewed Feb 4, 2021

View reviewed changes

teeyog added 6 commits February 5, 2021 10:49

translate the api partitionBy of spark DataFrameWriter to `hoodie.dat…

b3484c2

…asource.write.partitionpath.field`

[HUDI-1526] translate the api partitionBy of spark DataFrameWriter to…

a1ea754

… `hoodie.datasource.write.partitionpath.field`

[HUDI-1526] translate the api partitionBy of spark DataFrameWriter to…

518b5df

… `hoodie.datasource.write.partitionpath.field`

[HUDI-1526] translate the api partitionBy of spark DataFrameWriter to…

0c7e5b8

… `hoodie.datasource.write.partitionpath.field`

[HUDI-1526] translate the api partitionBy of spark DataFrameWriter to…

90389a0

… `hoodie.datasource.write.partitionpath.field`

[HUDI-1526] translate the api partitionBy of spark DataFrameWriter to…

83c7139

… `hoodie.datasource.write.partitionpath.field`

teeyog added 2 commits February 5, 2021 10:49

[HUDI-1526] translate the api partitionBy of spark DataFrameWriter to…

371028d

… `hoodie.datasource.write.partitionpath.field`

[HUDI-1526] translate the api partitionBy of spark DataFrameWriter to…

e0ac169

… `hoodie.datasource.write.partitionpath.field`

teeyog force-pushed the translate_param branch from 9d3fea0 to e0ac169 Compare February 5, 2021 02:49

teeyog requested a review from nsivabalan February 5, 2021 04:08

nsivabalan approved these changes Feb 10, 2021

View reviewed changes

nsivabalan merged commit 26da4f5 into apache:master Feb 10, 2021

nsivabalan added a commit to nsivabalan/hudi that referenced this pull request Mar 31, 2021

Revert "[HUDI-1526] Translate the api partitionBy in spark datasource…

7b36a21

… to hoodie.datasource.write.partitionpath.field (apache#2431)" This reverts commit 26da4f5.

prashantwason pushed a commit to prashantwason/incubator-hudi that referenced this pull request Aug 5, 2021

[HUDI-1526] Translate the api partitionBy in spark datasource to hood…

9249a76

…ie.datasource.write.partitionpath.field (apache#2431)

hudi-bot mentioned this pull request Nov 30, 2025

Add more tests to TestCOWDataSource for TimestampbasedKeyGen #14753

Open

[HUDI-1526] Translate the api partitionBy to hoodie.datasource.write.partitionpath.field #2431

[HUDI-1526] Translate the api partitionBy to hoodie.datasource.write.partitionpath.field #2431

Uh oh!

Conversation

teeyog commented Jan 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

codecov-io commented Jan 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

yanghua commented Jan 19, 2021

Uh oh!

wangxianghu commented Jan 20, 2021

Uh oh!

teeyog commented Jan 20, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhedoubushishi commented Jan 21, 2021

Uh oh!

wangxianghu commented Jan 23, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wangxianghu left a comment

Choose a reason for hiding this comment

Uh oh!

yanghua left a comment

Choose a reason for hiding this comment

Uh oh!

vinothchandar commented Jan 29, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nsivabalan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

teeyog commented Jan 11, 2021 •

edited

Loading

codecov-io commented Jan 11, 2021 •

edited

Loading