[HUDI-1109] Support Spark Structured Streaming read from Hudi table #2485

pengzhiwei2018 · 2021-01-25T03:13:09Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contributing.html before opening a pull request.

What is the purpose of the pull request

Support spark struct streaming read from hudi COW & MOR table.

Brief change log

Implement a HoodieSource to read hudi table for spark struct streaming.
Implement a HoodieSourceOffset to store the latest commit time has read.

Verify this pull request

(Please pick either of the following options)

This change added tests and can be verified as follows:

Add TestStreamingSource for testing read data from hudi COW & MOR table.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

codecov-io · 2021-01-25T03:39:42Z

Codecov Report

Merging #2485 (56a0ae1) into master (5d053b4) will increase coverage by 1.22%.
The diff coverage is 64.51%.

@@             Coverage Diff              @@
##             master    #2485      +/-   ##
============================================
+ Coverage     49.78%   51.01%   +1.22%     
- Complexity     3089     3197     +108     
============================================
  Files           430      435       +5     
  Lines         19566    19925     +359     
  Branches       2004     2047      +43     
============================================
+ Hits           9741    10164     +423     
+ Misses         9033     8925     -108     
- Partials        792      836      +44

Flag	Coverage Δ	Complexity Δ
hudicli	`36.90% <ø> (-0.31%)`	`0.00 <ø> (ø)`
hudiclient	`100.00% <ø> (ø)`	`0.00 <ø> (ø)`
hudicommon	`51.38% <ø> (-0.12%)`	`0.00 <ø> (ø)`
hudiflink	`43.21% <ø> (+10.17%)`	`0.00 <ø> (ø)`
hudihadoopmr	`33.16% <ø> (ø)`	`0.00 <ø> (ø)`
hudisparkdatasource	`69.48% <64.51%> (-0.03%)`	`0.00 <27.00> (ø)`
hudisync	`48.61% <ø> (ø)`	`0.00 <ø> (ø)`
huditimelineservice	`66.49% <ø> (ø)`	`0.00 <ø> (ø)`
hudiutilities	`69.46% <ø> (+7.60%)`	`0.00 <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ	Complexity Δ
...rc/main/scala/org/apache/hudi/Spark2RowSerDe.scala	`0.00% <0.00%> (ø)`	`0.00 <0.00> (?)`
...rc/main/scala/org/apache/hudi/Spark3RowSerDe.scala	`0.00% <0.00%> (ø)`	`0.00 <0.00> (?)`
.../spark/sql/hudi/streaming/HoodieSourceOffset.scala	`63.15% <63.15%> (ø)`	`4.00 <4.00> (?)`
...src/main/scala/org/apache/hudi/DefaultSource.scala	`84.21% <64.28%> (-4.50%)`	`17.00 <2.00> (+2.00)`	⬇️
.../main/scala/org/apache/hudi/HoodieSparkUtils.scala	`88.88% <66.66%> (ø)`	`0.00 <0.00> (ø)`
.../spark/sql/hudi/streaming/HoodieStreamSource.scala	`68.67% <68.67%> (ø)`	`21.00 <21.00> (?)`
...ache/hudi/common/fs/inline/InMemoryFileSystem.java	`79.31% <0.00%> (-10.35%)`	`15.00% <0.00%> (-1.00%)`
...a/org/apache/hudi/cli/commands/CommitsCommand.java	`53.50% <0.00%> (-5.30%)`	`15.00% <0.00%> (ø%)`
...pache/hudi/common/table/HoodieTableMetaClient.java	`67.42% <0.00%> (-3.27%)`	`45.00% <0.00%> (ø%)`
.../org/apache/hudi/MergeOnReadSnapshotRelation.scala	`89.13% <0.00%> (-1.46%)`	`17.00% <0.00%> (+1.00%)`	⬇️
... and 29 more

vinothchandar · 2021-01-25T06:58:53Z

cc @garyli1019 mind taking a first pass at this PR? :)

garyli1019

@pengzhiwei2018 thanks for your contribution. Left some comments but I am not quite familiar with Structured streaming. @zhedoubushishi mind taking a pass as well?

garyli1019 · 2021-01-26T14:03:45Z

...-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/streaming/HoodieSource.scala

More specific name? like StructuredStreamingHoodieSource?

Thanks for your suggestion. Maybe HoodieStreamSource is suitable.

garyli1019 · 2021-01-26T14:06:38Z

hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala

I think we should use the hudi internal converter from AvroConversionUtils, to be consistent with others.

That's all right. I will have a try.

garyli1019 · 2021-01-26T14:23:53Z

...-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/streaming/HoodieSource.scala

Initial version?

Yes, it is the first version number of this source.

garyli1019 · 2021-01-26T14:29:54Z

...-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/streaming/HoodieSource.scala

We could add more docs. This sounds like using structured streaming to sink to Hudi.

All right! I will write more for this.

zhedoubushishi · 2021-01-26T20:33:35Z

@pengzhiwei2018 thanks for your contribution. Left some comments but I am not quite familiar with Structured streaming. @zhedoubushishi mind taking a pass as well?

Sure will take a look.

zhedoubushishi · 2021-01-27T07:09:05Z

Can you check if this change is compatible with Spark 3.0.0?

pengzhiwei2018 · 2021-01-27T11:25:48Z

Can you check if this change is compatible with Spark 3.0.0?

Hi @zhedoubushishi , The Source and Offset implement is still available in the spark 3.0.0, But it needs to be integrated into the new SourceProvider before use it in the 3.0.0.

zhedoubushishi · 2021-01-27T06:37:26Z

...rk-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestStreamingSource.scala

[nit] duplicated space

Thanks for your correct！

...ource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/streaming/HoodieSourceOffset.scala

zhedoubushishi · 2021-01-27T19:14:38Z

...ource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/streaming/HoodieSourceOffset.scala

This class is not compatible with Spark 3. I am thinking if we could move this class to hudi-spark2 module (or any class that is not compatible with Spark 3). In the long term, we should make this change be compatible with both Spark 2 and Spark 3 but I think you can create a JIRA to track this for now.

You may run mvn clean install -Dspark3 -DskipTests to see if your change is work with Spark 3.

Thanks for your suggestion.I have test with -Dspark3. The HoodieSourceOffset is not work for spark3.
I have a puzzle that should we leave it in hudi-spark or hudi-spark2? Currently most of the Spark code stay in the hudi-spark. If moving this to hudi-spark2 now, many other relate code should also move. So I think we can create a JIRA to track this currently. In the long term, I will provide an implement for the spark3.
Here is the Jira: https://issues.apache.org/jira/browse/HUDI-1558

If it's only compatible with Spark 2, we should leave it in hudi-spark2. If it's compatible with both Spark2 and Spark3, we should leave it in hudi-spark.
My understand is at least you need to make this change be able to compile with mvn packge -DskipTests -Dspark3. Otherwise Hudi cannot compile with Spark 3 any more.

Thanks @zhedoubushishi for your advice. I will do fix this problem later.

done for this!

...ource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/streaming/HoodieStreamSource.scala

update test for mor table code format remove duplicate white space remove empty line Empty fix compile error for spark3 trigger check

vinothchandar · 2021-02-03T20:22:41Z

@pengzhiwei2018 I am planning to spend sometime on this as well.

High level question. does the offset for the streaming read map to _hoodie_commit_seq_no in this implementation. This way we can actually do record level streams and even resume where we left off.

pengzhiwei2018 · 2021-02-04T02:48:21Z

@pengzhiwei2018 I am planning to spend sometime on this as well.

High level question. does the offset for the streaming read map to _hoodie_commit_seq_no in this implementation. This way we can actually do record level streams and even resume where we left off.

Hi @vinothchandar , welcome to join this.
Currently the HoodieSourceOffset just keep the commitTime . In every min batch we consume the incremental data between (lastCommitTime, currentCommitTime] If it failed during the consuming, It will recovered from the offset state and continue to consuming the data between (lastCommitTime, currentCommitTime]. It is a commit level recovery.
Introducing _hoodie_commit_seq_no to the offset may makes recovery more fine-grained to the record level. But the problem is how can we know the max commit_seq_no in the commit. In the getOffset method, we must tell spark which commit_seq_no we will read to in the min batch. Currently in the hoodie meta data, we just record the commit time for each commit. So this is problem for slicing the offset.

vinothchandar · 2021-02-06T19:24:13Z

how can we know the max commit_seq_no in the commit

I think we should do what would be done for Kafka's case. or just use an accumulator to obtain this on each commit? Either way, lets file a follow up JIRA to allow record level streaming? We can do it in a follow up

vinothchandar

Few comments. but overall LGTM. lets land this soon! :)

vinothchandar · 2021-02-06T19:25:10Z

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/client/utils/SparkRowEncoder.java

 import java.io.Serializable;

-public interface SparkRowDeserializer extends Serializable {
+public interface SparkRowEncoder extends Serializable {


may be SparkRowSerDe is an apt name?

Agree this!

...ource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/streaming/HoodieSourceOffset.scala

vinothchandar · 2021-02-06T19:51:05Z

...ource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/streaming/HoodieSourceOffset.scala

+    }
+  }
+
+  val INIT_OFFSET = HoodieSourceOffset("000")


should/can we reuse HoodieTimeline#INIT_INSTANT_TS ?

vinothchandar · 2021-02-06T19:52:04Z

...ource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/streaming/HoodieSourceOffset.scala

+
+package org.apache.spark.sql.hudi.streaming
+
+import com.fasterxml.jackson.annotation.JsonInclude.Include


this is very simple right, can we just hand format the json without the jackson dependency? just a thought. leave it to you

Thanks for your advice! Yes, currently it is much simple. But in the long term, if we introduce the commit_seq_no or other infos to the offset, We may need the json parser. And the jackson dependency is already in the dependency of spark. So I prefer to keep.

vinothchandar · 2021-02-06T19:53:15Z

...ource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/streaming/HoodieStreamSource.scala

+    val fs = path.getFileSystem(hadoopConf)
+    TablePathUtils.getTablePath(fs, path).get()
+  }
+  @transient private lazy val metaClient = new HoodieTableMetaClient(hadoopConf, tablePath.toString)


this does serialize well actually

vinothchandar · 2021-02-06T19:57:19Z

...ource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/streaming/HoodieStreamSource.scala

+import java.nio.charset.StandardCharsets
+import java.util.Date
+
+import org.apache.commons.io.IOUtils


Please use the Hudi version of IOUtils. We need the same checkstyle rules applied for scala apache-commons is an illegal import in java code

Thanks for the advice!

vinothchandar · 2021-02-06T20:01:19Z

...ource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/streaming/HoodieStreamSource.scala

+        }
+
+        override def deserialize(in: InputStream): HoodieSourceOffset = {
+          val content = IOUtils.toString(new InputStreamReader(in, StandardCharsets.UTF_8))


FileIOUtils etc have a similar method

Thank you for advice！

vinothchandar · 2021-02-06T20:06:55Z

...ource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/streaming/HoodieStreamSource.scala

+  }
+
+  override def getOffset: Option[Offset] = {
+    initialPartitionOffsets


could nt this be done only lazily in the else block? i.e remove this line?

Yes, It can be removed.

vinothchandar · 2021-02-06T20:12:09Z

...ource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/streaming/HoodieStreamSource.scala

+    }
+  }
+
+  override def getOffset: Option[Offset] = {


just a rant. Source#getOffset() is such a bad name. its actually the latest offset. :(

Yes, getOffset is not a meaningful method name. However it is defined in the spark interface Source. We can not rename it but can add some comments for it.

vinothchandar · 2021-02-06T20:13:05Z

...ource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/streaming/HoodieStreamSource.scala

+    val endOffset = HoodieSourceOffset(end)
+
+    if (startOffset == endOffset) {
+      sqlContext.internalCreateDataFrame(


pengzhiwei2018 · 2021-02-07T03:20:10Z

how can we know the max commit_seq_no in the commit

I think we should do what would be done for Kafka's case. or just use an accumulator to obtain this on each commit? Either way, lets file a follow up JIRA to allow record level streaming? We can do it in a follow up

Yes, agree to file a JIRA to allow record level streaming!

pengzhiwei2018 · 2021-02-09T02:22:41Z

Hi @vinothchandar , All the comments have been dealt with. Please take a review again when you have time. And I also file a JIRA to support record level streaming consume at HUDI-1601.

vinothchandar

LGTM!

rubenssoto · 2021-02-19T01:56:37Z

I have a curiosity, what will happen if I recreate the source table of streaming?

For example, I have a tableA and a streaming using tableA as source and tableB as a sink, for any needs I reprocessed tableA, when Spark streaming try to read tableA, what will happen?

I will try.

@pengzhiwei2018

pengzhiwei2018 · 2021-02-19T02:13:21Z

I have a curiosity, what will happen if I recreate the source table of streaming?

For example, I have a tableA and a streaming using tableA as source and tableB as a sink, for any needs I reprocessed tableA, when Spark streaming try to read tableA, what will happen?

I will try.

@pengzhiwei2018

Hi @rubenssoto ,do you mean multiple stream jobs consume the same table simultaneously？Yeah, it does not matter with that. Each stream job keeps its own offset state. So the consumers do not influence each other.

rubenssoto · 2021-02-19T02:19:51Z

@pengzhiwei2018

no, for example, today with spark structured streaming in a regular parquet if my tableA as a source to my streaming, if I reprocess/recreate tableA spark streaming will process all new files of my reprocessed tableA

if for any reason I need to recreate my tableA, what will happen to my streams?

I dont know if I make myself clear

pengzhiwei2018 · 2021-02-19T02:50:38Z

@pengzhiwei2018

no, for example, today with spark structured streaming in a regular parquet if my tableA as a source to my streaming, if I reprocess/recreate tableA spark streaming will process all new files of my reprocessed tableA

if for any reason I need to recreate my tableA, what will happen to my streams?

I dont know if I make myself clear

Hi @rubenssoto.
If the table has recreated, the offset of the stream source should be reset(e.g. use another checkpoint directory or delete the old checkpoint directory). Otherwise, the old offset may not match the new recreated table and we cannot read data correctly.

rubenssoto · 2021-02-19T02:52:34Z

good to know...

thank you @pengzhiwei2018 !!!

…pache#2485)

…OSS master Summary: [HUDI-1509]: Reverting LinkedHashSet changes to combine fields from oldSchema and newSchema in favor of using only new schema for record rewriting (apache#2424) [MINOR] Bumping snapshot version to 0.7.0 (apache#2435) [HUDI-1533] Make SerializableSchema work for large schemas and add ability to sortBy numeric values (apache#2453) [HUDI-1529] Add block size to the FileStatus objects returned from metadata table to avoid too many file splits (apache#2451) [HUDI-1532] Fixed suboptimal implementation of a magic sequence search (apache#2440) [HUDI-1535] Fix 0.7.0 snapshot (apache#2456) [MINOR] Fixing setting defaults for index config (apache#2457) [HUDI-1540] Fixing commons codec shading in spark bundle (apache#2460) [HUDI 1308] Harden RFC-15 Implementation based on production testing (apache#2441) [MINOR] Remove redundant judgments (apache#2466) [MINOR] Fix dataSource cannot use hoodie.datasource.hive_sync.auto_create_database (apache#2444) [MINOR] Disabling problematic tests temporarily to stabilize CI (apache#2468) [MINOR] Make a separate travis CI job for hudi-utilities (apache#2469) [HUDI-1512] Fix spark 2 unit tests failure with Spark 3 (apache#2412) [HUDI-1511] InstantGenerateOperator support multiple parallelism (apache#2434) [HUDI-1332] Introduce FlinkHoodieBloomIndex to hudi-flink-client (apache#2375) [HUDI] Add bloom index for hudi-flink-client [MINOR] Remove InstantGeneratorOperator parallelism limit in HoodieFlinkStreamer and update docs (apache#2471) [MINOR] Improve code readability,remove the continue keyword (apache#2459) [HOTFIX] Revert upgrade flink verison to 1.12.0 (apache#2473) [HUDI-1453] Fix NPE using HoodieFlinkStreamer to etl data from kafka to hudi (apache#2474) [MINOR] Use skipTests flag for skip.hudi-spark2.unit.tests property (apache#2477) [HUDI-1476] Introduce unit test infra for java client (apache#2478) [MINOR] Update doap with 0.7.0 release (apache#2491) [MINOR]Fix NPE when using HoodieFlinkStreamer with multi parallelism (apache#2492) [HUDI-1234] Insert new records to data files without merging for "Insert" operation. (apache#2111) [MINOR] Add Jira URL and Mailing List (apache#2404) [HUDI-1522] Add a new pipeline for Flink writer (apache#2430) [HUDI-1522] Add a new pipeline for Flink writer [HUDI-623] Remove UpgradePayloadFromUberToApache (apache#2455) [HUDI-1555] Remove isEmpty to improve clustering execution performance (apache#2502) [HUDI-1266] Add unit test for validating replacecommit rollback (apache#2418) [MINOR] Quickstart.generateUpdates method add check (apache#2505) [HUDI-1519] Improve minKey/maxKey computation in HoodieHFileWriter (apache#2427) [HUDI-1550] Honor ordering field for MOR Spark datasource reader (apache#2497) [MINOR] Fix method comment typo (apache#2518) [MINOR] Rename FileSystemViewHandler to RequestHandler and corrected the class comment (apache#2458) [HUDI-1335] Introduce FlinkHoodieSimpleIndex to hudi-flink-client (apache#2271) [HUDI-1523] Call mkdir(partition) only if not exists (apache#2501) [HUDI-1538] Try to init class trying different signatures instead of checking its name (apache#2476) [HUDI-1538] Try to init class trying different signatures instead of checking its name. [HUDI-1547] CI intermittent failure: TestJsonStringToHoodieRecordMapF… (apache#2521) [MINOR] Fixing the default value for source ordering field for payload config (apache#2516) [HUDI-1420] HoodieTableMetaClient.getMarkerFolderPath works incorrectly on windows client with hdfs server for wrong file seperator (apache#2526) [HUDI-1571] Adding commit_show_records_info to display record sizes for commit (apache#2514) [HUDI-1589] Fix Rollback Metadata AVRO backwards incompatiblity (apache#2543) [MINOR] Fix wrong logic for checking state condition (apache#2524) [HUDI-1557] Make Flink write pipeline write task scalable (apache#2506) [HUDI-1545] Add test cases for INSERT_OVERWRITE Operation (apache#2483) [HUDI-1603] fix DefaultHoodieRecordPayload serialization failure (apache#2556) [MINOR] Fix the wrong comment for HoodieJavaWriteClientExample (apache#2559) [HUDI-1526] Translate the api partitionBy in spark datasource to hoodie.datasource.write.partitionpath.field (apache#2431) [HUDI-1612] Fix write test flakiness in StreamWriteITCase (apache#2567) [HUDI-1612] Fix write test flakiness in StreamWriteITCase [MINOR] Default to empty list for unset datadog tags property (apache#2574) [MINOR] Add clustering to feature list (apache#2568) [HUDI-1598] Write as minor batches during one checkpoint interval for the new writer (apache#2553) [HUDI-1109] Support Spark Structured Streaming read from Hudi table (apache#2485) [HUDI-1621] Gets the parallelism from context when init StreamWriteOperatorCoordinator (apache#2579) [HUDI-1381] Schedule compaction based on time elapsed (apache#2260) [HUDI-1582] Throw an exception when syncHoodieTable() fails, with RuntimeException (apache#2536) [HUDI-1539] Fix bug in HoodieCombineRealtimeRecordReader with reading empty iterators (apache#2583) [HUDI-1315] Adding builder for HoodieTableMetaClient initialization (apache#2534) [HUDI-1486] Remove inline inflight rollback in hoodie writer (apache#2359) [HUDI-1586] [Common Core] [Flink Integration] Reduce the coupling of hadoop. (apache#2540) [HUDI-1624] The state based index should bootstrap from existing base files (apache#2581) [HUDI-1477] Support copyOnWriteTable in java client (apache#2382) [MINOR] Ensure directory exists before listing all marker files. (apache#2594) [MINOR] hive sync checks for table after creating db if auto create is true (apache#2591) [HUDI-1620] Add azure pipelines configs (apache#2582) [HUDI-1347] Fix Hbase index to make rollback synchronous (via config) (apache#2188) [HUDI-1637] Avoid to rename for bucket update when there is only one flush action during a checkpoint (apache#2599) [HUDI-1638] Some improvements to BucketAssignFunction (apache#2600) [HUDI-1367] Make deltaStreamer transition from dfsSouce to kafkasouce (apache#2227) [HUDI-1269] Make whether the failure of connect hive affects hudi ingest process configurable (apache#2443) [HUDI-1611] Added a configuration to allow specific directories to be filtered out during Metadata Table bootstrap. (apache#2565) [Hudi-1583]: Fix bug that Hudi will skip remaining log files if there is logFile with zero size in logFileList when merge on read. (apache#2584) [HUDI-1632] Supports merge on read write mode for Flink writer (apache#2593) [HUDI-1540] Fixing commons codec dependency in bundle jars (apache#2562) [HUDI-1644] Do not delete older rollback instants as part of rollback. Archival can take care of removing old instants cleanly (apache#2610) [HUDI-1634] Re-bootstrap metadata table when un-synced instants have been archived. (apache#2595) [HUDI-1584] Modify maker file path, which should start with the target base path. (apache#2539) [MINOR] Fix default value for hoodie.deltastreamer.source.kafka.auto.reset.offsets (apache#2617) [HUDI-1553] Configuration and metrics for the TimelineService. (apache#2495) [HUDI-1587] Add latency and freshness support (apache#2541) [HUDI-1647] Supports snapshot read for Flink (apache#2613) [HUDI-1646] Provide mechanism to read uncommitted data through InputFormat (apache#2611) [HUDI-1655] Support custom date format and fix unsupported exception in DatePartitionPathSelector (apache#2621) [HUDI-1636] Support Builder Pattern To Build Table Properties For HoodieTableConfig (apache#2596) [HUDI-1660] Excluding compaction and clustering instants from inflight rollback (apache#2631) [HUDI-1661] Exclude clustering commits from getExtraMetadataFromLatest API (apache#2632) [MINOR] Fix import in StreamerUtil.java (apache#2638) [HUDI-1618] Fixing NPE with Parquet src in multi table delta streamer (apache#2577) [HUDI-1662] Fix hive date type conversion for mor table (apache#2634) [HUDI-1673] Replace scala.Tule2 to Pair in FlinkHoodieBloomIndex (apache#2642) [MINOR] HoodieClientTestHarness close resources in AfterAll phase (apache#2646) [HUDI-1635] Improvements to Hudi Test Suite (apache#2628) [HUDI-1651] Fix archival of requested replacecommit (apache#2622) [HUDI-1663] Streaming read for Flink MOR table (apache#2640) [HUDI-1678] Row level delete for Flink sink (apache#2659) [HUDI-1664] Avro schema inference for Flink SQL table (apache#2658) [HUDI-1681] Support object storage for Flink writer (apache#2662) [HUDI-1685] keep updating current date for every batch (apache#2671) [HUDI-1496] Fixing input stream detection of GCS FileSystem (apache#2500) [HUDI-1684] Tweak hudi-flink-bundle module pom and reorganize the pacakges for hudi-flink module (apache#2669) [HUDI-1692] Bounded source for stream writer (apache#2674) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. (apache#2494) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. [HUDI-1695] Fixed the error messaging (apache#2679) [HUDI 1615] Fixing null schema in bulk_insert row writer path (apache#2653) [HUDI-845] Added locking capability to allow multiple writers (apache#2374) [HUDI-1701] Implement HoodieTableSource.explainSource for all kinds of pushing down (apache#2690) [HUDI-1704] Use PRIMARY KEY syntax to define record keys for Flink Hudi table (apache#2694) [HUDI-1688]hudi write should uncache rdd， when the write operation is finnished (apache#2673) [MINOR] Remove unused var in AbstractHoodieWriteClient (apache#2693) [HUDI-1653] Add support for composite keys in NonpartitionedKeyGenerator (apache#2627) [HUDI-1705] Flush as per data bucket for mini-batch write (apache#2695) [1568] Fixing spark3 bundles (apache#2625) [HUDI-1650] Custom avro kafka deserializer. (apache#2619) [HUDI-1667]: Fix a null value related bug for spark vectorized reader. (apache#2636) [HUDI-1709] Improving config names and adding hive metastore uri config (apache#2699) [MINOR][DOCUMENT] Update README doc for integ test (apache#2703) [HUDI-1710] Read optimized query type for Flink batch reader (apache#2702) [HUDI-1712] Rename & standardize config to match other configs (apache#2708) [hotfix] Log the error message for creating table source first (apache#2711) [HUDI-1495] Bump Flink version to 1.12.2 (apache#2718) [HUDI-1728] Fix MethodNotFound for HiveMetastore Locks (apache#2731) [HUDI-1478] Introduce HoodieBloomIndex to hudi-java-client (apache#2608) [HUDI-1729] Asynchronous Hive sync and commits cleaning for Flink writer (apache#2732) [HOTFIX] close spark session in functional test suite and disable spark3 test for spark2 (apache#2727) [HOTFIX] Disable ITs for Spark3 and scala2.12 (apache#2733) [HOTFIX] fix deploy staging jars script [MINOR] Add Missing Apache License to test files (apache#2736) [UBER] Fixed creation of HoodieMetadataClient which now uses a Builder pattern instead of a constructor. Reviewers: balajee, O955 Project Hoodie Project Reviewer: Add blocking reviewers!, PHID-PROJ-pxfpotkfgkanblb3detq! JIRA Issues: HUDI-593 Differential Revision: https://code.uberinternal.com/D5867129

…OSS master Summary: [HUDI-1509]: Reverting LinkedHashSet changes to combine fields from oldSchema and newSchema in favor of using only new schema for record rewriting (apache#2424) [MINOR] Bumping snapshot version to 0.7.0 (apache#2435) [HUDI-1533] Make SerializableSchema work for large schemas and add ability to sortBy numeric values (apache#2453) [HUDI-1529] Add block size to the FileStatus objects returned from metadata table to avoid too many file splits (apache#2451) [HUDI-1532] Fixed suboptimal implementation of a magic sequence search (apache#2440) [HUDI-1535] Fix 0.7.0 snapshot (apache#2456) [MINOR] Fixing setting defaults for index config (apache#2457) [HUDI-1540] Fixing commons codec shading in spark bundle (apache#2460) [HUDI 1308] Harden RFC-15 Implementation based on production testing (apache#2441) [MINOR] Remove redundant judgments (apache#2466) [MINOR] Fix dataSource cannot use hoodie.datasource.hive_sync.auto_create_database (apache#2444) [MINOR] Disabling problematic tests temporarily to stabilize CI (apache#2468) [MINOR] Make a separate travis CI job for hudi-utilities (apache#2469) [HUDI-1512] Fix spark 2 unit tests failure with Spark 3 (apache#2412) [HUDI-1511] InstantGenerateOperator support multiple parallelism (apache#2434) [HUDI-1332] Introduce FlinkHoodieBloomIndex to hudi-flink-client (apache#2375) [HUDI] Add bloom index for hudi-flink-client [MINOR] Remove InstantGeneratorOperator parallelism limit in HoodieFlinkStreamer and update docs (apache#2471) [MINOR] Improve code readability,remove the continue keyword (apache#2459) [HOTFIX] Revert upgrade flink verison to 1.12.0 (apache#2473) [HUDI-1453] Fix NPE using HoodieFlinkStreamer to etl data from kafka to hudi (apache#2474) [MINOR] Use skipTests flag for skip.hudi-spark2.unit.tests property (apache#2477) [HUDI-1476] Introduce unit test infra for java client (apache#2478) [MINOR] Update doap with 0.7.0 release (apache#2491) [MINOR]Fix NPE when using HoodieFlinkStreamer with multi parallelism (apache#2492) [HUDI-1234] Insert new records to data files without merging for "Insert" operation. (apache#2111) [MINOR] Add Jira URL and Mailing List (apache#2404) [HUDI-1522] Add a new pipeline for Flink writer (apache#2430) [HUDI-1522] Add a new pipeline for Flink writer [HUDI-623] Remove UpgradePayloadFromUberToApache (apache#2455) [HUDI-1555] Remove isEmpty to improve clustering execution performance (apache#2502) [HUDI-1266] Add unit test for validating replacecommit rollback (apache#2418) [MINOR] Quickstart.generateUpdates method add check (apache#2505) [HUDI-1519] Improve minKey/maxKey computation in HoodieHFileWriter (apache#2427) [HUDI-1550] Honor ordering field for MOR Spark datasource reader (apache#2497) [MINOR] Fix method comment typo (apache#2518) [MINOR] Rename FileSystemViewHandler to RequestHandler and corrected the class comment (apache#2458) [HUDI-1335] Introduce FlinkHoodieSimpleIndex to hudi-flink-client (apache#2271) [HUDI-1523] Call mkdir(partition) only if not exists (apache#2501) [HUDI-1538] Try to init class trying different signatures instead of checking its name (apache#2476) [HUDI-1538] Try to init class trying different signatures instead of checking its name. [HUDI-1547] CI intermittent failure: TestJsonStringToHoodieRecordMapF… (apache#2521) [MINOR] Fixing the default value for source ordering field for payload config (apache#2516) [HUDI-1420] HoodieTableMetaClient.getMarkerFolderPath works incorrectly on windows client with hdfs server for wrong file seperator (apache#2526) [HUDI-1571] Adding commit_show_records_info to display record sizes for commit (apache#2514) [HUDI-1589] Fix Rollback Metadata AVRO backwards incompatiblity (apache#2543) [MINOR] Fix wrong logic for checking state condition (apache#2524) [HUDI-1557] Make Flink write pipeline write task scalable (apache#2506) [HUDI-1545] Add test cases for INSERT_OVERWRITE Operation (apache#2483) [HUDI-1603] fix DefaultHoodieRecordPayload serialization failure (apache#2556) [MINOR] Fix the wrong comment for HoodieJavaWriteClientExample (apache#2559) [HUDI-1526] Translate the api partitionBy in spark datasource to hoodie.datasource.write.partitionpath.field (apache#2431) [HUDI-1612] Fix write test flakiness in StreamWriteITCase (apache#2567) [HUDI-1612] Fix write test flakiness in StreamWriteITCase [MINOR] Default to empty list for unset datadog tags property (apache#2574) [MINOR] Add clustering to feature list (apache#2568) [HUDI-1598] Write as minor batches during one checkpoint interval for the new writer (apache#2553) [HUDI-1109] Support Spark Structured Streaming read from Hudi table (apache#2485) [HUDI-1621] Gets the parallelism from context when init StreamWriteOperatorCoordinator (apache#2579) [HUDI-1381] Schedule compaction based on time elapsed (apache#2260) [HUDI-1582] Throw an exception when syncHoodieTable() fails, with RuntimeException (apache#2536) [HUDI-1539] Fix bug in HoodieCombineRealtimeRecordReader with reading empty iterators (apache#2583) [HUDI-1315] Adding builder for HoodieTableMetaClient initialization (apache#2534) [HUDI-1486] Remove inline inflight rollback in hoodie writer (apache#2359) [HUDI-1586] [Common Core] [Flink Integration] Reduce the coupling of hadoop. (apache#2540) [HUDI-1624] The state based index should bootstrap from existing base files (apache#2581) [HUDI-1477] Support copyOnWriteTable in java client (apache#2382) [MINOR] Ensure directory exists before listing all marker files. (apache#2594) [MINOR] hive sync checks for table after creating db if auto create is true (apache#2591) [HUDI-1620] Add azure pipelines configs (apache#2582) [HUDI-1347] Fix Hbase index to make rollback synchronous (via config) (apache#2188) [HUDI-1637] Avoid to rename for bucket update when there is only one flush action during a checkpoint (apache#2599) [HUDI-1638] Some improvements to BucketAssignFunction (apache#2600) [HUDI-1367] Make deltaStreamer transition from dfsSouce to kafkasouce (apache#2227) [HUDI-1269] Make whether the failure of connect hive affects hudi ingest process configurable (apache#2443) [HUDI-1611] Added a configuration to allow specific directories to be filtered out during Metadata Table bootstrap. (apache#2565) [Hudi-1583]: Fix bug that Hudi will skip remaining log files if there is logFile with zero size in logFileList when merge on read. (apache#2584) [HUDI-1632] Supports merge on read write mode for Flink writer (apache#2593) [HUDI-1540] Fixing commons codec dependency in bundle jars (apache#2562) [HUDI-1644] Do not delete older rollback instants as part of rollback. Archival can take care of removing old instants cleanly (apache#2610) [HUDI-1634] Re-bootstrap metadata table when un-synced instants have been archived. (apache#2595) [HUDI-1584] Modify maker file path, which should start with the target base path. (apache#2539) [MINOR] Fix default value for hoodie.deltastreamer.source.kafka.auto.reset.offsets (apache#2617) [HUDI-1553] Configuration and metrics for the TimelineService. (apache#2495) [HUDI-1587] Add latency and freshness support (apache#2541) [HUDI-1647] Supports snapshot read for Flink (apache#2613) [HUDI-1646] Provide mechanism to read uncommitted data through InputFormat (apache#2611) [HUDI-1655] Support custom date format and fix unsupported exception in DatePartitionPathSelector (apache#2621) [HUDI-1636] Support Builder Pattern To Build Table Properties For HoodieTableConfig (apache#2596) [HUDI-1660] Excluding compaction and clustering instants from inflight rollback (apache#2631) [HUDI-1661] Exclude clustering commits from getExtraMetadataFromLatest API (apache#2632) [MINOR] Fix import in StreamerUtil.java (apache#2638) [HUDI-1618] Fixing NPE with Parquet src in multi table delta streamer (apache#2577) [HUDI-1662] Fix hive date type conversion for mor table (apache#2634) [HUDI-1673] Replace scala.Tule2 to Pair in FlinkHoodieBloomIndex (apache#2642) [MINOR] HoodieClientTestHarness close resources in AfterAll phase (apache#2646) [HUDI-1635] Improvements to Hudi Test Suite (apache#2628) [HUDI-1651] Fix archival of requested replacecommit (apache#2622) [HUDI-1663] Streaming read for Flink MOR table (apache#2640) [HUDI-1678] Row level delete for Flink sink (apache#2659) [HUDI-1664] Avro schema inference for Flink SQL table (apache#2658) [HUDI-1681] Support object storage for Flink writer (apache#2662) [HUDI-1685] keep updating current date for every batch (apache#2671) [HUDI-1496] Fixing input stream detection of GCS FileSystem (apache#2500) [HUDI-1684] Tweak hudi-flink-bundle module pom and reorganize the pacakges for hudi-flink module (apache#2669) [HUDI-1692] Bounded source for stream writer (apache#2674) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. (apache#2494) [HUDI-1552] Improve performance of key lookups from base file in Metadata Table. [HUDI-1695] Fixed the error messaging (apache#2679) [HUDI 1615] Fixing null schema in bulk_insert row writer path (apache#2653) [HUDI-845] Added locking capability to allow multiple writers (apache#2374) [HUDI-1701] Implement HoodieTableSource.explainSource for all kinds of pushing down (apache#2690) [HUDI-1704] Use PRIMARY KEY syntax to define record keys for Flink Hudi table (apache#2694) [HUDI-1688]hudi write should uncache rdd， when the write operation is finnished (apache#2673) [MINOR] Remove unused var in AbstractHoodieWriteClient (apache#2693) [HUDI-1653] Add support for composite keys in NonpartitionedKeyGenerator (apache#2627) [HUDI-1705] Flush as per data bucket for mini-batch write (apache#2695) [1568] Fixing spark3 bundles (apache#2625) [HUDI-1650] Custom avro kafka deserializer. (apache#2619) [HUDI-1667]: Fix a null value related bug for spark vectorized reader. (apache#2636) [HUDI-1709] Improving config names and adding hive metastore uri config (apache#2699) [MINOR][DOCUMENT] Update README doc for integ test (apache#2703) [HUDI-1710] Read optimized query type for Flink batch reader (apache#2702) [HUDI-1712] Rename & standardize config to match other configs (apache#2708) [hotfix] Log the error message for creating table source first (apache#2711) [HUDI-1495] Bump Flink version to 1.12.2 (apache#2718) [HUDI-1728] Fix MethodNotFound for HiveMetastore Locks (apache#2731) [HUDI-1478] Introduce HoodieBloomIndex to hudi-java-client (apache#2608) [HUDI-1729] Asynchronous Hive sync and commits cleaning for Flink writer (apache#2732) [HOTFIX] close spark session in functional test suite and disable spark3 test for spark2 (apache#2727) [HOTFIX] Disable ITs for Spark3 and scala2.12 (apache#2733) [HOTFIX] fix deploy staging jars script [MINOR] Add Missing Apache License to test files (apache#2736) [UBER] Fixed creation of HoodieMetadataClient which now uses a Builder pattern instead of a constructor. Reviewers: balajee Reviewed By: balajee JIRA Issues: HUDI-593 Differential Revision: https://code.uberinternal.com/D5867129

endgab · 2022-03-08T08:53:29Z

Hello, I haven't found any official documentation about this feature. Is there any reason for that? Can you confirm that it is going to be maintained in the future?

pengzhiwei2018 force-pushed the dev_1109 branch from 5d3ec8d to 91cf083 Compare January 25, 2021 03:32

pengzhiwei2018 mentioned this pull request Jan 25, 2021

[WIP] [HUDI-1125] build framework to support structured streaming #1880

Closed

5 tasks

vinothchandar self-assigned this Jan 25, 2021

garyli1019 reviewed Jan 26, 2021

View reviewed changes

pengzhiwei2018 force-pushed the dev_1109 branch from f474087 to 55b8bc8 Compare January 27, 2021 16:19

zhedoubushishi reviewed Jan 27, 2021

View reviewed changes

pengzhiwei2018 force-pushed the dev_1109 branch 5 times, most recently from 0063431 to 9293bd5 Compare February 3, 2021 16:15

[HUDI-1109] Support Spark Structured Streaming read from Hudi table

07bc599

update test for mor table code format remove duplicate white space remove empty line Empty fix compile error for spark3 trigger check

pengzhiwei2018 force-pushed the dev_1109 branch from 28df4ea to 07bc599 Compare February 3, 2021 16:23

remove empty line

afbcaa2

vinothchandar approved these changes Feb 6, 2021

View reviewed changes

fix for some code review

56a0ae1

garyli1019 approved these changes Feb 9, 2021

View reviewed changes

vinothchandar approved these changes Feb 17, 2021

View reviewed changes

vinothchandar merged commit 3797207 into apache:master Feb 17, 2021

prashantwason pushed a commit to prashantwason/incubator-hudi that referenced this pull request Aug 5, 2021

[HUDI-1109] Support Spark Structured Streaming read from Hudi table (a…

28f0daa

…pache#2485)


		package org.apache.spark.sql.hudi.streaming

		import com.fasterxml.jackson.annotation.JsonInclude.Include

[HUDI-1109] Support Spark Structured Streaming read from Hudi table #2485

[HUDI-1109] Support Spark Structured Streaming read from Hudi table #2485

Uh oh!

Conversation

pengzhiwei2018 commented Jan 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

codecov-io commented Jan 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

vinothchandar commented Jan 25, 2021

Uh oh!

garyli1019 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pengzhiwei2018 Jan 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhedoubushishi commented Jan 26, 2021

Uh oh!

zhedoubushishi commented Jan 27, 2021

Uh oh!

pengzhiwei2018 commented Jan 27, 2021 • edited by vinothchandar Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pengzhiwei2018 Jan 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vinothchandar commented Feb 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pengzhiwei2018 commented Feb 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vinothchandar commented Feb 6, 2021

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pengzhiwei2018 commented Jan 25, 2021 •

edited

Loading

codecov-io commented Jan 25, 2021 •

edited

Loading

pengzhiwei2018 Jan 27, 2021 •

edited

Loading

pengzhiwei2018 commented Jan 27, 2021 •

edited by vinothchandar

Loading

pengzhiwei2018 Jan 28, 2021 •

edited

Loading

vinothchandar commented Feb 3, 2021 •

edited

Loading

pengzhiwei2018 commented Feb 4, 2021 •

edited

Loading

pengzhiwei2018 commented Feb 9, 2021 •

edited

Loading