[HUDI-426] Bootstrap datasource integration #1702

umehrot2 · 2020-06-03T23:57:53Z

What is the purpose of the pull request

This PR consolidates changes related to Hudi data source and hive sync integration.

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

hudi-spark/src/main/scala/org/apache/hudi/HudiBootstrapRelation.scala

garyli1019

Hi @umehrot2 , very clean work 👍 ! I walked through this PR and found some common places we can share.

Path filtering.
User input paths handling and blob pattern.
Schema provider.

I have a few questions.

How should we define the user interface?
Soon, we will have Bootstrap view, read optimized view, snapshot(realtime) view, incremental view. I am wondering we should unified the query interface and handle all the file formats internally. How about this:
Snapshot view: Bootstrap files + non-hudi files + hudi files + hudi log
Read optimized: Bootstrap files + non-hudi files + hudi files
Incremental: incremental view on top of snapshot

How should we split the filegroups?
Right now we already have 4 different filegroups. Once we add ORC support, there will be more. One of the cleanest ways I could find is to read each filegroup into RDD independently then union them together. In the current version of this PR, we handle regular parquet in HudiBootstrapRDD. The two disadvantages I could see:

After we add ORC support, the complexity of this RDD would increase if we handle the ORC reading here too.
IIUC, we didn't take the full advantage of the vectorized reader by using ColumnBatch directly. Merging probably requires reading row by row, but for regular parquet files, we can use the default parquet reader.

If we can find a way to efficiently listing files in the driver, I think we can separate the bootstrap files from regular parquet and only use the BootstrapRDD to handle the files that need to be merged. Happy to discuss more here.

garyli1019 · 2020-06-14T23:28:57Z

hudi-spark/src/main/scala/org/apache/hudi/HudiBootstrapRDD.scala

+
+    val rows = fileIterator.flatMap(_ match {
+      case r: InternalRow => Seq(r)
+      case b: ColumnarBatch => b.rowIterator().asScala


If we use vectorized reader this way, does it still have a huge performance boost?
From my understanding, the regular reader iterator will read the whole row as UnsafeRow then do the column pruning before load it into memory. The vectorized reader will do the column pruning and loading data in one step. So theoretically vectorized reader would still be faster even we read it as InternalRow
The description I found from Spark code
This class can either return InternalRows or ColumnarBatches. With whole stage codegen enabled, this class returns ColumnarBatches which offers significant performance gains.

As per my understanding column pruning is independent of vectorized reader. vectorized reader will basically read a batch of rows into a columnar batch and that is what will happen here as well. However, the only difference is that we are not passing it as a columnar batch all the way down as a batch. However, even if I use regular parquet reader at some point it must be converting the columnar batch to rows I guess. Right now I am not fully sure whether I am able to 100% use all the benefits of vectorized reading with this method, but atleast it reads the data as a batch.

Will do some more research on this.

We probably have to use rowIterator since we will need to merge on row level anyway, same for MOR table too. Agree that Spark will convert ColumnBatch to row at some point and it is very difficult to locate.

For MOR table, I have some ideas to speed things up by pre-reading the delete/rollback blocks and simply "skip" rows as long as OverwritewithLatestPayload is used.. If the user does specify a merge function, then its hard to get away from.. we can take this discussion in a separate forum.

hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala

hudi-spark/src/main/scala/org/apache/hudi/HudiBootstrapRDD.scala

vinothchandar · 2020-06-23T23:59:44Z

@umehrot2 does this PR some of @bvaradar 's changes included?

umehrot2 · 2020-06-29T22:22:20Z

@umehrot2 does this PR some of @bvaradar 's changes included?

@vinothchandar yes it does. I had put some stuff in just for ease of reviewing becuase this utilizes some of the core changes that @bvaradar has done. If that is creating confusion I can get rid of it.

umehrot2 · 2020-06-29T22:24:11Z

@garyli1019 thank you for your inputs. Sorry, I had been busy with oncall and other projects. Let me try to catch up and process your comments.

umehrot2 · 2020-06-30T10:46:59Z

Hi @umehrot2 , very clean work 👍 ! I walked through this PR and found some common places we can share.
* Path filtering.

* User input paths handling and blob pattern.

* Schema provider.
I have a few questions.

How should we define the user interface?
Soon, we will have Bootstrap view, read optimized view, snapshot(realtime) view, incremental view. I am wondering we should unified the query interface and handle all the file formats internally. How about this:
Snapshot view: Bootstrap files + non-hudi files + hudi files + hudi log
Read optimized: Bootstrap files + non-hudi files + hudi files
Incremental: incremental view on top of snapshot

How should we split the filegroups?
Right now we already have 4 different filegroups. Once we add ORC support, there will be more. One of the cleanest ways I could find is to read each filegroup into RDD independently then union them together. In the current version of this PR, we handle regular parquet in HudiBootstrapRDD. The two disadvantages I could see:
* After we add ORC support, the complexity of this RDD would increase if we handle the ORC reading here too.

* IIUC, we didn't take the full advantage of the vectorized reader by using `ColumnBatch` directly. Merging probably requires reading row by row, but for regular parquet files, we can use the default parquet reader.
If we can find a way to efficiently listing files in the driver, I think we can separate the bootstrap files from regular parquet and only use the BootstrapRDD to handle the files that need to be merged. Happy to discuss more here.

Thanks @garyli1019 for your review and bringing some interesting points.

Yes, I think the pieces you mentioned can be used by you later for the MOR datasource work.

Regarding the user interface for the query your proposal makes sense to me in general. We can may be have it flushed out in more detail once our PRs are merged and happy to collaborate on that.

Regarding your suggestion about using sparks regular parquet reader for regular hudi files and doing a union with bootstrapped files:

Complexity after ORC comes in: The current implementation is not very tightly coupled with parquet. IIUC for this implementation it should just be matter of initializing the readers with OrcFileFormat instead of ParquetFileFormat which shouldn't make life difficult. Happy to hear your thoughts.
Full advantage of Vectorized Reader: I think I answered this in another comment you posted. At this point I need to do more research and gather datapoints if it is not utilizing 100% of the advantages of vectorized reading. What I know for sure that the data from the file is read in a batch. Now, if I am loosing some performance in doing a row iteration over that batch I am not sure. But I believe spark regular readers, must be doing the batch to row conversion at some point of time. If you have more details on how spark does this, do let me know as it will be of great help. I will do some more research on this as well.

hudi-spark/src/main/scala/org/apache/hudi/HudiBootstrapRDD.scala

garyli1019 · 2020-07-06T00:12:41Z

hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala

-        className = "parquet",
-        options = parameters)
-        .resolveRelation()
+      val readPathsStr = parameters.get(DataSourceReadOptions.READ_PATHS_OPT_KEY)


Are these additional paths on top of the path? Any example of the use cases?

These additional paths are being used in the Incremental query code to make it work for bootstrapped tables. I need to pass a list of bootstrapped files to read, and that is why had to add support for reading from multiple paths. spark.read.parquet already has that kind of support and is being used in incremental relation already to read a list of files.

the bootstrap.base.path is now in hoodie.properties. Should can we make this transparent for the user?

Well right now I added it only for our internal logic to support incremental query on bootstrapped tables.

Would you want customers to use this otherwise as well, to be able to provide multiple read paths for querying ? Is that the ask here ?

garyli1019 · 2020-07-06T00:28:05Z

hudi-spark/src/main/scala/org/apache/hudi/HudiBootstrapRDD.scala

+
+    val rows = fileIterator.flatMap(_ match {
+      case r: InternalRow => Seq(r)
+      case b: ColumnarBatch => b.rowIterator().asScala


We probably have to use rowIterator since we will need to merge on row level anyway, same for MOR table too. Agree that Spark will convert ColumnBatch to row at some point and it is very difficult to locate.

vinothchandar

@umehrot2 Some of these are very good optimizations in the general sense as well.

hudi-client/pom.xml

hudi-spark/src/main/scala/org/apache/hudi/HudiSparkUtils.scala

hudi-spark/src/main/scala/org/apache/hudi/HudiBootstrapRelation.scala

hudi-spark/src/main/scala/org/apache/hudi/HudiBootstrapRDD.scala

vinothchandar · 2020-07-15T12:59:59Z

hudi-spark/src/main/scala/org/apache/hudi/HudiBootstrapRDD.scala

+    }
+  }
+
+  def mergeInternalRow(skeletonRow: InternalRow, dataRow: InternalRow): InternalRow = {


on avoiding the merge cost, my understanding is - its hard for this case where you need to actually merge these two values, re-order etc.

vinothchandar · 2020-07-15T13:04:09Z

hudi-spark/src/main/scala/org/apache/hudi/HudiBootstrapRDD.scala

+
+    val rows = fileIterator.flatMap(_ match {
+      case r: InternalRow => Seq(r)
+      case b: ColumnarBatch => b.rowIterator().asScala


For MOR table, I have some ideas to speed things up by pre-reading the delete/rollback blocks and simply "skip" rows as long as OverwritewithLatestPayload is used.. If the user does specify a merge function, then its hard to get away from.. we can take this discussion in a separate forum.

vinothchandar · 2020-08-04T22:18:54Z

@umehrot2 heads up, we could be landing #1848 before this. (CI willing). How hard would the rebase be. I assume there would be some extra work to integrate?

umehrot2 · 2020-08-04T22:41:38Z

@umehrot2 heads up, we could be landing #1848 before this. (CI willing). How hard would the rebase be. I assume there would be some extra work to integrate?

I guess one of us will have to re-base. While most of the work seems isolated between the two PRs, but some files are common and common code areas have been touched. I am fine with doing further re-base if that PR gets in first.

umehrot2 · 2020-08-05T04:25:38Z

@vinothchandar the tests are passing, so its ready for review from my side.

vinothchandar

Few high level comments. Took a pass at the code, lgtm high level.
Doing a in-depth review, while we hash out the high level comments

vinothchandar · 2020-08-06T08:06:46Z

hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala

-        className = "parquet",
-        options = parameters)
-        .resolveRelation()
+      val readPathsStr = parameters.get(DataSourceReadOptions.READ_PATHS_OPT_KEY)


the bootstrap.base.path is now in hoodie.properties. Should can we make this transparent for the user?

hudi-spark/src/main/scala/org/apache/hudi/DefaultSource.scala

hudi-spark/src/main/scala/org/apache/hudi/HudiBootstrapRDD.scala

hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala

vinothchandar · 2020-08-07T09:03:29Z

@umehrot2 I rebased this after landing @garyli1019 's PR. Please take a look at DefaultSource again to make sure things are ok

vinothchandar · 2020-08-07T15:53:09Z

@umehrot2 some tests are failing . looking at them later today.

Before we head into the weekend, is this PR ready from your perspective. if so, I will take care of making the final changes and land.

umehrot2 · 2020-08-07T21:18:29Z

@umehrot2 some tests are failing . looking at them later today.

Before we head into the weekend, is this PR ready from your perspective. if so, I will take care of making the final changes and land.

@vinothchandar the rebase has some issues. With the introduction of Spark datasource support for real time queries, we need to handle the bootstrap case there. For bootstrapped tables, real time queries are still not supported. Only read optimized queries will work for MOR case with bootstrapped tables for now. I will fix this, and hopefully that should fix atleast the unit test failures.

umehrot2 · 2020-08-08T00:18:28Z

@vinothchandar I fixed the rebase issue, and resolved the bootstrap related test failures. I still see MOR data source related unit test failures because of spark context. Is this something you are already aware about ?

garyli1019 · 2020-08-08T00:56:39Z

@vinothchandar I fixed the rebase issue, and resolved the bootstrap related test failures. I still see MOR data source related unit test failures because of spark context. Is this something you are already aware about ?

hi @umehrot2 , the datasource test will initialize spark context before each run. If the previous run didn't close the spark properly, this error will come out. See 4f74a84#diff-b9deb8bdc09b0440cafdf6354fe9068dR104

hudi-spark/src/test/scala/org/apache/hudi/functional/TestDataSourceForBootstrap.scala

umehrot2 · 2020-08-08T03:05:15Z

@vinothchandar the unit tests issues are resolved now. But the integration tests are behaving crazy. They passed the last time, and failed now even though I didn't make any code change. They are getting stuck for some reason. I think you mentioned about this issue to me.

garyli1019 · 2020-08-08T03:46:52Z

The integration test fails sometimes for no reason. I have been seeing this for a few times. Maybe rerun will fix if lucky.

bvaradar · 2020-08-08T05:20:15Z

@umehrot2 : Thanks for the update. Yeah, the integration test flakiness is a know issue and the logs shows the same pattern. Let me do one pass of it along with other bootstrap PRs from @zhedoubushishi and land them. If there are any minor review comments, I will update the PRs myself to speed up landing.

bvaradar · 2020-08-08T05:22:00Z

@umehrot2 : Can you confirm if all review comments are resolved and the PR is ready otherwise.

umehrot2 · 2020-08-08T19:12:53Z

@umehrot2 : Can you confirm if all review comments are resolved and the PR is ready otherwise.

@bvaradar Thanks for taking a look. Yes the other PR comments are resolved, so it is ready otherwise.

bvaradar

Awesome work @umehrot2. Looks good overall. I have addressed the conflicts. Will land this tomorrow after the tests finishes.

bvaradar · 2020-08-09T19:29:00Z

hudi-client/src/main/java/org/apache/hudi/client/bootstrap/BootstrapSchemaProvider.java

+            Boolean.parseBoolean(SQLConf.PARQUET_INT96_AS_TIMESTAMP().defaultValueString()));
+    StructType sparkSchema = converter.convert(parquetSchema);
+    String tableName = writeConfig.getTableName();
+    String structName = tableName + "_record";


@umehrot2 : ITTestBootstrapCommand is failing with the below exception. Adding a sanitization API to remove illegal characters from avro field names

Exception in thread "main" org.apache.avro.SchemaParseException: Illegal character in: test-table_record at org.apache.avro.Schema.validateName(Schema.java:1151) at org.apache.avro.Schema.access$200(Schema.java:81) at org.apache.avro.Schema$Name.<init>(Schema.java:489) at org.apache.avro.Schema.createRecord(Schema.java:161) at org.apache.avro.SchemaBuilder$RecordBuilder.fields(SchemaBuilder.java:1732) at org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:173) at org.apache.spark.sql.avro.SchemaConverters.toAvroType(SchemaConverters.scala) at org.apache.hudi.client.bootstrap.BootstrapSchemaProvider.getBootstrapSourceSchema(BootstrapSchemaProvider.java:97) at org.apache.hudi.client.bootstrap.BootstrapSchemaProvider.getBootstrapSchema(BootstrapSchemaProvider.java:66) at org.apache.hudi.table.action.bootstrap.BootstrapCommitActionExecutor.listAndProcessSourcePartitions(BootstrapCommitActionExecutor.java:288)

umehrot2 mentioned this pull request Jun 4, 2020

[HUDI-426][WIP] Initial implementation for Bootstrapping data source #1475

Closed

5 tasks

garyli1019 reviewed Jun 4, 2020

View reviewed changes

hudi-spark/src/main/scala/org/apache/hudi/HudiBootstrapRelation.scala Outdated Show resolved Hide resolved

vinothchandar assigned vinothchandar and bvaradar Jun 10, 2020

umehrot2 mentioned this pull request Jun 13, 2020

[HUDI-69] Support Spark Datasource for MOR table #1722

Closed

5 tasks

garyli1019 reviewed Jun 15, 2020

View reviewed changes

garyli1019 reviewed Jul 5, 2020

View reviewed changes

hudi-spark/src/main/scala/org/apache/hudi/HudiBootstrapRDD.scala Outdated Show resolved Hide resolved

garyli1019 reviewed Jul 6, 2020

View reviewed changes

vinothchandar reviewed Jul 15, 2020

View reviewed changes

umehrot2 mentioned this pull request Jul 16, 2020

[HUDI-1102] Add common useful Spark related and Table path detection utilities #1841

Merged

5 tasks

This was referenced Jul 22, 2020

[HUDI-1089] Refactor hudi-client to support multi-engine #1827

Merged

[HUDI-242] Support for RFC-12/Bootstrapping of external datasets #1876

Merged

vinothchandar changed the title ~~Bootstrap datasource changes~~ [HUDI-242] Bootstrap datasource changes Jul 29, 2020

vinothchandar changed the title ~~[HUDI-242] Bootstrap datasource changes~~ [HUDI-426] Bootstrap datasource changes Jul 29, 2020

umehrot2 force-pushed the umehrot2_hudi_rfc12_code_review branch from 2af6913 to c8295f2 Compare August 4, 2020 21:44

umehrot2 changed the title ~~[HUDI-426] Bootstrap datasource changes~~ [HUDI-426] Bootstrap datasource integration Aug 4, 2020

umehrot2 force-pushed the umehrot2_hudi_rfc12_code_review branch from c8295f2 to 9d21da8 Compare August 5, 2020 00:06

umehrot2 force-pushed the umehrot2_hudi_rfc12_code_review branch 2 times, most recently from 313385d to 08e8481 Compare August 6, 2020 02:30

vinothchandar reviewed Aug 6, 2020

View reviewed changes

umehrot2 force-pushed the umehrot2_hudi_rfc12_code_review branch from 08e8481 to 4fcd7fe Compare August 6, 2020 23:46

vinothchandar force-pushed the umehrot2_hudi_rfc12_code_review branch from 4fcd7fe to 923a678 Compare August 7, 2020 09:02

umehrot2 force-pushed the umehrot2_hudi_rfc12_code_review branch 2 times, most recently from 7fe1bfa to 3f7ecde Compare August 7, 2020 22:56

garyli1019 reviewed Aug 8, 2020

View reviewed changes

hudi-spark/src/test/scala/org/apache/hudi/functional/TestDataSourceForBootstrap.scala Outdated Show resolved Hide resolved

umehrot2 force-pushed the umehrot2_hudi_rfc12_code_review branch from 3f7ecde to caa597a Compare August 8, 2020 01:59

Bootstrap datasource integration

952a499

umehrot2 force-pushed the umehrot2_hudi_rfc12_code_review branch from caa597a to 952a499 Compare August 8, 2020 02:03

Merge branch 'master' into umehrot2_hudi_rfc12_code_review

ff41ded

bvaradar approved these changes Aug 9, 2020

View reviewed changes

Fix unit-tests

e8c3361

bvaradar reviewed Aug 9, 2020

View reviewed changes

Add spark-avro dependency in hudi-cli and sanitize Avro field names

1ce422e

bvaradar merged commit e4a2d98 into apache:master Aug 9, 2020

hudi-bot mentioned this pull request Nov 30, 2025

Optimization whether to query Bootstrapped table using HoodieBootstrapRelation vs Sparks Parquet datasource #14634

Open

[HUDI-426] Bootstrap datasource integration #1702

[HUDI-426] Bootstrap datasource integration #1702

Uh oh!

Conversation

umehrot2 commented Jun 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

Uh oh!

Uh oh!

garyli1019 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vinothchandar commented Jun 23, 2020

Uh oh!

umehrot2 commented Jun 29, 2020

Uh oh!

umehrot2 commented Jun 29, 2020

Uh oh!

umehrot2 commented Jun 30, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vinothchandar commented Aug 4, 2020

Uh oh!

umehrot2 commented Aug 4, 2020

Uh oh!

umehrot2 commented Aug 5, 2020

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vinothchandar commented Aug 7, 2020

Uh oh!

vinothchandar commented Aug 7, 2020

Uh oh!

umehrot2 commented Aug 7, 2020

Uh oh!

umehrot2 commented Aug 8, 2020

umehrot2 commented Jun 3, 2020 •

edited

Loading