Skip to content

Comments

[HUDI-3838] Implemented drop partition column feature for delta streamer code path#5294

Merged
codope merged 4 commits intoapache:masterfrom
vingov:HUDI-3838_delta_streamer_drop_partition_column
Apr 12, 2022
Merged

[HUDI-3838] Implemented drop partition column feature for delta streamer code path#5294
codope merged 4 commits intoapache:masterfrom
vingov:HUDI-3838_delta_streamer_drop_partition_column

Conversation

@vingov
Copy link

@vingov vingov commented Apr 12, 2022

What is the purpose of the pull request

Add support for hoodie.datasource.write.drop.partition.columns config for DeltaStreamer code path.

Brief change log

  • Add support for hoodie.datasource.write.drop.partition.columns config for DeltaStreamer code path.

Verify this pull request

(Please pick either of the following options)

This change added tests and can be verified as follows:

  • Added method to TestHoodieAvroUtils class to verify the change.
  • Manually verified the change by running a job locally using the local docker setup, the BigQuery tables were created properly as expected.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

* <p>
* To better understand how it removes please check {@link #rewriteRecord(GenericRecord, Schema)}
*/
public static GenericRecord removeFields(GenericRecord record, List<String> columnsToRemove) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for not invoking rewriteRecord method directly from DeltaStreamer code is because I was getting serialization error org.apache.spark.SparkException: Task not serializable when passing the targetSchema to that method from the map lambda expression.

@nsivabalan nsivabalan added the priority:blocker Production down; release blocker label Apr 12, 2022
Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @vingov for quickly fixing this. Looks good. Left a few minor comments.

CI is failing. I think that's because we're not passing default value. Please check my comments below.

String partitionColumns = HoodieSparkUtils.getPartitionColumns(keyGenerator, props);
List<String> listOfPartitionColumns = Arrays.asList(partitionColumns.split(","));
JavaRDD<GenericRecord> avroRDD = avroRDDOptional.get();
HoodieKey key = keyGenerator.getKey(avroRDD.first());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if this is right. we have to move this within map (). for each genRec, we have to first run through key generator and then modify the genRec if need be (drop partition cols).
Also, can we move lines 482 to 494 to separate method and keep this method cleaner/leaner.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it cleaner I picked the first record for generating the key, I see your point, updated it.

@codope
Copy link
Member

codope commented Apr 12, 2022

@hudi-bot run azure

@apache apache deleted a comment from hudi-bot Apr 12, 2022
Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Verified the patch. Below is table config and schema in a commit file:

#Updated at 2022-04-12T10:02:39.874Z
#Tue Apr 12 15:32:39 IST 2022
hoodie.table.precombine.field=tpep_dropoff_datetime
hoodie.datasource.write.drop.partition.columns=false
hoodie.table.partition.fields=date_col
hoodie.table.type=COPY_ON_WRITE
hoodie.archivelog.folder=archived
hoodie.populate.meta.fields=true
hoodie.partition.metafile.use.base.format=false
hoodie.timeline.layout.version=1
hoodie.table.version=4
hoodie.table.metadata.partitions=column_stats,files
hoodie.table.recordkey.fields=tpep_pickup_datetime
hoodie.table.base.file.format=PARQUET
hoodie.table.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator
hoodie.table.name=ny_hudi_tbl
hoodie.table.metadata.partitions.inflight=
hoodie.table.checksum=1182839966

* Schema (note that date_col is the partition column which is not written as part of schema) *

"extraMetadata" : {
    "schema" : "{\"type\":\"record\",\"name\":\"test_struct_name\",\"namespace\":\"test_record_namespace\",\"fields\":[{\"name\":\"VendorID\",\"type\":[\"null\",\"int\"],\"default\":null},{\"name\":\"tpep_pickup_datetime\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"tpep_dropoff_datetime\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"passenger_count\",\"type\":[\"null\",\"int\"],\"default\":null},{\"name\":\"trip_distance\",\"type\":[\"null\",\"double\"],\"default\":null},{\"name\":\"RatecodeID\",\"type\":[\"null\",\"int\"],\"default\":null},{\"name\":\"store_and_fwd_flag\",\"type\":[\"null\",\"string\"],\"default\":null},{\"name\":\"PULocationID\",\"type\":[\"null\",\"int\"],\"default\":null},{\"name\":\"DOLocationID\",\"type\":[\"null\",\"int\"],\"default\":null},{\"name\":\"payment_type\",\"type\":[\"null\",\"int\"],\"default\":null},{\"name\":\"fare_amount\",\"type\":[\"null\",\"double\"],\"default\":null},{\"name\":\"extra\",\"type\":[\"null\",\"double\"],\"default\":null},{\"name\":\"mta_tax\",\"type\":[\"null\",\"double\"],\"default\":null},{\"name\":\"tip_amount\",\"type\":[\"null\",\"double\"],\"default\":null},{\"name\":\"tolls_amount\",\"type\":[\"null\",\"double\"],\"default\":null},{\"name\":\"improvement_surcharge\",\"type\":[\"null\",\"double\"],\"default\":null},{\"name\":\"total_amount\",\"type\":[\"null\",\"double\"],\"default\":null},{\"name\":\"congestion_surcharge\",\"type\":[\"null\",\"double\"],\"default\":null}]}",
    "deltastreamer.checkpoint.key" : "1634021149000"
  }

Would be good to add a UT in TestHoodieDeltastreamer as a followup. Tracking in HUDI-3863

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codope codope merged commit d167409 into apache:master Apr 12, 2022
JavaRDD<GenericRecord> avroRDD = avroRDDOptional.get();
JavaRDD<HoodieRecord> records = avroRDD.map(gr -> {
JavaRDD<HoodieRecord> records = avroRDD.map(record -> {
GenericRecord gr = isDropPartitionColumns() ? HoodieAvroUtils.removeFields(record, getPartitionColumns(keyGenerator, props)) : record;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getPartitionColumns(keyGenerator, props) could have been done once in the driver.

Copy link
Member

@codope codope Apr 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense.. please land #5302 if it looks good.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created the same PR #5303 as well.

xushiyan pushed a commit that referenced this pull request Apr 14, 2022
…mer code path (#5294)

* [HUDI-3838] Implemented drop partition column feature for delta streamer code path

* Ensure drop partition table config is updated in hoodie.props

Co-authored-by: Sagar Sumit <sagarsumit09@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants