-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-7040] Handle dropping of partition columns with populate meta fields disabled #10272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-7040] Handle dropping of partition columns with populate meta fields disabled #10272
Conversation
codope
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix! Left some comments. Can you also check the CI failures?
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java
Show resolved
Hide resolved
...nt/src/main/java/org/apache/hudi/table/action/commit/BulkInsertDataInternalWriterHelper.java
Outdated
Show resolved
Hide resolved
...-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala
Outdated
Show resolved
Hide resolved
...-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala
Show resolved
Hide resolved
70388e7 to
9e4cc4f
Compare
|
Thanks for the comments @codope. Addressed comments. |
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala
Outdated
Show resolved
Hide resolved
...-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala
Outdated
Show resolved
Hide resolved
…ernalWriterHelper::write(...) Issue: There are two configs which when set in a certain manner throws exceptions or asserts 1. Configs to disable populating metadata fields (for each row) 2. Configs to drop partition columns (to save storage space) from a row With apache#1 and apache#2, partition paths cannot be deduced using partition columns (as the partition columns are dropped higher up the stack. BulkInsertDataInternalWriterHelper::write(...) relied on metadata fields to extract partition path in such cases. But with apache#1 it is not possible resulting in asserts/exceptions. The fix is to push down the dropping of partition columns down the stack after partition path is computed. The fix manipulates the raw 'InternalRow' row structure by only copying the relevent fields into a new 'InternalRow' structure. Each row is processed individually to drop the partition columns and copy it a to new 'InternalRow'
9e4cc4f to
3827b7b
Compare
codope
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Can land once the CI is green.
…ernalWriterHelper::write(...) (#10272) Issue: There are two configs which when set in a certain manner throws exceptions or asserts 1. Configs to disable populating metadata fields (for each row) 2. Configs to drop partition columns (to save storage space) from a row With #1 and #2, partition paths cannot be deduced using partition columns (as the partition columns are dropped higher up the stack. BulkInsertDataInternalWriterHelper::write(...) relied on metadata fields to extract partition path in such cases. But with #1 it is not possible resulting in asserts/exceptions. The fix is to push down the dropping of partition columns down the stack after partition path is computed. The fix manipulates the raw 'InternalRow' row structure by only copying the relevent fields into a new 'InternalRow' structure. Each row is processed individually to drop the partition columns and copy it a to new 'InternalRow' Co-authored-by: Vinaykumar Bhat <[email protected]>
| def testBulkInsertForDropPartitionColumn(): Unit = { | ||
| //create a new table | ||
| val tableName = "trips_table" | ||
| val basePath = "file:///tmp/trips_table" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cannot do this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bad miss! Fixing in #11912
Change Logs
Fixes #9991
Issue:
There are two configs which when set in a certain manner throws exceptions or asserts
With 1 and 2 above, partition paths cannot be deduced using partition columns (as the partition columns are dropped higher up the stack.
BulkInsertDataInternalWriterHelper::write(...)relied on metadata fields to extract partition path in such cases. But with only 1 it is not possible resulting in asserts/exceptions.The fix is to push down the dropping of partition columns down the stack after partition path is computed. The fix manipulates the raw 'InternalRow' row structure by only copying the relevent fields into a new 'InternalRow' structure. Each row is processed individually to drop the partition columns and copy it a to new 'InternalRow'
Impact
No piblic API or user facing changes. However, each
InternalRowstructure is processed and copied individually whenthe config is set.
Risk level (write none, low medium or high below)
low
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist