Skip to content

Conversation

@bhat-vinay
Copy link
Contributor

@bhat-vinay bhat-vinay commented Dec 7, 2023

Change Logs

Fixes #9991

Issue:
There are two configs which when set in a certain manner throws exceptions or asserts

  1. Configs to disable populating metadata fields (for each row)
  2. Configs to drop partition columns (to save storage space) from a row

With 1 and 2 above, partition paths cannot be deduced using partition columns (as the partition columns are dropped higher up the stack. BulkInsertDataInternalWriterHelper::write(...) relied on metadata fields to extract partition path in such cases. But with only 1 it is not possible resulting in asserts/exceptions.

The fix is to push down the dropping of partition columns down the stack after partition path is computed. The fix manipulates the raw 'InternalRow' row structure by only copying the relevent fields into a new 'InternalRow' structure. Each row is processed individually to drop the partition columns and copy it a to new 'InternalRow'

Impact

No piblic API or user facing changes. However, each InternalRow structure is processed and copied individually when
the config is set.

Risk level (write none, low medium or high below)

low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@bhat-vinay bhat-vinay marked this pull request as draft December 7, 2023 12:45
Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! Left some comments. Can you also check the CI failures?

@bhat-vinay bhat-vinay force-pushed the HUDI-7040-fix-drop-partition-columns branch from 70388e7 to 9e4cc4f Compare December 8, 2023 06:14
@bhat-vinay bhat-vinay marked this pull request as ready for review December 11, 2023 04:09
@bhat-vinay
Copy link
Contributor Author

Thanks for the comments @codope. Addressed comments.

@codope codope changed the title [WIP] [HUDI-7040] Handle dropping of partition columns [HUDI-7040] Handle dropping of partition columns with populate meta fields disabled Dec 11, 2023
…ernalWriterHelper::write(...)

Issue:
There are two configs which when set in a certain manner throws exceptions or asserts
1. Configs to disable populating metadata fields (for each row)
2. Configs to drop partition columns (to save storage space) from a row

With apache#1 and apache#2, partition paths cannot be deduced using partition columns (as the partition columns are dropped higher up the stack.
BulkInsertDataInternalWriterHelper::write(...) relied on metadata fields to extract partition path in such cases.
But with apache#1 it is not possible resulting in asserts/exceptions.

The fix is to push down the dropping of partition columns down the stack after partition path is computed.
The fix manipulates the raw 'InternalRow' row structure by only copying the relevent fields into a new 'InternalRow' structure.
Each row is processed individually to drop the partition columns and copy it a to new 'InternalRow'
@bhat-vinay bhat-vinay force-pushed the HUDI-7040-fix-drop-partition-columns branch from 9e4cc4f to 3827b7b Compare December 11, 2023 07:07
Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Can land once the CI is green.

@apache apache deleted a comment from hudi-bot Dec 11, 2023
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codope codope merged commit b181063 into apache:master Dec 11, 2023
nsivabalan pushed a commit that referenced this pull request Dec 18, 2023
…ernalWriterHelper::write(...) (#10272)

Issue:
There are two configs which when set in a certain manner throws exceptions or asserts
1. Configs to disable populating metadata fields (for each row)
2. Configs to drop partition columns (to save storage space) from a row

With #1 and #2, partition paths cannot be deduced using partition columns (as the partition columns are dropped higher up the stack.
BulkInsertDataInternalWriterHelper::write(...) relied on metadata fields to extract partition path in such cases.
But with #1 it is not possible resulting in asserts/exceptions.

The fix is to push down the dropping of partition columns down the stack after partition path is computed.
The fix manipulates the raw 'InternalRow' row structure by only copying the relevent fields into a new 'InternalRow' structure.
Each row is processed individually to drop the partition columns and copy it a to new 'InternalRow'

Co-authored-by: Vinaykumar Bhat <[email protected]>
@bhat-vinay bhat-vinay deleted the HUDI-7040-fix-drop-partition-columns branch April 4, 2024 02:48
def testBulkInsertForDropPartitionColumn(): Unit = {
//create a new table
val tableName = "trips_table"
val basePath = "file:///tmp/trips_table"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cannot do this

Copy link
Member

@codope codope Sep 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bad miss! Fixing in #11912

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[SUPPORT] Can not extract Partition Path with conf populateMetaFields set false and dropPartitionColumns set true

4 participants