Skip to content

Conversation

@parisni
Copy link
Contributor

@parisni parisni commented Nov 7, 2025

feat: support nested doc datasource. improvement of #8683

Goal:

Full comments support for hudi, from source to target

Context:

as of 1.x and previous, the field comment support is partial:

  • hive-sync and co, allow to sync the comments only on the first field level
  • the comments are retrieved from the schema resolver. while they are stored in the hudi avro schema, they can't yet be extracted, since it ignore the avro docs.
  • avro support nested comments while hive metastore and co only support first layer of comment. However spark can store the schema structure in a property, and if exists, use it in priority.
  • Hudi currently optionally supports writing the spark structure, but it does not store comments at all, and rely on parquet schema which don't capture the comments anyway

What this PR covers:

  1. enable extracting the nested comments out of avro schema
  2. replaces the generation of the spark structure with a new util, directly leveraging avro and not parquet schema anymore
  3. provides full support for spark schema comments storage
  4. enable metastore first level comment (they were partially working because of the lack of support in 1.

Reminder:

  • to activate the metastore comment, set: hoodie.datasource.hive_sync.sync_comment
  • to activate the spark schema structure comments, set: [hoodie.datasource.hive_sync.sync_as_datasource](https://hudi.apache.org/docs/0.15.0/configurations#hoodiedatasourcehive_syncsync_as_datasource)

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions bot added the size:XL PR with lines of changes > 1000 label Nov 7, 2025
@parisni parisni force-pushed the feat-e2e-support-schema-comment branch 5 times, most recently from de214ce to 37809ac Compare November 11, 2025 14:49
@parisni parisni changed the title [HUDI-5533] Support spark columns comments [HUDI-5533] Support spark nested columns comments Nov 11, 2025
@parisni parisni force-pushed the feat-e2e-support-schema-comment branch 2 times, most recently from f9f84ee to 864a9ad Compare November 11, 2025 15:47
feat: support nested doc in hms datasource

fix

fix

wip

wip
@parisni parisni force-pushed the feat-e2e-support-schema-comment branch 4 times, most recently from 72ab191 to 236b589 Compare November 11, 2025 16:55
@parisni parisni force-pushed the feat-e2e-support-schema-comment branch from 236b589 to 86ffe68 Compare November 11, 2025 17:36
@parisni parisni marked this pull request as ready for review November 12, 2025 12:50
@parisni
Copy link
Contributor Author

parisni commented Nov 12, 2025

@yihua what about also landing this into 1.1 ?

@parisni
Copy link
Contributor Author

parisni commented Nov 12, 2025

also maybe @danny0405 as you did follow this topic previously and this touch a bit to flink too

@danny0405
Copy link
Contributor

@parisni The 1.1 RC2 is out, I don't think this is a blocker for it, I can help for the review though, can you check the test failures?

@parisni parisni mentioned this pull request Nov 13, 2025
3 tasks
the avro now provides the docs which is not the case of the method
tested here. so to solve the schema mismatch don't consider metadata
@parisni
Copy link
Contributor Author

parisni commented Nov 13, 2025

@danny0405 i did fix the tests. still some are failing but sounds like flaky tests ?
confirmed flaky for:

  • TestJavaHoodieBackedMetadata.testReattemptOfFailedClusteringCommit
  • TestNewHoodieParquetFileFormat.testNewParquetFileFormat

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@parisni Could you rebase your branch on top of the latest master? A few fixes on CI flakiness have been merged.

@parisni parisni requested a review from yihua December 1, 2025 08:35
@parisni
Copy link
Contributor Author

parisni commented Dec 22, 2025

Hi @yihua @danny0405 i did rebase on master again

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants