Skip to content

Conversation

@jonvex
Copy link
Contributor

@jonvex jonvex commented Jul 30, 2025

Change Logs

Add support by implementing full projection like we have in avro.
Then use the data schema and prune it. Use that to read the files and then use projection to the requested schema.

Impact

Hive supports reading tables with schema.on.write

Risk level (write none, low medium or high below)

medium,
uses the schema on write file group reader tests

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Jul 30, 2025
@github-actions github-actions bot added size:XL PR with lines of changes > 1000 and removed size:L PR with lines of changes in (300, 1000] labels Aug 2, 2025
return generateTypeInfo(
AvroSerdeUtils.getOtherTypeFromNullableType(schema), seenSchemas);
if (AvroSchemaUtils.isNullable(schema)) {
return generateTypeInfo(AvroSchemaUtils.resolveNullableSchema(schema), seenSchemas);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for hive version compatibility

boolean isParquet = filePath.getFileExtension().equals(HoodieFileFormat.PARQUET.getFileExtension());
Schema avroFileSchema = isParquet ? HoodieIOFactory.getIOFactory(storage)
.getFileFormatUtils(filePath).readAvroSchema(storage, filePath) : dataSchema;
Schema actualRequiredSchema = isParquet ? AvroSchemaUtils.pruneDataSchema(avroFileSchema, requiredSchema, Collections.emptySet()) : requiredSchema;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an inline comment explaining why the pruning is required for parquet only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's actually that we don't want hfile. So I flipped it. Because mdt the schema from the file is different and things fail if we try to use it

Copy link
Contributor

@the-other-tim-brown the-other-tim-brown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @yihua can you take a look as well?

Schema schema = getSchemaFromBufferRecord(bufferedRecord);
ArrayWritable writable = bufferedRecord.getRecord();
return new HoodieHiveRecord(key, writable, schema, objectInspectorCache, bufferedRecord.getHoodieOperation(), bufferedRecord.isDelete());
return new HoodieHiveRecord(key, writable, schema, getHiveAvroSerializer(schema), bufferedRecord.getHoodieOperation(), bufferedRecord.isDelete());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the hashcode of avro schema is cached, should be negligible for computation cost.

Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, overall looks good, @jonvex can you rebase with the master to resolve conflicts.

@github-project-automation github-project-automation bot moved this from 🆕 New to 🛬 Near landing in Hudi PR Support Aug 11, 2025
private JobConf getJobConf() {
JobConf jobConf = new JobConf(storageConfiguration.unwrapAs(Configuration.class));
jobConf.set("columns", "field_1,field_2,field_3,datestr");
jobConf.set("columns.types", "string,string,struct<nested_field:string>,string");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried query a Hudi table with schema evolution using Hive engine (not just the unit tests) to make sure everything still works, without leveraging this conf provided by Hive (is this changed now)?

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to run queries on a large Hudi table on Hive engine to make sure there is no noticeable performance difference.

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We can land this first to unblock other PRs.

* The names of records, namespaces, or docs do not need to match. Nullability is ignored.
*/
public static boolean areSchemasProjectionEquivalent(Schema schema1, Schema schema2) {
return AvroSchemaComparatorForRecordProjection.areSchemasProjectionEquivalent(schema1, schema2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this can be directly used without adding AvroSchemaUtils#areSchemasProjectionEquivalent


@Override
protected boolean validateField(Schema.Field f1, Schema.Field f2) {
return f1.name().equalsIgnoreCase(f2.name());
Copy link
Contributor

@yihua yihua Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intended for case insensitivity of column names?

@yihua yihua merged commit e7aae2e into apache:master Aug 13, 2025
61 checks passed
@github-project-automation github-project-automation bot moved this from 🛬 Near landing to ✅ Done in Hudi PR Support Aug 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

engine:hive Hive integration size:XL PR with lines of changes > 1000

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

5 participants