[HUDI-9669] Add schema on write support to hive #13654

jonvex · 2025-07-30T20:29:06Z

Change Logs

Add support by implementing full projection like we have in avro.
Then use the data schema and prune it. Use that to read the files and then use projection to the requested schema.

Impact

Hive supports reading tables with schema.on.write

Risk level (write none, low medium or high below)

medium,
uses the schema on write file group reader tests

Documentation Update

N/A

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieArrayWritableAvroUtils.java

jonvex · 2025-08-02T19:41:05Z

hudi-hadoop-mr/src/main/java/org/apache/hadoop/hive/serde2/avro/HiveTypeUtils.java

-      return generateTypeInfo(
-          AvroSerdeUtils.getOtherTypeFromNullableType(schema), seenSchemas);
+    if (AvroSchemaUtils.isNullable(schema)) {
+      return generateTypeInfo(AvroSchemaUtils.resolveNullableSchema(schema), seenSchemas);


This is for hive version compatibility

hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HiveHoodieReaderContext.java

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieArrayWritableAvroUtils.java

hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/utils/TestHoodieArrayWritableAvroUtils.java

hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HiveAvroSerializer.java

hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java

hudi-common/src/test/java/org/apache/hudi/avro/TestAvroSchemaUtils.java

the-other-tim-brown · 2025-08-05T14:15:44Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HiveHoodieReaderContext.java

+    boolean isParquet = filePath.getFileExtension().equals(HoodieFileFormat.PARQUET.getFileExtension());
+    Schema avroFileSchema = isParquet ? HoodieIOFactory.getIOFactory(storage)
+        .getFileFormatUtils(filePath).readAvroSchema(storage, filePath) : dataSchema;
+    Schema actualRequiredSchema = isParquet ? AvroSchemaUtils.pruneDataSchema(avroFileSchema, requiredSchema, Collections.emptySet()) : requiredSchema;


Can you add an inline comment explaining why the pruning is required for parquet only?

it's actually that we don't want hfile. So I flipped it. Because mdt the schema from the file is different and things fail if we try to use it

…java one correct

…ck for avro and parquet log files

the-other-tim-brown

LGTM, @yihua can you take a look as well?

danny0405 · 2025-08-11T07:05:35Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HiveRecordContext.java

    Schema schema = getSchemaFromBufferRecord(bufferedRecord);
    ArrayWritable writable = bufferedRecord.getRecord();
-    return new HoodieHiveRecord(key, writable, schema, objectInspectorCache, bufferedRecord.getHoodieOperation(), bufferedRecord.isDelete());
+    return new HoodieHiveRecord(key, writable, schema, getHiveAvroSerializer(schema), bufferedRecord.getHoodieOperation(), bufferedRecord.isDelete());


the hashcode of avro schema is cached, should be negligible for computation cost.

danny0405

+1, overall looks good, @jonvex can you rebase with the master to resolve conflicts.

hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java

hudi-common/src/test/java/org/apache/hudi/common/table/read/TestHoodieFileGroupReaderBase.java

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HiveHoodieReaderContext.java

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieArrayWritableAvroUtils.java

yihua · 2025-08-11T22:32:51Z

hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/TestHiveHoodieReaderContext.java

-  private JobConf getJobConf() {
-    JobConf jobConf = new JobConf(storageConfiguration.unwrapAs(Configuration.class));
-    jobConf.set("columns", "field_1,field_2,field_3,datestr");
-    jobConf.set("columns.types", "string,string,struct<nested_field:string>,string");


Have you tried query a Hudi table with schema evolution using Hive engine (not just the unit tests) to make sure everything still works, without leveraging this conf provided by Hive (is this changed now)?

yihua

It would be good to run queries on a large Hudi table on Hive engine to make sure there is no noticeable performance difference.

hudi-common/src/test/java/org/apache/hudi/common/table/read/TestHoodieFileGroupReaderBase.java

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieArrayWritableAvroUtils.java

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

This reverts commit 52a99e1.

hudi-bot · 2025-08-13T04:18:42Z

CI report:

ed44441 UNKNOWN
0ef0f87 UNKNOWN
de5c67c UNKNOWN
c7b4b8f UNKNOWN
baedc96 UNKNOWN
ae845ce UNKNOWN
a31899b UNKNOWN
77825b4 UNKNOWN
4c10c4e UNKNOWN
4a5339b Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

yihua

LGTM. We can land this first to unblock other PRs.

yihua · 2025-08-13T06:18:14Z

hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java

+   *  The names of records, namespaces, or docs do not need to match. Nullability is ignored.
+   */
+  public static boolean areSchemasProjectionEquivalent(Schema schema1, Schema schema2) {
+    return AvroSchemaComparatorForRecordProjection.areSchemasProjectionEquivalent(schema1, schema2);


nit: this can be directly used without adding AvroSchemaUtils#areSchemasProjectionEquivalent

yihua · 2025-08-13T06:25:04Z

hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaComparatorForRecordProjection.java

+
+  @Override
+  protected boolean validateField(Schema.Field f1, Schema.Field f2) {
+    return f1.name().equalsIgnoreCase(f2.name());


Is this intended for case insensitivity of column names?

Jonathan Vexler added 2 commits July 30, 2025 16:16

fix schema on read on hive

7e3cc31

get rid of nullable check

ff7a559

github-actions bot added the size:L PR with lines of changes in (300, 1000] label Jul 30, 2025

Jonathan Vexler added 4 commits July 30, 2025 16:34

rename util method

ed44441

remove java restriction

0ef0f87

revert java changes and move to new pr

de5c67c

fix hive partition issue

3ba37c5

danny0405 added engine:hive Hive integration schema-evolution labels Jul 30, 2025

danny0405 added this to Hudi PR Support Jul 30, 2025

github-project-automation bot moved this to 🆕 New in Hudi PR Support Jul 30, 2025

fix hive tests

041a135

the-other-tim-brown reviewed Jul 31, 2025

View reviewed changes

Jonathan Vexler added 3 commits August 2, 2025 12:50

Merge branch 'master' into fix_schema_on_write_hive

306d9e2

address review comments and apply fix that was done for avro

ab9e84f

add testing

c7b4b8f

github-actions bot added size:XL PR with lines of changes > 1000 and removed size:L PR with lines of changes in (300, 1000] labels Aug 2, 2025

Jonathan Vexler added 2 commits August 2, 2025 15:35

get rid of object inspector cache

baedc96

simplify if in hoodieavroutils

bfd9aa1

jonvex commented Aug 2, 2025

View reviewed changes

Jonathan Vexler added 2 commits August 2, 2025 15:43

style

ae845ce

update test that uses illegal schema for record

49815e2

the-other-tim-brown reviewed Aug 4, 2025

View reviewed changes

address review comments

5fa01e2

jonvex requested a review from the-other-tim-brown August 4, 2025 15:21

the-other-tim-brown reviewed Aug 4, 2025

View reviewed changes

hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java Outdated Show resolved Hide resolved

Jonathan Vexler added 2 commits August 4, 2025 17:00

refactor to get rid of most caches and make code better

a31899b

address review comment and fix some things

65f6c00

the-other-tim-brown reviewed Aug 4, 2025

View reviewed changes

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HiveAvroSerializer.java Outdated Show resolved Hide resolved

hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java Outdated Show resolved Hide resolved

the-other-tim-brown reviewed Aug 5, 2025

View reviewed changes

Jonathan Vexler added 4 commits August 5, 2025 14:59

address review comments

dbb3f06

add tests for the new hive avro serializer methods and also make the …

15fbf03

…java one correct

fix parquet/orc extension check. Add schema evolution on log file che…

77825b4

…ck for avro and parquet log files

fix checkstyle

ea0bb44

the-other-tim-brown reviewed Aug 6, 2025

View reviewed changes

Merge branch 'master' into fix_schema_on_write_hive

560d38c

danny0405 reviewed Aug 11, 2025

View reviewed changes

danny0405 approved these changes Aug 11, 2025

View reviewed changes

github-project-automation bot moved this from 🆕 New to 🛬 Near landing in Hudi PR Support Aug 11, 2025

Merge branch 'master' into fix_schema_on_write_hive

96f114f

yihua reviewed Aug 11, 2025

View reviewed changes

hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java Show resolved Hide resolved

yihua reviewed Aug 11, 2025

View reviewed changes

hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java Show resolved Hide resolved

hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java Show resolved Hide resolved

yihua reviewed Aug 11, 2025

View reviewed changes

hudi-common/src/test/java/org/apache/hudi/common/table/read/TestHoodieFileGroupReaderBase.java Outdated Show resolved Hide resolved

yihua reviewed Aug 11, 2025

View reviewed changes

Jonathan Vexler added 2 commits August 11, 2025 19:18

reuse code for validation of projection equivalence

efb6058

address some review comments

52a99e1

yihua reviewed Aug 12, 2025

View reviewed changes

yihua added 3 commits August 12, 2025 19:00

Revert "address some review comments"

4c10c4e

This reverts commit 52a99e1.

Add test case

e24a2e3

Only use equivalent check for read-only utils

4a5339b

yihua approved these changes Aug 13, 2025

View reviewed changes

yihua merged commit e7aae2e into apache:master Aug 13, 2025
61 checks passed

github-project-automation bot moved this from 🛬 Near landing to ✅ Done in Hudi PR Support Aug 13, 2025

This was referenced Nov 30, 2025

Fix schema.on.write support for flink reader #17127

Open

Revisit per-record schema parsing in schema evolution #17148

Open

[HUDI-9669] Add schema on write support to hive #13654

[HUDI-9669] Add schema on write support to hive #13654

Uh oh!

Conversation

jonvex commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jonvex Aug 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

the-other-tim-brown Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

jonvex Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

the-other-tim-brown left a comment

Choose a reason for hiding this comment

Uh oh!

danny0405 Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

danny0405 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yihua Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hudi-bot commented Aug 13, 2025

CI report:

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

yihua Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

yihua Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

jonvex commented Jul 30, 2025 •

edited

Loading

yihua Aug 13, 2025 •

edited

Loading