Timestamp millis repair #14120

yihua · 2025-10-18T22:50:39Z

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

danny0405 · 2025-10-20T03:44:45Z

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

        case LONG:
          if (oldSchema.getLogicalType() != newSchema.getLogicalType()) {
-            if (oldSchema.getLogicalType() instanceof LogicalTypes.TimestampMillis) {
+            if (skipLogicalTimestampEvolution || oldSchema.getLogicalType() == null || newSchema.getLogicalType() == null) {


didn' get why we need this flag skipLogicalTimestampEvolution, we should always rewrite the field if the logical type mismatch?

Based on my understanding, previously, AvroSchemaCompatibility#calculateCompatibility does not validate the logical timestamp evolution before, so timestamp micros to timestamp millis can happen which leads to precision loss, and such schema evolution should not be allowed.

However, for handling the timestamp issue this PR addresses, the ingestion writer needs to rewrite the schema from timestamp micros to millis.

Yes, this is true, but there is actually also another case where we do need to support micros -> millis:
If the user has a transformer but are using avro writer, which is still standard, then when we are in spark, we will always be in micros, so when we convert the spark back to avro, it will be in micros. But then if the target schema specifies millis, we need to convert micros to millis.

so timestamp micros to timestamp millis can happen which leads to precision loss, and such schema evolution should not be allowed

This may be right, but at least type promotion like ts(3) to ts(6) should be always allowed? Did't really get the flag skipLogicalTimestampEvolution or in which case we can skip the evolution. And the code here did handles the case for micors -> millis conversion, if it is disallowed, we should forbidden it?

danny0405 · 2025-10-20T03:46:56Z

hudi-common/src/main/java/org/apache/hudi/common/config/HoodieCommonConfig.java

      .withDocumentation("Enables support for Schema Evolution feature");

+  public static final ConfigProperty<Boolean> SCHEMA_EVOLUTION_ALLOW_LOGICAL_EVOLUTION = ConfigProperty
+      .key("hoodie.schema.evolution.allow.logical.evolution")


why need this flag? timestamp-millis to/from timestamp-micros should always be feasible in schema evolution.

As mentioned in the other thread, timestamp-micros to timestamp-millis should not be allowed as it loses precision.

Then why we need this flag if the precision loss is never allowed in schema evolution. Just abadon the converion in the code?

BTW, not only timestamp type got logical types in avro schema right? and I didn't see it works for preventing precision loss from this name. At least type promotion should be allowed.

I think this flag can be added in an independent PR, not to be coupled with the logical timestamp fix, to avoid confusion.

On master, the timestamp-micros to timestamp-millis type change is allowed although it should not be, given that it incurs precision loss. Thus for new table on table version 9, such type change in the schema should be validated and disallowed.

However, for the logical timestamp fix to work, the Hudi streamer needs to automatically change the field type from timestamp-micros (because of the regression) to timestamp-millis based on the target schema in the schema provider. So such field type change needs to be allowed for table version 8 and below, where the field is incorrectly changed from timestamp-millis in the target schema from the schema provider to timestamp-micros in the table schema due to the regression.

danny0405 · 2025-10-20T03:50:32Z

hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieAvroDataBlock.java

        this.readRecords++;
        if (this.promotedSchema.isPresent()) {
-          return HoodieAvroUtils.rewriteRecordWithNewSchema(record, this.promotedSchema.get());
+          return HoodieAvroUtils.rewriteRecordWithNewSchema(record, this.promotedSchema.get(), skipLogicalTimestampEvolution);


I thought we only got problems for Parquets, so avro logs also got mismatch precision for timestamp type and it's values? the avro schema in the log block head comes from the table schema which should be correct right?

It's not a parquet problem, it's anything ingested with deltastreamer. When the issue happens, the table schema still matches the data schema

yihua · 2025-10-20T05:02:35Z

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/common/ParquetTimestampUtils.java

+      if (isTimestampMicros(fileType) && isTimestampMillis(tableType)) {
+        columnsToMultiply.add(path);
+      } else if (isLong(fileType) && isLocalTimestampMillis(tableType)) {
+        columnsToMultiply.add(path);


Is this a new breaking case to handle?

It's not new, I brought it up a bunch of times. In the doc we have rows to show what is happening

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/common/ParquetTimestampUtils.java

yihua · 2025-10-20T05:42:15Z

...scala/org/apache/spark/sql/execution/datasources/parquet/HoodieParquetFileFormatHelper.scala

          Cast(expr, dec, if (needTimeZone) timeZoneId else None)
        case (StringType, DateType) =>
          Cast(expr, DateType, if (needTimeZone) timeZoneId else None)
+        case (LongType, TimestampNTZType) => expr // @ethan I think we just want a no-op here?


Now I kind of get it. Is this because the local timestamp or TimestampNTZType is written as Long type in parquet before? Also, there is no regression micros in schema vs millis in values for TimestampNTZType for published Hudi releases correct? If so, there is no need for conversion.

If you go to the codegen for casting, they don't support long->timestampntz but spark has natural handling. But I got rid of this, because now we will repair the data on this case. If we want to protect when repair is disabled, we can actually make a better change: org.apache.spark.sql.execution.datasources.parquet.HoodieParquetFileFormatHelper.isDataTypeEqual we can add a case for (TimesatmpNTZType, LongType) => true, so that way we won't even need to use the schema evolution paths for this col

I think avoiding going through the schema evolution read path would be preferred to avoid the additional overhead from the read path.

...scala/org/apache/spark/sql/execution/datasources/parquet/HoodieParquetFileFormatHelper.scala

yihua · 2025-10-20T05:55:36Z

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

        case LONG:
          if (oldSchema.getLogicalType() != newSchema.getLogicalType()) {
-            if (oldSchema.getLogicalType() instanceof LogicalTypes.TimestampMillis) {
+            if (skipLogicalTimestampEvolution || oldSchema.getLogicalType() == null || newSchema.getLogicalType() == null) {


Based on my understanding, previously, AvroSchemaCompatibility#calculateCompatibility does not validate the logical timestamp evolution before, so timestamp micros to timestamp millis can happen which leads to precision loss, and such schema evolution should not be allowed.

yihua · 2025-10-20T05:56:16Z

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

        case LONG:
          if (oldSchema.getLogicalType() != newSchema.getLogicalType()) {
-            if (oldSchema.getLogicalType() instanceof LogicalTypes.TimestampMillis) {
+            if (skipLogicalTimestampEvolution || oldSchema.getLogicalType() == null || newSchema.getLogicalType() == null) {


However, for handling the timestamp issue this PR addresses, the ingestion writer needs to rewrite the schema from timestamp micros to millis.

yihua · 2025-10-20T06:28:48Z

...scala/org/apache/spark/sql/execution/datasources/parquet/HoodieParquetFileFormatHelper.scala

      })
    }

+    def recursivelyApplyMultiplication(expr: Expression, columnPath: String, dataType: DataType): Expression = {


I'm wondering if we can change HoodieParquetReadSupport and adding a read support implementation for Avro parquet reader for handling the millis interpretation, which is one layer below the current approach? Would that incur less overhead than the projection?

Yes I was able to do so and it will infact incur less overhead. My original approach was to spoof the schema read from the parquet footer, but I thought it was too deep into the parquet-java hadoop stuff to work. But after trying it out, it seems like that readsupport requested schema is all that needs to change for it to work

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java

hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieAvroDataBlock.java

hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieDataBlock.java

yihua · 2025-10-21T16:47:41Z

...main/scala/org/apache/spark/sql/execution/datasources/parquet/HoodieParquetReadSupport.scala

  override def init(context: InitContext): ReadContext = {
    val readContext = super.init(context)
-    val requestedParquetSchema = readContext.getRequestedSchema
+    val requestedParquetSchema = SchemaRepair.repairLogicalTypes(readContext.getRequestedSchema, tableSchemaOpt)


Where is fix for the Avro parquet reader? Also, the Hive reader needs a fix too.

can it be ensured that the readContext.getRequestedSchema coming from the parquet footer?

…and avro

danny0405 · 2025-10-22T02:41:36Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java

      if (!existingTableSchema.isPresent()) {
        return;
      }
+      boolean allowLogicalEvolutions = config.shouldAllowLogicalEvolutions();


Like we discussed: https://github.com/apache/hudi/pull/14120/files#r2448965924, for V9 table, the flag allowLogicalEvolutions should always be false while for V8 and below, it should be true to allow the fix to work.

So that we can get rid of the option config.shouldAllowLogicalEvolutions() and just decide by table version?

danny0405 · 2025-10-22T02:46:32Z

hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieAvroDataBlock.java


-      if (recordNeedsRewriteForExtendedAvroTypePromotion(writerSchema, readerSchema)) {
-        this.reader = new GenericDatumReader<>(writerSchema, writerSchema);
+      Schema repairedWriterSchema = AvroSchemaRepair.repairLogicalTypes(writerSchema, readerSchema);


the writer schema and reader schema are both schema from the log header which is actually the same, the fix seems not working as expected.

Then how does the existing evolution fixes like recordNeedsRewriteForExtendedAvroTypePromotion work then? I will validate to see if the fix is working, but the schemas should be able to be different

yeah, looks like if there is no schema evolution, the read schema is right:

hudi/hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFileReader.java

Line 223 in 5fe72cb

private Option<Schema> getTargetReaderSchemaForBlock() {

, otherwise the read schema would be empty and set as the header schema in the HoodieDataBlock constructors.

And we should invoke AvroSchemaRepair.repairLogicalTypes( after checking recordNeedsRewriteForExtendedAvroTypePromotion as true because micros to millis check is added there?

hudi-hadoop-common/src/main/java/org/apache/hudi/io/hadoop/HoodieAvroParquetReaderIterator.java

danny0405 · 2025-10-22T03:18:05Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/avro/HoodieAvroParquetReader.java

    }
-
-    // if exists read columns, we need to filter columns.
-    List<String> readColNames = Arrays.asList(HoodieColumnProjectionUtils.getReadColumnNames(conf));


so this code never works before? looks like it servers for Hive queries.

This works, it's just redundant because we already read the footer in the fg reader. So we were doing the repair in the fg reader, and then the repair was getting undone because here we get the footer from the schema again

danny0405 · 2025-10-22T03:22:03Z

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HiveHoodieReaderContext.java

        || filePath.getFileExtension().equals(HoodieFileFormat.ORC.getFileExtension());
-    Schema avroFileSchema = isParquetOrOrc ? HoodieIOFactory.getIOFactory(storage)
-        .getFileFormatUtils(filePath).readAvroSchema(storage, filePath) : dataSchema;
+    Schema avroFileSchema = AvroSchemaRepair.repairLogicalTypes(isParquetOrOrc ? HoodieIOFactory.getIOFactory(storage)


if it isn't parquet or orc, there is even no need to call AvroSchemaRepair.repairLogicalTypes

danny0405 · 2025-10-22T04:13:53Z

hudi-common/src/main/java/org/apache/parquet/schema/AvroSchemaRepair.java

+      repairedFields.add(repaired);
+    }
+
+    return Schema.createRecord(


there is no need to instantiate a new schema if the timestmap type matches in precision;

we better intern the schema if is new generated

danny0405 · 2025-10-23T05:12:17Z

create_schema_only_when_necessary_for_schema_repair.patch
@jonvex , here is the patch to address the schema repair instantiation issue.

hudi-bot · 2025-10-25T21:41:03Z

CI report:

0ffcc96 UNKNOWN
5b89d6d UNKNOWN
6a6836f UNKNOWN
758645e Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

jonvex · 2025-10-26T15:12:17Z

switching to #14161

Jonathan Vexler added 4 commits October 16, 2025 17:12

current progress

e46d157

seems to be working for spark non vectorized and avro

513e8a1

filters working

bb4a278

prevent overflow

814d442

github-actions bot added the size:L PR with lines of changes in (300, 1000] label Oct 18, 2025

danny0405 reviewed Oct 20, 2025

View reviewed changes

yihua commented Oct 20, 2025

View reviewed changes

Jonathan Vexler added 4 commits October 20, 2025 13:54

use read support instead of mapping function

3ed9ee7

use repaired schema instead of doing operations after we read the data

c6c6caa

add spark log support

2f447c2

remove find cols to multiply class

be727d9

yihua added the release-1.1.0 label Oct 20, 2025

yihua added this to the release-1.1.0 milestone Oct 20, 2025

danny0405 reviewed Oct 21, 2025

View reviewed changes

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java Outdated Show resolved Hide resolved

danny0405 reviewed Oct 21, 2025

View reviewed changes

hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieAvroDataBlock.java Outdated Show resolved Hide resolved

cshuo reviewed Oct 21, 2025

View reviewed changes

hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieAvroDataBlock.java Outdated Show resolved Hide resolved

yihua commented Oct 21, 2025

View reviewed changes

hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieDataBlock.java Outdated Show resolved Hide resolved

yihua commented Oct 21, 2025

View reviewed changes

Jonathan Vexler added 3 commits October 21, 2025 13:55

log file changes as requested and set up the read supports for spark …

639e57c

…and avro

hive working

c1179df

add individual test and fix issue with dropping messagetype logical

50a64cd

danny0405 reviewed Oct 22, 2025

View reviewed changes

hudi-hadoop-common/src/main/java/org/apache/hudi/io/hadoop/HoodieAvroParquetReaderIterator.java Outdated Show resolved Hide resolved

danny0405 reviewed Oct 22, 2025

View reviewed changes

Jonathan Vexler added 6 commits October 22, 2025 12:42

revert calls for rewrite avro with extra param

7441315

revert config to prevent timestamp evolutions

0b83171

A few fixes

2450d9b

fix bug with field reuse in avro schema repair

0eb01ca

fix read parquet log block timestamp ntz

5090d2e

allow long to timestampntz without cast

674babb

refactor AvroSchemaRepair for performance and add unit tests

796e7f0

github-actions bot added size:XL PR with lines of changes > 1000 and removed size:L PR with lines of changes in (300, 1000] labels Oct 23, 2025

Jonathan Vexler added 11 commits October 23, 2025 16:30

refactor schema repair for performance and add testing

0ffcc96

try fix other spark versions

6231af6

fix spark 3.3 build

5b89d6d

fix spark3.4 build

bb4f550

hopefully fix spark4

3e7a298

fix issue with union schema, add table schema to cdc in missing place

f297231

add spark cow read testing for repair

b0933c0

building, and add spark mor tests

6a6836f

forgot to add the zips

a5ac68a

cow write testing

47506f6

add mor testing

758645e

yihua closed this Oct 27, 2025

Timestamp millis repair #14120

Timestamp millis repair #14120

Uh oh!

Conversation

yihua commented Oct 18, 2025

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0405 Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0405 Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danny0405 Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

danny0405 Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

danny0405 Oct 21, 2025 •

edited

Loading

danny0405 Oct 21, 2025 •

edited

Loading

danny0405 Oct 23, 2025 •

edited

Loading

danny0405 Oct 22, 2025 •

edited

Loading