Skip to content

Conversation

@jonvex
Copy link
Contributor

@jonvex jonvex commented May 12, 2025

Change Logs

add new types TIMESTAMP_MILLIS and TIME_MILLIS to InternalSchema

add validation for row writer that we don't use timestamp millis

testing added the timestamp and time types to the schema so that TestAvroSchemaEvolutionUtils.testFixNullOrdering() will validate that the logical types are preserved.

  • When writing to col stats / partition stats now convert the timestamp millis to micros to write to the mdt because we only have timestamp micros wrapper
    -When reading with spark, use the table schema to base avro schema off of

Impact

Prevent information loss when converting avro schema to internal schema

Risk level (write none, low medium or high below)

low

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label May 12, 2025
@github-actions github-actions bot added size:M PR with lines of changes in (100, 300] and removed size:S PR with lines of changes in (10, 100] labels May 13, 2025
@jonvex jonvex marked this pull request as ready for review May 13, 2025 18:28
Copy link

@rkwagner rkwagner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM, hopefully this goes in soon!


def canUseRowWriter(schema: Schema, conf: Configuration): Boolean = {
if (conf.getBoolean(AvroWriteSupport.WRITE_OLD_LIST_STRUCTURE, true)) {
if (HoodieAvroUtils.hasTimestampMillisField(schema)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a UT or integration test for Spark already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested HoodieAvroUtils.hasTimestampMillisField but not canUseRowWriter. Tbh I'm not sure if I should just get rid of it. I think there is a setting in spark that allows using timestamp millis, so I think maybe it should be checking if there are both millis and micros used at the same time. And also checking that config. But we also have a hudi config that sets that config. IDK how much value all the time spent to validate all that will add. So maybe for now we just remove it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there are both millis and micros used at the same time. And also checking that config. But we also have a hudi config that sets that config

The default timestamp precision of Spark is 6, are you saying user specify the timestamp precision as explicit 3 for some columns? I guess most of the cases would just use either default precision 6 or 3, the mixed case should be very rare.

Is the patch to fix the schema evolution use case for timestamp(3)?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix is for this issue:
#13233

Where Hudi streamers always force Timestamp into micros no matter what the user specifies at the output schema in the case of a new table. As you can see in the internal converter, no matter what version of timestamp is used in the output schema (millis or micros), you'll always end up with micros.

The OR clause here makes that clear:
https://github.com/apache/hudi/pull/13291/files#diff-2d823101c425b4f9fbc444d1def5b6ebe1607bf19b532c80f5b0851cfd27a292

And is reproducible in the script in the linked issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, then I think we should support timestamp(3) for Spark, @jonvex do you think it is feasible?

DATE(Integer.class),
BOOLEAN(Boolean.class),
TIME(Long.class),
TIME_MILLIS(Integer.class),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the primitive type for the logical type of time-millis also use Long type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like Integer type is aligned with Avro. Could you add a note here in a comment?

Comment on lines 348 to 354
} else if (logical instanceof LogicalTypes.TimeMillis) {
return Types.TimeMillisType.get();
} else if (logical instanceof LogicalTypes.TimeMicros) {
return Types.TimeType.get();

} else if (
logical instanceof LogicalTypes.TimestampMillis
|| logical instanceof LogicalTypes.TimestampMicros) {
} else if (logical instanceof LogicalTypes.TimestampMillis) {
return Types.TimestampMillisType.get();
} else if (logical instanceof LogicalTypes.TimestampMicros) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would type handling go through the schema evolution on read here? Supposedly the schema evolution on read logic should not be invoked by default. Is the logic being leaked to non-schema-on-read code path?

@github-actions github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:M PR with lines of changes in (100, 300] labels Jun 11, 2025
@jonvex jonvex changed the title [HUDI-9359] add TIMESTAMP_MILLIS and TIME_MILLIS to InternalSchema [HUDI-9359] Support Timestamp-millis Jul 21, 2025
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua
Copy link
Contributor

yihua commented Oct 7, 2025

Given that the timestamp-millis type support is added by #13711 on master, we should get this fix into branch-0.x and disable col stats index on timestamp millis columns for 0.x releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants