feat(metadata): Improve Logical Type Handling on Col Stats #13711

jonvex · 2025-08-12T15:33:23Z

Describe the issue this Pull Request addresses

Column stats uses avro wrappers to store the different value types. For a type like decimal, this causes problems because it has precision and scale as part of the schema. We would need all the combos to cover all column types. The primitive types we wrap are: bool, int, long, float, double, string, bytes. We add an additional field

Summary and Changelog

Instead, we introduce column stats v2 (and partition stats and expression index). In this spec, we only store the primitive type. The primitive types are: bool, int, long, float, double, string, bytes.
We also add an additional field to each column stat metadata record, valueType of HoodieValueTypeInfo. This holds an int that corresponds to a type and an optional string that can store additional info. Currently that field is only used for decimal precision and scale.

We introduce an enum ValueType and a class ValueMetadata. ValueMetadata holds a ValueType as well as any additional info.
ValueType is an enum that holds info on:

underlying primitive type
java class to represent it in memory
method to convert from primitive to java class
method to convert from java class to primitive
method to type promote and standardize values

If any new logical types are introduced we can just add new enums

For backwards compatibility we have a V1 type that will call the legacy methods. So when v1 col stats index is used, every value will have a type of V1.

ValueMetadata is essentially the in memory representation of HoodieValueTypeInfo. It has a method getValueTypeInfo() that will build a HoodieValueTypeInfo that can be written to the mdt. It also has a few other very useful methods:

standardizeJavaTypeAndPromote
wrapValue()
unwrapValue()

wrapValue will take the java in memory representation of a type, convert it into a primitive representation, and then wrap it with the appropriate avro wrapper

unwrapValue will do the inverse operation and unwrap an avro wrapper and convert the primitive value into its java representation

This pr also fixes up the HoodieRecord.getColumnValueAsJava so that all types are supported and work correctly.

Additionally, this pr fixes our handling of some logical types. InternalSchema didn't have some types like timestamp_millis so that info would be dropped

For the future. If you want to add array support, here is the commit where I removed it from this pr: 357675c

Impact

Column stats works correctly with all the logical types that are supported in regular hudi tables.

Easier to add support for more types to col stats

Risk Level

high

Documentation Update

N/A

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

…the reader

jonvex · 2025-08-17T18:03:47Z

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

      //       we simply fall back, in lieu of failing the whole task
-      LOG.error("Failed to fetch column range metadata for: {}", partitionPathFileName);
-      return Collections.emptyList();
+      throw new HoodieException("Failed to fetch column range metadata for: " + partitionPathFileName, e);


https://issues.apache.org/jira/browse/HUDI-9722 ticket. Need to fail task in some cases

The column stats index and data skipping are designed in a way that if the column stats are missing for a particular file or column, the data skipping should not prune the file so the correctness of querying is still guaranteed. We can revisit this case to see if we want to throw error if the column range metadata cannot be read due to more restricted set of exceptions.

hudi-hadoop-common/src/main/java/org/apache/parquet/avro/HoodieAvroSchemaConverter.java

...c/main/scala/org/apache/spark/sql/execution/datasources/parquet/SparkParquetReaderBase.scala

hudi-utilities/src/test/resources/col-stats/colstats-upgrade-test-v6.zip

yihua · 2025-09-23T00:21:40Z

...t/hudi-spark-client/src/main/java/org/apache/hudi/client/utils/SparkMetadataWriterUtils.java

        : new ExpressionIndexComputationMetadata(colStatRecords);
  }

+  private static SparkValueMetadata getValueMetadataFromColumnRangeDataset(Dataset<Row> dataset, HoodieIndexVersion indexVersion) {


Not following. This method only needs the schema of dataset, i.e., dataset.schema(), correct? So, why passing in the dataset itself which contains the data which is not needed by the method?

yihua · 2025-09-23T00:22:49Z

...rg/apache/spark/sql/execution/datasources/parquet/HoodieFileGroupReaderBasedFileFormat.scala

+    val exclusionFields = new java.util.HashSet[String]()
+    exclusionFields.add("op")
+    partitionSchema.fields.foreach(f => exclusionFields.add(f.name))
    val requestedSchema = StructType(requiredSchema.fields ++ partitionSchema.fields.filter(f => mandatoryFields.contains(f.name)))
-    val requestedAvroSchema = AvroConversionUtils.convertStructTypeToAvroSchema(requestedSchema, sanitizedTableName)
-    val dataAvroSchema = AvroConversionUtils.convertStructTypeToAvroSchema(dataSchema, sanitizedTableName)
+    val requestedAvroSchema = AvroSchemaUtils.pruneDataSchema(avroTableSchema, AvroConversionUtils.convertStructTypeToAvroSchema(requestedSchema, sanitizedTableName), exclusionFields)
+    val dataAvroSchema = AvroSchemaUtils.pruneDataSchema(avroTableSchema, AvroConversionUtils.convertStructTypeToAvroSchema(dataSchema, sanitizedTableName), exclusionFields)


I still think the logic here around op field should not be here. @jonvex could you follow up on why this is needed here?

yihua · 2025-09-23T00:25:22Z

hudi-common/src/main/java/org/apache/hudi/stats/ValueMetadata.java

+      //TODO: decide if we want to store things in a better way
+      String[] splits = data.split(",");
+      return Pair.of(Integer.parseInt(splits[0]), Integer.parseInt(splits[1]));


Reminder on this

hudi-common/src/main/java/org/apache/hudi/stats/ValueMetadata.java

hudi-common/src/test/java/org/apache/hudi/common/testutils/HoodieTestDataGenerator.java

yihua · 2025-09-23T00:36:21Z

...asource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestPartitionStatsPruning.scala

      "hoodie.upsert.shuffle.parallelism" -> "4",
      HoodieWriteConfig.TBL_NAME.key -> "hoodie_test",
      DataSourceWriteOptions.TABLE_TYPE.key -> testCase.tableType.toString,
+      HoodieWriteConfig.WRITE_TABLE_VERSION.key() -> testCase.tableVersion.toString,


Reminder on this

hudi-hadoop-common/src/main/java/org/apache/parquet/avro/AvroSchemaConverterWithNTZ.java

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

yihua

LGTM. Thanks for pushing this through

hudi-bot · 2025-09-23T11:05:01Z

CI report:

9a59641 UNKNOWN
672ce5b UNKNOWN
d655091 UNKNOWN
bee20e5 UNKNOWN
ef6350d UNKNOWN
9e15009 UNKNOWN
a955d92 UNKNOWN
87bfc6f UNKNOWN
c84c787 UNKNOWN
5beb5b0 UNKNOWN
06c79cf UNKNOWN
7029fe7 UNKNOWN
f7dad4d UNKNOWN
b462c63 UNKNOWN
fcc6153 UNKNOWN
4cfa36c Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

Jonathan Vexler added 2 commits August 12, 2025 10:12

just schema change

5a8d574

current progress

d211a2c

github-actions bot added the size:L PR with lines of changes in (300, 1000] label Aug 12, 2025

builds

755ea1e

github-actions bot added size:XL PR with lines of changes > 1000 and removed size:L PR with lines of changes in (300, 1000] labels Aug 13, 2025

revert avro refactor not needed

7089c12

github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:XL PR with lines of changes > 1000 labels Aug 13, 2025

add parquet type parsing

7d221fb

github-actions bot added size:XL PR with lines of changes > 1000 and removed size:L PR with lines of changes in (300, 1000] labels Aug 13, 2025

Jonathan Vexler added 10 commits August 13, 2025 15:20

fix another test

a852c2a

fix another test

714e47e

fix more tests

5c34d1a

fix issue with reading col stats record

b0c0c7f

Merge branch 'master' into col_stats_new_schema

35f7030

current progress. Timestamp millis seem to be converted to micros in …

71d846c

…the reader

tests mostly working now. small precision decimals giving problems

b41b39f

testing working

e7cfe3e

add versioning and refactor

017a4cc

refactor value metadata

9b88584

jonvex commented Aug 17, 2025

View reviewed changes

hudi-hadoop-common/src/main/java/org/apache/parquet/avro/HoodieAvroSchemaConverter.java Show resolved Hide resolved

jonvex commented Aug 17, 2025

View reviewed changes

...c/main/scala/org/apache/spark/sql/execution/datasources/parquet/SparkParquetReaderBase.scala Show resolved Hide resolved

minor cleanup

789bbd1

jonvex mentioned this pull request Aug 17, 2025

[MINOR] Move Avro Wrapper Utils from HoodieAvroUtils to HoodieAvroWrapperUtils #13730

Merged

4 tasks

Jonathan Vexler added 3 commits August 17, 2025 17:35

refactor the wrapping

ac9702f

builds

b495642

Merge branch 'master' into col_stats_new_schema

ecf1b14

Jonathan Vexler added 10 commits September 18, 2025 16:50

restore validate script

681833e

add spark conf flags to the validation

b9a3ecc

disable the hive bundle validation

f86bbf2

Merge branch 'master' into col_stats_new_schema

f1b7f67

directly use the class so it will fail when loaded

dade2b6

addressing review comments

c174395

Merge branch 'master' into col_stats_new_schema

17d5552

catch all exception

62cfcb4

catch throwable instead of exception

d018e59

enable upgrade tests

6840594

yihua reviewed Sep 23, 2025

View reviewed changes

hudi-hadoop-common/src/main/java/org/apache/parquet/avro/AvroSchemaConverterWithNTZ.java Outdated Show resolved Hide resolved

address review comments

1940eb8

yihua reviewed Sep 23, 2025

View reviewed changes

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java Outdated Show resolved Hide resolved

yihua reviewed Sep 23, 2025

View reviewed changes

hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java Outdated Show resolved Hide resolved

Fix nits

fcc6153

yihua changed the title ~~Improve Logical Type Handling on Col Stats~~ feat(metadata): Improve Logical Type Handling on Col Stats Sep 23, 2025

Merge branch 'master' into col_stats_new_schema

4cfa36c

yihua marked this pull request as ready for review September 23, 2025 09:09

yihua approved these changes Sep 23, 2025

View reviewed changes

yihua merged commit be6f08d into apache:master Sep 23, 2025
71 of 72 checks passed

yihua mentioned this pull request Oct 7, 2025

[HUDI-9359] Support Timestamp-millis #13291

Closed

4 tasks

jonvex mentioned this pull request Oct 29, 2025

fix(ingest): Repair affected logical timestamp milli tables #14161

Merged

3 tasks

linliu-code mentioned this pull request Nov 21, 2025

[MINOR] Fix logical timestamp issue #14315

Draft

3 tasks

linliu-code mentioned this pull request Dec 15, 2025

fix: Fix logical type issue for timestamp columns #17601

Open

3 tasks

This was referenced Dec 20, 2025

[HUDI-7088] convert avro logical timestamp types correctly #10079

Closed

Hudi Spark datasource doesn't convert Avro Logical Type of Local timestamp #16305

Closed

linliu-code mentioned this pull request Jan 15, 2026

fix: timestamp logical type fixes #17899

Open

3 tasks

feat(metadata): Improve Logical Type Handling on Col Stats #13711

feat(metadata): Improve Logical Type Handling on Col Stats #13711

Uh oh!

Conversation

jonvex commented Aug 12, 2025 • edited by yihua Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

jonvex Aug 17, 2025

Choose a reason for hiding this comment

Uh oh!

yihua Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yihua Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

yihua Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

yihua Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yihua Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Sep 23, 2025

CI report:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jonvex commented Aug 12, 2025 •

edited by yihua

Loading