Skip to content

Conversation

@jonvex
Copy link
Contributor

@jonvex jonvex commented Aug 12, 2025

Describe the issue this Pull Request addresses

Column stats uses avro wrappers to store the different value types. For a type like decimal, this causes problems because it has precision and scale as part of the schema. We would need all the combos to cover all column types. The primitive types we wrap are: bool, int, long, float, double, string, bytes. We add an additional field

Summary and Changelog

Instead, we introduce column stats v2 (and partition stats and expression index). In this spec, we only store the primitive type. The primitive types are: bool, int, long, float, double, string, bytes.
We also add an additional field to each column stat metadata record, valueType of HoodieValueTypeInfo. This holds an int that corresponds to a type and an optional string that can store additional info. Currently that field is only used for decimal precision and scale.

We introduce an enum ValueType and a class ValueMetadata. ValueMetadata holds a ValueType as well as any additional info.
ValueType is an enum that holds info on:

  1. underlying primitive type
  2. java class to represent it in memory
  3. method to convert from primitive to java class
  4. method to convert from java class to primitive
  5. method to type promote and standardize values

If any new logical types are introduced we can just add new enums

For backwards compatibility we have a V1 type that will call the legacy methods. So when v1 col stats index is used, every value will have a type of V1.

ValueMetadata is essentially the in memory representation of HoodieValueTypeInfo. It has a method getValueTypeInfo() that will build a HoodieValueTypeInfo that can be written to the mdt. It also has a few other very useful methods:

  1. standardizeJavaTypeAndPromote
  2. wrapValue()
  3. unwrapValue()

wrapValue will take the java in memory representation of a type, convert it into a primitive representation, and then wrap it with the appropriate avro wrapper

unwrapValue will do the inverse operation and unwrap an avro wrapper and convert the primitive value into its java representation

This pr also fixes up the HoodieRecord.getColumnValueAsJava so that all types are supported and work correctly.

Additionally, this pr fixes our handling of some logical types. InternalSchema didn't have some types like timestamp_millis so that info would be dropped

For the future. If you want to add array support, here is the commit where I removed it from this pr: 357675c

Impact

Column stats works correctly with all the logical types that are supported in regular hudi tables.

Easier to add support for more types to col stats

Risk Level

high

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Aug 12, 2025
@github-actions github-actions bot added size:XL PR with lines of changes > 1000 and removed size:L PR with lines of changes in (300, 1000] labels Aug 13, 2025
@github-actions github-actions bot added size:L PR with lines of changes in (300, 1000] and removed size:XL PR with lines of changes > 1000 labels Aug 13, 2025
@github-actions github-actions bot added size:XL PR with lines of changes > 1000 and removed size:L PR with lines of changes in (300, 1000] labels Aug 13, 2025
// we simply fall back, in lieu of failing the whole task
LOG.error("Failed to fetch column range metadata for: {}", partitionPathFileName);
return Collections.emptyList();
throw new HoodieException("Failed to fetch column range metadata for: " + partitionPathFileName, e);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://issues.apache.org/jira/browse/HUDI-9722 ticket. Need to fail task in some cases

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column stats index and data skipping are designed in a way that if the column stats are missing for a particular file or column, the data skipping should not prune the file so the correctness of querying is still guaranteed. We can revisit this case to see if we want to throw error if the column range metadata cannot be read due to more restricted set of exceptions.

: new ExpressionIndexComputationMetadata(colStatRecords);
}

private static SparkValueMetadata getValueMetadataFromColumnRangeDataset(Dataset<Row> dataset, HoodieIndexVersion indexVersion) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not following. This method only needs the schema of dataset, i.e., dataset.schema(), correct? So, why passing in the dataset itself which contains the data which is not needed by the method?

Comment on lines +198 to +203
val exclusionFields = new java.util.HashSet[String]()
exclusionFields.add("op")
partitionSchema.fields.foreach(f => exclusionFields.add(f.name))
val requestedSchema = StructType(requiredSchema.fields ++ partitionSchema.fields.filter(f => mandatoryFields.contains(f.name)))
val requestedAvroSchema = AvroConversionUtils.convertStructTypeToAvroSchema(requestedSchema, sanitizedTableName)
val dataAvroSchema = AvroConversionUtils.convertStructTypeToAvroSchema(dataSchema, sanitizedTableName)
val requestedAvroSchema = AvroSchemaUtils.pruneDataSchema(avroTableSchema, AvroConversionUtils.convertStructTypeToAvroSchema(requestedSchema, sanitizedTableName), exclusionFields)
val dataAvroSchema = AvroSchemaUtils.pruneDataSchema(avroTableSchema, AvroConversionUtils.convertStructTypeToAvroSchema(dataSchema, sanitizedTableName), exclusionFields)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think the logic here around op field should not be here. @jonvex could you follow up on why this is needed here?

Comment on lines 111 to 113
//TODO: decide if we want to store things in a better way
String[] splits = data.split(",");
return Pair.of(Integer.parseInt(splits[0]), Integer.parseInt(splits[1]));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder on this

"hoodie.upsert.shuffle.parallelism" -> "4",
HoodieWriteConfig.TBL_NAME.key -> "hoodie_test",
DataSourceWriteOptions.TABLE_TYPE.key -> testCase.tableType.toString,
HoodieWriteConfig.WRITE_TABLE_VERSION.key() -> testCase.tableVersion.toString,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder on this

@yihua yihua changed the title Improve Logical Type Handling on Col Stats feat(metadata): Improve Logical Type Handling on Col Stats Sep 23, 2025
@yihua yihua marked this pull request as ready for review September 23, 2025 09:09
Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for pushing this through

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua yihua merged commit be6f08d into apache:master Sep 23, 2025
71 of 72 checks passed
@yihua yihua mentioned this pull request Oct 7, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-1.1.0 size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants