-
Notifications
You must be signed in to change notification settings - Fork 2.5k
feat(metadata): Improve Logical Type Handling on Col Stats #13711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| // we simply fall back, in lieu of failing the whole task | ||
| LOG.error("Failed to fetch column range metadata for: {}", partitionPathFileName); | ||
| return Collections.emptyList(); | ||
| throw new HoodieException("Failed to fetch column range metadata for: " + partitionPathFileName, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://issues.apache.org/jira/browse/HUDI-9722 ticket. Need to fail task in some cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The column stats index and data skipping are designed in a way that if the column stats are missing for a particular file or column, the data skipping should not prune the file so the correctness of querying is still guaranteed. We can revisit this case to see if we want to throw error if the column range metadata cannot be read due to more restricted set of exceptions.
hudi-hadoop-common/src/main/java/org/apache/parquet/avro/HoodieAvroSchemaConverter.java
Show resolved
Hide resolved
...c/main/scala/org/apache/spark/sql/execution/datasources/parquet/SparkParquetReaderBase.scala
Show resolved
Hide resolved
hudi-utilities/src/test/resources/col-stats/colstats-upgrade-test-v6.zip
Show resolved
Hide resolved
| : new ExpressionIndexComputationMetadata(colStatRecords); | ||
| } | ||
|
|
||
| private static SparkValueMetadata getValueMetadataFromColumnRangeDataset(Dataset<Row> dataset, HoodieIndexVersion indexVersion) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not following. This method only needs the schema of dataset, i.e., dataset.schema(), correct? So, why passing in the dataset itself which contains the data which is not needed by the method?
| val exclusionFields = new java.util.HashSet[String]() | ||
| exclusionFields.add("op") | ||
| partitionSchema.fields.foreach(f => exclusionFields.add(f.name)) | ||
| val requestedSchema = StructType(requiredSchema.fields ++ partitionSchema.fields.filter(f => mandatoryFields.contains(f.name))) | ||
| val requestedAvroSchema = AvroConversionUtils.convertStructTypeToAvroSchema(requestedSchema, sanitizedTableName) | ||
| val dataAvroSchema = AvroConversionUtils.convertStructTypeToAvroSchema(dataSchema, sanitizedTableName) | ||
| val requestedAvroSchema = AvroSchemaUtils.pruneDataSchema(avroTableSchema, AvroConversionUtils.convertStructTypeToAvroSchema(requestedSchema, sanitizedTableName), exclusionFields) | ||
| val dataAvroSchema = AvroSchemaUtils.pruneDataSchema(avroTableSchema, AvroConversionUtils.convertStructTypeToAvroSchema(dataSchema, sanitizedTableName), exclusionFields) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think the logic here around op field should not be here. @jonvex could you follow up on why this is needed here?
| //TODO: decide if we want to store things in a better way | ||
| String[] splits = data.split(","); | ||
| return Pair.of(Integer.parseInt(splits[0]), Integer.parseInt(splits[1])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder on this
hudi-common/src/main/java/org/apache/hudi/stats/ValueMetadata.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/stats/ValueMetadata.java
Outdated
Show resolved
Hide resolved
hudi-common/src/test/java/org/apache/hudi/common/testutils/HoodieTestDataGenerator.java
Show resolved
Hide resolved
| "hoodie.upsert.shuffle.parallelism" -> "4", | ||
| HoodieWriteConfig.TBL_NAME.key -> "hoodie_test", | ||
| DataSourceWriteOptions.TABLE_TYPE.key -> testCase.tableType.toString, | ||
| HoodieWriteConfig.WRITE_TABLE_VERSION.key() -> testCase.tableVersion.toString, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder on this
hudi-hadoop-common/src/main/java/org/apache/parquet/avro/AvroSchemaConverterWithNTZ.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
Outdated
Show resolved
Hide resolved
yihua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for pushing this through
CI report:
Bot commands@hudi-bot supports the following commands:
|
Describe the issue this Pull Request addresses
Column stats uses avro wrappers to store the different value types. For a type like decimal, this causes problems because it has precision and scale as part of the schema. We would need all the combos to cover all column types. The primitive types we wrap are: bool, int, long, float, double, string, bytes. We add an additional field
Summary and Changelog
Instead, we introduce column stats v2 (and partition stats and expression index). In this spec, we only store the primitive type. The primitive types are: bool, int, long, float, double, string, bytes.
We also add an additional field to each column stat metadata record, valueType of HoodieValueTypeInfo. This holds an int that corresponds to a type and an optional string that can store additional info. Currently that field is only used for decimal precision and scale.
We introduce an enum ValueType and a class ValueMetadata. ValueMetadata holds a ValueType as well as any additional info.
ValueType is an enum that holds info on:
If any new logical types are introduced we can just add new enums
For backwards compatibility we have a V1 type that will call the legacy methods. So when v1 col stats index is used, every value will have a type of V1.
ValueMetadata is essentially the in memory representation of HoodieValueTypeInfo. It has a method getValueTypeInfo() that will build a HoodieValueTypeInfo that can be written to the mdt. It also has a few other very useful methods:
wrapValue will take the java in memory representation of a type, convert it into a primitive representation, and then wrap it with the appropriate avro wrapper
unwrapValue will do the inverse operation and unwrap an avro wrapper and convert the primitive value into its java representation
This pr also fixes up the HoodieRecord.getColumnValueAsJava so that all types are supported and work correctly.
Additionally, this pr fixes our handling of some logical types. InternalSchema didn't have some types like timestamp_millis so that info would be dropped
For the future. If you want to add array support, here is the commit where I removed it from this pr: 357675c
Impact
Column stats works correctly with all the logical types that are supported in regular hudi tables.
Easier to add support for more types to col stats
Risk Level
high
Documentation Update
N/A
Contributor's checklist