-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[HUDI-9669] Add schema on write support to hive #13654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java
Outdated
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieArrayWritableAvroUtils.java
Outdated
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieArrayWritableAvroUtils.java
Outdated
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieArrayWritableAvroUtils.java
Show resolved
Hide resolved
| return generateTypeInfo( | ||
| AvroSerdeUtils.getOtherTypeFromNullableType(schema), seenSchemas); | ||
| if (AvroSchemaUtils.isNullable(schema)) { | ||
| return generateTypeInfo(AvroSchemaUtils.resolveNullableSchema(schema), seenSchemas); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for hive version compatibility
hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java
Outdated
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HiveHoodieReaderContext.java
Outdated
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieArrayWritableAvroUtils.java
Outdated
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieArrayWritableAvroUtils.java
Outdated
Show resolved
Hide resolved
hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/utils/TestHoodieArrayWritableAvroUtils.java
Outdated
Show resolved
Hide resolved
hudi-hadoop-mr/src/test/java/org/apache/hudi/hadoop/utils/TestHoodieArrayWritableAvroUtils.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java
Outdated
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HiveAvroSerializer.java
Outdated
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/avro/AvroSchemaUtils.java
Outdated
Show resolved
Hide resolved
hudi-common/src/test/java/org/apache/hudi/avro/TestAvroSchemaUtils.java
Outdated
Show resolved
Hide resolved
hudi-common/src/test/java/org/apache/hudi/avro/TestAvroSchemaUtils.java
Outdated
Show resolved
Hide resolved
| boolean isParquet = filePath.getFileExtension().equals(HoodieFileFormat.PARQUET.getFileExtension()); | ||
| Schema avroFileSchema = isParquet ? HoodieIOFactory.getIOFactory(storage) | ||
| .getFileFormatUtils(filePath).readAvroSchema(storage, filePath) : dataSchema; | ||
| Schema actualRequiredSchema = isParquet ? AvroSchemaUtils.pruneDataSchema(avroFileSchema, requiredSchema, Collections.emptySet()) : requiredSchema; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add an inline comment explaining why the pruning is required for parquet only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's actually that we don't want hfile. So I flipped it. Because mdt the schema from the file is different and things fail if we try to use it
…ck for avro and parquet log files
the-other-tim-brown
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, @yihua can you take a look as well?
| Schema schema = getSchemaFromBufferRecord(bufferedRecord); | ||
| ArrayWritable writable = bufferedRecord.getRecord(); | ||
| return new HoodieHiveRecord(key, writable, schema, objectInspectorCache, bufferedRecord.getHoodieOperation(), bufferedRecord.isDelete()); | ||
| return new HoodieHiveRecord(key, writable, schema, getHiveAvroSerializer(schema), bufferedRecord.getHoodieOperation(), bufferedRecord.isDelete()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the hashcode of avro schema is cached, should be negligible for computation cost.
danny0405
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, overall looks good, @jonvex can you rebase with the master to resolve conflicts.
hudi-common/src/test/java/org/apache/hudi/common/table/read/TestHoodieFileGroupReaderBase.java
Outdated
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HiveHoodieReaderContext.java
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieArrayWritableAvroUtils.java
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieArrayWritableAvroUtils.java
Outdated
Show resolved
Hide resolved
| private JobConf getJobConf() { | ||
| JobConf jobConf = new JobConf(storageConfiguration.unwrapAs(Configuration.class)); | ||
| jobConf.set("columns", "field_1,field_2,field_3,datestr"); | ||
| jobConf.set("columns.types", "string,string,struct<nested_field:string>,string"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried query a Hudi table with schema evolution using Hive engine (not just the unit tests) to make sure everything still works, without leveraging this conf provided by Hive (is this changed now)?
yihua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to run queries on a large Hudi table on Hive engine to make sure there is no noticeable performance difference.
hudi-common/src/test/java/org/apache/hudi/common/table/read/TestHoodieFileGroupReaderBase.java
Outdated
Show resolved
Hide resolved
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieArrayWritableAvroUtils.java
Show resolved
Hide resolved
hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java
Outdated
Show resolved
Hide resolved
yihua
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. We can land this first to unblock other PRs.
| * The names of records, namespaces, or docs do not need to match. Nullability is ignored. | ||
| */ | ||
| public static boolean areSchemasProjectionEquivalent(Schema schema1, Schema schema2) { | ||
| return AvroSchemaComparatorForRecordProjection.areSchemasProjectionEquivalent(schema1, schema2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this can be directly used without adding AvroSchemaUtils#areSchemasProjectionEquivalent
|
|
||
| @Override | ||
| protected boolean validateField(Schema.Field f1, Schema.Field f2) { | ||
| return f1.name().equalsIgnoreCase(f2.name()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intended for case insensitivity of column names?
Change Logs
Add support by implementing full projection like we have in avro.
Then use the data schema and prune it. Use that to read the files and then use projection to the requested schema.
Impact
Hive supports reading tables with schema.on.write
Risk level (write none, low medium or high below)
medium,
uses the schema on write file group reader tests
Documentation Update
N/A
Contributor's checklist