-
Notifications
You must be signed in to change notification settings - Fork 36
Hive Metastore: Merge hive and avro schema if not consistent #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| cols.addAll(getPartitionCols(table)); | ||
| } | ||
|
|
||
| return convertFieldSchemaToAvroSchema(recordName, recordNamespace, true, cols); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of passing cols which is a List can we create a struct field schema here and call HiveTypeToAvroType directly? We can get rid of convertFieldSchemaToAvroSchema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this is a tail reference/call, we can just move the code in convertFieldSchemaToAvroSchema inside convertHiveSchemaToAvro. But aside from that, what do you mean by "create a struct field schema"? I see the parseSchemaFromStruct method from HiveTypeToAvroType, but it is a private method, are you referring to this method?
| for (int i = 0; i < fieldNames.size(); ++i) { | ||
| final TypeInfo fieldTypeInfo = fieldTypeInfos.get(i); | ||
| String fieldName = fieldNames.get(i); | ||
| fieldName = removePrefix(fieldName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this? I think field names being passed here are relative, they come from StructTypeInfo.getAllStructFieldNames() so I don't think they are qualified from the root. A . in the field name is probably the actual name of the field here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it can be removed? also since we are dealing with hive, I feel the field names won't contain . anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be easy to verify.
| // We don't cache the structType because otherwise it could be possible that a field | ||
| // "lastname" is of type "firstname", where firstname is a compiled class. | ||
| // This will lead to ambiguity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what this comment means. Which cache are we referring to?
| if (schemaStr != null) { | ||
| schema = AvroSchemaUtil.toIceberg(new org.apache.avro.Schema.Parser().parse(schemaStr)); | ||
| org.apache.avro.Schema avroSchema = new org.apache.avro.Schema.Parser().parse(schemaStr); | ||
| org.apache.avro.Schema hiveSchema = LegacyHiveSchemaUtils.convertHiveSchemaToAvro(table); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this step during the conversion we pass mkFieldsOptional as true to make fields nullable, but in the very next line we remove nullables from the schema. Can we just mark fields as non-nullable to being with and remove LegacyHiveSchemaUtils.extractActualTypeIfFieldIsNullableTypeRecord?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I think I can change the signature of convertHiveSchemaToAvro to add this boolean flag parameter to `convertHiveSchemaToAvro(Table table, boolean mkFieldsOptional), so that this function directly return a non-null version of the schema.
| import org.slf4j.LoggerFactory; | ||
|
|
||
|
|
||
| public class LegacyHiveSchemaUtils { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code in this class is way too verbose. This can be heavily simplified by using Visitors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel the same, but I feel this will require non-trivial refactor of the code. I think we also want to publish this code soon, so there is this trade-off.
| import org.codehaus.jackson.node.JsonNodeFactory; | ||
|
|
||
|
|
||
| public class HiveTypeToAvroType { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add test cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do later, but the integration test I ran already passed all the tables.
|
|
||
| org.apache.avro.Schema tableSchema = avroSchema; | ||
| boolean isHiveSchemaEvolved = | ||
| LegacyHiveSchemaUtils.isRecordSchemaEvolved(avroSchemaWithoutNullable, hiveSchemaWithoutNullable); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like isRecordSchemaEvolved has to traverse the whole Schema tree. What do we gain by checking this first rather than just merging directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is actually a good point. I think the 2 logic can be combined into just one pass.
wmoustafa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still going through the patch. Currently, there are many ways conversions take place and seems they could be simplified. For example, we can go from Hive type string to Hive TypeInfo (one type info to represent the whole schema) then to Avro schema.
| for (int i = 0; i < fieldNames.size(); ++i) { | ||
| final TypeInfo fieldTypeInfo = fieldTypeInfos.get(i); | ||
| String fieldName = fieldNames.get(i); | ||
| fieldName = removePrefix(fieldName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be easy to verify.
| schema = parseSchemaFromUnion((UnionTypeInfo) typeInfo, recordNamespace, recordName); | ||
| break; | ||
| default: | ||
| throw new UnsupportedOperationException("Conversion from " + category + " not supported"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is not supported
| // For example, in tracking.CommunicationRequestEvent.specificRequest, | ||
| // PropGenerated and PropExternalCommunication have the same structure. In case of duplicate typeinfos, we generate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best not to mention actual table and field names.
|
|
||
| final List<FieldSchema> cols = new ArrayList<>(); | ||
|
|
||
| cols.addAll(table.getSd().getCols()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were concerns around using getSd().getCols(). Could you check if we should use HiveMetastoreClient.getSchema()?
| for (TypeInfo ti : typeInfos) { | ||
| Schema candidate; | ||
| if (ti instanceof StructTypeInfo) { | ||
| StructTypeInfo sti = (StructTypeInfo) ti; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sti --> structTypeInfo
| // a new record type for the duplicates. | ||
| List<Schema> schemas = new ArrayList<>(); | ||
|
|
||
| for (TypeInfo ti : typeInfos) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ti --> typeInfo
| private static final String SHORT_TYPE_NAME = "short"; | ||
| private static final String BYTE_TYPE_NAME = "byte"; | ||
|
|
||
| public HiveTypeToAvroType(String namespace, boolean mkFieldsOptional) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to convert this to a utility class and move those parameters to convertFieldsTypeInfoToAvroSchema?
| Schema convertFieldsTypeInfoToAvroSchema(String recordNamespace, String recordName, List<String> fieldNames, | ||
| List<TypeInfo> fieldTypeInfos) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid using list of field names and list of field types throughout the PR? For example instead of of List<String> fieldNames and List<TypeInfo> fieldTypeInfos, we just pass StructTypeInfo. Usually the input is already a StructTypeInfo and then it is broken down to two lists then dealt with here. In cases where the original input comes as two lists, we can combine them using TypeInfoFactory. getStructTypeInfo() from Hive.
| // We don't cache the structType because otherwise it could be possible that a field | ||
| // "lastname" is of type "firstname", where firstname is a compiled class. | ||
| // This will lead to ambiguity. | ||
| schema = parseSchemaFromStruct((StructTypeInfo) typeInfo, recordNamespace, recordName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename this and other methods to something like convertStructTypeInfoToAvroSchema?
No description provided.