-
Notifications
You must be signed in to change notification settings - Fork 3k
Add persistent IDs to partition fields (WIP) #499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add persistent IDs to partition fields (WIP) #499
Conversation
|
build is failed because of So, possibly of intermittent error, because Metastore didnt start. In local all tests ran fine. |
|
finally got the good PR build. |
|
@rdblue can you please check this one, thanks ! |
| private static final String FIELD_ID = "field-id"; | ||
|
|
||
| private PartitionSpecParser() { | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove any non-functional changes, like this one that moves the private constructor.
| return builder.build(); | ||
| } | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid adding extra newlines. These can cause avoidable commit conflicts.
|
It looks like this is trying to assign the same IDs for a spec each time it is created, but I think the approach should be to assign IDs to each field in a spec. The JSON serialization should be updated to parse an ID for each field. That's a good place to start, just adding the ability to track an ID for each partition field |
|
@rdblue thanks.
I believe you meant when Avro file is created using |
This is assigning an ID. Those IDs should be statically assigned when the partition spec is created the first time and stored with the field information when it is serialized. The highest assigned ID in any spec should be kept in table-level metadata, like the lastColumnId property.
No; JSON serialization of a partition spec should encode the IDs. That gets put into file metadata, but the main thing is to add field IDs to partition fields. |
Ok, I started that way, but currently So, in case of partitionFieldId also, when its firstTime newMetadata is created we can assign, but next time it should be part of the PartitionSpec Object. just to verify the partitionFieldIds, this is the manifest file schema ( partial till has partition field-id ) |
|
Yeah, the first step is to add a partition field ID in addition to the existing source field ID. |
Cool. then we also have to also handle cases where table is created with old way (partition-field if in ) and not adding partitionSpec to that table? |
Yes. In that case, we can get IDs by assigning the same way they would be in the method that returns the partition schema. |
|
thanks @rdblue for giving more details ! Please see, with last few commits. I'm trying to add partitionFieldId to table-metadata.
|
| Type sourceType = schema.findType(field.sourceId()); | ||
| Type resultType = field.transform().getResultType(sourceType); | ||
| // assign ids for partition fields starting at PARTITION_DATA_ID_START to leave room for data file's other fields | ||
| // assign ids for partition fields starting at 1000 to leave room for data file's other fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer assigning IDs, so the comment can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me take care.
| private final Transform<?, ?> transform; | ||
|
|
||
| PartitionField(int sourceId, String name, Transform<?, ?> transform) { | ||
| PartitionField(int sourceId, int partitionFieldId, String name, Transform<?, ?> transform) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name partitionFieldId is redundant because the class is PartitionField.
Let's use the same convention that is used in types. The PartitionField should have an id instance variable that is accessed by a fieldId method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure.
| .add("value_counts", "null_value_counts", "lower_bounds", "upper_bounds") | ||
| .build(); | ||
| private static final String PARTITION_SPEC = "partition-spec"; | ||
| private static final String SCHEMA = "schema"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes aren't functional. Can you please remove them?
| private static final String SPEC_ID = "spec-id"; | ||
| private static final String FIELDS = "fields"; | ||
| private static final String SOURCE_ID = "source-id"; | ||
| private static final String PARTITION_FIELD_ID = "partition-field-id"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "field-id" is fine here. The "partition" part is clear from context.
| @Override | ||
| public void commit() { | ||
| TableMetadata update = applyChangesToMapping(base.updateSchema(apply(), lastColumnId)); | ||
| TableMetadata update = applyChangesToMapping(base.updateSchema(apply(), lastColumnId, lastPartitionFieldId)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the last field ID passed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, its not required.
As its a schema update, and ideally we dont need keep lastPartitionFieldId at the schemaUpdate level.
| } | ||
|
|
||
| public int lastPartitionFieldId() { | ||
| return lastPartitionFieldId; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These names are good.
|
thanks @rdblue. |
|
Yes, we do want to reuse the fields across specs. We might want to make equality ignore the field ID for this purpose. |
c330531 to
9f15a3a
Compare
|
@rdblue please see, updated PR to reusing field-id . |
0229751 to
8e6d147
Compare
|
@rdblue it would be helpful, if you check this. thanks ! |
|
@rdblue it would be helpful, if you review this. thanks ! |
| public class PartitionSpec implements Serializable { | ||
| // start assigning IDs for partition fields at 1000 | ||
| private static final int PARTITION_DATA_ID_START = 1000; | ||
| public static final int PARTITION_DATA_ID_START = 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be public or can it be package-private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, why remove the comment?
| return false; | ||
| } | ||
|
|
||
| // not considering field id, as field-id will be reused. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ID will be reused, but assignment is consistent because we assume that partition specs are not modified before the addition of partition field IDs. That means that tables start with only one spec that might not have IDs. Because we assign incrementally, IDs will always match when assigned using the default (1000, 1001, etc.).
Because we do have consistent IDs, I think this should check field ID here.
| return false; | ||
| } | ||
|
|
||
| // not considering field id, as field-id will be reused. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: adding this comment removed a spacing line. Could you add it back?
| this.schema = schema; | ||
| } | ||
|
|
||
| private int incrementAndGetPartitionFieldId() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about nextFieldId? That's a much shorter name, but is still descriptive.
| } else { | ||
| partitionFieldId = partitionFieldId + 1; | ||
| } | ||
| builder.add(sourceId, partitionFieldId, name, transform); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems odd that partitionFieldId is incremented from the last value when missing. If a spec has one missing field ID, then it will be assigned based on the previous field's ID. I don't think this would cause problems because we expect either all fields to have assigned IDs, or no fields to have them.
I'd prefer to keep the logic for those cases separate to make this easier to follow. It isn't a good practice to rely on a hidden assumption that either all fields have ids or none do.
| } | ||
|
|
||
| private static Map<String, Integer> indexPartitionFieldIdByColumnName(List<PartitionSpec> specs) { | ||
| Map<String, Integer> result = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use Maps.newHashMap instead of instantiating one directly.
|
|
||
| executorService.shutdown(); | ||
| Assert.assertTrue("Timeout", executorService.awaitTermination(2, TimeUnit.MINUTES)); | ||
| Assert.assertTrue("Timeout", executorService.awaitTermination(5, TimeUnit.MINUTES)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these changes are still needed.
| static final String SNAPSHOT_ID = "snapshot-id"; | ||
| static final String TIMESTAMP_MS = "timestamp-ms"; | ||
| static final String SNAPSHOT_LOG = "snapshot-log"; | ||
| static final String FIELDS = "fields"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used?
| List<PartitionField> fields = specs.get(specs.size() - 1).fields(); | ||
| if (fields.size() > 0) { | ||
| // get the last lastPartitionFieldId | ||
| lastAssignedPartitionFieldId = fields.get(fields.size() - 1).fieldId(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of getting the last ID, I think this should just keep track of the last assigned partition field ID, like the last-column-id. How about storing it as last-partition-id?
| // increment and assign new id, if this column_transform has not used in partition yet. | ||
| (partitionFieldIdByColumnName == null) ? nextPartitionFieldId.incrementAndGet() | ||
| : ((partitionFieldIdByColumnName.containsKey(field.name())) ? partitionFieldIdByColumnName.get(field.name()) | ||
| : nextPartitionFieldId.incrementAndGet()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is difficult to read nested ternary expressions. I recommend avoiding that pattern.
|
|
||
| return specBuilder.build(); | ||
| PartitionSpec freshSpec = specBuilder.build(); | ||
| return freshSpec; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this change is unnecessary.
| // get a fresh spec to ensure the spec ID is set to the new default | ||
| builder.add(freshSpec(newDefaultSpecId, schema, newPartitionSpec)); | ||
| PartitionSpec freshSpec = freshSpecWithAssignIds(newDefaultSpecId, schema, schema, newPartitionSpec, | ||
| nextPartitionFieldId, partitionFieldIdByColumnName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should work like updateSchema, where a different class is responsible for reassigning IDs. The TableMetadata class should validate consistency and help with tracking (like the snapshot log) but it shouldn't modify other objects that are passed in, like schemas, snapshots, and partition specs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue
are we referring to this
Schema freshSchema = TypeUtil.assignFreshIds(schema, lastColumnId::incrementAndGet);
as TypeUtil.assignFreshIds assign id to the schema ?
| PartitionSpec.Builder specBuilder = PartitionSpec.builderFor(schema) | ||
| private static PartitionSpec freshSpecWithAssignIds(int specId, Schema newSchema, Schema oldSchema, | ||
| PartitionSpec partitionSpec, AtomicInteger nextPartitionFieldId, | ||
| Map<String, Integer> partitionFieldIdByColumnName) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that it is necessary to have this method here. The freshSpec method just ensures that all specs have the correct schema associated.
| PartitionSpec freshSpec = specBuilder.build(); | ||
| AtomicInteger lastPartitionFieldId = new AtomicInteger(PartitionSpec.PARTITION_DATA_ID_START - 1); | ||
| PartitionSpec freshSpec = freshSpecWithAssignIds(INITIAL_SPEC_ID, freshSchema, schema, spec, lastPartitionFieldId, | ||
| null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this should assign fresh IDs to the partition spec fields, but you can just add the ID to the existing code.
Also, if you are using an AtomicInteger, you can use getAndIncrement to avoid needing to subtract 1 from the ID starting point.
|
@manishmalhotrawork, I've added review comments. Sorry I wasn't able to get back to this sooner! |
|
@rdblue np, thanks for reviewing ! May be I'll raise a new PR with the required changes, it would be cleaner. |
|
Thanks for working on it, @manishmalhotrawork. If you do open a new PR, please remember to close this one. Up to you which one you want to do. |
|
@manishmalhotrawork |
|
I'm closing this because it has been picked up as #845. |
for #280.
@rdblue can you please review.
Raising as WIP PR, as this might need some changes.
Summary: parsing manifest-schema json to find out the partition-field_ids and initializing
PartitionSpecbased on last ID (if available ) other 1000.