Skip to content

Conversation

@RussellSpitzer
Copy link
Member

Previously getProjectedIds would only return leaf nodes and primitives that
were selected. This made it impossible to return empty structs. To fix this
we change the behavior to return all id's of required fields including structs.
This in turn requires fixing the alternate PruneColumn methods for Avro
and Parquet to respect that they will now have selected field ID's for non
primitive nodes. Previous use cases of TypeUtil.select are converted to
TypeUtil.project, which inverses this new getProjecetedIds code.

if (fieldResult == null) {
fieldIds.add(field.fieldId());
}
fieldIds.add(field.fieldId());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to my comment on the other PR. I don't think that we should return inner IDs other than struct IDs or else it isn't clear how lists and maps should be handled.

@kbendick
Copy link
Contributor

This in turn requires fixing the alternate PruneColumn methods for Avro
and Parquet to respect that they will now have selected field ID's for non
primitive nodes.

Do you know if anything needs to be done for ORC @RussellSpitzer? I'm helping out with ORC more going forward and if you're aware of anything that needs to be updated there, if you don't have time to update it, if you make an issue I'll see if I can grab it (or at the least we'll have the issue to track).

If you're not sure then possibly you can add a unit test for ORC as well? If it fails, open an issue (or I will) and then we can follow up on it after. =)

Set<Integer> projectedIds = getIdsInternal(struct);
projectedIds.removeAll(fieldIds);
return select(struct, projectedIds);
return project(struct, projectedIds);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue here is selectNot doest not actually deselect children when a parent ID is not selected. Previously this is because getProjectedIDs (behind getIdsInternal) would not return parent struct ids, so removing it from the set of projectedIds would not do anything.

Now It will not work because removing a parentID still leaves all child IDs. We could fix this but it would be a change in behavior from the previous code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree with the decision to not change the behavior of this method, even though the opposite of "select" behavior would be to fully remove a struct when its ID is passed in fieldIds.

But I don't think that project is quite correct either. Consider the example schema 1: id bigint, 2: location struct<3: lat double, 4: long double>. Previously, selectNot(t, set(3, 4)) would produce 1: id bigint and omit the location entirely. Using project with the updated GetProjectedIds, the projected ID set will be {1, 2, 3, 4} and not {1, 3, 4}. That would result in the same call producing 1: id bigint, 2: location struct<>, which introduces a new bug because now there is an unexpected extra field.

To clean this up, I think we need a version of GetProjectedIds that doesn't select structs and uses the old behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like the right behavior to me? Shouldn't you be required to explicitly omit the parent if you don't want the that element? Otherwise there would be no way to "selectNot" and only get back the empty struct.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrote up these test cases, i'll run the full test suite to make sure this works with our other usages

    Schema schema = new Schema(
        Lists.newArrayList(
            required(1, "id", Types.LongType.get()),
            required(2, "location", Types.StructType.of(
                required(3, "lat", Types.DoubleType.get()),
                required(4, "long", Types.DoubleType.get())
            ))));

    Schema expectedNoPrimitive = new Schema(
        Lists.newArrayList(
            required(2, "location", Types.StructType.of(
                required(3, "lat", Types.DoubleType.get()),
                required(4, "long", Types.DoubleType.get())
            ))));

    Schema actualNoPrimitve = TypeUtil.selectNot(schema, Sets.newHashSet(1));
    Assert.assertEquals(expectedNoPrimitive.asStruct(), actualNoPrimitve.asStruct());

    // Expected legacy behavior is to completely remove structs if their elements are removed
    Schema expectedNoStructElements = new Schema(required(1, "id", Types.LongType.get()));
    Schema actualNoStructElements = TypeUtil.selectNot(schema, Sets.newHashSet(3, 4));
    Assert.assertEquals(expectedNoStructElements.asStruct(), actualNoStructElements.asStruct());

    // Expected legacy behavior is to ignore selectNot on struct elements.
    Schema actualNoStruct = TypeUtil.selectNot(schema, Sets.newHashSet(2));
    Assert.assertEquals(schema.asStruct(), actualNoStruct.asStruct());
    ```

@RussellSpitzer
Copy link
Member Author

This in turn requires fixing the alternate PruneColumn methods for Avro
and Parquet to respect that they will now have selected field ID's for non
primitive nodes.

Do you know if anything needs to be done for ORC @RussellSpitzer? I'm helping out with ORC more going forward and if you're aware of anything that needs to be updated there, if you don't have time to update it, if you make an issue I'll see if I can grab it (or at the least we'll have the issue to track).

If you're not sure then possibly you can add a unit test for ORC as well? If it fails, open an issue (or I will) and then we can follow up on it after. =)

I don't think so, we don't have any custom ORC projection code as far as I can tell. It just uses the output of a TypeUtil.selectNot()

Schema readSchemaWithoutConstantAndMetadataFields = TypeUtil.selectNot(readSchema,
Sets.union(idToConstant.keySet(), MetadataColumns.metadataFieldIds()));

against the ORC schema to determine what to read and then uses that directly here :

public <D> CloseableIterable<D> build() {
Preconditions.checkNotNull(schema, "Schema is required");
return new OrcIterable<>(file, conf, schema, nameMapping, start, length, readerFunc, caseSensitive, filter,
batchedReaderFunc, recordsPerBatch);
}

Previously getProjectedIds would only return leaf nodes and primitives that
were selected. This made it impossible to return empty structs. To fix this
we change the behavior to return all id's of required fields including structs.
This in turn requires fixing the alternate PruneColumn methods for Avro
and Parquet to respect that they will now have selected field ID's for non
primtiive nodes. Previous use cases of TypeUtil.select are converted to
TypeUtil.project, which inverses this new getProjecetedIds code.
I was working on this too late at night.
The changed behavior of getProjectedIds and "select" means that
selectNot needs to be implemented with Project.
public static StructProjection create(Schema schema, Set<Integer> ids) {
StructType structType = schema.asStruct();
return new StructProjection(structType, TypeUtil.select(structType, ids));
return new StructProjection(structType, TypeUtil.project(structType, ids));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there aren't any uses of this call, which is good. I agree that we probably want this to use project instead of select.

requiredFieldIds.addAll(selectedIds);

return TypeUtil.select(schema, requiredFieldIds);
return TypeUtil.project(schema, requiredFieldIds);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this because it is the opposite of GetProjectedIds used above.

if (selectedIds.contains(fieldId)) {
filteredFields.add(copyField(field, field.schema(), fieldId));
if (fieldSchema != null) {
filteredFields.add(copyField(field, fieldSchema, fieldId));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that I understand the reason for this change. Is this implementing the same change as the previous PR, but in the Avro PruneColumns?

It looks like if a struct field is selected and a sub-field is selected, then the selection for the struct isn't a full selection. But if a sub-field is not selected then the selection for the struct is a full selection. That doesn't make sense to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I'm thinking about this more, I think that the behavior in this class should always match project. I doubt there's a case where we want select behavior, right? In that case, shouldn't the else case check whether the type is a record and create an empty record?

public void testPartitionsTable() {
TableIdentifier tableIdentifier = TableIdentifier.of("db", "partitions_test");
Table table = createTable(tableIdentifier, SCHEMA, PartitionSpec.builderFor(SCHEMA).identity("id").build());
Table table = createTable(tableIdentifier, SCHEMA, SPEC);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these changes needed? I don't have a problem with them, but I don't think they are required right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just a clean up I had in the original PR, I can split this into another pull request if needed

Have PruneColumns Avro mimic PruneColumns Iceberg
Adjust TypeUtil.selectNot to better mimic the old behavior and added tests
@RussellSpitzer
Copy link
Member Author

@rdblue All fixed up

hasChange = true;
builder.addField(field);
} else {
builder.addField(originalField);
Copy link
Member Author

@RussellSpitzer RussellSpitzer Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I do the empty message only thing here as well? where we copy the struct to be empty?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean?


private static Set<Integer> getIdsInternal(Type type) {
return visit(type, new GetProjectedIds());
return getIdsInternal(type, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a lot of changes to remove this method? I wouldn't expect it since this is private. I'd probably prefer removing it, but up to you if it touches a bunch of unrelated code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are pretty safe here, removing it. Only two changes in this file

if (isRecord(field.schema())) {
filteredFields.add(copyField(field, makeEmptyCopy(field.schema()), fieldId));
} else {
filteredFields.add(copyField(field, field.schema(), fieldId));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so in this case the field is a map, list, or primitive. Then we just follow the old behavior. Looks good to me since the Iceberg schema selection would fail.

if (selectedIds.contains(fieldId)) {
filteredFields.add(copyField(field, field.schema(), fieldId));
if (fieldSchema != null) {
filteredFields.add(copyField(field, fieldSchema, fieldId));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to set hasChange in the cases where we don't return field.schema() for the field, right?

Copy link
Member Author

@RussellSpitzer RussellSpitzer Sep 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are actually fine here unless every field is selected because the logic for has change is a bit confusing.

You either

  1. Have a change (Make a new record using the filtered fields) ( Return Changed Records)
  2. Have no change and filtered fields size is the same as the original number of fields ( Return Original Record)
  3. Have no change and filtered field size is not empty (Make a new record using the filtered fields) (Return changed record)

Currently we have tests hitting 1 and 3 but not 2 :/
I'll add the "hasChange" flag

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good now.

builder.addField(field);
} else {
if (isStruct(originalField)) {
builder.addField(originalField.asGroupType().withNewFields(Collections.emptyList()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should hasChange be true here as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the ordering of the statements under (field != null) so that they match the usages here

@rdblue
Copy link
Contributor

rdblue commented Sep 19, 2021

@RussellSpitzer, looking good! My one remaining concern is the hasChange behavior. That may also be causing tests to not do what you expect.

@RussellSpitzer
Copy link
Member Author

@rdblue patched up, I think I got that covered. Should be set now. I also removed that private function as recommended

record.put("id", 34L);
Record location = new Record(record.getSchema().getField("location").schema());
location.put("lat", 52.995143f);
location.put("long", -1.539054f);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Odd location to choose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was already in the test suite and came in the Netflix original commit :) So you'll have to ask whoever wrote the first version.

https://github.com/apache/iceberg/blame/master/core/src/test/java/org/apache/iceberg/avro/TestReadProjection.java#L205-L206

}

@Test
public void testEmptyStructRequiredProjection() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this identical to the test case above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, I see that the write schema has the struct as optional.

@rdblue rdblue merged commit fafe33a into apache:master Oct 3, 2021
@rdblue
Copy link
Contributor

rdblue commented Oct 3, 2021

Thanks, @RussellSpitzer!

@RussellSpitzer
Copy link
Member Author

Thanks @rdblue I know this one was a bit of a pain, but we should hopefully have all of our projection issues fixed now 🤞

@RussellSpitzer RussellSpitzer deleted the GetAllProjectedIds branch October 3, 2021 23:57
@szehon-ho
Copy link
Member

🎉 Great work !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants