Skip to content

Conversation

@aokolnychyi
Copy link
Contributor

@aokolnychyi aokolnychyi commented Jan 16, 2020

This PR extends the information stored in the manifest list with row stats to avoid touching manifests in #675. This change is backward and forward-compatible.

This addresses #733.

import org.junit.Assert;
import org.junit.Test;

public class TestManifestWriter extends TableTestBase {
Copy link
Contributor Author

@aokolnychyi aokolnychyi Jan 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, this doesn't have to extend TableTestBase but we need writeManifest. I think we can refactor a bit and introduce a separate parent test class with the logic for writing/checking manifests/snapshots.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to extract out base classes for just manifest and snapshots.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way is fine with me. I usually prefer to keep things simple.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to refactor but it looks the changes won't be worth the effort after a closer look. For example, we will need to pass a partition spec and FileIO to write manifests that we can simply take from table in TableTestBase. Let's keep it as is for now.

))),
optional(512, "added_rows_count", Types.LongType.get()),
optional(513, "existing_rows_count", Types.LongType.get()),
optional(514, "deleted_rows_count", Types.LongType.get()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add these up by the data files counts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue, not sure I got. Do you mean whether we can avoid storing these and add them up?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I meant that we don't need to add these at the end of the schema. We can add them up by the file count columns, like you do with all of the accessor methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let me update this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor Author

@aokolnychyi aokolnychyi Jan 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, reordering actually led to failures while writing manifest lists as GenericAvroWriter doesn't take into account field ids.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue, do we want to build something as ProjectionDatumReader for the write side?

Copy link
Contributor Author

@aokolnychyi aokolnychyi Jan 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue, I reverted this change. I propose to merge this PR as is and create a follow-up issue to implement writers that take into account field ids.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Writers don't use field IDs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue, yes, GenericAvroWriter doesn't. I think SparkAvroWriter does respect field ids.

@rdblue
Copy link
Contributor

rdblue commented Jan 17, 2020

Looks good to me.

This reverts commit 873ae91
@rdblue rdblue merged commit fb1eb3a into apache:master Jan 23, 2020
@rdblue
Copy link
Contributor

rdblue commented Jan 23, 2020

Thanks @aokolnychyi! I'll merge this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants