Skip to content

Conversation

@ajantha-bhat
Copy link
Member

@ajantha-bhat ajantha-bhat commented Nov 24, 2022

Background:

Spec has a snapshot id in two places, one in StatisticsFile and another in its blob metadata.
To support the reuse of statistics files, we should have the referenced snapshot id in StatisticsFile, not the computed-from snapshot id. Hence, updated the spec.

Note that PR #6090 is stuck because of confusion around stats file reuse.

@ajantha-bhat ajantha-bhat marked this pull request as draft November 24, 2022 17:27
@github-actions github-actions bot added the core label Nov 24, 2022
@ajantha-bhat
Copy link
Member Author

@rdblue, @findepi: Please let me know your thoughts on this.

@ajantha-bhat ajantha-bhat force-pushed the stats branch 4 times, most recently from 8537ce6 to 1e0906e Compare December 3, 2022 09:39
private List<Snapshot> snapshots;
private final Map<String, SnapshotRef> refs;
private final Map<Long, List<StatisticsFile>> statisticsFiles;
private final Map<Long, List<StatisticsFile>> statisticsFilesById;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought better to rename this similar to snapshotsById

"snapshotId does not match: %s vs %s",
snapshotId,
statisticsFile.snapshotId());
statisticsFiles.put(statisticsFile.snapshotId(), ImmutableList.of(statisticsFile));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface supports returning multiple stats file for one snapshot. But was always overwriting instead of appending. So, fixed it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct. There should be only one stats file for a snapshot that contains all information. Allowing multiple stats files for a snapshot is going to lead to lazy implementations that don't merge the files and slower job planning.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue, @findepi:
It is just as per the previous existing interface in the builder.
we had Map<Long, List<StatisticsFile>> statisticsFiles in builder.
If we wanted one stats file per interface then no need to have list for map value?

Also is it possible that there will be one partition level stats (avro/parquet file) and there is one table level NDV file per snapshot. In this case, no need to merge them but can be used based on the need. So, still ok to have multiple stats per snapshot id?

Assert.assertEquals("Statistics file path", "/some/path/to/stats/file2", statisticsFile.path());
Assertions.assertThat(withStatisticsAppended.statisticsFiles())
.as("There should be two statistics files registered")
.hasSize(2)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I fixed the overwrite to append. There will be two stats file for this snapshot id. Hence, updated the testcases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is incorrect.

@ajantha-bhat ajantha-bhat marked this pull request as ready for review December 3, 2022 10:34
@ajantha-bhat
Copy link
Member Author

@rdblue, @findepi: PR is ready for review. This should help in unblocking #6090, #6091

@rdblue
Copy link
Contributor

rdblue commented Dec 8, 2022

@ajantha-bhat, I'm fine with the update to the spec wording, but I don't think the rest of these changes are needed. There should be only one stats file per snapshot in metadata. There is no value in having multiple stats files and implementations should merge them.

@ajantha-bhat
Copy link
Member Author

@ajantha-bhat, I'm fine with the update to the spec wording, but I don't think the rest of these changes are needed.

I think we still need an interface to get the current stats file for snapshot id currentStatisticsFiles()

There should be only one stats file per snapshot in metadata.
There is no value in having multiple stats files and implementations should merge them.

Maybe true. But It was just as per the previous existing interface in the builder.
we had Map<Long, List<StatisticsFile>> statisticsFiles in builder.
If we wanted one stats file per interface then no need to have list for map value?

Also is it possible that there will be one partition level stats (avro/parquet file) and there is one table level NDV file per snapshot. In this case, no need to merge them but can be used based on the need. So, still ok to have multiple stats per snapshot id?

@rdblue
Copy link
Contributor

rdblue commented Dec 12, 2022

I think we still need an interface to get the current stats file for snapshot id currentStatisticsFiles()

No, I don't think this is valuable. This depends too much on some definition of "current", which is about as useful as asking for any stats file. This should be more specific. I agree that we want some way to get a stats file, but I don't think there's much value in the "current" idea.

It was just as per the previous existing interface in the builder.

I wouldn't worry too much about intermediate representations. What we have in the current API is fine for now. I don't think that we should make the changes in this PR.

@ajantha-bhat ajantha-bhat changed the title Core: Update StatisticsFile interface in TableMetadata spec Docs: Update spec about statistics file snapshot id Dec 12, 2022
@ajantha-bhat
Copy link
Member Author

@rdblue: Thanks for the review and suggestions. I have kept it as just the document (spec) update now.

@rdblue
Copy link
Contributor

rdblue commented Dec 19, 2022

Thanks, @ajantha-bhat! Merging this.

@rdblue rdblue merged commit bd526a1 into apache:master Dec 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants