-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Docs: Update spec about statistics file snapshot id #6267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8537ce6 to
1e0906e
Compare
| private List<Snapshot> snapshots; | ||
| private final Map<String, SnapshotRef> refs; | ||
| private final Map<Long, List<StatisticsFile>> statisticsFiles; | ||
| private final Map<Long, List<StatisticsFile>> statisticsFilesById; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought better to rename this similar to snapshotsById
| "snapshotId does not match: %s vs %s", | ||
| snapshotId, | ||
| statisticsFile.snapshotId()); | ||
| statisticsFiles.put(statisticsFile.snapshotId(), ImmutableList.of(statisticsFile)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interface supports returning multiple stats file for one snapshot. But was always overwriting instead of appending. So, fixed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not correct. There should be only one stats file for a snapshot that contains all information. Allowing multiple stats files for a snapshot is going to lead to lazy implementations that don't merge the files and slower job planning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rdblue, @findepi:
It is just as per the previous existing interface in the builder.
we had Map<Long, List<StatisticsFile>> statisticsFiles in builder.
If we wanted one stats file per interface then no need to have list for map value?
Also is it possible that there will be one partition level stats (avro/parquet file) and there is one table level NDV file per snapshot. In this case, no need to merge them but can be used based on the need. So, still ok to have multiple stats per snapshot id?
| Assert.assertEquals("Statistics file path", "/some/path/to/stats/file2", statisticsFile.path()); | ||
| Assertions.assertThat(withStatisticsAppended.statisticsFiles()) | ||
| .as("There should be two statistics files registered") | ||
| .hasSize(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I fixed the overwrite to append. There will be two stats file for this snapshot id. Hence, updated the testcases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is incorrect.
|
@ajantha-bhat, I'm fine with the update to the spec wording, but I don't think the rest of these changes are needed. There should be only one stats file per snapshot in metadata. There is no value in having multiple stats files and implementations should merge them. |
I think we still need an interface to get the current stats file for snapshot id
Maybe true. But It was just as per the previous existing interface in the builder. Also is it possible that there will be one partition level stats (avro/parquet file) and there is one table level NDV file per snapshot. In this case, no need to merge them but can be used based on the need. So, still ok to have multiple stats per snapshot id? |
No, I don't think this is valuable. This depends too much on some definition of "current", which is about as useful as asking for any stats file. This should be more specific. I agree that we want some way to get a stats file, but I don't think there's much value in the "current" idea.
I wouldn't worry too much about intermediate representations. What we have in the current API is fine for now. I don't think that we should make the changes in this PR. |
|
@rdblue: Thanks for the review and suggestions. I have kept it as just the document (spec) update now. |
|
Thanks, @ajantha-bhat! Merging this. |
Background:
Spec has a snapshot id in two places, one in
StatisticsFileand another in itsblob metadata.To support the reuse of statistics files, we should have the referenced snapshot id in
StatisticsFile, not the computed-from snapshot id. Hence, updated the spec.Note that PR #6090 is stuck because of confusion around stats file reuse.