-
Notifications
You must be signed in to change notification settings - Fork 3k
Spec - Add -1 to Manifest Entry's data_file.record_count to indicate unknown #3284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spec - Add -1 to Manifest Entry's data_file.record_count to indicate unknown #3284
Conversation
…unt, such as files imported from Avro tables
|
This is in reference to #3273, but the base implementation also does this and thus we should likely add this to the spec. Note - Marking this as a draft as we might want to also call out that 0 indicates we should skip the file entirely (though that's perhaps overkill). Also, marking as draft as I'm open to formatting this differently. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me to add -1 to spec (as @kbendick mentioned, my change is not the first to set it to this value)
site/docs/spec.md
Outdated
| | _required_ | _required_ | **`101 file_format`** | `string` | String file format name, avro, orc or parquet | | ||
| | _required_ | _required_ | **`102 partition`** | `struct<...>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids | | ||
| | _required_ | _required_ | **`103 record_count`** | `long` | Number of records in this file | | ||
| | _required_ | _required_ | **`103 record_count`** | `long` with special value: `-1: Record count unknown` | Number of records in this file. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be added to the last column? (description)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did that initially, but there are a few places that place it under description.
However, those are arguably placed under Type and not Description because they're enums, whereas this is just one possible magic value (so I personally agree that description is the better column for this).
That's mostly why I made this a draft actually 😅
Here's an example of the values being in Type:
Line 488 in 86350db
| | | _required_ | **`517 content`** | `int` with meaning: `0: data`, `1: deletes` | The type of files tracked by the manifest, either data or delete files; 0 for all v1 manifests | |
As well as here:
Lines 374 to 378 in 86350db
| `data_file` is a struct with the following fields: | |
| | v1 | v2 | Field id, name | Type | Description | | |
| | ---------- | ---------- |-----------------------------------|------------------------------|-------------| | |
| | | _required_ | **`134 content`** | `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But since this is just a caveat on one magic value, I do tend to agree with you.
I'm open to either. Given it's the spec, I figured I'd let others weigh in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated it to be in Description @szehon-ho. I copied language similar to an entry a few lines down.
…ormat (Avro) used a few lines down
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, only style thing is I am not sure if we need the example here as it gets a bit wordy (and might be a bit too specific to a spark-operation, for a generic spec)
I thought that too. One thing that exists is number marks that drop to a sentence with further detail, where necessary. Like Here's an example: Lines 397 to 403 in 86350db
However, I agree this one reads very particular to this case. My desire was to indicate that Will update to be more generic for now. |
szehon-ho
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kbendick looks good from my end
|
@rdblue are you ok with this minor update to the spec? It reflects current behavior, even in the base classes (mentioned above). I had considered adding a footnote that this is a rare occurrence (i.e. it shouldn't be common or something people expect during normal operations), but seemed possibly too much information for the spec. |
|
Based on this PR, is this no longer needed (at least for the Avro import path)? https://github.com/apache/iceberg/pull/3273/files @szehon-ho |
|
Yeah, I think that fixing the Avro bug was the important thing. Let's close this. Thanks @kbendick! |
In some cases, we don't know the count of records in a data file when we write the manifest entry.
This can happen particularly when importing files from Avro tables, as we would otherwise need to scan the entire file due to lack of built-in metrics to Avro files.
As we haven't implemented parsing record count from avro files, we set record count to -1 to indicate to the metrics evaluators that the file has rows which might match.
See an example here:
iceberg/api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java
Lines 90 to 99 in 7aef02b