Skip to content

Conversation

@kbendick
Copy link
Contributor

In some cases, we don't know the count of records in a data file when we write the manifest entry.

This can happen particularly when importing files from Avro tables, as we would otherwise need to scan the entire file due to lack of built-in metrics to Avro files.

As we haven't implemented parsing record count from avro files, we set record count to -1 to indicate to the metrics evaluators that the file has rows which might match.

See an example here:

if (file.recordCount() == 0) {
return ROWS_CANNOT_MATCH;
}
if (file.recordCount() < 0) {
// we haven't implemented parsing record count from avro file and thus set record count -1
// when importing avro tables to iceberg tables. This should be updated once we implemented
// and set correct record count.
return ROWS_MIGHT_MATCH;
}

…unt, such as files imported from Avro tables
@github-actions github-actions bot added the docs label Oct 12, 2021
@kbendick
Copy link
Contributor Author

kbendick commented Oct 12, 2021

This is in reference to #3273, but the base implementation also does this and thus we should likely add this to the spec.

Note - Marking this as a draft as we might want to also call out that 0 indicates we should skip the file entirely (though that's perhaps overkill). Also, marking as draft as I'm open to formatting this differently.

cc @RussellSpitzer @szehon-ho @aokolnychyi @rdblue

Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me to add -1 to spec (as @kbendick mentioned, my change is not the first to set it to this value)

| _required_ | _required_ | **`101 file_format`** | `string` | String file format name, avro, orc or parquet |
| _required_ | _required_ | **`102 partition`** | `struct<...>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids |
| _required_ | _required_ | **`103 record_count`** | `long` | Number of records in this file |
| _required_ | _required_ | **`103 record_count`** | `long` with special value: `-1: Record count unknown` | Number of records in this file. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be added to the last column? (description)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that initially, but there are a few places that place it under description.

However, those are arguably placed under Type and not Description because they're enums, whereas this is just one possible magic value (so I personally agree that description is the better column for this).

That's mostly why I made this a draft actually 😅

Here's an example of the values being in Type:

| | _required_ | **`517 content`** | `int` with meaning: `0: data`, `1: deletes` | The type of files tracked by the manifest, either data or delete files; 0 for all v1 manifests |

As well as here:

iceberg/site/docs/spec.md

Lines 374 to 378 in 86350db

`data_file` is a struct with the following fields:
| v1 | v2 | Field id, name | Type | Description |
| ---------- | ---------- |-----------------------------------|------------------------------|-------------|
| | _required_ | **`134 content`** | `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) |

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But since this is just a caveat on one magic value, I do tend to agree with you.

I'm open to either. Given it's the spec, I figured I'd let others weigh in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated it to be in Description @szehon-ho. I copied language similar to an entry a few lines down.

@kbendick kbendick marked this pull request as ready for review October 12, 2021 20:07
Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, only style thing is I am not sure if we need the example here as it gets a bit wordy (and might be a bit too specific to a spark-operation, for a generic spec)

@kbendick
Copy link
Contributor Author

Looks good to me, only style thing is I am not sure if we need the example here as it gets a bit wordy (and might be a bit too specific to a spark-operation, for a generic spec)

I thought that too. One thing that exists is number marks that drop to a sentence with further detail, where necessary. Like [1].

Here's an example:

iceberg/site/docs/spec.md

Lines 397 to 403 in 86350db

| _optional_ | _optional_ | **`140 sort_order_id`** | `int` | ID representing sort order for this file [3]. |
Notes:
1. Single-value serialization for lower and upper bounds is detailed in Appendix D.
2. For `float` and `double`, the value `-0.0` must precede `+0.0`, as in the IEEE 754 `totalOrder` predicate.
3. If sort order ID is missing or unknown, then the order is assumed to be unsorted. Only data files and equality delete files should be written with a non-null order id. [Position deletes](#position-delete-files) are required to be sorted by file and position, not a table order, and should set sort order id to null. Readers must ignore sort order id for position delete files.

However, I agree this one reads very particular to this case.

My desire was to indicate that -1 shouldn't be common, under normal situations. Though I'm not sure how true that is or if it matters.

Will update to be more generic for now.

Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kbendick looks good from my end

@kbendick
Copy link
Contributor Author

@rdblue are you ok with this minor update to the spec? It reflects current behavior, even in the base classes (mentioned above).

I had considered adding a footnote that this is a rare occurrence (i.e. it shouldn't be common or something people expect during normal operations), but seemed possibly too much information for the spec.

@kbendick
Copy link
Contributor Author

Based on this PR, is this no longer needed (at least for the Avro import path)? https://github.com/apache/iceberg/pull/3273/files @szehon-ho

@rdblue
Copy link
Contributor

rdblue commented Oct 20, 2021

Yeah, I think that fixing the Avro bug was the important thing. Let's close this. Thanks @kbendick!

@rdblue rdblue closed this Oct 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants