Spec - Add -1 to Manifest Entry's data_file.record_count to indicate unknown #3284

kbendick · 2021-10-12T17:54:37Z

In some cases, we don't know the count of records in a data file when we write the manifest entry.

This can happen particularly when importing files from Avro tables, as we would otherwise need to scan the entire file due to lack of built-in metrics to Avro files.

As we haven't implemented parsing record count from avro files, we set record count to -1 to indicate to the metrics evaluators that the file has rows which might match.

See an example here:

iceberg/api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java

Lines 90 to 99 in 7aef02b

    
           if (file.recordCount() == 0) { 
        
             return ROWS_CANNOT_MATCH; 
        
           } 
        
           if (file.recordCount() < 0) { 
        
             // we haven't implemented parsing record count from avro file and thus set record count -1 
        
             // when importing avro tables to iceberg tables. This should be updated once we implemented 
        
             // and set correct record count. 
        
             return ROWS_MIGHT_MATCH; 
        
           }

…unt, such as files imported from Avro tables

kbendick · 2021-10-12T17:55:14Z

This is in reference to #3273, but the base implementation also does this and thus we should likely add this to the spec.

Note - Marking this as a draft as we might want to also call out that 0 indicates we should skip the file entirely (though that's perhaps overkill). Also, marking as draft as I'm open to formatting this differently.

cc @RussellSpitzer @szehon-ho @aokolnychyi @rdblue

szehon-ho

Makes sense to me to add -1 to spec (as @kbendick mentioned, my change is not the first to set it to this value)

szehon-ho · 2021-10-12T19:10:52Z

site/docs/spec.md

 | _required_ | _required_ | **`101  file_format`**            | `string`                     | String file format name, avro, orc or parquet |
 | _required_ | _required_ | **`102  partition`**              | `struct<...>`                | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids |
-| _required_ | _required_ | **`103  record_count`**           | `long`                       | Number of records in this file |
+| _required_ | _required_ | **`103  record_count`**           | `long` with special value: `-1: Record count unknown` | Number of records in this file. |


Shouldn't it be added to the last column? (description)

I did that initially, but there are a few places that place it under description.

However, those are arguably placed under Type and not Description because they're enums, whereas this is just one possible magic value (so I personally agree that description is the better column for this).

That's mostly why I made this a draft actually 😅

Here's an example of the values being in Type:

iceberg/site/docs/spec.md

Line 488 in 86350db

| | _required_ | **`517 content`** | `int` with meaning: `0: data`, `1: deletes` | The type of files tracked by the manifest, either data or delete files; 0 for all v1 manifests |

As well as here:

iceberg/site/docs/spec.md

Lines 374 to 378 in 86350db

`data_file` is a struct with the following fields:

| v1 | v2 | Field id, name | Type | Description |

| ---------- | ---------- |-----------------------------------|------------------------------|-------------|

| | _required_ | **`134 content`** | `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) |

But since this is just a caveat on one magic value, I do tend to agree with you.

I'm open to either. Given it's the spec, I figured I'd let others weigh in.

I updated it to be in Description @szehon-ho. I copied language similar to an entry a few lines down.

…ormat (Avro) used a few lines down

szehon-ho

Looks good to me, only style thing is I am not sure if we need the example here as it gets a bit wordy (and might be a bit too specific to a spark-operation, for a generic spec)

kbendick · 2021-10-13T18:39:14Z

Looks good to me, only style thing is I am not sure if we need the example here as it gets a bit wordy (and might be a bit too specific to a spark-operation, for a generic spec)

I thought that too. One thing that exists is number marks that drop to a sentence with further detail, where necessary. Like [1].

Here's an example:

iceberg/site/docs/spec.md

Lines 397 to 403 in 86350db

    
           | _optional_ | _optional_ | **`140  sort_order_id`**          | `int`                        | ID representing sort order for this file [3]. | 
        
           Notes: 
        
           1. Single-value serialization for lower and upper bounds is detailed in Appendix D. 
        
           2. For `float` and `double`, the value `-0.0` must precede `+0.0`, as in the IEEE 754 `totalOrder` predicate. 
        
           3. If sort order ID is missing or unknown, then the order is assumed to be unsorted. Only data files and equality delete files should be written with a non-null order id. [Position deletes](#position-delete-files) are required to be sorted by file and position, not a table order, and should set sort order id to null. Readers must ignore sort order id for position delete files.

However, I agree this one reads very particular to this case.

My desire was to indicate that -1 shouldn't be common, under normal situations. Though I'm not sure how true that is or if it matters.

Will update to be more generic for now.

szehon-ho

Thanks @kbendick looks good from my end

kbendick · 2021-10-14T18:28:44Z

@rdblue are you ok with this minor update to the spec? It reflects current behavior, even in the base classes (mentioned above).

I had considered adding a footnote that this is a rare occurrence (i.e. it shouldn't be common or something people expect during normal operations), but seemed possibly too much information for the spec.

kbendick · 2021-10-20T18:15:23Z

Based on this PR, is this no longer needed (at least for the Avro import path)? https://github.com/apache/iceberg/pull/3273/files @szehon-ho

rdblue · 2021-10-20T18:28:43Z

Yeah, I think that fixing the Avro bug was the important thing. Let's close this. Thanks @kbendick!

Spec - Add -1 to data_file.record_count to indicate unknown record co…

a204f31

…unt, such as files imported from Avro tables

github-actions bot added the docs label Oct 12, 2021

szehon-ho reviewed Oct 12, 2021

View reviewed changes

kbendick added 2 commits October 12, 2021 12:58

Remove added period

4924662

Move -1 to be in description and match the language of row-oriented f…

fd501ab

…ormat (Avro) used a few lines down

kbendick marked this pull request as ready for review October 12, 2021 20:07

szehon-ho reviewed Oct 13, 2021

View reviewed changes

Simply mention that -1 is unknown without mentioning details

20b9cc3

szehon-ho approved these changes Oct 13, 2021

View reviewed changes

szehon-ho mentioned this pull request Oct 14, 2021

Add File for Avro files throws PreconditionException #3273

Merged

rdblue closed this Oct 20, 2021

	if (file.recordCount() == 0) {
	return ROWS_CANNOT_MATCH;
	}

	if (file.recordCount() < 0) {
	// we haven't implemented parsing record count from avro file and thus set record count -1
	// when importing avro tables to iceberg tables. This should be updated once we implemented
	// and set correct record count.
	return ROWS_MIGHT_MATCH;
	}

	`data_file` is a struct with the following fields:

	\| v1 \| v2 \| Field id, name \| Type \| Description \|
	\| ---------- \| ---------- \|-----------------------------------\|------------------------------\|-------------\|
	\| \| _required_ \| `134 content` \| `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` \| Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) \|

Spec - Add -1 to Manifest Entry's data_file.record_count to indicate unknown #3284

Spec - Add -1 to Manifest Entry's data_file.record_count to indicate unknown #3284

Uh oh!

Conversation

kbendick commented Oct 12, 2021

Uh oh!

kbendick commented Oct 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szehon-ho left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Oct 12, 2021

Choose a reason for hiding this comment

Uh oh!

kbendick Oct 12, 2021

Choose a reason for hiding this comment

Uh oh!

kbendick Oct 12, 2021

Choose a reason for hiding this comment

Uh oh!

kbendick Oct 12, 2021

Choose a reason for hiding this comment

Uh oh!

szehon-ho left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kbendick commented Oct 13, 2021

Uh oh!

szehon-ho left a comment

Choose a reason for hiding this comment

Uh oh!

kbendick commented Oct 14, 2021

Uh oh!

kbendick commented Oct 20, 2021

Uh oh!

rdblue commented Oct 20, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kbendick commented Oct 12, 2021 •

edited

Loading

szehon-ho left a comment •

edited

Loading

szehon-ho left a comment •

edited

Loading