Add File for Avro files throws PreconditionException #3273

szehon-ho · 2021-10-11T17:10:58Z

As mentioned, this is one of the proposed solution to set 0 rows for added Avro files which fixes the immediate problem, as I am not sure any quick way to get the row count of outside Avro files. Was this the original intent?

I wonder if we should run a spark job that reads the Avro files to compute the real row count, but it might be self-defeating as add_files is supposed to be have performance benefit over an Iceberg insert.

Also adds a test case to demonstrate the problem.

szehon-ho · 2021-10-11T17:39:57Z

@RussellSpitzer @aokolnychyi fyi

RussellSpitzer · 2021-10-11T19:17:21Z

Checking the evaluators it looks like we really do need a negative 1 here see

iceberg/api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java

Lines 90 to 99 in 7aef02b

    
           if (file.recordCount() == 0) { 
        
             return ROWS_CANNOT_MATCH; 
        
           } 
        
           if (file.recordCount() < 0) { 
        
             // we haven't implemented parsing record count from avro file and thus set record count -1 
        
             // when importing avro tables to iceberg tables. This should be updated once we implemented 
        
             // and set correct record count. 
        
             return ROWS_MIGHT_MATCH; 
        
           }

If we set it to 0 then files will always be ignored, if we set it to -1 then that is currently used as a signal that the file has no metric information and must be scanned.

This has been broken ever since we switched from using a direct DataFile constructor

So I think we only have a few options

Change the builder so that it allows -1
Go back to using the raw datafile constructor

I think approach 1. is the safest thing to do here for future proofing behavior. Specifically I think we should allow for explicitly setting the rowCount to -1 if and only if the rest of the metrics are empty. This should preserve the behavior of allowing a file without metrics but also insure that we don't have rowcount as -1 when the metrics are set.

szehon-ho · 2021-10-11T19:27:20Z

Thanks for catching that. Done with approach 1, and also fixed checkstyle. Watching the tests

szehon-ho · 2021-10-11T23:04:51Z

Added the check mentioned in the last comment:

Specifically I think we should allow for explicitly setting the rowCount to -1 if and only if the rest of the metrics are empty.

RussellSpitzer · 2021-10-12T00:06:21Z

data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java

                .withMetrics(metrics)
                .withPartitionPath(partitionKey)
                .build();
-


I was saving this white-space for my retirement :nit:

RussellSpitzer

I think this is the right thing to do, @rdblue can I ask you for a quick sanity check?

karuppayya · 2021-10-13T20:36:33Z

spark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java

  }

+  @Test
+  public void addDataUnpartitionedAvroFile() throws Exception {


nit: rename method name, there seems to be another method with almost same name.

karuppayya · 2021-10-13T21:05:59Z

spark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java

+    );
+    assertEquals("Iceberg table contains correct data",
+        expected,
+        sql("SELECT * FROM %s ORDER BY id", tableName));


Not related to this change.
For a COUNT query, we currently do not rely on the record count metrics since Spark does not pushdown the count expression.
When Spark supports count pushdown, using -1 as a literal record count will give incorrect results.
Can we also add a COUNT assertion to the test?

Also since the users can read the metrics from manifests and compute the count, might be good idea to document the meaning of -1 for record count metrics.

Added count test.

I think @kbendick is making the doc change: #3284

szehon-ho · 2021-10-14T04:49:38Z

spark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java

+        expected,
+        sql("SELECT * FROM %s ORDER BY id", tableName));
+
+    List<Object[]> expectedCount = Lists.newArrayList();


Lists.newArrayList(new Object[]{2L}) not possible as compiler confuses between Object[] and varargs Object.

rdblue · 2021-10-15T16:16:55Z

core/src/main/java/org/apache/iceberg/DataFiles.java

+      Preconditions.checkArgument(recordCount >= 0 ||
+              (recordCount == -1 && valueCounts == null && columnSizes == null && nanValueCounts == null &&
+                      lowerBounds == null && upperBounds == null),
+          "Metrics cannot be set if record count is -1.");


Why is this change necessary? I don't think that we should allow writing a file without a record count.

Because our importAvroPartitions code expects to be able to do this and our metrics evaluation code assumes it can be -1 as well. We could forbid this but that would require changing the import avro code to also fully scan avro files before importing. As is, importAvro has been broken since we switched to the builder.

Got it. What about getting the correct count in the import code? We already open up Parquet files and read the footer. It wouldn't be too bad to skip through Avro files, like we do to find the count for a specific starting offset: https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/avro/AvroIO.java#L149

I don't think there is a problem with that, we were just working with the past behavior which was to include a rowcount of -1. I think writing new code to skip through avro blocks and sum up row counts is fine too

If we leave this (even temporarily), should we change the exception message to mention something more in the range of Metrics shouldn't be set when the record count is -1. -1 is only valid for files which haven't been read yet or something?

I don't mind the -1 for now as it's consistent with some current behavior, but I feel like users (or at least framework developers) should be made aware this is not a relatively normal circumstance and that this means the file hasn't been read.

Ideally, we remove the -1 as soon as possible, but I can't think of any valid scenarios presently where we have metrics but not record count.

OK I will give this a try. @kbendick I guess we won't have -1 anymore in this case?

I took @rdblue suggestion and made an attempt to use the AvroIO method to get the row count, which internally just visits each block once. Potential follow up could be making this (and even the Parquet/ORC footer reading) into distributed Spark jobs. Added test.

Need to rebase following the spark directory refactor

rdblue · 2021-10-19T18:36:34Z

build.gradle

    }

    testImplementation project(path: ':iceberg-hive-metastore', configuration: 'testArtifacts')
-


Can you revert this change? It could cause conflicts.

Yea I just rebased, and moved it to the appropriate build.gradle, hope its ok

apache#3263

… of long to differentiate unset vs set recordCount.

szehon-ho · 2021-10-19T18:59:56Z

...ark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java


+  @Test
+  public void addAvroFile() throws Exception {
+    // Spark Session Catalog cannot load metadata tables


This runs on the two other test parameters (hive catalog, hadoop catalog)

rdblue · 2021-10-19T19:44:44Z

...ark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java

+
+    DataWriter<Record> dataWriter = Avro.writeData(file)
+        .schema(schema)
+        .createWriterFunc(org.apache.iceberg.data.avro.DataWriter::create)


Does this need to be fully-qualified?

rdblue · 2021-10-19T19:47:42Z

...ark3-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestAddFilesProcedure.java

+    sql(createIceberg, tableName);
+
+    Object result = scalarSql("CALL %s.system.add_files('%s', '`avro`.`%s`')",
+        catalogName, tableName, path);


This is actually dangerous and we probably want to disallow it. We should only import data files that do not have field IDs. Otherwise, the field IDs may not match and you could get strange behavior. I'd prefer if the test used an Avro file written without Iceberg support to ensure it doesn't have field IDs. Not a huge problem, but eventually I think we should catch that there were IDs in the imported file and fail if they don't match the table schema's IDs.

Got it, was not familiar with this. I changed to using native Avro writers, does that do the trick?

Re: fieldId check, sounds good, probably we can look at that and also add general schema validation while importing the file.

rdblue · 2021-10-19T19:49:45Z

core/src/main/java/org/apache/iceberg/avro/AvroIO.java

  }

-  static long findStartingRowPos(Supplier<SeekableInputStream> open, long start) {
+  public static long findStartingRowPos(Supplier<SeekableInputStream> open, long start) {


Rather than exposing this, could you add a helper method to the already public Avro class?

+1 on something like "public static long readRowCount(path)"

rdblue · 2021-10-19T19:52:56Z

data/src/main/java/org/apache/iceberg/data/TableMigrationUtil.java

+            long length = inFile.getLength();
+
+            // Seeking to the end will count all the rows.
+            long rowCount = AvroIO.findStartingRowPos(inFile::newStream, length);


The findStartingRowPos method handles EOFException so I think you could implement this using AvroIO.findStartingRowPos(..., Long.MAX_VALUE). Then you wouldn't need to call to S3 to find the file length.

That's how I'd implement the util method:

public class Avro { ... public long rowCount(InputFile file) { return AvroIO.findStartingRowPos(file::newStream, Long.MAX_VALUE); } }

Done, thanks for the code suggestion

szehon-ho · 2021-10-20T03:09:12Z

@rdblue @RussellSpitzer addressed the comments, if you guys have time for another round

rdblue · 2021-10-20T15:09:20Z

Thanks, @szehon-ho! Nice work.

szehon-ho · 2021-10-20T16:38:25Z

Thanks for the review!

github-actions bot added build data spark labels Oct 11, 2021

github-actions bot added the core label Oct 11, 2021

RussellSpitzer reviewed Oct 12, 2021

View reviewed changes

RussellSpitzer approved these changes Oct 12, 2021

View reviewed changes

kbendick mentioned this pull request Oct 12, 2021

Spec - Add -1 to Manifest Entry's data_file.record_count to indicate unknown #3284

Closed

karuppayya approved these changes Oct 13, 2021

View reviewed changes

szehon-ho commented Oct 14, 2021

View reviewed changes

rdblue reviewed Oct 15, 2021

View reviewed changes

rdblue reviewed Oct 19, 2021

View reviewed changes

szehon-ho added 8 commits October 19, 2021 11:51

Add File for Avro files throws PreconditionException

d2971b9

apache#3263

Change DataFiles.Builder Preconditions to accept -1. Use Long instead…

b660e8f

… of long to differentiate unset vs set recordCount.

Add another precondition to validate -1 case

f6ba229

Fix condition

9fce528

Restore unnecessary newline removal

ee54d1a

Minor test changes for review feedback

3385130

Implement rowCount for added Avro files, restore -1 rowCount check

438e439

Fix rebase error

bd71a89

szehon-ho force-pushed the add_file_avro_master branch from 9a5cd9d to bd71a89 Compare October 19, 2021 18:56

szehon-ho commented Oct 19, 2021

View reviewed changes

rdblue reviewed Oct 19, 2021

View reviewed changes

Address review comments, use native Avro writer in test

0aba1aa

Revert making AvroIO public

fbdb3e4

rdblue approved these changes Oct 20, 2021

View reviewed changes

rdblue merged commit 303f925 into apache:master Oct 20, 2021

kbendick added this to the Java 0.12.1 Release milestone Oct 27, 2021

kbendick pushed a commit to kbendick/iceberg that referenced this pull request Oct 31, 2021

Avro: Fix file import with correct row count (apache#3273)

3805b5e

kbendick mentioned this pull request Oct 31, 2021

0.12.1 Cherry-Picks: Part 3 #3428

Merged

rdblue pushed a commit that referenced this pull request Nov 1, 2021

Avro: Fix file import with correct row count (#3273)

b149b10

izchen pushed a commit to izchen/iceberg that referenced this pull request Dec 7, 2021

Avro: Fix file import with correct row count (apache#3273)

38e089f

Initial-neko pushed a commit to Initial-neko/iceberg that referenced this pull request Dec 15, 2021

Add File for Avro files throws PreconditionException apache#3273

1f70f51

Initial-neko pushed a commit to Initial-neko/iceberg that referenced this pull request Dec 17, 2021

Add File for Avro files throws PreconditionException apache#3273

cd08a51

		}

		testImplementation project(path: ':iceberg-hive-metastore', configuration: 'testArtifacts')

Add File for Avro files throws PreconditionException #3273

Add File for Avro files throws PreconditionException #3273

Uh oh!

Conversation

szehon-ho commented Oct 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szehon-ho commented Oct 11, 2021

Uh oh!

RussellSpitzer commented Oct 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szehon-ho commented Oct 11, 2021

Uh oh!

szehon-ho commented Oct 11, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Oct 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Oct 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RussellSpitzer Oct 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Oct 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho commented Oct 20, 2021

Uh oh!

rdblue commented Oct 20, 2021

Uh oh!

szehon-ho commented Oct 20, 2021

szehon-ho commented Oct 11, 2021 •

edited

Loading

RussellSpitzer commented Oct 11, 2021 •

edited

Loading

RussellSpitzer Oct 15, 2021 •

edited

Loading

szehon-ho Oct 20, 2021 •

edited

Loading

RussellSpitzer Oct 19, 2021 •

edited

Loading

szehon-ho Oct 20, 2021 •

edited

Loading