Skip to content

Conversation

@szehon-ho
Copy link
Member

@szehon-ho szehon-ho commented Oct 11, 2021

#3263

As mentioned, this is one of the proposed solution to set 0 rows for added Avro files which fixes the immediate problem, as I am not sure any quick way to get the row count of outside Avro files. Was this the original intent?

I wonder if we should run a spark job that reads the Avro files to compute the real row count, but it might be self-defeating as add_files is supposed to be have performance benefit over an Iceberg insert.

Also adds a test case to demonstrate the problem.

@szehon-ho
Copy link
Member Author

@RussellSpitzer @aokolnychyi fyi

@RussellSpitzer
Copy link
Member

RussellSpitzer commented Oct 11, 2021

Checking the evaluators it looks like we really do need a negative 1 here see

if (file.recordCount() == 0) {
return ROWS_CANNOT_MATCH;
}
if (file.recordCount() < 0) {
// we haven't implemented parsing record count from avro file and thus set record count -1
// when importing avro tables to iceberg tables. This should be updated once we implemented
// and set correct record count.
return ROWS_MIGHT_MATCH;
}

If we set it to 0 then files will always be ignored, if we set it to -1 then that is currently used as a signal that the file has no metric information and must be scanned.

This has been broken ever since we switched from using a direct DataFile constructor

So I think we only have a few options

  1. Change the builder so that it allows -1
  2. Go back to using the raw datafile constructor

I think approach 1. is the safest thing to do here for future proofing behavior. Specifically I think we should allow for explicitly setting the rowCount to -1 if and only if the rest of the metrics are empty. This should preserve the behavior of allowing a file without metrics but also insure that we don't have rowcount as -1 when the metrics are set.

@github-actions github-actions bot added the core label Oct 11, 2021
@szehon-ho
Copy link
Member Author

Thanks for catching that. Done with approach 1, and also fixed checkstyle. Watching the tests

@szehon-ho
Copy link
Member Author

Added the check mentioned in the last comment:

Specifically I think we should allow for explicitly setting the rowCount to -1 if and only if the rest of the metrics are empty.

.withMetrics(metrics)
.withPartitionPath(partitionKey)
.build();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was saving this white-space for my retirement :nit:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the right thing to do, @rdblue can I ask you for a quick sanity check?

}

@Test
public void addDataUnpartitionedAvroFile() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename method name, there seems to be another method with almost same name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

);
assertEquals("Iceberg table contains correct data",
expected,
sql("SELECT * FROM %s ORDER BY id", tableName));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this change.
For a COUNT query, we currently do not rely on the record count metrics since Spark does not pushdown the count expression.
When Spark supports count pushdown, using -1 as a literal record count will give incorrect results.
Can we also add a COUNT assertion to the test?

Also since the users can read the metrics from manifests and compute the count, might be good idea to document the meaning of -1 for record count metrics.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added count test.

I think @kbendick is making the doc change: #3284

expected,
sql("SELECT * FROM %s ORDER BY id", tableName));

List<Object[]> expectedCount = Lists.newArrayList();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lists.newArrayList(new Object[]{2L}) not possible as compiler confuses between Object[] and varargs Object.

Preconditions.checkArgument(recordCount >= 0 ||
(recordCount == -1 && valueCounts == null && columnSizes == null && nanValueCounts == null &&
lowerBounds == null && upperBounds == null),
"Metrics cannot be set if record count is -1.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change necessary? I don't think that we should allow writing a file without a record count.

Copy link
Member

@RussellSpitzer RussellSpitzer Oct 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because our importAvroPartitions code expects to be able to do this and our metrics evaluation code assumes it can be -1 as well. We could forbid this but that would require changing the import avro code to also fully scan avro files before importing. As is, importAvro has been broken since we switched to the builder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. What about getting the correct count in the import code? We already open up Parquet files and read the footer. It wouldn't be too bad to skip through Avro files, like we do to find the count for a specific starting offset: https://github.com/apache/iceberg/blob/master/core/src/main/java/org/apache/iceberg/avro/AvroIO.java#L149

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a problem with that, we were just working with the past behavior which was to include a rowcount of -1. I think writing new code to skip through avro blocks and sum up row counts is fine too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we leave this (even temporarily), should we change the exception message to mention something more in the range of Metrics shouldn't be set when the record count is -1. -1 is only valid for files which haven't been read yet or something?

I don't mind the -1 for now as it's consistent with some current behavior, but I feel like users (or at least framework developers) should be made aware this is not a relatively normal circumstance and that this means the file hasn't been read.

Ideally, we remove the -1 as soon as possible, but I can't think of any valid scenarios presently where we have metrics but not record count.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I will give this a try. @kbendick I guess we won't have -1 anymore in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took @rdblue suggestion and made an attempt to use the AvroIO method to get the row count, which internally just visits each block once. Potential follow up could be making this (and even the Parquet/ORC footer reading) into distributed Spark jobs. Added test.

Need to rebase following the spark directory refactor

build.gradle Outdated
}

testImplementation project(path: ':iceberg-hive-metastore', configuration: 'testArtifacts')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you revert this change? It could cause conflicts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea I just rebased, and moved it to the appropriate build.gradle, hope its ok

@szehon-ho szehon-ho force-pushed the add_file_avro_master branch from 9a5cd9d to bd71a89 Compare October 19, 2021 18:56

@Test
public void addAvroFile() throws Exception {
// Spark Session Catalog cannot load metadata tables
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This runs on the two other test parameters (hive catalog, hadoop catalog)


DataWriter<Record> dataWriter = Avro.writeData(file)
.schema(schema)
.createWriterFunc(org.apache.iceberg.data.avro.DataWriter::create)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be fully-qualified?

sql(createIceberg, tableName);

Object result = scalarSql("CALL %s.system.add_files('%s', '`avro`.`%s`')",
catalogName, tableName, path);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually dangerous and we probably want to disallow it. We should only import data files that do not have field IDs. Otherwise, the field IDs may not match and you could get strange behavior. I'd prefer if the test used an Avro file written without Iceberg support to ensure it doesn't have field IDs. Not a huge problem, but eventually I think we should catch that there were IDs in the imported file and fail if they don't match the table schema's IDs.

Copy link
Member Author

@szehon-ho szehon-ho Oct 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, was not familiar with this. I changed to using native Avro writers, does that do the trick?

Re: fieldId check, sounds good, probably we can look at that and also add general schema validation while importing the file.

}

static long findStartingRowPos(Supplier<SeekableInputStream> open, long start) {
public static long findStartingRowPos(Supplier<SeekableInputStream> open, long start) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than exposing this, could you add a helper method to the already public Avro class?

Copy link
Member

@RussellSpitzer RussellSpitzer Oct 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on something like "public static long readRowCount(path)"

long length = inFile.getLength();

// Seeking to the end will count all the rows.
long rowCount = AvroIO.findStartingRowPos(inFile::newStream, length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The findStartingRowPos method handles EOFException so I think you could implement this using AvroIO.findStartingRowPos(..., Long.MAX_VALUE). Then you wouldn't need to call to S3 to find the file length.

That's how I'd implement the util method:

public class Avro {
  ...
  public long rowCount(InputFile file) {
    return AvroIO.findStartingRowPos(file::newStream, Long.MAX_VALUE);
  }
}

Copy link
Member Author

@szehon-ho szehon-ho Oct 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks for the code suggestion

@szehon-ho
Copy link
Member Author

@rdblue @RussellSpitzer addressed the comments, if you guys have time for another round

@rdblue rdblue merged commit 303f925 into apache:master Oct 20, 2021
@rdblue
Copy link
Contributor

rdblue commented Oct 20, 2021

Thanks, @szehon-ho! Nice work.

@szehon-ho
Copy link
Member Author

Thanks for the review!

@kbendick kbendick added this to the Java 0.12.1 Release milestone Oct 27, 2021
kbendick pushed a commit to kbendick/iceberg that referenced this pull request Oct 31, 2021
izchen pushed a commit to izchen/iceberg that referenced this pull request Dec 7, 2021
Initial-neko pushed a commit to Initial-neko/iceberg that referenced this pull request Dec 15, 2021
Initial-neko pushed a commit to Initial-neko/iceberg that referenced this pull request Dec 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants