Spark: Test reading default values in Spark #11832

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

rdblue merged 5 commits into apache:main from rdblue:fix-spark-default-reads

Dec 21, 2024

Contributor

rdblue commented Dec 19, 2024

This updates Spark's tests for scans and data frame writes to validate default values.

This fixes problems found in testing:

ReassignIds was dropping defaults
SchemaParser did not support either initial-default or write-default
SchemaParser did not have a test suite

This also refactors the data frame writer tests and removes ParameterizedAvroDataTest that was an unnecessary copy of AvroDataTest. To avoid needing the duplicate test suite, this updates the tests to inherit from a base class like the scan tests. Last, there were a few unnecessary tests that have been removed. One was testing basic Spark behavior (no commit if an action fails) and the others were only valid for Spark 2.x.


          Spark: Test reading default values in Spark.

f974d9b

github-actions bot added API spark core labels

rdblue commented

View reviewed changes

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/data/AvroDataTest.java

    
                      .isInstanceOf(IllegalArgumentException.class)

                      .hasMessage("Missing required field: missing_str");

                      .hasRootCauseInstanceOf(IllegalArgumentException.class)

                      .hasMessageContaining("Missing required field: missing_str");

Contributor Author

rdblue Dec 19, 2024

This was needed to validate the reader failure in testMissingRequiredWithoutDefault in Spark scans because the failure happens on executors and is wrapped in SparkException when it is thrown on the driver.

rdblue commented

View reviewed changes

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/data/AvroDataTest.java

    
                  writeAndValidate(writeSchema, readSchema);

                }

                protected void withSQLConf(Map<String, String> conf, Action action) throws IOException {

Contributor Author

rdblue Dec 19, 2024

This was unused.

rdblue commented

View reviewed changes

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/data/TestHelpers.java

    
                  List<Types.NestedField> fields = struct.fields();

                  for (int i = 0; i < fields.size(); i += 1) {

                    Type fieldType = fields.get(i).type();

                  for (int readPos = 0; readPos < fields.size(); readPos += 1) {

Contributor Author

rdblue Dec 19, 2024 •

edited

Loading

These changes mirror what was already done in #11803 and #11811.

rdblue commented

View reviewed changes

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/DataFrameWriteTestBase.java

    
              import org.junit.jupiter.api.Test;

              import org.junit.jupiter.api.io.TempDir;

              public abstract class DataFrameWriteTestBase extends ScanTestBase {

Contributor Author

rdblue Dec 20, 2024

New base suite for tests of data frame writes, which replaces TestDataFrameWrites and ParameterizedAvroDataTest.

rdblue commented

View reviewed changes

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/ScanTestBase.java

    
              import org.junit.jupiter.api.io.TempDir;

              /** An AvroDataScan test that validates data by reading through Spark */

              public abstract class ScanTestBase extends AvroDataTest {

Contributor Author

rdblue Dec 20, 2024

New base class for scan tests (TestAvroScan, TestParquetScan, TestParquetVectorizedScan).

rdblue commented

View reviewed changes

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestDataFrameWrites.java

    
                @Parameters(name = "format = {0}")

                public static Collection<String> parameters() {

                  return Arrays.asList("parquet", "avro", "orc");

Contributor Author

rdblue Dec 20, 2024

This was broken into DataFrameWriteTestBase and subclasses for each format:

TestAvroDataFrameWrite
TestParquetDataFrameWrite
TestORCDataFrameWrite

rdblue commented

View reviewed changes

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestDataFrameWrites.java

    
                }

                @TestTemplate

                public void testWriteWithCustomDataLocation() throws IOException {

Contributor Author

rdblue Dec 20, 2024

Replaced by DataFrameWriteTestBase#testAlternateLocation.

rdblue commented

View reviewed changes

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestDataFrameWrites.java

    
                }

                @TestTemplate

                public void testNullableWithWriteOption() throws IOException {

Contributor Author

rdblue Dec 20, 2024

Assumes Spark 2.x so this is no longer needed.

rdblue commented

View reviewed changes

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestDataFrameWrites.java

    
                }

                @TestTemplate

                public void testNullableWithSparkSqlOption() throws IOException {

Contributor Author

rdblue Dec 20, 2024

Assumes Spark 2.x so this is no longer needed.

rdblue commented

View reviewed changes

spark/v3.5/spark/src/test/java/org/apache/iceberg/spark/source/TestDataFrameWrites.java

    
                }

                @TestTemplate

                public void testFaultToleranceOnWrite() throws IOException {

Contributor Author

rdblue Dec 20, 2024

I dropped this test because it is testing basic Spark behavior and doesn't belong in scan and write tests for specific schemas. I didn't move it anywhere because I don't think it is a valuable test. Spark stage failures throw exceptions and don't commit. I think it was originally trying to check for side-effects, but that isn't necessary in Iceberg.

rdblue added 3 commits

December 19, 2024 16:08


          Apply spotless.

e722324


          Fix missing required field test assertion.

b59b8bb


          Fix checkstyle.

69f6886

Fokko approved these changes

View reviewed changes

Contributor

Fokko left a comment

api/src/main/java/org/apache/iceberg/types/Types.java Outdated Show resolved Hide resolved


          Fix style.

7d44fe5

rdblue merged commit cd187c5 into apache:main

50 checks passed

Contributor Author

rdblue commented Dec 21, 2024

Thanks for the reviews, @Fokko!

rdblue added a commit to rdblue/iceberg that referenced this pull request


          Spark 3.4: Test reading default values in Spark (apache#11832)

2d91c71

rdblue added a commit to rdblue/iceberg that referenced this pull request


          Spark 3.3: Test reading default values in Spark (apache#11832)

91dffb7

rdblue added a commit to rdblue/iceberg that referenced this pull request


          Spark 3.3: Test reading default values in Spark (apache#11832)

526f807

This was referenced Jan 17, 2025

Spark 3.4: Backport support for default values #11987

Merged

Spark 3.3: Backport support for default values #11988

Merged

rdblue added a commit to rdblue/iceberg that referenced this pull request


          Spark 3.4: Test reading default values in Spark (apache#11832)

fc4943e

rdblue added a commit to rdblue/iceberg that referenced this pull request


          Spark 3.3: Test reading default values in Spark (apache#11832)

8f4c8e5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels