-
Notifications
You must be signed in to change notification settings - Fork 3k
Spark: Test reading default values in Spark #11832
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| .isInstanceOf(IllegalArgumentException.class) | ||
| .hasMessage("Missing required field: missing_str"); | ||
| .hasRootCauseInstanceOf(IllegalArgumentException.class) | ||
| .hasMessageContaining("Missing required field: missing_str"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was needed to validate the reader failure in testMissingRequiredWithoutDefault in Spark scans because the failure happens on executors and is wrapped in SparkException when it is thrown on the driver.
| writeAndValidate(writeSchema, readSchema); | ||
| } | ||
|
|
||
| protected void withSQLConf(Map<String, String> conf, Action action) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was unused.
| List<Types.NestedField> fields = struct.fields(); | ||
| for (int i = 0; i < fields.size(); i += 1) { | ||
| Type fieldType = fields.get(i).type(); | ||
| for (int readPos = 0; readPos < fields.size(); readPos += 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| import org.junit.jupiter.api.Test; | ||
| import org.junit.jupiter.api.io.TempDir; | ||
|
|
||
| public abstract class DataFrameWriteTestBase extends ScanTestBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New base suite for tests of data frame writes, which replaces TestDataFrameWrites and ParameterizedAvroDataTest.
| import org.junit.jupiter.api.io.TempDir; | ||
|
|
||
| /** An AvroDataScan test that validates data by reading through Spark */ | ||
| public abstract class ScanTestBase extends AvroDataTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New base class for scan tests (TestAvroScan, TestParquetScan, TestParquetVectorizedScan).
|
|
||
| @Parameters(name = "format = {0}") | ||
| public static Collection<String> parameters() { | ||
| return Arrays.asList("parquet", "avro", "orc"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was broken into DataFrameWriteTestBase and subclasses for each format:
TestAvroDataFrameWriteTestParquetDataFrameWriteTestORCDataFrameWrite
| } | ||
|
|
||
| @TestTemplate | ||
| public void testWriteWithCustomDataLocation() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced by DataFrameWriteTestBase#testAlternateLocation.
| } | ||
|
|
||
| @TestTemplate | ||
| public void testNullableWithWriteOption() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assumes Spark 2.x so this is no longer needed.
| } | ||
|
|
||
| @TestTemplate | ||
| public void testNullableWithSparkSqlOption() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assumes Spark 2.x so this is no longer needed.
| } | ||
|
|
||
| @TestTemplate | ||
| public void testFaultToleranceOnWrite() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dropped this test because it is testing basic Spark behavior and doesn't belong in scan and write tests for specific schemas. I didn't move it anywhere because I don't think it is a valuable test. Spark stage failures throw exceptions and don't commit. I think it was originally trying to check for side-effects, but that isn't necessary in Iceberg.
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Thanks for the reviews, @Fokko! |

This updates Spark's tests for scans and data frame writes to validate default values.
This fixes problems found in testing:
ReassignIdswas dropping defaultsSchemaParserdid not support eitherinitial-defaultorwrite-defaultSchemaParserdid not have a test suiteThis also refactors the data frame writer tests and removes
ParameterizedAvroDataTestthat was an unnecessary copy ofAvroDataTest. To avoid needing the duplicate test suite, this updates the tests to inherit from a base class like the scan tests. Last, there were a few unnecessary tests that have been removed. One was testing basic Spark behavior (no commit if an action fails) and the others were only valid for Spark 2.x.