iceberg-spark changes for vectorized reads #828

samarthjain · 2020-03-05T19:16:45Z

No description provided.

spark/src/main/java/org/apache/iceberg/spark/arrow/ArrowUtils.java

spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ColumnarBatchReaders.java

spark/src/main/java/org/apache/iceberg/spark/data/vectorized/IcebergArrowColumnVector.java

spark/src/main/java/org/apache/iceberg/spark/arrow/ArrowUtils.java

spark/src/main/java/org/apache/iceberg/spark/data/vectorized/IcebergArrowColumnVector.java

spark/src/main/java/org/apache/iceberg/spark/source/Reader.java

spark/src/main/java/org/apache/iceberg/spark/source/InternalRowTaskDataReader.java

spark/src/main/java/org/apache/iceberg/spark/source/ColumnarBatchTaskDataReader.java

spark/src/main/java/org/apache/iceberg/spark/source/Reader.java

rdblue · 2020-06-03T23:22:13Z

I just ran a test with non-dictionary data using all supported primitives and 100 million records and it passed. That, combined with mostly good coverage gives me a lot of confidence that this is (mostly) correct. Nice work, @samarthjain!

rdblue · 2020-06-05T17:22:33Z

Looks like the failures are checkstyle violations:

> Task :iceberg-spark:checkstyleMain
[ant:checkstyle] [ERROR] /home/travis/build/apache/iceberg/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/VectorizedSparkParquetReaders.java:22:1: Use org.apache.iceberg.relocated.* classes from bundled-guava module instead. [BanUnrelocatedGuavaClasses]
[ant:checkstyle] [ERROR] /home/travis/build/apache/iceberg/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/VectorizedSparkParquetReaders.java:23:1: Use org.apache.iceberg.relocated.* classes from bundled-guava module instead. [BanUnrelocatedGuavaClasses]
[ant:checkstyle] [ERROR] /home/travis/build/apache/iceberg/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/VectorizedSparkParquetReaders.java:24:1: Use org.apache.iceberg.relocated.* classes from bundled-guava module instead. [BanUnrelocatedGuavaClasses]
[ant:checkstyle] [ERROR] /home/travis/build/apache/iceberg/spark/src/main/java/org/apache/iceberg/spark/source/BatchDataReader.java:22:1: Use org.apache.iceberg.relocated.* classes from bundled-guava module instead. [BanUnrelocatedGuavaClasses]
[ant:checkstyle] [ERROR] /home/travis/build/apache/iceberg/spark/src/main/java/org/apache/iceberg/spark/source/Reader.java:22:1: Use org.apache.iceberg.relocated.* classes from bundled-guava module instead. [BanUnrelocatedGuavaClasses]

rdblue · 2020-06-12T23:39:45Z

...c/test/java/org/apache/iceberg/spark/data/parquet/vectorized/TestParquetVectorizedReads.java

+  }
+
+  protected int getNumRows() {
+    return NUM_ROWS;


Why make this a method? So it can be overridden?

Summary of changes: 1) Below new test cases added: - Test for code path when optional values are mostly null - Test for case when containers are not reused for every batch - Test for case to verify arrow's validity vector is set correctly when setArrowValidityVector = true 2) Reuse container logic is now similar to row based read path 3) We now always set the nullability holder. Arrow validity vector is set only for purpose of supplying complete arrow vectors when requested to do so.

rdblue · 2020-06-15T22:16:56Z

Thanks for all the hard work, @samarthjain! I think this is ready to merge.

)