-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Vectorized reader does NOT support row-level delete currently. It is turned off, check this code,
iceberg/spark3/src/main/java/org/apache/iceberg/spark/source/SparkBatchScan.java
Line 180 in 80ff749
| boolean readUsingBatch = batchReadsEnabled && hasNoDeleteFiles && (allOrcFileScanTasks || |
I'm working on a solution to enable vectorized reading for row-level delete. The idea is to filter out deleted rows when Iceberg return a batch for Spark to consume. The challenge is that class ColumnarBatch is from Spark, and is a final class. We cannot extend it in Iceberg. Of course, we can filter out deleted rows by iterating it, and construct a new batch object, but that would have a big perf concern. I will try to propose the idea to make ColumnarBatch non final from Spark side. Hopefully it can be accepted. Otherwise, we need to think about other ways to approach this feature.
Any feedback?
cc @aokolnychyi @rdblue @RussellSpitzer @jackye1995 @sunchao @chenjunjiedada