Skip to content

Support row-level delete in vectorized reader #3141

@flyrain

Description

@flyrain

Vectorized reader does NOT support row-level delete currently. It is turned off, check this code,

boolean readUsingBatch = batchReadsEnabled && hasNoDeleteFiles && (allOrcFileScanTasks ||

I'm working on a solution to enable vectorized reading for row-level delete. The idea is to filter out deleted rows when Iceberg return a batch for Spark to consume. The challenge is that class ColumnarBatch is from Spark, and is a final class. We cannot extend it in Iceberg. Of course, we can filter out deleted rows by iterating it, and construct a new batch object, but that would have a big perf concern. I will try to propose the idea to make ColumnarBatch non final from Spark side. Hopefully it can be accepted. Otherwise, we need to think about other ways to approach this feature.

Any feedback?

cc @aokolnychyi @rdblue @RussellSpitzer @jackye1995 @sunchao @chenjunjiedada

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions