Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Easy way to zero-copy IPC buffers. #5165

Closed
wants to merge 7 commits into from

Conversation

comath
Copy link
Contributor

@comath comath commented Dec 4, 2023

Which issue does this PR close?

Closes #5153

Rationale for this change

There are ways to memmap buffers, if you know the header information, but no easy way to use the existing IPC file format in a zero-copy way.

What changes are included in this PR?

Reused the IPC reader code to read the header and footer information from an IPC file, but swapped out the read/copy of the buffer for the Buffer creation with Buffer::from_custom_allocation. Minor savings can be had by not allocating the arbitrary-sized footer and message read, but the easy way to do this requires a nightly API in the Cursor object.

This duplicated some code, but it's messy to deduplicate because most of it is boilerplate around reading the flatbuffer messages. The better way to deduplicate is to change the FileReader to optionally accept a buffer and conditionally use that in the 2 locations where buffers are actually read. Perfectly happy to do it this way as well.

Are there any user-facing changes?

There's a new struct that mirrors the current IPC reader struct that passes back zero-copy buffers.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Dec 4, 2023
@tustvold
Copy link
Contributor

tustvold commented Dec 5, 2023

Rather than duplicating the reader perhaps we could extract a BufferRead trait that returns Buffer and has an implementation for Buffer along with any T: Read + Seek. FileReader could then be updated to use this trait?

Something like

trait BufferRead {
    fn read(&mut self, range: Range<u64>) -> Result<Buffer>;
}

@tustvold
Copy link
Contributor

tustvold commented Dec 5, 2023

As written this is now a breaking API change... I'll have a play over tomorrow and see what I can come up with

@comath
Copy link
Contributor Author

comath commented Dec 5, 2023

I have a way that goes:

/// Arrow File reader
pub struct FileReader<R: Read + Seek> {
    /// Buffered file reader that supports reading and seeking
    reader: BufReader<R>,

    /// Optional Buffer for when we want to map the underlying data without copying
    buffer: Option<BufferReader>,

    ......
}

impl FileReader<Cursor<BufferReader>> {
    pub fn try_new_map(buffer: Buffer, projection: Option<Vec<usize>>) -> Result<Self, ArrowError> {
        let buffer = BufferReader { inter: buffer };
        let cursor = Cursor::new(buffer.clone());
        Self::try_new_inter(cursor, projection, Some(buffer))
    }
}

impl<R: Read + Seek> FileReader<R> {
    pub fn try_new(reader: R, projection: Option<Vec<usize>>) -> Result<Self, ArrowError> {
        Self::try_new_inter(reader, projection, None)
    }

    fn try_new_inter(
        reader: R,
        projection: Option<Vec<usize>>,
        buffer: Option<BufferReader>,
    ) -> Result<Self, ArrowError> {
        let mut reader = BufReader::new(reader);
        .......
    }
}

It then checks if the optional buffer is there during the dictionary loop and maybe_next call. If there is a buffer it maps the memory instead of copying it. The BufferReader is a struct wrapping Buffer that just implements AsRef<[u8]> so that we can put a Cursor on it.

It's here and I can merge it with the current state if you want to go this direction.

AsRef<[u8]> instead of a breaking trait.
@tustvold
Copy link
Contributor

tustvold commented Dec 7, 2023

FYI I've started work on a slightly different approach to this, starting with - #5179

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Request to Memmap Arrow IPC files on disk
2 participants