Skip to content

Conversation

srilman
Copy link
Contributor

@srilman srilman commented Sep 5, 2025

Changes Made

I need it for both a process UDF PR and for the dashboard stuff, so just moved it to this PR to rebase the others on top of it. Lmk if you want me to add a test for this specifically, but I do test it E2E in the others.

@srilman srilman requested a review from colin-ho September 5, 2025 00:36
@github-actions github-actions bot added the feat label Sep 5, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR adds Arrow IPC (Inter-Process Communication) serialization and deserialization functionality to the RecordBatch type in Daft. The changes introduce two key methods: to_ipc_stream() for converting RecordBatches to IPC stream format bytes, and from_ipc_stream() for reconstructing RecordBatches from IPC stream bytes.

The implementation spans both the Rust core (src/daft-recordbatch/src/lib.rs) and Python bindings (src/daft-recordbatch/src/python.rs). In the Rust implementation, the methods utilize Arrow2's IPC writer and reader with no compression, handling schema conversion and proper error handling. The Python bindings expose these methods as from_ipc_stream() (static method) and to_ipc_stream() (instance method), following the existing pattern for similar conversion methods in the codebase.

These methods enable efficient serialization of columnar data while preserving schema information, which is essential for the planned process UDF and dashboard functionality mentioned in the PR description. The IPC format is a standard Arrow serialization format that's well-suited for inter-process communication, data transfer between systems, and temporary storage scenarios. The implementation properly releases the Python GIL during potentially expensive operations using py.allow_threads().

Confidence score: 4/5

  • This PR is safe to merge with minimal risk as it adds well-contained functionality using established Arrow patterns
  • Score reflects solid implementation following existing codebase patterns, though lacks dedicated unit tests in this PR
  • Pay close attention to the assertion in from_ipc_stream() method that expects exactly one table, which could panic if assumptions are violated

2 files reviewed, no comments

Edit Code Review Bot Settings | Greptile

Copy link

codecov bot commented Sep 5, 2025

Codecov Report

❌ Patch coverage is 0% with 41 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.88%. Comparing base (d321b39) to head (6925f9c).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
src/daft-recordbatch/src/lib.rs 0.00% 31 Missing ⚠️
src/daft-recordbatch/src/python.rs 0.00% 10 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5143      +/-   ##
==========================================
- Coverage   71.90%   71.88%   -0.02%     
==========================================
  Files         953      953              
  Lines      130550   130593      +43     
==========================================
+ Hits        93866    93874       +8     
- Misses      36684    36719      +35     
Files with missing lines Coverage Δ
src/daft-recordbatch/src/python.rs 26.45% <0.00%> (-0.57%) ⬇️
src/daft-recordbatch/src/lib.rs 77.95% <0.00%> (-2.82%) ⬇️

... and 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@srilman srilman merged commit 1368e0d into main Sep 5, 2025
42 of 43 checks passed
@srilman srilman deleted the slade/recordbatch-ipc branch September 5, 2025 02:56
venkateshdb pushed a commit to venkateshdb/Daft that referenced this pull request Sep 6, 2025
## Changes Made

I need it for both a process UDF PR and for the dashboard stuff, so just
moved it to this PR to rebase the others on top of it. Lmk if you want
me to add a test for this specifically, but I do test it E2E in the
others.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants