Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 16 additions & 15 deletions docs/source/format/CDataInterface.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,38 +22,39 @@ The Arrow C data interface
Rationale
=========

Apache Arrow aims to be a universal in-memory format for the representation
of tabular ("columnar") data, but some projects may face a difficult
choice between either depending on a fast-evolving dependency such as the
Arrow C++ library, or having to implement adapters for data interchange
(for example by reimplementing the Arrow IPC format, which is non-trivial
and does not allow cross-runtime zero-copy).
Apache Arrow is designed to be a universal in-memory format for the representation
of tabular ("columnar") data. However, some projects may face a difficult
choice between either depending on a fast-evolving project such as the
Arrow C++ library, or having to reimplement adapters for data interchange,
which may require significant, redundant development effort.

The Arrow C data interface defines a very small, stable set of C definitions
that can be easily *copied* in any project's source code and used for columnar
data interchange in the Arrow format. For non-C/C++ languages and runtimes,
it should be almost as easy to translate the C definitions into the
corresponding C FFI declarations.

Applications and libraries can therefore choose between tight integration
Applications and libraries can therefore work with Arrow memory without
necessarily using Arrow libraries or reinventing the wheel. Developers can
choose between tight integration
with the Arrow *software project* (benefitting from the growing array of
facilities exposed by e.g. the C++ or Java implementations of Apache Arrow,
but with the cost of a dependency) or minimal integration with the Arrow
*format*.
*format* only.

Goals
-----

* Expose an ABI-stable interface.
* Easy for third-party projects to implement support for (including partial
* Make it easy for third-party projects to implement support for (including partial
support where sufficient), with little initial investment.
* Zero-copy sharing of Arrow data between independent runtimes
* Allow zero-copy sharing of Arrow data between independent runtimes
and components running in the same process.
* Match the Arrow array concepts closely, to avoid the development of
* Match the Arrow array concepts closely to avoid the development of
yet another marshalling layer.
* Avoid the need for one-to-one adaptation layers such as the limited
JPype-based bridge between Java and Python.
* Allow integration without an explicit dependency (either at compile-time
* Enable integration without an explicit dependency (either at compile-time
or runtime) on the Arrow software project.

Ideally, the Arrow C data interface can become a low-level *lingua franca*
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we change this wording to be more declarative and less hypothetical when this proposal is accepted?

Copy link
Owner

@pitrou pitrou Oct 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can. But a spec alone can't determine that it's sufficiently used to actually become a lingua franca ;-)

Expand All @@ -79,9 +80,9 @@ Pros of the C data interface vs. the IPC format:

Pros of the IPC format vs. the data interface:

* Works accross processes and machines.
* Allows data storage and persistency.
* Being a streamable format, has room for composing more features (such as
* Works across processes and machines.
* Allows data storage and persistence.
* Being a streamable format, the IPC format has room for composing more features (such as
integrity checks, compression...).

Data type description -- format strings
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other questions (I can't comment that far outside the diff:

  • L322 "If omitted, MUST be 0." That doesn't sound right.
  • Should we note below that you can use the ArrowArray struct shape to wrap up RecordBatches and other things?
  • "Examples" should probably point to the C++/Python/R stuff we did, right? (If not in this proposal, then later)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"If omitted, MUST be 0": I mean that the information is meant to be omitted, then the field should be set to 0. Do you want to propose another wording?

Exporting RecordBatches: yes, I could add a paragraph somewhere later.

I also could add pointers to in-progress implementations somewhere towards the end.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess what was confusing to me was that setting a field to 0 doesn't sound like omitting it. If I were omitting it, I wouldn't set anything. It sounds like you need to set it always, but if you don't want to enable any of the options, it should be 0--is that right?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't set anything, then it could be any junk value, though (C doesn't zero-initialize stuff automatically).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, yes, it needs to be set to the value meaning "omitted" (in R it would be NA? :-)).

Expand Down