-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-8797: [C++] Read RecordBatch in a different endian #7507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? Then could you also rename pull request title in the following format? See also: |
cpp/src/arrow/ipc/read_write_test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, I intetionally comment out this test since this type of communication is not supported.
cpp/src/arrow/ipc/read_write_test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we always convert schema's endianness to native endian?
I think that we should keep the original schema's endianness for serialization.
Did we discuss this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comment. You are right. We should keep the original schema until ReadRecordBatch.
Then, if LoadRecordBatchSubset will automatically convert the endian based on the schema's endianness, LoadRecordBatchSubset will update the endian flag in the schema.
I will move this assertion after Do*RoundTrip.
cpp/src/arrow/ipc/read_write_test.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how difficult to implement this idea...
How about enhancing existing IpcTestFixture instead of adding a new fixture?
diff --git a/cpp/src/arrow/ipc/read_write_test.cc b/cpp/src/arrow/ipc/read_write_test.cc
index 923670834..1e03472b8 100644
--- a/cpp/src/arrow/ipc/read_write_test.cc
+++ b/cpp/src/arrow/ipc/read_write_test.cc
@@ -397,6 +397,9 @@ class IpcTestFixture : public io::MemoryMapFixture, public ExtensionTypesMixin {
ASSERT_OK_AND_ASSIGN(result, DoLargeRoundTrip(batch, /*zero_data=*/true));
CheckReadResult(*result, batch);
+
+ ASSERT_OK_AND_ASSIGN(result, DoEndiannessRoundTrip(batch));
+ CheckReadResult(*result, batch);
}
void CheckRoundtrip(const std::shared_ptr<Array>& array,
IpcWriteOptions options = IpcWriteOptions::Defaults(),DoEndiannessRoundTrip() will:
- create a nonnative endianness record batch by swapping the given record batch's endianness
- We need to implement this endianness conversion feature but how difficult...?
- run
DoStandardRoundTrip()against the nonnative endianness record batch
BTW, should we always convert endianness in IPC? If users just relay data (A (little, generate data) -> B (big, not touch data) -> C (little, process data)), it'll be inefficient.
Or should we convert endianness explicitly out of IPC by providing Result<std::shared_ptr<RecordBatch>> RecordBatch::EnsureEndianness(Endianness) or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. In my current thought, another version of ArrayLoader (LoadType method), which is called from LoadRecordBatchSubset, will be implemented. One existing version is the non-endian conversion. The other is the endian conversion version. At this level, we know the type of each element. Thus, I imagine it is easy to swap endianness. What do you think?
To add a new API Result<std::shared_ptr<RecordBatch>> RecordBatch::EnsureEndianness(Endianness) is good idea to avoid unnecessary conversion in the above scenario. Technically, this could work correctly.
I have two questions.
- I imagine that this API is explicitly called by a user. If so, is this change acceptable by a user? Of course, most of the scenarios (communicate between the same endianness) work well without calling this new API.
- This new API needs to traverse all of the elements again if we need the data conversion. It may lead to additional overhead.
What do you think? cc @pitrou
|
@kiszk I would suggest creating a LE and BE example corpus in apache/arrow-testing. You can use the integration test command line tools to create point-of-truth JSON files and then use the JSON_TO_ARROW mode of the integration test executable to create IPC binary files that can be checked in |
|
@wesm Thank you for your suggestion. I will pursue the approach that you suggested. I will check the integration test command line tool and the integration test with the JSON_TO_ARROW mode. |
|
@wesm Can I ask a question? I am writing a test case after creating a point-of-truth JSON file and generating two arrow binary files for BE and LE. Is this my approach what you suggested? I am writing a function https://gist.github.com/kiszk/bccb2640a68b706049c1631ba9514eae |
|
I was suggesting to generate BE and LE files to be put in this directory Then you can specify the respective directory of e.g. BE files to be read on an LE platform I think it's better to stick with the integration test tools for this rather than writing new C++ code |
|
@wesm Thank you for your clarification. I thought that I will check in binary files and write C++ files to validate the result to be read. Now, I realized the integration test |
|
Yes you need to use the |
|
This commit can succeed to run an integration test with primitive types like this. |
de53dcf to
89b2277
Compare
|
After merging this PR apache/arrow-testing#41, I will update the |
ci/scripts/integration_arrow.sh
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since other languages (e.g. Java) cause failure of integration tests, future separate PR will fix failures in other languages.
This PR focuses on cpp integration test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean even the little-endian cases fail on Java?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You didn't answer my question above :-)
Also, instead of running archery integration a second time, you should add the new "gold dirs" to the command above, and add selective skips like in https://github.com/apache/arrow/blob/master/dev/archery/archery/integration/runner.py#L129
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry, I should mention you here
The answer is yes. It fails on Java at least.
Sure, I will update runner.py instead of integration_arrow.sh.
|
Thank you very much @kiszk ! I'll merge once CI is green. |
|
The CI failure is unrelated (see https://issues.apache.org/jira/browse/ARROW-11668). |
|
@pitrou I appreciate a lot of your valuable feedbacks. |
This PR creates a test to receive RecordBatch for different endian (e.g. receive RecordBatch with big-endian schema on little-endian platform).
This PR changes
endiannesstoarrow::schemato represent endianness of Array inRecordBatch.IpcReadOption.use_native_endian = true(trueby default).Regarding 3., other possible choices are as follows:
Pros: Avoid conversion for columns that will not be read
Cons: Complex management of endianness of each column. Inconsistency of endianness between schema and column data.
Pros: Can keep the original schema without batch conversion
Cons: 1) Each RecordBatch may need an additional field to show whether the endian conversion is necessary or not. 2) Need to update test cases to accept different endianness between expected and actual schemas.
Now, this PR uses the simplest approach (always see native endianness in schemas) that eagerly converts all of the columns in a batch.
TODO
For creating this PR, @kou helps me greatly (prototype of arrow::schema and teaching me about RecordBatch).