ARROW-7808: [Java][Dataset] Implement Dataset Java API by JNI to C++ #7030

zhztheplayer · 2020-04-24T09:44:33Z

Add following Datasets APIs to Java:

DatasetFactory
Dataset
Scanner
ScanTask
ScanTask.BatchIterator

Add a native dataset path to bridge c++ Datasets components to Java:

NativeDatasetFactory (c++ class : DatasetFactory)
NativeDataset (c++ class: Dataset)
NativeScanner (c++ class: DisposableScannerAdaptor)

Following c++ components are not JNI-mapped to keep the initial implementation simple:

Fragment
ScanTask
(arrow::)RecordBatchIterator

Add following API to FileSystemDatasetFactory to avoid passing file system objects via JNI bridge:

FileSystemDatasetFactory::Make( std::string uri, std::shared_ptr<FileFormat> format, FileSystemFactoryOptions options)

Unit tests are based on FileSystemDatasetFactory.

github-actions · 2020-04-24T09:57:49Z

https://issues.apache.org/jira/browse/ARROW-7808

emkornfield · 2020-04-25T05:19:20Z

cpp/src/jni/dataset/concurrent_map.h

Please do not copy and paste this file, factor it out to a common place that can be used by all JNI implementations.

Maybe move concurrent_map.h out into src/jni/? By the way, should we somehow migrate src/gandiva/jni to src/jni/gandiva, or migrate src/jni to src/arrow/jni? It seems that currently they are organized inconsistently.

see comments below, I don't think we should be using concurrent_map in general, especially for new use cases so we can drop it for now.

By the way, should we somehow migrate src/gandiva/jni to src/jni/gandiva, or migrate src/jni to src/arrow/jni? It seems that currently they are organized inconsistently.
I don't have a strong opinion here probably best to discuss this on the mailing list.

the file has been removed

emkornfield · 2020-04-25T05:25:51Z

cpp/src/jni/dataset/jni_wrapper.cpp

I don't think these macros are necessary. you should be able to use a normal (templated) function since you throw an exception instead of return.

Done. Thanks!

emkornfield · 2020-04-25T05:28:37Z

cpp/src/jni/dataset/jni_wrapper.cpp

it looks like there are a lot of FIXMEs can you address those?

ok. fixmes like this one are by my mistake.

emkornfield · 2020-04-25T05:30:37Z

cpp/src/jni/dataset/jni_wrapper.cpp

this still strikes me as a bad pattern.

https://stackoverflow.com/questions/26375215/c-shared-ptr-and-java-native-object-ownership/26377515 seems better.

Will do. But I kind of like the current pattern - it benefits more or less when debugging/tracing (self-growing id, instance cache, etc.), right? Is there any other consideration making it bad?

extra memory and resolution overhead. The instance cache should be handled by the underlying allocator which should be highly optimized for returning pointer sized values.

Could you explain how a self-growing ID is useful for debugging purposes? If it is required would a DEBUG log with only an atomic int associated with each new ID be sufficient?

Could you explain how a self-growing ID is useful for debugging purposes?

It may not be a considerable bit deal but growing ID numbers can be much smaller and more contextual. E.g. when we experience a JVM breakpoint debugging against some sort of abnormal duplicated buffer creation, if five different buffers with ID 10-15 are created one would easily remember it. Then if ID 12 is returned from JNI again then one instantly realizes that something must went wrong there.

If it is required would a DEBUG log with only an atomic int associated with each new ID be sufficient?

Yes I think if we are debugging with logs rather than breakpoints, what type of ID is used is indeed not that important because we can use control + F to highlight a specific instance anyway.

And as the performance impact is considered I don't have strong inclination. So let's just use pointers?

Another thing is we already use maps to cache JNI objects in other modules (gandiva-jni, orc-jni), we may need to unify the approaches somehow.

emkornfield · 2020-04-25T05:35:53Z

cpp/src/jni/dataset/proto/Types.proto

why is this file being introduced?

Currently it is for passing datasets filter via JNI. I see Wes has recommended not to include filter bindings (in the bottom comment) in this version. I may consider removing it in a later commit.

I think it's fine if you have a filter API in Java, but I think you should try to limit the exposure of the particular details of the current C++ API in Java. If the Java API has a 1-1 correspondence with the C++ API that's what I'm advising you against

Thanks, I think I would understand any of the concerns against a too detailed mapping. (actually currently Fragment is not mapped to Java-side so it's not yet a 1-1 mapping) I'll try remove more JNI stuffs.

As for filter, If I understand correctly it's OK to keep Java API but the JNI mapping for filters is considered fragile, right? I can remove the mapping anyway but when users read parquet files from Java they'll not be able to filter row groups to reduce I/O. Which is extremely important when low-selectivity filter is specified.

As for filter, If I understand correctly it's OK to keep Java API but the JNI mapping for filters is considered fragile, right? I can remove the mapping anyway but when users read parquet files from Java they'll not be able to filter row groups to reduce I/O. Which is extremely important when low-selectivity filter is specified.

I'm advising you to define some kind of abstraction layer so that the details of the C++ API are not exposed in the Java API. So if there are changes in the C++ API, then refactoring can take place only at the JNI C++ code and no changes will be required in Java.

I see. At this time I'll try keeping them detached as far as possible.

Can you start a thread on the mailing list about filter support. I'm concerned with a proto specific to the java implementation of datasets. Especially, because I think this essentially replicates what is in the Gandiva proto. We should ideally have one representation for all the languages that are going to bind to the same functionality.

Thanks for your suggestion, but they don't seem like the same functionality to me for dataset expression and Gandiva expression... In C++ code they are designed and placed differently and we'll face problems merging the JNI abstraction together. e.g. arrow::dataset::FieldExpression doesn't retain data type information (it is only represented by field name) where gandiva::FieldNode is type sensitive... I don't think it's a good idea to use any workaround to make them suitable for each other. What do you think?

I do not have a solution off the top of my head. But creating a new java->C++ protobuf abstraction seems like the the wrong way to go.

As far as i can tell gandiva protobuf declare column type optional. This needs more design which is why a discussion/proposal on the ML is warranted.

It might make sense to split out the filtering components of this PR so it doesn't get held up by the conversation.

Yes I prefer removing them from this PR as I think of the implementation immature too. I can open ticket/discussion later.

java/dataset/src/main/java/org/apache/arrow/dataset/file/JniWrapper.java

emkornfield · 2020-04-25T05:37:51Z

java/dataset/src/main/java/org/apache/arrow/dataset/filter/Filter.java

leave this out if it isn't implemented.

Removed FilterImpl and Filter is now made a class rather than interface. I think it's not prepared to be a user-friendly interface so far.

I've marked Filter as experimental in class comment.

emkornfield · 2020-04-25T05:39:00Z

java/dataset/src/main/java/org/apache/arrow/dataset/jni/JniWrapper.java

these methods need javadoc.

Javadoc added

emkornfield · 2020-04-25T05:42:38Z

java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanTask.java

can we call this "Iterator"?

Using Iterator is OK but it may collide with java.util.Iterator in scenarios. Should it be "BatchIterator" or something?

I would expect ScanTask.Iterator would be used consistently to avoid collision. BatchIterator is also code with me.

emkornfield · 2020-04-25T05:43:23Z

java/dataset/src/main/java/org/apache/arrow/dataset/scanner/ScanTask.java

Done. Now it is changed to iterate ArrowRecordBatch rather than VectorSchemaRoot.

emkornfield · 2020-04-25T05:43:52Z

java/dataset/src/main/java/org/apache/arrow/dataset/source/DatasetFactory.java

emkornfield · 2020-04-25T05:45:20Z

java/dataset/src/main/java/org/apache/arrow/memory/NativeUnderlingMemory.java

this should probably be in a different common package if we want to wrap the C++ allocator. It isn't clear to me that we should do this because we lose java native heap caps IIUC.

ummm if I am not wrong by "java native heap" you meant jvm direct memory right? I think it's possible to new a new type of c++ allocator to allocate for direct mem but I suggest to do so in a separate ticket. We have orc adaptor doing things worse in this part and we can optimize them together.

sounds ok to me.

I think we should probably hold off on merging until we resolve this. Just because another part of the codebase has debt or weaknesses doesn't mean we should introduce more of that.

Hi @jacques-n,

Thanks for your suggestion but I think the "weaknesses" of current implementation might be somehow overrated. Specifically for orc adaptor it's of course urgent to improve its memory allocation process because all of the allocated memory by orc is escaping from Java allocator. However we don't have the same problem here in Datasets as allocator is going to manage all the native memory. The only issue we may have is that OOM error might be thrown after a buffer is allocated rather than before we create it. But because OOM will be finally thrown and over-allocated memory will be finally released, the impact will be very very small.

And I knew we can possibly change to allocate Java memory directly from C++ via JNI, but codes will become messy and performance will turn to worse. That is why I didn't propose such a solution in this PR.

In reality, this is not at all the case. There are two ways that you run out of memory in Java: you run out of available allocation space OR you hit the limit at the JVM level. The limit at the JVM level can be influenced by things like fragmentation so in nearly all cases you hit that limit before the allocator managed limit. In your patch, you only are respecting the allocator set limit, not the jvm limit. This means a user could configure java to allow X of memory and the process actually will use 2X of memory, leading to issues in many systems.

In reality, this is not at all the case. There are two ways that you run out of memory in Java: you run out of available allocation space OR you hit the limit at the JVM level. The limit at the JVM level can be influenced by things like fragmentation so in nearly all cases you hit that limit before the allocator managed limit. In your patch, you only are respecting the allocator set limit, not the jvm limit. This means a user could configure java to allow X of memory and the process actually will use 2X of memory, leading to issues in many systems.

Making sense overall. So we may have to use a C++-to-Java/Netty allocation path in the first version right? I'll try it out.

Not necessarily. When you allocate and deallocate memory in c++ you can call (via reflection) java.nio.Bits.reserveMemory() and unreserveMemory(). If you wanted to be extra fancy you could actually just do the reads/writes to the underlying atomics. Alternatively, you could build a pooling impl for C++ that uses the JNIEnv.NewDirectByteBuffer().

emkornfield · 2020-04-25T05:47:29Z

java/dataset/src/main/java/org/apache/arrow/util/SchemaUtils.java

this probably belongs in the vector package.

emkornfield · 2020-04-25T05:47:48Z

java/dataset/src/main/java/org/apache/arrow/util/SchemaUtils.java

why not make these static methods? instead of using a singleton pattern?

It's just the way I got used to. Singleton makes it easier to extend things. As I noticed we use static pattern in other places I have changed to static now.

emkornfield · 2020-04-25T05:49:11Z

java/dataset/src/test/java/org/apache/arrow/util/SchemaUtilsTest.java

generally, I think we static import assertEquals.

emkornfield · 2020-04-25T05:49:40Z

java/pom.xml

revert this?

my mistake, reverted. Thanks!

emkornfield · 2020-04-25T05:52:34Z

java/dataset/src/test/java/org/apache/arrow/dataset/jni/NativeDatasetTest.java

please don't check the binary file into this repo. It would be better to generate a file on the fly, so readers of the test can understand what data exists there. If a standalone parquet file is really needed arrow-testing is the place to put it

Yes I did not mark the PR reviewable mainly because test codes are messy. I am working on refactoring

emkornfield · 2020-04-25T05:56:29Z

cpp/src/jni/dataset/jni_wrapper.cpp

it seems like a a bad idea to collect an iterator if the intention is for the iterator to provide laziness?

The logic is now moved into DisposableScannerAdaptor, which doesn't collect these tasks anymore. Also removed the collect function.

emkornfield · 2020-04-25T05:57:29Z

cpp/src/jni/dataset/jni_wrapper.cpp

style nit: I think Pure C++ methods should be TranslateFilter

~~done but I'll remove the entire function~~

wesm · 2020-04-27T01:24:34Z

As a high level advisory: the Datasets C++ API should still be regarded as alpha-stage, so I recommend making the JNI bindings as minimal as possible (to satisfy required functionality) so that refactoring is not too painful.

Additionally, I recommend that you do not build JNI bindings for the Expression classes in arrow/dataset/filter.h -- this code is likely to change quite a bit as the associated query engine project gets off the ground this year.

zhztheplayer · 2020-04-28T10:43:59Z

Hi @wesm @emkornfield , based on the discussion in the latest commit I removed some of the JNI mappings of c++ components. Now only Dataset(Factory) and Scanner are there to be bridged. Would you think of this a feasible direction?

emkornfield · 2020-04-29T05:31:09Z

I'll let @wesm comment on the specifics of should be JNI bound.

What I would prefer not to see is a custom proto living on the java side solely for describing filters.

zhztheplayer · 2020-04-29T07:15:29Z

What I would prefer not to see is a custom proto living on the java side solely for describing filters.

Yea the current approach is just working but proto-generated classes are totally unfriendly to use and evolve. I may intend to mark them all experimental, or just develop a new way in this PR, I don't know. Let me know if you have more suggestions on this.

pitrou · 2021-01-07T08:09:09Z

@zhztheplayer Yes, all Arrow C++ code should be inside our own namespaces to avoid potential name clashes.
Also, for private functions and classes it is recommended to put them inside the anonymous namespace (in addition to the arrow::... namespace) to hide them from other modules being linked.

zhztheplayer · 2021-01-25T07:42:11Z

Hi @emkornfield @pitrou

I think I have addressed existing comments so would you like to take another look? Thanks a lot.

Based on some recent comments I think we may don't have to make 100% consensus on the current design of ReservationListenableMemoryPool or relevant things instantly, if this part is in a way controversial I could split out the part from this PR then create a new one for it if we can possibly get this merged.

And if you have other suggestions on the whole design or architecture of the basic functionality please be free to add comments and I am happy to make further refactors accordingly.

emkornfield · 2021-01-30T05:24:51Z

cpp/src/jni/dataset/jni_wrapper.cpp

nit: format_id

emkornfield · 2021-01-30T05:31:38Z

cpp/src/jni/dataset/jni_wrapper.cpp

this causes an extra copy just construct directly with chars.

Thanks. Didn't know the constructor implies a copy.

emkornfield · 2021-01-30T05:32:17Z

cpp/src/jni/dataset/jni_wrapper.cpp

since copied is unused just pass a nullptr (according to JNI docs it seems like this is allowed. /copied=/nullptr

Right. Done

emkornfield · 2021-01-30T05:32:48Z

cpp/src/jni/dataset/jni_wrapper.cpp

checked_cast

Changed to reinterpret_cast. Correct me if I am wrong we cannot use arrow::internal::checked_cast (if you meant to use this one) here since class _jobject doesn't include a virtual members for dynamic_cast: https://github.com/AdoptOpenJDK/openjdk-jdk/blob/9a9add8825a040565051a09010b29b099c2e7d49/jdk/src/share/javavm/export/jni.h#L67-L80

emkornfield · 2021-01-30T05:41:04Z

cpp/src/jni/dataset/jni_wrapper.cpp

there is no reason for the pimpl pattern if both interface and implementation are bot in the .cpp file.

but if there aren't any tests that test this functionality, then I would recommend splitting this out to another .h/.cc file under this directory so tests can be written.

I've made some refactors on jni_wrapper.cpp. Now there is a new source file jni_util.cpp/jni_util.h for storing util members in namespace arrow::dataset::jni.

Added some unit tests

emkornfield · 2021-01-30T05:41:30Z

cpp/src/jni/dataset/jni_wrapper.cpp

nit: Ithink this can be an anonymous namespace?

See above. Do we still have to do this after the refactors

emkornfield · 2021-01-30T05:47:53Z

cpp/src/jni/dataset/jni_wrapper.cpp

any reason for protected an not private?

emkornfield · 2021-01-30T05:50:28Z

java/dataset/src/main/java/org/apache/arrow/dataset/jni/DirectReservationListener.java

this is misleading, it should be less then int32 max value.

Done, thanks.

emkornfield · 2021-01-30T05:56:16Z

java/dataset/src/test/java/org/apache/arrow/dataset/file/TestFileSystemDataset.java

could we separate out allocation tests to their own file?

zhztheplayer · 2021-02-03T14:09:14Z

Hi @emkornfield. I've addressed the new comments within the latest commit, please review. Previous commits have been merged into a single one.

emkornfield · 2021-02-07T23:10:52Z

cpp/src/jni/dataset/jni_util.h

+namespace dataset {
+namespace jni {
+
+class JniPendingException : public std::runtime_error {


does this need to be in the header? Please add docs?

Moved these error handling tools back to jni_wrapper.cc.

I was thinking of preparing some common utilities for both jni/dataset jni/orc (or maybe gandiva/jni). But we can actually do this later in some other tickets. The utilities also can not work well without the macros in jni_wrapper.cc.

Added some docs to other interfaces in this header.

yeah, I think this is a good thought and it would be great to see some refactoring here to share common code. Thank you for taking the time to address all the concerns around this.

emkornfield · 2021-02-07T23:12:29Z

java/vector/src/main/java/org/apache/arrow/vector/util/SchemaUtility.java

+ * Schema utility class including serialization and deserialization.
+ */
+public class SchemaUtility {
+


private constructor to prevent initialization?

emkornfield

Two minor comments and then I think we can merge this.

zhztheplayer · 2021-02-08T15:16:22Z

Thanks @emkornfield. I've made updates accordingly.

ZhaoNeil · 2022-07-13T13:26:57Z

Hi, I have a question about why Java implementation still uses C++ memory allocation?

Add following Datasets APIs to Java: - DatasetFactory - Dataset - Scanner - ScanTask - ScanTask.BatchIterator Add a native dataset path to bridge c++ Datasets components to Java: - NativeDatasetFactory (c++ class : DatasetFactory) - NativeDataset (c++ class: Dataset) - NativeScanner (c++ class: DisposableScannerAdaptor) Following c++ components are not JNI-mapped to keep the initial implementation simple: - Fragment - ScanTask - (arrow::)RecordBatchIterator Add following API to `FileSystemDatasetFactory` to avoid passing file system objects via JNI bridge: - `FileSystemDatasetFactory::Make( std::string uri, std::shared_ptr<FileFormat> format, FileSystemFactoryOptions options)` Unit tests are based on `FileSystemDatasetFactory`. Closes apache#7030 from zhztheplayer/ARROW-7808 Authored-by: Hongze Zhang <[email protected]> Signed-off-by: Micah Kornfield <[email protected]>

zhztheplayer force-pushed the ARROW-7808 branch from 55836f5 to a1178d8 Compare April 24, 2020 09:45

nealrichardson requested a review from bkietz April 24, 2020 14:43

emkornfield reviewed Apr 25, 2020

View reviewed changes

java/dataset/src/main/java/org/apache/arrow/dataset/file/JniWrapper.java Outdated Show resolved Hide resolved

emkornfield reviewed Apr 25, 2020

View reviewed changes

wesm changed the title ~~ARROW-7808: [Java][Dataset] Implement Datasets Java API~~ ARROW-7808: [Java][Dataset] Implement Datasets Java API by JNI to C++ Apr 27, 2020

fsaintjacques added the Component: Java label Apr 27, 2020

zhztheplayer force-pushed the ARROW-7808 branch from 349858f to 38cf4c8 Compare April 28, 2020 07:48

zhztheplayer force-pushed the ARROW-7808 branch from 591a841 to 89d6ce0 Compare April 29, 2020 07:28

emkornfield reviewed Jan 30, 2021

View reviewed changes

github-actions bot added the Component: C++ label Feb 1, 2021

Datasets Java API

d484521

zhztheplayer force-pushed the ARROW-7808 branch from b7ed735 to 38e8e2b Compare February 2, 2021 02:06

Several fixes

42ff0d2

zhztheplayer force-pushed the ARROW-7808 branch from 17aad82 to 42ff0d2 Compare February 3, 2021 14:09

emkornfield reviewed Feb 7, 2021

View reviewed changes

emkornfield approved these changes Feb 7, 2021

View reviewed changes

Several fixes

d68b66b

zhztheplayer force-pushed the ARROW-7808 branch from 35df503 to d68b66b Compare February 8, 2021 14:42

emkornfield closed this in d25660e Feb 9, 2021

asfimport mentioned this pull request Jul 13, 2022

[Java][Dataset] Implement Datasets Java API #17055

Closed

ARROW-7808: [Java][Dataset] Implement Dataset Java API by JNI to C++ #7030

ARROW-7808: [Java][Dataset] Implement Dataset Java API by JNI to C++ #7030

Uh oh!

Conversation

zhztheplayer commented Apr 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 24, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhztheplayer Apr 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhztheplayer May 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhztheplayer Apr 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhztheplayer Apr 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhztheplayer Apr 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhztheplayer commented Apr 24, 2020 •

edited

Loading

zhztheplayer Apr 26, 2020 •

edited

Loading

zhztheplayer May 11, 2020 •

edited

Loading

zhztheplayer Apr 27, 2020 •

edited

Loading

zhztheplayer Apr 28, 2020 •

edited

Loading

zhztheplayer Apr 28, 2020 •

edited

Loading

zhztheplayer Jun 15, 2020 •

edited

Loading

jacques-n Jul 6, 2020 •

edited

Loading