-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-7808: [Java][Dataset] Implement Dataset Java API by JNI to C++ #7030
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
55836f5 to
a1178d8
Compare
cpp/src/jni/dataset/concurrent_map.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not copy and paste this file, factor it out to a common place that can be used by all JNI implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe move concurrent_map.h out into src/jni/? By the way, should we somehow migrate src/gandiva/jni to src/jni/gandiva, or migrate src/jni to src/arrow/jni? It seems that currently they are organized inconsistently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comments below, I don't think we should be using concurrent_map in general, especially for new use cases so we can drop it for now.
By the way, should we somehow migrate src/gandiva/jni to src/jni/gandiva, or migrate src/jni to src/arrow/jni? It seems that currently they are organized inconsistently.
I don't have a strong opinion here probably best to discuss this on the mailing list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the file has been removed
cpp/src/jni/dataset/jni_wrapper.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these macros are necessary. you should be able to use a normal (templated) function since you throw an exception instead of return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks!
cpp/src/jni/dataset/jni_wrapper.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like there are a lot of FIXMEs can you address those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. fixmes like this one are by my mistake.
cpp/src/jni/dataset/jni_wrapper.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this still strikes me as a bad pattern.
https://stackoverflow.com/questions/26375215/c-shared-ptr-and-java-native-object-ownership/26377515 seems better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do. But I kind of like the current pattern - it benefits more or less when debugging/tracing (self-growing id, instance cache, etc.), right? Is there any other consideration making it bad?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra memory and resolution overhead. The instance cache should be handled by the underlying allocator which should be highly optimized for returning pointer sized values.
Could you explain how a self-growing ID is useful for debugging purposes? If it is required would a DEBUG log with only an atomic int associated with each new ID be sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain how a self-growing ID is useful for debugging purposes?
It may not be a considerable bit deal but growing ID numbers can be much smaller and more contextual. E.g. when we experience a JVM breakpoint debugging against some sort of abnormal duplicated buffer creation, if five different buffers with ID 10-15 are created one would easily remember it. Then if ID 12 is returned from JNI again then one instantly realizes that something must went wrong there.
If it is required would a DEBUG log with only an atomic int associated with each new ID be sufficient?
Yes I think if we are debugging with logs rather than breakpoints, what type of ID is used is indeed not that important because we can use control + F to highlight a specific instance anyway.
And as the performance impact is considered I don't have strong inclination. So let's just use pointers?
Another thing is we already use maps to cache JNI objects in other modules (gandiva-jni, orc-jni), we may need to unify the approaches somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this file being introduced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently it is for passing datasets filter via JNI. I see Wes has recommended not to include filter bindings (in the bottom comment) in this version. I may consider removing it in a later commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine if you have a filter API in Java, but I think you should try to limit the exposure of the particular details of the current C++ API in Java. If the Java API has a 1-1 correspondence with the C++ API that's what I'm advising you against
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think I would understand any of the concerns against a too detailed mapping. (actually currently Fragment is not mapped to Java-side so it's not yet a 1-1 mapping) I'll try remove more JNI stuffs.
As for filter, If I understand correctly it's OK to keep Java API but the JNI mapping for filters is considered fragile, right? I can remove the mapping anyway but when users read parquet files from Java they'll not be able to filter row groups to reduce I/O. Which is extremely important when low-selectivity filter is specified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for filter, If I understand correctly it's OK to keep Java API but the JNI mapping for filters is considered fragile, right? I can remove the mapping anyway but when users read parquet files from Java they'll not be able to filter row groups to reduce I/O. Which is extremely important when low-selectivity filter is specified.
I'm advising you to define some kind of abstraction layer so that the details of the C++ API are not exposed in the Java API. So if there are changes in the C++ API, then refactoring can take place only at the JNI C++ code and no changes will be required in Java.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. At this time I'll try keeping them detached as far as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you start a thread on the mailing list about filter support. I'm concerned with a proto specific to the java implementation of datasets. Especially, because I think this essentially replicates what is in the Gandiva proto. We should ideally have one representation for all the languages that are going to bind to the same functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your suggestion, but they don't seem like the same functionality to me for dataset expression and Gandiva expression... In C++ code they are designed and placed differently and we'll face problems merging the JNI abstraction together. e.g. arrow::dataset::FieldExpression doesn't retain data type information (it is only represented by field name) where gandiva::FieldNode is type sensitive... I don't think it's a good idea to use any workaround to make them suitable for each other. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not have a solution off the top of my head. But creating a new java->C++ protobuf abstraction seems like the the wrong way to go.
As far as i can tell gandiva protobuf declare column type optional. This needs more design which is why a discussion/proposal on the ML is warranted.
It might make sense to split out the filtering components of this PR so it doesn't get held up by the conversation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I prefer removing them from this PR as I think of the implementation immature too. I can open ticket/discussion later.
java/dataset/src/main/java/org/apache/arrow/dataset/file/JniWrapper.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leave this out if it isn't implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed FilterImpl and Filter is now made a class rather than interface. I think it's not prepared to be a user-friendly interface so far.
I've marked Filter as experimental in class comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these methods need javadoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Javadoc added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we call this "Iterator"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using Iterator is OK but it may collide with java.util.Iterator in scenarios. Should it be "BatchIterator" or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect ScanTask.Iterator would be used consistently to avoid collision. BatchIterator is also code with me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Now it is changed to iterate ArrowRecordBatch rather than VectorSchemaRoot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
javadoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should probably be in a different common package if we want to wrap the C++ allocator. It isn't clear to me that we should do this because we lose java native heap caps IIUC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ummm if I am not wrong by "java native heap" you meant jvm direct memory right? I think it's possible to new a new type of c++ allocator to allocate for direct mem but I suggest to do so in a separate ticket. We have orc adaptor doing things worse in this part and we can optimize them together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds ok to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should probably hold off on merging until we resolve this. Just because another part of the codebase has debt or weaknesses doesn't mean we should introduce more of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jacques-n,
Thanks for your suggestion but I think the "weaknesses" of current implementation might be somehow overrated. Specifically for orc adaptor it's of course urgent to improve its memory allocation process because all of the allocated memory by orc is escaping from Java allocator. However we don't have the same problem here in Datasets as allocator is going to manage all the native memory. The only issue we may have is that OOM error might be thrown after a buffer is allocated rather than before we create it. But because OOM will be finally thrown and over-allocated memory will be finally released, the impact will be very very small.
And I knew we can possibly change to allocate Java memory directly from C++ via JNI, but codes will become messy and performance will turn to worse. That is why I didn't propose such a solution in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In reality, this is not at all the case. There are two ways that you run out of memory in Java: you run out of available allocation space OR you hit the limit at the JVM level. The limit at the JVM level can be influenced by things like fragmentation so in nearly all cases you hit that limit before the allocator managed limit. In your patch, you only are respecting the allocator set limit, not the jvm limit. This means a user could configure java to allow X of memory and the process actually will use 2X of memory, leading to issues in many systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In reality, this is not at all the case. There are two ways that you run out of memory in Java: you run out of available allocation space OR you hit the limit at the JVM level. The limit at the JVM level can be influenced by things like fragmentation so in nearly all cases you hit that limit before the allocator managed limit. In your patch, you only are respecting the allocator set limit, not the jvm limit. This means a user could configure java to allow X of memory and the process actually will use 2X of memory, leading to issues in many systems.
Making sense overall. So we may have to use a C++-to-Java/Netty allocation path in the first version right? I'll try it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily. When you allocate and deallocate memory in c++ you can call (via reflection) java.nio.Bits.reserveMemory() and unreserveMemory(). If you wanted to be extra fancy you could actually just do the reads/writes to the underlying atomics. Alternatively, you could build a pooling impl for C++ that uses the JNIEnv.NewDirectByteBuffer().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this probably belongs in the vector package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not make these static methods? instead of using a singleton pattern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just the way I got used to. Singleton makes it easier to extend things. As I noticed we use static pattern in other places I have changed to static now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally, I think we static import assertEquals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
java/pom.xml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my mistake, reverted. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please don't check the binary file into this repo. It would be better to generate a file on the fly, so readers of the test can understand what data exists there. If a standalone parquet file is really needed arrow-testing is the place to put it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I did not mark the PR reviewable mainly because test codes are messy. I am working on refactoring
cpp/src/jni/dataset/jni_wrapper.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like a a bad idea to collect an iterator if the intention is for the iterator to provide laziness?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic is now moved into DisposableScannerAdaptor, which doesn't collect these tasks anymore. Also removed the collect function.
cpp/src/jni/dataset/jni_wrapper.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style nit: I think Pure C++ methods should be TranslateFilter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done but I'll remove the entire function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed
|
As a high level advisory: the Datasets C++ API should still be regarded as alpha-stage, so I recommend making the JNI bindings as minimal as possible (to satisfy required functionality) so that refactoring is not too painful. Additionally, I recommend that you do not build JNI bindings for the Expression classes in arrow/dataset/filter.h -- this code is likely to change quite a bit as the associated query engine project gets off the ground this year. |
349858f to
38cf4c8
Compare
|
Hi @wesm @emkornfield , based on the discussion in the latest commit I removed some of the JNI mappings of c++ components. Now only Dataset(Factory) and Scanner are there to be bridged. Would you think of this a feasible direction? |
|
I'll let @wesm comment on the specifics of should be JNI bound. What I would prefer not to see is a custom proto living on the java side solely for describing filters. |
Yea the current approach is just working but proto-generated classes are totally unfriendly to use and evolve. I may intend to mark them all experimental, or just develop a new way in this PR, I don't know. Let me know if you have more suggestions on this. |
591a841 to
89d6ce0
Compare
|
@zhztheplayer Yes, all Arrow C++ code should be inside our own namespaces to avoid potential name clashes. |
|
I think I have addressed existing comments so would you like to take another look? Thanks a lot. Based on some recent comments I think we may don't have to make 100% consensus on the current design of And if you have other suggestions on the whole design or architecture of the basic functionality please be free to add comments and I am happy to make further refactors accordingly. |
cpp/src/jni/dataset/jni_wrapper.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: format_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
cpp/src/jni/dataset/jni_wrapper.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this causes an extra copy just construct directly with chars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Didn't know the constructor implies a copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
cpp/src/jni/dataset/jni_wrapper.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since copied is unused just pass a nullptr (according to JNI docs it seems like this is allowed. /copied=/nullptr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Done
cpp/src/jni/dataset/jni_wrapper.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checked_cast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to reinterpret_cast. Correct me if I am wrong we cannot use arrow::internal::checked_cast (if you meant to use this one) here since class _jobject doesn't include a virtual members for dynamic_cast: https://github.com/AdoptOpenJDK/openjdk-jdk/blob/9a9add8825a040565051a09010b29b099c2e7d49/jdk/src/share/javavm/export/jni.h#L67-L80
cpp/src/jni/dataset/jni_wrapper.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no reason for the pimpl pattern if both interface and implementation are bot in the .cpp file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but if there aren't any tests that test this functionality, then I would recommend splitting this out to another .h/.cc file under this directory so tests can be written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made some refactors on jni_wrapper.cpp. Now there is a new source file jni_util.cpp/jni_util.h for storing util members in namespace arrow::dataset::jni.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some unit tests
cpp/src/jni/dataset/jni_wrapper.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Ithink this can be an anonymous namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above. Do we still have to do this after the refactors
cpp/src/jni/dataset/jni_wrapper.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason for protected an not private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is misleading, it should be less then int32 max value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we separate out allocation tests to their own file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
b7ed735 to
38e8e2b
Compare
17aad82 to
42ff0d2
Compare
|
Hi @emkornfield. I've addressed the new comments within the latest commit, please review. Previous commits have been merged into a single one. |
cpp/src/jni/dataset/jni_util.h
Outdated
| namespace dataset { | ||
| namespace jni { | ||
|
|
||
| class JniPendingException : public std::runtime_error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to be in the header? Please add docs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved these error handling tools back to jni_wrapper.cc.
I was thinking of preparing some common utilities for both jni/dataset jni/orc (or maybe gandiva/jni). But we can actually do this later in some other tickets. The utilities also can not work well without the macros in jni_wrapper.cc.
Added some docs to other interfaces in this header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I think this is a good thought and it would be great to see some refactoring here to share common code. Thank you for taking the time to address all the concerns around this.
| * Schema utility class including serialization and deserialization. | ||
| */ | ||
| public class SchemaUtility { | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private constructor to prevent initialization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
emkornfield
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two minor comments and then I think we can merge this.
35df503 to
d68b66b
Compare
|
Thanks @emkornfield. I've made updates accordingly. |
|
Hi, I have a question about why Java implementation still uses C++ memory allocation? |
Add following Datasets APIs to Java: - DatasetFactory - Dataset - Scanner - ScanTask - ScanTask.BatchIterator Add a native dataset path to bridge c++ Datasets components to Java: - NativeDatasetFactory (c++ class : DatasetFactory) - NativeDataset (c++ class: Dataset) - NativeScanner (c++ class: DisposableScannerAdaptor) Following c++ components are not JNI-mapped to keep the initial implementation simple: - Fragment - ScanTask - (arrow::)RecordBatchIterator Add following API to `FileSystemDatasetFactory` to avoid passing file system objects via JNI bridge: - `FileSystemDatasetFactory::Make( std::string uri, std::shared_ptr<FileFormat> format, FileSystemFactoryOptions options)` Unit tests are based on `FileSystemDatasetFactory`. Closes apache#7030 from zhztheplayer/ARROW-7808 Authored-by: Hongze Zhang <[email protected]> Signed-off-by: Micah Kornfield <[email protected]>
Add following Datasets APIs to Java:
Add a native dataset path to bridge c++ Datasets components to Java:
Following c++ components are not JNI-mapped to keep the initial implementation simple:
Add following API to
FileSystemDatasetFactoryto avoid passing file system objects via JNI bridge:FileSystemDatasetFactory::Make( std::string uri, std::shared_ptr<FileFormat> format, FileSystemFactoryOptions options)Unit tests are based on
FileSystemDatasetFactory.