-
Notifications
You must be signed in to change notification settings - Fork 2
WIP: Refactor arrow functions; Add support for more types #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Refactor arrow functions; Add support for more types #14
Conversation
|
@BryanCutler Li and I are going to spend a bunch of time on this the next few weeks leading up to Spark Summit (and beyond) -- how can we coordinate to best align efforts? I'm also interested in providing an alternate code path for UDF evaluation, but I'm not sure how complicated it would be to share code between the collect* functions and the streaming UDF evaluator in For my part, I can fill in feature gaps in Arrow C++/Python -- decimal support is one thing that comes to mind. |
| package org.apache.spark.sql.arrow | ||
|
|
||
| trait ColumnWriter { | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this planned to be used later? Ok if I don't merge it now?
| case class NullInts(a: Integer) | ||
| case class NullStrings(value: String) | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks mostly borrowed from SQLTestData.scala. Mind if I try to rework the tests to use that test data and avoid adding this file?
|
Thanks @icexelloss , looks good. I'll merge after fixing up the test data. |
I'd like to merge what we have so far back to the Spark PR to get some other eyes on it and show some preliminary benchmarks. I think it's close to being ready, but need to add more tests and be clear about what is and isn't supported. I'll try to put together a checklist of what to do. @icexelloss, if you could add support for nested types and help with test coverage, that would be great! @wesm , any feature gaps in Arrow you can fill in would be nice so that we can push to get this in Spark after the Arrow 0.2 release. |
We had discussed this too for follow up work, but I haven't looked into it yet so I'm not sure what it would take either. |
|
Thanks for the update. If you all could help me by driving Arrow feature requirements from the Spark side (e.g. failing unit tests because we're missing this or that type implementation) versus the other way around that would be very helpful. I should be able to turn around work pretty quickly as needed the next couple weeks |
|
sure thing @wesm, will try to write tests to exercise different types |
|
Bryan,
Wes and I kept an internal list of things we would like to do:
(1) Support for more types: Besides the types that are currently supported,
there are
BinaryType, TimestampType, DateType
ArrayType, MapType, StructType
toPandas might not benefit last three nested types now, but the ideas is
we want to design the code such that they can supported nested types
as well.
(2) Benchmark
benchmark.py tests conversion for Int/Double with spark local mode. We
might want to have a more convincing benchmark that shows result for:
- cluster mode
- more complex schema
(3) Profiling
Profile to eliminate inefficiency in our implementation.
(4) Support multiple record batches and remove memory footprint
Current implementation converts entire dataframe to one record batch.
We want to have an implementation that is more memory efficient.
(5) Arrow based serialization of pandas dataframe to Spark dataframe in
SparkSession.createDataFrame
What do you think?
…On Thu, Jan 12, 2017 at 5:26 PM, Bryan Cutler ***@***.***> wrote:
sure thing @wesm <https://github.com/wesm>, will try to write tests to
exercise different types
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAwbrOwGAqLXkgIiI3gbJpVxivFfvw2xks5rRqiigaJpZM4LhXXM>
.
|
|
Bryan,
Regarding the test data, I am a bit reluctant to reuse SQLTestData
completely beacuse:
(1) Depend on SQLTestData. If someone changes SQLTestData in the future,
they may need to fix the arrow test file as well.
(2) SQLTestData might not be complicate enough for arrow unit testing, we
might end up creating arrow test data anyway.
We should probably also ask Spark folks what their preferences are. For now
I can add test data in a temporary place and we can figure out this later.
…On Thu, Jan 12, 2017 at 9:44 PM, Li Jin ***@***.***> wrote:
Bryan,
Wes and I kept an internal list of things we would like to do:
(1) Support for more types: Besides the types that are currently
supported, there are
BinaryType, TimestampType, DateType
ArrayType, MapType, StructType
toPandas might not benefit last three nested types now, but the ideas
is we want to design the code such that they can supported nested
types as well.
(2) Benchmark
benchmark.py tests conversion for Int/Double with spark local mode. We
might want to have a more convincing benchmark that shows result for:
- cluster mode
- more complex schema
(3) Profiling
Profile to eliminate inefficiency in our implementation.
(4) Support multiple record batches and remove memory footprint
Current implementation converts entire dataframe to one record batch.
We want to have an implementation that is more memory efficient.
(5) Arrow based serialization of pandas dataframe to Spark dataframe in
SparkSession.createDataFrame
What do you think?
On Thu, Jan 12, 2017 at 5:26 PM, Bryan Cutler ***@***.***>
wrote:
> sure thing @wesm <https://github.com/wesm>, will try to write tests to
> exercise different types
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#14 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAwbrOwGAqLXkgIiI3gbJpVxivFfvw2xks5rRqiigaJpZM4LhXXM>
> .
>
|
|
Thanks, @icexelloss that sounds like a good list of things to work on. I'll respond based on what I think might be good for getting SPARK-13534 merged, which might not be easy so it will be best to keep things simple and the scope to a minimum. Once Arrow is a dependency in Spark, follow on work will be much easier to get merged.
3,4. Anything we can do to get better speedup and efficiency will help. However, if creating multiple batches at the worker level complicates the code too much, it may be better to save for follow up. Keeping things simple for reviewers that have never seen Arrow will make it easier.
Regarding the SQLTestData, it's fine to use - that's why it's in SharedSQLContext. It's doubtful that this data would change as they would need to update all dependent tests. As we need other test data, try to keep it in ArrowSuite for now and we can adjust later if needed. Thanks to you and Wes for the help on this. I'll focus on more tests and pinning down Arrow requirements. |
…cala changed tests to use existing SQLTestData and removed unused files closes #14
…cala changed tests to use existing SQLTestData and removed unused files closes #14
This PR includes:
(1) Fix conversion for String type
(2) Refactor related functions into arrow.scala
This is a working in progress, I would like to add support for nested types and refactor internalRowsToArrowRecordBatch (use different ColumnWriter for different types) next