WIP: Refactor arrow functions; Add support for more types #14

icexelloss · 2017-01-12T04:42:54Z

This PR includes:
(1) Fix conversion for String type
(2) Refactor related functions into arrow.scala

This is a working in progress, I would like to add support for nested types and refactor internalRowsToArrowRecordBatch (use different ColumnWriter for different types) next

…cala

wesm · 2017-01-12T17:32:30Z

@BryanCutler Li and I are going to spend a bunch of time on this the next few weeks leading up to Spark Summit (and beyond) -- how can we coordinate to best align efforts? I'm also interested in providing an alternate code path for UDF evaluation, but I'm not sure how complicated it would be to share code between the collect* functions and the streaming UDF evaluator in PythonRDD.scala.

For my part, I can fill in feature gaps in Arrow C++/Python -- decimal support is one thing that comes to mind.

BryanCutler · 2017-01-12T19:00:52Z

sql/core/src/test/scala/org/apache/spark/sql/arrow/ColumnWriter.scala

+package org.apache.spark.sql.arrow
+
+trait ColumnWriter {
+}


Is this planned to be used later? Ok if I don't merge it now?

BryanCutler · 2017-01-12T19:02:26Z

sql/core/src/test/scala/org/apache/spark/sql/test/ArrowTestData.scala

+  case class NullInts(a: Integer)
+  case class NullStrings(value: String)
+}
+


This looks mostly borrowed from SQLTestData.scala. Mind if I try to rework the tests to use that test data and avoid adding this file?

BryanCutler · 2017-01-12T19:04:10Z

Thanks @icexelloss , looks good. I'll merge after fixing up the test data.

BryanCutler · 2017-01-12T19:39:21Z

Li and I are going to spend a bunch of time on this the next few weeks leading up to Spark Summit (and beyond) -- how can we coordinate to best align efforts?

I'd like to merge what we have so far back to the Spark PR to get some other eyes on it and show some preliminary benchmarks. I think it's close to being ready, but need to add more tests and be clear about what is and isn't supported. I'll try to put together a checklist of what to do.

@icexelloss, if you could add support for nested types and help with test coverage, that would be great! @wesm , any feature gaps in Arrow you can fill in would be nice so that we can push to get this in Spark after the Arrow 0.2 release.

BryanCutler · 2017-01-12T19:43:50Z

I'm also interested in providing an alternate code path for UDF evaluation, but I'm not sure how complicated it would be to share code between the collect* functions and the streaming UDF evaluator in PythonRDD.scala.

We had discussed this too for follow up work, but I haven't looked into it yet so I'm not sure what it would take either.

wesm · 2017-01-12T20:51:14Z

Thanks for the update. If you all could help me by driving Arrow feature requirements from the Spark side (e.g. failing unit tests because we're missing this or that type implementation) versus the other way around that would be very helpful. I should be able to turn around work pretty quickly as needed the next couple weeks

BryanCutler · 2017-01-12T22:26:42Z

sure thing @wesm, will try to write tests to exercise different types

icexelloss · 2017-01-13T02:45:23Z

Bryan, Wes and I kept an internal list of things we would like to do: (1) Support for more types: Besides the types that are currently supported, there are BinaryType, TimestampType, DateType ArrayType, MapType, StructType toPandas might not benefit last three nested types now, but the ideas is we want to design the code such that they can supported nested types as well. (2) Benchmark benchmark.py tests conversion for Int/Double with spark local mode. We might want to have a more convincing benchmark that shows result for: - cluster mode - more complex schema (3) Profiling Profile to eliminate inefficiency in our implementation. (4) Support multiple record batches and remove memory footprint Current implementation converts entire dataframe to one record batch. We want to have an implementation that is more memory efficient. (5) Arrow based serialization of pandas dataframe to Spark dataframe in SparkSession.createDataFrame What do you think?

…

On Thu, Jan 12, 2017 at 5:26 PM, Bryan Cutler ***@***.***> wrote: sure thing @wesm <https://github.com/wesm>, will try to write tests to exercise different types — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAwbrOwGAqLXkgIiI3gbJpVxivFfvw2xks5rRqiigaJpZM4LhXXM> .

icexelloss · 2017-01-13T02:55:14Z

Bryan, Regarding the test data, I am a bit reluctant to reuse SQLTestData completely beacuse: (1) Depend on SQLTestData. If someone changes SQLTestData in the future, they may need to fix the arrow test file as well. (2) SQLTestData might not be complicate enough for arrow unit testing, we might end up creating arrow test data anyway. We should probably also ask Spark folks what their preferences are. For now I can add test data in a temporary place and we can figure out this later.

…

On Thu, Jan 12, 2017 at 9:44 PM, Li Jin ***@***.***> wrote: Bryan, Wes and I kept an internal list of things we would like to do: (1) Support for more types: Besides the types that are currently supported, there are BinaryType, TimestampType, DateType ArrayType, MapType, StructType toPandas might not benefit last three nested types now, but the ideas is we want to design the code such that they can supported nested types as well. (2) Benchmark benchmark.py tests conversion for Int/Double with spark local mode. We might want to have a more convincing benchmark that shows result for: - cluster mode - more complex schema (3) Profiling Profile to eliminate inefficiency in our implementation. (4) Support multiple record batches and remove memory footprint Current implementation converts entire dataframe to one record batch. We want to have an implementation that is more memory efficient. (5) Arrow based serialization of pandas dataframe to Spark dataframe in SparkSession.createDataFrame What do you think? On Thu, Jan 12, 2017 at 5:26 PM, Bryan Cutler ***@***.***> wrote: > sure thing @wesm <https://github.com/wesm>, will try to write tests to > exercise different types > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#14 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AAwbrOwGAqLXkgIiI3gbJpVxivFfvw2xks5rRqiigaJpZM4LhXXM> > . >

BryanCutler · 2017-01-13T20:06:38Z

Thanks, @icexelloss that sounds like a good list of things to work on. I'll respond based on what I think might be good for getting SPARK-13534 merged, which might not be easy so it will be best to keep things simple and the scope to a minimum. Once Arrow is a dependency in Spark, follow on work will be much easier to get merged.

I'm not sure what type support would be needed for this first iteration. If we only enable Arrow with a flag in toPandas then we might be able to have less than full support. Probably best to bring it up with other PySpark folks.
Having clear benchmarks that show speedups for different schemas is definitely good. I think local mode for now is fine because we are doing a collect on the data anyway.

3,4. Anything we can do to get better speedup and efficiency will help. However, if creating multiple batches at the worker level complicates the code too much, it may be better to save for follow up. Keeping things simple for reviewers that have never seen Arrow will make it easier.

This is out of the scope of this, but would be great as an immediate followup JIRA. We had talked about this too and would like to help out on it.

Regarding the SQLTestData, it's fine to use - that's why it's in SharedSQLContext. It's doubtful that this data would change as they would need to update all dependent tests. As we need other test data, try to keep it in ArrowSuite for now and we can adjust later if needed.

Thanks to you and Wes for the help on this. I'll focus on more tests and pinning down Arrow requirements.

…cala changed tests to use existing SQLTestData and removed unused files closes #14

Fix conversion for String type; refactor related functions to Arrow.s…

548dc59

…cala

BryanCutler reviewed Jan 12, 2017

View reviewed changes

BryanCutler closed this in 1b5f17f Jan 12, 2017

BryanCutler mentioned this pull request Jan 13, 2017

Cleanup of ArrowSuite tests and Arrow type matching #15

Closed

BryanCutler pushed a commit that referenced this pull request Jan 24, 2017

Fix conversion for String type; refactor related functions to Arrow.s…

8c769db

…cala changed tests to use existing SQLTestData and removed unused files closes #14

BryanCutler pushed a commit that referenced this pull request Feb 23, 2017

Fix conversion for String type; refactor related functions to Arrow.s…

be508a5

…cala changed tests to use existing SQLTestData and removed unused files closes #14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: Refactor arrow functions; Add support for more types #14

WIP: Refactor arrow functions; Add support for more types #14

Uh oh!

icexelloss commented Jan 12, 2017

Uh oh!

wesm commented Jan 12, 2017

Uh oh!

BryanCutler Jan 12, 2017

Uh oh!

BryanCutler Jan 12, 2017

Uh oh!

BryanCutler commented Jan 12, 2017

Uh oh!

BryanCutler commented Jan 12, 2017

Uh oh!

BryanCutler commented Jan 12, 2017

Uh oh!

wesm commented Jan 12, 2017

Uh oh!

BryanCutler commented Jan 12, 2017

Uh oh!

icexelloss commented Jan 13, 2017 via email

Uh oh!

icexelloss commented Jan 13, 2017 via email

Uh oh!

BryanCutler commented Jan 13, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

WIP: Refactor arrow functions; Add support for more types #14

WIP: Refactor arrow functions; Add support for more types #14

Uh oh!

Conversation

icexelloss commented Jan 12, 2017

Uh oh!

wesm commented Jan 12, 2017

Uh oh!

BryanCutler Jan 12, 2017

Choose a reason for hiding this comment

Uh oh!

BryanCutler Jan 12, 2017

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Jan 12, 2017

Uh oh!

BryanCutler commented Jan 12, 2017

Uh oh!

BryanCutler commented Jan 12, 2017

Uh oh!

wesm commented Jan 12, 2017

Uh oh!

BryanCutler commented Jan 12, 2017

Uh oh!

icexelloss commented Jan 13, 2017 via email

Uh oh!

icexelloss commented Jan 13, 2017 via email

Uh oh!

BryanCutler commented Jan 13, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants