Skip to content

Conversation

@icexelloss
Copy link
Collaborator

This PR includes:
(1) Fix conversion for String type
(2) Refactor related functions into arrow.scala

This is a working in progress, I would like to add support for nested types and refactor internalRowsToArrowRecordBatch (use different ColumnWriter for different types) next

@wesm
Copy link

wesm commented Jan 12, 2017

@BryanCutler Li and I are going to spend a bunch of time on this the next few weeks leading up to Spark Summit (and beyond) -- how can we coordinate to best align efforts? I'm also interested in providing an alternate code path for UDF evaluation, but I'm not sure how complicated it would be to share code between the collect* functions and the streaming UDF evaluator in PythonRDD.scala.

For my part, I can fill in feature gaps in Arrow C++/Python -- decimal support is one thing that comes to mind.

package org.apache.spark.sql.arrow

trait ColumnWriter {
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this planned to be used later? Ok if I don't merge it now?

case class NullInts(a: Integer)
case class NullStrings(value: String)
}

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks mostly borrowed from SQLTestData.scala. Mind if I try to rework the tests to use that test data and avoid adding this file?

@BryanCutler
Copy link
Owner

Thanks @icexelloss , looks good. I'll merge after fixing up the test data.

@BryanCutler
Copy link
Owner

Li and I are going to spend a bunch of time on this the next few weeks leading up to Spark Summit (and beyond) -- how can we coordinate to best align efforts?

I'd like to merge what we have so far back to the Spark PR to get some other eyes on it and show some preliminary benchmarks. I think it's close to being ready, but need to add more tests and be clear about what is and isn't supported. I'll try to put together a checklist of what to do.

@icexelloss, if you could add support for nested types and help with test coverage, that would be great! @wesm , any feature gaps in Arrow you can fill in would be nice so that we can push to get this in Spark after the Arrow 0.2 release.

@BryanCutler
Copy link
Owner

I'm also interested in providing an alternate code path for UDF evaluation, but I'm not sure how complicated it would be to share code between the collect* functions and the streaming UDF evaluator in PythonRDD.scala.

We had discussed this too for follow up work, but I haven't looked into it yet so I'm not sure what it would take either.

@wesm
Copy link

wesm commented Jan 12, 2017

Thanks for the update. If you all could help me by driving Arrow feature requirements from the Spark side (e.g. failing unit tests because we're missing this or that type implementation) versus the other way around that would be very helpful. I should be able to turn around work pretty quickly as needed the next couple weeks

@BryanCutler
Copy link
Owner

sure thing @wesm, will try to write tests to exercise different types

@icexelloss
Copy link
Collaborator Author

icexelloss commented Jan 13, 2017 via email

@icexelloss
Copy link
Collaborator Author

icexelloss commented Jan 13, 2017 via email

@BryanCutler
Copy link
Owner

Thanks, @icexelloss that sounds like a good list of things to work on. I'll respond based on what I think might be good for getting SPARK-13534 merged, which might not be easy so it will be best to keep things simple and the scope to a minimum. Once Arrow is a dependency in Spark, follow on work will be much easier to get merged.

  1. I'm not sure what type support would be needed for this first iteration. If we only enable Arrow with a flag in toPandas then we might be able to have less than full support. Probably best to bring it up with other PySpark folks.

  2. Having clear benchmarks that show speedups for different schemas is definitely good. I think local mode for now is fine because we are doing a collect on the data anyway.

3,4. Anything we can do to get better speedup and efficiency will help. However, if creating multiple batches at the worker level complicates the code too much, it may be better to save for follow up. Keeping things simple for reviewers that have never seen Arrow will make it easier.

  1. This is out of the scope of this, but would be great as an immediate followup JIRA. We had talked about this too and would like to help out on it.

Regarding the SQLTestData, it's fine to use - that's why it's in SharedSQLContext. It's doubtful that this data would change as they would need to update all dependent tests. As we need other test data, try to keep it in ArrowSuite for now and we can adjust later if needed.

Thanks to you and Wes for the help on this. I'll focus on more tests and pinning down Arrow requirements.

BryanCutler pushed a commit that referenced this pull request Jan 24, 2017
…cala

changed tests to use existing SQLTestData and removed unused files

closes #14
BryanCutler pushed a commit that referenced this pull request Feb 23, 2017
…cala

changed tests to use existing SQLTestData and removed unused files

closes #14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants