Implement Arrow column writers #16

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

icexelloss wants to merge 5 commits into BryanCutler:arrow-integration from icexelloss:arrow-integration

Collaborator

icexelloss commented Jan 17, 2017

What changes were proposed in this pull request?

I have refactored arrow serialization into arrow column writers

How was this patch tested?

The two tests (int and string) in ArrowSuite pass

Owner

BryanCutler commented Jan 18, 2017

Thanks for this @icexelloss. It looks good in general but like I mentioned before the Spark committers are very reluctant to add any files or classes. Since we are only working with toPandas() right now, it will be hard to justify adding so much for a small corner of Spark functionality. I would not like to add any more files or classes until we get a committer to sign off on what we have currently. So how about you hold off on this until then?

wesm commented Jan 18, 2017

Since this is Scala, can't these classes all go in a single file?

Collaborator Author

icexelloss commented Jan 18, 2017

Bryan and Wes, I moved the column writers to Arrow.scala

Collaborator Author

icexelloss commented Jan 18, 2017

I added support for more types and switched to use arrow NullableVector

Once #17 is merged, I will rebase and add tests for uncovered types

Collaborator Author

icexelloss commented Jan 18, 2017

here are some performance issues I found with this change:

(For 1 million long and double)

Before:
internalRowsToArrowRecordBatch
[0.045898734002548736, 0.05031404899636982, 0.044390044997271616, 0.03245397000137018, 0.030072718996962067, 0.03008368899463676, 0.029609725999762304, 0.02923199100041529, 0.029868764999264386, 0.02980228500382509]

After:
internalRowsToArrowRecordBatch
[0.17399869799555745, 0.06763976799993543, 0.17177338899637107, 0.11198228599823778, 0.08500225299940212, 0.12630318199808244, 0.08831146299780812, 0.11189435399865033, 0.08414604800054803, 0.11313620099826949]

This contributes to ~20% slowness to toPandas()

I will spend some time looking into that

BryanCutler reviewed

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/Arrow.scala Outdated

Owner

BryanCutler Jan 18, 2017

Usually explicit imports are preferred. Is it basically an import for every vector type?

Collaborator Author

icexelloss Jan 18, 2017

This is similar to import org.apache.spark.sql.types._ , otherwise we end up with a very long import (10+)

Owner

BryanCutler Jan 19, 2017

Ok, it will probably be fine like this then

BryanCutler reviewed

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/Arrow.scala Outdated

Owner

BryanCutler Jan 18, 2017

lets move this to a ColumnWriter object apply() method

Collaborator Author

icexelloss Jan 18, 2017

Will do

BryanCutler reviewed

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/Arrow.scala Outdated

Owner

BryanCutler Jan 18, 2017

This and subclasses should probably be package private

BryanCutler reviewed

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/Arrow.scala Outdated

Owner

BryanCutler Jan 18, 2017

nit: remove empty line

BryanCutler reviewed

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/Arrow.scala Outdated

Owner

BryanCutler Jan 18, 2017

I think it might be a little clearer to check for null here, like before and just define writeData and writeNull in the ColumnWriter interface. Is that do-able?

Collaborator Author

icexelloss Jan 18, 2017

Will do

BryanCutler reviewed

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/Arrow.scala Outdated

Owner

BryanCutler Jan 18, 2017

Where is this implicit used? Generally these are somewhat frowned upon, can it be done without this?

Collaborator Author

icexelloss Jan 18, 2017

There isn't a scala function to turn a boolean into a int, that's what is this for. I will change it to be non implicit

BryanCutler reviewed

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/Arrow.scala Outdated

Owner

BryanCutler Jan 18, 2017

Maybe some of these classes can be consolidated with a generic class definition?

Collaborator Author

icexelloss Jan 18, 2017

It has already been pretty much consolidated. Each column writers have different valueVector type, valueMutator type and writeData. The only part that is same among them is writeNull(), however, arrow doesn't have a nullable mutator interface, therefore this needs to be this way

Owner

BryanCutler commented Jan 18, 2017

Ok, it looks better with these classes in one file and the ColumnWriter interface is a good way to do this. Just some minor comments, otherwise looking good.

Owner

BryanCutler commented Jan 18, 2017 •

edited

Loading

Once #17 is merged, I will rebase and add tests for uncovered types

Ok, I'll go ahead and merge that first

This contributes to ~20% slowness to toPandas()

Any idea what was causing this?

icexelloss added 3 commits

January 18, 2017 19:03


          Implement Arrow column writers

8c7de51


          Move column writers to Arrow.scala

0fd25ea


          Add support for more types; Switch to arrow NullableVector

48155e0

Collaborator Author

icexelloss commented Jan 19, 2017

I boxed primitives by accidents. It is fixed now.

Before:
internalRowsToArrowRecordBatch
[0.18191100400144933, 0.036740217001351994, 0.03310888200212503, 0.02675118299521273, 0.02415680999547476, 0.02392288400005782, 0.023824697003874462, 0.023984511994058266, 0.022978036002314184, 0.026774197998747695]

After:
internalRowsToArrowRecordBatch
[0.2175193030052469, 0.032566589005000424, 0.030894666000676807, 0.031016137996630277, 0.06971160900138784, 0.02801599200029159, 0.028111334002460353, 0.028634793998207897, 0.027740458004700486, 0.0285637749984744]


          Address comments

3ede886

Collaborator Author

icexelloss commented Jan 19, 2017

Comments addressed

Collaborator Author

icexelloss commented Jan 19, 2017

Oops, some tests are failing. Need to fix those.

icexelloss changed the title ~~Implement Arrow column writers~~ WIP: Implement Arrow column writers


          Fix tests

f432bac

icexelloss changed the title ~~WIP: Implement Arrow column writers~~ Implement Arrow column writers

Collaborator Author

icexelloss commented Jan 19, 2017

Tests pass

BryanCutler closed this in

bb37bf1

Owner

BryanCutler commented Jan 19, 2017 •

edited

Loading

LGTM, merged. Thanks @icexelloss!

BryanCutler pushed a commit that referenced this pull request


          Implement Arrow column writers

f672f09

Move column writers to Arrow.scala

Add support for more types; Switch to arrow NullableVector

closes #16

BryanCutler pushed a commit that referenced this pull request


          Implement Arrow column writers

bdba357

Move column writers to Arrow.scala

Add support for more types; Switch to arrow NullableVector

closes #16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet