-
Notifications
You must be signed in to change notification settings - Fork 2
Implement Arrow column writers #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Arrow column writers #16
Conversation
|
Thanks for this @icexelloss. It looks good in general but like I mentioned before the Spark committers are very reluctant to add any files or classes. Since we are only working with |
|
Since this is Scala, can't these classes all go in a single file? |
|
Bryan and Wes, I moved the column writers to Arrow.scala |
|
I added support for more types and switched to use arrow NullableVector Once #17 is merged, I will rebase and add tests for uncovered types |
|
here are some performance issues I found with this change: (For 1 million long and double) Before: After: This contributes to ~20% slowness to toPandas() I will spend some time looking into that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually explicit imports are preferred. Is it basically an import for every vector type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is similar to import org.apache.spark.sql.types._ , otherwise we end up with a very long import (10+)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, it will probably be fine like this then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets move this to a ColumnWriter object apply() method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and subclasses should probably be package private
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove empty line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be a little clearer to check for null here, like before and just define writeData and writeNull in the ColumnWriter interface. Is that do-able?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this implicit used? Generally these are somewhat frowned upon, can it be done without this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There isn't a scala function to turn a boolean into a int, that's what is this for. I will change it to be non implicit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe some of these classes can be consolidated with a generic class definition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has already been pretty much consolidated. Each column writers have different valueVector type, valueMutator type and writeData. The only part that is same among them is writeNull(), however, arrow doesn't have a nullable mutator interface, therefore this needs to be this way
|
Ok, it looks better with these classes in one file and the |
Ok, I'll go ahead and merge that first
Any idea what was causing this? |
|
I boxed primitives by accidents. It is fixed now. Before: After: |
|
Comments addressed |
|
Oops, some tests are failing. Need to fix those. |
|
Tests pass |
|
LGTM, merged. Thanks @icexelloss! |
Move column writers to Arrow.scala Add support for more types; Switch to arrow NullableVector closes #16
Move column writers to Arrow.scala Add support for more types; Switch to arrow NullableVector closes #16
What changes were proposed in this pull request?
I have refactored arrow serialization into arrow column writers
How was this patch tested?
The two tests (int and string) in ArrowSuite pass